On this page On this page
Episode 13 – Inside Ping Proxies with Joseph Dye.
In this episode of Netstack.fm, Glen from Plabayo talks with Joseph Dye (Joe), founding engineer at Ping Proxies, about building large-scale proxy infrastructure in Rust. Joe shares how he went from art student to programmer, joining Ping when it was a tiny startup running on Python and Squid. He explains how they rebuilt everything in Rust, creating performant HTTP and SOCKS5 proxies and managing massive IP networks for web scraping. The conversation covers the evolution of their stack, challenges with HTTP versions, TCP/IP fingerprinting, user-space networking with DPDK, and the adoption of MASQUE and HTTP/3. Joe also reflects on Rust’s safety benefits, being the only Rust engineer at Ping, and how the company stays competitive through technical innovation rather than size.
If you like this podcast you might also like our modular network framework in Rust: https://ramaproxy.org
00:00 Intro00:41 Introduction to Proxies and Joe's background03:42 Understanding Pink Proxies and Their Offerings06:52 The Technical Journey: From Squid to Rust09:47 Proxy Types: Data Center vs. Residential12:42 Building a Proxy Infrastructure15:44 Challenges with HTTP Protocols18:39 The Importance of Customization in Proxy Development21:38 Team Dynamics and Future Growth29:32 Transitioning to Rust Development30:59 Understanding HTTP Protocols32:40 Exploring HTTP/2 and HTTP/334:05 The Future of Proxying with Mask36:14 Evaluating New Technologies for Proxies37:51 Developing for End User Devices39:49 Challenges in Network Stack Development41:15 Proxying Non-HTTP Traffic42:51 TCP/IP Fingerprinting Explained47:57 The Importance of TCP/IP Fingerprinting53:28 Performance Considerations in User Space TCP58:22 Competing in the Proxy Market01:00:05 Cancellation Safety in Rust Concurrency01:03:53 OutroMusic for this episode was composed by Dj Mailbox. Listen to his music at https://on.soundcloud.com/4MRyPSNj8FZoVGpytj .
Elizabeth (Plabayo)
0:14 | 🔗
This is netstack.fm, your weekly podcast about networking, Rust and everything in between. You are listening to episode 13, recorded on the 10th of November, 2025, where Glen has a conversation with Joe, founder, engineer, and first employee at Ping Proxies, where they build proxy infrastructure in Rust to collect data at scale. Welcome in another week of Netstack.FM Today my guest is Joe. He works at Ping Proxies. Ping Proxies is a proxy company which we will learn all about. Also we will explain a bit what proxies are and what proxy companies do. They heavily use Rust but they do not make use of Rama as at that point it did not yet exist So welcome Joe. Yeah, can you maybe start a bit by explaining what is your own background? Like how did you get into programming and specifically where came the love or maybe like interest for networking programming? Yeah sure, so I think like most people I started quite young, maybe when I was 12 or 13. But I only ever did it as a hobby, it was never my intention to do programming as a career. ⁓ I actually went to university to do art and it was while on my art degree that I realised there's nothing like doing art as a career that kills any passion you once had for the subject. And so I started spending more time doing my hobby of programming ⁓ rather than attending lectures and doing work. So at some point I decided I might as well just drop out and pursue programming. ⁓ and it's actually very quickly after dropping out of university I got a job at Ping Proxies. Well, that's a very quick journey indeed. that meant you didn't have like too much like, like what was your experience at that point into actually like doing anything related to like building proxies or, or any of that kind of stuff. None, I had no experience at all. I was entirely self taught. I've been using Rust a lot in my free time. ⁓ I kind got into using Rust for art related things, know, there's something called Nano or Nanu and then I moved into like writing raytracers and things like that but I was entirely self-taught, I had no experience writing proxies or anything like that. I just kind of turned up at PING one day and just kind of had to learn it all from there. And so how did they find you or how did you find them and why the choice to join it? Because it doesn't seem at all like an obvious choice. Yeah, was more that it was available and they were interested in hiring me, but Ping was a small company at this point. It was something that the boss, Tim, had started at university, ⁓ kind of out of his bedroom. It small company, it wasn't making much money, but he knew he needed a developer and he kind of wanted one at a similar point in life as him. So he kind of looked for students, or in my case, guess, an ex-student. So it was very, when we started, you we didn't have any experience. We were very scrappy and we kind of just figured it out as we went along. Like for that first year I wouldn't even say we were a proper company, we were just like trying to do something not really knowing what we were trying to get to or achieve. Okay and so I always explain people that a proxy is kind of like the combination of a server and a client because you accept incoming connections and you make a client connection back and you basically forward traffic may be modified in the meanwhile maybe reroute it. Like there is a lot of things that proxies do but what specifically does ping proxy mean with a proxy and what kind of different kind of proxies do you have? So we kind of operate as a... We have two products that we sell. We have like a per IP product and a per gigabyte product. One where you sell people an IP, one where you sell people like one gigabyte of bandwidth. But with both of these products, they operate quite simply to be honest. ⁓ 99 % of our traffic is HTTP connector tunnels, which means it's on handshakes with us. We establish a connection onwards to the target, and then we just tunnel bytes back and forth transparently without touching them, without modifying them and because of that because we're not trying to you know adjust their headers or terminate SSL it's really quite simple actually at its core and a lot of the challenges come more around handling scale and doing this performantly than it does like the actual mechanics of proxying. And so why would anybody really want a proxy to begin with? Because it seems like it just establishes a connection forward traffic like some folks might wonder like what's the point? It's mainly for accessing the internet at scale. If you're trying to web scrape and you want 100 million pages a day. you can't really do that from your machine or just one bare metal that you got off like scaleway. You kind of ⁓ different IPs to prevent yourself from being banned so you don't get detected. And that's kind of where we come in. We offer a lot of IPs to enterprise customers so they can do their web scraping at scale. Okay, and so you kind of explained that, okay, when this company started, was like from basically from like a college bedroom, like it was STEM, like how did he get started with that and how did he get access to those IPs? So. Tim started, he did an economics degree, took a couple modules of Python and just kind of put it together from there. He didn't start with having IPs, he started with reselling other people's products, focusing basically more marketing better than the companies he was reselling. But then he kind of realized that there's more money to be made if you're selling IPs that you own yourself. And so he just slowly started introducing subnets to our network that we controlled until now, you know, everything all I don't know, think it's like a hundred thousand of the IPs that we sell for our data-centred ISP products are controlled and owned by us, or if not owned, leased by us. in all these like all locations where okay, you basically just rent something like VMs or VPCs from companies and you can request IP addresses there or how does it actually work? Like how do you, like do you, yeah, is the machine separate from the IPs or is it more like you rent data center resources and it comes with IPs? I kind of wish it was that simple. ⁓ The way it works is we get a server in a data center somewhere, then we have to ⁓ communicate with a carrier like, I don't know... ⁓ Verizon, I think they do internet in America, right? And we have to ask them if they will provide us with a five gigabyte line or something, and then ask them if they'd be willing to announce IPs for us. And that will set limits on, know, we will announce 4,000 IPs for you on this single machine. ⁓ And then we have to go out onto like a platform like IPXO, which is kind of like a marketplace for IPs and find IPs that have characteristics that we want on our network. Then we lease them. Then we get Verizon to announce them on the servers and then we're good to go. It's a slow process with like lots of back and forth. It's not automated, can be quite annoying. Okay, that indeed seems also very non-technical at all, basically just a lot of communication, which is indeed probably why it's so slow. ⁓ Okay. Very fascinating and you already mentioned data center proxies with these days. Often people also have something called residential proxies, meaning like you instead of like having data center IPs, you have IP addresses which are seen as residential, like as if they're coming from a house. Is that also the kind of proxies you sell? Yeah, so under our per IP product line, have a data center and then we have, we call them ⁓ ISP, which is like you said, static residential. They look like ⁓ an IP that might be coming from someone's home, but they are hosted on ⁓ servers in a data center somewhere. Okay, but then as far as I know in the market of... ⁓ proxies and specifically IP proxies. There is usually a distinction made between ISPs and actual residential proxies, meaning as actual WANs may be actually living on the router of an actual house where someone, for example, might ⁓ use a service where they get paid for any excessive bandwidth they have, which is an actual residential, but it seems that you only offer ISP WANs, meaning IPs which are seen as residential but which you are still hosting from within a data center. Thank We also do do real residential proxies as well. That's under our per gigabyte offering where someone buys a gigabyte from us. Then we route them through our network to like a device that has one of our SDKs on it. And then we proxy them out of that mobile device. So ⁓ that's what you're talking about there. So we have data center, ISP or static residential, which is hosted in a data center and then true residential, is coming through like a mobile. Yeah, and then of course a mobile device is even more of an expensive one because there you're running like on an actual mobile device, but you also have like ⁓ just regular residential where they might just be living on a router or something close to a router. I mean that really depends. We partner with ⁓ SDK providers and they give us entry points to these devices. The majority of the time that is through someone who's installed an app. I think there's one called like Pawns ⁓ which some people might have heard of which is you put the app on your phone, you turn it on and you get paid for every gigabyte that goes through your mobile device and that's how we acquire the majority of our residential peers. Hmm. Okay, very interesting. I didn't know that these things were also like done by middlemen. I was always under the assumption that the proxy companies themselves would be developing those SDKs, but I mean, I get it makes sense as it seems like a very specialized business. It doesn't mean that you don't have too much control over these. I don't know what's going on there or I mean, it's like very high level. On a technical level we unfortunately don't have the sort of control that I want. ⁓ Building a residential network with your own app is a lot of work, right? You've got to market, you've got to build the apps, and so it's not something we've focused too much on at this time, but it's something I want to work on ⁓ because these SDKs are not, let's say, up to modern proxy and standards, let's say. There's many new protocols that have come out, many new ways of doing things that would really let us offer better products. ⁓ and so I think going into next year that's really something I think something I want to focus on building our own so we can just do it do everything we want to do. okay so a bit to recap what we learned so far is that you at Ping Proxies, okay, you joined you, you were basically the second employee. We will also learn a bit later about more how that evolved from that step, but we know that at this point you offer several kind of proxies, meaning you offer the data center ones you offer the ones with just data center IPs, but also with residential IPs within data centers. And next to that, you also offer residential IPs through SDK providers. So far so good, right? Okay, so now that we learned a bit about where you came from and what kind of products high level you offer now, I want to start filling the gaps a bit. So when you joined, you said, okay, he started with some Python packages, if I recall correctly. So how did you move forward from there? Like, and at what point, for example, came Rust on the table? Yeah, so. I want to give you an overview of how bad it was when I joined. ⁓ Tim was making money, but he was not technologically competent. I think he'd be okay with me saying that still. ⁓ So we were running Squid Proxy, which is like an open source solution, and then just the Flask Python API wrapping that. And it meant we had basically no flexibility with what we wanted to do. And to really understand the state of things, Tim hadn't even set up Git. We didn't have version control. So the first year I was there was really just putting out fires, know, actually adding database backups ⁓ adding version control just kind of making it less of a House of cards that was likely to fall down but then after that first year we started looking at how we can improve the company and improve our products and we ⁓ We started looking outside of the company for someone who could build us like the the proxy software that we needed And after we got quotes, I just said, you know what, I'm all right with Rust. Why don't I give it a go? It'll probably save us money. ⁓ And then that was it. We just started introducing Rust, slowly building up our own proxy server so that we could offer all the features we wanted. At what time scale are we talking about? Which years, for example? That would have been about a year into me being at PING, so about four and a half, five years ago. yeah, okay, that was well before something like Rama even came on the scene. Yeah, unfortunately there was no Rama which I would have loved because it would saved us a lot of time. Pingora hadn't been released yet. So yeah, don't think there was anything really that would let us do the sort of proxying that we wanted to do and so we just had to build everything from scratch. Yeah, I understand and so at that point there was already plenty of other stuff such as at that point Tokio was mature, there was things like Hyper so I would imagine that that is the kind of stack you build upon or maybe I'm wrong. We do use Tokio I love Tokio, although I'm doing some tests and writing a blog about ⁓ alternatives to Tokio. We want to test it in production and see how if these thread per core run times actually offer benefits for us. But we don't use Hyper at all actually. ⁓ We use H3, which is the library underlying Hyper's H3 capabilities. H2, which is the same. And then I was hopeful that Hyper would offer H1 library which in the vein of H2 and H3 does HTTP 1 proxying but it didn't so we wrote our own HTTP 1 library around ureq Proto which is the Sans-IO state machine underlying the ureq HTTP client. Okay, very interesting. ⁓ I am familiar with them and they are solid libraries, so that's very cool. how did you basically, I saw an example, what capabilities did, I mean despite that it maybe was not technically very profound, it still was functional, it was working. So what features were already there and how did you port them over? To be honest, our original Squid setup didn't have many features outside of ⁓ reading authentication, username and passwords from a file and then checking if someone had that when they're proxy connected to the server. We didn't have concurrency or rate limiting or throughput limiting or anything like that. So we had ⁓ basically no features to speed off to port over. So it was quite easy to get an MVP that could do everything that Squid was doing for us. And then from there, took our time to really kind of decide what we as a business needed to be able to sell proxies and offer the best features possible and that took quite a lot of time. Okay, and so as you are running on data centers, does that mean that you just deploy them straight on like VMs or do you use something like Kubernetes or how do you exactly deploy these? So we've run everything on bare metal servers. And then yeah, we have like Kubernetes orchestrating everything. ⁓ I think we have quite a simple infrastructure even right now. ⁓ I think a lot of times it's easy to over complicate and make premature optimizations for use cases you think might arise in the future, but you're not sure. So we kind of kept it simple, know, Kubernetes, Postgres and a lot of replication logic to get that data down to our servers. But for the most part, really quite simple. Okay, very interesting. And then you mentioned, okay, we do one aspect, use HTTP proxies, meaning there is like a connect request, and then there is like a little, ⁓ basically the proxy will establish a connection for you, and then it basically ⁓ bidirectionally copies the bytes between the target and the source. Do you also offer something like SOCKS 5 proxies where it happens even before the HTTP ⁓ layer is used? do. We offer Socks5 when implementing that actually we were looking for libraries and there's one that's quite good called fast Socks5. I didn't like the API for it so we wrote our own in like a sans-io manner. I'm a big fan of sans-io. And so we offer Socks5 connect, Socks5 associate which is UDP proxy and then we offer HTTP 1, HTTP 2 and HTTP 3 proxy. Okay, yeah, because of course once you are... I mean for certain use cases you just want to use Socks5 it makes sense but not all proxy providers support it and yeah we are also a fan of ⁓ SOUN-IO or I don't know how you called it but it's basically the approach where you decouple your protocol processing from your actual I.O. meaning the networking or the file system or any I.O. and in a future episode we will talk with Martin he is also ⁓ very much into on IO and we will devote an entire episode about that. Now to recap a bit, so also offer Socks5 proxies, offer HTTP proxies. ⁓ What about HTTP versions? there any difficulties there? I imagine HTTP 1 was the most simple one to proxy even though there are probably also some war stories. But then what about HTTP 2 and HTTP 3? How well did that go to proxy that kind of traffic? be honest, HTTP 2 was very easy because the H2 library that we use is really quite full featured, does everything we need. HTTP 3 was... Not hard, not easy. There were a few things that we had to make modifications to. I got a two line PR to H3 so I can say I'm a contributor. ⁓ HTTP1 was actually the most difficult one and again that was just because there's no HTTP1 library in the Rust ecosystem that's asynchronous and operates in the way that I wanted outside of Hyper but I deliberately chose to avoid Hyper. So writing my own part of that was the most difficult bit and I introduced many bugs you know many issues with the reading of bodies, the handling of headers that really took a long time to nail down and get solid enough for production. Yeah, I can imagine because actually HTTP1 is actually quite a complicated set of protocols. I call them set of protocols because there are so many different RFCs involved. ⁓ Yeah, yeah, indeed. And around payloads, I can imagine because, yeah, now I forgot how it's called, normally you have like just your payload, which is like content length defined, but then you also have transfer or chunked encoding. I imagine there might have been issues there because it's definitely tricky. Yeah, things like that. And then you also get closed delimited bodies, which is like a relic from HTTP 0.9, but still occurs. That was an interesting bug to find, to realize that the customer was actually doing something like that. ⁓ Yeah, there's a lot with HTTP one. It makes me wish that I could have used Hyper, but it just, there's a few things about it. I'm sure you've used Hyper for Realm. You've probably encountered maybe some issues there as well with just how its API isn't always what you might Yeah, originally we were based on Hyper, but since then we forked it. We are still keeping in sync with it and we contribute where we can, but we need sometimes more control or customizations. And yeah, the issue with a lot of libraries and frameworks in general, not specific to Rust is that they often are trying to be like spec compliant, but of course as a proxy, you know that yourself, you cannot really be 100 % spec compliant because things in the wild, clients and servers alike, they are often not spec compliant or where there is ambiguity they might do things which are not logical but still it became kind of like the pseudo standard for those kind of traffic flows and if you use something like hyper or something like that then yeah often you might, I don't know, err on some traffic which you really should accept or you might modify something which you shouldn't modify. had many issues like that and it annoys me that things aren't spec compliant but I understand that you have to approach it in that way. But we had issues where ⁓ lowercase headers if we sent them back to certain clients they would just drop them and like reject our responses. We had issues with ⁓ the HTTP Connect handshake. The RFC is very clear that you should never send back ⁓ a body or certain headers like a content length header on a response to a HTTP connect and yet we had clients on end targets and things that would still do that and cause issues with us. So a lot of the time any bugs that we still encounter will be some random client from 2004 that's just doing things in its own way. Yeah, and it's only once you're a proxy in the middle that you just notice it and you just have to handle it. Like, I mean, I've built proxies in many different languages and in one case we used Go and we forked the entire standard library of Go because we had to customize the HTTP library, which is very much baked into the standard library, but gives little control and... Yeah, the classic one you already mentioned is definitely things around headers being like lowercase or something like that. But another one also related to headers is things like cookies. Like cookies are often very strictly ⁓ validated, but then actual web browsers and servers are often very relaxed around them and they completely break because some kind of weird thing that you should never do with cookies according to the specs still is something that they do all the time. it's not like a developer to some purpose, but they just use it because it's possible and they didn't even think about it and somehow they break on it because they expect a specific kind of weird way that it's formatted and yeah most libraries would just drop it and that's of course a proxy also a no-go. Yeah, can't do that. Interesting though that you mentioned the Go standard library for HTTP because we've been having issues with it recently. ⁓ I'm not a major fan of its implementation at this point. Yeah, I mean... I like for example, one of the things that goes often celebrated is how well it's standard library is. And in a way I agree, like it has a very rich standard library. It's including all the cryptographic primitives. I mean, it's wonderful, but it's like, as soon as you go a bit from the standard part, as soon as you need to do a bit something special, like you kind of like on your own. And in fact, they don't even allow you to do that much about it because despite the fact that something like Go has what is called interfaces if I recall correctly, like they don't really use it themselves in the standard library for those things where they actually use concrete types. So you kind of like cannot do much about it because even libraries built on top of the STD will have completely ⁓ Yeah, I mean, it's just a mess. then, and then, yeah, but it's not specific to Godot. Like I feel it's the same for almost most libraries, like especially also server libraries. Like they usually are very well, if you have some very standard case, but for proxies, in my opinion, you want to like, don't touch your traffic where you don't need it and only touch it where you do. And yeah, and especially don't modify it if you don't want to modify it. any modification should be just on purpose. Yeah, I think that's the issue roughly that we're facing is that it doesn't want to give you that control ⁓ for some of these operations that are incredibly niche that most people won't encounter. Yeah and then at this point how big is Ping Brooks is I mean if you if you can tell it of course I'm not sure if you're allowed to but how many people are working there now. I think right now we're, there's five of us in the office doing engineering work and then we've got front end developers and content writers and things that work remotely. So we're still quite small. ⁓ We're wanting to grow, but ⁓ I think it's best to grow a company slowly and like with intention rather than hiring for the sake of hiring. think sometimes throwing more people at a problem doesn't mean you start developing faster. can actually slow you down so we're trying to be very careful with how we grow the company. I understand and yeah I totally agree there you can just throw people out there they are not that simple resources and you want to be very mindful about it so totally agreed there now you also mentioned that you are for now the only one still in the company that knows Rust Is that also the case for example, let's say there's like a small issue, could others still fix it or is it even then still you? Even then it's still me and it's kind of an issue. think one of the problems with introducing Rust to a small company is hiring for Rust because there's not that many developers. They may not be in the locations that you want. And also Rust kind of commands a premium as a salary, which can make it hard for a small company to hire the people that they want to hire. ⁓ So it's something I'm kind of encouraging the other people on the team to kind of get into to learn Rust, maybe use Advent of code this year to teach them a little bit because right now any small feature is something that I have to fix and that's not really an issue. I like working with Rust. ⁓ Working on the proxy software is my main job so it's not like it's taken me away from other work really but it would be nice if they have a small issue they can go in and fix it and to be honest now that all the hard work of developing the proxy software is done you know I've defined the APIs and how we want things to work I've handled all the protocol logic. I do think it's the sort of thing that someone with a week's worth of RAS knowledge could go in and change, you know, ⁓ or ⁓ something small to fix a bug. So I think we could get to the point very quickly where I'm not the only one writing on this code base all the time. Okay. That's fair enough. yeah, I think so. Like ⁓ that's one thing some companies do. Like they don't really hire for Rust developers because of the pool that is very small, but they mostly hire for people who have maybe enough experience in other languages that they could easily transfer over to Rust. And with someone like your position as a mentor, they probably get quite quickly up to speed, especially as you said, you already defined the trenches, you already set up the pathways and now it's up to them kind of like to follow. what has been developed before and to continue that which is often a lot easier than to start a new project especially in a language you don't really know much about. also that Rust is a language that's actually quite easy for beginners to be productive with if they're working on existing code because you know the type system and the borough checker it really does keep them in line and ensure that they know if the work what they're doing is working and they can't go too wrong so I think I'd rather have these these other developers learning Rust to contribute than say I'd written the code base in C++ and now they've got to C++ to contribute. Yeah exactly. Now you mentioned something interesting which I didn't pick up at the time is the fact okay we talked a bit about difficulties in HTP1 and you mentioned there about the specific HTP 0.9 relic. I forgot the name of it but I never heard of it before so could you elaborate a bit more on what that was and how it works and why it was used? Mm-hmm. Yeah, I mean, I could be wrong here. It's been a while since I read the spec, but there's something called a closed delimited body, which is the HTTP body. You don't get a length of it ahead of time and it's not using chunked encoding. Someone just sends you a body and when they close their connection, that's when the body has ended. So it's quite a simplistic way of sending bodies. But the issue that was causing us was if we get anything other than a proper like body that we knew the length of or that was chunked and coded. And if the TCP stream was closing in an incorrect way, we'd essentially just report it to the user as an error, which it wasn't, it was actually a successful proxying attempt, but we'd still report it to the user as an error in our dashboard and in our log. So it was mainly figuring out why that was happening ⁓ and then making sure it wouldn't happen. So right now you basically handle that by the fact that if you don't have any of these headers, you may be like optimistically assume it might be that and then if it's closed, you might try to parse whatever payload you got so far. Yeah, that's exactly it. Now we just assume that if there is no body, sorry, if there is no body length header, we just assume it's a closed delimited body and try to go from there. Okay, yeah, in that way, something like HTTP 2 and HTTP 3 is of course a lot nicer because you are just working in improper data frames where the frames exactly indicate it's like a payload, etc. exactly big fan of H2 and H3. I'm glad we've got it on our network I'm just annoyed no one's using it. Yeah, yeah, well, it's also not entirely their fault because you, which is something very interesting you mentioned, okay, we all support HTTP tree proxying, but like, basically, noticing our client right now is even supporting like proxy HTTP3 traffic. So what kind of customers are that then? ⁓ Yeah, you're right that it's on the client and unfortunately there's no clients that do HTTP3. I mean, seems like the majority of clients still do their proxy handshakes in HTTP1. ⁓ so they can't use any of the benefits of H2 with connection reuse. But this wasn't really something we added for customers so much as to bring customers or open pathways for existing customers. If we provide the ability to handshake using H3, maybe they'll start using it, maybe they'll find use cases for it that they hadn't considered before. We've got more things along those lines actually in the next coming weeks. I don't know if you know about, is it MASQUE? ⁓ proxying. ⁓ Yeah, yeah, we talked about this in episode 11 where we had a conversation with Max Inden from Firefox and they also are now on their way to support MASQUE within Firefox, I imagine. So it's not there yet, but they are working on it. Mm-hmm. Yeah, so I'm in the middle of a few ⁓ PRs that are going to try and support ⁓ connect to IP and connect to UDP like the two pseudo methods that MASQUE introduces ⁓ on our side. And it's interesting actually because the methods are backwards compatible. So in HTTP one, you can still use them, which I'm hoping kind of ⁓ makes it easier for people to use these features because on like an old client introducing like a whole HTTP stack with quick and everything is a lot of work but it's not so much work for one of these people who've created a client to just add a new method essentially to their HTTP one proxy so I'd really like to see how that works ⁓ and if it gives anyone a reason to stop using socks because you can do UDP proxying over HTTP with those methods I'll love it Very interesting, yeah. I should probably do an entire episode about Mosque one day. ⁓ That would be something for the future. But first I need to dive into it myself a bit more because I didn't really get the time to catch up with that yet. ⁓ If you want to get into it, Cloudflare open sourced the library a couple days ago called Tokio Quiche. which supports the connect to IP and connect UDP headers, ⁓ pseudo methods. So you can start proxying HTTP3 traffic and UDP datagrams and IP ⁓ basically very easily. So, know, respect Cloudflare for that. I'm gonna see if I can get it integrated onto our side of things and if it works for us. Yeah, I mean that makes sense. Now another question I have is like, let's say right now you're like five years or something like that in your journey and you have quite a well established code base which seems to be well developed. You put some thought in it, you made deliberate choices, it's functioning, it works. Is there that point still a lot of use case or a lot of reason for a team and a company like yours to even switch to something new, let's say something like Rama, like would there still be any use in that? I think for us, for the software that we're running on our proxy servers, I don't think so because we kind of developed the software in line with kind of what we need as a business. And so it's very niche and specific for our use cases. We considered open sourcing it at one point, but it just wouldn't be useful for anyone who's not selling proxies like we do. But we have some projects that I think Rama could be really useful for. So I talked about the residential networks and how we use SDK providers but we want to go to providing our own apps and peering with those and something like that I think Rama could be a much better option. It's probably going to be a lot lighter weight than the software we're running and it's going to be a lot easier to develop with and do things with. So that's the sort of thing I think Rama would be really good for. Any new project involving proxies, Rama would pretty much be my first choice. Okay, very interesting. Are you then still talking about an actual end user device or are you talking also about things like routers and those kind of stuff? In this case it would be like end user devices, so running Rama on a mobile phone or something perhaps ⁓ as the final hop in a residential proxy. Yeah, very cool. mean, yeah, those things are definitely possible, especially nowadays. Both. Well, actually our Android, I'm not too certain, but I do have experience with iOS and they have very nice ways to hook into user space into the actual like DNS and networking and any kind of like stacks there. And you can basically take over those things and do whatever you want. I mean, it's very similar to EBPF. So it's, it's very nice. Mm-hmm. interesting, I assumed that iOS would be more locked down and more difficult to work with. But yeah, that's going to be a big project where I'll learn loads just messing around with things trying to get them to work. Yeah, for now I have plenty of stuff around that which is like closed source because I don't really know yet how to really what to do there around open sourcing it because it's basically also just gluing together Rama and then the SDKs of Apple and I wasn't so far yeah I wasn't sure what the value would be there or what kind of create to make of it and if it's like iOS specific or I to make something generic for different platforms. Because like Windows also has specific stuff like with all Windows it used to be that you have to hook into the kernel but nowadays especially with the entire mess that was the cause of CrowdStrike I think now they definitely want to have more and more also in user space which they actually do So now you can also do a lot of that in user space, which is very interesting I mean it works pretty nice you can hook into it all it just takes a bit of conversions and and still I guess be careful because even if you cannot crash the system you can still easily like I don't know break the entire network stack and like drop all the traffic or something so definitely something to be careful about which is like a more subtle issue because like the phone will be working but nothing will be connecting which is I guess a bit frustrating be an interesting bug for me to have to sort out at some point in the future. Yeah, yeah, look, luckily, like, at least from my experience, it's usually pretty pleasant to develop for something like iOS, because it's such a like close system, like you said, but also there's only so many devices and like, like, it's not like there are not many people even run all devices. So you can usually just like care for the last couple of phones. In my experience, developing on something like Android is usually the more interesting one, as in like, not interesting, but like very annoying because I don't know. there are just so many devices which is kind of like the same issue I guess on something like Linux or Windows because there's just so many different hardware configurations possible so there's good I mean it's good for the end user but it's not so fun for I would say the developer having to support all those kind of setups I don't know So, okay, that's very interesting. And then of course, like people might be trafficking, HTTP traffic, that's the normal one, or at least the one you expect in that kind of setup, where you say also it's like for example, for data extraction and those kinds of things. Web socket ones are based on HTTP, so I would assume that you don't really have to do anything for that, is it correct? I would say that's probably right after you already established a connection and you're already piping just bytes, I suppose. Yeah, exactly. If it operates over TCP, it will just work through us. If it operates over UDP, you can also make that work. We really don't ⁓ have much insight into what is going through the tunnel, mainly because it's usually TLS protected. But anything anyone wants to do through a proxy, they pretty much can. Just because you did the handshake with HTTP doesn't mean you're limited to HTTP. No, no, exactly. then, I mean, that's also kind of like how all kind of different things like, I mean, for example, if you use PAC you can also ⁓ enable HTTP proxies, even if it's not for HTTP traffic. So I understand that. But then let's say something like ⁓ Web Transport, that's based on QUICK. I suppose if you use something like Socks 5 UDP or something like Mask I guess you don't really care either because at that point you're once again just piping bytes I suppose. Yeah, exactly. ⁓ Many competitors advertise that they support proxying HTTP 3 or QUIC. What they mean is they support Socks5 UDP ⁓ and that allows you to do that. But that's where the MASQUE proxying is really interesting for me because I really like it to possible for customer to use HTTP to proxy their HTTP traffic. So if you're using QUIC, can still use HTTP to establish that tunnel and then proxy over that. and that's what the connectUDP pseudo method will be really good for. Okay, and then you already say that for example right now you don't offer any support for like let's say ⁓ meddling with headers or anything like that like you don't really emulate traffic you just proxy in traffic but at the same time you did tell me that you you might be working in the future on a project to to work something to do with TCP IP fingerprinting, which is kind of like at that point you are emulating just not on the HTTP layer, but only on the TCP layers. I suppose that is because that's the only part where maybe the user has no control given that they are not really in control of your TCP connections, UDP connections. that why that is the only thing you might be offering related to emulation in the future? Yeah, exactly right like ⁓ if we were to mess with their HTTP requests we could mess something up with their application logic But they're already using our TCP stack to connect to the target So changes we make there aren't really going to affect anything as long as we don't you know introduce major bugs But that's the project I'm working on now. It's really really interesting ⁓ And I think probably wouldn't be possible actually if I didn't know Rust and wasn't using Rust at this company. Yeah, I mean, yeah, I can imagine. mean, also just at some point a project like yours gets at such a size that given you're the only developer, I would even say Rust is one of the only ones which can handle that without you going crazy because... like one hidden superpower of I find is the fearless refactoring and I imagine you did your fair share of that and from my experience most languages outside of Rust if you would do such a refactor as a single person you might not even want to do it because you you might be breaking things here and there even in what they call memory safe languages like Go or Python I mean I wouldn't dare to do such refactors at such a scale a bit in Rust it is just possible. Mm-hmm. totally. mean, I have like such confidence in Rust's compiler that when I make these big refactors, I won't even run my code until the end. won't run the tests until the end. ⁓ I've made 2000 line PR changes and not even bothered to run the tests until I'm merging into main. And it's just because Rust gives me such confidence that, you know, there's memory safety, the type system and the abstractions I've set up, it all just lets me make changes and know that they're going to work. I think also, you know, working on the same code base for five years kind of gives you some confidence there as well. But I would not want to be working in any other language without a similar sort of type system and memory safety for any major refactoring. Yeah, I mean the type system is pretty rich so it's a very pleasant experience. So yeah, the only other language I've ever had that in is something like Haskell based like alum. I'm not sure if you've ever experienced that but that's also that was a very nice experience too. Didn't really turn out to be a language that ever caught up but yeah, I had some fun in that in the past. ⁓ So... What I do realize is we talked about TCP IP fingerprinting, but I didn't really explain a bit what it is. So, I mean, I will explain a bit what I think you mean with it. And then you can tell me if that's what you also mean with it is, the fact that. Well, fingerprinting is always about the fact that any implementation of a protocol has its specifics on how it implements, because you, the specification doesn't specify every exact thing on how to do it. That might mean there might be ambiguity in the case of like what you have to do, or maybe there are like seven steps and you can do them in any order, maybe also in the terms of like what defaults you use or what values. So that's everything we in general mean with fingerprinting. The fact that you kind of like can track as a server, what the client has established or the other way around the server, even the client could fingerprint what the server kind of sends to you in any kind of the network stack. And then when we talk about TCP, IP fingerprinting that I would assume you just mean about the IP address. the fact that okay there you you are in control of which IP address that the server sees so they might see ⁓ I don't know like San Francisco based IP address instead of like someone from Germany as an example but then in case of TCP there you open a TCP socket on on your site for example when they do an HTTP connect you have to open a TCP socket on your site to the server and attach to that TCP socket or certain options and I assume that that's the kind of options you want to modify based on if you are a Windows based OS or a Linux based or any of those. Yeah, that's nearly exactly it. I'll just clarify a few things. So the reason it matters in the first place is if I send a GET request to Google, my user agent will say I'm using Linux and Firefox. But if I'm going through a proxy, the proxy will connect with its TCP stack and Google could fingerprint that incoming connection based off the SYN that happens and kind of detect what operating system I'm actually using. And there's this sort of mismatch between the declared operating system in the user agent and the detected one from the TCP SYN packet that will let them say okay he says he's Linux and Firefox but he's actually on a Windows machine there's likely some sort of shenanigans going on ⁓ and so TCP fingerprinting is a way to change the values within the first packet of the TCP handshake to appear as any operating system and it operates it works on both the TCP and IP layers so with the IP layers it's quite minimal, I'd say maybe 20 % of a fingerprint score comes from the IP layer. But things like if the don't fragment header is set, but then there's like fragment IDs or things like that, ⁓ can be unique quirks of an operating system that make it identifiable, maybe only Windows does that. And then the majority of the score does come from the TCP layer. And that's things like the ⁓ MSS, which can tell you if you're on a mobile device, for a laptop or a desktop because mobile devices typically have a much lower mss because of the overhead of cell communication And then it's a lot of the time just the options that are in the TCP segment. And mainly it comes down to the options that are there and the order in which they're presented. ⁓ Because as you said, it's just these code bases are old, no one's rewriting a TCP stack from scratch and there's no definitive way to do it. So all the different operating systems approach it differently. Honestly, you could probably identify these operating systems by eye, just looking at the options. Apple for example, their TCP options always end with two end of options bytes whereas none of the other operating systems even send end of options bytes or there's things like the TCP options need to be padded to four bytes and the ways in which operating systems will do that padding can tell you what operating system is connecting to you and so our kind of The way we're trying to work around this is to use a user space TCP stack that we have full control over to form these connections. And so on every connection, we can determine what operating system we want to look like, set the correct options, ⁓ the correct MSS. ⁓ And then we should theoretically on paper from my prototyping look like any operating system we want to be. Okay, I mean that's very interesting. I'm surprised you went that far. I was probably wrongly assuming that you could get away with it by simply setting the correct socket options, but I suppose it doesn't translate as exactly to the kind of actual TCP traffic on the network then. Yeah, because the options is kind of to help you control the behavior of the TCP connection. But the things we actually care about in a fingerprintable don't really affect behavior for the most part, like the ordering of the options in the headers, they don't affect behavior. So there's no ability to control that. So we really need to go as low level as possible. And we're using a smolltcp, ⁓ which is a Rust package. It's a Rust user space. ⁓ networking stack and it's brilliant. It's missing a few things like selective acknowledgements and ⁓ time stamping, which I'm working on PRs to introduce, but it should again on paper from my prototyping work basically perfectly. And the most complicated part will be actually introducing this user space TCP stack to all of our servers and how we do that kernel bypass to use our own TCP stack. Yeah, I mean, I guess at that point you're basically if you're, suppose you're on Linux, which means you probably just hook it into something wired together with EPPF. Yeah, we were looking at ⁓ XDP, ⁓ Express Data Path, which is, I think, EPPF related. I've not looked too much into that, actually. But the other option that I've tried and I'm still trying with is DPDK, the Data Plane Development Kit, which is how typically for companies like at scale that need faster networking, they use it. I think I read a paper, I think, by ByteDance about how they use DPDK and like a user space networking stack to power TikTok. So think DPDK is what I'm thinking will work best for us. ⁓ Very interesting, yeah. Please do share me the article, will put it in the footnotes. ⁓ Now, yeah, I mean, given that it will be a user space, ⁓ does that mean that you might see some regression in performance compared to the default kernel-based TCP sockets? Mm-hmm. Okay. bypass the correctly, say with DPDK, we'd actually expect a performance increase. The kernel is right now for us the limiting factor in how many connections we can do per second. So bypassing the kernel will actually speed things up. The only issue is if the stack we're using itself is slow. And right now is. SmallTCP doesn't do selective acknowledgements, which is one of the most important extensions to TCP for high performance. So that's, it's a big PR. on that and I'm thinking that once that's introduced we'll at the very least even if smolltcp is slower the kernel bypass will give us enough performance that we are on baseline ⁓ at the least. Okay, very cool. And you mentioned smolltcp but as far as I know that's only for TCP. Do you not have similar issues for UDP or is fingerprinting on that ⁓ protocol less common? Fingerprinting on UDP, as far as I'm aware, isn't really possible. There's not the variety of headers with UDP that would allow it to be fingerprinted. If there was fingerprinting to happen, I think it would be on QUICK. And it's something I want to look into in my free time, see if it is possible, see if I can get a prototype working. But with that, we wouldn't have to worry about that as a proxy company because that would mean when we forward someone's quick packets, it's whatever fingerprint they're using on their machine, it's transparent, we wouldn't need to fingerprint it or emulate it, sorry. Okay. Yeah. Yeah. I suppose that makes sense. Very interesting. Wasn't aware of that. And is that something that was requested by users? Because I mean, these days it's not that uncommon to, to jump around between load balancers and this and that. And so the only way for let's say a security company to actually even like fingerprint you is because they would be on the edge and actually the one receiving the exact connection from the clients. Is that something that happens that much? It really depends on the site you're trying to access. The majority of sites don't really care if bots are seeing them or not. ⁓ Other sites they really do care. Perhaps they're selling products or they've got data that they want to protect. ⁓ It depends on the site and in that case if they really care they'll use something like ⁓ Akamai or Datadome or think there's a company called Castle that will do things like that. ⁓ fingerprinting isn't like an on-off signal of whether or not someone's a bot or a proxy but it's a signal probably quite a strong signal that these companies combine with a bunch of other signals to then determine and classify what type of traffic you are so anything that can help reduce one signal when you're when someone's trying to profile you is a good thing and since this isn't something that a customer can control ⁓ It's something that we really need to do ourselves otherwise they'll never be able to you know, bypass those protections. Yeah, yeah, I mean for sure. And another signal that they use and which I'm not sure you have an answer on is timings, where they just know based on the timings and some heuristics that someone is going via proxies, which again is definitely not a single source of truth of knowing if it's a bot, but it definitely again adds to the score. Is that something you can do much about or I just, yeah. That's just something they will always have. it's not really something we can do too much about. The best we can do there is just... ⁓ to be as fast as possible, to route connections optimally, to use fast nodes and just hope that the time it takes to actually reach the target isn't so long that something like that can be used. mean, ⁓ in comparison to like many other companies in the industry, we're considerably faster because we've put a lot of time into how we optimize and choose routing. ⁓ But that isn't really one that could be worked around easily. Yeah, I can imagine it's quite difficult. yeah, when I was still working a lot with proxy companies, I don't think Ping Proxy exists, which seems to align with the timeline you mentioned at the start. So, but there are quite a couple of big ones. There are of course also smaller ones like... How easy is it even to still start a proxy company because these days there are like a couple of players which are so big that I would imagine that they... And they can also invest in all the technology that you might want to invest in such as the SDKs for residential proxies for actually going quite far in there for the entire optimal routing. How do you even compete against those? Yeah, it's really difficult. we were to try and start a proxy company now, I think it would fail. It wouldn't matter how many features we have. In the end, these companies now are so huge that their marketing budgets can just destroy you on SEO. ⁓ And right now as well, it's a very crowded market. I think there's 250 different proxy companies. I think we could consider ourselves in the top 10. We've had some reviews and things that kind of put us up there. ⁓ It's a difficult market to compete in and we're kind of fortunate that... A lot of these companies, as with I think happens many other companies in other industries, reach a certain size and become almost complacent and they stop pushing out new features and things like that. Maybe they focus on other products but they're not focused so much on adding the next protocol to their proxies to ⁓ let their customers do what they want. So that's really where we're trying to compete, pumping out features that aren't found in the industry to try and show that we are better than these companies technologically. and to attract these customers that are wanting these features but can't find them yet. Okay, very interesting. And then... like one tricky issue that companies at your scale sometimes run into and which yeah, Rust does not have a definite answer to yet and it's not always clear yet like how to fully prevent that other than just making others aware of the kind of issue which is all around ⁓ cancellation safety, meaning like you might have a future which is maybe like pulled in a certain state but is then cancelled and it might not be entirely safe to do if it's like a simple future it's like very easy to prevent but once you get into complicated enough stacks you might be in a certain state machine where you are halfway in the future and then it like got dropped but like it meant maybe some or maybe it never got dropped but it just doesn't get pulled anymore or that kind of issues you ever ran into and your time at Ping Proxies Unfortunately, not really. I'm always very careful when ⁓ doing anything to like read the docs, make sure I understand what the safety is regarding cancellation. But I think something that helps us is we kind of took a structured concurrency approach. We never spawn tasks. Every task is a clear owner and then sub tasks kind of forms a tree. And that means we know that if we cancel something at the top, we know exactly what else is going to be canceled. And that kind of makes it easier to understand when it's safe to cancel something. Yeah that makes sense, that's indeed the of situation where you wouldn't run into it too much. Where you do run into it, it's often because you're mixing... Concurrency and parallelism and where you might be sometimes because if you're always just using tasks and you don't run into it and but it's usually where you sometimes I don't know have nested branches where you might be Pulling a future from a previous branch and then I don't know might get into some kind of which looks a lot like a deadlock but it's like a lot harder to debug because it's it's not like a threat between threats because if you're if you're just doing regular parallelism, there are plenty of Unix tools where you can deal with that but for these things yeah can get pretty tricky especially as they as they usually don't really happen often it's like edge case issues and if they do happen I mean then yeah you can spend quite some time on it knowing why this thing hang I mean I'm unfortunate yeah yeah I mean I never run into it myself but I had a couple of customers who ran into it with some other stacks and yeah that that was a funny to find for sure. I Yeah, I think as well the nesting of futures can make that more complicated and it's one of the, I wouldn't say it's the main reason, but it's one of the reasons we stayed away from hyper and you tower. It kind of, a lot of the time it seems like you are nesting futures within futures within futures. And one of the issues there is then if you have these sort of issues, you don't really know what layer it's happening in and where these things are going wrong. We deliberately took a very flat approach to how we do things where We don't really nest futures. We kind of execute them one after the other. So we have like, ⁓ we don't often do intratask concurrency where one task is awaiting on multiple futures or there's any nested futures. Yeah, yeah, makes sense. ⁓ Okay, very cool. So I think we cover a lot already. Is there something else that you think we should cover before we start to wrap up? I don't think so. No, I'm happy. Well, if you're happy, I am very happy as well. I'm very thankful that you had some time to come on the podcast today because the work you do is quite interesting and it seems that you did a pretty good job because Ping Proxy grew from basically nowhere until like you said somewhere in the top 10 of proxy companies. So that's a pretty remarkable achievement, I would say. So congratulations for that. Okay, I thank you very much then to come and I talk to you another time. Thank you for having me. I've really enjoyed this. I'll speak to you later. Elizabeth (Plabayo)
1:03:56 | 🔗
Netstack.fm is brought to you by Plabayo building secure, open, and resilient infrastructure with Rust protocols, and purpose. This show is also made possible by Rama, the open source networking framework. Plabayo offers service contracts and welcome sponsorships to keep building and supporting its ecosystem. The theme music of this podcast was composed by DJ Mailbox. If you enjoyed this episode, don't forget to subscribe on your favorite podcast platform and leave a five-star review. It really helps others discover the show. Thanks for tuning in. We'll see you next time for the next handshake.