On this page On this page
episode 19 — Firezone and Zero-Trust Network Security with Thomas Eizinger.
In this episode of Netstack.fm, Glen talks with Thomas Eizinger from Firezone about designing a zero trust enterprise VPN built on top of WireGuard. They break down how modern VPNs work in practice, covering virtual network adapters, split tunneling, DNS interception, policy based access, and secure packet routing using WireGuard, ICE, and TURN relays.
The discussion highlights how Firezone differs from legacy VPNs by focusing on performance, reliability, and minimal user friction, while also touching on the role of Rust and Elixir in Firezone’s architecture and the long term importance of IPv6 adoption.
If you like this podcast you might also like our modular network framework in Rust: https://ramaproxy.org
00:00 Intro00:42 Introduction to Thomas Eizinger05:19 Firezone's Turn implementation11:00 Understanding VPNs and Firezone's Approach29:27 Legacy VPNs vs. Firezone: A New Era of Networking36:19 Firezone is opensource37:27 Zero-Trust VPNs40:28 What is WireGuard43:36 Firezone's Integration with WireGuard50:19 Handling Connection Failures58:00 Geolocation and Relay Selection01:04:45 Elixir Developer Experience (DX)01:10:19 IPv6 Adoption and Future Considerations01:15:03 Outro
Music for this episode was composed by Dj Mailbox. Listen to his music at https://on.soundcloud.com/4MRyPSNj8FZoVGpytj
Elizabeth (Plabayo)
0:13 | 🔗
This is netstack.fm, your weekly podcast about networking, Rust, and everything in between. You are listening to episode 19, recorded on December 17th, 2025, where Glenn has a conversation with Thomas Isinger about a zero trust VPN built on top of WireGuard at OSI layer three. Welcome to another week of Netstable FM. Today with me is Thomas. He is working in a startup called Firezone where they are building a zero trust VPN built on top of WireGuard. There are a lot of goodies to unpack here and I'm super excited as many of these topics we haven't touched yet and I believe Thomas is the perfect guest to help us understand this all. So welcome Thomas. Thank you very much. Super excited to be here. and you are joining us all the way from the beautiful Australia. Does that also mean that you maybe joined the Rustforge conference in New Zealand earlier this year? That's correct. I'm in Australia. Unfortunately, I couldn't go to Rustforge because I had to, I guess, look after some of my other hobbies, which is not just writing software, but like ⁓ sailing on boats. And there was like a regatta that overlapped with the conference. I had signed up to that already and then saw that the conference is at the same time. So, which was a bummer because I would have really liked to go. Super cool, I mean I like sailing myself as well I don't have a boat but I sometimes used to join a crew in the past. Does that also mean that you could ⁓ sail all the way to Rustforge? Or would that be difficult? That would be quite difficult, I think, like the Tasman Sea, is like the part of the ocean that surrounds New Zealand and Australia in the south, is quite gnarly because it's exposed to all these winds from the south, whereas like... no other landmass in between, right? So you get what we call these southerly busters, which can be like winds of 40-50 knots and like can be very unpleasant to be out there in the ocean. Okay, fair enough. So as tradition goes, we can perhaps first go a bit over your background because the work you're doing now is very exciting. But what always interests me as well is like how you got to where you are now. So could you tell a bit about that? Yeah, of course. So I grew up in Austria, actually. and ⁓ studied computer science in Vienna. I of got into computers because my dad bought me, like for my brother and myself, one of these like Lego Mindstorms ⁓ sets. And then we started like programming that with like a visual programming language. And I was kind of like fascinated that you could, you know, build something and then make it do different things. then, yeah, decided to go into computer science and ⁓ after uni wanted to travel a bit and so travel to Australia. Actually didn't have any plans of moving here but met up with an old colleague who then like I think a month later or something decided to ⁓ ring me up and was like hey mate I've got this job offer here if you want like you can move over to Australia and we can together ⁓ do some research on blockchains in fact and that's kind of what got me into like the networking space and also into Rust. That's where that's the job I learned Rust for or on the job, I guess. And then from there, I became a maintainer of Rust Lippie2P. This is where I met Max, which was also like a previous guest on the show. ⁓ And from Rust Lippie2P, I then like, guess... ⁓ got even further into networking and eventually got contracted initially by Firezone, the company to write a turn server. And then bit by bit as I got more involved, just ⁓ got into more parts of the clients that compose Firezone and more into like layer-free networking, DNS, WireGuard, all these protocols. ⁓ Okay, very cool and it's amazing how many in the industry got into the like deep networking and cryptography knowledge because of ⁓ blockchain and peer-to-peer decentralized web3 companies. Like there was a lot of... move in there and yeah even for me like in the past I was doing already a lot of networking for like the game industry and also relate industries and I was always interested in security and cryptography but it's only when like things like bitcoin started to exist and and all the hype around that that there were so many opportunities to to learn a lot and it seems that for you it was similar I've also ⁓ did a couple of things around libp2p and that's also where I know Max from. By the way, ⁓ we had episode 16 with Martin and he's like the maintainer of a WebRTC ⁓ crate and they also have like a turn server but I think you guys have your own implementation, right? Yes. We decided to write our own turn server because it's quite fundamental to like the product working, right? It's like Firezone. I'm sure we're gonna go over this in a bit, but one of the fundamentals of Firezone is that you don't need to open any ports. So we try to hole punch a direct connection between the exit node of the VPN and your client. And if that doesn't work, then we need to fall back to a relay, which is the turn server. And... like having that working very well is kind of like really important in order to provide a good user experience. And so if we consider it like mission critical for our product and therefore we just wanted to have full control over its design. And it actually turned out it was very interesting like development because it's not at all like what I wrote initially. Like I wrote the first implementation and it's changed a lot. And so I'm not sure if we would have been able to use another implementation and shape it into what we want as easily if we wouldn't have had full control over it. Yeah, I I understand that. I myself also together with my company, we develop and maintain a RAMA, a network framework and... And maybe it's also a bit of a, because of my background in C++ in the game industry, but I feel if you own these components, own these different modules, you can continue to shape them as your business needs grow, as the customers, as the user, need to grow. While if you work with dependencies and it's always a matter of first of all, need to be a bit lucky that you can get alignment and that they want to have it. And the secondly, you still need to be able to get it upstream then and patched and this and that. And it often like if you're already like on the lucky side then it usually means that you can get some kind of form of it but you still need to work around certain awkward abstractions so given how critical the turn server is as you describe I totally understand that Firezone went for its own implementation. Now on the flip side of that coin would it be possible for other users like that are not like Firezone to make use of it or is it very specific to Firezone? I mean, at the end of the day, it is an implementation of the TURN specification. it would be able, somebody else would be able to make use of it. There are a few Firezone specific things in there, like for example, how we handle authentication of clients because the TURN server. needs some kind of mechanism to control who is actually allowed to use it. Because it needs to be publicly accessible on the internet, otherwise it can't really function as a relay. So you need to have authentication to make sure that only, I guess in our case, our own clients and exit nodes can access it. So if you wanted to use it for your own ⁓ software, you'd like I guess have to rip out the authentication code, replace it with whatever you want to do. And there's a few other bits like we were fully turned compliant in the sense that what we implement is as proctor and spec, but there are like missing bits where we like, we don't care about this functionality. So we didn't implement it just to make it simpler and easier to work with. Okay, and are there actually like non-industry accepted test suites that you can use to test such an implementation? Because I know with things like WebSoc and HTTP2 and similar you have certain test suites that people use, same for like browser APIs, but is it also something for that for Turner implementations? I'm not aware of it. I would love to know if there is. Because what we ended up doing ⁓ is actually we had some big integration tests initially. ⁓ over time those actually became very annoying to work with because you sort of, in order to test the TURN server, you need a TURN client, right? You need the other side that also follows the same protocol. And ⁓ obviously our software is a TURN client, otherwise it wouldn't be able to use the server. So what we're now actually doing is we have a pretty big... ⁓ integration test suite that just tests the entire thing but without doing any I.O. itself. So it's all just like passing byte buffers back and forth. So all of Firezone is ⁓ written in a sense I.O. way. You talked about that with Martin from Strom, right? And so we took inspiration from that and that allows us to actually test client relays and gateways in one setting with like a simulated network without actually having to send anything on the wire. And after we introduced that, we were kind of like, we don't need these integration tests anymore that like test that the turn server is like, I guess, compliant with this specification, because what we really care about is that it works with our clients and gateways. understand that's very smart. So I think we're almost ready to start doing some really deep technical dives and learning a lot about your stack and the different protocols involved. But before that, Firezone describes itself a bit as replacing legacy VPNs. So there is a lot to unpack there. But first of all, what is a VPN? So VPN short for virtual private network. And I guess when most people hear VPN, they think of what I would call a consumer VPN, which is like used to like hide your IP address or access to your blocked content or something like that, where essentially you install some kind of client software on your computer and then ⁓ your ⁓ for any remote thing that you access, it looks like you're actually in Singapore instead of like Australia, for example. ⁓ Firezone is more of an, I guess like an enterprise VPN where... We don't actually run the exit nodes of the VPN. We just provide the client and exit node software and we run the control plane that connects these two with each other and our customers run these gateways in their own data centers and therefore allow their workforce to like access these protected resources. So it's like. quite a difference in terms of what you use it for and how it's designed under the hood. Okay, and is it comparable to let's say Warp from Cloudflare as we had them at episode 15 and they talked a bit about it as well? Yes, it sort of serves the same purpose. I think there's like an enterprise version of WAP. But I think WAP can also be used as a consumer VPN, if I'm not mistaken. I should actually know this, I realized. Yeah. ⁓ Okay, and so as not everybody is familiar and what we also try to aim with this podcast is to educate each other and the listeners. Could you maybe in your own words describe a bit like how these kind of enterprise VPNs like yours work? Like in the sense like, okay, how does it operate on the machine of the user? So on the end user device and then what happens all the way to the server? kind of like... in the sense we could try to describe the life of a packet as we know it from like, maybe employee of a corporation and I need to access a web platform that should only be accessible by employees like me. So how would that work on the end user device or the employee device all the way to the different relays and network layers as well, all the way then finally ending up in the server. There's a lot to unpack there. So I'm gonna try and like do this in a structured way. So the, one of the first things we need to know is that at least for Firezone, the way that we actually get the packets is by installing a virtual. or creating a virtual network adapter. So when you sign into Firezone and the software is booted up and you check your network settings, like on Linux with ⁓ the IP command, you will see a new network device, the same as like right next to your Wi-Fi connection or your ethernet connection. And this network device has an IP, IPv4 and IPv6 address. And same as with all the other network devices, it has routes attached. And so depending on sort of the resources that are configured in the backend, this network device might have a route to, for example, capture all traffic that goes to 10.0000 slash 24. So, which is like a common private subnet. ⁓ And so when you then send, when anything on your system sends a packet, into this address space, the kernel will be like, okay, where does this packet need to go? And if it falls within the routes of the Firezone network adapter, it will send the packet into this adapter. And then because it's a virtual network device, there is actually no hardware attached to it that actually can handle this packet. And so what happens is that within Firezone there is essentially an event loop and we read from a file descriptor and the content of this file descriptor is the IP packet that the operating system crafted for us. And then we can do with this IP packet whatever we want. And in the ⁓ case of a VPN, because it's meant to be like a virtual private network, One of the first things you do is usually you encrypt a packet because nobody should actually see what the traffic is. So you encrypt a packet and then you wrap it in a UDP packet and then you can send this UDP packet wherever you want. In our case, you send it to like an exit node and the exit node unwraps it and then does the same thing in reverse. Basically it receives the IP packet. It also has a virtual network adapter and it sends that IP packet. into this adapter, which again passes it back to the kernel. And then the kernel will look again at its routing rules and see like, okay, where does this packet need to go? And it performs things like masquerading to then actually send it out to the network of where, of where, it needs to go, basically. So that's the high level overview. Obviously what we have, yeah. Yeah, I think that's a good start. Yeah, okay. Yeah. So maybe before we continue to go too deep, let's zoom in on some bits there. you mentioned, okay, so first of all, I like it a lot. It's quite simple if you explain like this high level. So that's a very nice explanation. What I do miss there is like, okay, in the end, I understand from your explanation that the reason why some of the traffic is going through the VPN is because it's going through this private subnet. What I don't understand yet is if I as an employee enter, let's say corporation.com, how does that then translate to one of these private IP addresses? Very good question. And how the different VPNs handle this differs. So the way we do that is we install ourselves as a stub resolver on the system. So FireZone boots up and we set an IP address as the stub resolver that is also a route that goes into the network adapter. which means as soon as anything on the system tries to resolve a domain name, we actually receive that DNS query. And then we look at the DNS query and we say, okay, is this a DNS query that we want to intercept or do we want to just pass it on to like an actual DNS resolver and then just return the response? ⁓ And if we want to intercept the DNS query, so for example, let's say company.com is a protected resource and we get a DNS query for like the A records of company.com. Instead of resolving the real records, we allocate a set of ⁓ proxy IPs. Like in our case, we use the CGNet range and then we return those proxy IPs to the application that made this DNS query. So if it's a UDP DNS query, we just send back a UDP packet with those IPs in it. And then, and this is kind of like the clever part in it, then the application knows, okay, company.com resolves to 196.01. And then it will start, for example, opening a TCP connection to that IP. And as a result, because the we also installed routes for this big CGNAT range, we will receive that TCP traffic that is meant to go to company.com and we can then do with it whatever we want. We can route it to the gateway and then what the gateway needs to do is the gateway needs to resolve what is the actual IP. of company.com and basically perform the role of a ⁓ network address translation device to send it to the real IP. Okay, and does that mean that you receive all the DNS queries and you have to pass it on? And these things can stack. Like let's say there are multiple systems on the machine, all having their own stupors over, can these things stack or they just override each other? Yes. Yeah. ⁓ What we found is that this doesn't play very well together with other systems. So, and that's kind of like how you resolve DNS of a system is almost like a, it's like a global resource where there can only be one way, right? It's like, can have multiple DNS servers but they're usually used as fallbacks. So you like, contact the first DNS server and if that one gives you a response, this is what you use. ⁓ So what we found is that, and also with VPNs, right? Only one can really route your traffic. So it's kind of like, you've got to decide which one you want to use and then only have that one active. Okay, I mean there are two things there. So how do you actually register yourself as the resolver? Is it because you use the OS SDKs where you say I want to receive all the DNS traffic or how that works? Like which system calls and protocols are used there? So this is where the fun of cross-platform development begins. So on my desk here, I actually have like three machines sitting because you can't really emulate this stuff very well. Like if you wanna work for, develop for Windows, you've gotta like have a real Windows machine and same with a Mac and Linux. And so it's entirely different on the different operating systems. So on Linux, we rely on systemd resolved, which is like ⁓ a daemon that is part of the systemd suite. And that one actually already runs a stub resolver on the system on the IP 127.00.53. But then you can tell systemd resolve what the DNS server should be. And so basically we tell this daemon, we want to be the DNS server. And then it sort of like receives the DNS queries and forwards them to us. And then we forward them to like, for example, the DHCP given DNS server if we don't actually want to handle the query. On Apple platforms, it's very different. On Apple platforms, in order to be a VPN, and if you want to be in the App Store as well, you need to use the network extension framework that Apple provides. And that one has various callbacks of where you set the configuration of your virtual network adapter. And one of the fields of this configuration is the DNS servers that you would like to use. And on Windows, we set the DNS servers by calling to the registry, I believe. I would have to double check. On Android, it's similar to what we do on Mac OS. ⁓ and iOS is the same as Mac OS. I'm very surprised that you say that Android is similar to Mac OS. I would think it's just pretty much Linux, no? Or am I wrong there? ⁓ It's the same in the sense of that it's completely managed for you. So you need to implement a certain interface to be a VPN essentially. And then that gives you the ability to set a certain configuration. ⁓ So Android is pretty locked down in terms of. what you can do. Like, so whilst it is Linux underneath or I guess a very heavily modified version of it, unless you have root, you have to sort of play by the rules of whatever interfaces you're given. And so that's what I meant with like, it's very similar to Mac OS. You have to be within the APIs that are given to you. Okay. Very cool. And so I get the DNS aspect. We now understand that you are working there as a stub resolver. We know there can only be one and the queries you don't care about, you just pass through. So I get that part doesn't play well with other systems, but let's say you did have somehow a stub resolver and it does handle all your different systems. So you just make sure that the domains that you care about for this VPN goes to that address, et cetera. And for other things that goes with this. say you handle that. What I understand for now from the other parts of the stack is that as you work by registering a new network adapter or virtual network device let's say, I don't understand yet why that couldn't be like let's say I have three different VPNs and they all have their own little network device they all have their own little private subnet and I would think that given the knowledge I got from you so far, that as long as I make sure that my single step resolver does resolve for all three different network adapters, for the different IP ranges, for their relevant domains, that should just work, no? There should be no reason why I cannot have three of those or probably I'm missing something. ⁓ That is correct. So there is, I guess we need to decide between what kind of routing we're doing. So what I've just described is known as split tunnel routing, where you basically on the IP level deciding which traffic goes through the VPN and which traffic just goes out via the main interface. ⁓ There is also a configuration that is called full tunnel VPN. ⁓ We call that like the internet resource. And that just means you essentially create a route that captures all traffic. So 0.0.0.0 slash zero. So like the entire IPv4 address space. ⁓ And once you start doing that, you run into like very interesting configurations, right? Because for one, you need to make sure that you don't have any routing loops with your own software, right? Because Firezone or the VPN emits packets to be sent. Those should actually go out by the real network interface, right? But if you don't take any precautions on most systems, if you have a route register that captures all traffic, you end up in a packet loop where the packets that you send come back in. via the virtual network device. And once you get into these situations and you have more than one, then I think it's pretty non-deterministic what is going to happen. And so is that all something that you ⁓ offer or you only do the split turning where you only say like we only care about this subnet and the rest we don't touch? We offer full tunnel configurations, yes. Okay, because I'm surprised that that still works at the IP level, I would have thought that you could just tunnel by similar to what you were saying, for example, on Macintosh, I know, okay, you have the network extension framework, which allows you to do DNS queries. But as far as I know, you can also just capture entire network interfaces and basically take it over. So I would think that you can just at that level, capture all the IP packets and wrap them without ever even needing to worry about DNS. Yeah, so when you're in a full tunnel configuration, obviously all the traffic to your DNS servers is also captured there, right? So like contacting a DNS server is also just sending another packet to an IP. So if you're capturing like the entire address space of IPv4 and IPv6, then... you just also receive the DNS traffic. And if you don't filter for it, you will just wrap it and send it to your exit mode and happy days. The tricky part is like grabbing the DNS traffic without actually grabbing all the traffic, right? Because just because... someone assigned to fire zone and they need to access company protected resources doesn't mean that when they stream a YouTube video at lunchtime that we should tunnel this traffic to like the company's exit note, know, like that doesn't really make any sense. Yeah, totally makes sense to me. Now, before we go a bit deeper, because in a bit I want to talk a bit about the specific layer you operated and what is WireGuard and what it offers you and why you build on top, etc. But before that, I want to take a step back because for now we covered the... VPN aspect like what is the VPN recovered already we now know more or less high level and I'm sure we left so many details and nuances at the table but we at least start to understand a bit how Firezone works which is very exciting. What I do want to first understand is what are then legacy VPNs and what does it do different than something like Firezone and why does it matter? So when we say legacy VPN, I guess we're mostly talking about things like open VPN. And the main difference there is in fact the encryption algorithms and the wire format. And so WireGuard, which is what we're building on top of is kind of like two things. So WireGuard is like a packet format. then you also have the kernel module which allows you to create ⁓ a similar device as we do with Firezone. The WireGuard cryptography is actually extremely simple. It's just I've once heard it described as like it's just ⁓ good opinions about cryptography all packaged into a protocol basically. So it uses like the noise IK handshake and then ⁓ Chacha poly symmetric encryption with AES 256, I believe. ⁓ And that's essentially where the performance gains come from because this, all of those cryptography algorithms They are all implemented ⁓ partly in hardware or with very easily ⁓ optimized assembly code with SIMD instructions and stuff like that. And so that just allows you to very efficiently encrypt individual packets ⁓ and send them out. Okay, and so that's on a technical level, but like let's say for a typical company, let's say a lawyer firm or a little mom and pop shop, like why do they care? I guess the main question is like, okay, is whatever they're using working for them? And like, if you talk to people and ask them how much they love their VPN, most of the people say like, I don't like it. It's constantly buggy. Like it loses connectivity. And it's like in my way sort of. like, want to do a certain work, but like the VPNs like disconnecting or it's not whatever it's doing. It's not, it's like standing in between me and getting my work done. so, ⁓ if whatever your, ⁓ little shop is so like using is working for them, then I guess there is not much reason to switch, but In our experience, most people are very frustrated with the VPNs that they use because they're either slow or unreliable. And so I guess the selling point is like, we want to build one that is basically invisible. Like you shouldn't even notice that it's running on your system because it's A, performs extremely well and it's non intrusive and it just always works. That makes totally sense. So that basically means in the ideal case scenario and maybe that's today the case already, it's just a little tray icon in the top, but you don't really care and it's pretty much 24-7 on. Yeah, so that's kind of the beauty of like doing a split tunnel approach. It's like, you don't really need to like log in and out of your VPN because it doesn't capture your entire traffic, right? Like the, this is sort of this, I guess, this new wave of VPNs that is also part of this idea of zero trust networking. It's like when you're, if you think back, like working remotely 10 years ago, You like you have your company laptop and if you want to access anything, you first need to log into your VPN ⁓ and then your entire traffic sort of like gets tunneled through the company service. ⁓ That means a lot of people are like, okay, I now want to sign out because I don't need this anymore. Whereas like the idea of a split tunnel VPN is really, it's just running in the background and it's giving you additional capabilities like accessing these protected resources without interfering with the stuff that you're using that is not like, that is like other traffic, browsing YouTube or reading the news or something like that. Yeah, I mean love it. It's the beauty and magic of networking, so well done. Now for DNS, we had for example in episode 7 of NetlVM we had a guest, Dirk Jan Ochtman, who works on Rust DLS but also on Hikaro DNS. And for example, Latse Encrypt is adopting Hikaro DNS and they have a tracking issue with Openwork. Do you guys also use Hikaro DNS or you also went there for your own implementation? Yeah. We use it again. We used to use it in an earlier version and then decided that we actually wanted to do DNS resolution ourselves but we've brought it back in for a different component. specifically ⁓ So on the gateway, like I was saying, the traffic goes to these proxy IPs and then the gateway needs to translate, what IPs does this traffic actually need to go to? And that DNS resolution step, that's where we use Hickory because it gives us like the full DNS metadata, including like the time to live, like how long is this DNS record valid, ⁓ which is important for caching. So we wanna know how, so I guess up to date is this DNS record. And what we used to do was using just calling libc resolve, but libc resolve only gives you the result of the DNS query. Like example.com resolves to these four IPs without any metadata of like, well, what is the TTL and stuff like that. So that's why we brought Tickory back in. Okay, and so I also know that Firezone is at Github Is everything of your technology there open source or are there also components which are closed source? It's pretty much all open source. ⁓ What we have closed source is like our infrastructure. So because it's not really anyone's business, like how we run our backend servers and stuff like that. But the product itself is entirely open source. ⁓ And there is a community that actually does self-host Firezone as well. We don't really recommend it at this point ⁓ because it's a lot of work. You need to like, we make like the backend is written in Alex here. So it's like distributed actor system and we make very heavy use of like various Postgres features and stuff. So operating that at scale is like non-trivial, but if somebody wants to, they could go to the repository and like set it up, set it up themselves. Like the code's all there. Okay, very cool. And so now that we know that a bit, let's go back to where we were earlier. So Firezone, we already know now it's a VPN, it's different than legacy VPNs, we know a bit how it works, but it also claims it's a zero trust VPN. So what does that mean in this context? The idea of zero trust networking is that from a business point of view, joining a network doesn't actually give you access to anything that's critical. Zero trust means like there is no trust between the network devices in a network. So like in a, I guess more of a legacy environment, you know, you would have like, let's say an office wifi. And as soon as you're in the, ⁓ if you have the password or you can sign into the office wifi, you have now access to the printer or the server because you're in the same subnet. Right. There might be additional authentication to like actually access a certain service, but you have, but you do have access on a network level. Right. And the idea of Zero Trust is that this is not the case. So there's no implicit trust between the network devices. And the way you achieve that is by firewalling all of these components off so that there's actually no network traffic, no inbound traffic allowed. And then you would deploy what we call a gateway, which is essentially the exit node, into these private subnets with your protected resources like your web server or whatever you need to secure. the traffic then only gets routed to that server by establishing an on-the-fly tunnel. So, and that on-the-fly tunnel is only authorized to be established if... the client is like installed on your machine and you are logged in with your user. And then when you're trying to access something, it gets checked against a certain policy in the backend to see, okay, is this user actually allowed to access this website or this printer or this server? And only if yes, will it establish a connection between the two nodes by handshaking the necessary encryption keys. and only then can the two nodes actually talk to each other and the traffic gets funneled through. Okay, very cool. And so now it's time to talk a bit about WireGuard. What is WireGuard? Because, I mean, it's a crucial part of this episode because as far as I know, Firezone is built on top of WireGuard. as it's the first episode where we even mention WireGuard, I do want to make sure that we explain a bit what it is. So maybe in your own words, you can explain what it is and where it comes from. WireGuard is on the one hand a packet format. It's quite simple in fact. ⁓ You have a couple of different message types and the idea of WireGuard is that you wrap IP packets. Technically, you can put whatever you want inside a WireGuard packet. Like there's nothing that necessarily restricts you to send IP packets. But the idea is that you put IP packets inside. And then WireGuard itself then forms the payload of a UDP packet that you can then send around. The specification itself is ⁓ refreshingly simple. So it's like... by design, as far as I know, it has like basically no configuration options, like there's no pluggable cryptography or whatsoever. It's like everything is kind of locked down and this is the only way of how you can run WireGuard. And what that does is it makes it quite easy to audit, right? It's like there's like one set of... ⁓ cryptography, like, and it always uses that. And therefore it's quite easy to say, okay, well, is this secure environment in my environment or not? ⁓ And then on top of this ⁓ packet format, you have like different implementations, some in user space, some in kernel space. So the kernel space implementation just creates a network interface for you and gives you like a private key. And then you have to share the public key side of that with whatever other node you want to talk to. So you perform the key exchange out of band and that needs to happen before you can actually establish a connection. For Firezone, we use a user space implementation of WireGuard. And it was originally developed by Cloudflare, called Borington. And that one basically allows you to sort of more pick and choose what you would like to do. So in our case, we ⁓ perform the key exchange. That's part of like this handshake of when we create the initial connection. and this gives us more control over which socket are we sending out these packets over and allows us to not do some of the what WildGuard calls a crypto key routing where it figures out which peer is associated with which key and stuff like that. ⁓ multiple layers and with the user space implementation you can sort of choose more clearly what you would like to implement and what you are leaving off basically. Okay, very cool. I mean, I like Wireguard as well and I think many of our listeners probably also use it. Some directly, some as part of some other software similar to how Firezone makes use of it. Now... What does then fire zone add on top of of WireGuard? Like what's what's the part which are like fire zone and what are the parts which WireGuard just like offers you out the box and you kind of piggyback on it. Yeah, so like I said, it's basically a packet format or part of it is a packet format. ⁓ But what WireGuard doesn't do is like, it doesn't tell you how to send the packets, right? It's sort of like just a state machine really where you put the IP packet in, it encrypts it and then the encrypted blob comes back out. And then you're like, okay, what do I do with this? And so what... What Firezone adds on top is like establishing this connection between the client and the exit node, which is like a UDP connection, so to speak. And then sending those encrypted packets over this connection to the other side. That's like our data layer, what we call it. And then on top of that, you have like the policy layer. that decides whether or not you're even allowed to access something which would then internally trigger making such a connection. ⁓ Maybe this is a good point to like zoom in a bit more of like what actually happens when you get the packet, how and how we establish the connection to actually send it, right? Because by default, like you're trying to send a UDP packet across the internet, it's gonna bounce off the other person's firewall. because there's no open ports or whatsoever. Okay, so what we do is a process called hole punching, which has sometimes been described as like the greatest hack of the internet. ⁓ And what we need to know for that is that because the IPv4 address space is small, we have like ⁓ routers and network address translation devices in between, right? Where like from your private home network, you have a certain address space and then you have like a public IP. And then the router in between translates packets from like the outside world to the inside world and back. Right. And so that means if ⁓ I have my laptop in my home network and I want to access something that is in a subnet somewhere in the cloud, but the inbound traffic is so like firewall off, then I need to ⁓ somehow establish a connection even though both sides of the connection cannot talk directly to each other, right? Because if my laptop tries to contact the server, the firewall of the server blocks it. And if the server tries to talk to my laptop, the firewall of my router blocks it. And so what you can do is you can leverage the stateful nature of these network address translation devices, these NATs, which means If I send a packet to a website, the return traffic obviously has to be allowed back in. Otherwise I can't actually view the website. Right. So what happens when you send an outbound packet is that the NAT device will make an entry saying, okay, I'm, I'm seeing UDP traffic or TCP traffic on the outgoing port, ⁓ 30,000. And so to this IP and any traffic coming back in. That should actually be allowed, even though there is a firewall. And so what you can do now is if both parties know what their external address actually already is, then and they start to simultaneously send traffic to each other. The inbound traffic from either end is going to look like the return traffic of the outbound traffic. And so you trick the NAT device into letting traffic back in, even though that it's not actually the response to the request that you sent. So this is called hole punching. And so basically when receive a, when Firezone now receives a packet and we ⁓ need to route this to a gateway and we don't have a connection yet, what we do is we ping our control plane and we say, this and this user would like to access the website company.com. And then the control plane looks up and it's like, allow rules. This is what the software or the network administrator of a company would like configure of like who has access to what. We evaluate these rules in our policy engine and if that one gives us a green light it sends a message to both the gateway and the client with a set of credentials and most importantly like the information of each of the other party of what the public IP addresses are. And then both sides can simultaneously start talking to each other. And we hole punch a connection, which means we now have a way of sending UDP traffic back and forth to this particular exit node. What comes next is then a WireGuard handshake. as part of this message that got sent to Adorand, ⁓ of this connection, we also have the other party's public key. And so we ⁓ pass that to the WireGuard protocol and the WireGuard protocol forms a handshake. And that handshake gives us a symmetric encryption key. And once that is complete, we can now take the IP packets that are coming in and encrypt them, wrap them in a UDP packet, send them to this other party and then the gateway decrypts it. with the same session key and sort of reveals the original IP packet can then perform any kind of like network address installation that it still needs to do and then send it out to the real resource. And that's the case for when hole punching succeeds, which is the good case with low latency and stuff. ⁓ But there's a whole alternative code path of what happens when ⁓ we cannot ⁓ hole punch a connection. unfortunately, not all net devices are, ⁓ I guess, generous enough to like allow this traffic back in, even though ⁓ it comes from a different IP. And so what can sometimes happen is that there are different algorithms for how such a net device allocates the source port. So if I'm... if my laptop is like making air. connection to a website, right? And I have multiple devices within my network. We all share the same external IP, right? And so it's possible that whatever source port my computer chose to like create a socket on, that somebody else's computer on the same network also uses the same source port. And so we both obviously cannot have the same source port on the outside world. We need to use a different one. And this source port allocation algorithm, if that is randomized, then we don't actually know what our source port is going to be when we're trying to contact the other party. And so this like hack of trying to hole punch is not going to work and all the packets will bounce. And that is when ⁓ turn servers or relays come into play. And for So for a relay, what basically happens is they kind of act like a remote socket, right? So we have like a control protocol with a turn server and we can like allocate an address on this turn server, which gives us the public IP of this turn server plus some port sort of similar to how you call a bind on like the kernel, the kernels bind API to ⁓ create a socket. You kind of like just use a protocol with the turn server to create a socket. And this turn server then uses like a very simple ⁓ framing mechanism to send back and forth packets that it received on this on its ⁓ on that port that it allocated for you. And that forwards you the data to your local device. And so What we do as part of the connection setup is we actually concurrently try to first make a direct connection, but also at the same time try, can we reach the other person via a turn server? ⁓ And this entire protocol is actually called ICE. So it's Interactive Connectivity Establishment. ⁓ It's an RFC standardized by the ITF. And that is... That is the bit that we heavily integrate with the Strom library. That is WebRTC implementation that you already talked with Martin about. ⁓ And so that's how we coordinate this ⁓ connection establishment. And then as a result of that, for example, we know, okay, we actually need to use this remote socket to send traffic to this other party because we couldn't, the hole punch attempt sort of failed. and this is it's better than no connection i guess but the downside is that you obviously have an extra hop before the packet makes an extra connectivity hop which means in increased latency and so like the connection is not not quite as fast or not quite as yeah not quite as fast as it could be if it were direct but it's an it's an unfortunate reality of today's internet and all kind of boils down to the IPv4 address space being small and we're running out of IPs, right? So. Is that also where most of, let's say if customers complain, where most of the issues are pinpointed to or are there different parts of the stack that are equally sensitive? ⁓ It has definitely been a big problem in the past. So the interesting part about a relay is that it is actually from an implementation point of view of what it needs to do. It's very simple, right? Like it reads data from one socket and then it like, it strips a little header and then it sends it out somewhere else. So ⁓ initially we, ⁓ the initial implementation of the turn server actually had not great performance ⁓ because it spent like 95 % of its time in syscalls, right, like copying memory from the kernel space to user space, stripping four bytes of the payload and sending it out somewhere else. So you really only spend time copying data back and forth. And so what we're now doing is we're leveraging ⁓ eBPF. In particular, we're leveraging the express data path part of eBPF to basically do this data routing directly in the kernel. So we install an eBPF program ⁓ as part of the network driver on our relays. And that gives you a chance to basically look at the byte buffer that the network card received even before the kernel starts to look at is this ethernet, is this IPv4, IPv6, it hasn't analyzed it at all, it just gives you like here's a byte buffer that we received. And then we look into that, we sort of like probe is this like a packet that we should relay? And then we make like a few targeted modifications and send it back out again. And ever since we did that, the reliability and like the speed of relay connections has improved. tremendously. Like sometimes you don't even notice that you're the relayed connection because it's working so well. It's actually magical. And so that means that as part of your offering, you are also all these different servers, right? Like the STUN servers, the relays, everything you might else need. Can you maybe give an overview of the kind of infrastructure you host in total? Yeah, so we have relays in every region that Azure offers. ⁓ So basically it doesn't really matter where you are in the world. You should have a relay within 30 milliseconds of you. And then we have the control plane nodes, which I briefly touched on already is written in Alex here, ⁓ which is like actor based, right? So there's like multiple servers and all these like little processes in there talk to each other. So it's distributed in a way as well. ⁓ And that's, those are really the only two points. So like the stun and turn servers are like combined. So our relays also perform the stun functionality. And those are really the only two parts of infrastructure that we have. Okay, very cool. And maybe because I'm not sure if we ever covered this in our podcast, but you said, okay, we have in every region that Azure offers, we have a version of the service, but how does the end user device know what is the closest one to route two? It's actually an interesting problem. Yeah. So what we do right now is geolocation IPs essentially. So when the client connects to the control plane, we obviously see it's public IP. And so we can sort of like through geolocation databases, make an educated guess of like, which one is probably going to be your closest relay. ⁓ I have to admit that doesn't work quite as well as we would like it to work. So ⁓ what we plan to do in the future is we actually want to send a bigger list of possible relays down to the client. And then the client will just make like a sort of ping based check. So based on RTT, which one is actually the closest one and then primarily use that. But that's still to be built. so to speak. Yeah, I mean, as always, I'm sure there's always plenty to improve and expand, so... Yeah, not sure I want to see your backlog. I'm sure it's filled plenty. Okay. Yeah. I mean you can look at it, it's on Github. ⁓ So you also mentioned that okay, one of those components is in Elixir and I love it. I like it. It's a nice language. ⁓ I used to have more time to explore different languages and try to build products with them or ⁓ find like production level work to enhance my knowledge in them. And I've also done that in Elixir in the past. So definitely I like the language, but still I wonder... As you also already used Rust for Firezone, is there a reason why you continued that split or it just works because maybe the developers at that time they knew Elixir and it works fine and there's no reason to maybe align those technologies? I didn't really know any Alex here when I joined Firezone. And I have to say, think for what the control plane needs to do, it's a very good choice of technology. And I think while you could build the same thing in Rust, ⁓ it's... Yeah, I think you'd probably make your life unnecessarily hard. it's kind of, I like Rust a lot, like, and I would, if I had to build it, I would probably build it in Rust just because I'm much more comfortable with that language. ⁓ But to give like an example, ⁓ so each of our clients, they maintain like a WebSocket connection to the backend. ⁓ and that WebSocket connection is tied to an LXE process. And so basically us sending messages to the portal just involves like a WebSocket message. And then that WebSocket message ⁓ gets like internally analyzed. And then if we, for example, want to send a message to the gateway via the control plane, then that's just as easy as sending a message between two LXC processes and then handing it back down to via the WebSocket on the gateway side. And similar for like other background processes and workers. So for example the the admin portal web UI ⁓ is also just like another set of processes with like a live WebSocket connection. And this is pretty cool because it means that you can have your browser window open and things will update in basically real time as your devices go online or offline or things happen because it's all sort of connected through this big PubSub network. And yeah, you can build that in Rust if you want to, but there's a lot of things you need to build that basically ⁓ Erlang and Alexi give you for free. Okay, makes total sense and I'm definitely not the person to say rewrite everything in Rust. I think it's nice to have different languages than have their own strengths and I totally agree with your assessment there. So it's a proper choice. I do have to say I have more experience with like Erlang than Alexer. I mean they're both running on the beam. Is there like an advantage of using Alexer next to Erlang? I mean would it be similar in Erlang itself or? there are some good reasons there. ⁓ I'm not the right person to ask for that. So I believe we, as far as I understand, ⁓ Elixir is like a language built on top of it, right? As far as I know, we don't use any Erlang per se. It's like all Alex here, essentially. Like we make very heavy use of ⁓ Phoenix Live View and all of these, like, I guess ready-made parts. Just because it makes it like very quick to develop. ⁓ So I can't really say whether or not what it would look like if it would all be written in Erlang. Okay, and I always find that every family of programming languages that you learn to work with, especially in production systems where you really discover all the different corners and have to work with it in a pretty deep manner, they kind of change the way you think and work. So... as you mentioned yourself before joining Firezone, you never even worked with the virtual machine Erlang like Beam or with Alexir and now I imagine, and maybe I'm wrong here, but I would think that you now have worked on it plenty, that it kind of like changed you as developer and are there like things that you can now remark on that you, that made you like a better developer by using it. ⁓ I've not worked that much with the Elixir back end, I have to say. ⁓ I think one thing that I found was very impressive was like how easy it is to write like the tests for like pretty big things. So for example, we, like I said, we make heavy use of live views and it's sort of I'm a bit envious of how easy it is to test that a certain very interactive web page does the things that you want to do in basically without even launching a browser. Because one of the things that have been quite tricky is figuring out how do we actually test Firezone, especially the data layer. ⁓ Seeing how easy the integration tests, I guess what they are on the backend side are, it's like, I wish it would be this easy for the data plane. So it's quite nice these abstractions that they found there and the way you, guess, ⁓ send these messages between processes on a very dynamic level. think if you want to do the same thing in Rust, it can feel a bit verbose because you've got to like... spawn a task and create a channel and clone the center end and move it into the task and create a loop and all these kinds of things. It's sort of like, unless you're actually building an active framework on top and even with an active framework, it's like Rust wants you to be very explicit of like what you're doing. And it's like, this is a future you need to await it, you know, all these kinds of things. ⁓ As I feel like on the Alexia side, it's much more like just send this message, you know, and like I know there is a process somewhere listening and it will handle it. Is it easy for you to reason about the flow? Do you at every time understand how the different pieces move through the system? Or is it sometimes maybe a bit too magical and abstract? the latter, like it's, rely very heavily on like full text search. ⁓ also because this is where I think, Rust works really well. It's like, because it's so explicit, like the LSP can give you, or you want to see the definition of this function. Here you go. you want to see all call sites. Here you go. You know, it like in a dynamic language, like, ⁓ yeah, Alex here is, it's not that easy. And so I, I find myself like grabbing or like full text searching through the code base for like certain keywords where I know like, okay, this process sends this message. So I now need to a function that's like probably called handle and then like receives this token. And so I can now link the two together and actually follow like what is the data doing? So it can be quite magical at times. And I'm sure you get used to it as you work more in it, but like, Yeah, initially it can be a bit daunting to be like, I don't understand how it achieves this functionality. Yeah, I understand. I I for one don't like that magic at all, but... ⁓ I do agree that if you do accept it and you work with it enough, I'm sure you get a feeling for it and especially for the happy parts, I'm sure you know where it's going. But yeah, it always frustrates me that when issues come up and like it makes it pretty difficult to really understand what's going on or you kind of like have second guess and maybe your findings line up with what you think, but that still always ticks me in my mind because maybe it doesn't work like that, but just the things I see. align with that kind of behavior but they might have a different reason. Anyway, as long as it works, it's okay I guess. Yeah. It's, really notice how it's like, it depends so much on like what you're used to. Like I've been working in Rust for like quite a while now. And so I don't even, if I know my code compiles, I don't even question certain parts of it because I'm like, it wouldn't compile if this wouldn't be true, you know? Whereas like, find when you then move to a dynamic language, you like, you start programming and then you run it and something blows up because like, function not found. And you're like, what? How is this possible? It's like, yeah, of course I can run the code without like a call it can call a function that doesn't exist. You know, like this is, it's like an error case that you don't even think about. So I feel like you're. you're then bug hunting, like you take a completely different approach to like how is sort of which what could the bug possibly be depending on like I guess what mindset your language is currently anchored on. Yeah, I agree. And so I believe you already covered a lot. I now have a much better understanding of Firezone and also because it's open source, everybody can continue to study now with this high level knowledge and mapping in mind, like how to navigate through the code base and understand a bit before we wrap up, they're like things that you think we didn't mention yet or that you would... like to elaborate on or things you would like to give a shout out. I guess like sort of a more having worked like ⁓ for so long on this product. One of the things that I became more passionate about is IPv6 adoption. So I guess my shout out would be like, make your software IPv6 compatible, not your ISPs to give you like a slash 64 block and yeah, start using IPv6 so you can enjoy direct connections. Very cool. mean, can everybody just request an IPv6 64 block? well, if you, it's typically the default that you get, which sounds insane initially, but like, if you actually do the math, how many, ⁓ how many Ipv6 addresses there are, then I saw a recent comparison where someone was like, okay, imagine all the stars in the universe, right? And then you can kind of like pack those basically into a slash 64 block. And so you can almost like with your own slash 64 block, can kind of like, you can address so many things. It's like unimaginable. And then that's only like, you still have like 64 bits on top of that, which is like another. It's so big, it's unimaginable how many addresses there are. so what ISPs typically do once they are IPv6 enabled, every customer just gets like a slash 64 block. And then you can do with that, whatever you want. Yeah, I mean, I get the reasoning, but at the same time, I feel that was very similar with IPv4 where they were like reasoning like, yeah, how can we ever run out and this and that, and they were just handing them out. And now they are like universities and corporations who are like parking so many IP addresses, which could be used, but they're just hogging them because also the value keeps raising and they might never know when they need it. But so there are like plenty of IP address not used. And I feel like Yeah, I'm just not sure I follow the reasoning why you would then also start in IPv6, do kind of like a similar mistake and hand out like an entire 64 like slash block. mean, it seems a bit ridiculous to me like and like. I think I'm probably wrong, but I have feeling that it's kind of like the similar mistake, but just at a bigger scale. And that kind of like might even it out in the future where suddenly like every bacteria has its own IP address or something. And I don't know, like we certainly like, my God, now we run out again. How could that happen? Maybe because everybody has a 64 block. I don't know. Yeah, and I maybe. ⁓ I think the point is that it's just unimaginable actually how big this address space is. so I agree with the argument in principle of like, but we also thought that all the IPv4 addresses were enough. And maybe it's a bit short-sighted, but I guess, like I would say the last 30 or 35 years of like trying to connect the world has probably taught us a lot about how these scales can happen. And I guess when in the design of IPv6, it's like, it is just such a large address space that It's fine if every human on earth gets a slash 64 block, basically, because there's like, that's not, that's still even like, so there's still only like a handful of, um, the entire address space that it's like, I guess going to be fine, but we'll see. don't know. Maybe in 500 years, as you said, like every bacteria has its own IP six address or something. Then it's like, Maybe we have problems then. Okay, very cool. And is there like a shout out you want to give to someone or to other things like maybe your team or your company? I think, yeah, shout out to Max for making, for sending me the link to your podcast, ⁓ that, that Firezone got to mention it. And, yeah, also thank you to you for having me as a guest. I really enjoyed, ⁓ chatting about Firezone. Thank you very much for your time and I wish you all the best and I'm looking forward to continue to see ⁓ you and your teamwork on Firezone and how it evolves. Thank you. Yeah, it's an exciting future ahead for sure. Elizabeth (Plabayo)
1:15:06 | 🔗
Netstack.fm is brought to you by Plabayo building secure, open, and resilient infrastructure with Rust protocols, and purpose. This show is also made possible by Rama, the open source networking framework. Plabayo offers service contracts and welcome sponsorships to keep building and supporting its ecosystem. The theme music of this podcast was composed by DJ Mailbox. If you enjoyed this episode, don't forget to subscribe on your favorite podcast platform and leave a five-star review. It really helps others discover the show. Thanks for tuning in. We'll see you next time for the next handshake.