On this page On this page
Episode 12 – Oxide Networking with Ryan Goodfellow.
A conversation with Ryan Goodfellow about Rust networking at Oxide. We will explore the Oxide computer stack with a focus on network, including their fully integrated cloud computer, programmable networking with P4 and Dendrite, the Maghemite routing stack, and OPTE — a Rust-based packet engine running inside the kernel. Ryan also shares how his background in large-scale network testbeds led him to help design Oxide’s rack-scale system and its modern approach to routing, observability, and hardware–software co-design.
If you like this podcast you might also like our modular network framework in Rust: https://ramaproxy.org
00:00 Intro00:44 Meet Ryan Goodfellow06:23 Building Large-Scale Test Beds07:46 The future of the internet10:54 Overview of Oxide's Rack Scale Computer19:36 Exploring BGP and Routing Protocols26:02 The X4C Compiler and Its Origins39:43 Programming for Tofino and Observability45:10 Life of packets of an HTTP Web (Oxide Rack) server01:01:58 Exploring Maghemite: The Routing Stack01:12:45 Future Directions: Rust-Based Operating Systems01:19:28 Testing Strategies and the Falcon Framework01:27:25 Outro
Music for this episode was composed by Dj Mailbox. Listen to his music at https://on.soundcloud.com/4MRyPSNj8FZoVGpytj .
Elizabeth (Plabayo)
0:16 | 🔗
This is netstack.fm, your weekly podcast about networking, Rust and everything in between. You are listening to episode 12, recorded on the 31st of October, 2025, where Glen has a conversation with Ryan Goodfellow, an engineer working on the network stack of the Oxide Cloud computer. Buckle up, as this will be a very technical episode. Hello everyone for another week of Netstack.FM Today I am joined by Ryan Goodfellow. He's doing a lot of Rust networking at Oxide So welcome Ryan. Thank you very much for having me. It's wonderful to be here. Yes, and a lot of people in this podcast will probably know Oxide, but some don't. And we also don't really know you yet. So could you start by introducing yourself a bit? Sure, so I am a, I guess now a networking engineer. I've been working professionally in networking for, I guess the past few decades now. Before coming to Oxide, I was at a place called Information Sciences Institute where I created network test beds. And the best way to describe ⁓ the work that I was doing there is kind of by an analogy to the public clouds that we have today. So with the public cloud, it allows... companies and developers to not have to own and manage hardware but be able to deploy complex systems onto the cloud that's managed by Google or Amazon or somebody like that and be able to scale things up. ⁓ and just really not have to worry about hardware management. ⁓ In my previous life, in the testbed systems that I built, we provided a similar kind of cloud-like feel, but for our users, the goal was not to deploy applications, it was to experiment with networks. And so if we had users from academia, from government, from industry that would want to, say, experiment with what the next generation internet would look like, or we had people from power systems that wanted to develop power grid control algorithms that were resilient to certain types of attacks. ⁓ And we would build testbed environments where they could basically model and deploy very complex like internet style networks that could be very large, as big as like the Western Interconnect of the United States ⁓ and deploy systems ⁓ onto those networks that they had modeled and deployed in our testbed environment and experiment with them. ⁓ And so ⁓ I actually came into that environment from the power grid space. My formal background was actually in computation of differential algebraic equations, which is very far from networking. was kind of building simulation algorithms for the power grid, but started working with a lot of internet people. And ISI is actually like a very famous place for the internet. It was one of the first internet sites. A lot of the RFCs were developed at ISI. ⁓ My... original boss there was Bob Braden, who was very influential on me and how I looked at networks and how I thought about networking. And I kind of drifted from the computational mathematics side into networking slowly ⁓ and wound up just kind of falling in love with networking and building these really large scale test beds that were kind of used as ⁓ national infrastructure and infrastructure for universities and things like that. And I really loved that work. It was hugely rewarding. ⁓ I had a lot of fun, had a small team that built and deployed these systems that were anywhere from a couple of racks at university to deploying dozens of racks at an Equinix Colo facility to build out these test beds. But I really reached a limit, I felt like, with what I could do with the technology of today. If you are deploying systems with your standard HP or Dell servers with some Mellanox or Cisco switches, things like that, there's limits to the level that you can go down to to really understand how the metal beneath you is operating. And because we were doing kind of crazy things with networks and letting our users deploy, not even IP networks, just networks that they had imagined, you very quickly run into corner cases ⁓ with a lot of these commodity switches, commodity routers, commodity systems that they didn't really anticipate. And it's really opaque to you as the operator in trying to understand actually going on? Why is the system failing? Why aren't packets forwarding in the way that I expect? like if one of my servers has fallen over like an IPMI isn't working or Redfish isn't working or something like that, like how do I actually build a robust system out of that? And it became deeply frustrating over time and we came up with mechanisms to manage it. But when I became aware of Oxide and what Oxide was doing in terms of ⁓ building this like new generation of computer that was was like rack scale computer designed for scale up designs and built on an open source foundation, built on programmable networking, because I saw the P4 programmable switch hanging out and the materials that they had publicly available. ⁓ I decided that's really what I wanted to be a part of was building that kind of next generation of hardware. And if you look at like what the cloud providers are building their systems on, like Google and Amazon, like there and Facebook, like they're building everything in. house based on the needs that they have to build robust cloud infrastructure. And at Oxfide, what we're trying to do is take that same type of approach of looking at, know, what is the cloud computing problem and how do we design a fully integrated hardware software system to approach that problem. And so that's kind of how I left my previous life of building network test beds because I was frustrated with the technology that I had to kind of being a part of the solution that I wanted to see happen in the world. Yeah, that's a whole history and very fascinating one. Now these test bats were at scale, I imagine, no? mean, because you were saying how large they were simulating it, but as you were simulating, I imagine you're not actually having deployments go over the entire country, or do you? Yeah, so we had a geo-distributed set of ⁓ test beds kind of all over the country. We had some test beds ⁓ in Europe, and so they were kind of ⁓ all over the world. ⁓ the test bed infrastructure that I had created at ISI was called Merge. And the entire idea behind that was that different places were going to have different expertise at building different types of test beds. And so you would have like certain university labs that were really good at wireless networking and they could build great wireless testbed labs. ⁓ Some labs are going to be great at like IoT stuff. Some labs are going to be great at like SDN and software defined networking. Some labs were going to have like net FPGAs and resources like that. And the idea behind merge was to be able to actually interconnect all those specialized types of testbeds and put network emulation infrastructure in between them. So you could actually have the networks that you wanted or you wanted to model that are interconnecting all these different testbed facilities. Okay. and you were also saying that, some of them were not even like IP packets. Like I was also not aware like how much work was going on outside of these protocols even. Yeah, and so for people who are kind of looking at like what the future of networking, you know, ⁓ could potentially look like, like some of like the name domain networking and things like this that doesn't necessarily follow like the IP protocols, we built substrates that were based on overlay networks that could transit those protocols that didn't have to be IP based at all ⁓ that people could experiment with. And so it did provide like a very flexible substrate. And then for like or network emulation engines, you could define the characteristics of certain links you wanted in terms of like, what should the loss rates be? What should the jitter rates be? All this type of stuff. So if you wanted to simulate systems in like a lossy or unreliable environment and see how a distributed system would respond to that, that's something you could do in that environment. Okay, and some listeners might hear this and they might think, okay, they got the issues of something like IPv4, but they thought, okay, maybe IPv6 solved all we had. What is there then in the future that we would be missing even if you continue to stay IP based? ⁓ I mean, I think that remains to be seen. if there's an interesting paper out there on this ⁓ called, I think, the Tussle for Cyberspace. ⁓ It came out a few years ago, but it nicely articulates the different paths forward ⁓ that we might have for what the future of the internet would look like. A lot of people like to talk about the internet is very sender-driven, right? ⁓ communication with somebody else, ⁓ that somebody else is just going to take your packets kind of whether they like it or not, right? And this is why we have like DDoS attacks and things like that. And communication isn't really receiver driven. And so there's some interesting research ⁓ that's out there in like the academic space about like driving things toward a more like receiver driven model or like. pub sub type of model, if you will. ⁓ And so there's lots of interesting work in that space, but I've been at Oxide for about ⁓ five years now, almost. And so for the last five years, my brain space has been filled with, let's say, more immediate and practical concerns than what the next version of the internet is gonna look like. Yeah, I wonder even if within my lifetime I would still even see it because I feel if you would have to move entire systems up to a new internet feels like a very big task, not something that will go very lightly I imagine. Yeah, it's a huge task, it's, know, if listeners are interested in it, it's fun to kind of look at the academic conferences like SIGCOMM and NSDI and things like that to see what ideas are kicking around there and see what people are thinking about and, you know, join and participate in the conversation. It's a lot of fun. Yeah, very cool. Let's definitely put some links in there in the show notes. Now, like, I don't think a lot of people, know what's going on at some company like Oxide. Like, maybe we could start by giving an overview of like, okay, we have this computer or whatever we call it and what is its components and then how is it connected and how much do you control there, etc. I don't know, like, I think it's probably appropriate to start there with some kind of overview and some tour Sure. Yeah, so ⁓ the Oxide Cloud computer ⁓ is what we call a rack scale computer. Like the minimum amount of product that we sell is a rack of compute sleds and a couple of switches. And so the rack that we are selling today comes with 32 compute sleds in it that are based on an AMD ⁓ SoC. So the first generation was Milan. The next generation of CPUs are going to be coming out pretty soon. But basically there's 32 compute sleds. ⁓ that slide into cubbies ⁓ in the oxide rack. And then there are two switches in the middle of it. And that's kind of the, that's the unit that we sell is that one 32 compute slot to switch unit. ⁓ And an important thing about the oxide platform is it's actually a fully integrated and co-designed hardware software substrate. So we're not just selling the hardware all by itself. It comes with ⁓ complete software installed that gives you a cloud-like interface. And so the of the experience of buying an oxide computer is you basically, get the whole thing. ⁓ You can roll it into your data center, plug it up to power, plug it up to the network, give it a few configuration parameters for how you wanna connect to upstream networks, like if it's gonna be BGP or static routing, what have you. ⁓ And then start it up and within the hour, you have your own cloud that presents an AWS or GCP style interfaces where you can, or APIs, where you can immediately start ⁓ creating instances, connecting those instances to one another, to the outside world, assigning external IP spaces, delving that IP space out to the various different instances and things like that. And so that's kind of what we're putting together is this fully integrated computer. ⁓ In some ways you could kind of think of it like the approach that Apple has for their computers where when you buy an Apple computer, you're buying the software and the hardware that are kind of designed together. ⁓ A difference with Oxide is that we are completely open source ⁓ with the software that we ship. The only software that really doesn't ship with the platform is software that is ⁓ encumbered by NDAs and we work tirelessly to get that software out from under NDAs. ⁓ A good example that we're going to be talking about today is our Dendrite software. that we could not release for a very long time because of NDAs with Intel revolving around the Tufino 2. ⁓ But we put a lot of pressure on Intel to provide an open source SDK for the Tufino and an open source compiler. And eventually that did happen. And that's why Dendrite is open and available today because of kind of years of working on Intel to be like, hey guys, we got to open this up. And it finally did happen. And so that's kind of an overview ⁓ of the system as a whole. Okay, very exciting. Now, I still don't fully understand like what would be the entire, like I get the hardware aspect because I also have some experience there from the past. Why I understand less is how we'd have to understand the software stack because as you said, you have an entire integrated computer. It comes with hardware and software. Does that mean it like... runs like an OS or is like it has compatibility layers like you can like because you were saying okay within minutes or within some time frame you have like some AWS like interface that means it comes with all these services like S3 and all those things like how do I have to imagine it Yeah, so the system comes pre-installed ⁓ with the operating systems that are ⁓ running on all the compute sleds. our operating system is the Illumos operating system, which is a descendant of Solaris. And so there's an Illumos distribution that's going to basically become pre-installed on the compute sleds. And when you start the system up for the first time, basically when you apply power to it, ⁓ all the compute sleds ⁓ are going to fire up. ⁓ And then actually a lot of things are going to happen from there. There's some basic services that are going to start up on all of those compute sleds. ⁓ They'll start some of the networking subsystems that we're going to talk about in detail later to basically construct a routed network inside the rack. So all these services can talk to one another, our control plane service, which is an. ⁓ open source system called Omicron, it's available on GitHub, is our control plane that kind of drives all of this. And so every single slot is going to have a slot agent that comes up that basically reports into like our overall kind of brain of the control plane, which is called Nexus. And then Nexus is going to keep tabs of all the hardware that's coming up in the system. ⁓ Nexus also provides an API that is going to be accessible to kind of any networks that are connecting to the ⁓ the front-facing QSFP ports of the device. And there are also a couple of technician ports on the front of the device, which are RJ45, and they're kind of for the initial configuration of the system. so after you turn the system on, within like a few minutes, like all of that is gonna happen. Like all the core system services are gonna be running, ⁓ and you'll be able to basically talk to Nexus, which is like the brain of our control plane to say, configuration for how to connect the rack to the upstream network. ⁓ It's going to chew on that for a little bit. It's going to set up routing tables or set up BGP sessions if that's what it needs to do. ⁓ A few minutes later, you'll basically, you have to give it like an IP space that you're assigning to it. You can delegate like a DNS zone to it. ⁓ And eventually ⁓ there will be an API that becomes available over the IP space and the DNS space that you have delegated to the rack. ⁓ That is the API for the Oxide Rack. There's a web server that will also be served. So you can log into like, if you're familiar with like AWS console, things like that. ⁓ Similar to that, to be able to log in to manage the device. ⁓ and start to onboard users. ⁓ It integrates with external IDPs like Okta, Google, things like that ⁓ for user management. that does mean that on your part you can ship one computer and it's the same for all customers, right? Yeah, is. It's the same platform for everybody. ⁓ We do have, like I mentioned at the top of the hour, ⁓ our next generation compute slot that's coming out that ⁓ maintains backwards compatibility with the existing racks. And so if customers wanted to upgrade their compute sleds, ⁓ it's just a matter of sliding one of the compute sleds out of its socket, replacing it with this next generation compute slot. The oxide control plane will automatically onboard it and then that compute slide is now available to users to run virtual machines on and things like that. The basic compute abstraction that we provide to users, much like the public clouds, is the virtual machine instance. Okay, and so to start diving and chipping away a bit on the protocols, like one of the protocols you already mentioned was BGP, and it's a very important protocol, but not many people are aware of it. And I have a feeling that, at least maybe it's from my point of view, that a lot of people became aware of BGP when Facebook had like a fallout, which was apparently BGP related. Now, For our listeners, can you maybe explain a bit what BGP is and also what is the context that it is important to an oxide rack? Sure, so BGP is a pervasive routing protocol. It's the routing protocol that ⁓ a big chunk of the internet ⁓ is built on. And it's what we call a distance vector routing protocol, which means that no individual router. inside of the network has to understand the entire topology of the network. They just basically need to know about their BGP routers, ⁓ and they understand through advertisements that are sent from those peers what network prefixes that they can reach through those peers. ⁓ And because of this decentralized aspect of BGP where you don't have to really understand the whole model of the network, ⁓ it's very popular for large networks like the internet. It's also very popular in data center networks ⁓ for constructing certain types of topologies like fat trees, which are very common ⁓ in data center networks because basically you just have to spin up a BGP daemon, assign an address space to that BGP daemon that it's going to advertise to its BGP neighbors and the network kind of assembles itself. You don't have to go in and do a bunch of like manual routing configuration, anything like that. It has nice features. in it like ⁓ loop breaking and things like that where if it sees like a loop in the network, it's going to go ahead and break that loop so you don't have like an active ⁓ loop that is spinning traffic around to infinity inside of that loop. so ⁓ it's a very common protocol that's used for assembling data center networks, which is why it was one of our early targets at Oxide ⁓ for our external networking interface for the RAC. ⁓ Inside of the rack, we don't actually use BGP. ⁓ One kind of funny name I have for BGP is Big Grumpy Protocol, because BGP is quite old. It's a little bit long in the tooth ⁓ at this point. And inside of the rack, we actually have a different routing protocol that myself and a few other people at Oxide designed ⁓ called Delay-Driven Multipath. ⁓ And it's also a distance vector routing protocol, but it's a little bit more of a modern take ⁓ on the control plane that routers kind of communicate with one another over. And the big thing about delay-driven multipath is the way that it makes multipath decisions. so inside the oxide rack, there are two switches ⁓ and every single compute slot connects to both of those switches. And so in a steady state, when like there have been no hardware failures or anything like that, when virtual machine on one slot wants to talk to a virtual machine on another slot. There's a multipath decision that it has to make because it has two options of ⁓ what path that it can take through either one of those switches. And what delay-driven multipath does is ⁓ it takes active telemetry across the network to be able to figure out which of those two paths is the most loaded from the perspective of that particular slot. And then we will probabilistically take the least loaded path. and that will ⁓ dynamically balance the network on the rack. This is in contrast to ⁓ the approach that is typically taken with BGP, which is something called ECMP or equal cost multipathing. ⁓ single packet is basically going to be hashed in terms of its layer 3 and layer 4 headers. And based on that hash, ⁓ an index is going to be calculated and that index is going to index into the different path options that you have. And so for that particular path, it's always going to stay. ⁓ sorry for that particular flow, it's always going to stay along that particular path, independent of how congested that path may or may not be. ⁓ And so that's just, I guess, a quick introduction to routing protocols at Oxide. We anticipate for external routing that eventually we'll have like OSPF and ISIS and things like that ⁓ available to our users, but there hasn't been a whole lot of demand for that yet, so we haven't quite gotten there. Okay and within the rack how are these different slots connected on the network? Are these like fiber cables or what do I have to imagine on the hardware level? Yeah, so that's actually one of the really cool things about the platform is that so we have a cabled backplane inside of the rack itself. It's copper based. There's just a couple of meters from top to bottom inside the rack. And so we can safely put 200 gigabit signals across that copper running at like PAM 456 for the signaling. That cabling never has to be touched by operators. That's just fundamentally built into how the chassis is constructed. ⁓ And so the compute sleds have like a blind mating mechanism in the back of them. So when you slide the slot into the rack, it just automatically connects into the back plane of the network that goes up to our switch ASICs, which are our Tofino 2 based switch ASICs that are in the rack today. ⁓ And all of that is just completely kind of automatic and fixed. ⁓ Facing out the front of the rack, each of the two rack switches has 32 QSFP ports that are facing out the front ⁓ that users can connect in any way that they see fit to connect to their upstream networks. So this could be through like if they're in a data center, their data center provider could be dropping fiber into the data center or they might be connecting to like their own ⁓ upstream routers or firewalls, what have you. And then we provide a generic kind of API for them to configure BGP or static routing or things like that through those interfaces that they have populated. Okay, and then I also came across your in-progress P4 compiler, which is called X4C, I believe. Yeah, like, of course, it's a bit of a random jump, but maybe we could start there because anyway, there is so much to cover, unless you had like a more natural one to jump to. Yeah. Sure, yeah, we can talk about ⁓ X4C. So I think the place that it makes the most sense is to go to the origin story of X4C and like, why does X4C exist in the first place? And this goes actually all the way back. And this is like, this is a great Rust story also because Rust played such a huge role here. And... When we kind of were working on the product before we were even in market, we knew that we were using the Tofino 2 switch, which was a P4 programmable switch. ⁓ But we didn't actually have hardware prototypes available for that. And so something about the Oxide computer is that ⁓ we're not just wrapping somebody else's hardware design. We're not taking an Acton or an Edgecore ⁓ packaging of Tofino ASIC and then putting that with a nice shiny cover on and calling it a day. ⁓ Our switching platform is actually designed by us from the PCB all the way up, which gives us a bunch of really nice security properties and things like that in terms of how we lock down the system and how we establish trust inside the system. ⁓ Won't go too much into that today, and also I'm not really the expert to talk about that kind of stuff, but ⁓ for now the point is that we didn't actually have the hardware. ⁓ available to us to start developing the low level software, the high level software ⁓ on the platform that we needed very early on in the company when we were designing the product. And so we needed a way to represent that. ⁓ And so when you look at this like. when you look at the oxide rack and you ask yourself, like, how do I represent that thing? When you look at the compute slides, you're like, okay, I can use virtualization to represent those compute slides. I can just represent them as virtual machines. I can connect them together, but how do I represent the network part of that? And specifically, how do I represent a P4 programmable ⁓ network part of that? And so we were thinking about this for a little while and Intel actually, so the Tofino ASIC, the P4 programmable ASIC that we have is ⁓ provided by Intel, or I guess I should say was provided by Intel because they've cancelled it. But one of the things that came with that platform was a simulator. And it was a simulator where you could compile a P4 program, put that P4 program down on the simulator, and the simulator would execute that P4 program. And you could hook up like Linux VE devices to it and things like that and push packets through it. But the thing about this is it was a cycle accurate simulator. So What it was simulating was actually like a gate level representation of the Tofino ASIC. And what that means is that you got a ton of useful information about how your P4 programs are running on the Tofino architecture. But what it also meant is that it was extremely slow. And so once you started putting a couple of hundred packets per second through that simulator, latencies would spike up into the tens or hundreds of seconds. And so if you think about using that kind of tool to start develop software around, like it's a non-starter, right? You can't, like once you get up to like five or six BGP routers or DDM routers that are just like sending their like hello and like keep alive messages around, you're already at a few hundred packets per second at a rack scale. And so we needed another answer for this. ⁓ And so one of the other really interesting technologies that was developed at Oxide is our hypervisor. Or I should say like our hypervisor user space and so typically hypervisors are separated into like the internal hypervisor part like on Linux like your KVM and then there's the user space part like the the QMU ⁓ type of thing and at Oxide our kernel space part ⁓ is beehive ⁓ and the user space part is something called propolis and propolis is rust fix ⁓ And one of the neat things about propolis is that it's very easy to develop virtual hard inside of ProPolis. And so I started looking at this and I was like, hey, you know, we can use ProPolis to actually develop a virtual switch, which we eventually went up calling soft-npu for software-based network processing unit. And the idea behind soft-npu was that it could take a ⁓ compiled P4 program and just run that compiled P4 program just like a regular P4 programmable ASIC would. And so with that in mind, we're like, okay, this is a great start. Like we can easily do that. We started out by doing some like implementing some fixed function ASICs and propolis and that was really promising. ⁓ We could get up to like, you know, like a gigabit per second per port, which was several orders of magnitude more than we could get with the simulator. And so we're like, okay, now we got to make this thing P4 programmable. How are we going to go about doing that? And so what we ultimately landed on there was actually building our own little compiler for P4, or I should say at that time, transpiler for P4 ⁓ that would take P4 as input ⁓ and it would produce basically Rust code as output that you could then compile into like a shared library object. And then the soft-end PU device that was running inside of the propolis hypervisor could then ⁓ execute, ⁓ dynamically load that shared library. And the shared library came with like a predictable, what we call pipeline trait, a Rust trait, that the code that's loading that shared library code that has knowledge of what the structure of that trait is, could use that trait to basically send packets through that pipeline object to populate tables in that pipeline object, things like that. And ⁓ that all worked out really, really, really nicely. The initial motivation for building that was because we didn't have hardware at the time, but now that we do have hardware, the hardware is like, it's very expensive. It takes up a lot of power. ⁓ And so it's not practical for like every single engineer at Oxide. ⁓ If they're, you know, deploying their, whatever their work is on the Oxide platform to have like a dedicated ⁓ pile of hardware to be able to do that. And so ⁓ even today, this like emulation environment that we had created ⁓ is one of the go-to platforms for a good cross-section of the company that is developing on oxide to be able to develop things quickly. ⁓ Getting into some of the benefits that we get ⁓ from Rust for this is a few fold. And so the first one is kind of observability. And so when you're running code on like an actual bare metal ASIC, like a Tofido ASIC or an X2 or something like that, ⁓ you don't have a whole lot of observability about like what's going on inside of that ASIC. ⁓ But with the code that we're transpiling from P4 to rest. don't just... generate the Rust code, we are inserting something called probes into that code. And this is somewhat unique to the Illumus operating system that we're running on, that it has something called D-trace or dynamic tracing, where you can essentially insert probes into programs that you compile, where those probes, when they're not activated, are no ops, but when the probes are activated, then they kind of branch to a different point in the program structure, can collect a lot of information, ⁓ about kind of the architectural state of the processor that's running that program and then return to the regular execution of the program. so. Because we were compiling onto Rust that was running on a Lumos, we used this crate, another open source crate from Oxide called USDT. That is these kind of user space probes that we can run in kind of any arbitrary program. And we compiled those into this P4 code that we're running on these kind of virtual ASICs. And that allowed us to ask some very interesting questions about what was happening when packets are going sideways because this is kind of like the age old question that you're asking in networking when you have a communication flow that you know should be working, but it's not working. It's falling over somewhere and you have narrowed it down to this particular switch or this particular router. And you want to ask what is going on ⁓ with the packets that are going through this router. And so what these probes allow us to do is they allow us to say, okay, hey, for any ⁓ packet that gets dropped going through this softendpu, basic, please tell me the exact path that that took through the P4 program, so essentially like a stack trace. ⁓ Please tell me what the results of all the table lookups were for that ⁓ particular drop packet or even what all the table state was at the time that that packet was dropped. And by being able to ask those questions, we can figure out a lot of the places where we've made mistakes or we haven't anticipated some particular type of interaction an external system and kind of get through that very quickly instead of looking at a device that you can't get that information out of and just kind of beating your head against the device and trying to think very hard about what could possibly be going wrong there. And so that was a huge win for us. Yeah, for sure. Is it only in the emulation environment or is it also once you actually mirror it to the actual hardware, do you still get those probes? So this is kind of like, it's a great question because this is kind of like, this is the dream, right? To be able to do this on real hardware. It's obviously much more challenging on an ASIC that is running at like, you know, multiple terabit speeds to be able to do this. Cause you just have some like constraints when you're working with that type of hardware. ⁓ I'll say it's going to be challenging to pull an observability mechanism like that off on the Tofino 2. But our next generation hardware that we're coming out with for the switching ASIC with the Excite Labs ⁓ X2 platform, we are actually working on building exactly this. So this is going to be, this is the realization of a long time dream for me personally, of being able to ask these questions about production systems without taking an actual production hit. ⁓ So this is something that we're actually working on developing for the next generation of the platform is giving that kind of debug environment. observability but for real production networks that run at a dozen terabits per second per switch. Yeah, I mean, I can imagine you've had this past career and you've the pain, so I imagine you are in a very good position to know exactly what kind of observability you would want for all kinds of edge case issues now. Given how much ⁓ traffic is flowing through the road, how much packets, does it mean like you will always have to sample, I imagine? because it's not like you're just going to be able to see everything. Yeah, so there's. Yeah, so this is an active area of kind of research and work for us. I mean, there's obviously going to be limits with what you can do. And so when you just look at the architecture of like a switching platform, you have your data plane interfaces that are running at some, you know, 100 gigabit, 200 gigabit, you know, these days, 800 gigabit per port interfaces. And then, you know, the host that's connected to that ASIC is going to be connected to the ASIC through PCI Express. And even if if you're connected at, let's say PCI Express generation five by like eight, ⁓ there's a serious bandwidth limitation there, right? You're not gonna scoop up like terabits of traffic through that five by eight interface. I think that five by eight interface is gonna give you just shy of about 200 gigabits per second when you factor in like the overhead for PCI Express encoding. ⁓ And so there are definitely limits to what you're gonna be able to pull out of a production ASIC. ⁓ But being able to formulate programs that, or formulate probes that allow you to ask specific questions of interest that are not necessarily gonna be a fire hose that's gonna overwhelm that PCI Express link is a hugely valuable thing. And being able to kind of extract that information out and being able to... Being able to ask the why question, which is the question that, you know, as a network operator in my previous life, I was never able to ask of my switching ASICs of why is this happening? I can see the configuration is correct ⁓ in terms of like what I feel like it should look like, but like the packets just aren't moving how I think they should be moving with this. We're looking at having like CPU style debugging capabilities and really being able to have that higher level of comprehensive for our operators and allow them to ask why. Okay. And now I don't think a lot of people will be familiar with something like P4 but some might be familiar with FPGAs where they would program it with something like VHDL or Verilog. Is it similar to that, then instead of working on a processor, you're working on network nodes and how they connect and the entire flow there? Yeah, are, I mean, I guess it depends on the P4 programmable device that you're working on. I will say that programming for the Tofino ASIC is actually a lot like programming for an FPGA. And I say that because, so when we program like software for CPUs, like you can write slow code and... As you pack more more functionality into that code that you're running on the CPU, it's going to take more and more instruction cycles and it's going to slow down more and more. You're going to get more latency and things like that, but that's the trade-off that you have with CPUs. Extremely flexible. You essentially have, practically speaking, an infinite number of instructions that you can put into a program, but just the more instructions and the more things that you do, the more memory accesses that you have, it's going to slow things down more. The Tofino is a lot different than that. ⁓ It's you have a P4 program and you can compile it for the Tofino and it's either going to fit on the Tofino or it's not going to fit on the Tofino. And if it fits, it's pretty much guaranteed. that it's going to run at line rate. And line rate means like the physical capability of like the aggregate bandwidth capacity of all the ports on that device. And that's similar in spirit to like programing or ⁓ synthesizing like Verilog or VHDL programs for an FPGA, right? If you run out of resources, you run out of resources and it's going to run like that FPGA is going to run at the clock rate that you're driving those clocks at. The clocks aren't going to slow down because you've more stuff like yes, you can make trade-offs with like ⁓ the widths of interfaces and things like that, but it's either going to fit or it's not going to fit. And if it fits, then it's going to run at the rate that you've specified it at. And the Tofino is very similar ⁓ in that regard in that you don't really create slow Tofino programs. I mean, you can if you do like things like doing resubmits where you can like circulate packets through the ASIC multiple times. But in terms of like a single pipeline execution for a packet, it's nearly impossible to make the thing go slow. And then without going too deep because in a bit I want to like do some kind of like live in a packet based on some actual traffic but I did like read a bit around the projects and a project like for example Dendrit it's all about supporting sidecore switches in the oxide track. Now given that the oxide track is like an integrated computer what is the need to then have like a sidecore switch? Yeah, so, I mean, what is the need to have the switch at all? Yeah, because in the end you already have switches at the Oxide rack itself which is for the compute slots. So I just have for... yeah, yeah. okay, okay, I see. Yeah, okay. Yeah, it's about the name Sidecar itself. Okay, so, ⁓ yeah, the name of the Sidecar is ⁓ interesting. when you, or the origins of it, when you look at like what a traditional like switch or router looks like. ⁓ hardware today, especially like some of like the white box switches. ⁓ You basically have a switching ASIC and then you have like a little ⁓ mezzanine card above that, like the board that the switching ASIC is on that is going to have like typically like an Intel Xeon D or maybe an Atom processor or maybe an ARM processor on there that is going to run the switch operating system. And it's kind of its own little entity. That's not how things are put together. in the oxide rack. As I mentioned before, we build everything in the platform from the PCB up. ⁓ And so the way that the ⁓ switch is connected to the rest of the system is there's no mezzanine card inside of our switch. There's no dedicated small little processor that is driving that switch. What there is is there is a PCI Express cable that is going from ⁓ the sidecar switch up to one of our compute sleds ⁓ and then that compute slot has access to that switch ⁓ over PCI Express and like the joke initially was like this was just like a gigantic network card that was you know that has 32 QSFP ports in the front and 32 going out the back it's like a know 12 terabit network card ⁓ that is connected to ⁓ our compute slot and they got the nickname the sidecar as the the switch was just kind of the sidecar hanging off ⁓ one of the compute sleds. And so that's kind of where the origin of that name is. But the sidecar is actually what's providing all of the connectivity between the compute sleds in the rack and the compute sleds to the outside world. Okay, thank you very much for clarifying that. So I think a lot of people will be at least familiar with like something like an HTTP web server. And I suppose you could host, like let's say you want you have your rack, it's in your data centers. That does mean that you can also have on there somehow an HTTP web server. And for people to reach that, will have to, well, the clients then will use the... the server from a browser or whatever, like they will connect to it. And normally I would start, you have your HTTP packet, but then below that to whatever like TLS, TCP. But like in the case of, in the case of Oxide, like the store is even going much lower, right? Because before we even reach to TCP, we have to reach the actual compute slot to begin with. So like, maybe we can like, I don't know, talk a bit from the... starting from the perspective of the customer from within the browser trying to connect to one of these eventual compute slots and then do an entire, I don't know, traffic there. Yeah, sure. That's a fantastic example and we can just walk through a day in the life of those packets coming into an oxide platform. ⁓ When ⁓ a client web browser or something like that is going to submit an HTTP request to an instance that's running on the oxide rack, the first thing that it's going to hit is going to be our P4 programmable switch. ⁓ That switch is going to be connected. say that this rack is in like a colocation facility that is hooked up to, let's just make it simple and say it's hooked directly to the backbone of the internet, right? Your oxide rack is connected to an AT &T or Verizon like backbone router if you're so fortunate to do that. And that's where it ⁓ is getting its packets from. And so there's a VGP session where the oxide rack is announcing its IP space ⁓ to that ⁓ ISP router. So the ISP router is pointing to the oxide rack for a particular IP space. And so when it gets this HTTP request that is coming into the AT &T or Verizon switch, it's like, okay, my next hop for this is going to be this oxide switch hanging out over there. So it forwards that packet to our switch, which is, you know, going to an address, let's just say 1.2.3.4, right? And so the first thing that the oxide switch needs to do ⁓ is it's going to need to figure out, what compute slot do I actually need to send this traffic to? ⁓ And so for that, we need to pop up the stack a little bit. We need to say, okay, that some user or some owner of this oxide rack has created an instance, a VM instance that is actually going to be listening ⁓ on that IP address. And so when you create instances on oxide, much like any public cloud, You create the instances and then you can create floating or ephemeral IPs from like an IP space that is associated with that instance. So for this kind of packet walkthrough, we'll assume that some developer has launched an HTTP server ⁓ with that external IP address 1.2.3.4, right? And so ⁓ because they've done, like when they do that, ⁓ that's going to set up a whole bunch of state ⁓ on the oxide rack. ⁓ So just kind of going from the top down, the first piece of state that's gonna get set up ⁓ is a NAT entry, a network address translation entry, ⁓ in the oxide switches themselves that basically say, ⁓ when you see this 1.2.3.4 address, ⁓ please send this to the following IPv6 address ⁓ where that IPv6 address belongs to ⁓ the compute server that is actually hosting that user's instance. And so when the user created that instance, It's going to go set up all of that state. It's going to create a mapping between that 1.2.3.4 address and like a physical address of that compute slot that is hosting that instance. ⁓ Similar to the public clouds, when users create instances, they're created in what's called a VPC or a virtual private cloud. And that's an isolated little network that is just for like a group of instances, like within a project that a user is creating. ⁓ When the P4 switch is getting that packet, it's not only looking at, what is the physical address of the compute slot that I have to send this to, ⁓ but it also needs to encapsulate that onto an isolated network that is just for that VPC. ⁓ And when that user or that developer that set up this virtual machine that's running on the oxide platform, ⁓ basically launched that instance, it was a part of a VPC and the control plane is going to set up some state on that P4 switch that says, okay, not only does 1.2.3.4 belong to this particular compute slot and this compute slot address, it also has to have this ⁓ VPC identifier. The way that we do ⁓ VPC encapsulation on our physical networks inside the rack is through Geneva encapsulation. ⁓ So our underlay network is 100 % IPv6. ⁓ And so we have GenEV over UDP over IPv6 for kind of our underlay network. And so the P4 programmable switch at this point is basically gonna take that packet best in to 1.2.3.4. It's going to wrap it up ⁓ in an IPv6 packet with a UDP header ⁓ and a GenEV header that has the VNI that our control plane is calculating. that it's supposed to be associated with that user's VPC. And it's going to then send that packet down to the compute slot that has that IPv6 address that is hosting that particular developer's instance that is running this web server. And so that brings us, so the packet has now made it to the ⁓ physical compute server, but these compute servers are hosting a bunch of virtual machines on them. ⁓ you know, potentially dozens, ⁓ quite a lot. And so we have to take a look at that packet when it comes into ⁓ the kernel's networking stack and figure out where is this packet supposed to go. ⁓ And that's where ⁓ a system on the oxide platform called OPC-E or the Oxide Packet Transformation Engine comes in. ⁓ And so basically any time a packet hits the physical network interfaces of a compute slot, ⁓ We have a kernel module. So OBT is a kernel module. ⁓ It's listening kind of directly ⁓ on the network stack of the compute slot and it's looking for ⁓ this encapsulate this ⁓ Geneva encapsulated traffic on these IPv6 addresses and it's basically saying it's so it's gonna parse every single packet that comes into it to parse out like the Ethernet the IP the UDP the Geneva headers and whatnot ⁓ and then it is going to essentially look at all those headers and be like, is this a packet for a virtual machine that I know about? And how does it know about virtual machines? That kind of all goes back to our control plane. So going back to when this developer has set up this virtual machine instance, the control plane is going to say, okay, you're setting up this VM and it's going to have the address 1.2.3.4 assigned to it. going to be a part of a VPC that's going to have this particular or Geneva virtual network identifier. And so the control plane of Oxide ⁓ is going to basically set up a virtual port that is connected to the hypervisor that has OPTE kind of sandwiched in between the physical network and the hypervisor. And OPTE is going to ⁓ basically perform processing on every single packet that is going between instances and the physical network. So OPTE is gonna receive this ⁓ HTTP request ⁓ for this web server and it's going to go through several layers of processing. And so the first layer that's gonna go through is firewalling. developers are able to set up firewall rules through the oxide platform that's, know, say basic things like only allow traffic on like port 443 for SSL ⁓ or, you know, port 22 for SSH if that's something that you want. so... ⁓ First off, those rules are gonna be consulted to see if this packet should even be allowed to go to that ⁓ host. ⁓ And then it's going to do some lookups in state that has internally to figure out, so this was for 1.2.3.4 in this particular VNI, like what is the interface like internally in the operating system that I use to get that packet to that actual like virtual machine. ⁓ And so it's doing like this layered processing. broadly based on Microsoft's ⁓ VFP design that they have. ⁓ state is put into OPTE by the control plane. It's kind of like it's compiled state where we have these like pre-compiled like match action is what we call it in networking parlance. ⁓ Match action filters that like the packets are going to go through all of these filters and processing engines on the way to the instance. And right before the packet goes to the instance itself, ⁓ OPTE is going to de-encapsulate that packet. So it's going to take it out of that ⁓ IPv6 ⁓ UDP Genie wrapper packet. So it's just back to the raw like IPv4, IPv6 packet, or I guess in this case, 1.2.3.4. So IPv4 packet ⁓ that was being sent by the sender ⁓ and send that packet ⁓ to OPTE itself, ⁓ or sorry, to the instance itself. ⁓ Something else that is happening in here ⁓ is that today in the Onsight platform, everything is actually NAT based. And so when you have an address 1.2.3.4, you don't actually see that address inside of the instance itself. What you'll see is like a private VPC address, but OPTE is gonna perform network address translation ⁓ to translate that 1.2.3.4 to be whatever the association is for the NAT target ⁓ inside of that VM. And that allows us to do some nice things like live migration of virtual machines without network ⁓ disruption. ⁓ and things like that. ⁓ yeah, OPTE is ⁓ completely Rust based running inside of the kernel and Rust has been a huge benefit for us in doing this. ⁓ Like the first benefit of Rust is memory safety, memory safety, memory safety. Like that is really the big thing for us. There's a whole class of bugs ⁓ that we don't even have to worry about occurring, ⁓ obviously. when you're kernel code in Rust, there are places where you have to use what's called like unsafe code in Rust. But the thing is, is that's a microscopic portion of our code base. And so if we engineer that part of the code base, like very, very carefully, we can lean on the guarantees that Rust provides us for the rest of the code base ⁓ to ⁓ really build up a robust substrate for doing this packet processing, which is, you know, processing things at 100 megabits per second, at least per slot. And so that's a lot of packets per second going through this platform. ⁓ And Rust's ability to let us build things like no cost abstractions that ⁓ one of my colleagues Dave likes to call obviously correct abstractions that allow us to express code in simple ways ⁓ is really great. And a good example of that, if people want to go take a look at something called ingot, I-N-G-O-P-E. ⁓ OT ⁓ is a crate that was developed by my colleague Kyle for OPTE. And it's a very robust packet processing framework that ⁓ he has put together in Rust that allows you to kind of manage hybrid scenarios where you have both zero copy and owned representation of packets kind of at the same time. And you can kind of let go between zero copy and own packets without having a lot of kind of pitfalls. that you run into when you're manually doing that. ⁓ And so that's been really nice. Another amazing thing about doing Rust in the kernel is that if you implement the global alloc trait to get the alloc subsystem in Rust, then you have all of the core subsystem in Rust that exists on top of that. And so you get things like vectors and B-tree maps and things like that that are just absolutely wonderful to have. And you get it inside the kernel, which is amazing. So ⁓ that is a... very quick tour of getting packets from point A to point B in terms of the data plane. If you want, we can talk a little bit about the control plane that's running inside the rack or wherever you want to go from here. Yeah, well, I also want to clarify because you mentioned, for example, those firewall rules and that was like within the context of OPTE. Is it also the one responsible for those things or what is doing that? Yeah. Yeah. Yeah, so OPTE ⁓ is the one that's implementing the firewall rules. ⁓ The reason for that design choice is we wanted to distribute the firewall across the compute sleds so we didn't have like a bottleneck for implementing the firewall rules. ⁓ OPTE is also able to do ⁓ stateful firewalling. And so when it needs to be able to do things that depend on like TCP session state, for example, like we can readily do that in OPTE. That is something that ⁓ is very difficult to do in like P4 on a switching ASIC because P4 is mostly a stateless language. Like you ⁓ operate a packet at a time on P4 and like keeping state between packets ⁓ is very tricky if not impossible depending on exactly what you're trying to do ⁓ inside of P4. But we can readily do it inside of OPTE. The performance we've been able to get in OPTE has been ⁓ quite remarkable, like surprising even to me. ⁓ like for the up-down times for a packet going through OPTE, we're sub microsecond. ⁓ And so ⁓ it's been amazing. Okay, and so you mentioned stateful Do I have to imagine like rate limiting or where do I need the states in my firewall rules? So for example, if you ⁓ want to have like a firewall rule that says ⁓ you want to allow like incoming TCP traffic, but only if it's been initialized ⁓ from ⁓ within the instance itself, right? So only basically like outbound. So if you have like some code that's running inside of your instance that wants to reach out to the internet at large, but you don't want the internet to be able to reach out to your instance, ⁓ then that relies on being aware of like the TCP protocol state and making sure that like the directionality of that connection is correct. interesting and then we also I imagine you went over it, but we didn't mention it by name, but you have this project called MagMite, which is like the upper half of the... about discovering other routers, and you mentioned of course at the beginning, okay, in order for that packet to arrive at your computer slot or even at your switch, it would first have to have tied itself up via BGP, so I suppose that's where MagMite comes in. Yeah, so Magomite is ⁓ our routing stack that is also written entirely in Rust, also completely open source. ⁓ And ⁓ like I think I said before, we have our external routing protocols there, like our BGPs, our BFDs, our static routing interfaces for the rack. ⁓ And then we also have our internal routing protocol called DDM, Delay-Different Multipath. ⁓ If folks are interested in reading about DDM, we have a public RFD. ⁓ RFD stands for Request for Discussion. It's similar in spirit to the internet RFC concept. DDM is described in RFD 347. But Magimite is kind of the Rust repository that has the implementations of all of these ⁓ routing protocols. ⁓ routing in the oxide platform is a little bit different than like the general control plane ⁓ in that it's not directly built into Omicron and the reason for that is that ⁓ Omicron actually depends on the routing platform to come up and so when I was talking earlier about like when the rack first gets powered on like the slot agents are going to come online and each one of the compute sleds they're going to phone home to ⁓ Nexus, which is kind of driving the workflows around all of those slot agents and things are going to start to communicate within the rack. But ⁓ we're not layer two, like broadcast domain driven at all inside of the rack. It's all layer three ⁓ and it's all routed. And so it actually depends on routing tables getting set up on the nodes themselves ⁓ and on the switches. And so we run our DDM routing instances on all of the hosts that are or all the compute sleds that are inside the rack. then additionally, DDM is running ⁓ on top of the switch. And so when the compute sleds first come online, DDM is going to, it's part of the base operating system image. ⁓ It's gonna come up and work to discover what switches it's connected to. The DDM that's running on the switches, it's gonna discover all of the compute sleds that it's connected to. And it'll kind of autonomously build what we call the IPv6 bootstrap network ⁓ that allows everybody to communicate with one another. ⁓ And so that's kind of like the bones of how that is built. ⁓ But then inside of Magomite itself, there are kind of, I guess, a few interesting properties that distinguish it from a lot of like the routing stacks that are out there today. And so like one of the really popular routing stacks that's out there is like FRR. which is built on a bunch of different demons. You have your BGP demon and your OSPF demon and your BFD demon and then you have your zebra demon ⁓ below. of this that is kind of like synchronizing state from like what we call the upper half of the protocol daemons down to the zebra which is like the lower half that's synchronizing state down onto like ⁓ the Linux kernels routing tables are onto a switching ASIC something like that. With with Magamite we've actually decided to take a little bit more of a monolithic approach in here and so MGD which is our daemon that implements all of our external routing protocols ⁓ it actually implements all those protocols in one process. ⁓ And the lower half is also implemented in one process. And we leverage a lot of the concurrency machinery that is inside Rust to be able to do that in a robust and safe way. know a lot of the arguments behind putting together routing stacks across different daemons is for having different failure domains. saying, OK, if your BGP daemon fails, it doesn't affect ⁓ your OSP. SPF, daemon, what have you. ⁓ I think that architecture is somewhat an artifact of the time in which it came around. ⁓ Whereas today when we have programming languages like Rust that provide strong memory safety guarantees and we can write really, really, really robust software, I think it does present the opportunity to create more monolithic architectures for routing stacks and that's what we've done in Magomite. And we have very tight integration between like, our routing protocol daemons, our upper halves and like the lower half being the synchronizer and they all share like a common routing information base. ⁓ We've built a message bus ⁓ based on ⁓ Rust channels and so we have like a nice channel communication model between like the different routing daemons, between the routing daemons in the upper and the lower half and has really enabled us to build like a really cool architecture there that has turned out to be ⁓ quite robust so far. yeah, if folks are interested in taking a look at Magomite, it's ⁓ out in open source and available. Hmm, very well. We will need to put in the show notes. Now two questions I also have around ⁓ all of this is one, you're dealing with a lot of different protocols and types, for example, all the different network types, like an IP address, but many more of these kind of things. Are you having your own crates to define all these types or how does it work? Yeah, so like the things that are specific to like the protocols themselves, like the types that are associated with like BGP messages and things like that, like that's definitely defined in the Magomite crate. One of the cool things that we were able to do with Magomite is we actually used a combinator parser framework called nom for parsing BGP messages, which can actually become quite complicated. But nom kind of, or combinator parsing in general, kind of allowed us to do ⁓ that parsing of these very complex messages in like a safe way. ⁓ And so that's been very nice. But getting to your question of like generic and like common ⁓ network data structures. So I mean, the rest library, like standard libraries like have. some of the most common stuff that you're going to run across in terms of the IP addresses and things like that. For stuff that's not covered in that, we do have a crate called OxNet ⁓ that is available in our GitHub that covers the stuff that's not covered in the Rust standard library. So has things like prefixes, ⁓ MAC addresses, things like that in ⁓ a way that is useful for us in the sense that they provide like SeriD implementation so they They're easy to serialize and deserialize. ⁓ And they provide JSON schema providers so we can plug them into our ⁓ open API infrastructure. And again, another ⁓ wonderful REST crate that's come out of Oxide is Dropshot and Progenitor, which are two crates that we have for building HTTP-based services. ⁓ And so... ⁓ the types that we provide in our OxNet crate have like machinery to easily integrate into kind of those web server type crates. Yeah and I believe as also I'm a listener to your Oxide and Friend podcast that you mentioned this technology at least once maybe twice or even more so maybe we can also link to those episodes as I believe you already go very much in depth what Dropshot is. Yeah, yeah, absolutely. I mean, yeah, I can speak to drop shot a little bit, but it basically, if you look at frameworks that are out there like Thrift or gRPC and things like that, that are an IDL, Interface Definition Language First type of approach, where you have the IDL and then you can generate a server in your language of choice, whether that's Rust or Go or Python, what have you. Dropshot kind of flips that model on its head where it's the code that you write in Rust is the source of truth. And then we generate an open API spec from the code. ⁓ And so it's just a little bit of a different model ⁓ that has worked out really, really well for us. ⁓ And all of the stuff that I've talked about today in terms of like Dendrite and Magomite and these services, network control plane services inside the rack all have Dropshot based interfaces and that's kind of how we communicate ⁓ inside the rack. Okay, like for example, let's say someone has their own web framework, how easy will it be for them to also support generating like OpenAPI using Dropshot for their own ⁓ framework? I think quite easy. I think Dropshot is one of the more popular like oxide crates outside of oxide that has pretty wide like uptake ⁓ outside the walls of oxide. so ⁓ yeah, it's pretty straightforward, I think. And then like the complimentary crate to Dropshot ⁓ is Progenitor where you can Progenitor basically consumes ⁓ open API specifications and then generates Rust clients for those open API specifications. So you don't have to kind of manually do that. And again, for there, let's say you want to generate like a client using your own framework. Is that something that you need to publish upstream or try to patch inside of the actual repo? That's something you can also just have your own like community crates to support other clients. You should just be able to have your own. So for the open API specifications for the Oxide product itself, those are all published. people have built Python clients around them, have built Go clients. think at Oxide, do actually maintain a Go client ourselves ⁓ as a mechanism for providing Terraform provider support ⁓ for the platform. But yeah, the open API spec that are generated by Dropshot should be readily consumable by any ⁓ code generation platform that understands OpenAPI 3. Okay, very exciting. Now, ⁓ as you mentioned, you built on top of Illumus, or ⁓ I believe at least, which is a Solaris based system. ⁓ It's relatively, I mean, it's not old, but it's neither new. And especially with an eye on the future, I'm wondering, given how much everything you already do in Rust, like is there ever like a... I imagine there must be talks within the company to have an entire Rust-based operating system and then of course, maybe not even reinventing the wheel, but we had a conversation with some of the Google folks in episode 8 and 10 and they are developing Fuchsia which seems like it might be up to the task as an operating system for the kind of stuff Oxhidles does or do I see that wrong? Yeah, so ⁓ I mean, there's been more than talk at Oxide about building a Rust based operating system. We have actually built one. ⁓ And I'm using we in the sense of Oxide, had absolutely nothing to do with it. ⁓ so we have an operating system called hubris ⁓ at Oxide, which is a microcontroller operating system written entirely in Rust that runs on what's called our service processor. And so if you look at like servers today, ⁓ most servers are going to have some type of management platform on them, whether that's like a BMC that's providing like an IPMI interface, ⁓ or whether it's like an Adele system providing iDRAC or HPE's ILO system, they all have this kind of out of bat management platform that's gonna manage the power state of the server. It'll provide you console access. help to manage the thermals, things like that of the platform. So ⁓ we did not want to use like a third party BMC style component for that. wanted to, because we kind of viewed these systems as one of the worst parts of dealing with technology systems of today. And so ⁓ we built our own platform. and we used a processor, ⁓ the STM32, I believe, is a little ARM processor that we use as our service processor on the platform. And that runs the ⁓ hubris ⁓ open source rest-based operating system that kind of drives the entire oxide racket like a platform and infrastructure level. okay and so yeah i mean especially as another expert like what stops you from once you have dipped your toes in that to also replace ⁓ something like Illumus or or whatever like with with something similar I mean, they're like whole different worlds of complexity, right? One is a microcontroller operating system and the other one, a LUMOS, is a general purpose operating system that is able to drive the most complex SOCs that are coming out of... AMD today that has a lot of complicated mechanisms inside of it, like DMA and PCI Express and ⁓ having ⁓ multitasking user management. So our microcontroller-based operating system is kind of like a statically scheduled operating system or a static task structure-based operating system. you're building a general-purpose operating system is a humongous lift. ⁓ And I mean, you guys probably got into that when you're talking to the Google folks about fuchsia. Like Oxide definitely doesn't have Google scale resources at this point to throw behind a project like that. But more to the point, think, is that Illumos is actually serving us quite well. ⁓ is a very stable operating system that we have a core team of ⁓ really good kernel engineers that are working on and making better every single day. It's moving forward very quickly and we're quite happy with it. And so I don't like... see and we're writing Rust kernel modules for it all the time like OPTE ⁓ is our you know our first foray into writing Rust-based kernel modules for Lumos. Our next generation NIC that will be coming out ⁓ probably sometime next year, ⁓ is completely rust-based in terms of the driver's stack for that. And so we're pushing rust into the kernel, but we don't necessarily see a strong need at this point in time to replace 20 plus years of stability ⁓ that we have from Illumos. Okay, now I totally understand and that makes sense to me. Now when you say you push these kernel modules, that's something you can just plug into your own version of the OS or it's something you have to upstream as well. So our OPTE is not in like in the kernel tree of Illumos. It is open source, ⁓ but it's not like an entry kernel driver. We have had discussions at Oxide about like, what does it mean to have like Rust source code in like entry inside of Illumos and like discussions are kind of like, you know, ongoing there, but ⁓ for right now, like Illumos does provide like a very stable interface to drivers. through what it calls it's like DDI ⁓ interfaces. I think that stands for device driver interface. ⁓ And that's what our drivers basically consume ⁓ inside of the kernel. that... out quite well and because that interface has been so stable it does actually reduce the need to really have like entry kernel drivers for these types of things. You can just kind of write your Rust module ⁓ and you know put it in the spot where the kernel is going to pick it up and dynamically load it into the kernel and you're off and running. It's actually a delightfully simple workflow for driver development. Okay, very cool. It's definitely something I should play with at some point. Now, ⁓ giving you a background, I want to especially ask this, like, what kind of testing do you do now and how did your background help with that in terms of like, I don't know, maybe like chaos testing or stress testing, all kind of tests I imagine. Yeah, I mean, my background, like dating all the way back to, guess, like my mathematics background is like a strong desire to model things. ⁓ And so one of the tools that we've developed at Oxide ⁓ is something called Falcon, also out there in open source, ⁓ which allows us to develop models of the oxide, like emulated models of the oxide rack, but not just a rack in isolation. So we can model multiple We can model racks that are interconnected to upstream networks through virtual switches from like Arista or like Cumulus Linux virtual switches, Cisco virtual switches, things like that. And we can actually, ⁓ it's a Rust-based framework where we can programmatically define what the topology looks like in terms of the upstream network routers and the oxide platforms themselves, deploy ⁓ oxide software onto that. emulated hardware substrate ⁓ and then run workloads over it and stress our control plane out to the max that we possibly can. ⁓ And I guess my biggest thing about testing is I like to see things working end-to-end. And so this type of environment, which allows you to kind of describe an end-to-end environment ⁓ in Rust and then you just basically, it generates a CLI ⁓ where you can say like, please, ⁓ create this environment for me and then it uses the propolis hypervisor that we have to create that environment as an interconnected set of virtual machines connected by like our P4 switches and things like that. ⁓ It does allow us to... ⁓ exercise a whole lot of our system through like the entire control planes, our routing stacks, our forwarding planes, all of that type of stuff. And so that's kind of, I do take a lot of my testbed background ⁓ to work with me here at Oxide ⁓ in trying to create these environments where we can really test things end to end and it's ⁓ really a multi-fidelity approach. And so this is kind of, this environment is kind of like our starting point where we get a lot of our initial work done and And at Oxide, we obviously have hardware resources where we can deploy new versions of our entire stack onto a little mini rack that we would have, or a set of mini racks, or a full rack in a data center. ⁓ And so having these different levels of fidelity of like... developing first in my virtual environment, then moving to like a mini rack, then moving to like a set of mini racks, then moving to like a full-blown rack. Like each of these has like increasing costs in terms of like the time involved to interact with each one of these environments. And so that approach of like increasing fidelity, but also increasing costs or like decreasing costs more importantly, when you're first trying to get things up and working has been extremely valuable for us. Okay, that was very exciting and I believe it's also about time to wrap up as we've covered a lot and there's a lot of content to digest. Given that, is there something that you think we forgot to discuss and that we anyway do want to discuss now? Like maybe some part of the stack or some other project that we didn't mention yet or some, I don't know, whatever you want to share. ⁓ I guess we could close with a few future looking things. So right now ⁓ we're actively working on our next generation hardware, both for our switching ASIC and for our... network interface cards where we're calling our communication processor for the sleds. And a big part of doing that is we're taking a real hard look at the hardware software interface and asking ourselves like, can we do better at the hardware software interface? How can we introduce more robustness at the hardware software interface? And there's really two interfaces that we're really looking at there. ⁓ One is the instruction set architecture or the ISA. ⁓ and figuring out how do we, for these new chips that we're going to be taking on for the next version of our product, how do we build infrastructure around the instruction set architecture to build really good assemblers, really good debuggers, really good tracers? And so we can really, really understand at runtime when things go sideways, ⁓ which they always do, no matter how good of a job you do in developing things like, how do we understand those things through the architectural state of those processors? We've recently open sourced ⁓ kind of the first effort around that, ⁓ which is called ISF. It's an open source oxide repository. It stands for Instruction Set Specification Format. ⁓ And it basically allows you to ⁓ describe an ISA ⁓ and then it will generate ⁓ Rust code around that ISA that can act as a foundation for building things like debuggers and assemblers and tracers and things like that. ⁓ The other side of the hardware software interface that we're taking a hard look at is the register interface for devices. And we have a repository called RSF that is out there, stands for the register specification format. And this allows you to kind of rigorously describe the register interface for a particular hardware device and then generate Rust code that is actually no standard Rust code that can run in a kernel. ⁓ that provides ⁓ memory safe ⁓ and in particular sub byte memory safe interfaces ⁓ for interacting ⁓ with those devices in kind of a robust way and taking down some of the pitfalls that you often see with register interfaces like manual address manipulation and things like that, like building the addressing model ⁓ into the specification for the register interface and then building clients around that with well-defined process. So you'll never like fat finger an address or get something wrong like a sub byte boundary or anything like that. And so that's RSF ⁓ that's available ⁓ in our GitHub repositories. And those are two things that we're working on kind of at the hardware software interface for the next generation of hardware that we're bringing out that brings a lot of the benefits of Rust to bear at that part of the stack. Okay, very cool. Thank you for sharing that. And it also reminded me that you have ⁓ these RFDs, as you mentioned earlier in the conversation, and they are very interesting and very elaborate. ⁓ But then for something that makes sense that these are just RFDs because they are something quite specific or something like more internal but then if I look at something like DDM, like the delay driven multipot, I wonder is there ever then also conversations to actually make an RFC for it or not? Yeah, I mean, we've kicked the idea around a little bit. And yeah, when we get more experience and more widespread deployment with DDM deployments, I think it would definitely make sense at some point to ⁓ make an RFC out of it for sure. Okay, very cool. With that being covered, I would like to thank you for coming on our show today. It was a great pleasure. I learned a lot from a world that I'm definitely not that familiar with. Yeah, well, thank you. Thank you very much for having me. It's a pleasure to come on. Elizabeth (Plabayo)
1:27:27 | 🔗
Netstack.fm is brought to you by Plabayo building secure, open, and resilient infrastructure with Rust protocols, and purpose. This show is also made possible by Rama, the open source networking framework. Plabayo offers service contracts and welcome sponsorships to keep building and supporting its ecosystem. The theme music of this podcast was composed by DJ Mailbox. If you enjoyed this episode, don't forget to subscribe on your favorite podcast platform and leave a five-star review. It really helps others discover the show. Thanks for tuning in. We'll see you next time for the next handshake.