On this page On this page
In this remastered episode, Chris Batterbee, co-founder of Metoro, discusses the importance of observability in modern software systems, particularly in Kubernetes environments. He explains how Metoro leverages eBPF technology to simplify observability by automatically instrumenting applications. The discussion also covers the integration of OpenTelemetry, the challenges faced by developers in implementing observability, and the potential of AI in diagnosing issues. Chris shares insights from his experience with Y Combinator and the competitive landscape of observability tools, emphasizing the unique position of Metoro in the market.
Learn more about OpenTelemetry, eBPF and Metoro:
If you like this podcast you might also like our modular network framework in Rust: https://ramaproxy.org
00:00 Intro00:34 Into to Chris Battarbee02:36 Understanding eBPF and Its Role in Observability10:13 Integrating OpenTelemetry with eBPF15:05 Challenges and Experiences with OpenTelemetry22:28 Future of eBPF and OpenTelemetry in Different Environments27:53 YCombinator Experience29:31 Insights from Y Combinator Experience32:31 Networking and Community in Y Combinator34:43 Raising Funds Post-Y Combinator35:39 Post-Y Combinator Relationship with YC36:27 OpenTelemetry Pain Points38:31 The Future of OpenTelemetry and Standards39:47 Competition with Major Cloud Providers41:40 Understanding eBPF and Its Challenges43:24 Prometheus vs. OpenTelemetry44:40 AI Integration in Telemetry45:36 Metoro49:03 Profiling51:43 Metoro Payment Models53:57 Customer Engagement and Support55:09 Outro
Music for this episode was composed by Dj Mailbox. Listen to his music at https://on.soundcloud.com/4MRyPSNj8FZoVGpytj .
This is a remastering of episode 3, It was originally recorded and aired in 2025 and is the last of a series of episodes that we have remastered to improve the audio quality. We hope you enjoy this episode around open telemetry and metoro. Elizabeth (Plabayo)
0:34 | 🔗
This is netstack.fm, your weekly podcast about networking, Rust and everything in between. You are listening to episode three, recorded on the 29th of August, 2025, where Glen has a conversation with Chris Batterbee co-founder of Metoro a company providing instant in-depth observability for services running in Kubernetes. This is another episode and another week of Netstack.FM Today with me is Chris. Welcome. Yeah, fine. Thank you for accepting our invitation. Today we will talk about OpenTelemetry, about Metoro the company that Chris co-founded. We will begin with asking, what's your background story, Chris? Yeah, so while I'm a software engineer primarily, I've done a few different things, so I worked at Palantir out of university, doing a lot of Kubernetes stuff there. So that's mainly around scheduling really large compute jobs. How did you get those to run efficiently over large distributed clusters, primarily in Kubernetes, but with some other schedulers as well. And then since then I went on to found Metoro which is primarily a company which Does observability of the Kubernetes, how can we make observability in Kubernetes the easiest thing possible. Yeah, primarily based on my experiences at Palantir, yeah. Okay, so can Can you maybe tell a bit more about that experience with Palentir? Yes, mean like, ⁓ Palentir runs a lot of different compute jobs. Like a huge amount of it is based on Spark. So a lot of big data stuff. So you have huge amounts of data coming in. You want to run custom code on that to produce like different data artifacts at the other side. And a huge bit of doing that with Spark. So I worked on a team which was responsible for orchestrating that compute. So essentially we get things in which are custom applications, but based on Spark. We then need to make sure that that is ran across like large compute clusters that we scale up effectively. We handle any issues with communication, nodes going offline, nodes coming online and doing the whole orchestration flow around that. And sort of like while I was doing that, I realized that you need pretty good observability to be able to drill into things, figure out exactly what's going wrong in certain situations. And I thought there was a lot you could do there. without having to do a lot of manual instrumentation. And that's really where Metoro was born. It's like, can we use some cool technology, which is primarily eBPF which is some tech inside the Linux kernel, which allows us to automatically instrument a lot of applications for you. Okay, very interesting and I've seen that work. It's pretty magical and as you said, there's not a lot you have to do out of the box. It just seems to work. Now at what point did you add open telemetry to the mix? Yeah, so eBPF is great. maybe maybe I should explain a little bit about how that works in case people aren't familiar. But essentially, it's some technology in the Linux kernel, which when certain events happen, you can run code that hooks into it, right? So in our case, we will monitor, read, and write syscalls. So whenever you write something to the network, we will inspect that write and see, OK, well, does this match any protocol that we instrument? So in this case, like HTTP. MySQL, Postgres, all these L7 protocols, we will look at the data that's being written. And if it matches ⁓ the standard specification for those protocols, then we'll know, well, actually someone's trying to make an HTTP call to a different service, And we'll pull that information out. We'll inspect essentially the headers, ⁓ things about the body to then produce a trace from that. So it means that we can do this for any arbitrary application because it's using EPPF and operating at the kernel level. So it means you don't have to use code and that's primarily the reason. yeah. So we do some TLS decryption stuff. So in that case, we'll actually hook into like, there's a bunch of standard libraries for TLS and we'll hook the function calls to those. So it means that when you write into it, the unencrypted content, we'll look at that unencrypted content before it gets encrypted. And that's how we can sort of deal with TLS. So it works for a number of different TLS, ⁓ libraries. like, thankfully most applications use the same TLS libraries across the hood. Does that work even if it's like TLS encrypted? So I suppose you mean things like KTLS, OpenSSL, stuff like that. Exactly. Yeah boring SSL these sort of things. Yeah And then let's say someone would use an entire different implementation of TLS like Rustls, would that still work or is that indifferent? Yeah, so for any new implementation of TLS, we'll have to do some additional instrumentation on our side, thankfully there are a few, you know, very few TLS implementations, like boring SSL is the big one which is used in Rust in my experience, but there are some other ones coming up as well, I think. Yeah. Okay, very cool. And so you have all of that data. so, yeah, of course we are an audio only podcast, so we can only show you how it will look like, but we will put some links to your website and where you have some demos, I imagine. But what can people expect already out of the box if they have this? Like they have a couple of service, they have like a database, they have all kind of, yeah, like an entire setup. Like what can they expect to see or do with that data? Yes. Yes, so primarily you'll be able to see, okay, all these services, every single call they make to any other service, as long as they're using a supported L7 protocol, so using, you know, gRPC, HTTP, Postgres, MySQL, we have a bunch of other ones. As long as you're using those protocols, you expect to see pretty much every call that's being made between those applications. And that's a really great starting point. But like to your question on, why do we bring in OpenTelemetry and why do we start using OpenTelemetry as well? ⁓ It's around joining those together. So what we can give you with eBPF is the individual calls. So in the terms of tracing, that correlates to an individual span. So you'll be able to see microservice A talks to microservice B. That normally happens in a chain. So A will talk to B, which will then talk to C, and maybe that will talk to a database on the back end. So when you want to tie all those together, that's very hard to do at the level of eBPF. ⁓ because a lot of the time you do things like cross thread boundaries. So like let's say, know, something comes in like a request, it gets put into a pool, and then that pool then makes ⁓ a thread pool, then a thread on that thread pool then makes another request out. That tracking across thread boundaries is really hard to do at the kernel level. So when you want to start to do this more intricate instrumentation, and you want to get really detailed and add your own spans, that's when OpenTelemetry comes in. And really the idea here is that we give you the base stuff, and then when you want to add more specific instrumentation on top, that's when you can use OTEL Okay, yeah and of course a lot of languages these days they also manage their own threads like they might use like as at the core they might use the opening system threads but then within those threads they might do all kind of like what they call green threads so I suppose you will anyway never have access to those. Yes. Yeah. Yeah, you can do some funky things where it's like you try and like say they're using like a VM. You can try and traverse the memory of the VM to understand all what are the actual like sort of like I guess I'm gonna call them like language level threads on top of the actual threads and there are some cool things that people do actually. There's a very interesting project called Deepflow which tries to do distributed tracing using eBPF and they do that by tracking threads. So it's like if know a request comes in a one thread and the same thread makes that request out. they can link those two things and say, well, this is part of the same request. And they can also do that for Go, which has its own implementation of threads on top, right? So like they can do certain things like this, but it starts to get really tricky when you have like, you know, things explicitly crossing thread boundaries as well. Like that gets really tricky, which is why at that point you're often better doing something yourself rather than relying on auto instrumentation. Yeah, exactly. so we, for starters, are very excited about OpenTelemetry because we've all run like production setups and things go wrong. And it's these kind of edge cases which used to be hard to reproduce, but then once you have a telemetry at runtime at any time, they just become very visible. At the same time, it feels like still a very immature ecosystem. Like how was your experience adopting this tech? And yeah, let's start with that. How was your experience with that? Yes, OpenTelemetry is like, it's a very, I want to say like verbose system, where it's essentially like there's like a lot of, as an implementator of it, it takes like a lot of work to translate between different formats. There's a lot of different libraries. ⁓ As someone using OpenTelemetry, I think the experience is much better. But as you mentioned, I think like your experience ⁓ is coming from Rust, right? Like essentially the implementation of OpenTelemetry for Rust. Yeah, I've mainly used it in Rust, in JavaScript, in Go, but yeah, like my core focus is definitely Rust. Yeah, and I think like depending on the language that you're using, they all have their own implementations and they're all at various stages across this sort of like, you know, stable lifecycle, right? Where it's like some have been around forever, like the ones in Java, Python's pretty good, especially ones where auto instrumentation is possible. So a lot of the things where it's like you have essentially like non-native binaries, right? So like Python, which uses a VM, they can do very good auto instrumentation for you. But when you start to get and start using languages like Go, for example, or like Rust or C++, something which is compiled, you have to do a lot more yourself. And at that point, it can become a little bit, it can become quite intense to implement that as like a single developer, or it requires like a pretty big or quite effort to sort of say, okay, we are going to use open telemetry and we're going to use it properly. And let's dedicate the resources for that. Alright, and so what's your typical kind of customer and what languages they usually work with? Yeah, so I think for us, where we really shine is when you have a pretty large microservice system across different languages, So because if you want to implement open telemetry from the start, you can do that. And if you have one consistent language, it's much easier to do that across all of your services. Whereas for a lot of our customers, they use many different languages. And where eBPF really shines there is that it can get you that base level of instrumentation across all of them without you doing any work, right? So I think like for us, it's like our average customer would probably have maybe five different languages used across, would say maybe 20, 30 microservices. And at that point they can get pretty decent instrumentation with like a one minute install. Okay, and so given it's eBPF based, does it also mean that even if someone doesn't use Kubernetes, but they're just running Linux on the bare metal, are there any customers like that and that still make use of your services? So you can do that. so our node agent will run on those machines. But for the most part, we focus specifically on Kubernetes. ⁓ Just because we can get a lot more information as well, because you can combine the eBPF stuff with the actual metadata provided by Kubernetes around pods, nodes, these sorts of things. And you get a very rich experience out the other side of that. Whereas if it's just on a host, you have a lot of these processes which aren't necessarily linked to something you're running in production. Maybe they're just sort of like system demons, things like this. And at that point, it gets harder to distinguish what should we actually be monitoring. Yeah, I gotcha. And then I would expect not, but I'm going to ask anyway, did you notice any differences in like the cloud platforms? Let's say G cloud or AWS or some of the other ones? Did you see difficulties in some more? Yeah. Did you notice differences there? Yeah. So I think for the most part, it's pretty good. Like as long as we can get access to the Linux kernel, like we're pretty good with the BPF. There are some providers. ⁓ I think Google Cloud is the biggest one here where it's the, you have this thing called autopilot mode inside of Kubernetes, they often restrict the things that you can run, which have privileged access to the kernel. So if we don't have privileged access to the kernel, we can't do this instrumentation. So in that situation, if you're running like a Kubernetes, sorry, Google Kubernetes, ⁓ engine like cluster and it's using autopilot, then we can't see inside the kernel and if we can't see inside the kernel, can't instrument. I think that's been the main problem ⁓ with Google, but apart from that. And is that something you can easily service to the customer? Like would they notice or would it be a pretty confusing mess? Yeah, no, it will be set up first because essentially we'll say, OK, we can't instrument these particular hosts that are running in this cluster. And then they'll see that as sort of like a banner inside the platform. So they'll see it. But it would require them to give exact privilege access to the kernel. Okay, first of all, do you have your own implementation of them, the open telemetry, what is called ingester or how does it work? Yeah, so we received a number of different formats. At an architecture level, we have these node agents, and these are the things that live on each of the hosts inside of Kubernetes cluster. And they collect the telemetry. They generate all the stuff via BPF. And then that sends to another component inside the cluster called the exporter. And that's done over the OTEL protocol. So in this case, we'll use the OTEL trace protocol to send traces, same with logs. And at that point, you can start ingesting your own stuff as well. So that component will take stuff from our node agents, and it will also take any of your own custom OTEL stuff that you want to send to us. And that will all be forwarded to our ingesters, which live in the cloud or on-prem. And they, again, they can receive a number of different formats. And one of those is like, OTEL. What are some of the other formats you support? Yes, is primarily like metrics things like this always is for Prometheus a few other ones as well like a view in logs like vector some of the things like this Yeah, but around traces we try and translate everything in the x-border to open telemetry and then for that on Okay how good is it working so far to combine like metrics and tracing and spans and all that stuff? Yeah, so metrics and traces are an interesting one. the, cause like when you sort of think about, well, how do you debug an issue? At least the flow that I tend to go through is like, okay, well, you first see something is generally wrong in metrics, right? And then you want to go from there. Okay. Well, where exactly is something failing? Cause you may see in your metric, okay, well, there's a spike in 500s in this service, but when there's many dependencies, well, how do you solve that? At that point, you then want to try and find a trace. So you'd probably hop to the trace view ⁓ and say, okay, cool. this trace from this service is failing and then that will give you the whole distributed trace which will isolate it to an individual service for you and then at that point ideally you can see the logs for it, right? That's the sort of overall flow. It's like metrics, traces, logs. Traces to logs is like normally pretty good because when you emit to trace, sorry, when you emit to log, you can include something like a trace ID. And if you include a trace ID inside of the log line itself, it's very easy to correlate though. You can just search for that attribute inside of your logs and you'll find it. ⁓ Metrics to traces, tends to be a little bit harder because there's often like a either a once in many relationship ⁓ or in general it's like they just don't track these things across like metrics or traces. So I would say like metrics to traces is the hardest part but traces to logs is much easier if you support it. But you have to omit that from your log line. Yeah. Which some automatic tracing libraries do. So if you're using something like I believe if you use the the OTEL libraries for Python they'll automatically include the trace ID in any logs where there is a trace. associated with it. Hmm. Yeah, it's tricky, especially also the fact that I think there's that's that's definitely where I feel is very immature, but there's a lot of limitations and constraints around how much data you can put into the like the metrics, like if you're not with your text, with your labels. And so I imagine that adding something as as unique as a trace ID that would like completely explode. your metrics data. ⁓ Yeah, the dimensionality problem is pretty big with metrics. I think like in general, this is why traces and logs tend to be quite nice because they don't really care about cardinality too much. I mean, like when you index them, you tend to index them on, you know, very little cardinality things. And it allows you to add really high cardinality stuff, like a full error message, right? Like that tends to be the best way of solving problems. It's like, if you have the full stack trace and you embed that inside of your trace, then you can just see that immediately when you search for a distributed trace. right? So I think like for me, I think like, I think traces and log is the most mature part of that. There needs to be a little bit of work done to combine that metric to then give you a trace out the other side. Yeah. And so if we just take an example, so let's say you're tracking, let's say, status codes per second, so like 200s, 500s. Now these are emitted from, I don't know, some kind of server, HTTP servers. It's also emitting spans and logs. Is it also a thing where as part of these spans, it also ⁓ includes maybe like an average or some kind of aggregation of these metrics at that point? So is that something that people do or that is possible? I mean, I think you could do it. think like, so for example, if you wanted to emit like a trace and then go and get the value of a metric at that point and then add that as a attribute, you could do it. I mean, there's nothing stopping you from doing that. Like you could introspect the OTEL like whatever library is you're using to emit those metrics. I find it's very rarely done in practice though. I've never seen that be done. ⁓ But I think there are also some, I mean, I might be wrong here. I've not looked at the spec in a while for this, but. I think there are some metric formats now which allow you to embed specific trace IDs inside of them as exemplars as well. And I think like that might be a good way of doing it. But again, it's like, even though that may be, yeah, in the metric itself, so it's like something which is not necessarily indexed. So it's like, it's part of the metric and you can do a scan for it later when we're emitting them. But it's, well, I think it exists. ⁓ I don't believe I've ever seen anybody ever actually use it. So it's one of those things where it's like, there needs to be a little bit of work there and actually like. Hmm. You mean in the metrics? Is it because it's very new or is it because it's still very experimental? making sure the exemplars use. It's a good question actually. I actually don't know why I've never seen it be used. ⁓ I think it's one of those like kind of like niche things that sort of goes overlooked for the most part when people are implementing stuff. I, yeah, it's good question. I actually don't know why it's not used. Yeah, maybe you can link it to me afterwards and I can include in the show notes because it seems like an interesting thing to explore. then, so if we go back a bit to the conversation is so you started with eBPF and then you also have open telemetry. Is it also a thing? Like, is it also possible because you're at the edge, right? Like you could be at the edge with eBPF and you know, this is where really from somewhere in the internet, somewhere from the world. this connection is coming from and from there on it goes through my entire service chain. Now you have the concept of like parent traces and ⁓ is it also possible for these like your, well, I guess eBPF service or whatever you call it to already inject the root, let's say parent ID or at least what it considers like its root so that it can like bubble up so you can follow the entire thing or is it something that just already happened automatically. Yeah, so following the entire thing, so the way this ideally works is that someone at the beginning of the chain injects that into the HTTP call and then it needs to be kind of propagated throughout it. It's really hard for us to propagate that at the eBPF level. it's like when you, because essentially what's going to happen is you're going to send a packet over the network and we would then need to somehow inject this ID into that packet, right? Which is like, it's a fairly hard thing to do because you end up overriding things that you don't want to overwrite. things start to break. ⁓ I think it's probably possible to do it for like certain implementations, but it's, it can get quite error prone, which is why we don't do it. And we end as well, like it's with us, we tend to make the promise here. We don't touch any of your network traffic, right? So means you can install this, you can uninstall this. ⁓ And everyone's like fairly happy with the fact that we're not actually injecting anything, which could break production traffic, right? Where at the moment we start injecting that, it becomes a little bit more, it was like a bigger discussion. Hmm. Yeah, yeah. Yeah, no, actually thinking about that, definitely makes sense. But that doesn't mean that your eBPF metrics or I don't know, like data is a bit separate from everything else, right? Like it's going to be hard to connect. Or am I wrong in that? Like, is it connected to your other, like, let's say open telemetry based data or? Yeah, so if you think so, the way that this tends to work is, okay, well, as like your open telemetry, like your trace ID and your span ID, that's included in like, if you're an HTTP call, it's going to be included in the HTTP headers. And like if you're using different protocols, you can inject it in other places as well. So when we're generating our own traces, we will look at that if it's available. So it means like, so like if something comes in and there's already a trace ID, if we emit any spans from eBPF, we'll attach them. to that specific trace ID. So you'll be able to see them in combination with your hotel stuff. ⁓ But like what we can't do is if there's nothing there in the first place, then subsequently inject the trace ID. Gotcha. And as you are, would say, quite an important player in this industry of telemetry. There's definitely not that many commercial players that I know of that are at your level. Are you also in close contact with the OpenTelemetry, or is it the foundation, or organization, or I don't know, with the folks behind it? So we're actually not, I mean, I'd love to talk with them more in general, but no, we've been very much ⁓ open telemetry exists and is the standard that everyone's using. So let's get involved and implement it on our side, but we've not really had too much scope in shaping the actual direction of the initiative at all, but I'd love to get involved sometime. That'd be pretty good. Okay, yeah, mean, definitely as the ecosystem is shaping up, it definitely also needs some proper like, because yeah, sometimes I'm afraid that they are bit maybe too much in their ivory tower and they don't really know what's the reality on the ground. And I imagine as you as a company are pretty close to the ground, so I also know that you have excellent customer support and talk to your customers and are very involved. Like I can't imagine much better players to... to give feedback on how it's actually being used and where maybe the pain points are. Yeah, we'd love to do that. And I think like, we'll try and get more involved with the community as we grow as well. So we're a fairly new company. you know, resources are always tight. have to sort of decide, okay, where we're going to spend our energy. So ideally as we grow, we can sort of start to give back a lot of the experience we have around open telemetry back to the community. That'd be ideal. And so, like right now you're very much eBPF based, you have this communities thing. And while I think kubernetes will still be around in 20, 30 years, I imagine, what is also happening is, or we don't know yet if it's for sure happening, but definitely there's a lot of money behind it and excitement. And it's getting more mature, is the entire like WASM community and WASI community where folks like might have some kind of orchestrated services, like very beefy services that host. very like a lot of different like Wasm, I don't know why you call it like Wasm Lats or Wasm services that you can just hot patch. Would that be something that you can ever support with us because that's like a very different thing, right? that might, I mean, while you could run that within Kubernetes at that point, there is also a little reason to still be within Kubernetes because the, for example, the cold startup time is like almost instant with Wasm. Yeah, that's a good question. I haven't actually looked into Wasm an enormous amount, like essentially like, how does that work in my practice? Like in a production system with Wasm, like is there, is there essentially just like a bunch of like standardized nodes and then you sort of give your Wasm components to it and they get compiled at runtime and they run? Like how does that work exactly? Yeah. So you could compare it to like some kind of virtual machine. are basically working with like a, like a virtual processor and et cetera. And the, the bonds are pretty clear, but so while you could run that, so you have like a host service, which is like running that runtime and executing the instructions. And so while you could run that in, in a Kubernetes setting, which I am sure some people do. I imagine if you're especially like an operator or something that hosts a lot of these, there is little reason to do it. so you would, at that point, we would go more back to my original question, I think, you are more running on bare metal almost, because you will still be probably in a Linux environment, because that's by at this point still the most common server environment, but you won't, will no longer be in a Kubernetes environment. which makes me think I get maybe your EPPF thing is still fine because the in the end it will still boil down to like kernel calls and and this and that so yeah maybe it's not another big issue at all but like I do wonder if there's like work needed there Yeah, I think like there probably definitely will be work needed. I mean, from our perspective, as long as this thing is making, ⁓ you know, kernel calls at the end of the day, which pretty much everything is, and we can map the process ID of whatever is making the call to some service, then we can work in that environment. But we need to be able to just do those two things, which it sounds like we should be able to do, but I'd need to look into it more. Yeah, gotcha. then of course you also have the once we touch the bare metal, you also have the Windows servers which I get it is a minority, but it's still a huge, well, there's still a lot of money involved anyway there. would it be, is there actually something like eBPF for Windows? Because I'm not even sure if there is. Mmm. So I believe that there is an effort to essentially have an eBPF-like system inside of Windows. I think that's actively been going on. think it's in Wacom. Do you know that there was a... I forgot what it was called. It was a crowd strike where they were running kernel modules in Windows and essentially they brought down a huge number of things because there was a bug. So I think that that sort of spurred Microsoft to say actually we need something like eBPF where people can still inspect system calls but it runs in a VM which we control and therefore we can guarantee that it won't actually crash the kernel itself. So I think there is an eBPF thing happening inside of Windows. I can have a little look for you after the show. ⁓ but I'm pretty sure that that's starting to happen and that will be very interesting to see where that goes. Yeah, yeah, I think you're correct because I ⁓ look into that space as well because I'm involved in security. For now, I'm not supporting yet Windows, but I do know that in Macintosh, since I don't know, almost a decade or less, have a, like you used in Macintosh, you also used to have to hook into the kernel. But then I think since five years ago, you can actually within user space at your background, they're not called tasks, I forgot the correct name, but. then you can choose to inspect the network or like all kinds of calls and like, I don't know. So I suppose our windows will be something similar and that might be enough for what you guys want to do. Like it would just need to be installed and I suppose that point, but yeah, maybe it's not worth it right now even to look at you because the Linux world is I guess still the majority. So I imagine you, yeah. It's the vast majority, like, yeah, was just, I just quickly just got this, like it is being worked on by Microsoft. They have a repository called eBPF for Windows. So I think like that will be interesting. I actually be curious, you know, as they sort of adopt more Linux style stuff in general, like how many companies will choose to run Kubernetes on top of Windows machines. That'd actually be very interesting to understand that. maybe not. But again, as we also have like maybe people running bare metal or they might run like Wasm components. Like maybe then they might be on Azure because everything else already in the Azure world. Even though I know in Azure you can also run Linux now. So it might be anyway not a biggie, but yeah. Okay, so also your first guest, which is ⁓ not just a co-founder of a commercial company, but also was in a Y Combinator batch, how was that experience for you and like, yeah, what were the takeaways for you? Yeah, mean, Y Combinator is great. I think like if you're ever thinking of starting a company and like you don't have the network in Silicon Valley, it's the best place to go. So it sort of gives you that network of, you know, venture capitalists, investors, people that otherwise I would never be able to really talk to or get in front of. I had the opportunity to get in front of them basically purely because I went through YC. So the way I YC in general is it kind of bridges the gap between Silicon Valley and technical people who otherwise don't have the connections to get there themselves. And I found that that's really useful. And then on top of that, you have the, you have the community, which is just really nice as well. So I'm sure people probably know, but starting a company is a pretty tough experience. There's all sorts of things that go wrong all of the time. ⁓ And having like a, there's a bunch of people to share that experience with is just really nice. They're all the same point of view in the company. They've all going through the batch at the same time. So when you're there, you get to meet some great people and they're really good colleagues and people to talk to in general about stuff that's happening in your life that maybe other people who you've worked in the past don't quite understand. So I found that to be really valuable. And the lessons they teach as well are just generally good for running a startup. I would say we're much further ahead now because we went through Y Combinator. So, you know, super grateful to all the YC people. Hmm, gotcha. And like, maybe can you summarize from start to finish how the process was for you? Like, first of all, how did you have to apply or how did you start with this process? Yeah, so we always thought about founding a company. was myself and my co-founder, ⁓ Tom. so we were just, we were always brainstorming ideas coming out of, well, what would it be cool to do? Right. ⁓ And just like, it was very much on a whim actually. I think like we came up with an idea completely different from the thing we're working on now, which is very common for a startup startups, especially going through YC, often change your idea quite frequently. So maybe don't get too attached to the idea itself, but we had an idea of something we just saw that YC, the applications are in process. So we just created an application. it was was super short, like I think one, maybe two hours max of filling in the application. And then we sort of forgot about it because we never thought we could accept it right, right? The acceptance rate was low. And then we got an email just saying, okay, could you be here for a meeting at this time? And we got there and we spoke to the partners. Oh. Where was here? Was it like in the UK or? That was a video call at the time. And I think they still are video calls. But when we were doing this, was ⁓ coming off the back end of COVID. So a lot of the batches were virtual. And then the one that I did was one of the first ones that they were back in San Francisco for. But that's a virtual call. So yeah, we hopped on the call ⁓ with our group partner. And they interviewed us, us a bunch of questions. And then we got the decision a couple days later. It was like a very smooth, very simple process that we really didn't expect to go through. interesting and then once you got accepted how was it then what was the expectation on their side what were the things you had to go through Yeah, so they expect you to go to San Francisco. So essentially the batch starts at a particular point in time, you you need to be in San Francisco for that time. And it's a three month process and it's really, really useful. So the rough structure is is essentially you have a three month batch and and at the end of that batch, you then talk to other investors and you've raised more money. So Y combinator themselves, they invest money and then at the end of the batch, you then raise money from different investors. And the process is really along, how can you get your company to be in front of customers, getting customers, how can you move, accelerate that process within that three month timeline? And it's structured in two week blocks where essentially you have a meeting with your group partners and you say, okay, well, these are the things that we plan on doing in these two weeks. And they give you feedback on whether or not that's useful or what you should be doing. You go away, you do that for two weeks and you repeat the same process and just do this iterative cycle of, saying, okay, we're to do this thing and then see if that works or failed. This really quick hypothesis testing loop. You get to be in a pretty decent place after three months. Okay, very nice. So that's pretty intense. And does it also mean, does it mean they cover everything? Like they give you housing and pay your tickets or how is it? Yeah, it's very intense. So you do that yourself, but you do that from the money that YC invests. So it's like you yourself don't have to cover that, your company will cover that after they invest that money in it. and it arrives like immediately like you have that or you have to prepay it first or... Yes, it depends on where you go. So they're pretty hands off about this whole like finding housing. essentially, they give you a list of places that have historically been pretty good for founders in the past, like accommodation blocks. And then you go and talk to those people and you sort out all this yourself. ⁓ But yeah, your company can cover that. And there are also like a lot of networking events where you talk with other maybe founders, other investors. Yeah, all of the time. the main sort of cycle is these sort of two-week blocks, but on top of that, there's loads of events where they get lots of different founders, people from industry, investors ⁓ to go to those events and you can do a lot of networking there and just meet a lot of like really interesting people. So it's been, yeah, I would say like probably the most intense three months of my life, but probably some of the most rewarding as well. Very nice, I mean yeah. I'm also thinking, like example right now, I'm a parent of three kids, very young kids, so that's definitely not an experience I can go through, but it definitely sounds interesting for the future. I will keep it in mind. ⁓ I think yeah, mean, so there are, by the way, there are parents who go through YC like, and they managed to do it. think like, don't get me wrong, think like juggling both the company and kids ⁓ going through YC is like, it's a tough thing to do, but there are people who do it. So like, it's possible. That's all I'll say. Yeah. Yeah, yeah. I mean, it could be exciting, especially if you can take them with you and they can also maybe go to summer school there or some kind of like exchange program could definitely be exciting. So let me think. So you did YC Combinator. How did because then you say at the end you have to raise some money again. Is it because by that point you have talked to plenty of investors and so you know which kind of investor matches your profile or your goal? Yeah, and I think as well, there's, just a great platform for talking to investors. mean, so this is the standard procedure for every YC batch So every investor is investing in these YC companies knows, okay, well, this is the time where all of these companies are going to go to, you know, essentially ask for more money, right? Like to do the things that they need to do. So it means that in that time period, I think we spoke to, I can't remember the exact number, but I think it was like around a hundred investors in a few week block, right? And like in terms of efficiency of time, to raising money, that's great because a lot of people, this process can go on for a long time. But if you can talk to all these investors over a short time period, you can get that process done very quickly. And it's a pretty big advantage not to waste time doing that later. Okay, and so because Y Combinator is both an investor and incubator, they also have like a stake in your company. Like how hands-off are they afterwards? Yes. So it's really on you, I think, to decide how much you want their input. So it is very possible to sort of fade off into the distance and then never talk to the YC group partners again. I don't think that's the best use of your time as a founder. And definitely, definitely not extracting the value from YC if you do that. ⁓ So, but you can schedule meetings with them, know, like pretty frequently. And in general, like you'll talk to the community or you should be talking to the community ⁓ quite a lot. because there's a lot of other people who have gone through exactly the same problems that you're facing. So if you're not talking to those people about how to solve those problems, you're essentially wasting time trying to solve them yourself. Yeah. Understand but but you are based now again in the UK is that is there also like a huge YC community? Within like I guess London or Yes. Yeah, so YC London is pretty big. It's at least a few hundred people, potentially more. So there's a pretty big community in London. It's by no way near the size of San Francisco. San Francisco is, I wanna say 90, 95 % of all people are in San Francisco. But then there are some smaller communities around the world. Okay. And so if we dive back into a bit in the technicalities and go from business back to open telemetry, what are some of the pain points regarding that that you are currently struggling with it that you would like to be changed? Yeah, I think a lot of people have have trouble instrumenting their apps with OpenTelemetry when they have to do it manually. The auto instrumentation stuff is great, but when you want to start adding metrics and things like this, the overhead can be quite high to understand all the different types of metrics, like how exactly you should use things, or what are the standard tags tags to add to all of your applications so that your provider knows exactly how to use your traces, right? Like there there is for sure, a big sort of a big learning curve associated with open telemetry. I think like really one of the best things that the foundation I think could do is I'm not even exactly sure how to do this potentially maybe adding like an abstraction over the top which allows which it makes it easier for people to get into open telemetry which would be just essentially here is something which looks like very much like I don't know Prometheus metrics that people are used to right where they can translate very easily between the two and it allows people to get into the ecosystem. more easily because when you're setting it up for the first time, it's just like cognitive overhead and it's pretty much exclusively with every customer that we talk to. It takes them a good few days to get used to that. And if you can reduce that, you know, that people will definitely use open telemetry more. Okay, and how do you see it as a standard? you see it as that's definitely what we want to go for or if one day soon there would be someone coming with something else you would gladly jump on board with it? So I think there are problems with the spec for sure, ⁓ but I think it's good that there is a standard. think having one main standard is really useful, especially just for customers. So it means that you can collect all this data on your side and then if you're not happy with your provider, so let's say you're using us or you're using, I don't know, someone else like Grafana or all, if you're not happy with the service that we're providing, it gives you a really easy way to switch to someone else. because it's like the data is all standardized. So you just change essentially the endpoint you're sending data to and now you can fully use another provider, right? That's really, really valuable to a customer. And I think like that's something which should exist. ⁓ Whether or not like it's specifically the open telemetry specification, I'm not too invested in. ⁓ But I think that having one clear standard is definitely the way to go. And then, often they say like platform is king. And so I wonder what stops something from like a G cloud or an AWS to just eat your lunch? Because I imagine they already have everything you need and they could just provide all the books. I know they don't really and where they do it's either very expensive or like very, yeah, not very user friendly. I mean, I've not seen anything like Metoro. So I just wonder what stops them from doing it and how scared are you of it? So I think fundamentally, they could, right? I think a lot of the cloud providers can eat the lunch of a lot of different people. I think a lot of the time it comes down to scope, what it is they're willing to invest in and the fact how niche are you as a company, right? So I think we We have We have a fairly niche focus and I think that gives us a little bit of protection where it's like it would require them to invest quite a lot of work to get to where we're at for potentially a market that they could be serving, like a bigger market somewhere else, right? So I think that that's potentially one sort of defense in quotation marks, it's not really a defense. ⁓ And I think, am I scared about that, partially, but I think as a startup, we have like other bigger problems to solve. And I think from our perspective, this is a good place to start. And there's a lot of other places that we want to go into. And I think the combination of the things that we build are quite hard to replicate, but fundamentally, yeah, someone could do it. This is software, right? mean, like it's, anyone can kind of do anything. Yeah. first of all, I like eBPF and I've also used it in very good ways, but I'm pretty certain that most companies you might talk to are not at all familiar with technology because it's pretty low level. It's pretty system programming wise. mean, and even among the system programmers, it's a very niche thing. what is What is the most common thing that people are confused about? do they understand fundamentally what's going on there? ⁓ I think some do, especially like when we, so normally we partner pretty closely with each of the customers that we have. And when we start talking to the engineers, like they'll start to understand it and they'll ask questions about it and they'll get to know it that way. Most people when they first sign up, don't know about it. They just see the outcome and that's what they're interested in. But I think in general, like a lot of the time, ⁓ people are concerned that we will introduce bugs, right? Into our stuff, which will then bring down their systems. And I think like that's the biggest misconception that they have is that. Fundamentally, eBPF is designed in such a way that even if we ship bugs, it's contained inside this VM that the kernel is running right. So means that the worst thing that will happen is that our telemetry will stop being produced. We will never impact your production systems because essentially of the way eBPF is designed, stops us from doing so. I think that's probably the biggest concern. Okay, and do they understand the metrics? Like for example, if it says, I don't know, X amount of requests per second, this amount of like, I don't know, they see maybe timings, like how long something takes. Do they understand, like how that is generated? Like what it really means. Mmm. Yeah, like I think some do some don't like that's on us to make essentially make make that clear in the platform, I think. So in general, yes, it's all well and good to say, okay, well, this request took this long to process, right? But like what that means is it takes this long from us to read the write request for that request, right? As in like read the right syscall and then to subsequently process the read of the response from it as well. And like that's sort of like additional small overhead on either side. That's probably something that people don't quite understand, yeah. Okay. And so how do you see things like Prometheus like going to the future? Because I know Prometheus was like a big thing and especially developers liked it, even though most developers I talked to that at first they like it, but then they all started to see limitations, of course. Open telemetry has also its limitations, but even like many platforms which were very centered around prometeos are now also supporting open telemetry and are even recommending it. Do you see it at some point phasing out or is it something that will always live next to open telemetry, you think? I think maybe you will see usage decline slightly, but Prometheus is such a massive ecosystem and it's built into so many different products that it's gonna take a while for that to change significantly. And I think it might be around for a long time as well. I there are plenty of people who use just Prometheus in a standard way, and they're perfectly happy with that system, They're happy with how that works. And the whole scraping methodology definitely suits people more sometimes than the push methodology. So I think there's pros and cons to both. I think in my ideal world, I think we would standardize on some approach and it seems like open telemetry is the way that that seems to be going. But I think like we'll definitely see Prometheus for a long time. I don't think it's going anywhere anytime soon, but I think we'll start to see see open telemetry eat more and more into that market share. And then of course, as you are 2025, I have to ask like how is like AI or LLMs use within your company could be like both of like, maybe you provide something around telemetry around the usage, or it could be how you you maybe consume or aggregate data or present. Is there a use even within Metoro for AI? Yeah, we actually think it's a pretty big one. ⁓ So essentially around, first of all, there's like a bunch of different things, but the biggest use case that we found so far is like helping people to automatically diagnose issues. So because we explore all this data via BPF, it's all in the exact same format everywhere, right? So it means that we can write very specific algorithms, excuse me, and allow AI to interact with it in a very ⁓ standardized way. which means that it can very easily spot problems or essentially failure modes in a lot of Kubernetes clusters. So the biggest sort of use case that we find is allowing AI to essentially look at this data stream and then, okay, well, this data stream is anomalous, then it can start to investigate individual components. So it can look at traces, it can look at logs and drill down into the root cause and then try and ⁓ show you that root cause ⁓ as an outcome, if that makes any sense. Yeah, totally makes sense. then, but for example, I've also, I've not seen it in Metoro yet because since then I no longer have a customer who use Metoro. So maybe that was after my time, but where I did see something like that used is in something like Sentry where they try to then automatically give like a a possible solution about like what could be the issue. And I found at each time I was just saying some very vague stuff, which didn't apply at all to the problem. Of course, that is like already like, like more than a year ago, which is like a century in terms of AI. So I wonder how is, how is the success rate there or the satisfaction rate with those kinds of diagnosis. Yeah, so this is anecdotal and like this has been, we've now rolled it out as now GA and like we're seeing pretty good success rates. It's not perfect by any means, like no AI system will be perfect, quite frankly. But we say, think like, I've not checked the metrics in a few days, but it's been like 80 % plus have been like positive. Essentially it's like a little thumbs up, thumbs down, like whether this was useful or not. There's been 80 % thumbs up, which I think is pretty good. And then we also do go a little bit further as well. So it's like we also also actually produce PRs as well. So in this situation, it's like if we find an application bug and we know how to fix that, we will actually raise a PR as well. And those have been getting pretty high success rates, similar to the actual thumbs up, thumbs down ratio of the investigation itself. But that means that next to the fact that you ingest their data, they also give you access to their code repository. Exactly, yeah. like in that situation, there is a... So you can put an annotation on your Kubernetes pod. This is the repository where the code for this service lives and we will go and inspect that code. So we support GitHub and GitLab. So as long as you have it hosted there and we have an integration that you've set up between the two, we can look at the code as well. ⁓ Like some people choose to do that, some people don't. It sort of depends on like what your standpoint is on the security and like what you want us to look at your code. Okay, totally got you. And then I mentioned Sentry and so Sentry was traditionally used for like based on your logs, it raises issues and it might also group them together. You can easily filter them. It can create alerts. Then they also ventured into spans and traces. I don't think they do metrics, I might be wrong, but still there is like a lot of overlap with what Metoro is doing and I feel there is very little. that you would need to fully capture what they capture, is that something that you would ever consider? Because otherwise, I feel it now means that even though someone might be a Metoro customer, they might still need something like Sentry for all the maybe like automatic bug tickets or based on their logs and those kinds of things, maybe linking to the metrics. And actually you even have a lot more data than they have because you also have the entire eBPF things and what else? Yeah, so I think that's where we want to go fundamentally. It's like we want to be able to give people this holistic view of their entire system and like they integrate the code, everything like this. ⁓ That's where we want to go. I'd say I think within a Sentry like you need to actually implement it inside your code, right? They have an SDK that you need to instrument your code with as far as I'm aware, ⁓ which I think is, yeah. Yeah, but that's kind of what you have OpenTelemetry for, right? Because they don't do that much more than what OpenTelemetry does. It's basically just about exporting logs, adding some metadata to it, which you kind of get for free already with OpenTelemetry, I would say. Yeah. Exactly. I think we have all of all those sort of data sources. We need to do this effectively. So yeah, we have logs, we have the metrics, we have the traces and now we have the code. I think like they're really the profiling as well. It's an interesting thing to talk about, but yeah, we have like those five pillars, I think. And as long as we have those five pillars, we can do a pretty effective job of fixing things. Okay, very nice. Yeah, I mean, we can definitely talk a bit about profiling, which again, it it depends a lot on the language and about the technology used, of course, because it used to be like a lot simpler, like let's say when everything was single-threaded or where everything was just using like OS threads. But then again, if you look at something like Rust, where maybe like, and it's not just Rust, like many modern languages, they manage their own threads. And then you have very language specific tools to really inspect what is really happening within these asynchronous tasks or well, there's all kinds of terminology used, fundamentally it's all like things that the OS doesn't manage, but that is juggled between OS threats and which has their own thing. And then I wonder, yeah, how easy it is to really support all these different things effectively, because if you don't, you do miss a lot of potential data points, I would say. 100%. And I think like really like, so with anything that's native, like, yeah, Rust, C++, anything compiles down to a binary, as long as you include debug symbols, it tends to be pretty good. You're right on the async front. front. Yeah that makes it much harder to actually see like the whole point of a profile, right? Did you see this thing calls this thing, this thing calls this thing, and you see the whole sort of ⁓ a, that stack trace effectively. Async stuff makes that much harder. So you'll see, okay, well, this particular function call was taking a long time, but where that was called from, you don't really know. So I think that that's definitely a problem. Yeah, exactly. So, allocations you will of course still see, but like you might not all see like why some kind of task is stuck for a long time or why is it being called all time or I don't know. I think so, yeah. And I think that's an interesting problem in Async programming in general. I think maybe there's a good solution to that, but I don't know what that is. is. With the more synchronous stuff, it tends to be much easier. And yeah, so that's for the native side. And then for the non-native side, so things like Python, Java, there some interesting solutions there. So for Python, what some people do is that they essentially see, OK, well, what's running here? And then they look inside the VM to reconstruct the actual stack trace. That's interesting. ⁓ But like for the most part, a lot of the time there, like you're looking into the underlying like library implementations of these things. So like in this case, like if you're running from Java, you'd want to take like, you know, a Java profile and then you have the integrated. So you actually integrate the language level for most of those things. Yeah. Hmm. Okay, and so I know, Metoro can do fully managed managed by you, but is it also something you can like self host? Yeah, we do. like we recommend the SAAS version for most people just because it's the easiest thing to install. ⁓ But you can self-host it. ⁓ It's a bit more work, but it's definitely doable. Yeah. And so what are the different payment models? Are there different tiers, different ways to pay, different ways to be a Metoro customer? Yeah, so it's pretty simple. We try and keep it as simple as we can. So you normally pay a per node fee. So if you have, let's say you have a Kubernetes cluster with 20 nodes, we charge $20 per ⁓ node. So you'd pay $400 a month at that point. And then that gives you 100 gigabytes of ingest per node. And if you use more than that ingest, you then pay 20 cents per gigabyte ingested. So it tends to be like, it's a fairly simple model. We try and keep it as simple as we can without... ⁓ Because we need to basically cover our different dimensions there because we, you know, we pay for ingest effectively. And is that same for self-hosted versus fully managed? sorry. Yeah, for self-hosted you don't pay for any data. You just pay the flat fee for the management of the nodes. Yeah, because we know we have to cover that cost. Okay, gotcha. Okay, I understand. And we talked about a lot already, but is there something else you would like to talk about or plug into the podcast? ⁓ No, I think it's been good. I wish more people started with tracing. I think it can be really interesting. It can help you debug a lot of stuff that you're otherwise struggling with. So yeah, just plug tracing. And if, and let's say someone is still not convinced of like why they would want this, like how could you as a kind of like elevator pitch convince them to become a Metoro customer even if just on trial? Yeah, I mean, think for us, you can just try it really easily. It's a one minute install, using eBPF so you don't have to do anything. So you can install it. And if you don't like it, you can uninstall it, one day later, two days later, it's all free for the first two weeks. So you can try it and if you don't like it, you can leave. That's my overall thing. All right, very cool. And then I know that for some, like I've seen you very engaged in some like customers where you even like join their like chat community channel or whatever. Is that something you offer as an extra or is that just something you do for new customers or how that works? Yeah, that's right. All of our customers. So if you want to, if you come with Metoro probably I'll be in a Slack room with you. you. We just, especially like at this stage in the company, we want to talk to all of our customers. We want to get as much feedback as we can from everyone. So I'll be there. Any questions you have, I'm there to help you out pretty much directly. Maybe a few other engineering members coming in as well, but for the most part, I think you'll be talking directly to me. very cool and very personal, I like it and I also like how you are based on open standards that has lot of advantages and I hope that people appreciate it because there are not enough commercial activities that do that. Thanks, I appreciate it Glen was great chatting. Bye. Elizabeth (Plabayo)
55:50 | 🔗
Netstack.fm is brought to you by Plabayo building secure, open, and resilient infrastructure with Rust protocols, and purpose. This show is also made possible by Rama, the open source networking framework. Plabayo offers service contracts and welcome sponsorships to keep building and supporting its ecosystem. The theme music of this podcast was composed by DJ Mailbox. If you enjoyed this episode, don't forget to subscribe on your favorite podcast platform and leave a five-star review. It really helps others discover the show. Thanks for tuning in. We'll see you next time for the next handshake.