Netdata helps you troubleshoot slowdowns and anomalies in your infrastructure with thousands of per-second metrics, meaningful visualizations, and insightful health alarms with zero configuration. Observability is a busy space, how do they compare? I speak with Robin Schumacher to find out.


Chris Ward 0:04
Welcome to another chinchilla squeaks my semi regular spontaneous video and audio interviews with tech luminaires tech leaders, people in interesting technologies. My guest tonight for me is Robert Schumacher of net data. How you doing, Robert? I’m great. Thanks for having me. Good. Where are you joining us from are joining me from I am joining from the technological hotbed of Louisville, Kentucky. Okay. The country. I don’t really know where that is. I’m not sure even when I was I was in Denver a couple of years ago, and I had I didn’t really know where I was, it was kind of disconcerting. We’re

Robert 0:52
semi on the East Coast, kind of the Midwest or whatever, kind of nestled right there in the middle.

Chris Ward 0:58
Okay, okay. Okay. Cool. Good. So, net data. As far as I can tell from the website. It is one of the many products and companies at the moment in this kind of observability space that increasingly popular. It’s almost becoming like the kind of joke of JavaScript frameworks A few years ago, there seemed to be a new observability option almost every couple of months, but I’m pretty sure net data has been around for a little while, haven’t you? I think I feel like I’ve seen the name for some time.

Robert 1:31
Yeah, I mean, it formally things started about seven years ago, when the founder of our company, open sourced what he had been working on it and like a lot of software, its evolution came about from his own needs. He couldn’t find the types of monitoring capabilities that were required for his particular enterprise at the time and decided to build his own. So yeah, it was about seven years ago. And it’s a it’s GPL. Three Oh, in terms of open source, but that’s when things kind of kicked off.

Chris Ward 2:05
And so those mean, those initial scratches here, we say that the creative wanted to, which sounds really bad on I put it that way. But what was it anything specific that they were trying to do? Or is it just a sort of general observability? monitoring? Kind of, kind of need?

Robert 2:28
Yeah. So you’re right, insane. A minute or so ago about our industry being very crowded, there’s a lot of entries, a lot of people, a lot of vendors and what have you, you know, the wild thing is that, well, the analysts groups I like to listen to is four or five point group, like their reports. And late last year, they put out a research note that said 89%, of all enterprises right now, are unhappy with their monitoring solution. I mean, think about that. It’s like almost 90% of companies feel like what they have right now just isn’t doing the job. And a lot of times, they have multiple people in there multiple solutions, including roll your own, they’re doing yourself for types of things. So there’s a lot of reasons I guess for that, but if you’re asking if the question is basically what caused our founder, to go ahead and start building out his his own monitoring solution, I think one of the key things had to do with the granularity of information collected and reported. So if you look at a lot of monitoring solutions that are out there, the window that you can look at in terms of the collection, sometimes it’s upwards of five minutes. And this can become very, very problematic, because in very complicated it stacks, the problems can come in and go and you’re unable to x, you know, actually capture them. If you don’t have very fine granular reporting mechanisms, you’ll be getting spikes, there’ll be driving you crazy, you can’t find them. And one of the distinguishing characteristics of the net data monitoring solutions is we have set per second granular collection in monitoring, which is is fairly unique in the industry.

Chris Ward 4:21
And seven years ago was were any of the other kind of still major options around like Prometheus and gravano. And I guess some of the the more commercial ones like data dog and all the others that because it’s late and I’m having a complete blank on what they are, but some of them have been around for some time. But they were they around as well. And what what were they not doing, I suppose?

Robert 4:51
Yeah, sure. So some were some weren’t. But you know, if we were to kind of generalize things, so on the one hand, you’ve got some of the vendors that you mentioned, proprietary nature, they’re, they tend to be fairly costly. When you’re going to roll them out on an enterprise scale, in a lot of the organizations, you’re going to have, obviously, many different servers, lots of different pieces to their IP stack or whatever. And depending how things are priced, you know, the the, the check that you have to write for those can be fairly substantial. So you’ve got that on the one hand. And on the other hand, you have some of the open source solutions, which were fairly rudimentary and today tend to still be fairly rudimentary, in terms of, Okay, we’ll give you the framework that you need to be able to do some of the monitoring. But now you have to craft almost everything from scratch, you have to write your own queries, using our proprietary language to be able to populate dashboards, which we don’t give you, you have to kind of create your own. So on the one hand, you had things that are gifted to you, but there’s a there’s a decent price tag that comes along with it. And the other hand, you have open source solutions that may be fairly bare bones, you with a lot of elbow grease, you can get things going, Yeah, they’re free, but you’re going to spend time to save that money. And so this is kind of where netdata sits in somewhat the middle here, in that. First, one of the promises that data makes to every user is that all the monitoring functionality that you have, or will have is free. Yes. So we’re committed to providing every piece of monitoring capability that comes with our solutions, whether it’s in the open source agent, there’s two pieces to our solution, there’s the open source agent that people install on the various servers that they want to monitor. And then we have a cloud solution while not being open source, it’s still free. So no matter how you’re using that data, you’re not having to write a paycheck for it, we in here, we want to make sure that we’re giving everyone that comes to us very sophisticated monitoring capabilities that are affordable to everyone. Again, you can’t be free, right. So that’s, that’s step one. Step two, though, is to ensure that we’re getting people off the off the ground running, you know, out of the blocks and everything else very quickly, whether you are very learned in the monitoring and troubleshooting space, or you’re fairly brand new, in at least according to our surveys that we run with the folks that utilize our technology, it’s about an even split, so you got 50% power users 50% not. And so it’s very important than when you have either a power user or a novice user. Oftentimes, they’re they’re having a problem. And they’re looking for help right now. And so they don’t have that time to spend to be able to do all that setup, all that crafting, and whatever. And so the wonderful thing about that data is you can download it, there’s a single line installation line to install it. And you will immediately be up and running with the statistics that auto discovers all the components of your IP stack. It’s pre configured with smart dashboards, that the monitoring experts here at netdata have either used in the past in their own environment or, you know, have built based on user specifications. And we’ve also set up a predefined set of alerts and that operate a very proactive sense, so that people can immediately begin discovering those needles in a haystack. In fact, one of our marketing team members forwarded me today, a user who did exactly what I was talking about, things weren’t running just quite as they’d hoped, downloading that data, and they posted a picture of it, and said, it’s amazing how this thing has been up for under a minute. And it’s already pointed out to me three, five things that are going wrong in my infrastructure that I need to, to work on. And so that’s again, it’s just just a couple of things that that I think net data provides people that may be the competition and or other open source projects aren’t doing right now.

Chris Ward 9:02
Okay. And just I’m just digging into the open source project a bit because there’s 10, like, where I tend to like to start looking at things I can see, for example, that the repository is tagged with, from atheists graphite, but is is it leaning on any kind of other libraries or frameworks? Or was it all mostly from scratch?

Robert 9:28
work from scratch? So you’ve written in C, and obviously, there’s going to be various open source libraries and other things that will make use of Yeah, for sure. But our our future is not dependent on any other open source provider from a monitoring perspective, if that makes sense.

Chris Ward 9:46
Yeah, that makes sense. Yeah. Which, and that has been controversial in this community. The past month or so. So yes. Okay, cool. Yeah. I just wondering and even is even Like the time series part, is that custom? Or is that leveraging something pre existing?

Robert 10:05
It’s a combination of both. So yeah, and I mean, there’s certain things that you take, for example, some of our one of our features is anomaly detection, being able to really find, again, those small needles in a haystack. So there’s going to be some things that we might reuse from the industry, but our engineers that are doing machine learning, artificial, intelligent, whatever, they’re, they’re building out their, their projects and their models all on our own. And that is all of that data’s.

Chris Ward 10:36
Okay. And this is, this is quite fascinating, large infographic, sort of about three quarters down the readme of the of the GitHub repository, where there’s a lot of logos in one place, and I’m not 100% sure what it’s showing me at it. So you’ve got things like rabbit mq, Mongo, puppet Maria, oh, yeah. Cool. Like, is that sources of information? Or is that? Like sources of monitoring? Is it that kind of thing? or?

Robert 11:06
Yeah, correct. So those are examples of an IP stack that may exist in a particular infrastructure, right. And so again, one of the beauties of net data is you don’t have to find you don’t have to have intimate knowledge of your IP stack to be able to monitor it effectively. Because the auto discovery capabilities built into the solution is going to do that for you. categorize those things smartly for you, so that you can very easily navigate between components of your IP stack. So for example, you mentioned MySQL, that was a company that I was formerly head of product for for five years. And I really appreciate the fact that that data auto discovers the various demons and things that are running behind the scenes configures various charts very smartly, based on the material and data that your MySQL troubleshooters your DevOps teams care the most about, and pre configure some of the alarms and we’ll help you get started when it comes to MySQL troubleshooting. If you happen to be a novice,

Chris Ward 12:07
yep, I was actually just quite fascinated, because you’ve got, like, there’s obviously a lot of very common ones, and actually mostly mentioned the common ones, just so you knew what I was talking about. Anything else? Yeah, because you don’t necessarily know every image I’m referring to. But there’s some quite some quite obscure ones here as well, I was quite fascinated to see ipfs, I spent some time working in the Ethereum space. And this is kind of this aetherium based file storage. And I found that quite a fascinating one to see there. And some of the operating system support is also very extensive.

Robert 12:42
We try to drive this, we try to drive these based on user feedback. Yeah. And so you will find, you know, some of the usual and customary names on there. But sometimes the folks that might use this for particular industry, they might have a little bit more obscure needs and, and if that doesn’t mean we’ll jump on it immediately. But if we see a consistent pattern, if we’re asking questions in the forums, and we see people upvote these things constantly, then we begin to take notice and begin to move those things into our roadmap.

Chris Ward 13:16
And I mean, without getting too much into the weeds, but how does this discovery work? Is it you just looking for, is it a pre configured list of things you’re looking for? Or is it looking for kind of network connections or common ports or something like that? All

Robert 13:35
right, you’ve got it. And so with, with each one of those pieces of software that you mentioned, our engineers do the homework to understand how they are routinely installed, configured. And when running, what are the operating system signs that they are open and available. And so those are then used by the net data agent in its auto discovery to locate those things. And again, we’re we’re working with users to make sure that we’re looking for the right things, because what we call these are called collectors. And so they’re developed in conjunction with the open source community, not just internal with our engineers only. So we’re making sure that we’re looking for the right things in the most current things that signify a particular piece of the stack is present and running.

Chris Ward 14:25
And so we’re speaking kind of in the week, week, I think, is only the week after cube con, where a lot of this observability and metrics and monitoring ecosystem is kind of coming into its own but with net data being about seven years old. I’m getting the impression from the repository for various reasons that the originator was something of a Linux kind of person. So you kind of get that move into Kubernetes relatively for free in some respects. But was any of that kind of transition difficult for net day At all into the kind of container container based communities based space, or was it’s it’s kind of history made it a relatively easy transition.

Robert 15:12
Yeah, I mean, obviously there for new features such as the the most recently announced Kubernetes support, there’s, there’s going to be some freshness that are involved in terms of implementing that particular piece of the stack into the, into the software. But at the same time, we’re, we’re following our high level map toward making it infrastructure monitoring easy for everyone. Right. And and when I say that, when you’re looking at a monitor, doesn’t matter which month, what kind of monitor you’re looking at. So it can be a database monitor, which is the ones I’ve been most associated with throughout my career operating system, something that does Kubernetes, what does the end user really need? Okay. First off, they’re not so much interested in, in coffee, getting up every morning and looking at screens and staying in front of those screens. This is not Netflix, this is not something that people typically want to do, you know, constantly, they may have the job to do it. But that’s not necessarily what they do. Instead, what they want to be able to do is ensure a couple things they want to ensure the uptime of their infrastructure, and that it is running performance that it’s performing well. And so to do that, if you’ve got a monitor, and it does what it does, well, it’s going to be able to answer four questions for you. And again, this applies to recently announced Kubernetes support. So when when I’m using a monitor to troubleshoot, I have four questions I want to ask, Do I have a problem? Where’s the problem? Who or what is causing the problem? What do I do about the problem? Okay, if I can expeditiously move through those four questions, and ensure again, that issues are solved in a proactive sense, so that my IP infrastructure stays up and has consistent levels of performance and meet my service level agreements, then I’ve done my job well, and you know, the very last thing that you want, and this, this does apply to something like Kubernetes, you got containers, you got pods, what you really got are black boxes, in a lot of cases. Yeah. And I don’t know about you. But when I was in the field, I did this for about 10 years before moving into software, I hate black boxes. Because I’m responsible for those things that I just mentioned, I need to know what’s going on underneath the covers, and being able to find that thing out or find find, you know, the intricate details of what’s moving and Grubin and everything else in those things, regardless of whether it’s in a container, or just the database or whatever else. And so that’s kind of the path that we follow when it comes to providing the level of details that people need to do their job.

Chris Ward 17:59
Okay, yeah. And so then when it comes to the the cloud offering, what are you adding on top, at the moment, to the open source project?

Robert 18:08
Yeah, so there’s a number of things. So the agent is typically where people begin, because they’re, you know, this is what’s going to be installed on every piece of hardware and or server that’s going to actually do that discovery that we’re talking about the collection of the needed information at a very granular sense. And one thing I also didn’t mention that’s very, very important is that collection has to be done with minimal overhead, because the very last thing you need is to add any additional burden to already mostly overburdened servers and what have you. So you want to make sure that you’re very streamlined in what you’re doing from that sense. And then, with the agent, you’re going to get a basic dashboard of that particular machine that you’re monitoring. So I install our agent on machine a, I have a nice Visual Dashboard, where I can look at the statistics and everything of machine a and have at my disposal some of the things that we’ve been talking about. All right, but what if I have 10 machines, on our machines, 1000 machines, something like that. And I want to be able to collectively monitor and troubleshoot all of those at the same time in one place. This is where the cloud offering comes into play. Right? And so you’re able then with netdata Cloud, to be able to stream the information from the various servers that you’re monitoring into a collectivized interface that’s in the cloud that’s obviously browser based, that you can then organize at your leisure and get things set up properly. So that you can, at a glance, be able to answer those four questions I talked about not just for a single machine, but understand when you sit down. Alright, where do I need to spend my time? Which which parts of my infrastructure and stack are under duress and where are they It’d be able to very quickly navigate to those types of things. So that’s the first thing that’s in the cloud. Second thing that you’re going to find are increasing levels of sophistication. And by that I mean, parts of observability that you were talking about earlier, that become somewhat difficult to to handle, especially when it comes to question three, who or what is causing the problem, because oftentimes, that’s an interrelated conclusion that you’ve got to come up with. And so you’re going to have things we call the metric correlations, being able to in the cloud, be able to to smartly be able to look at not one but upwards of 1000s of statistics, correlate them and see which affect one another. And then be able to see this bundle, if you will, of particular issues that may be causing a spike or threaten your uptime threaten your performance or something like that. And being able to do it, again, do some of that anomaly detection correlation that you couldn’t otherwise do. And so that’s, again, just another example of what you find the cloud but but in general, to answer your question, you’re going to be able to monitor, the agent monitors the machine, the cloud monitors your infrastructure, you’re going to be able to perform the core things you need to do with the agent when it comes to monitoring, troubleshooting. But you’re going to have increased levels of sophistication when it comes to the cloud solution.

Chris Ward 21:34
And is that is how old is the is the cloud in terms of product?

Robert 21:40
A little over a year? Okay, a little over a year. So the first cut was done. Actually, before my time, I’ve been with the company just since probably late November of last year. So it came out a little bit before I joined. And we’re continually making improvements. And there’s going to be some really interesting things on the horizon that are going to be coming in 2020, that it’s going to, I think, increase the clouds capabilities, and reach when it comes to your, what I call the average user. And I don’t mean to, to to sound unkind, but

Chris Ward 22:15
a lot of PDF.

Robert 22:16
Yeah, a lot of people today, if you look at some of the Stack Overflow, for example, surveys, and what have you, you’re seeing people who were in the seat doing their job, they have five years or less of experience. Yeah. And these are people that, that don’t necessarily understand complicated charts and algorithms and correlations, everything else, they need a little bit different hand holding. And that’s some of the things that we’re focused on for, you know, for 2021. Okay.

Chris Ward 22:47
So just out of interest is something that always interests me with companies that have taken some time between the open source project and like the commercial project. What, how was the how, how was it data, keeping the lights on in those sort of four or five years? I guess?

Robert 23:04
Yes, it was the same way we’re keeping the lights on right now. Because we’re still not charging for anything. commercialization of the software is not a primary goal. At the moment, we have received plenty of funding for the small company that we are, we’re very prudent in our spend. And this allows us to focus on providing functionality that will increase the reach ubiquity of the software. And then we’ve not been shy, it’s on our website or wherever that one day, we will commercialize. But we will do so and still keep all of the monitoring functionality free. So what does that mean? Well, we plan to differentiate with perhaps advanced security, more visual and automated administration capabilities for handling again, those hundreds of 1000s of servers that are running behind the scenes and things like that. So it’s going to be a lot of ease of use ease of management, extra layers of security to protect your demographic, as well as your monitoring data, even though we do that pretty well today. So those are at least the plans at present in terms of commercialization. But again, it’s not our primary focus at the moment.

Chris Ward 24:15
And so there’s been lots of announcements, actually kind of a lot in a lot in May. With the Kubernetes support you mentioned was just announced this year. And then quite a few different versions and updates to 1.3030. And then also just notice the latest blog post is also talking about, I don’t know if this is something recent or something you were enhancing. Then you also have ebpf support, which seems to be the latest buzzword and in all of observability right now, is that something that was new or was that something you’ve always had we’ve always had but had for a while.

Robert 24:59
Yeah The emphasis on it is increasing. So it’s new in that sense, the exact Can I will tell you that even we were surprised at the popularity of the support is when we announced that we saw spikes across the board. For you know, the the coverage, more coverage on that is actually rolling out the door here very, very shortly as it pertains to disk monitoring, and what have you. But that is a, again, even to us as a surprisingly popular aspect of of the metadata solution.

Chris Ward 25:37
And there’s also a headline in this release about machine learning powered collectors. Oh, this is for cloud. Yeah, I was just about to ask, Is this cloud or iPads? It would have to be cloud. So it was kind of interested. So is that it’s anomaly detection? largely, I guess? Oh,

Robert 25:56
yeah, it is. It is. And so where, historically netdata has done a really good job at the first two questions that I that I mentioned earlier, do I have a problem? And where is the problem? So being able to understand immediately if something requires my attention, and it’s it’s general location in terms of servers, servers that it has to deal with? But then answering that third question can get quite tricky. And this is where the machine learning models that are very talented engineering staff are developing comes into play with anomaly detection. And, you know, today, we have some very sophisticated charts, very colorful charts that will show you these things. And we’re now beginning to look at and see how we can begin to uplevel some of this and put this in more, just general verbiage and speak to be able to again, address that person that, you know, they’re looking at this chart. Yeah. Okay, that’s great. What does this mean to me? And so that’s some of the some of the things that we’re working on internally to again, up level be able to answer that, that question three, four people really, really well. And so some of this stuff will be coming later in the year. You know, we put out what we sometimes call experimental releases, where we’ll have new features and functionality, especially as it relates to machine learning, what have you have the community kick the tires, give us their feedback, we will go back and retrofit based on the feedback. And then eventually, we’ll we’ll have production level features that Oh, they’ll finally poke their head out.

Chris Ward 27:21
Okay. So may has seemingly been quite a busy month of announcements. I’m sure you didn’t make them all in May. But you know, and so in the next few months, six months, so the year what’s what’s on the roadmap that you can talk about?

Robert 27:38
Yeah. So there’s, there’s three basic question quadrants, if you will, pillars of the roadmap. The first just has to do with, again, very gracefully onboarding people to the cloud product, because sometimes, again, if you have hundreds or 1000s of servers or whatever, being able to easily install agents across all of those in a visual or administrative fashion, getting people signed up ready to go very, very quickly, can can be challenging. And today, you know, we do it in a fine, expeditious sense. But there’s, there’s, there’s more than we can do on that front end. So there’s a number of features and functions that will be coming, that will increase the new person’s initial experience onboarding of net data, whether you have one server, hundreds of servers, or 1000s of servers. So that’s one area focus, second layer, or second second pillar. And this really was the number one area of user feedback for us has to do with call it what you will headless monitoring, exception based monitoring. But it goes back to the statement I made earlier about people not being so interested in looking at a screen all the time in charts all the time. And this, this basically boils down to alert functionality that you’ll find in most types of monitoring solution. So again, I mentioned that we have them today. They’re You know, they’re configurable. They’re user definable, they work, that’s fine. But now we’re focused on eliminating the clutter. When it comes to receiving too many notifications. Were concerned with up leveling the information into not dashboards, but what we’re calling smart boards. This means that we’re trying to provide more intelligence communicate more, especially again to that person that is maybe not as experienced in monitoring, and being able to, to, even in a more quick manner, move people through those four questions in through the alert mechanisms and things like that, and these new smart boards, that’s a really good way to do that, and get to where they need to be in a very, very quick sense. And then lastly, we’re moving from being a monitoring solution, which has its own definition to a more observable Ability based solution that, again, it does questions one and two very well, do I have a problem? Where’s the problem? But now we’re beginning to look at how we can better answer who or what is causing the problem, and then go on to the even more difficult, what do I do about the problem. And that takes a whole new level of intelligence built in, so that you’re being able to educate the user in terms of what they’re seeing, help them understand why it’s important, why the things that are being flagged to them are critical, and the implications that they have. And then where possible, being able to provide them with guidance in terms of what to do about the problem, and then perhaps links to further education or something like that. So you’re going to be seeing announcements from us in those three areas, those three, those three pillars of our 2021 roadmap.

Chris Ward 30:50
Cool. So if anyone is interested in finding out more, it’s net Data Cloud, I mean, you can do the open source version, and play around with your heart’s content. And it’s seen, so I guess, is pretty efficient. And then as you also mentioned, the net Data Cloud is free to sign up. I mean, there must be some kind of restrictions on that free tea, I’m guessing.

Robert 31:15
No, sir. That’s again, something that the company is pretty proud of. So we do not restrict based on number of servers, data storage, any typical limitation that you might find a find in free offerings cripple where whatever you want to refer to it, they do not apply to us.

Chris Ward 31:38
Wow, okay. Sounds like an interesting problem to worry about in the future. We’ll maybe we’ll talk in a year’s time and see how you handle that. I’d love that. Cool. And if anyone is interested in is there anything that you do outside of netdata dot cloud, or you’re particularly frequent Twitter or blogger that you want to tell people about or just,

Robert 32:07
you know, I am, I’m not a huge social media guy. So you can definitely find blog posts for me on the netdata. website and what have you. But yeah, I mean, if you’re if you’re a current net data user, I’m very easy to reach. I’m just Robin RLB. I N net Data Cloud. Feel free to ping me with any questions. You got any product ideas? I’m very open.

Chris Ward 32:30
Nice. All right. Thank you, then thank you so much for your time, and I wish you all the best of luck with the with the roadmap. It sounds quite ambitious. So all right. Well, thank you very much, guys. Thank you again for having me.