Rockset promises The real-time indexing database for modern data applications. How does it compare to the competition? I speak with Venkat Venkataramani to find out.

Transcript

Chris Ward 0:03
Welcome to another chinchilla squeaks where I speak with tech luminaries, tech entrepreneurs, tech people generally. And tonight for me anyway, I am joined by Venkat from Rockset. How are you doing? venket I’m doing very well. Thanks for hosting me. Pleasure to be here. So nice background, you have the design team. I keep wanting to do more with this, but I haven’t really got around to it yet. And I have way too much shadow her really need something like like you have with this cool kind of graphics and stuff like that. are you stealing it? I’m totally fine. I’m always quite happy when guests are prepared. So it’s completely fine. So rock set. I think I have my own ideas kind of where it fits into the whole sort of ecosystem of things. But why don’t you describe what Roxette is, and the problem you’re trying to solve for people?

Venkat 1:06
Or what is Rockset, Rockset is the real time database in the cloud. You know, databases have largely been usually just falling into categories OLTP, or transaction processing on one end, and data lakes and warehouses on the other. Right. And all LTP systems gives you speed, like everything is fast to build applications and what have you. But it’s very hard to scale. You know, if your compute ation, complexity gets, like, you know, another big goal and like very complex, it’s hard for data volumes grow is hard, you write volumes go to you know, grow, it’s hard. On the other hand lakes and warehouses are amazing a scaling, they can, you know, manage petabytes of data, but it is slow, you really can’t expect to build an application where the queries never stopped coming and you want interactive speed, the data never stops coming in, you want real time, you know, access and, you know, fresh data sets. So really, I think what Roxette is, is is, is the only system that gives you a combination of speed and scale with the simplicity of the cloud, right. And so it’s a fully managed, you know, offering in the cloud built for the cloud. And it’s real time database, where you can, you know, send massive streams, you can index, you know, you can send massive amounts of data to rocks at and instantly build up very, very fast interactive applications and dashboards, simply using SQL. Sorry, I can’t hear you.

Chris Ward 2:50
Hardware, yes. Is it an observability? database for metrics? Or is it a database for application data? Or Yeah, what’s the data restoring? Perfect.

Venkat 3:04
It is business data, it is application data. So so I’ll give you a good example, right? You know, one of our customers is, you know, has a huge platform for supply chain management for heavy construction. So anytime a quality truck is driving around, all of those things are tracked in real time, you know, in, say, dynamodb, or some transaction processing system, and in this case, it happens to be Amazon dynamodb, hundreds of millions of records getting updated. And how would you build real time search on that real time analytics on that, where there’s lots of records coming in, and you want to be able to join and provide real time reporting, real time search real time analytics, and they want to build all of that for their customers as part of their platform. And here, the data is changing all the time. And it’s getting update updated based on where the truck happens to be, you know, every every time it crosses a turnstile. So this is all business data. This is all, you know, logistics supply chain, a lot of the diamonds business data for doing sales, operations, marketing operations, customer support operations, security operations, risk operations and finance industry, anything where real time matters, right. Like, don’t tell me that something bad happened, you know, hours ago or days ago, tell me what’s happening.

Chris Ward 4:27
And is it because you mentioned on the website, real time indexing database? Is it an index of data elsewhere? Or is it you replace your databases with Roxette?

Venkat 4:39
extremely good question. So it is a lot more of the former. So So why is indexing important? A lot of people say why don’t you just a real time database, like why is indexing so so the real thing within real time is two things happen? The minute you start building, you know, looking at real time analytics. So why, you know a lot of the people will say, Well, I don’t really need real time, you know, I only look at this thing once a week. So long if it’s two hours behind, as soon as the LM analytics Cummings comes into picture, people don’t want humans to look at that people want automation, people want alerting people want to be machines are on programs and applications are going to be looking at that data. And they’re going to be looking 24 seven, because they’re never going to get tired, they don’t need overtime. And they will alert a human on the other end when it actually demands your attention. And so it very quickly gets into application. So, so indexing is very important there. Because if your data is coming, and data is fresh, but every let’s say query takes 20 minutes to come back 40 minutes to come back, which is what you expect from a warehouse, or hours to come back. Well, you don’t really need real time, right? Like, you know, because my queries are slow. So indexing is one of the core techniques of Roxette has booth completely in. So you have to do extra steps to not index data and rock set, like by default, when data comes in real time within one to two seconds, the entire data set across all fields get fully indexed, the data could be structured or semi structured, and it automatically gets indexed into fully typed, you know, fully indexed SQL tables, so that you can get very fast SQL out of the box, you know, SQL queries out of the box. So, so indexing is a very key component, because, you know, we end up serving applications. And now you said are re indexing data elsewhere? For transaction systems? Yeah. So you know, if your system of record is dynamodb, MongoDB, MySQL, Postgres,

Chris Ward 6:42
you have a few here, Dynamo, Mongo, MySQL, Postgres, and then you have some Cloud object storage, s3 race. Yes. And then Kafka, which is also kind of interesting.

Venkat 6:57
Yes, yes. So we have all three types. We have transaction database sources, we have real time streams, Kafka kinesis, you can just plug that into. Or you could also have you no point as to whatever data is sitting in your data lake s3, GCS. And we will index all of the data and in real time and give you fast SQL queries on the other side for application development.

Chris Ward 7:18
Okay. And I can see already lurking on your website, you have the kind of comparisons to some of the obvious competitors here, which are relatively mature. But yeah, so how does rock set compared to something like Elasticsearch and all the projects that came before it, like leucine, and solar and etc, etc.

Venkat 7:42
So, so there’s like a bunch of open sores, like born in data center kind of technologies that are there, you know, Elasticsearch comes to mind, you know, Apache Druid comes to mind, I think they’re really, really good systems on premises. They’re just not built for the cloud. They don’t have compute, storage separation, and also, they don’t have full featured sequel. So when you don’t have full featured sequel, the operational complexity of these systems go through the roof. Because you can’t do real time joins, you have to do write time joins. And the minute you do write time joins, you have to denormalize your data. And the data gets inconsistent. And like your real time analytics doesn’t agree with your batch analytics. And it’s just a total mess, right? And so a lot of people look at those systems and say, Oh, my God is operationally complex, what people aren’t really saying is like, in our installing it and configuring it is not really the problem. When I build solutions on top of systems that don’t have joints, you know, and that don’t, aren’t built for the cloud, not just the server operations. And all of that gets complex, the data administration aspects of that also gets tremendously complex. So in short, you know, they’re not none of those systems are really built for the cloud. They’re not born in the cloud. And, and they don’t have full feature sequel. And vice versa is also true, you know, if you’re really looking to build a great real time analytics solution in your data center, you know, you have, you know, your big bank, and you have a huge data center, well Roxette can help you, you have to be in the cloud for for you to be able to use rocks at which is, you know, which is where we think we have a much, much better offering. Because, you know, we are one in the cloud and we only run in the cloud.

Chris Ward 9:28
So I think this might lead nicely into my next question, which is, where did you come from? Like, why build it? I’m getting the impression from what you said and from browsing around the website that it’s not open source. It’s not self hosted. It’s all cloud hosted. So yeah, where where were you when you kind of had this problem that you wanted to, to resolve and decided to create Roxette

Venkat 9:56
awesome, look a bit about me and the team. So I was managing all online data infrastructure at Facebook. And I was there between 2007 and 2015. And these were like the hypergrowth years. And by the time I left Facebook, the online data infrastructure was serving about 5 billion queries a second. And so massive scale. And we saw the transition, amongst many other things. We saw the transition of, you know, batch based applications and systems moving to real time. And that was one of the biggest trends that I would say Facebook did that. And I tell the story to a lot of people and most people are surprised to hear it, which is the first version of Facebook newsfeed was a batch based system. Yeah, it would basically run an ETL on everybody’s, you know, activity, and then it will build the newsfeed You know, every night for every, every person, and then they will, they will try to do it every few hours. And they couldn’t scale. You know, it couldn’t even last until 2008, a news feed was only launched in Was it 2006. I mean, it’d been, he couldn’t even last that much. And then, you know, we switched over to real time indexing in the real time system, very much like, you know, the generalized version of that, in some ways is rock set, right? So you can build massive scale real time applications, like, like newsfeed is one example. But it doesn’t have to be newsfeed could even be, you know, a real time dashboard, that that is, you know, gives you ad hoc slicing and dicing, for your business data and whatnot. So that’s really where, you know, we saw a lot of this movement towards, you know, you know, from batch to real time, and, and that’s where, you know, I spent a lot of time or the, my co founder through barbata, core, you know, creating started and was also one of the creators of the Hadoop file system. You know, back in Yahoo, when he was there. We also, you know, work together on this project called rocks dB. Hence, the namesake rocks. Yes. Okay. Okay. The rocks DB is open source, and that is our storage engine, the external doctor work well, and I actually used to work for a company here that used rocks, as it’s called, Lux, BB is it came out of my you know, you know, like, in Dubai, you know, did you know, I was the manager and overhead, I guess, and Dubai was the key person that really created and shepherded the project forward. Every model, and we open source in 2013, I think and since, you know, any new distributed data management system that has come out has been built on top of rocks dB. So, so rocks DB got extended and works very well in the cloud. It’s called rocks DB cloud, and that is also open source. So we have a lot of roots in open source, the rest of the sequel engine and everything else that makes rock said because rock said, it’s not just rocks DB in the cloud, it’s a lot more than that. It’s just like the leucine and the Elastic Search comparison that you gave. It’s very similar, right? The relationship that Elasticsearch has with leucine is very similar to the relationship Roxette has with rocks dB. So rocks DB and rocks DB is still open source

Chris Ward 13:04
from memory. Facebook had a reasonably well known SQL processing library that I cannot for the life of me remember what it’s called? presto, presto, yeah. Is that what you’re using? Or?

Venkat 13:16
Oh, no. Presto is also built. You know, there’s a, it’s called tree. No, I think open source project. Now they’ve changed the name because, you know, for whatever reason, but no, I think presto, when, when Facebook built presto, it was really again, for batch systems. It was for big data and batch analytics. Nobody in the right mind would use presto to power an application. Because it’s just not, it’s just not like, it’s not built for application processing. It’s built for batch analytics. It’s a it’s a wonderful, wonderful sequel engine, you know, huge fans of that work. If you have like, for example, a huge data lake and in Amazon s3. And you want to you’re looking for a massively distributed, massively scalable sequel engine for batch computation where you want to generate reports on a daily weekly basis for your analysts. Oh, my God, presto is amazing. But I wouldn’t build the real time application on top of presto. Okay. So when did you start? When did rock set begin? rocks ago, or, you know, I left in 2015. So I would say rock said was like, probably officially incorporated by like, the tail end of 2016 or something like that. And for two years, you know, we were just deep in r&d, like the team was, like, less than than building a new database, especially a real time database, you know, is no easy feat. We came out of stealth, some diamond 20, you know, late 2018. And I and then since then, you know, we’ve been in the market.

Chris Ward 14:49
And then, I mean, you’ve mentioned high speeds, real time indexing database, SQL querying, which is off Couldn’t be in a kind of kinda get the right word. But, you know, a golden thing that many modern databases have attempted to replicate, because it’s a language that many DBAs. Recognize. What are some of the other features you have, apart from those two fairly big, fairly useful features, but what are some of the other features you have.

Venkat 15:29
So this is great, right? These are the kind of like the bread and butter of what we’re talking about the core real time indexing engine. Whether the data is structured or semi structured coming from stream based systems or batch based systems doesn’t matter, in Roxette will automatically turn them. But on both ends on both to make it easy for people to you know, you know, bring their data set, wherever it’s being stored. Roxette comes with a lot of built in connectors. So if you have, if you’re a Mongo Atlas, customer, amazing dynamodb customer, for just as an hour, you know, have a lot of data on s3, you literally just have to create an account with roxa endpoint as your data set. And instantly, we will go make a full copy and will automatically transition or connectors will automatically transition into the Change Data Capture system. And so continue to use your Mongo Atlas, collections dynamodb tables. And whenever your application makes a change to it within one to two second, it will be reflected in rock sets indexes automatically in rocks, its collections automatically. So that is the built in connectors shorten the time to build a new solution so much that one of our customers said Roxette, took my six month roadmap and shrunk it to a single afternoon. Right? Because they had to, you know, build all of those connectors and everything else, even if there was a quote unquote, database on the other end. And there’s also a very, you know, a lot of innovation on on the, on the other side, where applications are trying to connect to rocks and query rocks. And we have this functionality called query lambdas. So once a query lambda, yeah, take a SQL query, put some parameters on the query, just like you do on a normal SQL query. And in a click of a button and rock set, you can turn that into a fully version, dedicated REST endpoint. And so you can export, you can basically go from SQL to a Data API, a REST API, that’s fully version. So you can integrate with your ci CD pipelines, you can do tests on that and have test versions and production versions, and you can have tags and what have you. And that I think, is, is a small thing, but I think it has a had profound impact in large teams very quickly, being able to build solutions. On rock sad, simple reason is when you when data teams cannot clean up data and put it in tables, and then give it to the other stakeholders, internal dev teams. Now, the expertise is expertise mismatch, right? Like the people who know how to query the data and know how the data is organized and indexed, and what have you is not the same person also constructing the SQL query. And in any table, you know, you can construct the SQL query that comes back in 10 milliseconds versus 10 days. And so when you’re exporting as API’s with query lambdas, it really accelerates development accelerates innovation, and reduces time to market for for building your solution. So both data connectors on bringing data in and query lambdas for, for building applications on top of Roxette, has been just a tremendous value proposition for our customers.

Chris Ward 18:33
And something I’m just trying to understand. So you mentioned these various data sources, do you actually merge the indexes from multiple data sources? Or is every kind of index connected to its own source?

Venkat 18:48
Great question. So most sources become its own, like, typically how people want is, there is a, a table that is sourced from Kafka like Kafka topic becomes a table and rock said, there’s another table that is like basically dynamodb in the sources dynamodb. But then there is another table, that source from Dynamo, there’s another table, that source from MongoDB, and what have you, you can join them equity down, this is the power of join, so you don’t need to merge them at all. Unless you’re, you really need that for D duping or something like that. That is also supported. But it’s quite uncommon for people to wanting to duplicate or D duplicate and merge from multiple sources. Fair enough. These all just want to show up as tables in a single database. And you can join them with that standard SQL left join, you know, right joins, in outer join, inner join, like what have you window functions, aggregations soar by everything is supported.

Chris Ward 19:41
So you don’t index across the databases. We allow people to query across to databases. they so wish and it be relevant. Yes, and actually the same question would go for this feature you have called Smart schemas. Is that a schema per source? I guess?

Venkat 19:59
Correct. Correct for table opetaia? Why is it smart? Yeah, what? Why is it smart? again, it goes back to real time analytics. So in real time analytics, right? Gone are the days where you can pause the world and say, ALTER TABLE, add a new column. And it’s going to be of this type. No, no, but no, nothing is going to wait for you, there are millions of records coming every second, you have to pick it up on the fly. And so what smart schemas basically is, we automatically turn no SQL data into SQL tables, right. And so we have to be very smart about that. And so when JSON data, no SQL data is coming in new fields can appear out of thin air fields can change types. And we have a very sophisticated system to automatically infer that, and it’s still strongly typed, there is no type coercion that happens in rock said, you know, we will still preserve the dive, even if a column is changing types, you know, on the fly, or new columns are appearing on the fly. And you can still do a describe table like you’ll do it on any other SQL database, and you still get a very detailed schema back on this is exactly what are all the fields we have seen, these are all the types that we have seen for each of those fields, and, and so on. And so, so that’s why it’s, it’s like, still the schema exists, the sequel is still strongly typed in Roxette. But the schema is built on the fly, and you don’t have to ever manage it. Okay.

Chris Ward 21:25
And the other one I’d like to quickly dig into is you say full SQL. Now, I know many companies and projects that claim SQL on top of things where SQL doesn’t really belong, and then you dig into the small print, and it’s never full SQL. Is it really full finger?

I will

Venkat 21:48
extremely good question. Clearly, you’ve, you’ve been around so. So I will, I will I will demo any project, I’ll give you a little tip. Any project that claims I have SQL like API, yeah, they don’t have SQL. Because people who have SQL, they’ll just say we have SQL, they will never say SQL like, so that’s your, so it’s, we are not SQL, like it’s a full fledged SQL engine. With all anti SQL standard. The real magic happens when you throw no SQL data at it, and then you can just now do SQL at it. Like that’s fair, like it flies in the face of what people have been told that SQL and no SQL are two islands with no bridges in between. Completely not true. You know, in fact, SQL and relational data, you know, data management data model was for two distinct, you know, inventions that just merge together in a in a relational database management system. Blood sequel is just a declarative data retrieval language. And it has tremendous properties that also applies to no SQL data. And so there is a very simple elegant dot notation, and simple extensions that we have that allows you to also work with, you know, Mr. data, and what have you that that is very common in in that, but it’s a full featured sequel engine. In fact, on our distribute sequel engine is extremely high performance. If you compare it against, like, let’s say things like buffer batch systems, like some of the open source projects, we talked about presto, you know, SQL query comes into rock said, it gets broken down into fragments, sometimes 1000s of fragments that that can, that now needs to get scheduled on a massively distributed back end infrastructure. The whole combine lation query planning scheduling and the initiation of the query actual execution usually takes you know, in Hadoop based systems, it will take many, many seconds, maybe minutes. In more modern, you know, warehouses, it will probably take 500 milliseconds to maybe one to two seconds. In rock said that whole process takes 1.2 milliseconds. Right. And this tells you a little bit about why we will for applications every millisecond matters. So we take a sequel engine, we’ve you know, fully supported sequel window functions are grouped by order by joints, what have you. And we compile that. And we have a massively distributed execution engine that can you know, supercharge your query, and, you know, and start execution in like one millisecond.

Chris Ward 24:29
Just looking around the the kind of cloud perspective as far as I can see, unless I’m just not understanding their product names. Yeah, I can see AWS I can see Google but I don’t see azula. Is there any reason for that? Or am I just not understanding product names?

Venkat 24:48
Work in Progress? No, I think sources data sources can be anywhere we have our idi even as your you know, use our API. Okay. Okay, all right, you’re looking at the built in connectors. And so we are just building, you know, building connectors based on, you know, the kind of customer base we happen to have right now by building a new connector is very easy with our API’s.

Chris Ward 25:13
But when you talk about s3, or GCS, what are people querying there? Because I mean, s3, especially you can store all sorts of things. Patients, I guess, is the question.

Venkat 25:28
Yeah. So you can store data in s3, but you can’t really query it, right? It’s a storage system and not a database. So so so what happens usually is people like we have a hedge fund customer that has hundreds of data sources that they that they look at, look into the run very, they get all of that in s3, and they run very complex ETL jobs, or read from s3 and dump it back into s3, right? spark jobs and what have you. And then they get a three terabyte, four terabyte golden data set, that they want to put it in the hands of every analyst every invest in their company, and they want to interact with queries on that. And they, so what they do is they basically put that into s3, and come to Roxette and say, hey, go index this data, so that I can do full feature sequel on this, okay. And, and so we are the indexing and serving layer of data that is just processed and kind of getting kind of like dumped into s3. And I just give you an example. They were using a modern Cloud Data Warehouse, the queries were for them taking two to five seconds. And they had certain interactions where they had 30 4050 queries. So you do the math, it was taking many minutes to load, they move to the move to rock said, the index this data, the queries come back in 18 milliseconds. And so the entire application got about 100 times faster when they started indexing it. Okay.

Chris Ward 26:51
And I kind of have usually have a question of like, what’s the next but I think I’m going to ask some what, specifically, in terms of data sources, what’s next? Are there any popular data sources out there that you don’t support right now that you’re planning to?

Venkat 27:10
Yeah, very good question. Um, so I think the, the, the key instead of data sources that are like people want to build real time analytics are more and more transactional databases, right? And we have, you know, Mongo, Atlas, we have dynamodb. And we just released, you know, MySQL, Postgres. But you can see, you know, we leverage, you know, for that connector, we leverage Amazon’s data migration service. And so and so the, you know, very, very soon, or, or what’s coming next is like, you know, no matter what your transactional databases, you should be able to, you know, Oracle SQL Server, maybe even on prem, your data can still live in on prem, but your indexes will be in the cloud. And so very soon, you know, you can basically expect, no matter where your data is, and know, in terms of transactional database management systems, you know, you should be able to use leverage Amazon, DMS, you know, if you’re an AWS, you can leverage DMS plus proc set to build your indexes in the cloud. Okay.

Chris Ward 28:14
And I guess the the kind of final question I have is, how do you stay ahead or competitive with? Yeah, the the kind of obvious contenders that we’ve missed, we’ve mentioned, Elasticsearch, and the company elastic is probably the, the biggest elephant in the room for you. They have their own challenges, of course, but how do you kind of stay ahead of the limited, but well known competition in this space?

Venkat 28:49
So I think, you know, for the application use cases where you’re trying to build applications with nice joins on business data. Usually, elastic is not really even the biggest competitor, because elastic is a lot more used for log analytics, right? This is data that is very valuable for like the, you know, the next one hour, maybe one day, but after that, it’s not that valuable. Right. And, and elastics future, I think, is going much, much, much more to become a cm, you know, a security company with Incident Management and a full fledged log analytics enterprise solution. So if you really want business analytics, you shouldn’t be using elastic in the first place. And even if you were, I think Roxette is a much better match. And you shouldn’t be using rock set for log analytics, you know, you should be using you should be using Elasticsearch for it because they have a much better solution for that. But it’s really the competition comes from really complex in on prem technologies that are very hard to manage, like Apache Druid, and things like that, which is I think I said they’re very, they’ve done amazing work. I think it’s it’s like if I if I had my own Data Center. And you know, I’m a CIO. And I have to, you know, like figured out how to, you know, get some real time analytics going in my in my on prem data center, and I have lots of servers, I think Druid is a pretty good bet. Right? But you wouldn’t you shouldn’t be you be using that in the cloud is the real, is the real way. And so, in the cloud, I think, you know, is real time analytics possible. I think the only the real competition on the cloud is, is a lot of DIY, you know, people build, people have been traditionally solving this problem with duct tape, infrastructure, a lot of you know, spit and glue and duct tape, where disparate systems are all concatenated. And the Rube Goldberg machine of sorts is usually built to solve real time analytics. And this is why I think the time the time to value the time to market in terms of your data, assets, turning into applications and being you know, being able to quickly build those applications and scale those applications. That’s the value prop that resonates. Because it’s possible, you can do real time analytics in the cloud, it’s just going to take you a lot of people and a lot of time.

Chris Ward 31:05
So it talks about new data sources. But my final question to always ask people is, apart from that, what else is next in the next six months or so?

Venkat 31:16
Lots of stuff we’re actively growing, you know, I think we’re, you know, we closed our series b $40 million with Sequoia and Greylock, you know, doing the third time leading a investment with us. So every part of the company is growing, you know, we’re hiring more people in sales, marketing, product engineering, you name it, we have, you know, looking for office managers, because our office is growing. So, um, so yeah, I think we’re growing a lot. And so there’s a lot of very exciting development that is coming coming up. I think real time analytics, you know, we want to, you know, we will do everything to, you know, not only make it easy for you to bring in data from anywhere, but also we would invest heavily on the app connectors. So whether you want to build the application, whether you want to build the application, you know, using whatever language, or maybe the application is just a dashboard, and more and more connectors with all the, you know, the kind of the real time bi layers that people want to visualize the data. And, and so there’s also going to be a lot of focus on, you know, building out the ecosystem, so that, you know, the broxson becomes the fastest place where you go from data to apps, and real time apps.

Chris Ward 32:28
Seems like a good aim to have. So if anyone is interested that roxette.com are not open source, so you can’t roll at your own. And you’ve mentioned several times about the kind of cloud native aspect. But there is a free tier, if people want to kick the tires and see what it’s all about. As far as I can tell, you get a reasonable amount for free, and then the charges per hour. So you can always experiment for a bit without a cost. So great, there’s

Venkat 33:07
a fee for MIT or for hobby projects, and, you know, upon like few gigabytes of data, you don’t have to pay anything. And so only for larger scale, even the pricing comes in the $300 free trial credits that will take you far for you to really build, you know, and try something out. So yeah, go ahead and kick the tires. roxor.com also has a blog, which we write a lot of content. So go check us out, go to rocks.com. And you can see the blog or just slash blog. And you know, if any of the technical details that we talked about you’re interested in that are white papers, blogs, videos of the actual engineers who actually built it making those videos so you would you would have you know, your your listeners and your viewers will have a lot of fun if if they’re interested they can go check that

Chris Ward 33:55
out. In watching more. Yes. Okay, thank you. Thank you very much for your time and good luck with the awesome Thank you. Bye