Video of conference talk at Big Data London
TRANSCRIPT
All right, everybody. Thanks for coming out. Appreciate the standing room. This is pretty awesome. Really glad to be here at Big Data London to talk about DuckDB at scale and how DuckDB scales and how we at MotherDuck have probably the largest DuckDB fleet in the world. Although given the rate at which DuckDB is growing, chances are that may change in the future.
How many people are DuckDB users, actively DuckDB users? Okay. How many people are DuckDB curious? How many people are just here to throw stuff at me? I see some ClickHouse shirts in the back. I bet—no, just kidding.
But first let me introduce the flock. Who am I? I'm Jordan Tigani. I've been a database person for about the past 15 years. I worked on BigQuery for 10 years. I was trying to calculate—I think we ended up querying a yottabyte or zettabyte by the time I left. Worked at SingleStore for a couple years and about 3 years ago I encountered DuckDB and saw, wow, this is an amazing piece of technology. Somebody should really take this and build a cloud service out of it. At first I talked to Hannes and Mark, the DuckDB founders, to see if they were interested in hiring me to maybe help them build a cloud service and they said, "Well, we really would just want to focus on the technology, but we'd be happy to partner with you." And after that moment, the DuckDB partnership and MotherDuck partnership was formed.
What is DuckDB in case people don't know? It's an embedded SQL database tuned for analytics and it has some really nice properties. It scales down really nicely. It's embedded so you can actually link it in to anything you're actually building. It has really nice connectors to most languages. Here's an example in just four lines of Python—you can write your first DuckDB query.
DuckDB has also been really accelerating in popularity. I think it has 30-some thousand GitHub stars at this point. The number of downloads—they just shared with me that they're doing a petabyte of extension downloads a month. That's a lot of people running DuckDB.
Why do people like it so much? I like to think of it as the full burger. Most databases focus just on the patty of the burger. Certainly when I was at BigQuery we did this. We said, "Okay, once you give us your data, you send us your query, then we'll try to optimize that query, run it really fast and give you the results as fast as we can." But anything outside of that is your own problem.
DuckDB, for example, has a full-time engineer with a PhD working on their CSV parser and they've been working on it for years and they're still working on it because they want to make it better and better and better. In BigQuery, we had a new grad working on our CSV parser. They spent a couple months on it and it was like, "Okay, cool. That's done. Now we're on to the next thing." But if you think about what actually takes you time as a data analyst, data engineer, data person, it's like, "How do I get this damn data that's over here into this other place that's over here?" And DuckDB makes this really nice. In fact, for my demo, I was going to show comparisons to BigQuery and I just got frustrated with how hard it was to load data from S3 and I gave up. That's the kind of thing that DuckDB focuses on doing really, really well.
What's MotherDuck? We're taking DuckDB and we're hosting it in the cloud. We're building a SaaS version of DuckDB. We run lots of DuckDBs. One way of thinking about it is just it's a data warehouse. It's a really low latency, easy, inexpensive, great data warehouse. On the other hand, it's being able to take the DuckDB you know and love and just have somebody else run it for you. These are the same four lines of code that I had for DuckDB. The only thing you need to do to start using MotherDuck is change the file name to start with "md:". If you start your file name with "md:" it loads the MotherDuck extension and it'll actually run in the cloud.
MotherDuck is serverless, spins up instances on your query. Within 100 milliseconds we'll assign you a DuckDB instance which we call a duckling. We use warm storage so it actually can be very fast for cold start, which allows us to spin up instances and spin down instances. We also have some very large instances. The tenancy model allows you to run one DuckDB instance per user, per end user, because one of the tricky parts about DuckDB is it's really focused for a single workload at a time. We say fine, use a single workload at a time. But if you have lots of workloads, each one is going to get a different DuckDB instance and each one can scale independently.
The other exciting thing is that next month MotherDuck is launching our first European region in Frankfurt at AWS. Because DuckDB was created in Amsterdam, there's a lot of excitement around DuckDB. We're really happy to partner with a bunch of folks in Europe to help spread the duckiness as much as possible.
What have we learned from running giant DuckDB instances? Well, first, I want to address the elephant in the room. We're at Big Data London. I also wrote a blog post a couple of years ago called "Big Data is Dead" where I talked about how big data is no longer a relevant term. So what am I doing at Big Data London?
Well, given that we're in London, I figured I would start with a Shakespeare quote from Julius Caesar, suitably duckified: "I come not to praise big data, but to bury it." I'm not quite that anti-big data. It turns out sometimes in order to get a catchy title, you need something very black and white, but of course there's lots of shades of gray.
We looked at the data about what data people are using. There's a couple of data sets that have been provided by some of the other data warehouse providers about how their technology is being used. One of them is called SnowSet, Snowflake's data set, and another one is called RedsSet, which is Redshift's data set. They wrote a paper basically making a claim that the shape of data analytics benchmarks are not quite right for modern workloads. But it also provided a really nice window into the actual sizes and shapes of data that people are actually using in the real world. It turns out that most databases are smaller than 10 terabytes. We're going to call those not really big data. When you do have a big data database, you end up only querying a small portion of it. Only 1% of the time when you have big data do you actually query the big data tables, and then only half a percent of the time when you have big data and you're actually reading big data tables do you also read large amounts, because often when people are reading these giant databases you read smaller slices.
Of course, there are a bunch of patterns that do lead people to big data. One of the key insights I think is that when you think about big data, it's actually two different things. There's big data, the size of the data—the size of the data can't be stored on a single machine, you have giant amounts of data. But if you have separation of storage and compute these days, everybody just stores their data on S3 or GCS or whatever your object store du jour is, and it turns out to be not a particularly interesting problem. Storing a lot of data is—you just throw it in an S3 bucket and it's actually kind of virtually unlimited how much data you can store and it's inexpensive.
What's actually the real interesting part of it is big compute—the size of compute that you need in order to access that data. Do you need actually distributed computation in order to solve your problem? The one thing that we sort of take for granted is just the size of machines these days. When we started BigQuery, getting 64 gigs of RAM was virtually impossible. We had to slice things very, very narrowly. Getting more than a couple cores was very difficult in a particular process. Now there's machines with hundreds of cores, terabytes of RAM, and tens if not hundreds of gigabits of network throughput. It really changes the equation of what big data means.
I like to draw things in a 2x2 chart with the bottom left being small data, small compute. The top left is small data, big compute. The bottom right is small compute, big data. And then finally is big data, big compute. I think it can be useful to segregate those by use cases.
If you think about how much these things cost, obviously small data, small compute doesn't cost very much. Small data, big compute—compute is what actually drives the costs, so if you have to do a lot of computation it's going to be more expensive. If you have big data and small compute, often actually it doesn't cost that much because data storage costs tend to be so low. Really where you're going to spend a ton of money is in this top right corner, which also leads people to avoid running queries in this part of the landscape. It's what leads them to do things like the medallion architecture where you take your data, you land it in this giant size and then you kind of create smaller and smaller slices of it in order to reduce your query costs.
Another interesting thing is that for these different quadrants there's different things that you optimize for. In the small data case, you really want latency. You generally very often have a user involved and users care how long things take. If it's a difference between 2 seconds and 200 milliseconds, that's a huge deal in terms of if you're showing to your end users some sort of analytic or some sort of graph. Whereas at the large data sizes what you really care about is, "Can we get stuff done? Can we just push this through as fast as possible?" I remember actually one time my manager in BigQuery said if you add a second to every query, he didn't care. Adding a second to every query was considered not a big deal, not a problem. Whereas I think if you're—because BigQuery was so far in this top right corner.
The different design goals for throughput versus latency really drive the kind of architecture you need to build. If you think about where 99% of all the usage is, if the design goal is latency, we tend to be using systems that are designed for throughput.
Here's where some of the workloads map. I'm going to go into a bunch of these in more detail but you can kind of see how there are real things in each part of these quadrants.
Digging into the big data versus big compute. The first quadrant is small data, small compute. DuckDB, I think probably everybody here admits or agrees that DuckDB can work great on this. You're doing ad hoc analytics, you're doing something on your laptop, you're doing some data science exploration. You have the platinum or gold tier if you're using the medallion architecture or you're authoring some SQL. What's important is the pace of iteration and if you want fast iteration then it's nice to just have small data and small use cases.
Here's a representation of some benchmark results from the ClickBench benchmark that was created by the ClickHouse folks. It's too hard, too small to read because what's important isn't actually the vendors on here, but if you look at the top right corner—the less expensive ones are on the top, faster is on the right. You want to be in the top right corner. We have a bunch of small distributed databases that are pretty inexpensive. Of course they're really slow. These are not very fast. They're often an order of magnitude less performant. Then you have the large-scale distributed databases which do get a lot faster and those certainly move towards the right but of course those are very expensive. And then here's the DuckDB versions of these which basically, you know, top right corner—they're not expensive and they're super fast, super low latency.
There's a typical problem when you have a shared data warehouse, which is sort of how typically we see people building data warehouses. You might have a single instance for your team or for your department. You have a lot of users hammering that same warehouse and you end up having to provision for peak. One of the things about analytics is that the sizes you need can change rapidly by orders of magnitude. Even if you're doing autoscaling, it's very hard to keep it up.
The way MotherDuck works is that every user gets their own DuckDB instance. If you have hundreds of users, each of your hundreds of users gets a DuckDB instance and then we can scale all those independently because some of those are going to need a lot of data, some of them need a lot of computational power, some of them are going to be pretty tiny, some of them are going to be very ephemeral and we shut them down when they're not being used.
That was small data, small compute. Now what about small data, big compute? I was originally thinking that this sort of didn't exist. Then I was thinking, well actually we do see a lot of these cases where people are actually hitting harder than you could really do in a single DB instance. Let's say you're doing cloud-based BI—you might have Tableau or Omni or Looker. There might be a lot of users hitting at the same time and if those are all hitting your data warehouse, maybe they're only looking at the platinum tier, maybe they're only looking at a small amount of data, but that can knock over the data warehouse that you're using unless you scale it up very large. If you scale it up very large, then you have the same overprovisioning problems.
Cloud business intelligence is a key workload for this. There's also a model of building SaaS applications where you build an application and that application wants to show data to end users. Sometimes everybody is looking at the same thing, they're slicing and dicing the same data—that's also kind of big compute, small data. And then finally AI agents. I'll talk about those a little bit in a second but there's something in MotherDuck that we do to help make this work. I mentioned that every user can get their own DuckDB instance, but that can be a pain to provision all those. What we do is if you enable read scaling, every end user of your BI tool, for example, or your SaaS application gets their own DB instance which we can scale up and scale down. We also have mechanisms to avoid overspending. But this allows you to basically scale by running lots of different instances.
There's an interesting paper I read recently about what does it take to build agents. Everybody wants to do agents these days—it's the cool topic. But in order for those to be interesting in analytics, to be able to answer deeper types of questions like, "Which of my customers is most likely to churn?"—it doesn't matter how good you are at writing SQL, it's pretty unlikely you're going to be able to come up with a single awesome SQL statement that can answer that question. What you're going to want to do is fire off a bunch of SQL statements, get some answers, make some decisions, maybe pull in data from other locations. The tenancy model is great for that because each of these agents basically can get a clone of the data. They can even modify the data as they want and then they can get destroyed when they're not being used.
All right, onto the third quadrant. The third quadrant is big data, small compute. These are very common and they're basically two really big patterns. First pattern is independent data SaaS where you're building a SaaS application where each user has their own data. The second one is time series. Time series—you're creating data over time, but usually you only look at the last hour, day, week of that data. The first one, hyper-tenancy can help, this tenancy model again. The second one, DuckLake can work and I'm going to skip past some of these so we can make sure to get to the demo.
I do want to talk about DuckLake. DuckLake is an Iceberg alternative. It solves a lot of the problems that are inherent in Iceberg, particularly around transactions, around being able to write data rapidly, being able to do low latency, and simplicity. It can be really difficult to set up Iceberg. It could take you seconds to provision DuckLake.
The last quadrant is the big data, big compute. You see this in model training, you see this if you're rebuilding data. We talked to people who are like, "Yeah, these smaller instances are fine but I have this thing once a week or once a month that I do where I rebuild my data to make sure that it doesn't drift from the data that I have under the covers." Then you need some giant larger instances. One of the things that we do in MotherDuck is we actually have two very large instances. One of them has 64 cores and a quarter terabyte of RAM. The other one's 192 cores and a terabyte and a half of RAM. If you were just going to ballpark compare that to Snowflake instances, that's between a 2XL and a 3XL in terms of the amount of hardware. There's very few workloads that won't fit in a Snowflake 3XL.
I go back to my old roots in BigQuery and Dremel. When the Dremel paper came out, people were amazed by how fast it was. They showed some examples of a 30,000 node Dremel cluster running an aggregation query over 85 billion records in 20 seconds. Can we do this on a single node? If you just do the math, that larger instance size that I showed has enough CPU, it has enough memory, it doesn't have enough network, but it actually has some pretty big local discs. If your data is in the local discs, then we can actually do this pretty fast. Let me switch over to my demo.
Hopefully this shows. Cool. This is the MotherDuck UI. Running SQL queries is usually not all that exciting, so I will narrate what's going on here. We have 60 billion strings that have multi-word strings. We're doing a word count on each one of these. If everything's working, it takes about 20 seconds to compute the word length. This is actually a real benchmark that people use or that's used in doing term frequencies. This is a little bit smaller than the table in the paper. The reason it's smaller is because I ran out of time when I was loading data this morning. But it should scale to the same size.
I will show one other thing which is the CLI because I mentioned latency is important. This is a TPC benchmark. We can see the tables here. It actually only took six milliseconds to return the catalog or the tables because we actually cached a bunch of it locally. Running—ah, I knew that this was going to be demo gremlins. Typically this has been about 38 milliseconds. I'll blame it on the conference Wi-Fi. But you can get very low latency and this is client-measured time. It's—one of the tricks that often database vendors play is they measure time from when they get the query, but we measure time from when the query actually gets sent.
This is what this would look like in DuckLake. In DuckLake it's a little bit slower. It's about 3 minutes. But to create the DuckLake in just one line of SQL—create that table—and then to take a bunch of Parquet files that we had generated and adopt those is only one more line of code. We basically more or less replicated the Dremel results. We were a little slower on DuckLake. We were 5x slower on DuckLake. Probably 30% slower currently on the DuckDB side. If we'd wanted to run this in Spark, the same DuckLake in Spark, it would be about 30 lines of code. This is all boilerplate setup for partitioning and this is the only thing you need to run to run that Spark over DuckLake.
Small data, small compute—DuckDB is awesome. Small data, big compute—you can use read scaling. Small compute, big data—hyper-tenancy and DuckLake can really help. And then finally, if you really do have big data and big compute, you can either use these giant instances or you can use DuckLake with external access and run whatever engine you want over it—Spark, Trino, etc.
Thank you. And stop by booth B37 for more info on MotherDuck. Thank you.
Transcript
0:01All right, everybody. Uh, thanks thanks for coming out. Appreciate the uh the uh this the standing room. This is uh this is this is pretty awesome. Uh really glad to be here at uh Big Data London to talk about DUTDTB at scale and how and how DuctTB scales and how we at Motherduct have um you know probably the
0:21largest uh the largest DUTDB fleet uh in in the world. Although you know given the rate uh at which duct DB is growing you know chances are that's going to change uh that may change in the future.
0:35Uh so how many people are ductb users act actively ductb users? Okay so I guess there's how many people are ductb curious?
0:45How many how many people are just here to throw stuff at me? I see some clickout shirts in the back.
0:51I bet uh I bet uh no just just kidding. Um uh but first let me introduce the uh the flock. Um so who am I? Uh I'm I'm Jordan
1:02Tagani. Uh I've been a database person for uh for about the past 15 years. I worked on BigQuery for for 10 years. Um uh I had I I was trying to calculate I think I think we ended up querying a yatabyte zeta zetabyte. Um by the time by the time I left um worked single store for a couple years and about 3
1:24years ago uh I encountered duct DB uh and saw like wow this is an amazing this is amazing piece of technology somebody should really take this and build a cloud service out of it. At first I I talked to Hannis and Mark the uh the DuckDB founders to see if they were interested in hiring me to maybe uh you
1:43know help them build a cloud service and they said well we really would just want to focus on the technology and so but we'd have be happy to partner with you and so kind of after that uh after that moment the uh the duct DB the DuckDB partnership and motherduck partnership was formed.
1:59So what is duct DB in case people don't know? It's uh it's an embedded SQL database uh tuned for analytics and uh
2:09it has some really nice properties. It scales down really nicely. It's embedded so you can actually link it in to any anything you're actually building. Uh it has really nice connectors to most to most languages. uh you know here's an example in you know just four lines of Python you can uh you can write your first ductb query
2:33uh ductb has also been really accelerating in in popularity I think it has 30ome thousand GitHub stars at this at this point uh the number of downloads I um they just shared with me that um they're doing a pabyte of extension downloads a month uh so that's a lot of people running that's a lot of people running DOCTDB. So why do people like it
2:56so much? Um I like to think of it as the
3:00as the burgers. It's most most databases they they focus just on the patt patty of the burger. Certainly when I was at at uh at BigQuery we did this like we we said okay once you give us your data you send us your query then we'll uh we'll try to optimize that query run it really fast and then give you give you the
3:19results uh as fast as we can but like anything outside of that is your own problem uh and ductb for example they have like full-time engineer PhD um working on their CSV parser and they've been working on it for years and they're still working on it because they want to make it better and better and better. In
3:43BigQuery, we had a new grad working on our CSV parser. There's like they spent a couple months on it and it was like, "Okay, cool. That's done. Uh, now we're on to the next thing." But if you think about like what actually takes you time as a data analyst, data engineer, uh, data person, it's like how do I get this
4:00damn data that's over here into this other place that's over here? And DocTV makes this really nice. In fact, for my demo, I was going to show comparisons to BigQuery and I just got frustrated with how hard it was to load data from from S3 and uh and so I I gave up. Uh and that's the kind of thing that DuctTb
4:18just they focus on um doing really really well. So what's Mother Duck? So we're taking DuckDB and we're hosting it in the cloud. So we're building a SAS version of uh of DuckDB. Uh we run lots of duct DBs. One way of thinking about it is just it's just a data warehouse. It's a really low latency uh easy inexpensive
4:41uh uh great data warehouse. On the other hand, it's like it's you know being able to take the duct DB you know and love and just have somebody else run it for you. Um these are the same four lines of code that I had for uh for for duct tb.
4:55The only thing you need to do to start using motherduck is change the file name to start with md. And if you start with MD your file name with MD colon it loads the motherduck extension and it'll actually run in the cloud.
5:09So motherduck is serverless spins up spins up instances on your query. So within 100 milliseconds we'll uh we'll assign you a duct DB instance which we call duckling. Um we use warm storage so it uh it actually can be very fast for cold start uh which allows us to sort of spin spin up instances and spin down
5:30instances. We also have some very large instances. Um and and kind of the uh the
5:36tenency model uh allows you to run one duct DB instance per per user per end user because one of the tricky parts about duct DB is you know it's really focused for a single workload at a time.
5:49And so we so we say fine use a single workload at a time. Um if you but if you have lots of workloads each one is going to get the different duct DB instance and each one can scale independently.
6:01Um the other exciting thing is that um next month duct DB is uh is launching our our first European region in uh in Frankfurt uh at AWS. Uh because duct DB was created in Amsterdam uh there's you know a lot of excitement for you know around around DuckDB. So, we're we're really happy to uh to partner with uh
6:24with a bunch of folks uh uh in uh in Europe to uh to help spread uh spread the duckiness uh as much as possible.
6:35Um so, what have we learned from running giant ductb instances? Uh well, first, uh I want to address the elephant in the room. We're at big data London.
6:48I suppose everybody knows that. Um, I also wrote a blog post a couple of years ago called Big Data is dead where I talked about, hey, you know, big data is no longer a relevant term. Um, so what am I doing at uh at Big Data London?
7:04Uh, well, so it, you know, given that we're in in London, I figured I would start with a Shakespeare quote um from Julius Caesar. Uh, suitably uh suitably duckified. Uh, I come not to uh to praise big data, but to but to bury it.
7:18Um, and uh I'm not quite that uh that anti anti- big data. Um, you know, it turns out sometimes in order to get a, you know, catchy catchy title, you need something very uh um uh black black and
7:33white, but of course there's lots of shades of gray. Um, so we we looked at the data about what data people are using. So uh there's a couple of data sets that have been have been provided by some of the other data warehouse providers about how their how their technology is going to be used. So one of them is called snowset um snowflakes
7:54data set and another one is called reds set uh which is redshift data set and they wrote a paper on you know they basically were making a uh claim that the the shape of uh the data analytics benchmarks are not quite right for sort of what for modern workloads. Um, but it also provided a really nice window into
8:16the actual sizes and shapes of data that are that that people are that people are actually using in the real world. You know, it turns out that like most databases are smaller than 10 terabytes.
8:26So, we're going to call those not really big data. Of the time when you do have a big data database, you end up only querying a small portion of it. So you know the um only 1% of the time when you
8:41have big data do you actually query you know the uh query the tables that are that are the big data tables and then you know of that only half a percent of the time um when you have big data and you're you're actually reading big data tables you also read large amounts because often when people are reading these large
9:02these giant giant tables these giant databases you read uh you read smaller slices. Of course, you know, there are a bunch of there are a bunch of patterns that do lead people to uh that lead people to big data. And this is some of the things that we've that we've seen. Um so, one of the key uh insights I think is that when you
9:23think about big data, it's actually two different things. There's um big data, the size of
9:31the data, you know, the size of the data can't be stored on a single machine. you have giant amounts of data but if you have separation of storage and compute these days everybody just stores their data on S3 or GCS or whatever your object store uh duour is and uh and then it turns out to be not a particularly
9:47interesting problem I mean like storing a lot of data is you know you just throw it in S3 bucket and it's actually kind of virtually unlimited how much data that you can store and it's inexpensive the uh the what's actually the real interesting part of it is is big compute is the size of compute that you need in
10:07order to access that data. Um, you know, do you need actually distributed computation in order to in order to solve in order to solve your problem? Um, and I the one thing that we sort of like take for granted is just the size of machines these days. And I'll show some examples, you know, later on. But like when um when we started
10:31BigQuery, the like getting 64 gigs of
10:37RAM was virtually impossible. Like it was basically we had to we had to slice things very very very narrowly. Getting more than a couple cores was very very difficult in a particular process. Um now there's machines with hundreds of cores, gigabytes, terabytes of RAM. um and tens if not hundreds of gigabits of uh of network throughput. So it really
11:02changes the the the equation of what what big data means. Um so I like to sort of like draw
11:11things in a 2 by 2 chart with um the bottom left is small data small compute.
11:17The uh top left is uh small data big compute. the bottom right small compute big data and then finally is uh big data big big compute uh and I think it can be useful to to segregate those by by by use cases um and if you think about where you know what how much these things cost well
11:43obviously small data small compute doesn't cost very much um small data big compute compute is what actually drives the costs so if you have to do a lot of computation it's going to be more expensive. Um, if you have big data and small compute, often actually it doesn't cost that much because data storage costs prices are so tend tend to be so
12:05low. Uh, and then really the where you're going to spend a ton of money is in this sort of this top top right corner, which also leads people to avoid running queries in this uh in this part of the uh landscape. It's it's what leads them to do things like the you know medallion architecture where you take your data you land it in this giant
12:27size and then you kind of create smaller and smaller slices of it uh in order to reduce in order to reduce your query costs.
12:35Um another kind of interesting thing is is that for these different quadrants there's different things that you optimize for because in like the the small data case um you really want latency. like you you generally very often have a have a user involved and users like users compare how long care how long things take. If it's a difference between like 2 seconds
12:57and 200 milliseconds like that's a that's a that's a huge deal in terms of if you're showing to your end users uh some sort of some sort of an analytic or some sort of graph. Um whereas at the large data sizes what you really care about is like okay can we get stuff done? Can we just sort of push this
13:15through as fast as possible? Uh I remember actually one time my manager in BigQuery uh he said if you add a second like it there's somebody who was working on some a feature said if you add a second to every query I don't care like adding a second to every query was considered like not a not a big deal not
13:34a problem. Um whereas I think if you're because BigQuery was so far in this top right corner.
13:43Um so you know the different design goals um for for throughput versus latency really drive the kind of architecture you need you need to build and so if you think about where 99% of all the usage is if the design goal is latency you know we tend to be using systems that are designed for throughput um so here's where some of the workloads
14:06map um I'm going to go into a bunch of these in uh in a bunch in a bunch more detail but you can kind of see like how there are sort of real uh real things in each in each part of these workloads or each part of these these uh quadrants.
14:22So digging into the big data versus big compute. The first quadrant is small data, small compute.
14:29DuctTDB, I think probably everybody here admits or agrees that duct DB can work great on this. It's sort of like you're doing ad hoc analytics, you're doing something on your laptop, you're doing some data science exploration. Um you have the sort of the platinum or gold tier if you're using the medallion architecture or you're authoring some SQL. uh you want to you know what what's
14:50important is sort of the the pace of iteration and if you want fast iteration then um uh then it's nice to just have have small small data and uh and small small use cases.
15:06Um, so here's a uh some a representation
15:10of of some benchmark results uh from the uh the ClickBench benchmark and uh was created by the by the Click House Click House folks. is too hard, too small to read because it's not what's important isn't isn't actually the, you know, the u the vendors on here, but if you look at the top right corner, so like the um
15:32uh less expensive ones are in the are in the are are on the left. Um and then uh
15:40sorry, less expensive ones are are on the bottom. Sorry, less expensive are on the top, faster is on the right. And so you want to be in the top right corner. So we have a bunch of small distributed databases um that are uh that are pretty inexpensive. Of course they're really slow. So you see like that this these
16:01are not uh not very fast. They're often an order of magnitude less um less performant. Uh and then you have the largecale distributed databases which um do get a lot faster and uh and those certainly move move towards the right but of course those are very expensive.
16:18Uh and then here's the duct DB versions of these which basically you know top top right corner they're they're not they're not expensive and they're super fast super low latency.
16:31So there's a typical problem when you have a when you have a shared data warehouse, which is sort of how typically we see people building data warehouse. You might have a a single instance for for your team or for your department. Um is that you have sort of a lot of users hammering that same that same warehouse and you end up having to
16:48provision provision for peak. Um and uh you know the thing about one of the things about analytics is that the sizes can the size that you need can change rapidly.
17:00um by orders of magnitude. And so if you're even if you're doing autoscaling, it's it's sort of very hard to very hard to keep keep it up.
17:10So what the way motherduck works is that every user gets their own duck ductb instance. So if you have you know hundreds of users uh each of your hundreds of users gets a duct TB gets a DTB instance and then we can scale all those independently because some of those are going to need a lot of data
17:28some of them need a lot of uh a lot of computational power and u some of them are going to be pretty tiny some of them are going to be very ephemeral and we we shut them down when they're when they're not being used.
17:42Um so that was that was small data small compute. Now what about small data big compute? I was originally thinking that this this sort of didn't exist. And then I was thinking well actually we do see a lot of the these cases where like people are actually hitting harder than you could really do in a single DB instance or in a single
18:01in a single instance you have let's say you're doing cloud cloud-based BI so you might have Tableau or or Omni or uh or Looker there might be a lot of users hitting at the same time and if those are all hitting your hitting your data warehouse maybe there's maybe they're only looking at the platinum tier. Maybe they're only looking at a small amount
18:20of data, but that you know that can act that can knock over the database the data warehouse that you're using unless you scale it up very large. And if you scale it up very large, then you have the same overprovisioning problems.
18:33So um cloud, you know, business intelligence is is a uh is is a key workload for for this. Um there's also a a model of of building SAS applications where you build an application and that application um wants to show data to end users and sometimes it's you know everybody is kind of looking at the same they're slicing and dicing the same data
18:55that's also kind of a big big compute small data and then finally AI agents and I I'll talk about those a little bit in in a second but there's something in in motherduck that we do to help make this work so I mentioned that kind of every user can get their own ductb instance But that's a that can be a pain to
19:11provision all those. And so what we do is if you enable reads scaling every end user of your uh of your BI tool for example or your SAS application gets their own DB instance which can which we can scale up and scale and scale down.
19:25We also have mech mechanisms to avoid overspending. Um but this allows you to basically scale by running lots of lots of different instances.
19:37Um there's also there's an interesting paper I read recently about you know what does it take to build agents everybody wants to do agents these days uh it's the the the the cool the cool topic um but in order for those to be interesting in analytics to be able to answer kind of deeper types of questions like let's say you wanted to say like
19:57which of my customers is most likely to churn it doesn't matter how good you are at writing SQL like it's it's pretty unlikely you're going to be able to come up with a single awesome SQL statement that they they can answer that question.
20:08So, what you're going to want to do is you're going to want to fire off a bunch of SQL statements, get some answers, make some decisions, maybe maybe pull in data from other other locations. Um, and the sort of tenency model is great for that because each of these agents basically can get a clone of the data.
20:24They can even modify the data as they want and then they get they can get destroyed when they're not when they're not being used.
20:33All right, the onto the third quadrant. The third quadrant is big data, small compute. Uh and this this is very uh the these are very common and they're basically two really big patterns. Um first pattern is independent data SAS where you're building a SAS application where uh each user has their own data.
20:53Uh and the second one is time series. Time series, you know, you're building you're creating data over time. Um and uh but usually you only look at the last hour, day, week uh of that of that data.
21:06Um so the first one hypertendency can help kind of this tendency model again. And the second one um duck lake can can work and I'm going to skip past some of these so we can make sure to get to the uh to the demo. Um I do want to talk about Duck Lake. Um so Duck Lake is is
21:22an iceberg alternative. Uh and it solves a lot of the problems that are are inherent in iceberg particularly around transactions around being able to write data write data rapidly um being able to do low latency um and uh and simplicity
21:40like it's it can be really difficult to set up iceberg. It could take you you know seconds to uh to provision Dark Lake.
21:47So the last the last quadrant is is the big data big compute. Um you know you see this in model training you see this if you're rebuilding data like we you know talked to people who are like yeah um these sort of smaller smaller instances are are fine but I have this thing once a week or once a month that I do where I
22:07rebuild my data to make sure that um you know that it doesn't drift from from from the uh the data that I have uh under the covers. Uh and then you need some giant larger instances. So one of the things that we do in motherdoc is we we actually have two um very large instances. One of them has 64 core and a
22:26quarter terabyte of RAM. And the other one's 192 cores and a terabyte and a half of RAM. And if you were just going to ballpark compare that to Snowflake instances, that's sort of between a a 2XL and a 3XL in terms of in terms of the amount of hardware. And there's very few workloads that won't fit in a
22:42Snowflake 3XL. Um so I go back to my old you know roots uh in in in BigQuery and Dremel and so Dremel uh that when the paper Dremel paper came out people were amazed by how fast how fast it was uh and they showed some examples of a 30,000 node Dremel cluster uh running an aggregation query over 85 billion records um in 20
23:05seconds. Um so can we do this on a single node? So, if you just do the math, um the uh that larger instance size that I that I that I showed, um it has enough CPU, it has enough memory, it doesn't have enough network, but it actually has um some pretty big local discs. And so, if your data is in the
23:26local discs, then um then we can we can actually do this pretty fast. So, let me uh let me switch over to to my demo.
23:35Hopefully, this shows. Cool. So this is the mother duck UI. Uh and you know running SQL queries is usually not all that exciting. So um I will sort of narrate what's going on here. So we have a uh a um uh 60 billion um 60 billion strings that have multi multi multi-word strings. And we're doing a word count on each each one of
24:00these. Uh and um if everything's working, takes about 20 seconds um to uh to compute the word length. you know, this is actually a a real benchmark that people use or that that's used in doing term frequencies. Um, and uh, so this is
24:15a little bit smaller than the uh, the table um, that uh, in in the paper. The reason it's smaller is because I ran out of time when I was loading data this morning. Um, but uh, you know, it should it should scale to the same size. Uh I will show one other thing which is this is the CLI because I mentioned latency
24:34is important and so um this is a uh TPC
24:39benchmark. Um and uh you know we we could so we can
24:46see the uh the tables here. It actually only took six milliseconds to return the uh the catalog or the tables because we actually cached a bunch of it locally.
24:54Um and so you know running ah I knew
24:58that this was going to be uh
25:03benchmark uh or sorry uh demo gremlins. Uh typically this has been about 38 milliseconds. Um I'll blame it on I'll blame it on the conference Wi-Fi. Um but you know you can get very low latency and this is client measured time. So it's uh it's you know one of the the tricks that often database vendors play is they they measure time from when they
25:22get the query but um we measure time from when the uh query actually gets sent.
25:28Um this is what this would look like in Duck Lake. So in Duck Lake it's a little bit slower. It's about it's about 3 minutes. Um but to create the duck lake in just one one line of SQL uh create that table and then to take a bunch of parket files that we had that we had generated and ad and adopt those um is
25:46only one more line of code. So we basically more or less replicated the uh the Dremel the Dremel results. We were a little slower on Duck Lake. We were you know um 5x slower on Duck Lake. Uh probably 30% slower um currently on the uh the DuckDB side. Uh and if we'd wanted to run this in Spark, the duck
26:05the same duck lake in Spark, um it would be about 30 lines of code. This is all boilerplate setup um for part this is the setup for partitioning and this is the only the only thing you need to run to run that spark um over over duck lake. So duck small data small compute duck DB is awesome. uh small data, big
26:24compute, uh you can use read scaling, uh
26:29small compute, big data, uh hypertendency and ducklake um can can really help. And then finally, if you really do have big data and big compute, um you can either use these giant instances or you can use ducklake with uh you know with external access and run whatever uh whatever engine you want over it, Spark, Trino, uh etc.
26:49Thank you. And stop by booth B37 uh for more info on on Mother Duck. Thank you.
FAQS
How does MotherDuck scale DuckDB for multiple users?
MotherDuck assigns each user their own DuckDB instance, called a "duckling," which can spin up in under 100 milliseconds. This per-user isolation means every workload scales independently. Some users may need heavy compute while others stay lightweight. When a duckling is no longer in use, it shuts down automatically, so you never over-provision. Learn more about getting started with MotherDuck.
Is most data actually 'big data' that requires distributed systems?
According to analysis of real-world datasets from Snowflake and Redshift, most databases are smaller than 10 terabytes. Even when databases are large, users query the big tables only about 1% of the time, and of those queries, only about half a percent actually scan large amounts of data. Modern machines with hundreds of cores and terabytes of RAM have shifted the equation, so most workloads fit comfortably on a single node.
What is DuckLake and how does it compare to Apache Iceberg?
DuckLake is an open table format alternative to Iceberg that stores metadata in a relational database rather than in files on object storage. This solves common Iceberg pain points around transactions, rapid data writing, and low-latency queries. DuckLake can be set up in seconds with just one line of SQL, compared to the much more complex Iceberg configuration. For a deeper dive, see the DuckLake open lakehouse guide.
Can DuckDB handle workloads that traditionally required large distributed systems like Dremel?
Yes. MotherDuck offers instances with up to 192 cores and 1.5 terabytes of RAM, comparable to a Snowflake 2XL-3XL. In a live demo, a single-node DuckDB instance ran an aggregation query over 60 billion strings in about 20 seconds, replicating results from Google's original Dremel paper, which used a 30,000-node cluster. For workloads exceeding a single node, DuckLake allows external engines like Spark or Trino to access the same data.

Related Videos

2026-01-27
Preparing Your Data Warehouse for AI: Let Your Agents Cook
Jacob and Jerel from MotherDuck showcase practical ways to optimize your data warehouse for AI-powered SQL generation. Through rigorous testing with the Bird benchmark, they demonstrate that text-to-SQL accuracy can jump from 30% to 74% by enriching your database with the right metadata.
AI, ML and LLMs
SQL
MotherDuck Features
Stream
Tutorial
2026-01-21
The MCP Sessions - Vol 2: Supply Chain Analytics
Jacob and Alex from MotherDuck query data using the MotherDuck MCP. Watch as they analyze 180,000 rows of shipment data through conversational AI, uncovering late delivery patterns, profitability insights, and operational trends with no SQL required!
Stream
AI, ML and LLMs
MotherDuck Features
SQL
BI & Visualization
Tutorial
2026-01-13
The MCP Sessions Vol. 1: Sports Analytics
Watch us dive into NFL playoff odds and PGA Tour stats using using MotherDuck's MCP server with Claude. See how to analyze data, build visualizations, and iterate on insights in real-time using natural language queries and DuckDB.
AI, ML and LLMs
SQL
MotherDuck Features
Tutorial
BI & Visualization
Ecosystem

