MotherDuck FeaturesSQLTalkBI & VisualizationPython

DuckDB At Scale

2025/09/24

Video of conference talk at Big Data London

TRANSCRIPT

All right, everybody. Thanks for coming out. Appreciate the standing room. This is pretty awesome. Really glad to be here at Big Data London to talk about DuckDB at scale and how DuckDB scales and how we at MotherDuck have probably the largest DuckDB fleet in the world. Although given the rate at which DuckDB is growing, chances are that may change in the future.

How many people are DuckDB users, actively DuckDB users? Okay. How many people are DuckDB curious? How many people are just here to throw stuff at me? I see some ClickHouse shirts in the back. I bet—no, just kidding.

But first let me introduce the flock. Who am I? I'm Jordan Tigani. I've been a database person for about the past 15 years. I worked on BigQuery for 10 years. I was trying to calculate—I think we ended up querying a yottabyte or zettabyte by the time I left. Worked at SingleStore for a couple years and about 3 years ago I encountered DuckDB and saw, wow, this is an amazing piece of technology. Somebody should really take this and build a cloud service out of it. At first I talked to Hannes and Mark, the DuckDB founders, to see if they were interested in hiring me to maybe help them build a cloud service and they said, "Well, we really would just want to focus on the technology, but we'd be happy to partner with you." And after that moment, the DuckDB partnership and MotherDuck partnership was formed.

What is DuckDB in case people don't know? It's an embedded SQL database tuned for analytics and it has some really nice properties. It scales down really nicely. It's embedded so you can actually link it in to anything you're actually building. It has really nice connectors to most languages. Here's an example in just four lines of Python—you can write your first DuckDB query.

DuckDB has also been really accelerating in popularity. I think it has 30-some thousand GitHub stars at this point. The number of downloads—they just shared with me that they're doing a petabyte of extension downloads a month. That's a lot of people running DuckDB.

Why do people like it so much? I like to think of it as the full burger. Most databases focus just on the patty of the burger. Certainly when I was at BigQuery we did this. We said, "Okay, once you give us your data, you send us your query, then we'll try to optimize that query, run it really fast and give you the results as fast as we can." But anything outside of that is your own problem.

DuckDB, for example, has a full-time engineer with a PhD working on their CSV parser and they've been working on it for years and they're still working on it because they want to make it better and better and better. In BigQuery, we had a new grad working on our CSV parser. They spent a couple months on it and it was like, "Okay, cool. That's done. Now we're on to the next thing." But if you think about what actually takes you time as a data analyst, data engineer, data person, it's like, "How do I get this damn data that's over here into this other place that's over here?" And DuckDB makes this really nice. In fact, for my demo, I was going to show comparisons to BigQuery and I just got frustrated with how hard it was to load data from S3 and I gave up. That's the kind of thing that DuckDB focuses on doing really, really well.

What's MotherDuck? We're taking DuckDB and we're hosting it in the cloud. We're building a SaaS version of DuckDB. We run lots of DuckDBs. One way of thinking about it is just it's a data warehouse. It's a really low latency, easy, inexpensive, great data warehouse. On the other hand, it's being able to take the DuckDB you know and love and just have somebody else run it for you. These are the same four lines of code that I had for DuckDB. The only thing you need to do to start using MotherDuck is change the file name to start with "md:". If you start your file name with "md:" it loads the MotherDuck extension and it'll actually run in the cloud.

MotherDuck is serverless, spins up instances on your query. Within 100 milliseconds we'll assign you a DuckDB instance which we call a duckling. We use warm storage so it actually can be very fast for cold start, which allows us to spin up instances and spin down instances. We also have some very large instances. The tenancy model allows you to run one DuckDB instance per user, per end user, because one of the tricky parts about DuckDB is it's really focused for a single workload at a time. We say fine, use a single workload at a time. But if you have lots of workloads, each one is going to get a different DuckDB instance and each one can scale independently.

The other exciting thing is that next month MotherDuck is launching our first European region in Frankfurt at AWS. Because DuckDB was created in Amsterdam, there's a lot of excitement around DuckDB. We're really happy to partner with a bunch of folks in Europe to help spread the duckiness as much as possible.

What have we learned from running giant DuckDB instances? Well, first, I want to address the elephant in the room. We're at Big Data London. I also wrote a blog post a couple of years ago called "Big Data is Dead" where I talked about how big data is no longer a relevant term. So what am I doing at Big Data London?

Well, given that we're in London, I figured I would start with a Shakespeare quote from Julius Caesar, suitably duckified: "I come not to praise big data, but to bury it." I'm not quite that anti-big data. It turns out sometimes in order to get a catchy title, you need something very black and white, but of course there's lots of shades of gray.

We looked at the data about what data people are using. There's a couple of data sets that have been provided by some of the other data warehouse providers about how their technology is being used. One of them is called SnowSet, Snowflake's data set, and another one is called RedsSet, which is Redshift's data set. They wrote a paper basically making a claim that the shape of data analytics benchmarks are not quite right for modern workloads. But it also provided a really nice window into the actual sizes and shapes of data that people are actually using in the real world. It turns out that most databases are smaller than 10 terabytes. We're going to call those not really big data. When you do have a big data database, you end up only querying a small portion of it. Only 1% of the time when you have big data do you actually query the big data tables, and then only half a percent of the time when you have big data and you're actually reading big data tables do you also read large amounts, because often when people are reading these giant databases you read smaller slices.

Of course, there are a bunch of patterns that do lead people to big data. One of the key insights I think is that when you think about big data, it's actually two different things. There's big data, the size of the data—the size of the data can't be stored on a single machine, you have giant amounts of data. But if you have separation of storage and compute these days, everybody just stores their data on S3 or GCS or whatever your object store du jour is, and it turns out to be not a particularly interesting problem. Storing a lot of data is—you just throw it in an S3 bucket and it's actually kind of virtually unlimited how much data you can store and it's inexpensive.

What's actually the real interesting part of it is big compute—the size of compute that you need in order to access that data. Do you need actually distributed computation in order to solve your problem? The one thing that we sort of take for granted is just the size of machines these days. When we started BigQuery, getting 64 gigs of RAM was virtually impossible. We had to slice things very, very narrowly. Getting more than a couple cores was very difficult in a particular process. Now there's machines with hundreds of cores, terabytes of RAM, and tens if not hundreds of gigabits of network throughput. It really changes the equation of what big data means.

I like to draw things in a 2x2 chart with the bottom left being small data, small compute. The top left is small data, big compute. The bottom right is small compute, big data. And then finally is big data, big compute. I think it can be useful to segregate those by use cases.

If you think about how much these things cost, obviously small data, small compute doesn't cost very much. Small data, big compute—compute is what actually drives the costs, so if you have to do a lot of computation it's going to be more expensive. If you have big data and small compute, often actually it doesn't cost that much because data storage costs tend to be so low. Really where you're going to spend a ton of money is in this top right corner, which also leads people to avoid running queries in this part of the landscape. It's what leads them to do things like the medallion architecture where you take your data, you land it in this giant size and then you kind of create smaller and smaller slices of it in order to reduce your query costs.

Another interesting thing is that for these different quadrants there's different things that you optimize for. In the small data case, you really want latency. You generally very often have a user involved and users care how long things take. If it's a difference between 2 seconds and 200 milliseconds, that's a huge deal in terms of if you're showing to your end users some sort of analytic or some sort of graph. Whereas at the large data sizes what you really care about is, "Can we get stuff done? Can we just push this through as fast as possible?" I remember actually one time my manager in BigQuery said if you add a second to every query, he didn't care. Adding a second to every query was considered not a big deal, not a problem. Whereas I think if you're—because BigQuery was so far in this top right corner.

The different design goals for throughput versus latency really drive the kind of architecture you need to build. If you think about where 99% of all the usage is, if the design goal is latency, we tend to be using systems that are designed for throughput.

Here's where some of the workloads map. I'm going to go into a bunch of these in more detail but you can kind of see how there are real things in each part of these quadrants.

Digging into the big data versus big compute. The first quadrant is small data, small compute. DuckDB, I think probably everybody here admits or agrees that DuckDB can work great on this. You're doing ad hoc analytics, you're doing something on your laptop, you're doing some data science exploration. You have the platinum or gold tier if you're using the medallion architecture or you're authoring some SQL. What's important is the pace of iteration and if you want fast iteration then it's nice to just have small data and small use cases.

Here's a representation of some benchmark results from the ClickBench benchmark that was created by the ClickHouse folks. It's too hard, too small to read because what's important isn't actually the vendors on here, but if you look at the top right corner—the less expensive ones are on the top, faster is on the right. You want to be in the top right corner. We have a bunch of small distributed databases that are pretty inexpensive. Of course they're really slow. These are not very fast. They're often an order of magnitude less performant. Then you have the large-scale distributed databases which do get a lot faster and those certainly move towards the right but of course those are very expensive. And then here's the DuckDB versions of these which basically, you know, top right corner—they're not expensive and they're super fast, super low latency.

There's a typical problem when you have a shared data warehouse, which is sort of how typically we see people building data warehouses. You might have a single instance for your team or for your department. You have a lot of users hammering that same warehouse and you end up having to provision for peak. One of the things about analytics is that the sizes you need can change rapidly by orders of magnitude. Even if you're doing autoscaling, it's very hard to keep it up.

The way MotherDuck works is that every user gets their own DuckDB instance. If you have hundreds of users, each of your hundreds of users gets a DuckDB instance and then we can scale all those independently because some of those are going to need a lot of data, some of them need a lot of computational power, some of them are going to be pretty tiny, some of them are going to be very ephemeral and we shut them down when they're not being used.

That was small data, small compute. Now what about small data, big compute? I was originally thinking that this sort of didn't exist. Then I was thinking, well actually we do see a lot of these cases where people are actually hitting harder than you could really do in a single DB instance. Let's say you're doing cloud-based BI—you might have Tableau or Omni or Looker. There might be a lot of users hitting at the same time and if those are all hitting your data warehouse, maybe they're only looking at the platinum tier, maybe they're only looking at a small amount of data, but that can knock over the data warehouse that you're using unless you scale it up very large. If you scale it up very large, then you have the same overprovisioning problems.

Cloud business intelligence is a key workload for this. There's also a model of building SaaS applications where you build an application and that application wants to show data to end users. Sometimes everybody is looking at the same thing, they're slicing and dicing the same data—that's also kind of big compute, small data. And then finally AI agents. I'll talk about those a little bit in a second but there's something in MotherDuck that we do to help make this work. I mentioned that every user can get their own DuckDB instance, but that can be a pain to provision all those. What we do is if you enable read scaling, every end user of your BI tool, for example, or your SaaS application gets their own DB instance which we can scale up and scale down. We also have mechanisms to avoid overspending. But this allows you to basically scale by running lots of different instances.

There's an interesting paper I read recently about what does it take to build agents. Everybody wants to do agents these days—it's the cool topic. But in order for those to be interesting in analytics, to be able to answer deeper types of questions like, "Which of my customers is most likely to churn?"—it doesn't matter how good you are at writing SQL, it's pretty unlikely you're going to be able to come up with a single awesome SQL statement that can answer that question. What you're going to want to do is fire off a bunch of SQL statements, get some answers, make some decisions, maybe pull in data from other locations. The tenancy model is great for that because each of these agents basically can get a clone of the data. They can even modify the data as they want and then they can get destroyed when they're not being used.

All right, onto the third quadrant. The third quadrant is big data, small compute. These are very common and they're basically two really big patterns. First pattern is independent data SaaS where you're building a SaaS application where each user has their own data. The second one is time series. Time series—you're creating data over time, but usually you only look at the last hour, day, week of that data. The first one, hyper-tenancy can help, this tenancy model again. The second one, DuckLake can work and I'm going to skip past some of these so we can make sure to get to the demo.

I do want to talk about DuckLake. DuckLake is an Iceberg alternative. It solves a lot of the problems that are inherent in Iceberg, particularly around transactions, around being able to write data rapidly, being able to do low latency, and simplicity. It can be really difficult to set up Iceberg. It could take you seconds to provision DuckLake.

The last quadrant is the big data, big compute. You see this in model training, you see this if you're rebuilding data. We talked to people who are like, "Yeah, these smaller instances are fine but I have this thing once a week or once a month that I do where I rebuild my data to make sure that it doesn't drift from the data that I have under the covers." Then you need some giant larger instances. One of the things that we do in MotherDuck is we actually have two very large instances. One of them has 64 cores and a quarter terabyte of RAM. The other one's 192 cores and a terabyte and a half of RAM. If you were just going to ballpark compare that to Snowflake instances, that's between a 2XL and a 3XL in terms of the amount of hardware. There's very few workloads that won't fit in a Snowflake 3XL.

I go back to my old roots in BigQuery and Dremel. When the Dremel paper came out, people were amazed by how fast it was. They showed some examples of a 30,000 node Dremel cluster running an aggregation query over 85 billion records in 20 seconds. Can we do this on a single node? If you just do the math, that larger instance size that I showed has enough CPU, it has enough memory, it doesn't have enough network, but it actually has some pretty big local discs. If your data is in the local discs, then we can actually do this pretty fast. Let me switch over to my demo.

Hopefully this shows. Cool. This is the MotherDuck UI. Running SQL queries is usually not all that exciting, so I will narrate what's going on here. We have 60 billion strings that have multi-word strings. We're doing a word count on each one of these. If everything's working, it takes about 20 seconds to compute the word length. This is actually a real benchmark that people use or that's used in doing term frequencies. This is a little bit smaller than the table in the paper. The reason it's smaller is because I ran out of time when I was loading data this morning. But it should scale to the same size.

I will show one other thing which is the CLI because I mentioned latency is important. This is a TPC benchmark. We can see the tables here. It actually only took six milliseconds to return the catalog or the tables because we actually cached a bunch of it locally. Running—ah, I knew that this was going to be demo gremlins. Typically this has been about 38 milliseconds. I'll blame it on the conference Wi-Fi. But you can get very low latency and this is client-measured time. It's—one of the tricks that often database vendors play is they measure time from when they get the query, but we measure time from when the query actually gets sent.

This is what this would look like in DuckLake. In DuckLake it's a little bit slower. It's about 3 minutes. But to create the DuckLake in just one line of SQL—create that table—and then to take a bunch of Parquet files that we had generated and adopt those is only one more line of code. We basically more or less replicated the Dremel results. We were a little slower on DuckLake. We were 5x slower on DuckLake. Probably 30% slower currently on the DuckDB side. If we'd wanted to run this in Spark, the same DuckLake in Spark, it would be about 30 lines of code. This is all boilerplate setup for partitioning and this is the only thing you need to run to run that Spark over DuckLake.

Small data, small compute—DuckDB is awesome. Small data, big compute—you can use read scaling. Small compute, big data—hyper-tenancy and DuckLake can really help. And then finally, if you really do have big data and big compute, you can either use these giant instances or you can use DuckLake with external access and run whatever engine you want over it—Spark, Trino, etc.

Thank you. And stop by booth B37 for more info on MotherDuck. Thank you.

Related Videos

"Watch Me Deploy a DuckLake to Production with MotherDuck!" video thumbnail

2025-12-10

Watch Me Deploy a DuckLake to Production with MotherDuck!

In this video, Hoyt Emerson will show you the fastest way to get DuckLake into production using MotherDuck's beta implementation. If you've been following his DuckLake series, this is the next step you've been waiting for!

YouTube

Data Pipelines

Tutorial

MotherDuck Features

SQL

Ecosystem

"Data-based: Going Beyond the Dataframe" video thumbnail

2025-11-20

Data-based: Going Beyond the Dataframe

Learn how to turbocharge your Python data work using DuckDB and MotherDuck with Pandas. We walk through performance comparisons, exploratory data analysis on bigger datasets, and an end-to-end ML feature engineering pipeline.

Webinar

Python

AI, ML and LLMs

"LLMs Meet Data Warehouses: Reliable AI Agents for Business Analytics" video thumbnail

2025-11-19

LLMs Meet Data Warehouses: Reliable AI Agents for Business Analytics

LLMs excel at natural language understanding but struggle with factual accuracy when aggregating business data. Ryan Boyd explores the architectural patterns needed to make LLMs work effectively alongside analytics databases.

AI, ML and LLMs

MotherDuck Features

SQL

Talk

Python

BI & Visualization