TalkMotherDuck FeaturesEcosystemSQLBI & VisualizationAI, ML and LLMs

The Unbearable Bigness of Small Data

2025/11/05

TRANSCRIPT

Small Data Conference Keynote: Designing for the Bottom Left

Thanks so much for coming out to our second Small Data Conference. I'm super excited to have so many people here, so many familiar faces, so many people from around the data community, so many practitioners. I think we had a nice workshop day yesterday where people actually felt like they got some real stuff done.

A Personal Story About "Small Data"

To start out with, I'd like to go into a little bit of a personal story. About five years ago when I was working at SingleStore, we were thinking about open-sourcing a single node version of SingleStore. When I pitched this to some internal folks, the CTO said to me, "If we have a single node version, a small data version of the database, people are gonna laugh at us."

I thought that seems sort of silly, because he didn't say it was a bad idea. He didn't say this isn't going to work or this isn't going to work with real workloads, because we had actually seen some of our biggest customers—Sony was running on these sort of giant scale-up machines and it was working great. But it was just sort of like, "Oh yeah, the database people are going to laugh at you."

As an aspiring entrepreneur, if there's an area where somebody might laugh at you for building something or thinking something, maybe it's not a bad idea. There's a lot of examples of people who have built amazing things that people laughed at. If somebody's going to laugh at you, the best way to deal with that is to own it and be like, "No, no, no, no. This is my joke and I'm going to let you in on the joke. Then we can all laugh together."

You may wonder why sometimes we dress up as silly things at MotherDuck. It's not just a marketing ploy, although it is nice—first they ignore you, and as a startup people are going to ignore you unless you come up with a way to get them not to. But it's also so that we can invite people in to laugh together at, "Hey, isn't this all ridiculous?"

The Real Data Engineer

A couple years later, I was thinking about starting a company with some other folks like Ryan who's here. I was talking to some people about what we were thinking about building, and one of them said, "I go to these big data conferences and I get all fired up and I'm like, yeah, the Netflix architecture is so cool. I go home and it's like, well, what am I gonna do—run a one-node Spark cluster?"

He felt like he wasn't a real data engineer because he wasn't operating at this huge scale. I thought that was sort of unfortunate and unfair because the scale at which you're operating has nothing to do with how important what you're doing is, how hard what you're doing is, or how impactful what you're doing is.

Sheila kind of touched on it earlier—we wanted this to be sort of a tongue-in-cheek conference because we're in on the joke. It's small data, and we believe that the size of the data you're operating on, the scale you're operating at, has nothing to do with how important it is. And we're going to have some fun.

The Small Data Mantra

I want everybody to repeat after me. The small data mantra is: I've got small data.

Ready? I've got small data.

One more. One more time. Everybody together. I've got small data.

Awesome. Thank you so much. That was my big job for the day—I got everybody to say I've got small data.

Rethinking Data Scale

I also want to talk a little bit about what we're seeing in the shapes of data. How data scales is maybe a little bit more rich than we tend to talk about it.

Once upon a time there were boxes. You bought a box. This box was your database. If you ran out of space on that box, you had to buy a bigger box, and the bigger box was probably a lot more expensive.

Then came cloud. We separated storage and compute. We got to break up those boxes. But then we actually realized that there's two different axes here. There's compute—in general a larger amount of compute was sort of linear scaling, not exponential scaling, which is a big difference. And then storage—you just put your storage on object store and it's kind of boring. You put it on S3. S3 is virtually infinite, virtually infinite bandwidth. You kind of don't have to worry about it as much anymore.

When you have separation of storage and compute, big data is now two different things. What you used to call big data because you had this big box with a bunch of storage and compute—there's two separate axes. First, there is literally the size of data that you have. If it's data that can't fit on a single machine, won't fit in your laptop, won't fit in your workstation—that's a real thing. But typically you just put it on object store and you don't think about it again.

Big Compute vs. Big Data

Big compute actually is probably the more interesting one, which is I think one of the reasons why we've built these scale-out distributed systems—super complicated. But also, machines are huge now. What doesn't fit on a single machine is very, very different than what didn't fit on a single machine 15 years ago.

I published a blog post a while ago called "Big Data is Dead." Sheila referenced it earlier. It's not really that big data is dead, because saying that it's dead doesn't make it go away. But it's actually kind of big compute that isn't as important.

The Quadrants of Data

If we look at the landscape, we have big data on one axis and big compute on the other axis. The vast, vast majority of workloads are in the small data, small compute quadrant. In fact, somebody was saying to me yesterday that in Supabase, the median database size is 100 rows. Not even 100 megabytes or gigabytes—it's 100 rows. There's just a lot of small data out there.

If you look at cost, obviously the small data, small compute doesn't tend to cost you very much. If you go to big data, big compute—well, compute tends to be a little bit more expensive than storage, so that can be pretty expensive. Big data, small compute—if you have a lot of data, maybe you're generating some logs over time, they just sort of sit there, and you're doing small amounts of compute over the recent data. That's a little bit more expensive than when you didn't have a lot of data. And then when it gets really expensive is when you have big data and big compute.

If you look at the workloads that fall in each of these boxes, you see a lot of what a SQL analyst does tends to be in the small data, small compute side of things—gold tier work. Your BI actually may push into the big compute because very often you have a lot of users, a bunch of people all hitting on the same dataset, refreshing their dashboards, drilling into different things. That does take more compute.

Then when you get into the big data side, I call it "independent data SaaS"—where you're building a SaaS application where each one of your users has separate data. If each one of your users has separate data, you might not actually need a whole lot of compute, but in total, the amount of storage might be a lot. And then there is, every once in a while, sometimes you need to rebuild your data sets and run model training over the whole dataset. Yes, those workloads do exist.

Design for the Primary Use Case

I was a software engineer for 20 years, and one of the primary rules of thumb for when I was building something was you wanted to make sure that the design point—the primary thing that you're building, the thing that drives your architecture—is the main use case and not the corner cases.

As a bad example, I needed to remove some roots from my yard and needed a backhoe, so of course I get a backhoe and drive that to work every day. That's a little bit of an absurd example, but that's kind of like, "Hey, I have every once in a while I need to rebuild my tables, so I'm going to use this giant distributed system every day when it's totally unnecessary."

The Old Way: Designed for Big Data, Big Compute

If you think about how a lot of the older school modern data stack systems are designed, they were designed for the top right corner—"Hey, we can handle the biggest scale, the biggest compute, the biggest data." And then like, "Yeah, I'm sure it'll work if you scale down the amount of compute," because of course it's going to work. "And of course it's going to work if you scale down the data size. And the bottom left corner stuff? I know that's 98% of what you're doing, but I'm not even going to worry about that or care about that."

Just as an example, in BigQuery at one point we ended up making a change and it added a second to every query. The tech lead at the time was like, "It's fine," because in general the thing that we cared about was the top right corner, and the stuff that people were doing that was trying to be interactive didn't matter.

If you think about the performance goals, the top right corner—you want throughput because you have a lot of data to churn through and you're willing to add a second because you're not concerned about latency. But the vast majority of things you're doing, latency is the important part.

The New Way: Design for the Bottom Left

What if instead we designed for the bottom left corner? We made sure that it was going to work and we had solutions for when we scaled up the data sizes. We made sure it was going to work when we scaled up the compute sizes and figure it out when you get to the top corner. I mean, it is a requirement—you can't ignore it. It is a requirement that that stuff is going to work, but you can use a little bit of elbow grease to make that work.

Building from Scratch

So if you were building a system from scratch, what would you do? Well, I believe you'd want to use scale up, not scale out, because you can scale up really, really far and scale out is a lot of work.

You store data at rest on object store, so then you kind of get the infinite scalability, and that means you have to sort of change some of the semantics—the data is immutable and you have to do a bunch of fun things. But if data is stored on object store, it means that it's highly durable, and so your compute can be ephemeral. You can clone, you can stamp out lots of those.

Hypertenancy

Who's familiar with hypertenancy? Heard that as a word? Glauber is, because that's what he started calling what they're doing in Turso with running lots and lots of SQLite instances. Each user gets their own SQLite instance, and you scale by not having a giant one but by just having hundreds of thousands or millions of users, and each one gets a different database.

Introducing DuckDB

You may have known this was coming. I wanted to talk a little bit about how we are handling some of these things, and to do that I want to give a little bit of background.

These days probably most people have heard of DuckDB. It's an in-process analytical data management system. It's been taking the world by storm. This is the code you need to do in Python to install DuckDB and start running queries—it's super easy. The GitHub stars have this nice exponential shape. The downloads have this very nice shape. I think it's actually probably in the top five websites in the Netherlands by the amount of traffic that it gets. It's been growing by a lot.

The reason that people like it is because they just make it easy. Having worked on other databases and other database companies, in a database company you tend to focus on the patty of the burger if what you're serving is a burger. You focus on the patty—how you get data in, how you get data out, how you integrate with other things. The general experience is something that is sort of like, "Oh, that's somebody else will deal with that, that's partners, that's something else." DuckDB does a really good job of just making the whole experience great.

To give an example, I think they have the world's best CSV parser. If you have this nasty CSV that has some goofy null characters in the middle of it and has some things that may change type from one part of the file to the other, wrestling with that and getting that in can take a lot longer than you end up waiting for your queries to run. So solving those problems is actually important.

MotherDuck: DuckDB in the Cloud

At MotherDuck we're taking DuckDB and we're running it in the cloud. This is the code you need to run to run DuckDB in the cloud. It's exactly the same code that I had before except I just changed the name of the database to have the prefix md: and that means it runs in the cloud. That's all you have to do.

Small Data, Small Compute

So big data versus big compute—how does this work for MotherDuck?

If you look at the quadrant, the bottom left quadrant—small data, small compute—I mean, everybody knows DuckDB works great here. The things you want to do here are ad hoc analytics, you're doing your platinum and gold tier stuff, you're writing a bunch of SQL queries, you're doing data science. You can scale this up as needed. That's pretty straightforward. That's right in the sweet spot of the design.

This is a visualization of a database benchmark from ClickHouse. Inexpensive goes up, faster goes to the right. If you look at inexpensive but slow, you've got the distributed small databases. If you look at the expensive but pretty fast, we've got the distributed large databases. And then kind of up on the top right, both inexpensive and fast, is DuckDB. This is actually from several months ago—if you look at the most recent results, they're actually further up and to the right.

Small Data, Big Compute

One of the problems with traditional data warehouses and the tenancy model that they have is basically you have lots of users hitting the same thing. I think it's a legacy from the days where you had the box—everybody shares this one box. You need to provision for the peak versus the instantaneous amount. One user can often stomp on other users or impact other users' access. Autoscaling tends to be behind, so from a price-performance perspective, it's not ideal.

In MotherDuck, everybody gets a duckling. We call our DuckDB instances ducklings. MotherDuck—we marshal and care for the ducklings in the cloud. A new user shows up, we can assign them a duckling in less than 100 milliseconds, so less than human reaction time. We keep things on warm storage so we can run queries super fast. Every user gets their own duckling, so they're all isolated. They can scale up to essentially the largest size that we need, and then they shut down immediately after they're not being used.

This can be helpful on small data, big compute, because small data, big compute is when I may not have a lot of data but I might have a lot of users using that data. As mentioned before, kind of a BI tool—Omni, the Omni folks are here. Omni supports MotherDuck using read scaling, which means that we can run lots of DuckDB instances against the same BI data.

Analytics Agents

Agents is also a really interesting one. Joe talked a lot about agents. If you have an analytics agent that's going to be operating over data, you can have lots of analytics agents that are all operating over the same data. That's a lot of compute, a lot of work that they're doing, but it may not be a lot of data.

The way that read scaling works in MotherDuck—every user gets their own duckling, but also each end user of the BI tool gets their own duckling, and we will route that to a separate replica. It should be stable, so the same user tends to be querying the same data. You can decide how many you want so that you don't have essentially infinite costs.

On the subject of agents, I'm actually really excited about agents because I feel like—and this may be kind of a preview of the talk that I have with some other folks in the BI space and observability and transformation a little bit later—text-to-SQL has some limitations if you're trying to ask questions of your data. I think agents is a really good way of solving some of these problems because agents means that you don't have to one-shot it. You don't have to come up with a perfect query that solves your problem.

Here's an interesting question that if you asked a human analyst: "Which of my customers are at risk of churning?" A human analyst is not going to one-shot that query. They're not going to type out this query and boom, "Oh, it's these three." They're going to investigate. They're going to look at a bunch of things. They're going to pull in data from different sources. They're going to think about it and be like, "Oh, maybe I need this." That's the kind of thing that an agent can do.

What would you need from your underlying system? You need to be able to spin up lots of different instances because each one of those agents is going to be a different system. As Joe mentioned, there's a good chance those are going to melt down your single server. But if each one can scale individually, then you have a lot better chance of being able to handle that load. You can clone data. They can all be operating—they may even be modifying data as they go, and you maybe want to sort of branch and return to a previous point. The tenancy model that we have tends to work pretty nicely for that.

Big Data, Small Compute

Onto the third quadrant—big data, small compute. I think the biggest thing here is time series workloads or logs analytics workloads. There's just a lot of big datasets. Actually at Google we used to say all big data is created over time. Giant datasets don't all of a sudden just sort of show up.

Typically what people end up doing is they're adding a small bit at a time or they're looking at a small bit at a time. They're looking at what happened in the last day, the last week. They're looking at your Datadog, looking at your observability data—what's going on right now. This is where hypertenancy comes into play, and then Duck Lake.

Typically the way SaaS provisioning works, if you're using a monolithic database, is you have lots of customers, you funnel those into a web application, and then you talk to a database. That's pretty standard. But you have to provision for peak, you have to be able to handle the scale, and then users aren't isolated.

With MotherDuck, we can actually have each end user talk directly to the database without even going through a backend. You don't even have to route things through the backend, and then they can be provisioned on demand and scale up and down, be isolated, etc.

Duck Lake

I mentioned Duck Lake. Iceberg is sort of all the rage these days. Duck Lake is an alternative to Iceberg that instead of storing the metadata in S3, stores data metadata in a database. It makes things a lot cleaner. You don't have these sort of goofy multitude of web of JSON and Avro files that point to all this metadata on disk. You have a database that knows how to do transactions, knows how to do filtering and pushdowns very, very fast.

I think Duck Lake is also a key to being able to do larger scale because it's a data lake—or a lakehouse. The data sits on S3. You can add as much as you want. The metadata is in a database, and as long as the query that you're doing is only operating over a reasonable amount of that data, then it should just work.

Duck Lake was created by the creators of DuckDB—Hannes and Mark and DuckDB Labs—and they've done some benchmarking on petabyte-scale Duck Lakes, and it just works.

Big Data, Big Compute

The last quadrant is big data, big compute. Every once in a while you do have to do some of these giant transformations. You have to rebuild tables. You want to run model training over your whole dataset.

You can still use this in MotherDuck. First of all, we have these giant instances. We just released these—we call them mega and giga. The largest of which is 192 cores, a terabyte and a half of memory. That's more memory than is in a Snowflake 3XL. A Snowflake 3XL is a million dollars a year. So if you have workloads that you need more than a 3XL for that single workload, you might need something bigger. But the vast, vast, vast majority of things can be handled.

But then if they can't, one of the nice things about Duck Lake is we can actually give physical access to the data and you can just run Spark. You have this sort of outlet valve because it's an open storage system.

The Evolution of Performance

When it came out, the Dremel paper came out in 2008. It was seen as like, wow, this is science fiction. Some of the queries that they ran, we basically could do now on a single machine and get similar performance or better performance, especially if you had pre-cached some of the data. If you had to read it from S3, there are potential bottlenecks from reading from S3. But in general, just because you're storing it on object store—object stores are really not great as a database.

In order to create a Duck Lake in the MotherDuck UI (also the same as the DuckDB CLI), it's just a couple lines of code: CREATE DATABASE TYPE DUCKLAKE. That's really all you need to do. Then you adopt your Parquet files and then you're up and running.

Also, just wanted to show one of the cool things about Duck Lake—this is a working Spark connector in Python that is 34 lines of code, but most of that is sort of boilerplate setting things up. It's super easy to do. If you just contrast that with how much code you'd need to build a working Iceberg connector and have it be properly distributed, I guarantee it would be a lot, a lot more than that.

Summary: Design Points

Just getting back to the sort of the design points that we're looking at: small data, small compute—DuckDB rocks. If you increase the data size, we have Duck Lake and we have hypertenancy. If you increase the compute side, we have read scaling. And then for the actual big data, big compute, we have giant instances, and then Duck Lake which also can have external access.

Thank you.

Related Videos

"Watch Me Deploy a DuckLake to Production with MotherDuck!" video thumbnail

2025-12-10

Watch Me Deploy a DuckLake to Production with MotherDuck!

In this video, Hoyt Emerson will show you the fastest way to get DuckLake into production using MotherDuck's beta implementation. If you've been following his DuckLake series, this is the next step you've been waiting for!

YouTube

Data Pipelines

Tutorial

MotherDuck Features

SQL

Ecosystem

"Data-based: Going Beyond the Dataframe" video thumbnail

2025-11-20

Data-based: Going Beyond the Dataframe

Learn how to turbocharge your Python data work using DuckDB and MotherDuck with Pandas. We walk through performance comparisons, exploratory data analysis on bigger datasets, and an end-to-end ML feature engineering pipeline.

Webinar

Python

AI, ML and LLMs

"LLMs Meet Data Warehouses: Reliable AI Agents for Business Analytics" video thumbnail

2025-11-19

LLMs Meet Data Warehouses: Reliable AI Agents for Business Analytics

LLMs excel at natural language understanding but struggle with factual accuracy when aggregating business data. Ryan Boyd explores the architectural patterns needed to make LLMs work effectively alongside analytics databases.

AI, ML and LLMs

MotherDuck Features

SQL

Talk

Python

BI & Visualization