TalkMotherDuck FeaturesSQLData PipelinesEcosystemIngestion

Ducks on a Lake: Scaling Data Lakes to Warehouse Performance

2025/09/18

Scaling Data Lakes to Warehouse Performance

Transcript

Hi, I'm here to talk about ducks on a lake—scaling data lakes to warehouse performance. I'm from MotherDuck, one of the co-founders. It's a company taking DuckDB to the cloud as a data warehouse, used for both internal BI and analytics and customer-facing applications.

I have to jet back to our company team offsite after this, so feel free to reach out to me on social or email at ryan@motherduck.com if you have any questions I don't answer here today.

Making Andy Ducky

I'm feeling pretty ducky today. Someone at this conference speaking later today, Andy, is a little less ducky. So let's fix that. [Shows slide transformation] Beautiful. That's like the only slide transition I've ever used in my life, and it worked really well. We fixed him—he's now appropriately duckified.

If you want a chance to be transformed in real life, there's a QR code and postcards at our sponsor table where you can get free swag and your own duck shirts. Today we're talking about lakehouses and making them as easy as duck.

The Era Tour: Evolution of Analytics Engines

My daughter is nine years old, so I'm calling this next section the era tour—showing you how we got to having Duck Lake and the evolution of analytics engines.

The Warehouse Era (1980s-1990s)

Starting in the late 1980s at IBM, data warehousing was created but not popularized until Bill Inmon did so in the 1990s. Their vision was grand: build a place to store all your business data in one central place and be able to easily retrieve it while sporting your amazing haircut.

These projects were difficult and expensive—expensive to build, but most importantly expensive to scale. Once you figured out how to put your data in a small box and wanted to scale up, it got exponentially more expensive. It took many months to do the modeling of the data before they started to load it, and many of these projects ended up failing—something like 80% of the time.

I asked Claude about data warehouse statistics, and it offered to give me a joke. Here's what it came up with: "Post-mortem on a data warehouse project: 30 months, $8.7 million, resignation of three project managers. They sunset their data warehouse. Their initial goal was to provide a single source of truth, and they encountered a series of minor complications that rendered it largely unusable."

Why did they consider it unusable? Their scope ambition surpassed NASA. They wanted a single source of truth. ETL became "eternally transforming and loading." Their daily job ran for 19 hours, and when they parallelized it, it ran for 22 hours—this is kind of why we build on top of DuckDB. They successfully loaded their January 1997 data by September 1998, and then their executive champion disappeared, went off for a yoga retreat in Sedona and never came back.

As data infrastructure people, we have one job: to make infrastructure that makes people successful, which makes technology a superpower and not a burden. We obviously failed in the data warehouse era.

The Data Lake Era (Early 2000s)

It got a little bit better in the early 2000s. We had improvements to software, great papers out of Google—Google File System, MapReduce, BigTable. Put these together and you created Hadoop. All of a sudden we had something that could scale at a reasonable cost, especially when the cloud came along and offered us separation of compute and storage.

We decided we're no longer going to go through months or years of modeling all our data before we store it. We're just going to have this beautiful lake and put all our data in distributed storage like HDFS. We can query it with Hive or Drill (an open-source version of BigQuery).

This was awesome in a handful of ways: you can store any of your data—structured and unstructured. You get proven scalability of distributed storage, and it's cost-effective. Cloud storage was cheap, orders of magnitude less than storage in the on-premise world.

The pendulum was swinging. We decided we don't need to think about modeling in advance, so our projects will succeed now instead of failing. Only, we created a mess—the data swamp. We stored all our data without worrying about format or schema evolution. It was very polluted.

The Lakehouse Era

This is why the lakehouse came along. What happens if we can do ACID transactions in our data lake? What happens if we can evolve our schema and keep our data lake organized? We still threw a bunch of stuff in at first, but at least we had a way to clean it up.

Open table formats gave us awesome capabilities: all the standard features of a data warehouse like ACID transactions, data versioning, schema enforcement and evolution. We added awesome things like time travel—the ability to run queries on last week's, last month's, or last year's data.

Importantly, it allowed multiple engines to work with the data. People really wanted to avoid lock-in and be able to query data from any compute engine they have.

The Hidden Complexity Problem

One hard requirement when building lakehouse formats was taking all the metadata and data and storing it all in blob storage without using a traditional database. There weren't external dependencies. Overall, this seems elegant enough. But why is it an issue?

Here's what it looks like to do a very simple query against a single table. Imagine joining three or four tables together. You're having to do so many lookups of JSON files and Avro files (in the case of Iceberg) just to retrieve the list of files you're then going to fetch to run your actual query. We haven't even fetched any data yet and we've already done 15 different HTTP requests.

This is where the hidden complexity of lakehouse open table formats comes in. Multiple round trips—HTTP has gotten better over the years, but it's still 15 or 20 requests for one table. For three tables, that's conservatively probably 45 different HTTP requests before you even start making requests for the data. This is really a killer and causes performance bottlenecks.

You also have a small file problem. If you do a lot of writes and transactions, those produce even more metadata files.

The Catalog Solution

These problems can be solved using a catalog server. You store all the metadata, or at least the high-level metadata, in here. This allows you to do multi-table transactions and more quickly fetch information about what files to query. It still generally keeps the metadata files on disk as well, so you have some abstraction that improves knowing where all the data exists in an organization.

Enter Duck Lake

Our friends Hannes and Mark, the co-creators of DuckDB, come from the Dutch National Computer Science Research Institute (CWI) and built DuckDB Labs. They care a lot about the aesthetics of systems—that's why DuckDB has become very popular.

They said: wait, the catalog is such an important part of the open table format ecosystem, yet it's not actually in the standards. It's a separate thing—a required add-on to make it operate at reasonable performance and make it easy to discover what data existed in an organization. It's a separate product from Iceberg and Delta, called something like Polaris or Unity Catalog, and those actually use a database.

Why not consider all of this as part of a standard? Databases are designed for metadata. Let's use them to back the lakehouse.

This does violate one of the core principles used to create the open table formats—they said they didn't want to use an extra system, an extra database. But things evolved and we're here to break the rules.

The standalone card catalog is officially dead, as is the real card catalog, sadly. We have ducks. Ducks on a lake.

Three Core Principles of Duck Lake

Duck Lake was designed around three principles: simplicity, scalability, and speed.

We radically simplified things by just having a database. Instead of storing all the metadata on the file system or in a catalog that stores it in a database that then accesses a file system with more metadata, let's just use a database to start with. This can be any SQL database—typically referenced as Postgres. We actually use DuckDB in our current implementation, but we'll probably use a more traditional transactional database in the long run. It can be any SQL database that supports ACID transactions.

You have your Parquet files. Instead of having your metadata layer separate from your database with a catalog that stores some of the metadata, we just combine that all together into the same database. Simple.

Scalable

The process is just as scalable as any other data lake because it's stored in object storage like S3, which is proven to handle large volumes of read and write transactions.

You may wonder: doesn't introducing a transactional database server reduce scale? Generally no. Companies have already gotten really good at running and operationalizing transactional databases like Postgres. Almost every company already has one—they're used to it and have staff that understands it. Postgres is designed to handle thousands of transactions per second. If you think of metadata being one 100,000th the size of the data itself, Postgres is pretty good.

Speed

You get speed. Instead of doing query planning by retrieving 50 different metadata files, you write one single SQL query (which Postgres is pretty good at) that tells you the files to read from. You still have to read those files from the object store, but it's a lot easier and faster—sub-second time for query planning regardless of the size of the overall data.

With Duck Lake, you just have your SQL and metadata. Or you can use other open table formats like Iceberg and deal with the manifest maze.

We've found a way to simplify, scale, and speed up traditional lakehouse formats by providing this new open table format.

Is Duck Lake Just for DuckDB?

The question we often get is: it's called Duck Lake, so is it just for DuckDB? Yes, it's created by the same creators of DuckDB, the same people who like ducks way too much. But it is an open standard. The first reference implementation is built in DuckDB, but it's an open standard. Visit ducklake.dev for more information.

Due to its really simple architecture, it's designed to be easy for you all to implement in any system—very much unlike trying to implement a writer for Iceberg.

Why MotherDuck Cares

You might wonder why we at MotherDuck care. For those who know a little about MotherDuck, we created a data warehouse based on DuckDB with our own storage format, designed to be close to the compute for fast storage to achieve warehouse speeds. Why would I be up here talking about this lakehouse format? The world seems to be in one camp or the other. Why would we be in both?

Especially because we wrote a blog post called "Big Data is Dead." Aren't lakehouse formats only for big data?

No. One of the foundational arguments in the blog post is that even if we have a ton of data in our tables, our actual workload for analytics only touches a small amount of that data. People often only query the most recent week, month, or quarter of data. So yes, big data is dead in terms of your individual queries often not touching big data, but that doesn't mean you don't have big data historical data.

Even with that historical data, as you query it, you typically only touch a few columns—it's heavily compressed, and we build that infrastructure so not much of the data needs to be read.

Challenges Duck Lake Solves for MotherDuck

1. Historic Data

If we only touch a few percent of our actual data in any given week or month, we still now and then are told, "Oh crap, we messed something up in our business. We need to analyze the last two years of data."

What if we can have layered storage tiers? Our data warehouse can store all the most recent data that we need to heavily access in that storage tier, while also keeping historic data in more of a lakehouse, and have that relatively transparent to users. The same query engine processes both datasets regardless of where they're stored.

2. High Throughput Concurrent Writes

One challenge of building a data warehouse or columnar analytics databases in general, and with the DuckDB engine and file format specifically, is how do you get high throughput concurrent writes? DuckDB was originally created to handle your local data on your desktop. It's getting better at high throughput scenarios, but many types of workloads don't need that—but some do.

Some folks have hundreds of Lambda jobs transforming and writing data into the data warehouse. What if each of those jobs can independently write to the object store? We can dramatically improve write concurrency.

Duck Lake also has something special called data inlining to enable concurrent writing. For analytics jobs, you want your data in a columnar format, but for loading data, columnar formats are typically slower to write to. If you have a large volume of rows you're inserting, that takes a lot of different operations on a columnar database.

What if we take this SQL database for metadata and slightly abuse it to insert data really quickly? If we have data we want to stream in fast concurrent fashion, Postgres is really good at that. You can stream data into the metadata store and it can periodically flush that out to the analytics database formats.

3. System Interoperability

How do we achieve system interoperability? If we have a data warehouse format, typically those formats are not as accessible outside their engines as a lakehouse format. Duck Lake gives us that. The modern data stack was a great unbundling of all the data infrastructure. Now that it's unbundled, how do we make sure all these different compute engines can read and write to that data? We might have Spark jobs that we want to write along with using our SQL queries.

4. Scale Anxiety

We wrote that blog post saying "Big Data is Dead" with a lot of intention. It is a bit of a hyperbole, we have to admit. But people naturally say, "DuckDB—will it scale? It's this in-process analytics engine. How could it possibly scale to be my lakehouse?"

Part of that is the messaging around things like the blog post. The reality is we've actually built a new type of distributed compute at MotherDuck. We don't call it that—we want it to be simple. But MotherDuck does scale through hypertenancy: every user gets their own DuckDB instance. We have horizontal read scaling where we can scale out read operations to dozens of machines, and we have vertical scaling.

The idea with Duck Lake is that it makes that a lot better. People don't need to just hear that it scales—they need to see that it scales. People see Duck Lake and say, "Oh, it's built on object store. Okay, it scales." That checkbox is complete.

Duck Lake Performance

Speed is really where Duck Lake shines. It still may not be quite as performant on small reads as a traditional data warehouse—it may not get us tens of milliseconds reads that you can get on a traditional warehouse. But it is a duckload better than the performance of other open table formats.

We continue at MotherDuck to offer our standard traditional data warehouse for really sub-second or tens of millisecond type queries—very important for customer-facing analytics. This storage is based on the DuckDB format, but now we've augmented it to support Duck Lake.

You can either bring your own bucket and directly access that bucket from MotherDuck as well as compute engines like Spark, or use a fully managed system where you can just say "CREATE DATABASE TYPE DUCKLAKE" and that's the end of it.

What About Iceberg?

Some of you might wonder: what about Iceberg? Isn't that the de facto standard?

I would say that standards are meant to be challenged. We're here as innovators. Through competition, we make the industry better. Duck Lake does build upon the innovations in these other lakehouse formats. I don't like the negative mentality around it. I like to think of it as we're evolving, learning what's going on in the industry, and re-examining and challenging the constraints we've seen.

Key Features

Duck Lake has lots of different features, but I want to call out in particular multi-table ACID transactions. Many table formats, whether traditional Parquet or lakehouse formats built on top of Parquet, think of the unit of a single table. How many of you really use just a single table at a time?

In school, it was all about denormalize everything. Yes, you go through cycles where you normalize and then denormalize. But we do think in multiple tables and want to run transactions across multiple tables to do views—update a view column while updating a column in the underlying data. Duck Lake really thinks of things in the database as a whole, not just table by table.

Will Duck Lake Overtake Iceberg?

I don't know. I do care that it's making things easier and that people can get up and running with Duck Lake super quickly.

We also want to support data engineers where they're at. Iceberg support was greatly enhanced with write support added in DuckDB 1.4, which was released two days ago. We added the ability to both import and export data from Duck Lake in version 0.3, released yesterday. We're continuing to invest in Iceberg while also asking how we can make this better.

Q&A Highlights

On vertical vs. horizontal scaling: DuckDB was initially designed because Hannes and Mark found it frustrating that their fellow academics were using Python or R code to analyze data instead of real databases. It started really small. DuckDB runs amazing on our MacBook Pros. As we do larger workloads in a warehouse environment, we've learned to scale up vertically with different instance sizes. But we've also learned that having an individual engine per person, per workload, per company is useful for many use cases—to isolate compute and maintain consistent performance. You can scale both horizontally and vertically; it comes down to your use case.

On managed Duck Lake: Right now in the fully managed Duck Lake we use DuckDB as the transactional database. It's not as good as Postgres for that use case, so we'll likely be managing Postgres instead after we move toward GA.

On partitioning: DuckDB itself doesn't do partitioning of data. That's actually one of the great advantages of using Duck Lake—the file format supports partitioning.

On query planning performance: Hannes and Mark have done performance testing with the query planning component, which is the most expensive thing in other lakehouse formats. They can do a TPC-H petabyte-size query in sub-second for query planning. Then you also have to fetch and analyze the data. But try doing that in another open table format—it's a lot more than less than a second.

On file format support: Duck Lake stores Parquet files. DuckDB itself does have Arrow support and does interop with Arrow, but Duck Lake is a Parquet file specification.

On suitability for enterprise datasets: I would take issue with the idea that it's not designed for terabytes of data. We have at least two or three terabytes of data in our internal data warehouse and use it every single day with people across the company querying it. It depends on how you architect that data. Some things need to be done a little bit differently with DuckDB, but for that you get pretty awesome ergonomics.

Evolution is the important thing here with every technology. We started with DuckDB used just on a desktop by scientists analyzing their few megabytes of files, and now we're at the point where yes, multi-terabyte data warehouses can be analyzed with DuckDB. We build based on consumption, so we see every time a new DuckDB release comes out and things get more efficient. The awesome thing about DuckDB is it's just so active of a community—so much development going on, so many new releases.

Thank you!

Related Videos

"Watch Me Deploy a DuckLake to Production with MotherDuck!" video thumbnail

2025-12-10

Watch Me Deploy a DuckLake to Production with MotherDuck!

In this video, Hoyt Emerson will show you the fastest way to get DuckLake into production using MotherDuck's beta implementation. If you've been following his DuckLake series, this is the next step you've been waiting for!

YouTube

Data Pipelines

Tutorial

MotherDuck Features

SQL

Ecosystem

"LLMs Meet Data Warehouses: Reliable AI Agents for Business Analytics" video thumbnail

2025-11-19

LLMs Meet Data Warehouses: Reliable AI Agents for Business Analytics

LLMs excel at natural language understanding but struggle with factual accuracy when aggregating business data. Ryan Boyd explores the architectural patterns needed to make LLMs work effectively alongside analytics databases.

AI, ML and LLMs

MotherDuck Features

SQL

Talk

Python

BI & Visualization

"In the Long Run, Everything is a Fad" video thumbnail

2025-11-05

In the Long Run, Everything is a Fad

Benn Stancil uses Olympics gymnastics scoring to argue data's quantification obsession is generational. We went from vibes to math and may return to AI-powered vibes. Will dashboards matter to the next generation?

Talk

BI & Visualization