Talk MotherDuck Features SQL Data Pipelines Ecosystem Ingestion

Ducks on a Lake: Scaling Data Lakes to Warehouse Performance

2025/09/18Featuring:

Scaling Data Lakes to Warehouse Performance

Transcript

Hi, I'm here to talk about ducks on a lake—scaling data lakes to warehouse performance. I'm from MotherDuck, one of the co-founders. It's a company taking DuckDB to the cloud as a data warehouse, used for both internal BI and analytics and customer-facing applications.

I have to jet back to our company team offsite after this, so feel free to reach out to me on social or email at ryan@motherduck.com if you have any questions I don't answer here today.

Making Andy Ducky

I'm feeling pretty ducky today. Someone at this conference speaking later today, Andy, is a little less ducky. So let's fix that. [Shows slide transformation] Beautiful. That's like the only slide transition I've ever used in my life, and it worked really well. We fixed him—he's now appropriately duckified.

If you want a chance to be transformed in real life, there's a QR code and postcards at our sponsor table where you can get free swag and your own duck shirts. Today we're talking about lakehouses and making them as easy as duck.

The Era Tour: Evolution of Analytics Engines

My daughter is nine years old, so I'm calling this next section the era tour—showing you how we got to having Duck Lake and the evolution of analytics engines.

The Warehouse Era (1980s-1990s)

Starting in the late 1980s at IBM, data warehousing was created but not popularized until Bill Inmon did so in the 1990s. Their vision was grand: build a place to store all your business data in one central place and be able to easily retrieve it while sporting your amazing haircut.

These projects were difficult and expensive—expensive to build, but most importantly expensive to scale. Once you figured out how to put your data in a small box and wanted to scale up, it got exponentially more expensive. It took many months to do the modeling of the data before they started to load it, and many of these projects ended up failing—something like 80% of the time.

I asked Claude about data warehouse statistics, and it offered to give me a joke. Here's what it came up with: "Post-mortem on a data warehouse project: 30 months, $8.7 million, resignation of three project managers. They sunset their data warehouse. Their initial goal was to provide a single source of truth, and they encountered a series of minor complications that rendered it largely unusable."

Why did they consider it unusable? Their scope ambition surpassed NASA. They wanted a single source of truth. ETL became "eternally transforming and loading." Their daily job ran for 19 hours, and when they parallelized it, it ran for 22 hours—this is kind of why we build on top of DuckDB. They successfully loaded their January 1997 data by September 1998, and then their executive champion disappeared, went off for a yoga retreat in Sedona and never came back.

As data infrastructure people, we have one job: to make infrastructure that makes people successful, which makes technology a superpower and not a burden. We obviously failed in the data warehouse era.

The Data Lake Era (Early 2000s)

It got a little bit better in the early 2000s. We had improvements to software, great papers out of Google—Google File System, MapReduce, BigTable. Put these together and you created Hadoop. All of a sudden we had something that could scale at a reasonable cost, especially when the cloud came along and offered us separation of compute and storage.

We decided we're no longer going to go through months or years of modeling all our data before we store it. We're just going to have this beautiful lake and put all our data in distributed storage like HDFS. We can query it with Hive or Drill (an open-source version of BigQuery).

This was awesome in a handful of ways: you can store any of your data—structured and unstructured. You get proven scalability of distributed storage, and it's cost-effective. Cloud storage was cheap, orders of magnitude less than storage in the on-premise world.

The pendulum was swinging. We decided we don't need to think about modeling in advance, so our projects will succeed now instead of failing. Only, we created a mess—the data swamp. We stored all our data without worrying about format or schema evolution. It was very polluted.

The Lakehouse Era

This is why the lakehouse came along. What happens if we can do ACID transactions in our data lake? What happens if we can evolve our schema and keep our data lake organized? We still threw a bunch of stuff in at first, but at least we had a way to clean it up.

Open table formats gave us awesome capabilities: all the standard features of a data warehouse like ACID transactions, data versioning, schema enforcement and evolution. We added awesome things like time travel—the ability to run queries on last week's, last month's, or last year's data.

Importantly, it allowed multiple engines to work with the data. People really wanted to avoid lock-in and be able to query data from any compute engine they have.

The Hidden Complexity Problem

One hard requirement when building lakehouse formats was taking all the metadata and data and storing it all in blob storage without using a traditional database. There weren't external dependencies. Overall, this seems elegant enough. But why is it an issue?

Here's what it looks like to do a very simple query against a single table. Imagine joining three or four tables together. You're having to do so many lookups of JSON files and Avro files (in the case of Iceberg) just to retrieve the list of files you're then going to fetch to run your actual query. We haven't even fetched any data yet and we've already done 15 different HTTP requests.

This is where the hidden complexity of lakehouse open table formats comes in. Multiple round trips—HTTP has gotten better over the years, but it's still 15 or 20 requests for one table. For three tables, that's conservatively probably 45 different HTTP requests before you even start making requests for the data. This is really a killer and causes performance bottlenecks.

You also have a small file problem. If you do a lot of writes and transactions, those produce even more metadata files.

The Catalog Solution

These problems can be solved using a catalog server. You store all the metadata, or at least the high-level metadata, in here. This allows you to do multi-table transactions and more quickly fetch information about what files to query. It still generally keeps the metadata files on disk as well, so you have some abstraction that improves knowing where all the data exists in an organization.

Enter Duck Lake

Our friends Hannes and Mark, the co-creators of DuckDB, come from the Dutch National Computer Science Research Institute (CWI) and built DuckDB Labs. They care a lot about the aesthetics of systems—that's why DuckDB has become very popular.

They said: wait, the catalog is such an important part of the open table format ecosystem, yet it's not actually in the standards. It's a separate thing—a required add-on to make it operate at reasonable performance and make it easy to discover what data existed in an organization. It's a separate product from Iceberg and Delta, called something like Polaris or Unity Catalog, and those actually use a database.

Why not consider all of this as part of a standard? Databases are designed for metadata. Let's use them to back the lakehouse.

This does violate one of the core principles used to create the open table formats—they said they didn't want to use an extra system, an extra database. But things evolved and we're here to break the rules.

The standalone card catalog is officially dead, as is the real card catalog, sadly. We have ducks. Ducks on a lake.

Three Core Principles of Duck Lake

Duck Lake was designed around three principles: simplicity, scalability, and speed.

We radically simplified things by just having a database. Instead of storing all the metadata on the file system or in a catalog that stores it in a database that then accesses a file system with more metadata, let's just use a database to start with. This can be any SQL database—typically referenced as Postgres. We actually use DuckDB in our current implementation, but we'll probably use a more traditional transactional database in the long run. It can be any SQL database that supports ACID transactions.

You have your Parquet files. Instead of having your metadata layer separate from your database with a catalog that stores some of the metadata, we just combine that all together into the same database. Simple.

Scalable

The process is just as scalable as any other data lake because it's stored in object storage like S3, which is proven to handle large volumes of read and write transactions.

You may wonder: doesn't introducing a transactional database server reduce scale? Generally no. Companies have already gotten really good at running and operationalizing transactional databases like Postgres. Almost every company already has one—they're used to it and have staff that understands it. Postgres is designed to handle thousands of transactions per second. If you think of metadata being one 100,000th the size of the data itself, Postgres is pretty good.

Speed

You get speed. Instead of doing query planning by retrieving 50 different metadata files, you write one single SQL query (which Postgres is pretty good at) that tells you the files to read from. You still have to read those files from the object store, but it's a lot easier and faster—sub-second time for query planning regardless of the size of the overall data.

With Duck Lake, you just have your SQL and metadata. Or you can use other open table formats like Iceberg and deal with the manifest maze.

We've found a way to simplify, scale, and speed up traditional lakehouse formats by providing this new open table format.

Is Duck Lake Just for DuckDB?

The question we often get is: it's called Duck Lake, so is it just for DuckDB? Yes, it's created by the same creators of DuckDB, the same people who like ducks way too much. But it is an open standard. The first reference implementation is built in DuckDB, but it's an open standard. Visit ducklake.dev for more information.

Due to its really simple architecture, it's designed to be easy for you all to implement in any system—very much unlike trying to implement a writer for Iceberg.

Why MotherDuck Cares

You might wonder why we at MotherDuck care. For those who know a little about MotherDuck, we created a data warehouse based on DuckDB with our own storage format, designed to be close to the compute for fast storage to achieve warehouse speeds. Why would I be up here talking about this lakehouse format? The world seems to be in one camp or the other. Why would we be in both?

Especially because we wrote a blog post called "Big Data is Dead." Aren't lakehouse formats only for big data?

No. One of the foundational arguments in the blog post is that even if we have a ton of data in our tables, our actual workload for analytics only touches a small amount of that data. People often only query the most recent week, month, or quarter of data. So yes, big data is dead in terms of your individual queries often not touching big data, but that doesn't mean you don't have big data historical data.

Even with that historical data, as you query it, you typically only touch a few columns—it's heavily compressed, and we build that infrastructure so not much of the data needs to be read.

Challenges Duck Lake Solves for MotherDuck

1. Historic Data

If we only touch a few percent of our actual data in any given week or month, we still now and then are told, "Oh crap, we messed something up in our business. We need to analyze the last two years of data."

What if we can have layered storage tiers? Our data warehouse can store all the most recent data that we need to heavily access in that storage tier, while also keeping historic data in more of a lakehouse, and have that relatively transparent to users. The same query engine processes both datasets regardless of where they're stored.

2. High Throughput Concurrent Writes

One challenge of building a data warehouse or columnar analytics databases in general, and with the DuckDB engine and file format specifically, is how do you get high throughput concurrent writes? DuckDB was originally created to handle your local data on your desktop. It's getting better at high throughput scenarios, but many types of workloads don't need that—but some do.

Some folks have hundreds of Lambda jobs transforming and writing data into the data warehouse. What if each of those jobs can independently write to the object store? We can dramatically improve write concurrency.

Duck Lake also has something special called data inlining to enable concurrent writing. For analytics jobs, you want your data in a columnar format, but for loading data, columnar formats are typically slower to write to. If you have a large volume of rows you're inserting, that takes a lot of different operations on a columnar database.

What if we take this SQL database for metadata and slightly abuse it to insert data really quickly? If we have data we want to stream in fast concurrent fashion, Postgres is really good at that. You can stream data into the metadata store and it can periodically flush that out to the analytics database formats.

3. System Interoperability

How do we achieve system interoperability? If we have a data warehouse format, typically those formats are not as accessible outside their engines as a lakehouse format. Duck Lake gives us that. The modern data stack was a great unbundling of all the data infrastructure. Now that it's unbundled, how do we make sure all these different compute engines can read and write to that data? We might have Spark jobs that we want to write along with using our SQL queries.

4. Scale Anxiety

We wrote that blog post saying "Big Data is Dead" with a lot of intention. It is a bit of a hyperbole, we have to admit. But people naturally say, "DuckDB—will it scale? It's this in-process analytics engine. How could it possibly scale to be my lakehouse?"

Part of that is the messaging around things like the blog post. The reality is we've actually built a new type of distributed compute at MotherDuck. We don't call it that—we want it to be simple. But MotherDuck does scale through hypertenancy: every user gets their own DuckDB instance. We have horizontal read scaling where we can scale out read operations to dozens of machines, and we have vertical scaling.

The idea with Duck Lake is that it makes that a lot better. People don't need to just hear that it scales—they need to see that it scales. People see Duck Lake and say, "Oh, it's built on object store. Okay, it scales." That checkbox is complete.

Duck Lake Performance

Speed is really where Duck Lake shines. It still may not be quite as performant on small reads as a traditional data warehouse—it may not get us tens of milliseconds reads that you can get on a traditional warehouse. But it is a duckload better than the performance of other open table formats.

We continue at MotherDuck to offer our standard traditional data warehouse for really sub-second or tens of millisecond type queries—very important for customer-facing analytics. This storage is based on the DuckDB format, but now we've augmented it to support Duck Lake.

You can either bring your own bucket and directly access that bucket from MotherDuck as well as compute engines like Spark, or use a fully managed system where you can just say "CREATE DATABASE TYPE DUCKLAKE" and that's the end of it.

What About Iceberg?

Some of you might wonder: what about Iceberg? Isn't that the de facto standard?

I would say that standards are meant to be challenged. We're here as innovators. Through competition, we make the industry better. Duck Lake does build upon the innovations in these other lakehouse formats. I don't like the negative mentality around it. I like to think of it as we're evolving, learning what's going on in the industry, and re-examining and challenging the constraints we've seen.

Key Features

Duck Lake has lots of different features, but I want to call out in particular multi-table ACID transactions. Many table formats, whether traditional Parquet or lakehouse formats built on top of Parquet, think of the unit of a single table. How many of you really use just a single table at a time?

In school, it was all about denormalize everything. Yes, you go through cycles where you normalize and then denormalize. But we do think in multiple tables and want to run transactions across multiple tables to do views—update a view column while updating a column in the underlying data. Duck Lake really thinks of things in the database as a whole, not just table by table.

Will Duck Lake Overtake Iceberg?

I don't know. I do care that it's making things easier and that people can get up and running with Duck Lake super quickly.

We also want to support data engineers where they're at. Iceberg support was greatly enhanced with write support added in DuckDB 1.4, which was released two days ago. We added the ability to both import and export data from Duck Lake in version 0.3, released yesterday. We're continuing to invest in Iceberg while also asking how we can make this better.

Q&A Highlights

On vertical vs. horizontal scaling: DuckDB was initially designed because Hannes and Mark found it frustrating that their fellow academics were using Python or R code to analyze data instead of real databases. It started really small. DuckDB runs amazing on our MacBook Pros. As we do larger workloads in a warehouse environment, we've learned to scale up vertically with different instance sizes. But we've also learned that having an individual engine per person, per workload, per company is useful for many use cases—to isolate compute and maintain consistent performance. You can scale both horizontally and vertically; it comes down to your use case.

On managed Duck Lake: Right now in the fully managed Duck Lake we use DuckDB as the transactional database. It's not as good as Postgres for that use case, so we'll likely be managing Postgres instead after we move toward GA.

On partitioning: DuckDB itself doesn't do partitioning of data. That's actually one of the great advantages of using Duck Lake—the file format supports partitioning.

On query planning performance: Hannes and Mark have done performance testing with the query planning component, which is the most expensive thing in other lakehouse formats. They can do a TPC-H petabyte-size query in sub-second for query planning. Then you also have to fetch and analyze the data. But try doing that in another open table format—it's a lot more than less than a second.

On file format support: Duck Lake stores Parquet files. DuckDB itself does have Arrow support and does interop with Arrow, but Duck Lake is a Parquet file specification.

On suitability for enterprise datasets: I would take issue with the idea that it's not designed for terabytes of data. We have at least two or three terabytes of data in our internal data warehouse and use it every single day with people across the company querying it. It depends on how you architect that data. Some things need to be done a little bit differently with DuckDB, but for that you get pretty awesome ergonomics.

Evolution is the important thing here with every technology. We started with DuckDB used just on a desktop by scientists analyzing their few megabytes of files, and now we're at the point where yes, multi-terabyte data warehouses can be analyzed with DuckDB. We build based on consumption, so we see every time a new DuckDB release comes out and things get more efficient. The awesome thing about DuckDB is it's just so active of a community—so much development going on, so many new releases.

Thank you!

Transcript

0:00[Music]

0:05Um, so I'm here to talk about ducks on a lake. I love my hat, but probably not good for the camera. Uh, scaling data lakes to warehouse performance. So mentioned I'm from Motherduck. Uh, I'm one of the co-founders. It is a company taking duck DB to the cloud uh as a data

0:28warehouse and uh a data warehouse that's both used for internal BI and analytics but also customerf facing applications.

0:38Um I have to jet back to our company team offsite after this. Uh I I left them for this. So, um, feel free to reach out to me though, uh, on my social there, uh, or email just ryan@motherduck.com if you have any questions that I don't answer here today.

1:00All right. So, I'm feeling pretty ducky today. Do you notice that? Uh, I'm looking pretty ducky. Uh someone at this conference

1:12uh who's speaking a little later today is is a little less ducky. Uh Andy. Andy

1:19is is Andy sitting here? I don't know. Uh but Andy, you know, doesn't generally look very ducky. Uh so he needs to be a little duckyified. So let's fix that.

1:33Beautiful. That's like the only slide transition I've ever used in my life, and it worked really well. Uh, we fixed him. He's now appropriately duckified.

1:44Hopefully, he doesn't take offense to that. All right. If you want a chance to be transformed uh in real life, uh there's a QR code and there's some postcards over at our sponsor table here uh that you can get a lot of free swag uh and your own duck shirts. Uh, but today we're talking about lake houses

2:06and making them as easy as duck.

2:11So easy a d instead of maybe af.

2:15All right. Uh, how many of you would, if we're separating our worlds into transactional database people and analytics database people, how many would say you're more on the transactional side?

2:30Okay. More on the analytic side. All right, some more analytics folks than than transactional folks here. Um,

2:39so my daughter is nine years old. Um, and so I'm going to call this next little section the era tour. I want to show you how we got to having Duck Lake.

2:52Um, and a little bit about the evolution of analytics engines. A lot of you may be familiar with it, the folks that raised your hand. Secondly, uh, but just a reminder kind of the foundation which led to Duck Lake.

3:07So, the warehouse era, um, I just love

3:11LLM today because you can generate 80s haircuts pretty easily. uh starting in the late 1980s at IBM um and you know like many things at at IBM uh Xerox and other other juggernauts they didn't popularize data warehousing even though they created it um Bill Iman did that uh in in the '9s but their vision was grand um build a place that

3:40you can store all of your original what's called business data warehouse all of your business data uh in one central place uh and be able to easily

3:50retrieve it while sporting your amazing haircut. Uh these projects were difficult, they were expensive.

4:00Um they were expensive to build, but most importantly expensive to scale. Once you figured out how to put, you know, your data in a small box and you wanted to scale up, it got exponentially more expensive.

4:15uh took many years, many months uh and

4:19into data uh do the modeling of the data before they started to load it. And many of these projects ended up failing. Uh something like 80% of the time, if you ask our friend Claude the exact statistic, uh Claude says, "Would you like to give me to give you a joke about data warehouses?" And of course, I said

4:41yes. And this is the joke that it came comes up with. It is a post-mortem on data warehouses. So we're already thinking they're dead. And uh at a particular company, 30 months, $8.7 million, resignation of three project managers. They sunset their data warehouse.

5:02Their initial goal was to provide a single source of truth. um and they encountered a series of minor complications that rendered it largely unusable.

5:13How many of you been through that? Anyone? All right, some folks. Um

5:21this is why they considered it uh unusable or why they think it failed. Their scope ambition surpassed NASA.

5:30Um and you know they wanted to have a single source of truth. ETL became ETL.

5:38Their daily job ran for 19 hours and uh

5:42when they parallelized it, it ran for 22 hours. This is kind of why we build on top of Duck DB. Um they successfully

5:52loaded their January 1997 data by September 1998 and then their executive champion disappeared, went off for a yoga retreat in Sedona and never came back.

6:06Have any of you had a colleague go off to Burning Man nowadays and never come back?

6:13Uh that seems more likely now than the yoga retreat in Sedona, but same difference.

6:19As data infrastructure people, we have one job. We have a job to make infrastructure which makes people successful, which makes technology a superpower and not a burden. And we

6:34failed obviously in the data warehouse era. So it got a little bit better in the early 2000s. Uh we had a lot of improvements to software some great papers uh out of Google uh Google file

6:50system map produce big table um and you know these you put together and you created Hadoop and all of a sudden we had something that could scale uh and not only could it scale but it could scale at a reasonable cost especially when the cloud came along.

7:12and offered us a separation of compute and storage um and allowed us to just scale the compute independently.

7:24So what happened with these from terms of how we store our data? Well, we decided we're no longer going to go through months or years of a process to model all of our data before we store it. We are just going to have this beautiful lake and we're going to put all of our data in this beautiful lake

7:46distributed storage HDFS. We can query it with Hive. We can use Drill which is sort of like an open- source version of BigQuery and all of our data is sitting there. It's awesome.

8:00um only uh it's it's awesome in a handful of ways and then it's really horrible in another way. We're going to talk the awesome first. Um first of all, you can store any of your data in there.

8:15Structured unstructured data. You get this proven scalability of distributed storage. We knew by then that distributed storage is fantastic. It's proven. Uh it's also cost effective.

8:29Um the cloud storage was cheap orders of magnitude less than storage in the onremise world. Um and you can notice here that the the pendulum is swinging.

8:41We decided hey we have a place to put all of our data. We don't need to think about the modeling in advance. We don't need to have years of projects. So thus our project will succeed now instead of failing like the 80% of data warehouses.

8:59only we created a mess. We created the data swamp as they call it uh or in this case just a very ugly ugly looking lake. Um we stored all of our data. We didn't worry about the format. Uh we didn't worry about how we evolved the schema. It's very polluted.

9:20So that's really why the lakehouse came along. And the lakehouse said, "All right, what happens if we can do acid transactions in our data lake? And what happens if we can evolve our schema?

9:34What happens if all of the sudden we actually have a way to keep our uh data lake organized? Now, we still threw a bunch of stuff in at first, but at least we had a way to clean it up. Uh we could clean up those swamps.

9:51So these open table formats gave us some really awesome capabilities here. Um and

9:58you know that was we have all of the standard features of a data warehouse like acid transactions and data versioning uh schema enforcement and evolution. We added on awesome things like time travel because, you know, we all want to be able to time travel or run queries on last week's, last month's, last year's data. Um, so these were all great things

10:26that this added. It also importantly allowed multiple engines to work with the data. So if you're considering, you know, not from a data lake to a data lakehouse, but you're considering a data warehouse or a data lake g um people really wanted to avoid lock in. They wanted to get data out of their their data lake and be able to query it from

10:48any compute engine that they have.

10:52So, one of the hard requirements that they had when building the lakehouse formats is uh that we took all of the metadata and all of the data and stored it all in that blob storage. Stored it all without the usage of a traditional database.

11:13So, there weren't external dependencies. Overall, this seems elegant enough. Uh why is it an issue? Here

11:24is what it looks like to uh do a very simple query against a single table. Uh imagine taking this and

11:34joining three or four tables together. um you're having to do so many lookups of JSON files and Avo files in the case of Iceberg uh in order to just retrieve the list of

11:51files that you're then going to fetch to run your actual query. So we haven't even fetched any data yet and we've already done I don't know 15 different uh queries uh sorry 15 different HTTP requests here to different files and this is really where

12:11the hidden complexity of the lakehouse open table formats come in you multiple round trips uh HCP is not the fastest it's gotten a lot better over the years with the latest specifications Um, but it's still, you know, 15 or 20 requests for one table. Let's imagine you're doing three tables. Uh, conservatively, that's probably uh 45 different uh, HTTP requests before

12:40you even start making requests for the data. So this is really a killer and causes this performance bottleneck um and you know reduces the performance of these systems. Now you can do all sorts of things around this. We'll talk about that here in a second to make them faster. You've all seen benchmarks and stuff from from data bricks. There are

13:06ways to improve this but at its core technology that's what it is. You also have a small file problem. If you do a lot of writes, a lot of transactions, those writes produce even more metadata files and metadata files.

13:26So the problems can be solved using this thing uh a catalog uh well a modern-day

13:34version of this a catalog server. you store all the metadata uh in here or at least kind of the high-level metadata in here. This allows you to do multi-table transactions. It

13:50allows you to more quickly fetch the

13:55information about what files to query. Uh it still generally will keep a uh the metadata files on disk as well.

14:05Um so you have some element some abstraction here which improves knowing where all the data exists in an organization.

14:15Um and then you know we have our friends Hanis and Mark. Hanis and Mark are the the co-creators of Duck DB. They come from the Dutch National Computer Science Research Institute called CWI and then they build out this company called Duck DB Labs uh and built DuckDB. Well, they

14:38care a lot about the aesthetics of systems. That's why DuckDB has become a lot popular or very popular is because of the aesthetics. Uh and they said

14:50wait the catalog is such an important part of the open table format ecosystem

14:58yet it's not actually in the standards. It's a separate thing. It was a required add-on to the standards in order to make it operate at a reasonable performance and to make it easy to discover what data existed in an organization.

15:15So it's a separate product from iceberg and delta uh called something like polaris or unity catalog and in turn those actually used a database. So everyone is using iceberg or delta really seriously will use a data catalog and that uses a database.

15:35Why not consider all of this as part of a standard? Databases are designed for metadata.

15:43Let's use them to back the lakehouse.

15:49So now it does violate one of the core principles that they used to create the open table formats. They said, "Hey, we do not want to use an extra system, an extra database." But things evolved and we're here to break the rules.

16:07Um how do we break the rules and make something better? because these constraints don't make sense.

16:15And so, you know, with Duck Lake, the standalone card catalog is officially dead, as is the real card catalog, sadly. Uh, so we have ducks. We have ducks on a lake. Ducks on a lake.

16:32Have fun. Um it was really designed around these three principles. Simplicity, scalability, speed. Um and

16:42we radically simplified how these things look by just having a database. Imagine

16:51that. Um, you know, it really is as simple as instead of storing all the metadata on the file system or storing all the metadata in a catalog that stores it in a database that then accesses a file system that has more metadata. Let's just use a database to start with. This can be any SQL database. Uh, typically it's referenced

17:19as Postgress. We actually use duct DB for this in our current implementation. Uh but you know probably are going to use you know more of a traditional transactional database in the long run.

17:34Uh but you know it can be any SQL database that supports ACID transactions. you have that, you have your parquet files. And so instead of

17:46having your metadata layer separate from

17:50your database with your catalog that stores some of the metadata, uh we just combine that all together into the same database.

18:01Simple scalable. Well, I mean the process, you know, basically it is just as scalable as any other data lake because it is

18:14stored in object storage like S3 which is proven to handle a large volume of read and write transactions.

18:21Um you may wonder doesn't introducing a

18:26transactional database server reduce scale in this operation. Generally no. A companies have already gotten really good at running and operationalizing their transactional databases like Postgress. Almost every company already has one. They're used to it. They have the staff that understands it. B Postgress is really designed to

18:52handle thousands of transactions per second. And if you think of metadata being one 100,000th size of the data itself in many cases Postgress is pretty good.

19:07So you get scalability, you get speed. So instead of doing your query planning by retrieving 50

19:17different metadata files uh and looking at those, you write one single SQL query which again Postgress is pretty good at.

19:27that tells you the files to read from. You still have to read those files from the object store, but you know it it was a lot easier, a lot faster sub-second time for query planning regardless of the size of the overall data.

19:43So, this is not my creation. You're missing the credit on the bottom right here, but um someone created this presumably with AI. you know, Duck Lake, you just have your SQL and your metadata for your metadata or you can use the other uh open table formats like iceberg and you have this manifest maze.

20:08So we've found a way to simplify to scale and to speed up uh the traditional

20:17lakehouse formats by providing this new open table format. So the question we often get is well what does this have to do it's called duck lake obviously it's just for duct db right that's the assumption a lot of people have uh yes it is created by the same creators of duck db the same people who like ducks way too much

20:44um but it is a open standard the first

20:50reference implementation is built in duct db uh but it is an open standard ducklake. select is the website and due

21:02to its really simple architecture it's designed to be easy for you all to implement in any system uh you know very

21:12much unlike trying to implement a uh writer for iceberg. So, you might be like, why do we at Motherduck care? So, for any of you uh that know a little bit about mother duck, we created a data warehouse that's based off of duct DB, a data warehouse that has our own storage format, designed to be close to the compute

21:39for fast storage to achieve warehouse speeds. Why the heck would I be up here talking about this lakehouse format? you know, the world seems to be kind of in one camp or in the other camp. Why would we be in both?

21:56Especially because we wrote a blog post called Big Data is dead. This is from my co-founder. Um, you know, aren't lakehouse formats only for this big data?

22:10No. Uh, one of the foundational arguments in the blog post is that even if we have a ton of data in our tables, a ton of data in storage, our actual workload for analytics only touches a small amount of that data.

22:26People often only query the most recent week, the most recent month, most recent quarter of data. So um you know yes big

22:37data is dead in terms of your individual queries often don't touch big data but that doesn't mean that you h don't have big data historical data uh even then

22:50you know with that historical data as you query it you typically only touch a few columns it's heavily compressed and we build that infrastructure so that not much of the data needs to be head.

23:06So, here's the challenges that Duck Lake solves for us at at Motherduck.

23:12The first one I want to talk about is that historic data, right? So, if we only touch say a few percentage of our actual data in any given week, any given month, we still now and then are told, "Oh crap, we messed something up in our business. We need to analyze the last two years of data. How do we do that? So

23:33your historic data you still want to keep around. What if we can have layered storage tiers? What if our our data warehouse can store all of the most recent data that we need to heavily access that would be cached in most systems if we were doing a lakehouse uh and keep it in in that storage tier while also keeping the historic data in

23:57more of a lakehouse and have that relatively transparent to the users of the system. The same query engine processes both of the data regardless of where it's stored.

24:10Also, one of the challenges of building a data warehouse uh or columary analytics databases in general and with the duct DB engine and file format specifically is how do you get high throughput concurrent writes? Um the you

24:30know duct DB was originally created to handle your local data on your desktop. uh it's getting a lot better at sort of the high throughput uh scenarios here, but many types of

24:45workloads don't need that. But some do.

24:49Some folks have hundreds of Lambda jobs that are transforming and writing data into the data warehouse. What if each of those jobs can independently write to the object store? We can dramatically improve the right concurrency.

25:07Now, Duck Lake also has something else special, uh, which I think again my bottom of my slides cut off here, but Duck Lake also has something else special called data inlining to enable the the concurrent writing. Data inlining is basically saying that for analytics jobs, you want your data in a colmer format, but for loading the data, colummer

25:33formats are slower to write to typically depending on where your data is coming from. But if you have a large volume of quote unquote rows that you're inserting, that takes a lot of different operations on a colum or database. So what if we take this SQL database that we have for metadata and we slightly abuse it to insert the

25:58uh the data really quickly. Uh if we

26:02have a set of of data that we want to stream in uh you know in fast concurrent fashion, Postgress is really good at that. You can stream in the data into the metadata store and it can periodically then flush that out over to the uh analytics database formats.

26:24Then how do we achieve uh system interoperability? Right? If we have a data warehouse format, typically the data warehouse formats are not as accessible outside their engines than a

26:37a lakehouse format. Duck Lake gives that to us. The modern data lake uh sorry, the modern data stack was a great unbundling of all of the data infrastructure. Uh and now that it's unbundled, how do we make sure all of these different compute engines can read and write to that data? What do we have?

26:56Spark jobs that we want to write uh along with using you know our SQL queries.

27:05The last thing I'll say is is around scale anxiety. So you know we we wrote uh that blog post

27:15saying big data is dead with a lot of intention. It is a bit of a hyperbole we got to admit. Um but you know people naturally say duck DB will it scale?

27:27It's this inprocess analytics engine. How could it possibly scale to my warehouse uh or scale to like be my lakehouse?

27:36And uh so part of that is the messaging around things like the blog post. Part of it is around duct DB and the way that it's messaged and the way that it's designed. The reality is we've actually built a new type of distributed compute at mother duck. We don't call it that.

27:54It wants, you know, we want it to be simple. Um, but mother duck does scale through hypertendency. So every user gets their own duct DB instance. We have horizontal read scaling where we can, you know, basically scale out the read operations to dozens of machines. And we have vertical scaling. all sorts of different ways that you can scale. Um,

28:18and you know, the idea with with Duck Lake is that makes that a lot better.

28:25Uh, people do need to feel good. They don't need to just hear that it scales.

28:30They need to see that it scales. And people see Duck Lake and say, "Oh, it's built on object store. Okay, it scales." And that check mark is complete.

28:44So speed is really where Duck Lake shines. Um it still may not be quite as performant on small reads as a traditional data warehouse. It may not get us our you know tens of milliseconds reads that you can get on a traditional warehouse that's built how we've done it with duct DB. But it is a duckload better than the performance of other

29:11opent formats. So we continue as motherduck to to offer our our standard traditional data warehouse for those really subsecond you know or tens of millisecond type queries very important for customerf facing analytics.

29:29This is a storage based off of the duck DB format, but now we've augmented it to support ducklake. Uh, and you can either bring your own bucket and directly access that bucket from mother duck as well as compute engines like spark or use a fully managed system where you can just say like create database type ducklake and that's the end of it.

29:54So some of you might wonder what about iceberg? Isn't that the deacto standard?

30:00I would say that standards are meant to be challenged. We're here as innovators. Through competition, we make the industry better.

30:10Duck Lake does build upon the innovations in these other lakehouse formats. Um, I don't like the negative mentality around it. I like to think of it as we're evolving. We're learning what's going on in the industry. We're evolving and we're re-examining and challenging the constraints that we've seen.

30:33So, lots of different features in Duck Lake itself, but I do want to call out in particular like multi-table acid uh

30:41transactions. One of the things that a lot of the table formats, whether they be the traditional parquet, uh, or the the lakehouse formats built on top of parquet, is they all think of this unit of a single table.

30:56How many of you really use just a single table at a time? Um, I don't know about you, but in school, when I was in school, like it was all about denormalize everything. And yes, you go through cycles where they you then normalize and then you denormalize blah blah blah. But like you know, we do think in multiple tables and we want to

31:15run transactions in multiple tables to do views. Uh you know, so we can update a view column while we update a column in the underlying data. Um you know, so Ducklake really does think of things in the database as a whole and not just table by table.

31:35So, will Duck Lake overtake iceberg? Um, I don't know. Uh, I do care that, uh, it

31:43is making things easier. Um, and I do care that people can get up and running with Duck Lake super super quickly. Um, and this was not my question. I promise I am not MRO, whatever that is. Um, but

32:01we also do want to support data engineers where they're at. Iceberg support was greatly enhanced uh and added write support for iceberg in ductb 1.4 which was released 2 days ago.

32:15It was added the ability to both import and export data from uh your ducklake in

32:23uh ducklake 0.3 um two days ago. Sorry, that was yesterday. Um, so we're continuing to to invest in iceberg while also saying how do we make this better?

32:37So I'm going to stop quacking and I think we have some seconds for some queries.

32:47All right, anyone have any questions?

32:54Yeah, thanks for presentation. So the uh the interesting fact is I I met the professor Hassan uh the DED DB uh the found co-founder at a datab bricks office um in in my conversation with Mr.

33:12Means professor Hassan he mentioned that duck DB was developed on vertical scaling not the horizontal scaling that is one thing and uh when it is in vertical scaling suppose scale up and scale scale down that process in the dynamic process how we are going to achieve it uh and uh the duck DB column

33:36database is there any specific the partition methods or indexing saying um when the tradition means the current snowflake or any other the databases currently trending so or click house your competitor like the duct DB competitor so I I just want to know more about the um how the back end duct DB is handling and lakehouse architecture uh

34:04the mother duck uh we have the open table formats like the um uh Apache hoodie iceberg and the delta tables. So all these three is going to support when we are doing the multiple updates during that time we are going to collect the manifest files how it is going to be inter interacting with the metadata file. So maybe multiple questions but

34:34give me some answer. >> Uh I will I will try answering some of that. Feel free again to to reach out uh or email Ryan at motherduck uh for the for the rest. Uh yeah, I mean DuckDB was designed I mean frankly DuckDB was initially designed because uh Hannis and Mark found it very frustrating that their fellow academics were using not

35:02real databases. they were using Python or R uh code in order to analyze their data. And so it started small. It started really small. And um you know

35:16just uh you know duct DB was designed for that use case. Scale up works great.

35:21Duct DB runs amazing on our MacBook Pros, even mine that's almost four years old. Um but you know as as we do larger and larger workloads in sort of a warehouse environment um we've learned to to scale up vertically scale. We offer different instance sizes. Uh but we've also learned that having an individual engine

35:49per person, per workload, per company uh

35:53is useful for a lot of different use cases to isolate the compute from one instance to another to maintain consistent performance uh and also to just have you know different different compute running uh running your uh data analysis there. So, uh, I think you can scale it both horizontally and vertically. It really just comes down to what is what

36:18is your use case? like anything. Um, and I'm trying to remember some of the other questions there, but why don't you why don't you just send me an email, ryan motherduck, uh, and I can address some of the some of the other questions that you have.

36:39Oh, >> for the for the fully managed Douglake instance, are you guys also going to manage the Postgress as well for that?

36:49>> Yeah, so like right now uh we in the fully managed uh ducklake we actually use duck DB as that transactional database. It's not as good as Postgress for that use case. Uh so we will likely also be managing Postgress instead uh of of duct DB for that uh after we move towards towards more GA. Um I did

37:18remember one of the other questions was about partitioning. Um duct DB itself doesn't do partitioning of data. That's actually one of the great uh advantages of using using Ducklake is the file format supports partitioning. Um, and eventually, yeah, I'm sure DuckDB will add that into their file format as well.

37:39>> Add a couple more. >> Yep. >> Uh, do you know what the P99 is on top of Duck Lake?

37:46>> Uh, the P99 of what? >> Just like querying like metadata like how like how long like I essentially want like a graph of like if I have a query that's taking this much data, I can expect this kind of response time.

38:01uh I don't I don't have that for you. I will say that there's a Mark have done uh some performance testing with regards to the query planning component of it which is the most expensive thing in the other lakehouse formats and can do a TPCH uh pabyte size query in sub a second for

38:24the query planning. Um then you also have to fetch the data and and analyze it. But like try doing that in in another open table format. It's a lot more than less than a second.

38:37All right. Anything else? >> Do you have time for one last question?

38:49>> Uh does it optimize for other formats other than parquet? Uh no. Right now it is ducklake does store parquet files.

38:57Um, duct DB itself does have arrow support um, and does do interop with arrow but ducklake is a is a parquet file uh, specification.

39:11>> Hi uh, Ryan. So I've read in some of the blogs and online documentation that you have that there are certain use cases for which duct t is not suitable like for example for enterprise data sets spanning to like terabytes and terabytes of like wide range of tables. uh

39:27the suitability is not there uh versus things like datab bricks uh unity catalog or the I mean the wide range of uh use cases that it kind of uh solves.

39:41Uh so how do you see these two ecosystems fitting together? Like do you still think that duck DB will continue to u take this opinionated way of uh you know handling certain uh uh uh certain

39:54type of use cases or it'll over time evolve to like the entire spectrum of uh lakehouse analytics ETL etc etc.

40:04>> Um I mean I would take issue with the idea that it's not designed for terabytes of of of data. We have at least two or three terabytes of data in our internal data warehouse uh and use it every single day uh with people across the company querying it. Uh I think it really just depends on how you

40:24how you architect that data. Um and things some things need to be done a little bit differently with duct DB. Um but for that you get pretty awesome ergonomics in the end. Uh but yes, I mean like like you said, right? So evolution is the important thing here uh with every every technology and uh we started off with duct DB used just on a

40:49desktop by you know random scientists do analyzing their few megabytes of files and now we're at the point where yes multi-terab data warehouses can be analyzed uh with duct DB we see um you

41:04know we build based off of consumption so we see every time a new duct DB release comes out and things get a lot more efficient. You know, we have a brief dip in in our amount of revenue, then it goes back up uh you know, as as we get more users. So these things are constantly getting much much better and

41:25we're working very closely with duct DB Labs to kind of expand the abilities around concurrency and the abilities around uh you know parallel execution and parallel compression and decompression and all sorts of other things that you can watch. You know the the awesome thing about ductb is it's just so active of a community. There's so so much development going on so many

41:48new releases. Um, so yeah, >> thanks.

41:58>> All right, thank you so much, Ryan. All right, we're going to take a quick break to grab drinks, restroom. Uh, before we do that, just want to remind everyone this is a small event by design, which means you can get to know people. So, say hi to your neighbor, introduce yourself. Um, have hope you have some

42:16interesting conversations and we'll see you uh back in the seats in 15 minutes.

FAQS

What is DuckLake and how does it differ from Iceberg and Delta Lake?

DuckLake is an open table format created by the DuckDB founders that uses a SQL database (like Postgres or DuckDB) to store all metadata instead of files on object storage. Iceberg and Delta Lake require multiple HTTP round trips to JSON and Avro metadata files before fetching actual data. DuckLake resolves query planning with a single SQL query, achieving sub-second planning time even at petabyte scale. Learn more about DuckLake.

What problems do traditional open table formats like Iceberg have?

Iceberg and similar formats require numerous HTTP requests to metadata files before any data is fetched. A simple query against a single table might need 15 different HTTP requests, and joining three tables could require 45 requests before touching actual data. This creates real performance bottlenecks. Frequent writes produce even more metadata files, creating a small file problem that compounds over time and requires a separate catalog server to manage.

How does MotherDuck use DuckLake for data warehousing?

MotherDuck uses DuckLake to solve several data warehousing challenges. Layered storage tiers keep recent hot data in an optimized warehouse format while historical data lives in DuckLake on object storage. Data inlining allows high-throughput concurrent writes by streaming data into the metadata database first, then flushing to columnar format. Multi-table ACID transactions enable updates across multiple tables at once, unlike single-table-focused formats like Iceberg.

How does MotherDuck scale DuckDB for enterprise workloads?

MotherDuck scales DuckDB through hyper-tenancy (every user gets their own DuckDB instance), horizontal read scaling across dozens of machines, and vertical scaling with different instance sizes. DuckDB was originally designed for single-machine use, but MotherDuck has built distributed compute infrastructure on top of it. The company runs multi-terabyte internal data warehouses on the platform, and DuckLake adds object storage scalability for organizations that need data lake-level scale.