DuckLake: Making BIG DATA feel small (Coalesce 2025)

Q: What maintenance tasks does a DuckLake lakehouse require?

DuckLake requires three maintenance operations: (1) merging adjacent files , combining small files written during incremental loads into larger, more efficient ones; (2) expiring snapshots , marking old snapshots for deletion based on your retention policy; and (3) cleanup , actually deleting expired data files from object storage. In the dbt demo, these were packaged into a single macro that can be invoked with dbt run-operation maintain_ducklake , making scheduled maintenance straightforward.

2025/10/14Featuring:

Jacob Matson

Alex Monahan

TL;DR: DuckLake is a new open lakehouse format that combines the simplicity of a database catalog with the scalability of open data formats—eliminating the "big data tax" and enabling a full lakehouse setup in just 5 minutes.

The Big Data Tax

Current cloud data warehouses were designed in 2012 when hardware was much weaker. Their distributed architecture comes with penalties:

Latency: Small queries take longer than they should due to coordination overhead
Cost: Network shuffling between nodes isn't free
Complexity: Scheduling, planning, and routing across nodes adds operational burden

The key insight: "Big compute is dead" (though it's not as catchy as "big data is dead"). Most queries (P99) touch under 256GB of data—well within single-node capability.

DuckDB: Pushing Single-Node Performance

In-process: Runs inside Python, Node, Go, Rust, and 15+ languages
Lightweight: 20MB binary, zero dependencies, installs in seconds
Fast: #1 on ClickBench, beating ClickHouse, Snowflake, Redshift, and BigQuery

DuckLake vs Iceberg Architecture

Iceberg	DuckLake
Multiple metadata layers (manifests, metadata files, catalog)	Single transactional database holds all metadata
Metadata overhead grows with commits	Database scales efficiently
Complex setup	5-minute setup
Requires Java ecosystem	Pure SQL, any language that wraps DuckDB

Key insight: DuckLake uses the same architecture as Snowflake (FoundationDB) and BigQuery (Spanner)—a transactional database for metadata.

5-Minute Lakehouse Demo with dbt

The demo shows setting up a complete lakehouse using:

A dbt profile configured to use DuckDB with the DuckLake extension
Postgres as the metadata catalog backend
Local or cloud storage for the actual data files

Maintenance operations include merging small files, expiring old snapshots, and cleaning up expired files—all callable through dbt run-operation.

Production Considerations

Cloud compute: Serverless preferred for simplicity
Large instances: Sometimes you need beefy compute for repartitioning or full scans
Access control: Lock down your lakehouse
Caching: Lakehouse files are immutable—perfect for caching
Scheduled maintenance: Automate file compaction and snapshot expiration

MotherDuck: Ducklings of Unusual Size

Size	Specs	Use Case
Standard	Various	Day-to-day queries
Mega	64 cores, 256GB RAM	Heavy transformations
Giga	192 cores, 1.5TB RAM	Most problems fit here

Real-World Migration

A customer replaced a 5-server distributed cluster (largest AWS instances) running Iceberg with one serverless DuckLake on MotherDuck.

Migration: Metadata-only (no data copying)
Iceberg import: Supported for bringing in existing Iceberg data
Iceberg export: Also supported for interoperability

Key Takeaways

10-100x data scale with existing SQL/dbt skills—no new stack or team required
Instant import from Iceberg—leverage existing data investments
Local dev parity: Same lakehouse runs on laptop and in production
Future: Spark connector in development for multi-engine support

Transcript

0:00All right, thanks everybody for joining us today. Uh we're going to talk a little bit about making big data feel small.

0:08And to start this out, we're going to do a four-part history lesson that walks us into how we got to where we got to. Um the first thing we're going to talk about is the history of big data. Then we're gonna talk about how that brought us to kind of DuckDB and what it does.

0:24Then we're gonna talk about where this takes us into Duck Lake and data lakes.

0:28And then lastly, what that means for you as a practitioner using uh DBT.

0:34Uh let's do some quick intros. Um I am

0:38Jacob. Uh hello everybody. I work at Motherduck and Devril. Uh you may know me from the things that are on the slide, but uh for those of you that don't, I started my career actually in accounting. And um that means to me everything a a database is a ledger, right? Um a transaction is journal entry. Um that's how I just think about

0:57things. Sorry, you have to deal with that if you're dealing with me. Um, but the through line here that led me into data is really a lot about uh accounting is uh adding numbers and counting them on a time series. And if that sounds a lot like data analysis and like data science, that's because it is. Um, so

1:15that's kind of how I got to where I got to. Started working obviously on Excel and moving my way into things like SQL Server and then now uh enjoying my time uh working on OLAP database which is duct db and motherduck.

1:27Sweet. Well, howdy. Hi, I'm Alex Monahan and uh my background is industrial systems engineering from Virginia Tech and then I just got bit by the data bug.

1:35So, I spent nine years at Intel breaking into data. Spent a couple of those years as a SQL server jockey writing store procs as one does. And then in 2021, I discovered DuckDB and I just became a huge fan. Tweeted a lot about it and started working part-time for DuckDB Labs. Uh first on documentation, later doing some blogging. And for the last

1:53two years, I've been at Modduck. I'm a customer software engineer. So working with the folks using motherduck in prod.

2:00So what can we help with today? What can motherduck help you with? And it's really to avoid the big data tax. So what is the big data tax? Well, when the current crop of cloud data warehouses were created back in 2012, hardware was really wimpy. It meant to get anything done of any size, you had to split one

2:18SQL query up to many, many, many compute nodes. And to do that comes with a ton of overhead. You have to schedule that, plan that, route that. You have to shuffle your data back and forth over the network multiple times. And all of that has penalties. Penalties in terms of latency. Small queries take much longer than they should. They come with

2:38penalties in terms of cost. All those network hops are not free. And so we believe that this architecture that it's time for it to change.

2:49Diving a little bit deeper, there's kind of two levels to this. We we coined the term at our company of big data is dead.

2:55Well, there's a couple pieces here. There's the big data part and that's fairly solved by object storage. You kind of store about as much data as you want. It's not super expensive. It's okay performance. That's not really the main problem. The main part that is really painful where the tax really comes from is big compute. If you have

3:13to process data across multiple nodes, that's where you really feel and pay that tax. It just happens that big comput is dead is not as catchy as big data is dead. But that's what we're talking about.

3:27>> Yeah. So this is a I want to make that visual what Alex was just talking about which is what we kind of notionally call the data singularity. Those of you who have seen kind of talks maybe from Hanis at uh DUTDB Labs we believe we are there at the point of that arrow and that is the theory that our ability to generate

3:44data has been eclipsed by our hardware's ability to handle that data. Right. And that kind of sounds ridiculous, but Alex, if you want to go to the next slide here. Um, we have a very timely timely tweet. Actually, this is I think from almost two years ago now from uh George Fraser. Hi, George if you're out there. Um, uh, most of our queries are

4:04not that big, right? Uh, I think like the the P99 is like 256 gigabytes.

4:11They're not that big. Um, so this is where we are. we got to uh and this is kind of the reality that DuckDB was built into. And you know, I think the the question that we're asking at Motherduck is maybe we don't need these MPP systems anymore.

4:27And so that takes us to DUTDB. What is ductb? Well, ductb was designed to take a single node and to push it as far as possible in terms of performance. It is an inprocess analytical database, which means it runs inside of another language, inside of another process. So in our case, it's running inside of Python right next to dbt. It is open

4:47source, so it's MIT licensed, so you can use it for anything you'd like. And it's really, really lightweight. Uh you can install it with just a really quick pip install. It'll take a couple seconds.

4:56It's 20 megabyte binary with no dependencies. And then in three lines of code, you're querying any CSV you can imagine. So one of the key things about ductb is that form factor. It's just so ergonomic, so easy to use to get started with. It's really lightweight, but it's also fast. When we say also fast, we mean for analytics, it is really fast.

5:15If you look at ClickBench, one of the industry leading benchmarks for measuring analytical database performance. Duct DB is now number one.

5:23So we say we're fast, we mean it. Uh we beat Click House on their own benchmark.

5:27Um other notable databases on that benchmark include Snowflake, Red Shift, and BigQuery, right? So we're we're right up there with all the big dogs to prove to you that you don't need the tax that they come with.

5:40So what is ducklake and how does that relate? Ducklake is an open table lakehouse specification. So it's similar to iceberg or uh Delta Lake. It's open in that anyone can implement it. The first implementation is a duct DB extension. So that's where we're starting. So let's look a little bit at the lineage comparing and contrasting data warehouses and data lakes. Kind of

6:02talk about how we got here and that'll help explain Duck Lake. So initially after we had traditional transactional databases, the first databases designed for analytics were created, data warehouses. They had a lot of scaling challenges and they also used proprietary data formats that kept your data locked up and it meant they could charge quite a lot of money to do it.

6:21Enter datal lakes. Data lakes bring open formats like parquet and orc and they use distributed file systems like S3 to uh have a very open approach to storing files. Um but they have challenges with concurrency. They have challenges with um consistency and they frequently became data swamps. Data lakeous were created to address this where they have extra metadata files on that object

6:44store to try and prevent corruption when there's concurrent operations going on and that's borrowing from databases borrowing those kind of acid principles back from databases. Cloud data warehouses borrowed some back from data lakes and they separated storage and compute and they use object storage uh similarly to store their data files but a key um feature that we don't talk as

7:06much about a key innovation there is that to manage the metadata they use a transactional database so snowflake uses foundation DB you know bigquery uses spanner and that allows them to handle the metadata in a database so ducklake is really combining the best of both it's bringing an open lakehouse style format with parquet and sitting on object storage, but it's using a

7:27transactional database for the metadata to simplify it and improve performance. So really getting the best of both.

7:35So there's four things that I want us to take take away here like what should you know about a lakehouse and what it does and Alex hit on a lot of these so I'm just going to kind of uh cover them a little bit briefly here but number one open data formats you can kind of bring any compute to it. It's really

7:50important. Um, often partition for scalability, right? So, we can break these pieces of our data into much smaller chunks. So, then we're doing our scans. They're very much constrained.

7:59Um, we can do high ingestion really easily. Just drop a bunch of data in there and then there's some janitorial optimization jobs that are in the background uh to kind of make that easier to query later. And then the last thing is we have the acid consistency guarantees. And you know what this really means is that uh you can have

8:18multiple transactions running out of time and if there's conflicts they will get rolled back and rejected. Um so

8:25there's two paradigms we can kind of talk about here. Um the one on the left is what well my left is that your your your left. Okay, perfect. My the one on the left is iceberg. Um you'll notice there's multiple layers happening in here, right? There's a data layer at the very bottom that's parquet files typically. Um and then we have manifests

8:43and then we have metadata and then we have our catalog on top. So we have all these complex parts and as you add and you commit more and more data it gets slower and slower and there's more uh more metadata overhead looking at what's in those files. Um on the right hand side we have what the same system looks

8:59like with uh ducklake which is just a database holding metadata right and like Alex mentioned if this for if this looks familiar to you it's because this is the same architecture that snowflake and bigquery implement as well um so you know I think what we're trying to really get at here is this shows you how to get

9:19to something that's very complex but valuable into something that's a simple distillation. >> You bet. And you'll notice those data files at the bottom are identical. So we're actually using the same data format as iceberg.

9:33So it's one thing to talk about simplicity in the abstract, but how does that help me as I'm crunching data?

9:38Well, first of all, it really helps with the local development experience. How many of you have run an iceberg lakehouse entirely on your laptop?

9:47We have one person in the audience. You get a prize afterwards. >> Come find us.

9:51>> Come find us. Come find us at the booth. Um, so that's one, right? In five minutes, we're going to set up a ducklike warehouse in five minutes, a lakehouse in five minutes. So truly an unmatched local experience.

10:03You can also use it for CI/CD and testing. So it's not just that first initial setup where you get a lot of benefit. It's every time you're doing development work. It's every time you're running a test or a CI/CD deployment.

10:14Not only that, the simplicity actually adds to performance. All those layers that Jacob was talking about, those are all round trips back and forth to object storage, which is not fast.

10:24With Ducklake, it's one database request to a fast transactional database and then you go read your files. So, it's faster to read and write to Ducklake.

10:33And because we have the same data format as iceberg, you can use one function call to import directly into Ducklake and another function call to export directly back to Iceberg. So, it's incredibly easy to adopt for those extra simplicity and performance advantages.

10:48There's lower maintenance. uh it's a little uh you can have you know if you had a million metadata files on object store it'd be a little easier to have a million rows in a Postgress database you know slap a B tree on there good to go we also can support multiple languages because we are a ductb extension in our

11:05first you know implementation here ductb is so portable it can go everywhere run it in python you can run it in node you can run it in go rust um and in Java so it's not locked into the Java ecosystem it's anything that can wrap duct DB And that's 15 or more languages.

11:24>> Great. So I'm going to put us in context real quickly and then we're going to jump into a demo. So this is what we're going to show you today. Uh we're going to generate some data using TBC data genen. We're going to then connect to that with uh ductb. We're actually going to ingest it. We're going to use dbt to

11:39load our data to with ductb and we're going to transform it with dbt and use ducklate kind of in the back end. So these are these are the components. This is it. This is all you need to build a data lakeink. So shall we jump in? Okay, let's see if the

11:56demo gods are kind to us today. All right, so the first thing I'm going to do as I discussed is we're going to generate some data. I am using this handy TPCH gen. We're using the TPCH data set which is a benchmarking data set. I'm also using UV. How's that look for size for those of you that are in

12:13the back? Okay, see thumbs up. Okay, great. Let's run it. So you'll see what's happening here is we're just getting parquet files populated here on the left. Great. We just generated at scale factor 10. This is about 10 gigabytes of data in uh uncompressed CSV. It's a little bit less obviously in parquet. So we generated our data. Um

12:33the next thing we're going to do is actually connect to our ducklake using ductb. So I'm going to just open my terminal and I'm going to make this slightly bigger. Run duct db. So you can see I'm connected and I'm going to attach my ducklake. So this is what it looks like to connect to my ducklake in

12:50in my uh ductb. And now I'm actually going to start my UI as well. So I'm going to do call start UI like this. And it should pop open a local server looking at our data database here. So what I'll show you is on the left we have our databases. So this is our duct DB running locally. We

13:12have our ducklake which has nothing in it at the moment. We have our memory database which is also empty. And then we have our ducklake metadata database.

13:18Now this is actually running Postgress. So when I connect it I connected to Postgress and I'm viewing it with duct DB. So we can actually just like look at our snapshot table for example. If I hit command enter so we can see it's empty which is great. That means my setup worked. Um okay. So now we're going to

13:35jump back here and how do we get this data that's sitting in this uh in these data files into our ducklike? Well, let's take a look at our configuration.

13:46The first thing I'm going to show you is this is what our dbt profile looks like.

13:52Right? So we have our uh ductb type.

13:56We've got our threads, right? We can run on four or more. Um we have our extensions. We have our secret. It's just a set of meta keys that we're going to use and pass to our Postgress database for authentication. Um, you can also use like environmental variables in here. I'm obviously running it all locally for the sake of this demo, but

14:11you can use any Postgress database here. Um, and then I'm attaching right here. So, um, a couple things that I'm defining here. Attach attach the attach notion is kind of like a foreign data wrapper in postgress or link server in SQL server. Um, I'm defining a data path and then I'm saying, hey, use this use these secrets up here to connect. So

14:30that's the first thing I'm doing in my profile. The next thing I'm doing is I'm telling it what database to use. So when I open my profile this way, I'm using an in-memory.tb database as kind of my core compute engine, but I can tell it where to write. And so I'm telling it to write into the catalog here in my uh in my

14:47config. This is probably pretty um common for those of you who have used uh dbt this way, but just calling it out.

14:54Last thing I'll show is in the models I have a sources file that just tells me, hey, like here's my location of these files. And so now I can just reference them with dbt to load them into my uh into my lakehouse. So that's it um from that perspective. And so now we can just run our dbt project. And so

15:13I'm going to do this and then we're going to tab over here and we're just going to see that our data is now coming in. Right? So we've loaded all that data and you can see here now the queries are populating. We go back here, we can see dbt ran. And so now we have in 10 seconds, we've taken this data, this raw

15:30paret file, and loaded it into our uh data lake. So now I'm just going to rerun this just to so we can look at it. Perfect. So we can see we have our snapshots. So there was 32 objects built. So we wrote 32 snapshots. Of course, I can query it just like a database. because I'm using duct tbql I can do fun things like write

15:49my select at after my from uh which is a fun party trick I think um so there's that and then of course we can interact with it just like we interact with anything else so I can like for example what okay thank you for the update cursor we can do this where we use our duct db in the command line it's very

16:09bottom of the screen here but um to pass a query and just look at the raw paret files right so this this is all showing these interactions because we're using ducklake, we can have multiple pi or multiple ductb processes running at once and they're not conflicting with each other, which is really cool. Um, so one

16:27thing we love about, you know, uh, data links is we can of course rerun our jobs whenever we want. So let's run it again.

16:32UV run dbt build. So when I run it again, one thing I'll show you is that we'll notice that our we now have duplicated parquet files for every for every file that we loaded into our data lakeink, right? because the default materialization in this case is just a table, right? We're rewriting full tables. That's the normal DBT kind of

16:50workflow. And when we do that, it's a the delete is a metadata only operation in ducklake, right? So, we're just saying, hey, these files, just mark them as totally deleted and then create new ones and we'll reference we'll point to those instead. Right.

17:03>> And it gives you full time travel capability. That's the big benefit. >> That's right. Yeah. So, you get time travel kind of baked in. But, of course, what this also means is we've got a bunch of data in here that we kind of need to maintain a little bit. And so how do we do that in an easy way? Well,

17:16I wrote a little macro which has a lot of ginga and not very much other stuff.

17:22But um there's three things we really care about in this macro, which is I'm going to highlight them here. We'll go through them. The first thing we have is this call for merging adjacent files, right? So what this does is it looks at your the data that's been written in and puts them together. Um in this case,

17:37we're not going to do anything because we didn't write any small files, but you have it there. We can also expire our snapshots. Those of you who used iceberg, iceberg works the same way. You expire your snapshots and then you do your delete operation second. In this case, I'm actually setting this to 1 minute. So, anything that's older than

17:51one minute, I just want to expire. Um, you can set that to whatever you want depending on your retention rules. Um, and then, uh, we're going to run the cleanup. And what the cleanup does is say, hey, once we've marked these for deletion, let's actually do the delete.

18:05And because we put this in a macro in dbt, we can simply invoke it like this.

18:10UV run dbt run operation

18:14maintain ducklake and we will see in our data files that now we just have one representation of each query and one file for most tables.

18:27The line item table is actually partitioned so it's split up in files but everything else we just see one. So um I think that's it we have for the demo. Anything else you want to call out Alex? I mean, five minutes and we've got a lakehouse on your laptop. Easy peasy.

18:42Perfect.

18:48Sweet. So, that was Hello World. You guys could do Hello World uh anytime pretty quick. Um we also wanted to talk about production. What does it look like to actually run this in the real world?

18:58So we've been working with a customer where they were running a five server cluster with five of the biggest machines you can rent on AWS running a distributed comput engine with iceberg as the storage layer and they were able to replace it with one serverless duckling running on motherduck and ducklake so really dramatic scalability advantages and to do this it was a

19:19metadata only migration they didn't have to recopy all their data so all the time that gets invested in centralizing data on iceberg you still get to benefit from all that get the simplicity and speed and performance advantages of moving to Duck Lake instead.

19:35What are some other things that I need to run this in production? The first thing that I'm going to need is I'm probably going to want to run some compute in the cloud. You could run this all on your laptop and object store, but you're going to have to download quite a lot to your laptop every time you want to do something. So,

19:48you're probably going to want to run some cloud compute. If it's serverless, it's going to make your life easier. You don't have to maintain clusters.

19:55Every once in a while though, you're probably going to want a really big compute instance if you need to repartition your data or resort your data or scan the whole data set and aggregate it for a data science workflow. Sometimes you really do need a lot of beefy compute. So you don't want to run this on like a lambda with 10

20:10gigs of RAM. Sometimes you do really need a lot of compute in the cloud.

20:15You're also going to want to lock this down, of course. Oh, and we'll get to that next. So you'll want to lock it down with access control, but you're also going to want to do some caching for better performance. So while S3 is bottomless and has a lot of throughput, it has a pretty high latency. So if you

20:29can cache things to avoid going to S3 every time, you get a huge benefit. And since lakehouses are immutable, once you write a file, it never changes. It's perfect for caching. And when you do that caching, you probably want to cache it across all your users in the cloud.

20:45Lastly, you have that scheduled maintenance. Again, probably don't want to cron on your laptop. Probably want to cron in the cloud. So if you're looking to do that, Motherduck does wrap this up in a convenient managed service. We can manage the full storage layer for you if you'd like or you can actually bring your own bucket and you can manage the

21:03storage and see all those files and retain ownership of your data and use as the catalog and the compute engine.

21:10So we talked about large instances. How can we help with that? We like to call them ducklings of unusual size. So we have our mega. This is our Godzilla duck, as you'll see. Um, that's 64 cores and 256 gigs of RAM. That's a lot of horsepower to throw at a problem. If you have a really tough problem, well, we've

21:29got our planet sized duckling here, the Giga, which is my the cutest icon we've got. And that is 192 cores and a terabyte and a half of RAM, not to mention all the SSD storage that's attached.

21:42There are a lot of problems that fit in a terabyte and a half of RAM, right? So, you don't have to fall back to a distributed system. If you need it, it's there.

21:53So, in conclusion, the combination of Ducklake and Motherduck really make your big data feel small. And we do that by Ducklick being the simplest lakehouse by far. And if you want it even simpler, the Motherduck hosted version will take care of some of those cloud oriented pieces as well.

22:10We also made a lakehouse in 5 minutes. It's an unmatched local development experience. And when you want to go to production, you can have a serverless production environment that is similarly easy. And lastly, there are some lakehouse tasks where you do want a lot of horsepower. And for that, we have our ducklings of unusual size because a

22:27terabyte and a half of RAM will get a lot done. So how does that help me as a practitioner crunching through numbers doing analysis? It means that my existing skill set of DBT and SQL, I can operate with 10 to 100 times the data size and I don't have to stand up a whole other stack. I don't have to learn

22:45a whole other language or have a whole other team maintain a platform on my behalf. I can just do it with Ducklink and you can do it today with that instant import from Iceberg.

22:56So we want to say thank you so much for your time and attention and we look forward to your questions

23:07and we do have a couple of minutes. Do we have a microphone for the questions?

23:11I'll repeat them. Go ahead. >> Sure. First of all, you cool shoes. >> Thank you.

23:16>> My question is what exactly is inside of the Duck Lake extension? >> The question is what is inside of the Ducklake extension? Um the Ducklake extension contains the logic for how to write the SQL statements to run against the catalog. Uh so it's mostly just a collection of functions to communicate with your catalog and then uh write to

23:36parquet files. The parquet writer is another extension. The ductb compute engine is you know built into ductb and then we have a postgress extension if that's your catalog but it's pretty self-contained and that's by design so it's easier to implement with other systems as well.

23:49>> That's a great question. >> Y >> next question. I was just curious.

24:05>> Oh, sure. >> The question was we had a magical line where we attached a duck lake is there anything behind the curtain? What did we do before we did that step? And that's a very good question. So um uh the answer to that is it's very simple is I just created an empty Postgress database and once that

24:23database exists I can use that attach command and it will do everything else for me. So there's one thing in advance which is create a Postgress database that's all you need to do.

24:31>> Yeah >> we do have the option of using a duct DB database or a SQLite database if you want to get it up and running on your on your local machine but we expect Postgress for production.

24:40>> Let me just see if they have the getting started here somewhere. Where did that go?

24:45Okay, there it is. I think it's right here. So, like this is this is what it looks like for duct DB here. We're using Postgress, so it's a little bit more work. Um, yeah, it looks like this.

24:56>> Sweet. Do you want to bump up the size? Zoom in. >> Oh, sorry.

24:59>> Sweet. >> Other questions? >> Yes. >> What do we charge for a giga duck? Uh, that one you talk to us about it and we'll hook you up. Um uh the um we do have our pricing disclos on our page up to the mega. The gig is the only one we ask you to talk to us about it. Um we

25:20still think it's very price competitive. Um but talk to us. We're at the booth.

25:28Other questions? >> Yes. >> Curious about BI tool connectivity. >> Yes, the question is about BI tool connectivity. That's a great one. Um to connect to Motherduck and Duck Lake, um we use the duct DB driver. So if it works with duct DB, it works works with us. That includes a wide variety of legacy tools. Sorry to say legacy like

25:47PowerBI and Tableau. I guess we can call them legacy. Can we can we agree? Okay.

25:50All right. Um and then you know some of the the other ones as well like you know omni and superset and metabase. So there's open source options, there's commercial options, quite a few. Um but if you have one in mind, come see us at the booth. We'll hook you up.

26:04>> Great question. Next question. >> Yes.

26:20So the question was around the migration from iceberg over. What did that look like behind the scenes?

26:25>> Great question. So there's uh if you already have things in iceberg format, there's a function called add data files and you call that and you basically just tell your metadata database about those files and you say hey here are the files that I want you to consider. Um, and you can just run that in a loop or run it

26:40with a long list of files and you just tell the catalog about it and then and then we're in business. So the files live where they are, you know, in in your own object storage bucket and then you just tell the catalog about it.

26:50>> Yep. Good question. Thank you. >> Yes.

27:06>> It's an excellent question. And the the question was that part of the value prop of iceberg is being able to use multiple compute engines. That's absolutely true.

27:13Uh we addressed part of that by just having ductb already being able to run everywhere. So for example, trino has a ductb connector. You can you can use that to access duct db directly inside of trino. Uh we are working on a spark connector as well to natively integrate so that spark can also write to ducklake. Um there's about a 30 line

27:31script where you can uh wrap duct db to do the same operation even in parallel.

27:35uh today with Spark because duct DB is so uh modular but we absolutely want that to be the case and it is an open spec and it is um mostly SQL based so it's it's uh we want to expand out to more transactional databases as well for the the catalog and then also more engines but thank you

27:55>> yes

28:17It's a great question. So the question was, is there a workflow that we can envision of having Ducklick be a local development workflow and possibly using iceberg as a production deployment? You can >> uh I can yeah I can answer that. I mean, I think the answer is is potentially.

28:33Um, I don't think we have a reference implementation that matches that today. Uh, but I don't see anything that would hard block it at the moment. Um,

28:46yeah, I I definitely definitely happy to talk more about that too at the booth if you want to stop by. Um, it's very interesting. I do think that's something we're very interested in of like a do all the really frequent transactional work with ducklake and then publish to iceberg for consumption for other >> exactly some sort way to do that.

29:02>> We do see such an advantage on the right side for ducklake that that we think that we can publish and still keep the benefits of the iceberg interrupt but not have to pay the penalty every time you write to it.

29:11>> Yep. >> But thank you for the question. We might have time for one more lucky winner.

29:19All right. Yes. So just curious like what are the benefits of still using model instead of using the UI? What are the things that you're still providing including the core?

29:34>> Sure. So you know we are a fully hosted platform so we're we're serverless compute. Um compared with local oh sorry the question was how would you compare and contrast kind of ductb and then motherduck around it. So motherduck is the cloud data warehouse around ductb as the engine. So we love DuckDB. It's what powers kind of the heart of motherduck.

29:51We have a lot of things around it like access control, user management, the ability to spin up multiple instances horizontally, the ability to have on demand really large instances ready. We have a a completely replace the storage layer and we have a caching layer on top. Um and then we also have a managed ducklake service as well. So um I think

30:10I think of it sort of like single player to multiplayer. >> Y All right, I think we're done. Thanks everybody. We really appreciate it.

30:18Thank you for your questions. Cheers.

FAQS

What is DuckLake and how does it compare to Apache Iceberg?

DuckLake is an open table lakehouse specification that stores metadata in a transactional database (like PostgreSQL or SQLite) rather than as files on object storage. Like Iceberg, it uses Parquet files for data storage on object storage, but replaces Iceberg's multi-layered metadata files (manifests, manifest lists, metadata JSON) with simple database rows. This makes reads and writes faster (one database request vs. multiple S3 round-trips), simplifies maintenance, and gives you an excellent local development experience. Learn more about DuckLake.

How do you set up a DuckLake lakehouse with dbt?

Setting up a DuckLake lakehouse with dbt takes about five minutes. In your dbt profile, configure DuckDB as the engine type, specify extensions to load (including ducklake), and use the ATTACH command to connect to your catalog database (PostgreSQL for production, SQLite or DuckDB for local development). Define your data source paths in a sources.yml file, then run dbt build. In the demo shown, 10 GB of TPC-H data was loaded into DuckLake in about 10 seconds.

Can you migrate from Iceberg to DuckLake without copying data?

Yes, DuckLake supports metadata-only migration from Iceberg. Since both formats use identical Parquet data files on object storage, you can use DuckLake's add_data_files function to tell the metadata catalog about your existing files. No data copying is required. The Parquet files stay exactly where they are in your object storage bucket. One production customer replaced a five-server distributed Iceberg cluster with a single serverless MotherDuck duckling running DuckLake using this approach.

What maintenance tasks does a DuckLake lakehouse require?

DuckLake requires three maintenance operations: (1) merging adjacent files, combining small files written during incremental loads into larger, more efficient ones; (2) expiring snapshots, marking old snapshots for deletion based on your retention policy; and (3) cleanup, actually deleting expired data files from object storage. In the dbt demo, these were packaged into a single macro that can be invoked with dbt run-operation maintain_ducklake, making scheduled maintenance straightforward.

What instance sizes does MotherDuck offer for large DuckLake workloads?

MotherDuck offers instance sizes up to "Giga", which is 192 CPU cores and 1.5 terabytes of RAM with attached SSD storage. This handles demanding tasks like repartitioning data, full dataset aggregations, or data science workflows. The key insight is that single-node compute has outpaced data growth: 83% of cloud data warehouse users query datasets under 1 TB, and 94% under 10 TB. MotherDuck's serverless model means you only pay for the seconds you use these large instances.