All About DuckLake: The O'Reilly Book

2026/04/02Featuring:

What you'll learn

Matt Martin and Alex Monahan — coauthors of O'Reilly's DuckLake: The Definitive Guide — sit down for a live conversation covering what DuckLake is, why it exists, and who should care. The session includes the release of Chapter 1 to all registered attendees.

What is DuckLake?

DuckLake is an open lakehouse table format (MIT licensed) that stores metadata in a SQL database instead of scattered files on object storage. The data files are standard Parquet, similar to Iceberg, but the catalog and metadata layer is replaced by a single database — DuckDB, SQLite, or Postgres. Matt describes getting a working DuckLake connected to S3 in two lines of code, compared to the extensive configuration typically required with Iceberg.

How DuckLake compares to Iceberg and Delta Lake

The authors walk through an honest comparison. Iceberg pioneered schema evolution, time travel, and Parquet storage — all things DuckLake builds on. The key difference is the metadata layer: Iceberg uses thousands of small JSON and Avro files for metadata tracking, while DuckLake puts that in a SQL database. This makes catalog operations 10–100x faster and eliminates the compaction jobs and file-management overhead that data engineers deal with today.

The book and what it covers

The guide is published through O'Reilly with the same editorial standards as their definitive guides to Kafka and Spark. It covers DuckLake's architecture, honest comparisons with Iceberg and Delta Lake, migration strategies, and practical getting-started guidance. The authors emphasize it is not a vendor pitch — it covers DuckLake's limitations alongside its strengths.

Live Q&A highlights

The audience asked about multi-engine support (Spark and DataFusion implementations exist), Iceberg interop (metadata-only copy to migrate), governance and access control patterns, and how DuckLake handles concurrent writes. Matt and Alex fielded questions throughout the session.

Transcript

0:00Thank you.

0:30Thank you.

1:00Thank you.

1:03Hello, everybody, and welcome.

1:05Great to have you guys here today.

1:08You bet.

1:08Well, hi, I'm Alex Monahan.

1:09Here is Matt Martin with me.

1:12Hey, guys.

1:13How y'all doing?

1:14Good, good.

1:15Well, we're excited to have you guys today to talk about Duck Lake, the definitive guide

1:20that Matt and I have been working on together.

1:22So I'm really excited.

1:24I think a key part of today, part of why we're doing it live, is for your questions.

1:29So as we're talking, as we're chatting, please pop them in the chat.

1:32Mehdi is also asking where everyone's dialing in from.

1:36So I'd love to hear where you guys are as you are all around the world watching this.

1:40But keep those questions coming.

1:42We'll go through them as they come in, as we can pop them in, and we'll definitely save

1:46a bunch of time at the end for your questions, too.

1:48And also stay tuned in about 10 minutes or so.

1:52We got an extra bonus surprise for you.

1:54So stay tuned for that.

1:56All right.

1:57We've got to kick things off.

1:59Matt, tell us a bit about yourself and how you got into data.

2:02Sure.

2:03So for those of you who don't know, my name is Matt Martin.

2:06I've been a data engineer for almost two decades now.

2:10Started off probably, and a lot of you can resonate this, I started off with Microsoft Excel

2:16and VBA and a SQL server running under somebody's desktop at my office.

2:22Yep, that actually did happen.

2:25But really just kind of latched on to data engineering over my career.

2:29I really liked taking complicated processes and getting them down to quick push-button solutions.

2:35And so I spent a lot of time at the beginning of my career at Home Depot, about 10 years there.

2:43And then I joined State Farm back in 2020 and been working on cloud, AWS, data engineering projects ever since.

2:51And it's been fun.

2:52So that's a little about me.

2:54Alex, what about you?

2:56You bet.

2:56Same gateway drug into data.

2:59Excel VBA.

2:59I showed up as an intern at GE in Cincinnati, and they said, hey, the last intern built this spreadsheet in this VBA thing.

3:06You need to fix it.

3:07I said, cool.

3:08What's VBA?

3:09So I had no idea.

3:11And at that point, I realized, oh, wow, I can make the machine do what I want.

3:14And the first time I wrote the script, it took like two hours to run.

3:17I actually told the IT guy, we need to buy a more powerful computer.

3:21He's like, dude, it should not take two hours to run.

3:24Go figure out how to do this right.

3:25And then when I got it to work right, it took 11 seconds.

3:28Yeah, that sounds familiar.

3:31Similar path.

3:33I finally figured out how to do it correctly.

3:35Picked up SQL at another internship 10 years ago and been doing that for 10 years and did Python and JavaScript as well.

3:43I was at Intel for nine years in the supply chain, and we were basically a shadow IT.

3:48You guys have heard the term.

3:49It's sort of where the business side builds their own IT stuff.

3:52So built our own internal BI tool, built a data virtualization platform, kind of like off-brand Trino for Windows.

4:01Quite a lot of fun.

4:02And a key ingredient to that was DuckDB.

4:04And that's how I got into the Duck world, the Duck pond, the lake house, as we're talking about today.

4:09Wow.

4:10Yeah.

4:11So as you guys can probably pick up, Alex and I kind of took somewhat similar parallel paths because I was also guilty as charged part of a shadow IT org at Home Depot.

4:22State Farm, I'm not in shadow IT.

4:24I'm in real IT this time.

4:27But yeah, you know, I picked up Python.

4:29I think it was in 2016 is when I started using Python is when we started replatforming from our on-prem Teradata warehouse to Google BigQuery.

4:40And that was the first time I ever had to figure out how Python works and had to figure out how to stream data into a BigQuery table.

4:50That was fun stuff.

4:51So, yep.

4:53That's awesome.

4:53Well, I'll also pop in this comment from Jason Wirth.

4:57Always blame the intern.

4:59I agree.

5:01I've been on that side.

5:02So I get that.

5:03So I appreciate that comment.

5:05Well, awesome.

5:06Well, maybe one more intro question before we jump into the Duck Lake.

5:10Sure.

5:11So what's kind of the what's your kind of favorite kind of data engineering, you know, kind of gnarliest pipeline you've had to wrangle or the data set that you remember having to really fight with?

5:22Yeah.

5:23Yeah.

5:23So there's a couple.

5:24Well, so one of them that I knew I spent an extensive amount of time on, but this was also granted my ramp up and understanding how to parse JSON correctly.

5:36It was the UPS tracking API.

5:39Because I had to set up a job to ingest that into Google BigQuery every hour and ping tracking numbers to figure out where they were.

5:49And this was the first time I ever had to understand that, oh, some JSON payloads come back with all these different attributes, but then some of them don't because they simply don't exist on a package.

6:01And having to deal with, you know, check if this key actually exists.

6:07And if this key exists, check if there's actually a value there and it's not null or an empty string or some weird voodoo hidden character.

6:15So that one was pretty gnarly because, as you can imagine, especially with UPS, FedEx, all of them, customers can put in the comments free form text, whatever they want.

6:26And some people, I think, out of pure spite would stick a hidden character like that backslash bell or something.

6:34Just so, yeah, there was that was fun.

6:37I had to do a lot of sanitizing of the data on that and flattening, but it was a great learning experience.

6:42How about you, Alex?

6:44Yeah, I think for me, it's mostly kind of I had a very daisy change, Ruby Goldberg machine.

6:49So coming off some of the factory equipment at Intel, we sent information through the Elastic Beans kind of system through Kafka and to Logstash.

6:58And then it ended up in Elasticsearch.

7:01And then I pulled it out of Elasticsearch with their SQL dialect, which, you know, it is technically a SQL dialect, but it was pretty rough.

7:08I then converted it into Pandas.

7:10Then I converted it into SQLite.

7:13Is that the animal that roams around the zoo that climbs trees and eats bamboo?

7:17Is that what you're talking about when you say panda?

7:20Python pandas.

7:21Yes, indeed.

7:23Yes, a lot of fun there.

7:25Then it went into SQLite to take it from a crappy semiconductor industry format into actual XML, which is still a crappy format.

7:34And then I converted it to JSON, typed it into DuckDB.

7:37Then I sent it to the browser with Apache Arrow and back into DuckDB in the browser and visualized it.

7:44So, you know, throwing everything at the wall for that one.

7:48So, yeah.

7:49I mean, pro tip for everybody here.

7:50XML still runs a lot of, at least last time I checked, the logistics world, especially trucking companies.

7:58They're still sending stuff either over that or as the old school EDI 214 bitmaps.

8:05It's a lot of them have finally started to modernize to JSON.

8:08It only took them a couple decades.

8:10But, yeah, XML is still very prevalent in the world.

8:15Modernized to JSON.

8:16You don't hear that every day.

8:17I know.

8:17Yeah, you're right.

8:20XML is also great for agents as well because it's a little more structured, harder for them to break.

8:24So there's still a purpose for that.

8:28Didn't they invent some new spec recently for agents?

8:36I forget.

8:37There were a lot of memes on it because people said, this looks like a CSV.

8:39I forget what it was called.

8:41Oh, there's like new line separated values?

8:43Yeah, I'm not sure.

8:44But, yeah.

8:44There's some new ones.

8:45I digress.

8:46Well, Steve Dodson says that XML is dead.

8:49Long live XML.

8:51Yeah, yeah.

8:51We're with you there.

8:53Absolutely.

8:54Pretty much.

8:55You bet.

8:56So with that, after having some fun down memory lane there, I figured we could talk a little bit about Duck Lake.

9:03So, Matt, if you were going to talk about Duck Lake, how would you describe it to folks?

9:10Yeah.

9:10So if I were to distill Duck Lake into its simplistic terms, I would say that I am, and I'm not trying to toot my own horn or anything or brag, but I would say I'm very advanced and seasoned with Spark and Apache Iceberg at this point.

9:29I'm relatively seasoned with Delta.

9:31I get that there are some fundamental differences on how they record their transaction logs versus Apache Iceberg.

9:36But I can tell you, with Iceberg, as much as, and this isn't an Iceberg bashing session, this is real world emotional battle scars.

9:45As much as organizations want to say, oh, yeah, Iceberg's the way and it's the easiest thing since sliced bread, it's not.

9:56There's still a lot of configurations that you have to do simply just to get it to do things like talk to a Google Cloud Storage or even an AWS S3 bucket.

10:07The first time I tried Duck Lake after it was literally announced, and I don't even know if that was an alpha or a beta version.

10:15The first time I tried it, I was connected.

10:18I had a Duck Lake up and running and connected to AWS S3 in just two lines of code.

10:24I'm not kidding.

10:25It was two lines of code, and I was like, wow, are my eyes playing tricks on me, or is this the way that God intended it to be for data lakes or for lake houses for data engineers?

10:40Because what I'm getting at, folks, is the cognitive load that these lake house architectures put on data engineers today is pretty high, in my opinion.

10:50It's higher than I feel it needs to be.

10:52I know that there's been some work lately in the iceberg camp to where they're looking at ways to simplify their metadata files, process tree, and all that stuff.

11:00But literally, the first time I built the Duck Lake and I tried it, I was like, this is so brain dead easy, and this is the way it should be.

11:08And it helps you kind of flip the switch on where your time is spent.

11:13You can either spend 80% of your time configuring a warehouse and only 20% actually providing the business value to your consumers,

11:20or you can flip that and say, I'm only going to spend 20% of my time configuring the warehouse,

11:25and I'm going to give the consumers and myself 80% of that time to actually drive business value.

11:30So I know that was a mouthful right there, folks, but that's kind of like my take on where and why I think Duck Lake is an amazing leap forward in the stack for data engineers.

11:45So with that, Alex, what about you?

11:49I'm going to cheat and pull up a diagram.

11:52I think for me, these are the lines of code it takes that Matt was talking about.

11:58You install Duck Lake, you attach a Duck Lake, and then you use it.

12:02And it's a slight difference if you want to use object storage.

12:06It's a couple of parameters you pass to that one line as well.

12:09But it's really that simple.

12:12So what is Duck Lake?

12:15We had a really great question come in.

12:18Is Duck Lake a format like Iceberg or both a format and an implementation?

12:23It's an excellent question.

12:25So Duck Lake is a table lake house specification, and it is fully open source.

12:30So it's MIT licensed.

12:31You can use it in any way you could possibly imagine.

12:34It is a spec.

12:36So it can be implemented in any engine.

12:39The first implementation is in DuckDB.

12:42Part of that was the creators of Duck Lake at DuckDB Labs.

12:45Hannes Newellheisen, Mark Rossfeldt, Pedro Holanda, and others really wanted to make sure that it wasn't just a spec,

12:53that it was a spec that had been battle tested by having to go implement the spec.

12:58And so there is an initial implementation in DuckDB as a DuckDB extension.

13:02There are also implementations in Spark, kind of in alpha mode, and then also Data Fusion.

13:09Apache Data Fusion is working on a read implementation, and we're open to others.

13:13So it's very, very open, really a very, very wide tent.

13:16Anyone who wants to use Duck Lake, happy to collaborate there.

13:20Yeah, and if you don't believe that it's open, just go to the Duck Lake org specification site.

13:24The schema, all of it is totally published there.

13:27You can see the whole thing.

13:28And they're not trying to hide a single thing on this one.

13:32You bet.

13:34And I think Matt talked about the feeling of Duck Lake and why we're excited about it and just the fact that it focuses on simplicity.

13:40But kind of nuts and bolts, really, it's an open lake house format.

13:44It uses parquet files that are in a very similar structure to Iceberg, but it changes how it handles metadata and the catalog.

13:53And it puts both of those in a SQL database.

13:55And SQL databases are something we've been working with for a long time.

14:00And the challenge with that is we can't use a SQL database for everything.

14:03Sometimes it's sort of the industry perspective is, oh, no, it won't scale as large as we need.

14:09Well, that's kind of what Duck Lake is designed to address is use parquet storage on object storage for bottomless storage, infinite scalability,

14:18and then use the SQL database to really simplify and speed up the very transactional work of managing a catalog and tracking where your files are,

14:27but have that work be much smaller so the SQL database can handle it.

14:31And so that's really the Goldilocks approach of Duck Lake.

14:35I will probably flip to one other diagram just because I think that also helps me understand a little bit.

14:41This is a bit of a contrast from Iceberg.

14:42It is worth saying we are fans of Iceberg at Mother Duck and in general.

14:48It's really pioneered a ton of incredibly innovative things.

14:52And we continue to use the Iceberg storage format because they got a ton of things right.

14:58You know, schema evolution, time travel, parquet as the storage format were all excellent choices,

15:04and it continues to be a solid option.

15:06I think this diagram is really just designed to show a little bit of the focus area difference where Duck Lake replaces the metadata layer and catalog layer of Iceberg with a single database.

15:19And that database can be DuckDB or SQLite or Postgres today and expanding to more as we go.

15:27And the data files are the same, so you can even import from Iceberg with a metadata-only copy.

15:32So if you already have Iceberg, you can already copy your data over to Duck Lake very easily.

15:39So also, I did promise a quick surprise as well.

15:43So if you guys check your email inboxes, the ones you use to register for this event,

15:48you'll find Chapter 1 of Duck Lake, the Definitive Guide, in there.

15:52So go ahead and hop over to the inbox, check it out, and you'll already see the first chapter in there.

15:58So when you guys are-

16:00And for level-set expectations, Alex and I reserve the right to change the contents of Chapter 1 up to 50 more times.

16:06But no, we feel pretty good about the introduction on that, folks.

16:10Definitely spent a lot of time on that first chapter to kind of give a frame of reference of what this book will be looking at

16:17and what it's not going to be looking at.

16:19One thing we want to make very clear is like, look, we're not-

16:21This isn't a hit piece on Iceberg or Delta.

16:23This is simply showing you an alternative method that we feel has a very good place in the data stack for data engineers

16:32and organizations that are looking to adopt something of a little different shift to address their concerns on their data.

16:43You bet. Absolutely.

16:45Well, thank you, Matt.

16:46I probably have one more kind of overview slide that helps me kind of understand the overview,

16:50and then we can dive into a bit about what we're really excited about.

16:53So for me, Duck Lake really allows you to scale three pieces of your Data Lakehouse platform totally independently.

17:04And these are the same pieces that any Data Lakehouse is going to be built out of.

17:08You've got your storage format.

17:09That is Apache Parquet files.

17:13You've got your metadata storage, which in this case for Duck Lake, it lives in a SQL database.

17:20And then you have your compute engine.

17:23And your compute engine is what's actually running and executing the queries, pulling data from those Parquet files.

17:28And a unique thing about Duck Lake is that it's very flexible about which pieces you pick.

17:34So this diagram is set up.

17:35It looks kind of like a slot machine where, you know, you pull the lever and you see the various reels spinning.

17:39And you can kind of pick in each dial, each of those three dials, what you'd like to use.

17:45And you've got a lot of choices.

17:47Some of those choices run on your laptop.

17:49So you can test Duck Lake on your laptop, no cloud connection, no cloud storage or anything, using just your laptop's hard drive.

17:58You can also use an on-premise shared drive.

18:02It doesn't have to be an S3 style file system.

18:04It can be any POSIX file system as well.

18:07So any standard place you can save files.

18:11And the metadata, if you're working locally, you can use DuckDB or a SQLite file to just have your full catalog in one file.

18:17If you use SQLite, you can do it across multiple processes on your machine.

18:21So that's really nice.

18:21And then for the compute, while you're testing locally, while you're doing your development, you're just using your laptop, which you've already paid for.

18:28And it's gotten pretty fast these days, especially the new Macs.

18:32They're quite speedy.

18:34And then when you want to push to production, if you want to go to cloud, that's when you can choose a couple of different options.

18:39So typically we see folks using object storage, like AWS 3, Google Cloud Storage, or Azure for their blob storage.

18:46And then in the metadata space, you can use Postgres as an open source option.

18:52MotherDuck is also an option here as well.

18:55And on the compute side, you can use DuckDB on your laptop still if you'd like, and just have everybody accessing DuckLike user, each their own laptop.

19:04You can also run DuckDB anywhere, cloud servers, serverless functions.

19:08And in MotherDuck, we run DuckDB as well.

19:11Yeah, you know, one thing I want to touch on, folks, and again, this just comes from personal experience here.

19:16Um, something very key here that Alex is calling out is, you know, you can run it on your laptop, you can run it in the cloud.

19:22Um, setting up the environment locally for DuckLake is so easy compared to all the hoops that at least I have to jump through if I want to, you know, instead use something like, um, AWS Glue, for instance.

19:37Uh, with that, you know, really the only way you can get kind of a one-for-one experience is to do a full-blown Docker image with Amazon's, um, you know, Glue instance stuff.

19:46And then you got to start going through like, okay, well, is it Glue version 3, 4, 5?

19:50What's the Python library analogous to Glue 5?

19:53And there's a lot of stuff you got to think through versus with DuckLake.

19:57It's like, okay, let me test this here.

19:59Okay, I got it working here locally.

20:01All right, now let me go stick it in prod.

20:03And the transition is nearly seamless.

20:05So, again, less time spent on the plumbing and the infrastructure, more time delivering business value, in my opinion.

20:13You bet.

20:16I guess I wasn't entirely truthful.

20:18I did have one more slide.

20:19This is another compare and contrast slide.

20:21Again, just to give you a different mental model.

20:24DuckLake at the bottom, you'll see the relational database contains both the catalog and the metadata.

20:29Whereas with something like Iceberg or Delta, you would actually be running a separate catalog service, like Apache Polaris or the Unity catalog.

20:36And then you would be storing your metadata in an Iceberg or Delta world on object storage as well.

20:42And the challenge there is the small files problem.

20:46So, maybe that's a lead in here, Matt, to talk about kind of one of the core problems that DuckLake is helping solve with kind of file management and things in the lake house.

20:56How does DuckLake help?

20:57Sure.

20:58And I think we just got a question here.

21:00And I'm sorry if I pronounced your name incorrectly.

21:02Chandra Kant put in there and said, hey, you know, does this mean that the metadata has to sit in a database rather than files?

21:10And the answer to that is yes.

21:12And that is by design.

21:14That is intentional.

21:15Specifically to start to work on these issues of small file problems.

21:19So, in today's...

21:21I'll jump in very briefly and say, if you would like it in files, if you use a SQLite file, that's a database in a file.

21:28So, you could have your data to do it if you want.

21:32Yeah.

21:32And so, oh, well, thank you.

21:35Okay.

21:35Sounds like I got the name close enough.

21:37And so, if you think about it today, with the current Iceberg spec now, of course, I think they're working on some new stuff.

21:43Whenever you run a transaction in Iceberg, it's roughly going to create three N plus one files of metadata files.

21:52And what I mean by that is, let's say you create a table.

21:55What's going to happen is Iceberg is going to go build...

22:00It's going to go, you know, do an object store scan, check if that table exists.

22:05If not, it's going to go create it.

22:06It's going to create a metadata JSON file, which actually provides the schema.

22:11It's going to then create a couple Avro files.

22:15One's a manifest list that actually points to any corresponding, you know, manifest files.

22:20And it's going to create a manifest file.

22:22No data is created yet in that transaction, so you get three.

22:25But then the minute you want to insert a row, guess what happens?

22:29You get three more files, and you get also a parquet data file.

22:33So, if you start to magnify that over time, folks, what happens is the number of metadata files that you accumulate in these catalogs starts to add up tremendously.

22:44And where that becomes problematic is the minute you want to do things like say, hey, I want to, you know, do some time travel and pull a table in an as-was state.

22:55At that point, you know, you can use your REST API, but it's going to be issuing numerous object store listing calls.

23:02And it's going to go through a resolution process in the tree to look up and scan hundreds, if not thousands, of metadata files to figure out what is the correct set of files for this point in time to get me to the answer this person's wanting.

23:16You issue the same time travel query in Duck Lake.

23:20It's a database lookup, low latency, highly tuned.

23:26And, you know, like Alex mentioned, databases, you know, relational databases have been around for 40-plus years at this point.

23:33So, we've had a lot of learnings at this point on what makes them tick.

23:36And so, you can get query plans resolved in just milliseconds, you know, with a Duck Lake catalog, and then it can hit the ground running and start querying the actual data files needed to give you your answer.

23:50With Iceberg and even with Delta, there's a lot more work involved in those situations.

23:54So, that's how Duck Lake kind of addresses the small file problem by saying, you know what, we're not going to put a bunch of metadata files out there in object store.

24:03We're just going to keep them as records in a database.

24:06And then we can, you know, resolve the query plan from there based on what the user wants.

24:09We can say, here's the list of data files and columns out of these parquet data files you need and go.

24:17You bet.

24:17And I'll add to that and say, we started out talking a whole lot about simplicity.

24:21Simplicity.

24:22And simplicity offers us a lot as the practitioners.

24:26And that's probably our favorite thing about Duck Lake, right?

24:28Matt talked about the two lines to get started.

24:31That's the best part.

24:32But simplicity also can mean speed.

24:35And that's really what Matt was getting to, which is if all you need to do is talk to a database once to find out what files you need, you can have very low latency queries on a lake house in a way that you just can't have in the other system.

24:47So to go and find what files you need to look at for iceberg is four round trips.

24:54One round trip to the catalog and a round trip to each layer of your metadata in this diagram.

25:00And you can't do that in parallel because each query depends on the prior one.

25:06And so when you talk to object storage that many times, it adds up to half a second or a second before you even start pulling your parquet files.

25:15And with Duck Lake, that can be 10 milliseconds.

25:19So it's really a 10x difference.

25:21The other piece where it really comes in handy is this small files problem means that your data has to be moving only at a certain speed with a traditional lake house.

25:31You can only insert data so frequently because if you insert it too often, you get too many of these tiny files.

25:37So if you have a streaming workload, you have to buffer your data before you can write it out to your data lake house.

25:43And that means you have to add a lot more complexity into your stack.

25:47Yeah, you're talking Kafka or Flink queues, all that stuff.

25:51So again, you're adding a lot more in.

25:52And Alex, I think I know you're going with this one.

25:55You're going to be segueing into inlining with this, which is something that Duck Lake has,

26:00which basically is another part of how it attacks the small data file problem head on.

26:07And it basically buffers new rows up to a certain degree inside the actual Duck Lake catalog and then will flush once a certain threshold is met.

26:18And this is really helpful because, you know, in a simplistic view, you can think about it this way.

26:24I think the default, correct me if I'm wrong, Alex, for inlining is 10 rows on the table.

26:28So if I insert, if I run a statement that inserts 10 records in a single batch, it's not going to write a parquet file immediately.

26:35If I do another insert with 11 rows, it's going to write the parquet file for those 11 rows.

26:41But guess what?

26:41The 10 rows from the prior transaction are still held in the catalog.

26:47But then, you know, you can also do periodic flushes if you want to get it out of the catalog and actually persist it to the parquet file.

26:55There's Duck Lake commands that can do that.

26:58But, you know, rest assured, those rows that are sitting in the quote unquote buffer are persisted in the catalog.

27:04They are safe.

27:05They're not going to disappear if all of a sudden there's a system restart.

27:09Once they're committed, they're there and they're available for you the next time.

27:13And then Duck Lake's smart enough on a query to blend rows that are sitting both in the buffer and on the parquet files together to give you the full snapshot of what you're asking for.

27:23Exactly right.

27:24I'm a huge fan of inlining.

27:25I'll pop up this comment from Harris Ward as well.

27:28Data inlining is so smart.

27:30Well, thank you.

27:30I'm glad we didn't have to say that.

27:31So thank you for chiming in there.

27:33I love that small data stays in the database until it's ready to be persisted to object storage.

27:38And in many ways, Duck Lake is a right tool for the right job situation.

27:43You know, Postgres is very good at inserting 10 rows.

27:47Object storage with, you know, a couple of metadata files and a separate parquet file is not very good at inserting 10 rows.

27:53So we're using the right tool for the job.

27:55And it can make a very big difference when you have low latency needs.

27:59Yeah.

27:59And also, too, to think about this, folks, back to the writing capabilities of current lake houses, you know, they're all using these optimistic concurrency control models.

28:09And so if I try to run a transaction and somebody else tries to run another one within a millisecond or two of me and insert the row, one of us is going to win and the other is going to fail.

28:19And guess what happens with the failed transaction?

28:21It's going to stop.

28:23It's going to roll back.

28:24And it's going to have to retry.

28:25Now, this is all under the covers in Iceberg.

28:29You don't actually see that.

28:30But that's what they're having to do because they're having to maintain this optimistic concurrency model.

28:36And if somebody writes the metadata file, which now, you know, is, quote, unquote, deemed as the next version of truth, yet I was in the process of trying to write my own metadata file with my own snapshot of rows, I got to now go undo what I did and then retry again.

28:51And again, that's all managed internally with the Iceberg and Delta specs.

28:56But that's what's going on.

28:57And so how does Duck Lake fit into this?

29:00Well, Duck Lake's just going to get the transaction done faster.

29:04And so, yes, there still might be a situation, a very rare one, where if me and Alex decide to hit F5 at the same time on the keyboard and try to insert the row at the same time, I might win, he might lose, and his transaction gets retried.

29:17But those retries are going to be less rare.

29:19And so what that means is lower latency, you know, more ability to insert a lot more data, you know, at once and concurrent workloads that are writing to the catalog.

29:33You bet.

29:34And with Duck Lake, when you do that retry, you don't actually have to rewrite the Parquet file at all in most cases.

29:41Usually you just have to retry your transaction into the transactional database, usually Postgres.

29:47And that's, again, a couple of milliseconds.

29:49So even if you have contention, you can retry your way out of it.

29:52Again, it's fully automatic in Duck Lake.

29:55Automatically has a default of retries a few times.

29:57Retry your way out of it in milliseconds to maintain that nice acid consistency.

30:03Hoyt chimed in.

30:03The inlining feature is brilliant.

30:05Thank you, Hoyt.

30:05Great to see you.

30:06And we do have a specific inlining question I can jump to.

30:10Again, love the questions.

30:12Thank you for keeping them coming.

30:13I think we might do one or two more of our pre-prepared questions, and then it's going to be Q&A.

30:18So load up the Q&A.

30:19So Dimitro says, is there a limit on the size of the inlining buffer?

30:25And does Duck Lake flush automatically?

30:28So there is not a limit.

30:30You can set it with a configuration.

30:32So you can set it to whatever you'd like.

30:33It's 10 rows by default.

30:35There is a balance point here.

30:37At some point, Parquet is more efficient than Postgres.

30:40So I wouldn't set your inlining limit to, probably wouldn't set it to a million rows.

30:46Probably less than that.

30:47You know, again, there's some tweaking you can do, but somewhere in the tens of thousands of range.

30:52Very, very reasonable.

30:54Then Duck Lake doesn't flush automatically.

30:57It lets you choose when to flush.

30:59And it's just another SQL command that you can run to do that flush.

31:02And it will convert it into Parquet files.

31:05Either one Parquet file, or if you're using partitions, it'll split it out into your right partition files.

31:09And write out all the files you need.

31:12So very configurable.

31:15And then last thing on this slide, just because it was up there and I didn't really talk about it out loud.

31:20At nuts and bolts level, the number of transactions per second you can do is about two orders of magnitude more with Duck Lake.

31:28So you can do about single digit transactions per second in a traditional lake house.

31:33And you can do around 100 with Duck Lake.

31:35And so that really means there's a wide variety of streaming use cases that can just work without having to buffer in Kafka and things of that nature.

31:45So maybe our final kind of pre-made question.

31:50What are you excited about with the future of Duck Lake, Matt?

31:53I know we're talking about just Duck Lake overall, but what makes you even more excited as we look ahead?

31:57Yeah, that one, you know, that's a tough question to answer because things in the data space move so fast.

32:09I think what I'm most excited with Duck Lake on is relieving the cognitive load of your data engineers to where they can spend, like I keep harping on this,

32:20they can spend more time solving the business problem, less time just getting the plumbing in place.

32:26And I think Duck Lake is going to be a catalyst for that.

32:29And I hope it is also, I hope it also kind of sets a standard and opens other, you know, vendors eyes into,

32:37oh, maybe we could do things a little differently here and not try to, you know, force one specification on people.

32:46And so I really see it as a key enabler for teams to move a lot faster and to really just kind of hone in on actually providing value back to their business partners

32:59versus having to tinker around with configurations nearly all day and Java call stacks barfing up.

33:07And this is my emotional battle scars from Spark manifesting here in a live stream, folks.

33:12But, you know, when you have those weird Java call stacks that in my case just kind of vomit up a bunch of nonsensical information and you're like,

33:21oh, what you were really trying to say on this thousand line call stack was,

33:26hey, you're trying to run Spark 4.1 and you're on OpenJDK 16 or something, I forget which version.

33:35That is incompatible.

33:37You need to go to this JDK version in order to run Spark 4.1, something like that.

33:41So with Duck Lake, it's a lot more direct, especially working through different versions with Duck Lake.

33:47I've seen, you know, from 1.4 to 1.5, sometimes in some of my demos and testing,

33:52I'll have, I'll try to run something that's on 1.5 and a 1.4 spec.

33:56And it'll actually air and tell me like, hey, you are trying to do this.

33:59In order to do this one thing, you have to add this flag to override an existing, you know, parameter.

34:05So it's very informative, again, just trying to help guide you through and get you out of the business of just having to tinker around with configurations all day.

34:15So, yes.

34:17What am I excited about, Duck Lake, in the future?

34:19Speed to market of delivering products to, you know, teams.

34:24That's what I'm excited about.

34:26How about you, Alex?

34:27I'm in a similar boat, of course.

34:29I love the tech.

34:31That's why I'm so excited to be writing the book together with you.

34:35It's just the tech is so exciting to me.

34:37The thing I'm looking forward to is that Duck Lake 1.0 is coming in around a month.

34:43So sometime in a month-ish or so time, we're going to be already at Duck Lake 1.0.

34:50And I think at that point, it's really game on.

34:54And I'm very excited for that.

34:56We're already to the point where there's been a ton of work on maturity in what's been recently been worked on in Duck Lake.

35:03A lot of the features were there out of the gate because we learned so much from what Delta and Iceberg have built.

35:09And we appreciated what they built.

35:11And so we took similar approaches in a lot of ways.

35:13And now we're fine-tuning and ready to go.

35:19Well, fantastic.

35:20Well, thank you all for your questions.

35:21Please keep those coming.

35:22I'm going to put in a link to the chat.

35:25And that is a link if you want to share the first chapter with other folks or if you want to – for some reason, you didn't get it in your inbox already.

35:36So check your email inbox you signed up with for this webinar, this live stream.

35:41And then if you need to, that link will also allow you to go ahead and sign up and get not only Chapter 1 but also every subsequent chapter in your inbox as we write them.

35:50And those are coming out every couple weeks.

35:52So you'll – a lot more to come there.

35:57So I figured we could go into Q&A mode here.

36:00Great.

36:02So we'll kind of go back in time a little bit to what – some of the earlier questions.

36:07So Harris Ward asked a question.

36:11Is Zee ordering something that can be added to Duck Lake?

36:14It's an excellent question.

36:15So this is a performance-tuning question.

36:18And the next chapter that Matt and I are going to write, I think maybe two chapters away, is the performance-tuning chapter.

36:23I'm really excited.

36:24Duck Lake is incredibly simple to set up, just like Postgres is incredibly simple to set up.

36:31But there's a few things you can do to tune Postgres.

36:34It's been around for 30 years.

36:35There's a couple knobs, just a few.

36:37And just a few companies even that will help you tune those knobs automatically.

36:40So there's a lot you can do to get the most out of it.

36:43Again, works out of the box incredibly well, but there's ways to push it.

36:46So one way to do that is by choosing good partitioning strategies and good sorting strategies.

36:52Another way of describing it is clustering strategies.

36:55And a great way to do that is to use a concept called Zee ordering.

36:59Zee ordering uses Morton curves, which is a way to say instead of sorting by one column and then sorting by another, you sort approximately by both at the same time.

37:11And the quintessential example of this is latitude and longitude in geospatial land.

37:15If I want to find, you know, cafes near my location, I need to look for things that are close together in both latitude and longitude.

37:25And if you sort by latitude, latitude goes out, you know, eight decimal places or whatever.

37:30You're going to have essentially a random longitude.

37:32But if you do a Zee order, you get a pseudo sorted dramatically faster query performance, sometimes 10 to 20 X because you just read a lot less data.

37:43This is in Duck Lake today.

37:45It came out in Duck Lake 0.4, which launched with Duck Lake 1.5 just a few weeks ago.

37:50So you can actually sort your data when you compact it or when you flush that inlined data that we talked about with flushing your inlining data.

38:01You can set a sort expression and that can be any expression, including Zee ordering.

38:07So that's absolutely supported today and it's a very good performance tuning technique.

38:11One funny quick side note on latitude, longitude.

38:15I heard this from a talk on I listen to StarTalk for all your astronomy nerds, Neil deGrasse Tyson.

38:22And he had a very interesting thing.

38:25And I never thought about this until he explained it.

38:27You know, when you think of latitude and longitudes, why do latitudes also why are they called parallels?

38:32It's because their lines never intersect when you look at them.

38:34But longitudinal lines do.

38:36They all intersect at the North and South Pole.

38:38And I had to sit there.

38:39I was like, man, how did I not realize that?

38:41But that's where the term the parallel lines came from when they refer to like the 40th parallel, the 50th parallel, stuff like that.

38:48So just an interesting side tidbit.

38:51Love that.

38:52Absolutely.

38:52Well, sticking in the same vein around geospatial data, we did have a question about that.

38:58Let me find that and pop it up for you guys.

39:00So what is the current status of support for geometry types and spatial functions using DuckLake?

39:06So there is support for geometry in DuckLake already.

39:11It was one of the very first data lake house formats to even support it.

39:16The DuckDB Labs folks have been partnering closely with the folks that are doing the parquet specifications, specifically around geoparquet.

39:23And so there's been a lot of great collaboration there.

39:27And so DuckDB supports a ton of geospatial analysis.

39:32And DuckLake has some support there as well.

39:34If you have specific questions, definitely send those to us.

39:37We'll take a look at more detail.

39:38But geometry is a first class citizen in DuckDB now.

39:42In version 1.5, it's now a core fundamental type of the engine with or without any extensions.

39:48And that's because we're very committed to geospatial.

39:51We think that it's getting more and more common, more and more useful.

39:57Thank you for the question.

39:59Let's jump back up to one a little bit earlier.

40:02So Matt did answer this one in the chat, but I'll pop it up here just so we can talk about it out loud as well.

40:08So Juan Ortiz asks, is DuckLake only a good fit for very large data sets?

40:12Or is it also good for data sets with a few million rows?

40:15So I'll pass that to you, Matt.

40:17Sure.

40:17No, it's fit for both.

40:19So, you know, there's been some interesting discussions that have been going on for the last couple of years, I'd say, in terms of, hey, what's DuckDB's fundamental overall scale?

40:31How much can you throw at it?

40:33I recently published an article.

40:35It was probably a few months ago in coordination with Zach Wilson.

40:40And I was able to work with a terabyte data set and scan specific columns off of it in under 17 seconds.

40:47I'm sure some of the purists out there might say, well, you didn't process the entire terabyte, you know, data set file.

40:53It's like, well, yeah, you know, it's going to be very rare that somebody is going to run an analytical query and say, I want to query all the columns out of this data set.

41:00But just what I'm trying to get at is to kind of showcase the scale of the engine of DuckDB.

41:06And that directly translates over to DuckLake.

41:08And sometimes it's going to be even better because DuckLake's been specifically optimized to work with object store and parquet files, not just internal DuckDB storage.

41:17And so to answer the point, though, from Juan, yeah, it works great on small data and large data.

41:25In terms of petabyte scale, that would be an interesting use case.

41:30What does Mother Duck call their jumbo, their galactic instance or something?

41:35If you go on their website, they have the various T-shirt sizes for their compute instances.

41:42And one of them, I kid you not, is a picture of a duck, I think, wearing a solar system as a floaty tube or something looking like it's in orbit.

41:50And it's like the galactic size.

41:51So if you wanted to work with a excruciating large data set, petabyte scale size, you could do it.

41:59It just might take a little time to do that.

42:02So anyways, I hope that kind of answers the question.

42:05It's a great question.

42:05And, you know, that's our Giga instance.

42:08We like those.

42:09It's like a galactic.

42:11I knew it started with a G, Giga.

42:13Oh, yeah.

42:14All good.

42:16It's got over a terabyte of RAM.

42:20So, yeah, you could throw a lot at it, folks.

42:23Yeah.

42:24But that's part of the ethos of Duck Lake and Mother Duck is that single node compute is incredibly powerful now.

42:32Latency is actually the primary thing you want to optimize for, for 99% of your queries.

42:37Right?

42:38Right?

42:38Most of your queries, you don't want to be scanning a petabyte every time because that's not cheap to do.

42:44Yeah.

42:44The only time people scanning petabyte files or data sets these days and workloads is on threat detection and fraud, folks.

42:52That's it.

42:53Maybe NASA has some edge cases for their Artemis II rocket where they're constantly reviewing petabyte scale data.

43:01But other than that, those are the two core use cases I see where somebody's like, I have to actually scan a petabyte of data every two minutes to see how are our threat levels?

43:11Is there a DDoS attack going on?

43:13Or is this transaction fraudulent?

43:16Does this person, if I go through a daisy chain?

43:18But then, you know, and this is me going off on a rabbit hole here.

43:22That's where different technologies come into the picture, too, like graph databases and stuff saying, you know, it's no longer a relational problem.

43:29It's more of a relationship problem that I need to be solving for.

43:34So, you know, full circle, yes, you can use Duck Lake on very large data sets.

43:41Whereas petabyte, you might want to start kind of reexamining what you're trying to answer in terms of the problem you're trying to solve.

43:46But, yes, you can use it on terabyte size data sets and it performs pretty darn well, in my opinion.

43:51You bet.

43:52There's been some testing and benchmarking on petabyte scale, so it is possible.

43:55Definitely hundreds of terabytes are well within reason as well.

43:59If you are interested in graph analytics, there is a DuckDB extension for that as well.

44:03Wow.

44:04You can actually do some pretty soon.

44:06All kinds of fun stuff.

44:07There's lots of, there's all kinds of extensions to DuckDB and Duck Lake is an extension and you can combine all these extensions together.

44:16So we've had some questions around the spatial extension.

44:19There's community extensions.

44:20So now with you and Claude or ChatGPT, you can build almost anything that you want.

44:27People built, you know, very industry-specific file readers.

44:30I need to go read biomedical data.

44:32And you can do it with DuckDB now because there's an extension and they, you know, collaborated with an agent to do that.

44:40Let me look at a couple other questions.

44:47One question was around kind of the way to connect to Duck Lake.

44:52And so today, you know, other lake houses, you connect with a REST API where that's the way that you're kind of talking to the catalog and then making these requests.

45:02I think the key difference with how Duck Lake works is that you're making a SQL connection to your catalog.

45:09And it's typically just using a Postgres driver to go connect.

45:13And DuckDB has a Postgres extension that ships with the full driver.

45:18So you can do it all native to DuckDB or you can do it with any other kind of Postgres driver that you would need.

45:23So that's kind of the one shift is that you would use a Postgres connection instead of a REST API when you connect.

45:31But the good news is that's something that in data engineering, it's bread and butter.

45:36And Alex, I do want to touch on that, too, because I have been seeing some chatter in here on BI integration stuff.

45:43And how does Duck Lake play out with that?

45:46So there's an easy way and a hard way.

45:49The easy solution to this one.

45:51And again, I'm not trying to make a plug for it, but a managed service such as Mother Duck actually makes the integration, especially with their support for Postgres now, very easy.

46:00You can do it, which and I'll drop a link in here.

46:02This is not me trying to the timing is interesting.

46:05It's not me trying to promote my own sub stack.

46:07But I did recently prove out that I was able to take Tableau locally and connect to Duck Lake.

46:17And the key ingredient on that one for a local Duck Lake instance is you have to have an initial SQL command to actually tell it to attach the Duck Lake database.

46:28But again, with Mother Duck, that's all handled for you to where it actually lives and breathes like an actual database already sitting up there.

46:33So you don't have to worry about any of that stuff.

46:36And then for those of you that are Tableau haters, you're like, well, why'd you use Tableau, not Power BI?

46:40It's because I have a Mac.

46:41They don't have Power BI desktop on Mac.

46:43Otherwise, I promise you I would have tested that as well.

46:46So anyway, I just want to kind of throw that out there.

46:49That's great.

46:49Yeah, I've done testing on both of those.

46:51So yes, absolutely.

46:51Something we're excited about for sure.

46:54Well, let's see.

46:56We've got a nice broad question here from Juan.

46:59Can you talk about the use cases of Duck Lake and compare it with maybe regular Mother Duck or DuckDB?

47:05That's an excellent question.

47:07I think in a lot of ways it comes down to choosing a lake house, choosing a warehouse.

47:12I think what's nice is in 2026, that's less of a hard line and much more of a kind of a choose what you'd like to use at the moment.

47:20So for example, Mother Duck, you can use Mother Duck built-in storage or Duck Lake storage and switch back and forth, even in the same SQL query.

47:28So really on the Mother Duck side, you can choose table by table.

47:32Really not a lot of hard choices there.

47:36Excuse me.

47:37In terms of in general, Duck Lake has partitioning built in, and it has this kind of compaction idea and clustering idea.

47:44So it does work with even larger data sets than you would associate with DuckDB or Mother Duck traditionally.

47:52So it's really great for that large range of scale, hundreds of terabytes up in that kind of petabyte range.

47:58And it's also good if your workload really fits well with partitioning.

48:02So if you have, you know, a thousand customers and you only ever query things one customer at a time, partitions can really be very fast.

48:10And so Duck Lake is a very good fit for kind of partitionable workloads, if you will.

48:16Duck Lake is also multiplayer in a way that DuckDB by itself is not.

48:21So Duck Lake allows you to really use it across your organization instead of just on your machine.

48:26So that's probably the biggest difference between open source DuckDB and open source Duck Lake.

48:31Mother Duck is multiplayer as well as another option there.

48:34Anything you would add, Matt?

48:36Sure.

48:37Yeah.

48:37So another thing, too, I think also you want to understand from your organizational landscape, you know, what you would be wanting to support versus, you know, delegate to others.

48:50And so if you think about it, like if you're comfortable supporting your own Postgres instance in AWS, that's the cloud service I'm most familiar with these days.

48:59So Aurora Postgres to manage the Duck Lake catalog as well as S3.

49:05You can do that if you're staffed for it and you have all of your security stuff in place and you want to have fun fighting with IAM and KMS keys.

49:14And now you're starting to bring up some PTSD for me with AWS.

49:18But or the alternative is if you're like, you know what, I don't want to spend the time managing the infrastructure.

49:24That's when something like Mother Duck would make more sense because with them it's more plug and play.

49:28You just say, hey, here's where my data sits and lives and breathes.

49:32And I want to be able to just query it out of your system.

49:34And I want to make sure, you know, I can scale on demand as needed.

49:38And I don't want to have to worry about KMS keys and bucket policies and all this other stuff.

49:45You bet.

49:46Yeah, there's great options.

49:48And that's what's nice about the openness of Duck Lake is that you can really move around.

49:52You can prototype anywhere.

49:53You can push to prod fully open source.

49:56And then Mother Duck also has some nice ease of use there.

50:01Let's see.

50:02So this is a question about the book.

50:04When will the full book be available?

50:05So if you sign up, you'll get the chapters as you go.

50:08And you'll actually get the PDF copy of the book before the print copy is done in that early release format.

50:14So I highly encourage you to sign up.

50:16You'll get the book in pieces faster than you'll get it in total.

50:20I'm not sure we have an end date that we're talking about too publicly.

50:22But it is going to be a new chapter every several weeks to a month.

50:27And we've got a decent number of chapters, not 30, you know, but a decent number of chapters.

50:32So towards, you know, kind of it's a multi-month process.

50:36Stay tuned and join us on the ride.

50:38That's what I would say there.

50:41All right.

50:42And we have just a couple minutes left.

50:45Definitely get questions in.

50:46And I'm looking at a couple of things.

50:51All right.

50:52So we had a question around security.

50:57So that's a very interesting one.

50:58So how would we, you know, handle security?

51:01Would we handle it at the bucket layer?

51:04How does it compare to kind of Unity and Iceberg?

51:06I'm paraphrasing a little bit.

51:07Thanks for your question, Dimitro.

51:10That is something where there is a guide for how to set that up on the ducklake.select website.

51:16It's a great domain.

51:17Ducklake.select.

51:19So great.

51:19So in ducklake.select, there's the full documentation there.

51:23And there is a section on security.

51:25And there's a couple options that you have.

51:27You can manage it with your buckets on S3.

51:31You can also manage some of it at the catalog layer in a unique feature that Ducklake has for full encryption capability.

51:40So what does that mean?

51:40It means that Parquet, because Ducklake only uses the Parquet format and not plain text formats like JSON or formats that don't support encryption like Avro, means that the storage of Ducklake can be entirely encrypted.

51:54And the keys can be stored in your catalog.

51:57So what that means is you can do all your security and access control in Postgres.

52:02And then have your object storage minimal security because it's fully encrypted.

52:09And you can manage your keys in your metadata database.

52:12So that's a new innovation that Ducklake provides that it's not possible with a format that stores metadata on object storage as well.

52:20So it's just not possible in other lake houses.

52:22And so that's a new thing Ducklake enables.

52:24So that's something I would encourage you to check out there.

52:27On the MotherDuck side, we do also manage the security for you as well.

52:30That's one more option.

52:33Let's see.

52:35Any other questions jump out to you, Matt?

52:39Any plans for on-prem?

52:41I'm curious about that.

52:42I'm looking at that one.

52:45All right.

52:45So Balraj, is that, and I'm sorry if I'm pronouncing your name incorrectly.

52:50Are you referring to a MotherDuck-like managed on-prem instance versus SaaS cloud offering?

52:59If you want to put in the chat here.

53:01I'm not sure.

53:02I'll pop this up here just as the diagram here of this.

53:05So one of the beauties of Ducklake is that you can self-host it almost anywhere.

53:10Yes.

53:10You don't even need to self-host with an S3 style API.

53:14You can host with a drive.

53:16So like a network-attached storage device or an on-premise shared drive.

53:20You can host your own Postgres on-prem, and you can run your own computer on-prem.

53:25So it's something you can completely sandbox if you'd like.

53:30Likewise, there's a full cloud option as well where you can do it on object storage, cloud-hosted Postgres, and run it on any kind of cloud compute you'd like.

53:38So really a lot of flexibility there, but it is something that's a unique empowerment capability for those that would prefer to stay on-prem for various reasons.

53:48It is a really useful technology in that situation.

53:52So thank you for the question.

53:55Let's see.

53:59Excellent.

54:00All right.

54:05Well, I think – thank you all for the excellent questions.

54:10Definitely really appreciate it.

54:11We'd highly recommend go ahead and sign up to receive the chapters of the book.

54:15And, Matt, it has been a tremendous pleasure talking with you today about Ducklake.

54:19Yep.

54:19Same with you, Alex.

54:20And thank you all for joining.

54:22It looks like we had some fantastic chatter here in the chat side.

54:26You know where to find us both.

54:28I'm – you know, as you probably all know, I frequently post on LinkedIn, stuff like that.

54:34So if you've got any questions, feel free to either send me a DM there or if you want to, you know, call me out in the public square of LinkedIn, feel free to do that too.

54:42I have thick skin, so I'm okay if you have some criticism for me.

54:45I always feel like I learn the most through criticism and, honestly, through failures.

54:50That's where, at least in my work career, I've learned the most when I've broken production, which has only happened a couple times.

54:56Only a couple times, at least that I'll admit to.

54:59But that's where I've learned the most.

55:01Oh, that's where the previous comment comes back, right?

55:03You blame the intern.

55:05Yeah.

55:05No, everybody breaks production.

55:07I've done that as well.

55:08It's a rite of passage.

55:11You always learn a ton, for sure.

55:14Thank you all for your great questions.

55:16And please keep the conversation coming.

55:17You can be in the Mother.com community Slack, the DuckDB Discord, and LinkedIn.

55:22You can find Matt and I there as well.

55:24Quack on and prosper, folks.

55:25Great time chatting.

55:27Thank you all.

55:27Have a good day.

FAQS

What is DuckLake?

DuckLake is an open source (MIT licensed) lakehouse table format that stores metadata in a SQL database — DuckDB, SQLite, or Postgres — instead of scattered files on object storage. Data files are standard Parquet, similar to Iceberg, but catalog operations are dramatically faster because they run as SQL queries against a database rather than file-system round trips.

How does DuckLake compare to Apache Iceberg?

DuckLake builds on many of the same foundations as Iceberg — Parquet storage, schema evolution, time travel — but replaces the metadata layer. Iceberg tracks metadata across thousands of small JSON and Avro files, which requires compaction jobs and complex configuration. DuckLake puts all of that in a single SQL database, making catalog operations 10–100x faster and dramatically simpler to set up and operate. Both formats support ACID transactions, but DuckLake gets them from the underlying database engine rather than file-level coordination.

Can I migrate from Iceberg to DuckLake?

Yes. DuckLake supports importing Iceberg tables with a metadata-only copy. Your Parquet data files stay where they are — DuckLake just reads the Iceberg metadata and converts it to its own catalog format. You don't need to move or rewrite any data.

Is the DuckLake book free?

Yes. The early release chapters are available for free. Sign up on the DuckLake: The Definitive Guide page to receive chapters as they are released.

Who are the authors?

Matt Martin is a Staff Engineer at State Farm with over 20 years of data engineering experience. Alex Monahan is a Developer Advocate at MotherDuck and a DuckLake contributor. The book is published through O'Reilly Media with the same peer-reviewed editorial standards as their other definitive technology guides.