Question 1

What is DuckLake and how does it differ from Apache Iceberg?

Accepted Answer

DuckLake is an open lakehouse format that stores catalog and metadata in a relational database instead of object storage files. With Iceberg, a query makes four sequential round trips through object storage before reading any data — each taking up to 100ms, which can add seconds of overhead per query. DuckLake replaces that with a single database call that takes milliseconds. The getting started guide walks through setup from scratch. Beyond DuckDB, DuckLake already has implementations in Spark, Trino, and DataFusion.

Question 2

How does DuckLake data inlining work?

Accepted Answer

Data inlining writes small inserts to the catalog database instead of immediately creating a new Parquet file. The default threshold is 10 rows, though you can adjust it. Rows stored in the catalog are still queryable, and a background process flushes them to Parquet later. This way you can ingest data up to thirty times per second without accumulating thousands of tiny files that bog down query planning. You get Kafka-style write buffering without actually running Kafka.

Question 3

What Parquet settings should I change when running DuckLake in production?

Accepted Answer

Three settings matter for cloud deployments. Switch to Parquet v2 — it compresses better and has been around long enough that most readers handle it fine. Swap the default Snappy compression for ZStandard (ZSTD), which gets significantly better ratios on Parquet files. And bump your row group size so you're targeting roughly 8MB per column; DuckDB will read object storage in larger chunks instead of making a ton of small requests. If your table has ten columns, that works out to an 80MB row group size. You set all of these with call my_ducklake.set_option(...).

Question 4

How should I partition a DuckLake table for better query performance?

Accepted Answer

Hundreds to low thousands of partitions works well for most workloads. People usually partition by time (year, month, day) or by something high-cardinality like customer ID. But if you push it too far — a million partitions for individual customers, say — your files end up tiny, and the engine spends more time figuring out which files to open than actually reading anything.

Bucket partitioning splits the difference. With 1,000 customer buckets you can still skip 99.9% of the data on a single-customer query without drowning in small files. The tradeoff is on the write side: more partitions mean more overhead at ingest. Whether that matters depends on whether reads or writes are your bottleneck.

Question 5

Can I use DuckLake without MotherDuck?

Accepted Answer

Yes. DuckLake is an open specification that works with DuckDB, SQLite, or Postgres as the catalog database, and any S3-compatible object storage. You can run it entirely on-prem or in your own cloud account with no dependency on MotherDuck. If you're under contractual restrictions that prevent sending data to third-party vendors, open source DuckLake with your own bucket is the way to go. MotherDuck offers a managed version where setup is a single SQL statement and compute scales automatically, but it's optional. The session covers the data lake vs. data warehouse vs. lakehouse tradeoffs if you want more on that comparison.

The DuckLake Lakehouse: From Getting Started to Going Fast

What makes DuckLake different

Getting started

Production tuning

FAQS

What is DuckLake and how does it differ from Apache Iceberg?

How does DuckLake data inlining work?

What Parquet settings should I change when running DuckLake in production?

How should I partition a DuckLake table for better query performance?

Can I use DuckLake without MotherDuck?

Related Videos

Building a Data Product on MotherDuck: PriceMedic's Path from Raw Data to Revenue

AI Data Analysis for Sales & Marketing Teams: Real Demos, No SQL Required

A Practical Guide to Context Management for Data Agents