The DuckLake Lakehouse: From Getting Started to Going Fast
2026/04/28Featuring: ,TL;DR: DuckLake stores the catalog and metadata in a single SQL database instead of object storage files. That one change drops metadata query latency from seconds to milliseconds and lets you ingest data tens of times per second. The session covers the architecture, a live setup demo, and the specific settings that make DuckLake fast in production.
What makes DuckLake different
Most lakehouse formats follow the Iceberg pattern: a database-backed catalog talks to metadata files in object storage, which point to manifest lists, which point to manifests, which point to data files. That's four sequential round trips in object storage before you read a single byte of actual data. At 100ms per request, you're waiting seconds just to start a query.
DuckLake skips all of that. The catalog and metadata both live in a relational database — DuckDB, SQLite, Postgres, or MotherDuck. When a query runs, it hits the database once, gets back a precise list of files to read, and goes straight to the Parquet data. That database call takes single-digit milliseconds. Complex queries that used to take multiple seconds now finish in hundreds of milliseconds.
This also fixes the small files problem. Ingests write to the database first rather than creating a new Parquet file for every insert, so you can ingest thirty times per second without drowning in tiny files.
Getting started
Three commands get you running:
INSTALL ducklake— adds the extensionATTACH— connects a catalog database- Start using it with standard SQL
The setup demo walks through a live example on MotherDuck, including attach options for managed and bring-your-own-bucket storage.
Production tuning
The second half of the session covers six settings worth changing before you go to production:
- Parquet v2 — improved compression with broad ecosystem compatibility
- ZStandard compression — better than the default Snappy, still widely supported
- Row group size — target 8MB per column for cloud storage so DuckDB can read in efficient chunks
- Data inlining threshold — batches small inserts into the catalog DB before flushing to Parquet; controls the write frequency vs. file count tradeoff
- Partitioning — hundreds to low thousands of partitions is the sweet spot; millions create too many small files
- Clustering and sorting — aim for about ten row groups per file; sorting on the right columns can give a 10x read speedup
For a deeper look at the open lakehouse stack architecture, the session also covers where DuckLake fits relative to Iceberg and Delta Lake, and how ACID transactions across tables work in practice.
FAQS
What is DuckLake and how does it differ from Apache Iceberg?
DuckLake is an open lakehouse format that stores catalog and metadata in a relational database instead of object storage files. With Iceberg, a query makes four sequential round trips through object storage before reading any data — each taking up to 100ms, which can add seconds of overhead per query. DuckLake replaces that with a single database call that takes milliseconds. The getting started guide walks through setup from scratch. Beyond DuckDB, DuckLake already has implementations in Spark, Trino, and DataFusion.
How does DuckLake data inlining work?
Data inlining writes small inserts to the catalog database instead of immediately creating a new Parquet file. The default threshold is 10 rows, though you can adjust it. Rows stored in the catalog are still queryable, and a background process flushes them to Parquet later. This way you can ingest data up to thirty times per second without accumulating thousands of tiny files that bog down query planning. You get Kafka-style write buffering without actually running Kafka.
What Parquet settings should I change when running DuckLake in production?
Three settings matter for cloud deployments. Switch to Parquet v2 — it compresses better and has been around long enough that most readers handle it fine. Swap the default Snappy compression for ZStandard (ZSTD), which gets significantly better ratios on Parquet files. And bump your row group size so you're targeting roughly 8MB per column; DuckDB will read object storage in larger chunks instead of making a ton of small requests. If your table has ten columns, that works out to an 80MB row group size. You set all of these with call my_ducklake.set_option(...).
How should I partition a DuckLake table for better query performance?
Hundreds to low thousands of partitions works well for most workloads. People usually partition by time (year, month, day) or by something high-cardinality like customer ID. But if you push it too far — a million partitions for individual customers, say — your files end up tiny, and the engine spends more time figuring out which files to open than actually reading anything.
Bucket partitioning splits the difference. With 1,000 customer buckets you can still skip 99.9% of the data on a single-customer query without drowning in small files. The tradeoff is on the write side: more partitions mean more overhead at ingest. Whether that matters depends on whether reads or writes are your bottleneck.
Can I use DuckLake without MotherDuck?
Yes. DuckLake is an open specification that works with DuckDB, SQLite, or Postgres as the catalog database, and any S3-compatible object storage. You can run it entirely on-prem or in your own cloud account with no dependency on MotherDuck. If you're under contractual restrictions that prevent sending data to third-party vendors, open source DuckLake with your own bucket is the way to go. MotherDuck offers a managed version where setup is a single SQL statement and compute scales automatically, but it's optional. The session covers the data lake vs. data warehouse vs. lakehouse tradeoffs if you want more on that comparison.
Related Videos

7:14
2026-04-23
A Practical Guide to Context Management for Data Agents
Learn to write effective context for data agents. Covers benchmarks comparing English, SQL, and semantic model formats plus scaling strategies.
Webinar
AI, ML and LLMs

2026-04-21
MotherDuck Now Speaks Postgres: Fast Analytics Without Changing Your Stack
MotherDuck's Postgres endpoint lets any Postgres client, driver, or BI tool query your data warehouse without installing a DuckDB library.
Webinar
MotherDuck Features
Ecosystem
SQL

60:41
2026-04-09
Zero-Latency Analytics in Your Application with Dives
BI tools were never built to be app interfaces — they're rigid, clunky, and add complexity to your user experience. Dives offer a different approach: interactive data apps you create with natural language that can be embedded directly into your applications. In this session, Alex Monahan walks through how to build a Dive and embed it in your app, with help from AI agents. You'll learn how to create interactive visualizations with natural language queries, embed Dives into your app with a secure sandbox, set up the auth flow so your users get read-only access without exposing credentials, and choose between server-side and dual execution with DuckDB-Wasm. Whether you're building customer-facing analytics or internal tools, this webinar shows you the full workflow from query to production-ready embed. See how Claude Code + Dives can get you from zero to a working data app fast.
Webinar
MotherDuck Features
BI & Visualization
App Development
