Interview

What is DuckLake? A Simpler Data Lake & Warehouse with DuckDB

2025/06/17

The data industry is buzzing with excitement about the lakehouse, and one technology is at the center of the storm: Apache Iceberg. This "Iceberg mania" isn't without reason. Developers and data engineers are grappling with real, persistent pain points that these new open table formats promise to solve:

Avoiding Vendor Lock-in: The desire to own your data's destiny, free from the silos of proprietary systems.
Escaping High Costs: The punishing expense of storing massive, infrequently accessed historical data in traditional cloud data warehouses.
Enabling Interoperability: The need for different engines—like Spark for large-scale ETL and DuckDB for fast, interactive analytics—to work together on the same data.
Achieving Transactional Guarantees: The critical ability to perform reliable, atomic updates on data stored in an object storage data lake.

These are valid, pressing problems. But as we've rushed to embrace the solution, a critical question has emerged: Have we traded one form of complexity for another?

In this conversation, Hannes Mühleisen, the creator of DuckDB, described his "aesthetic reaction" to the state of Iceberg. It’s a reaction that sparked the creation of DuckLake, a new approach to the lakehouse that challenges the prevailing complexity with a solution rooted in first principles and elegant simplicity.

A Creator's Critique: The "Aesthetic Problem" with Iceberg

The initial concept of Iceberg was clean. It was founded on a simple, powerful idea: tables could be represented as a collection of files on an object store, with metadata files tracking the state. The beauty was in its self-containment. All you needed was access to the object store, and you could read the table.

But according to Hannes, that philosophical purity was broken when a critical piece of infrastructure was bolted on.

"The aesthetics issue really started when they slapped the catalog server on top to solve some really fundamental issues... they ended up putting this catalog server on top that completely broke with that design philosophy."

— Hannes Mühleisen [11:35]

This wasn't just a minor addition; it was a fundamental architectural shift that introduced a cascade of complexity. The so-called "REST Catalog" requires:

A Separate, Always-On Service: A Java service that exposes a REST API.
A Backing Database: Typically, a PostgreSQL server to manage the catalog's state.
Complex Infrastructure: A fleet of containers and services that must be deployed, monitored, and maintained.

Suddenly, the simple "files on S3" model ballooned into a distributed system. This complexity is most acute for developers tasked with building writers for the format. As Jordan Tigani, CEO of MotherDuck, notes, to build a correct and efficient Iceberg writer, "you basically have to build half of a database."

Hannes goes even further: "You have to write an entire database." A competent writer needs to manage transactions, handle concurrent commits with optimistic locking, perform compaction to prevent the proliferation of small files, and manage snapshot expiration. You're not just writing to a spec; you're implementing the core logic of a database management system.

This has led to a situation Hannes and Jordan compare to the peak of Hadoop. While "knows Iceberg" is becoming a hot skill on a data engineer's resume, the underlying architecture has deep-seated issues. This isn't just a flawed implementation; it's a conceptual problem.

"Iceberg's problems are conceptual, fundamental, and baked in the specification, which is an entirely different way of being wrong."

— Hannes Mühleisen [17:26]

Introducing DuckLake: A First-Principles Data Lake Architecture

This is where "spite engineering," a term Hannes uses to describe a principled refusal to accept flawed solutions, comes into play. The DuckDB team asked a simple, almost heretical question: If we accept that a database is now required to manage the metadata, why not design the entire system around that fact from the beginning?

"Why don't we use a database? ... We are extremely surprised that people think this is a good idea. This is somehow revolutionary because this is, in retrospect... the most obvious thing ever, right?"

— Hannes Mühleisen [24:55]

This question is the foundation of DuckLake. Instead of treating the database as a bolted-on component hidden behind a complex API, DuckLake embraces it as the central nervous system of the lakehouse.

The DuckLake Data Warehouse Architecture: SQL + Parquet

The DuckLake model radically simplifies the technology stack. It consists of just two core components:

Metadata: Stored in standard SQL tables within any transactional database. This could be PostgreSQL, a local DuckDB file, or even a cloud data warehouse. The entire DuckLake specification is just a SQL schema.
Data: Stored as standard Parquet files in object storage (like S3, GCS, or R2).

That's it. The stack is SQL + Parquet. There is no custom REST API to learn, no separate Java service to maintain, and no Avro or JSON metadata files to parse. This dramatically reduces the number of technologies a developer needs to understand, implement, and manage.

How It Works: A Practical Look at DuckLake

This architectural simplicity translates directly into a superior developer experience. Getting started with a local DuckLake is astonishingly easy.

The "Three-Step Program" to Your First Lakehouse

Here's how you can get a DuckLake up and running on your machine:

Step 1: Install DuckDB (a single, dependency-free binary)

Copy code
curl https://install.duckdb.org | sh

Step 2: Start DuckDB

Copy code
duckdb

Step 3: Attach a DuckLake database using a local file for metadata

Copy code
ATTACH 'ducklake:metadata.ducklake' AS my_ducklake (DATA_PATH 'data_files');
USE my_ducklake;

With just two commands, you have a fully functional lakehouse. Creating tables, inserting data, and running queries works exactly as you'd expect. For a cloud-based setup with MotherDuck, the ATTACH command is just as simple, pointing to your MotherDuck account.

Key Technical Advantages

This design isn't just simpler; it's more powerful. By leveraging a real database for metadata, DuckLake unlocks capabilities that are difficult or impossible in other formats.

High-Frequency Updates: Because transactions are handled by a battle-tested SQL database, DuckLake can support thousands of transactions per second. It can even cache small, frequent updates directly in the database before transparently flushing them to Parquet, making it ideal for streaming use cases without creating a storm of tiny files.
Virtually Infinite Snapshots: In Iceberg, metadata for every historical snapshot lives in the metadata files, causing them to grow over time and requiring active maintenance to prune. In DuckLake, snapshots are simply rows in a table. This means you can have millions of snapshots without any performance degradation.
Blazing-Fast Query Planning at Scale: Hannes and his team ran a benchmark on a petabyte-scale virtual DuckLake with 100 million snapshots. Even at this massive scale, the SQL query to identify the necessary Parquet files for a query—including all partition pruning—remained sub-second.

The MotherDuck Advantage: The Best Hosted DuckLake

At MotherDuck, our mission is to make analytics with DuckDB easy, scalable, and delightful. DuckLake is a core part of that vision.

"Our intent is to be the best hosted Duck Lake. We hope there's other hosted Duck Lakes. We hope that everybody uses Duck Lake."

— Jordan Tigani [48:32]

We are building on DuckLake's powerful foundation to solve the "last mile" problems for developers. A key area is access management. By connecting to MotherDuck, you will be able to manage permissions to your lakehouse using familiar SQL grants. When a query is executed, MotherDuck can generate secure, time-bound signed URLs for the underlying Parquet files, ensuring that even with direct access to your data lake, security is never compromised.

A Call for Simplicity

DuckLake isn't just another standard. It's a fundamental simplification of the lakehouse architecture. It challenges the notion that we need to build complex, bespoke systems to manage what is ultimately structured data.

By leveraging the power, maturity, and transactional integrity of SQL databases for what they do best—managing metadata—DuckLake provides a more robust, more performant, and vastly more developer-friendly path forward. It's a return to first principles, demonstrating that the most elegant solution is often the most obvious one.

Ready to see for yourself? Get started with MotherDuck and DuckLake today. Experience the simplicity and power of a lakehouse built on SQL + Parquet.

CONTENT

A Creator's Critique: The "Aesthetic Problem" with Iceberg

Introducing DuckLake: A First-Principles Data Lake Architecture

How It Works: A Practical Look at DuckLake

The MotherDuck Advantage: The Best Hosted DuckLake

A Call for Simplicity

Related Videos

"The Death of Big Data and Why It’s Time To Think Small | Jordan Tigani, CEO, MotherDuck" video thumbnail

59:07

2024-10-24

The Death of Big Data and Why It’s Time To Think Small | Jordan Tigani, CEO, MotherDuck

A founding engineer on Google BigQuery and now at the helm of MotherDuck, Jordan Tigani challenges the decade-long dominance of Big Data and introduces a compelling alternative that could change how companies handle data.

YouTube

Interview

"Big data is dead, analytics is alive." video thumbnail

50:21

2024-10-09

Big data is dead, analytics is alive.

Till and Adithya from MotherDuck discuss DuckDB’s impact on analytics and AI, showcasing its fast, versatile in-process SQL and AI features like text-to-SQL, vector search, and query correction—powerful analytics even on your laptop.

AI, ML and LLMs

Interview

"What's New in Data: Small Data, Big Impact" video thumbnail

41:35

2024-09-19

What's New in Data: Small Data, Big Impact

John Kutay interviews MotherDuck's Jacob Matson about all things data including MotherDuck, DuckDB and more.

Interview

YouTube