DuckDB just introduced a new table format named DuckLake. If you work with data, you’ve probably heard about the "table format wars"—Iceberg and Delta Lake—over the past few years.
If you haven't, or if these terms are still confusing, don’t worry. I’ll start with a quick recap of what led to Iceberg and Delta Lake in the first place. Then we’ll dive into DuckLake with some practical code examples. The source code is available on GitHub.
And as always, if you're too lazy to read, you can also watch this content.
Table Format Recap
To understand table formats, we need to start with file formats like Parquet and Avro.
But first—why should we, as developers, even care about file formats? Aren’t databases supposed to handle storage for us?
Originally, databases were used for data engineering (and still are). But there were two main challenges with traditional OLAP databases:
Vendor lock-in: Data was often stored in proprietary formats, making migrations painful.
Scaling: Traditional databases weren’t always built to scale storage independently from compute.
That’s where decoupling compute from storage started to make sense. Instead of relying on a database engine to store everything, engineers started storing analytical data as files—mainly in open, columnar formats like Parquet—on object storage (e.g., AWS S3, Google Cloud Storage, Azure Blob Storage).
These formats are designed for heavy analytical queries, such as aggregations across billions of rows—unlike transactional databases like Postgres, which are optimized for row-by-row updates. Today, Parquet is a general standard supported by all cloud data warehouses (MotherDuck, BigQuery, Redshift, Snowflake, etc.) and compute engines (Polars, Apache Spark, etc.).
This architecture is what we call a data lake: raw Parquet files on blob storage, queried by compute engines of your choice—like Apache Spark, Dask, or, of course, DuckDB.
But there's a trade-off.
You lose database-like guarantees:
No atomicity: You can’t update a Parquet file in-place. They are immutable—you often have to rewrite the entire file.
No schema evolution: It’s hard to add or remove columns without manually tracking changes.
No time travel: You can’t easily query the state of data “as of yesterday.”
That’s where table formats come in. They sit on top of file formats like Parquet and add database-like features:
Metadata tracking (usually in JSON or Avro)
Snapshot isolation and time travel
Schema evolution
Partition pruning
These features are stored as separate metadata files in the same blob storage system.
However, this introduces new challenges:
You generate lots of small metadata files, which are "expensive" to read over networks like S3.
You often need an external catalog (like Unity or AWS Glue) to tell your query engine where the root folder of the table is and what versions exist.
Query engines must now perform multiple round trips just to resolve a query plan (see example below).
So while table formats brought huge improvements, they also introduced overhead and complexity—especially around metadata management.
DuckLake: A New Table Format
Enter DuckLake—a brand-new table format developed by the creators of DuckDB.
First of all: DuckLake is not tied to DuckDB, despite the name.
“DuckLake is not a DuckDB-specific format… it’s a convention of how to manage large tables on blob stores, in a sane way, using a database.” — Hannes Mühleisen, co-creator of DuckDB
So while today the easiest way to use DuckLake is through DuckDB, it’s not a technical requirement.
Second, unlike Iceberg or Delta Lake—where metadata is stored as files on blob storage—DuckLake stores metadata in a relational database.
Now you see why that earlier context was useful—we're kind of returning to a database architecture, to some extent.
That catalog database can be:
PostgreSQL or MySQL (preferred, especially for multi-user read/write)
DuckDB (great for local use or playgrounds)
SQLite (for multi-client local use)
You might wonder: if I can use DuckDB for the metastore, why would I use a transactional database like PostgreSQL?
Because these systems are designed to handle small, frequent updates with transactional guarantees. Metadata operations (like tracking versions, handling deletes, updating schemas) are small but frequent—and transactional databases are a great fit for that.
Also, the metadata is tiny—often less than 1% the size of the actual data. Storing it in a database avoids the overhead of scanning dozens of metadata files on blob storage.
While metadata is stored in a database, the data itself is still stored—like other table formats—as Parquet on the blob storage of your choice. Thanks to this architecture, DuckLake can be very fast.
Let’s take a quick example. If you want to query an Iceberg table, here are roughly the operations:
As you can see, there are a lot of round trips just to get the metadata before scanning the actual data. If you’re updating or reading a single row, that’s a huge overhead.
DuckLake flips the script. Since metadata lives in a database, a single SQL query can resolve everything—current snapshot, file list, schema, etc.—and you can then query the data. No more chasing dozens of files just to perform basic operations.
DuckLake supports nearly everything you’d expect from a modern lakehouse table format:
ACID transactions across multiple tables
Complex types like nested lists and structs
Full schema evolution (add/remove/change column types)
Metadata: Stored in SQL tables—on DuckDB, SQLite, but realistically Postgres or MySQL.
Data: Still in Parquet, on your blob storage.
DuckLake is not just "yet another table format"—it rethinks the metadata layer entirely.
Setting up DuckLake
Now that we’ve covered the background of table formats and introduced DuckLake, let’s get practical.
To run the next demo, you’ll need three components:
Data storage: an AWS S3 bucket with read/write access.
Metadata storage: a PostgreSQL database—we'll use a serverless free Supabase database.
Compute engine: any DuckDB client—we'll use the DuckDB CLI.
For the PostgreSQL database, Supabase is a great option. You can spin up a fully managed Postgres database in one minute. It has a generous free tier—just create an account, a project, and retrieve your connection parameters (IPv4-compatible).
You can install the DuckDB CLI with one command or through a package manager like homebrew on macOS.
Copy code
curl https://install.duckdb.org | sh
Creating your first DuckLake table
As a best practice, authenticate on AWS using:
Copy code
aws sso login
Once your AWS credentials are refreshed, create a DuckDB secret:
Copy code
CREATEOR REPLACE SECRET secret(
TYPE s3,
PROVIDER credential_chain
);
Also create a PostgreSQL secret using the connection information you retrieved from Supabase:
Copy code
CREATE SECRET(
TYPE postgres,
HOST '<your host>',
PORT 6543,
DATABASE postgres,
USER'<your user>',
PASSWORD '<your password>'
);
Now install the ducklake and postgres DuckDB extensions:
Copy code
INSTALL ducklake;
INSTALL postgres;
INFO: Extension Status Check
You can check the list of DuckDB extensions and their state (installed, loaded) using FROM duckdb_extension();.
Now create your DuckLake metastore using the ATTACH command:
And a first state of our data has been created. Now let's go to our Supabase UI through Table editor.
As we can see, a bunch of metadata tables has been created. For instance, we have also statistics about table and of course where the Parquet files are located. You can see the full schema definition of these tables on the documentation.
INFO: Querying Metadata Tables
You can also query these tables directly from the DuckDB client or the DuckDB UI. You can find them through SELECT database_name, table_name FROM duckdb_tables WHERE schema_name='public' and database_name LIKE '_ducklake_metadata%'; , database name will follow the pattern ducklake_metadata<your_catalog_name>
Now let’s alter the table by adding a new column—say we want to add a two-letter country code (iso2) in addition to the existing three-letter code (iso3):
DuckDB 1.3 has launched, with performance boosts, faster Parquet reads and writes, and new SQL syntax for ducking awesome analytics with full support in MotherDuck. Read on for highlights from this major release.
DuckDB news: DuckLake combines catalog and table format with ACID metadata in SQL. Radio extension adds WebSocket and Redis Pub/Sub. Top CSV benchmark results.