Understanding DuckLake: A Table Format with a Modern Architecture
2025/06/05The Evolution from Databases to Table Formats
Modern data engineering has undergone a significant transformation in how analytical data is stored and processed. Traditional OLAP databases once handled both storage and compute, but this approach led to two major challenges: vendor lock-in through proprietary formats and the inability to scale storage independently from compute.
This limitation gave birth to the data lake architecture, where analytical data is stored as files (primarily in columnar formats like Parquet) on object storage systems such as AWS S3, Google Cloud Storage, or Azure Blob Storage. This decoupling allows any compute engine - Apache Spark, Trino, or DuckDB - to query the same data.
The Table Format Revolution
While storing data as Parquet files on blob storage provides flexibility, it sacrifices essential database features:
- No atomicity: Parquet files are immutable, requiring complete rewrites for updates
- No schema evolution: Adding or removing columns requires manual tracking
- No time travel: Querying historical states of data becomes complex
Table formats like Apache Iceberg and Delta Lake emerged to bridge this gap. They add a metadata layer on top of file formats, enabling:
- Metadata tracking (typically in JSON or Avro format)
- Snapshot isolation and time travel capabilities
- Schema evolution support
- Partition pruning optimization
However, these solutions introduce new complexities. They generate numerous small metadata files that are expensive to read over networks, and often require external catalogs like Unity Catalog or AWS Glue to track table locations and versions.
DuckLake: A Fresh Approach to Table Formats
DuckLake represents a fundamental rethink of table format architecture. Despite its name, DuckLake is not tied to DuckDB - it's an open standard for managing large tables on blob storage.
The Key Innovation: Database-Backed Metadata
Unlike Iceberg or Delta Lake, which store metadata as files on blob storage, DuckLake stores metadata in a relational database. This can be:
- DuckDB (ideal for local development)
- SQLite
- PostgreSQL (typical for production)
- MySQL
This architectural decision leverages what relational databases do best: handle small, frequent updates with transactional guarantees. Since metadata operations (tracking versions, handling deletes, updating schemas) are exactly this type of workload, a transactional database is the perfect fit.
Performance Benefits
The metadata typically represents less than 1/100,000th of the actual data size. By storing it in a database, DuckLake eliminates the overhead of scanning dozens of metadata files on blob storage. A single SQL query can resolve all metadata operations - current snapshots, file lists, and more - dramatically reducing the round trips required for basic operations.
DuckLake in Practice
Architecture Overview
DuckLake maintains a clear separation of concerns:
- Metadata: Stored in SQL tables within a relational database
- Data: Stored as Parquet files on blob storage (S3, Azure, GCS)
Key Features
DuckLake supports all the features expected from a modern lakehouse format:
- ACID transactions across multiple tables
- Full schema evolution with column additions and updates
- Snapshot isolation and time travel queries
- Efficient metadata management through SQL
Practical Implementation
Setting up DuckLake requires three components:
- Data storage: A blob storage bucket (e.g., AWS S3) with read/write access
- Metadata storage: A PostgreSQL or MySQL database (services like Supabase work well)
- Compute engine: DuckDB or any compatible query engine
When creating a DuckLake table, the system automatically generates metadata tables in the specified database while storing the actual data as Parquet files in the designated blob storage location. Updates to tables create new Parquet files and deletion markers, maintaining immutability while providing a mutable interface.
The Future of Table Formats
DuckLake's approach solves many of the metadata management challenges that plague current table formats. By leveraging proven relational database technology for metadata while maintaining open file formats for data, it offers a pragmatic solution to the complexities of modern data lakes.
While still in its early stages, DuckLake shows promise for organizations looking to simplify their data lake architecture without sacrificing the flexibility and scalability that made data lakes popular in the first place. As the ecosystem matures and more compute engines add support, DuckLake could become a compelling alternative to established formats like Iceberg and Delta Lake.
CONTENT
- The Evolution from Databases to Table Formats
- The Table Format Revolution
- DuckLake: A Fresh Approach to Table Formats
- DuckLake in Practice
- The Future of Table Formats
CONTENT
- The Evolution from Databases to Table Formats
- The Table Format Revolution
- DuckLake: A Fresh Approach to Table Formats
- DuckLake in Practice
- The Future of Table Formats
Related Videos

2025-07-26
How to Efficiently Load Data into DuckLake with Estuary
Learn how DuckLake, MotherDuck, and Estuary enable fast, real-time data integration and analytics with modern open table formats, cloud data warehousing, and no-code streaming pipelines.
YouTube

20:44
2025-06-13
What can Postgres learn from DuckDB? (PGConf.dev 2025)
DuckDB an open source SQL analytics engine that is quickly growing in popularity. This begs the question: What can Postgres learn from DuckDB?
YouTube
Ecosystem
Talk

2025-06-12
pg_duckdb: Ducking awesome analytics in Postgres
Supercharge your Postgres analytics! This talk shows how the pg_duckdb extension accelerates your slowest queries instantly, often with zero code changes. Learn practical tips and how to use remote columnar storage for even more speed.
Talk
Sources