YouTube

Understanding DuckLake: A Table Format with a Modern Architecture

2025/06/05

The Evolution from Databases to Table Formats

Modern data engineering has undergone a significant transformation in how analytical data is stored and processed. Traditional OLAP databases once handled both storage and compute, but this approach led to two major challenges: vendor lock-in through proprietary formats and the inability to scale storage independently from compute.

This limitation gave birth to the data lake architecture, where analytical data is stored as files (primarily in columnar formats like Parquet) on object storage systems such as AWS S3, Google Cloud Storage, or Azure Blob Storage. This decoupling allows any compute engine - Apache Spark, Trino, or DuckDB - to query the same data.

The Table Format Revolution

While storing data as Parquet files on blob storage provides flexibility, it sacrifices essential database features:

No atomicity: Parquet files are immutable, requiring complete rewrites for updates
No schema evolution: Adding or removing columns requires manual tracking
No time travel: Querying historical states of data becomes complex

Table formats like Apache Iceberg and Delta Lake emerged to bridge this gap. They add a metadata layer on top of file formats, enabling:

Metadata tracking (typically in JSON or Avro format)
Snapshot isolation and time travel capabilities
Schema evolution support
Partition pruning optimization

However, these solutions introduce new complexities. They generate numerous small metadata files that are expensive to read over networks, and often require external catalogs like Unity Catalog or AWS Glue to track table locations and versions.

DuckLake: A Fresh Approach to Table Formats

DuckLake represents a fundamental rethink of table format architecture. Despite its name, DuckLake is not tied to DuckDB - it's an open standard for managing large tables on blob storage.

The Key Innovation: Database-Backed Metadata

Unlike Iceberg or Delta Lake, which store metadata as files on blob storage, DuckLake stores metadata in a relational database. This can be:

DuckDB (ideal for local development)
SQLite
PostgreSQL (typical for production)
MySQL

This architectural decision leverages what relational databases do best: handle small, frequent updates with transactional guarantees. Since metadata operations (tracking versions, handling deletes, updating schemas) are exactly this type of workload, a transactional database is the perfect fit.

Performance Benefits

The metadata typically represents less than 1/100,000th of the actual data size. By storing it in a database, DuckLake eliminates the overhead of scanning dozens of metadata files on blob storage. A single SQL query can resolve all metadata operations - current snapshots, file lists, and more - dramatically reducing the round trips required for basic operations.

DuckLake in Practice

Architecture Overview

DuckLake maintains a clear separation of concerns:

Metadata: Stored in SQL tables within a relational database
Data: Stored as Parquet files on blob storage (S3, Azure, GCS)

Key Features

DuckLake supports all the features expected from a modern lakehouse format:

ACID transactions across multiple tables
Full schema evolution with column additions and updates
Snapshot isolation and time travel queries
Efficient metadata management through SQL

Practical Implementation

Setting up DuckLake requires three components:

Data storage: A blob storage bucket (e.g., AWS S3) with read/write access
Metadata storage: A PostgreSQL or MySQL database (services like Supabase work well)
Compute engine: DuckDB or any compatible query engine

When creating a DuckLake table, the system automatically generates metadata tables in the specified database while storing the actual data as Parquet files in the designated blob storage location. Updates to tables create new Parquet files and deletion markers, maintaining immutability while providing a mutable interface.

The Future of Table Formats

DuckLake's approach solves many of the metadata management challenges that plague current table formats. By leveraging proven relational database technology for metadata while maintaining open file formats for data, it offers a pragmatic solution to the complexities of modern data lakes.

While still in its early stages, DuckLake shows promise for organizations looking to simplify their data lake architecture without sacrificing the flexibility and scalability that made data lakes popular in the first place. As the ecosystem matures and more compute engines add support, DuckLake could become a compelling alternative to established formats like Iceberg and Delta Lake.

Related Videos

"From Curiosity to Impact How DoSomething Democratized Data" video thumbnail

2025-09-10

From Curiosity to Impact How DoSomething Democratized Data

Hear how DoSomething's data team escaped the enterprise data trap, achieving 20X cost reduction and transforming hours-long queries into seconds with MotherDuck.

YouTube

"How to Efficiently Load Data into DuckLake with Estuary" video thumbnail

2025-07-26

How to Efficiently Load Data into DuckLake with Estuary

Learn how DuckLake, MotherDuck, and Estuary enable fast, real-time data integration and analytics with modern open table formats, cloud data warehousing, and no-code streaming pipelines.

YouTube

"What can Postgres learn from DuckDB? (PGConf.dev 2025)" video thumbnail

20:44

2025-06-13

What can Postgres learn from DuckDB? (PGConf.dev 2025)

DuckDB an open source SQL analytics engine that is quickly growing in popularity. This begs the question: What can Postgres learn from DuckDB?

YouTube

Ecosystem

Talk