DuckLake

⚠️Preview Feature

DuckLake is in preview and requires DuckDB 1.3.2 or greater. The file format and related metadata structure are subject to change. Consider your workload requirements carefully and test thoroughly before using with important data.

DuckLake is an integrated data lake and catalog format. DuckLake delivers advanced data lake features without traditional lakehouse complexity by using Parquet files and a SQL database.

MotherDuck provides two main options for creating and integrating with DuckLake databases:

Fully managed: Create a DuckLake database where MotherDuck manages both data storage and metadata
Bring your own bucket: Connect your own S3-compatible object storage for data storage with:
- MotherDuck compute + MotherDuck catalog: Use MotherDuck for both compute and catalog services
- Own compute + MotherDuck catalog: Use your own DuckDB client for compute while MotherDuck provides catalog services

Creating a fully managed DuckLake database

Create a fully managed DuckLake with the following command:

CREATE DATABASE my_ducklake (TYPE DUCKLAKE);

MotherDuck stores both data and metadata in MotherDuck-managed storage (not externally accessible at the moment), providing a streamlined way to evaluate DuckLake functionality.

The my_ducklake database can be accessed like any other MotherDuck database.

Bring your own bucket

You can use MotherDuck as a compute engine and managed DuckLake catalog while connecting your own S3-compatible object store (such as AWS S3, GCS, Cloudflare R2, and Tigris) for data storage. Additionally, you can bring your own compute (BYOC) using your DuckDB client to query and write data directly to your DuckLake.

Setup

Configure a custom data path when creating your DuckLake to use your own S3-compatible object storage:

warning

MotherDuck operates in the AWS us-east-1 region. For optimal performance and to avoid cross-region transfer costs, we highly recommend using an S3 bucket in the us-east-1 region.

CREATE DATABASE my_ducklake (
    TYPE DUCKLAKE, 
    DATA_PATH 's3://mybucket/my_optional_path/'
);

Create a corresponding secret in MotherDuck to allow MotherDuck compute to access your bucket:

CREATE SECRET my_secret IN MOTHERDUCK (
    TYPE S3,
    KEY_ID 'my_s3_access_key',
    SECRET 'my_s3_secret_key',
    REGION 'my-bucket-region',
    SCOPE 'my-bucket-path'
);

info

For service workloads like ETL, static keys are typically required. For queries in the DuckDB UI or CLI, we recommend using aws sso login to generate temporary credentials. See Create Secrets: Amazon S3 for details.

You can then create DuckLake tables as you would with a standard DuckDB database using either MotherDuck or local compute as shown in the examples below.

Using MotherDuck compute

Connect to MotherDuck:

./duckdb md:

Create your first DuckLake table from an hosted Parquet file:

CREATE TABLE my_ducklake.air_quality AS 
SELECT * FROM 's3://us-prd-motherduck-open-datasets/who_ambient_air_quality/parquet/who_ambient_air_quality_database_version_2024.parquet';

Query using MotherDuck:

SELECT
    year,
    AVG(pm25_concentration) AS avg_pm25,
    AVG(pm10_concentration) AS avg_pm10,
    AVG(no2_concentration) AS avg_no2
FROM my_ducklake.air_quality 
WHERE city = 'Berlin'
GROUP BY year
ORDER BY year DESC;

Using own compute

To use your own compute (e.g., your DuckDB client), you must:

Ensure you have appropriate S3-compatible credentials in your compute environment to read/write to your defined DATA_PATH (specified at database creation)
Attach the metadata database

Create a secret in your compute environment if you have authenticated using aws sso login:

CREATE OR REPLACE SECRET my_secret (
    TYPE S3,
    PROVIDER credential_chain
);

Alternatively, provide static AWS keys:

CREATE SECRET my_secret (
    TYPE S3,
    KEY_ID 'my_s3_access_key',
    SECRET 'my_s3_secret_key',
    REGION 'my-bucket-region',
    SCOPE 'my-bucket-path'
);

Attach the metadata database to your DuckDB session:

ATTACH 'ducklake:md:__ducklake_metadata_<database_name>' AS <alias>;

Every DuckLake database in MotherDuck has a corresponding metadata database that stores internal state, including schema definitions, snapshots, file mappings, and more.

Create a table using your own compute:

CREATE TABLE <alias>.air_quality AS 
SELECT * FROM 's3://us-prd-motherduck-open-datasets/who_ambient_air_quality/parquet/who_ambient_air_quality_database_version_2024.parquet';

With this configuration, your own compute can directly access or write data to your DuckLake (assuming appropriate credentials are configured). Data uploaded via your own compute will appear in the MotherDuck catalog and be queryable as a standard MotherDuck database.

Performing metadata operations on a DuckLake

DuckLake databases provide additional metadata operations for introspection and maintenance. These operations can be performed from both MotherDuck and your own compute environments. For example, you can list the snapshots backing your DuckLake.

Current limitations

Limited sharing options: Read-only sharing is supported through the existing share functionality, restricted to auto-update shares only
Single-account write access: Write permissions are currently limited to one account per database. This account can perform multiple concurrent writes, as long as they are append-only. If multiple queries attempt to update or delete from the same table concurrently, only the first to commit will succeed. Concurrent DDL operations are also not allowed. Support for multi-account write access is planned for a future release.

info

For multiple concurrent readers to a MotherDuck DuckLake database, you can create a read scaling token.

Data file maintenance is not automatically performed by MotherDuck. You can manually trigger these maintenance functions as needed from either MotherDuck or your own compute environments.

Creating a fully managed DuckLake database​

Bring your own bucket​

Setup​

Using MotherDuck compute​

Using own compute​

Performing metadata operations on a DuckLake​

Current limitations​