Livestream: Getting Started with MotherDuck and DuckLake - June 26Register Now

Skip to main content

DuckLake

⚠️Preview Feature

DuckLake is in preview. The file format and related metadata structure are subject to change. Consider your workload requirements carefully and test thoroughly before using with important data.

DuckLake is an integrated data lake and catalog format. DuckLake delivers advanced data lake features without traditional lakehouse complexity by using Parquet files and a SQL database.

MotherDuck provides two main options for creating and integrating with DuckLake databases:

Creating a fully managed DuckLake database

Create a fully managed DuckLake with the following command:

CREATE DATABASE my_ducklake (TYPE DUCKLAKE);

MotherDuck stores both data and metadata in MotherDuck-managed storage (not externally accessible at the moment), providing a streamlined way to evaluate DuckLake functionality.

The my_ducklake database can be accessed like any other MotherDuck database.

Bring your own bucket

You can use MotherDuck as a compute engine and managed DuckLake catalog while connecting your own S3-compatible object store (such as AWS S3, GCS, Cloudflare R2, and Tigris) for data storage. Additionally, you can use local compute (your DuckDB client) to query and write data directly to your DuckLake.

Setup

Configure a custom data path when creating your DuckLake to use your own S3-compatible object storage:

warning

MotherDuck operates in the AWS us-east-1 region. For optimal performance and to avoid cross-region transfer costs, we highly recommend using an S3 bucket in the us-east-1 region.

CREATE DATABASE my_ducklake (
TYPE DUCKLAKE,
DATA_PATH 's3://mybucket/my_optional_path/'
);

Create a corresponding secret in MotherDuck to allow MotherDuck compute to access your bucket:

CREATE SECRET my_secret IN MOTHERDUCK (
TYPE S3,
KEY_ID 'my_s3_access_key',
SECRET 'my_s3_secret_key',
REGION 'my-bucket-region'
);
info

For service workloads like ETL, static keys are typically required. For queries in the DuckDB UI or CLI, we recommend using aws sso login to generate temporary credentials. See Create Secrets: Amazon S3 for details.

You can then create DuckLake tables as you would with a standard DuckDB database using either MotherDuck or local compute as shown in the examples below.

Using MotherDuck compute

Connect to MotherDuck:

./duckdb md:

Create your first DuckLake table from an hosted Parquet file:

CREATE TABLE my_ducklake.air_quality AS 
SELECT * FROM 's3://us-prd-motherduck-open-datasets/who_ambient_air_quality/parquet/who_ambient_air_quality_database_version_2024.parquet';

Query using MotherDuck:

SELECT
year,
AVG(pm25_concentration) AS avg_pm25,
AVG(pm10_concentration) AS avg_pm10,
AVG(no2_concentration) AS avg_no2
FROM my_ducklake.air_quality
WHERE city = 'Berlin'
GROUP BY year
ORDER BY year DESC;

Using local compute

To use local compute (e.g., your DuckDB client), you must:

  1. Ensure you have appropriate S3-compatible credentials locally to read/write to your defined DATA_PATH (specified at database creation)
  2. Attach the metadata database

Create a local secret if you have authenticated using aws sso login:

CREATE OR REPLACE SECRET my_secret (
TYPE S3,
PROVIDER credential_chain
);

Alternatively, provide static AWS keys:

CREATE SECRET my_secret (
TYPE S3,
KEY_ID 'my_s3_access_key',
SECRET 'my_s3_secret_key',
REGION 'my-bucket-region'
);

Attach the metadata database to your DuckDB session:

ATTACH 'ducklake:md:__ducklake_metadata_<database_name>' AS <alias>;

Every DuckLake database in MotherDuck has a corresponding metadata database that stores internal state, including schema definitions, snapshots, file mappings, and more.

Create a local table:

CREATE TABLE <alias>.air_quality AS 
SELECT * FROM 's3://us-prd-motherduck-open-datasets/who_ambient_air_quality/parquet/who_ambient_air_quality_database_version_2024.parquet';

With this configuration, local compute can directly access or write data to your DuckLake (assuming appropriate credentials are configured). Data uploaded via local compute will appear in the MotherDuck catalog and be queryable as a standard MotherDuck database.

Performing metadata operations on a DuckLake

DuckLake databases provide additional metadata operations for introspection and maintenance. These operations can be performed from both MotherDuck and local compute environments. For example, you can list the snapshots backing your DuckLake.

Current limitations

MotherDuck-managed DuckLakes currently do not support:

Additionally, write access is currently limited to a single account (duckling) per database. Support for multi-account write access is planned for a future release.

info

For multiple concurrent readers to a MotherDuck DuckLake database, you can create a read scaling token.

Data file maintenance is not automatically performed by MotherDuck. You can manually trigger these maintenance functions as needed from either MotherDuck or local compute environments.