DuckLake
DuckLake is in preview. The file format and related metadata structure are subject to change. Consider your workload requirements carefully and test thoroughly before using with important data.
DuckLake is an integrated data lake and catalog format. DuckLake delivers advanced data lake features without traditional lakehouse complexity by using Parquet files and a SQL database.
MotherDuck provides two main options for creating and integrating with DuckLake databases:
- Fully managed: Create a DuckLake database where MotherDuck manages both data storage and metadata
- Bring your own bucket: Connect your own S3-compatible object storage for data storage with:
- MotherDuck compute + MotherDuck catalog: Use MotherDuck for both compute and catalog services
- Local compute + MotherDuck catalog: Use local DuckDB client for compute while MotherDuck provides catalog services
Creating a fully managed DuckLake database
Create a fully managed DuckLake with the following command:
CREATE DATABASE my_ducklake (TYPE DUCKLAKE);
MotherDuck stores both data and metadata in MotherDuck-managed storage (not externally accessible at the moment), providing a streamlined way to evaluate DuckLake functionality.
The my_ducklake
database can be accessed like any other MotherDuck database.
Bring your own bucket
You can use MotherDuck as a compute engine and managed DuckLake catalog while connecting your own S3-compatible object store (such as AWS S3, GCS, Cloudflare R2, and Tigris) for data storage. Additionally, you can use local compute (your DuckDB client) to query and write data directly to your DuckLake.
Setup
Configure a custom data path when creating your DuckLake to use your own S3-compatible object storage:
MotherDuck operates in the AWS us-east-1
region. For optimal performance and to avoid cross-region transfer costs, we highly recommend using an S3 bucket in the us-east-1
region.
CREATE DATABASE my_ducklake (
TYPE DUCKLAKE,
DATA_PATH 's3://mybucket/my_optional_path/'
);
Create a corresponding secret in MotherDuck to allow MotherDuck compute to access your bucket:
CREATE SECRET my_secret IN MOTHERDUCK (
TYPE S3,
KEY_ID 'my_s3_access_key',
SECRET 'my_s3_secret_key',
REGION 'my-bucket-region'
);
For service workloads like ETL, static keys are typically required. For queries in the DuckDB UI or CLI, we recommend using aws sso login
to generate temporary credentials. See Create Secrets: Amazon S3 for details.
You can then create DuckLake tables as you would with a standard DuckDB database using either MotherDuck or local compute as shown in the examples below.
Using MotherDuck compute
Connect to MotherDuck:
./duckdb md:
Create your first DuckLake table from an hosted Parquet file:
CREATE TABLE my_ducklake.air_quality AS
SELECT * FROM 's3://us-prd-motherduck-open-datasets/who_ambient_air_quality/parquet/who_ambient_air_quality_database_version_2024.parquet';
Query using MotherDuck:
SELECT
year,
AVG(pm25_concentration) AS avg_pm25,
AVG(pm10_concentration) AS avg_pm10,
AVG(no2_concentration) AS avg_no2
FROM my_ducklake.air_quality
WHERE city = 'Berlin'
GROUP BY year
ORDER BY year DESC;
Using local compute
To use local compute (e.g., your DuckDB client), you must:
- Ensure you have appropriate S3-compatible credentials locally to read/write to your defined
DATA_PATH
(specified at database creation) - Attach the metadata database
Create a local secret if you have authenticated using aws sso login
:
CREATE OR REPLACE SECRET my_secret (
TYPE S3,
PROVIDER credential_chain
);
Alternatively, provide static AWS keys:
CREATE SECRET my_secret (
TYPE S3,
KEY_ID 'my_s3_access_key',
SECRET 'my_s3_secret_key',
REGION 'my-bucket-region'
);
Attach the metadata database to your DuckDB session:
ATTACH 'ducklake:md:__ducklake_metadata_<database_name>' AS <alias>;
Every DuckLake database in MotherDuck has a corresponding metadata database that stores internal state, including schema definitions, snapshots, file mappings, and more.
Create a local table:
CREATE TABLE <alias>.air_quality AS
SELECT * FROM 's3://us-prd-motherduck-open-datasets/who_ambient_air_quality/parquet/who_ambient_air_quality_database_version_2024.parquet';
With this configuration, local compute can directly access or write data to your DuckLake (assuming appropriate credentials are configured). Data uploaded via local compute will appear in the MotherDuck catalog and be queryable as a standard MotherDuck database.
Performing metadata operations on a DuckLake
DuckLake databases provide additional metadata operations for introspection and maintenance. These operations can be performed from both MotherDuck and local compute environments. For example, you can list the snapshots backing your DuckLake.
Current limitations
MotherDuck-managed DuckLakes currently do not support:
- Time travel
- Data inlining
- Shares — you cannot create MotherDuck shares from a DuckLake database, but support is forthcoming
Additionally, write access is currently limited to a single account (duckling) per database. Support for multi-account write access is planned for a future release.
For multiple concurrent readers to a MotherDuck DuckLake database, you can create a read scaling token.
Data file maintenance is not automatically performed by MotherDuck. You can manually trigger these maintenance functions as needed from either MotherDuck or local compute environments.