DuckLake
In us-east-1, MotherDuck is compatible from client versions 1.3.2. In eu-central-1, MotherDuck is compatible from client version 1.4.1. Preview features may be operationally incomplete and may offer limited backward compatibility.
DuckLake is an integrated data lake and catalog format. DuckLake delivers advanced data lake features without traditional lakehouse complexity by using Parquet files and a SQL database.
MotherDuck provides two main options for creating and integrating with DuckLake databases:
- Fully managed: Create a DuckLake database where MotherDuck manages both data storage and metadata
- Bring your own bucket: Connect your own S3-compatible object storage for data storage with:
- MotherDuck compute + MotherDuck catalog: Use MotherDuck for both compute and catalog services
- Own compute + MotherDuck catalog: Use your own DuckDB client for compute while MotherDuck provides catalog services
Creating a fully managed DuckLake database
Create a fully managed DuckLake with the following command:
CREATE DATABASE my_ducklake (TYPE DUCKLAKE);
MotherDuck stores both data and metadata in MotherDuck-managed storage (not externally accessible at the moment), providing a streamlined way to evaluate DuckLake functionality.
The my_ducklake database can be accessed like any other MotherDuck database.
Data Inlining
Data inlining is currently experimental and requires explicit enablement using the DATA_INLINING_ROW_LIMIT parameter during database creation.
Data inlining is an optimization feature that stores small data changes directly in the metadata catalog rather than creating individual Parquet files for every insert operation. This eliminates the overhead of creating small Parquet files while maintaining full query and update capabilities.
How Data Inlining Works
When you enable data inlining with the DATA_INLINING_ROW_LIMIT parameter, any insert operation writing fewer rows than your specified threshold is automatically stored as inlined data in the metadata catalog. Larger inserts continue to use Parquet files as usual.
For example, if you set DATA_INLINING_ROW_LIMIT to 100, inserts with fewer than 100 rows are stored inline, while inserts with 100 or more rows create Parquet files.
Creating a DuckLake Database with Data Inlining
To create a (fully managed) DuckLake database with data inlining enabled:
CREATE DATABASE my_ducklake (
TYPE DUCKLAKE,
DATA_INLINING_ROW_LIMIT 100
);
This configuration will inline all inserts with fewer than 100 rows directly into the metadata catalog.
Flushing Inlined Data
You can manually convert inlined data to Parquet files using the ducklake_flush_inlined_data function.
-- Flush inlined data for a specific table
SELECT ducklake_flush_inlined_data('my_ducklake.my_schema.my_table');
-- Flush all inlined data in a schema
SELECT ducklake_flush_inlined_data('my_ducklake.my_schema');
-- Flush all inlined data in the database
SELECT ducklake_flush_inlined_data('my_ducklake');
When to Use Data Inlining
Data inlining is particularly beneficial for:
- High-frequency, small-batch inserts: Applications with streaming data or frequent small updates
- Incremental data loading: ETL processes that append small batches of data regularly
- Transactional workloads: Systems where individual transactions insert small numbers of rows
Bring your own bucket
You can use MotherDuck as a compute engine and managed DuckLake catalog while connecting your own S3-compatible object store (such as AWS S3, GCS, Cloudflare R2, and Tigris) for data storage. Additionally, you can bring your own compute (BYOC) using your DuckDB client to query and write data directly to your DuckLake.
Setup
Configure a custom data path when creating your DuckLake to use your own S3-compatible object storage:
MotherDuck is currently available on AWS in two regions, US East (N. Virginia) - us-east-1 and Europe (Frankfurt) - eu-central-1. For optimal performance and costs, we recommend using an S3 bucket in the same region as your MotherDuck Organization.
CREATE DATABASE my_ducklake (
TYPE DUCKLAKE,
DATA_PATH 's3://mybucket/my_optional_path/'
);
Create a corresponding secret in MotherDuck to allow MotherDuck compute to access your bucket:
CREATE SECRET my_secret IN MOTHERDUCK (
TYPE S3,
KEY_ID 'my_s3_access_key',
SECRET 'my_s3_secret_key',
REGION 'my-bucket-region',
SCOPE 'my-bucket-path'
);
For service workloads like ETL, static keys are typically required. For queries in the DuckDB UI or CLI, we recommend using aws sso login to generate temporary credentials. See Create Secrets: Amazon S3 for details.
You can then create DuckLake tables as you would with a standard DuckDB database using either MotherDuck or local compute as shown in the examples below.
Required IAM Permissions for DuckLake
The minimum required IAM permissions are:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": "${s3_bucket_arn}"
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject"
],
"Resource": "${s3_bucket_arn}/*"
}
]
}
Using MotherDuck compute
Connect to MotherDuck:
./duckdb md:
Create your first DuckLake table from an hosted Parquet file:
CREATE TABLE my_ducklake.air_quality AS
SELECT * FROM 's3://us-prd-motherduck-open-datasets/who_ambient_air_quality/parquet/who_ambient_air_quality_database_version_2024.parquet';
Query using MotherDuck:
SELECT
year,
AVG(pm25_concentration) AS avg_pm25,
AVG(pm10_concentration) AS avg_pm10,
AVG(no2_concentration) AS avg_no2
FROM my_ducklake.air_quality
WHERE city = 'Berlin'
GROUP BY year
ORDER BY year DESC;
Using own compute
To use your own compute (e.g., your DuckDB client), you must:
- Ensure you have appropriate S3-compatible credentials in your compute environment to read/write to your defined
DATA_PATH(specified at database creation) - Attach the metadata database
Create a secret in your compute environment if you have authenticated using aws sso login:
CREATE OR REPLACE SECRET my_secret IN MOTHERDUCK (
TYPE S3,
PROVIDER credential_chain
);
Alternatively, provide static AWS keys:
CREATE SECRET my_secret IN MOTHERDUCK (
TYPE S3,
KEY_ID 'my_s3_access_key',
SECRET 'my_s3_secret_key',
REGION 'my-bucket-region',
SCOPE 'my-bucket-path'
);
Attach the metadata database to your DuckDB session:
ATTACH 'ducklake:md:__ducklake_metadata_<database_name>' AS <database_alias>;
Every DuckLake database in MotherDuck has a corresponding metadata database that stores internal state, including schema definitions, snapshots, file mappings, and more.
Create a table using your own compute:
CREATE TABLE <alias>.air_quality AS
SELECT * FROM 's3://us-prd-motherduck-open-datasets/who_ambient_air_quality/parquet/who_ambient_air_quality_database_version_2024.parquet';
With this configuration, your own compute can directly access or write data to your DuckLake (assuming appropriate credentials are configured). Data uploaded via your own compute will appear in the MotherDuck catalog and be queryable as a standard MotherDuck database.
Performing metadata operations on a DuckLake
DuckLake databases provide additional metadata operations for introspection and maintenance. These operations can be performed from both MotherDuck and your own compute environments. For example, you can list the snapshots backing your DuckLake.
Current limitations
- Limited sharing options: Read-only sharing is supported through the existing share functionality, restricted to auto-update shares only
- Single-account write access: Write permissions are currently limited to one account per database. This account can perform multiple concurrent writes, as long as they are append-only. If multiple queries attempt to update or delete from the same table concurrently, only the first to commit will succeed. Concurrent DDL operations are also not allowed. Support for multi-account write access is planned for a future release.
For multiple concurrent readers to a MotherDuck DuckLake database, you can create a read scaling token.
Data file maintenance is not automatically performed by MotherDuck. You can manually trigger these maintenance functions as needed from either MotherDuck or your own compute environments.