---
sidebar_position: 1
title: DuckLake
description: Use DuckLake for transactional lakehouse analytics with MotherDuck-managed metadata and object storage.
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import Admonition from '@theme/Admonition';
import Versions from '@site/src/components/Versions';

::::warning[DuckLake Active Development]
DuckLake is on version 0.4. Functionality is expected to change without notice [until v1.0 is released later in 2026](https://ducklake.select/release_calendar).
::::

::::note
MotherDuck supports DuckDB <Versions region="global" bound="max" />. In **US East (N. Virginia) -** `us-east-1`, MotherDuck is compatible with client versions <Versions region="us-east-1" bound="min" /> through <Versions region="us-east-1" bound="max" />. In **Europe (Frankfurt) -** `eu-central-1`, MotherDuck supports client version <Versions region="eu-central-1" bound="min" /> through <Versions region="eu-central-1" bound="max" />.
::::

[DuckLake](https://ducklake.select) is an integrated data lake and catalog format. DuckLake delivers advanced data lake features without traditional lakehouse complexity by using Parquet files and a SQL database.

MotherDuck provides two main options for creating and integrating with DuckLake databases:

- **[Fully managed](#creating-a-fully-managed-ducklake-database)**: Create a DuckLake database where MotherDuck manages both data storage and metadata
- **[Bring your own bucket (BYOB)](#bring-your-own-bucket)**: Connect your own S3 or R2 bucket for data storage with:
  - **[MotherDuck compute + MotherDuck catalog](#using-motherduck-compute)**: Use MotherDuck for both compute and catalog services
  - **[Own compute + MotherDuck catalog](#using-own-compute)**: Use your own DuckDB client for compute while MotherDuck provides catalog services

## Creating a fully managed DuckLake database

Create a fully managed DuckLake with the following command:

```sql
CREATE DATABASE my_ducklake (TYPE DUCKLAKE);
```

MotherDuck stores both data and metadata in MotherDuck-managed storage (not externally accessible at the moment), providing a streamlined way to evaluate DuckLake functionality.

The `my_ducklake` database can be accessed like any other MotherDuck database.

You can attach the DuckLake metadata with:

```sql
ATTACH 'md:__ducklake_metadata_<database_name>' AS <database_alias>;
```

::::note
The metadata database can only be attached by the database owner.
::::

## Data inlining

Data inlining is an optimization feature that stores small data changes directly in the metadata catalog rather than creating individual Parquet files for every insert operation. This eliminates the overhead of creating small Parquet files while maintaining full query and update capabilities.

### How data inlining works

Starting with DuckLake 0.4, data inlining is **enabled by default** with a threshold of 10 rows. Any insert writing fewer than 10 rows is automatically stored inline in the metadata catalog rather than creating a Parquet file.

You can customize the threshold with the `DATA_INLINING_ROW_LIMIT` parameter. For example, if you set it to 100, inserts with fewer than 100 rows are stored inline, while inserts with 100 or more rows create Parquet files. Set it to 0 to disable inlining.

### Creating a DuckLake database with custom inlining

To create a (fully managed) DuckLake database with a custom inlining threshold:

```sql
CREATE DATABASE my_ducklake (
    TYPE DUCKLAKE,
    DATA_INLINING_ROW_LIMIT 100
);
```

This configuration will inline all inserts with fewer than 100 rows directly into the metadata catalog.

### Flushing inlined data

You can manually convert inlined data to Parquet files using the `ducklake_flush_inlined_data` function.

```sql
-- Flush inlined data for a specific table
SELECT ducklake_flush_inlined_data('my_ducklake.my_schema.my_table');

-- Flush all inlined data in a schema
SELECT ducklake_flush_inlined_data('my_ducklake.my_schema');

-- Flush all inlined data in the database
SELECT ducklake_flush_inlined_data('my_ducklake');
```

### When to use data inlining

Data inlining is particularly beneficial for:

- **High-frequency, small-batch inserts**: Applications with streaming data or frequent small updates
- **Incremental data loading**: ETL processes that append small batches of data regularly
- **Transactional workloads**: Systems where individual transactions insert small numbers of rows

## Bring your own bucket (BYOB) {#bring-your-own-bucket}

You can use MotherDuck as a compute engine and managed DuckLake catalog while connecting your own [AWS S3](/integrations/cloud-storage/amazon-s3/) or [Cloudflare R2](/integrations/cloud-storage/cloudflare-r2/) object store for data storage. Additionally, you can bring your own compute (BYOC) using your DuckDB client to query and write data directly to your DuckLake.

### Setup

Configure a custom data path when creating your DuckLake to use your own bucket.

<Tabs>
<TabItem value="s3" label="AWS S3">

:::note
Your S3 bucket must be in the same region as your MotherDuck organization:

- **US organizations**: `us-east-1` (US East - N. Virginia)
- **EU organizations**: `eu-central-1` (Europe - Frankfurt)

Other AWS regions are not supported for BYOB. This does not affect reading files directly from S3-compatible object stores (for example, CSV or Parquet): you can still query data from buckets in any region.
:::

```sql
CREATE DATABASE my_ducklake (
    TYPE DUCKLAKE,
    DATA_PATH 's3://mybucket/my_optional_path/'
);
```

</TabItem>
<TabItem value="r2" label="Cloudflare R2">

:::tip
Cloudflare R2 buckets are regionless, so you can use them with any MotherDuck organization regardless of region. When creating your R2 bucket, set a [location hint](https://developers.cloudflare.com/r2/reference/data-location/) close to your MotherDuck region to minimize latency (for example, `enam` for US organizations, `weur` for EU organizations).
:::

```sql
CREATE DATABASE my_ducklake (
    TYPE DUCKLAKE,
    DATA_PATH 'r2://mybucket/my_optional_path/'
);
```

</TabItem>
</Tabs>

Create a corresponding secret in MotherDuck to allow MotherDuck compute to access your bucket. See [Cloud Storage integrations](/integrations/cloud-storage/) for instructions on creating secrets for your provider.

You can then create DuckLake tables as you would with a standard DuckDB database using either MotherDuck or local compute as shown in the examples below.

#### Required permissions for DuckLake

<Tabs>
<TabItem value="s3" label="AWS S3">

The minimum required IAM permissions are:

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket"
      ],
      "Resource": "${s3_bucket_arn}"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject"
      ],
      "Resource": "${s3_bucket_arn}/*"
    }
  ]
}
```

</TabItem>
<TabItem value="r2" label="Cloudflare R2">

Your R2 API token needs the following permissions on the bucket:

- **Object Read** - read data files
- **Object Write** - write and delete data files
- **Bucket List** - list objects in the bucket

See the [Cloudflare R2 API tokens documentation](https://developers.cloudflare.com/r2/api/s3/tokens/) for instructions on creating an API token.

</TabItem>
</Tabs>

### Using MotherDuck compute

Connect to MotherDuck:

```sql
./duckdb md:
```

Create your first DuckLake table from an hosted Parquet file:

```sql
CREATE TABLE my_ducklake.air_quality AS 
SELECT * FROM 'https://us.data.motherduck.com/who_ambient_air_quality/parquet/who_ambient_air_quality_database_version_2024.parquet';
```

Query using MotherDuck:

```sql
SELECT
    year,
    AVG(pm25_concentration::double) AS avg_pm25,
    AVG(pm10_concentration::double) AS avg_pm10,
    AVG(no2_concentration::double) AS avg_no2
FROM my_ducklake.air_quality 
WHERE city = 'Berlin/DEU'
GROUP BY year
ORDER BY year DESC;
```

### Using own compute

To use your own compute (for example, your DuckDB client), you must:

1. Ensure you have appropriate credentials in your compute environment to read/write to your defined `DATA_PATH` (specified at database creation)
2. Attach the metadata database

Create a secret in your compute environment:

<Tabs>
<TabItem value="s3" label="AWS S3">

If you have authenticated using `aws sso login`:

```sql
CREATE OR REPLACE SECRET my_secret IN MOTHERDUCK (
    TYPE S3,
    PROVIDER credential_chain,
    CHAIN 'sso',
    PROFILE '<your_sso_profile>'
);
```

:::note
Run `aws sso login --profile <your_sso_profile>` before creating the secret to refresh your SSO token. You may need to restart your DuckDB CLI session after logging in for the credentials to be picked up. Starting with DuckDB v1.4.0, credentials are validated at creation time: if validation fails, confirm your SSO session is active and that you are using the correct `CHAIN` and `PROFILE`.
:::

Alternatively, provide static AWS keys:

```sql
CREATE SECRET my_secret IN MOTHERDUCK (
    TYPE S3,
    KEY_ID 'my_s3_access_key',
    SECRET 'my_s3_secret_key',
    REGION 'my-bucket-region',
    SCOPE 'my-bucket-path'
);
```

</TabItem>
<TabItem value="r2" label="Cloudflare R2">

```sql
CREATE SECRET my_secret IN MOTHERDUCK (
    TYPE R2,
    KEY_ID 'your_r2_access_key',
    SECRET 'your_r2_secret_key',
    ACCOUNT_ID 'your_account_id'
);
```

</TabItem>
</Tabs>

Attach the metadata database to your DuckDB session:

```sql
ATTACH 'md:__ducklake_metadata_<database_name>' AS <database_alias>;
```

Every DuckLake database in MotherDuck has a corresponding **metadata database** that stores internal state, including schema definitions, snapshots, file mappings, and more.

Create a table using your own compute:

```sql
CREATE TABLE <alias>.air_quality AS 
SELECT * FROM 'https://us.data.motherduck.com/who_ambient_air_quality/parquet/who_ambient_air_quality_database_version_2024.parquet';
```

With this configuration, your own compute can directly access or write data to your DuckLake (assuming appropriate credentials are configured). Data uploaded using your own compute will appear in the MotherDuck catalog and be queryable as a standard MotherDuck database.

## What's new in DuckLake 0.4

DuckLake 0.4 ships with DuckDB 1.5 and introduces several improvements:

- **VARIANT type support**: DuckLake tables can store [`VARIANT`](/sql-reference/duckdb-sql-reference/data-types/) columns, which significantly speed up semi-structured data workloads compared to plain JSON columns.
- **Data inlining enabled by default**: Inserts with fewer than 10 rows are automatically inlined into the metadata catalog. You can customize or disable this with the `DATA_INLINING_ROW_LIMIT` parameter.
- **Deletion inlining**: Similar to data inlining for inserts, delete operations on small batches can now be stored directly in the metadata catalog, reducing the overhead of small deletes.
- **Sorted compaction and flush**: Data is automatically sorted during compaction and inline flush operations, improving read performance for selective queries after maintenance.
- **Stats-only `COUNT(*)`**: Simple `COUNT(*)` queries can now be answered directly from metadata statistics without scanning data files.
- **TopN file pruning**: `LIMIT` queries with an `ORDER BY` can skip data files that fall outside the requested range, making paginated and top-N queries faster.
- **Expressions as default values**: Column defaults can use expressions like `now()` instead of only literal values.
- **Macros**: DuckLake catalogs now support storing [macros](https://duckdb.org/docs/sql/statements/create_macro.html).

## Performing metadata operations on a DuckLake

DuckLake databases provide additional metadata operations for introspection and maintenance. These operations can be performed from both MotherDuck and your own compute environments. For example, you can [list the snapshots](https://ducklake.select/docs/stable/duckdb/usage/snapshots) backing your DuckLake.

## Current limitations

- **Limited sharing options**: Read-only sharing is supported through the [existing share functionality](/key-tasks/sharing-data/), restricted to auto-update shares only
- **Single-account write access**: Write permissions are limited to one account per database. This account can perform multiple concurrent writes, as long as they are append-only. If multiple queries attempt to update or delete from the same table concurrently, only the first to commit will succeed. Concurrent DDL operations are also not allowed. Support for *multi-account* write access is planned for a future release.
- **Limited BYOB storage providers**: Bring Your Own Bucket is supported for [AWS S3](/integrations/cloud-storage/amazon-s3/) and [Cloudflare R2](/integrations/cloud-storage/cloudflare-r2/) storage. Other clouds are under consideration for future support.

:::info
For multiple concurrent readers to a MotherDuck DuckLake database, you can create a [read scaling token](/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/).
:::

[Data file maintenance](https://ducklake.select/docs/stable/duckdb/maintenance/recommended_maintenance) is not automatically performed by MotherDuck. You can manually trigger these maintenance functions as needed from either MotherDuck or your own compute environments.
