# DuckLake
> DuckLake is an integrated data lake and catalog format for large scale data analytics.
::::note
MotherDuck supports DuckDB 1.5.4. In **US East (N. Virginia) -** `us-east-1`, MotherDuck is compatible with client versions 1.4.0 through 1.5.4. In **US West (Oregon) -** `us-west-2`, MotherDuck supports client versions 1.4.1 through 1.5.4. In **Europe (Frankfurt) -** `eu-central-1`, MotherDuck supports client version 1.4.1 through 1.5.4. In **Europe (Dublin) -** `eu-west-1`, MotherDuck supports client versions 1.4.1 through 1.5.4.
::::

[DuckLake](https://ducklake.select) is an integrated data lake and catalog format. DuckLake delivers advanced data lake features without traditional lakehouse complexity by using Parquet files and a SQL database.

MotherDuck provides two main options for creating and integrating with DuckLake databases:

- **[Fully managed](#creating-a-fully-managed-ducklake-database)**: Create a DuckLake database where MotherDuck manages both data storage and metadata
- **[Bring your own bucket (BYOB)](#bring-your-own-bucket)**: Connect your own S3 or R2 bucket for data storage with:
  - **[MotherDuck compute + MotherDuck catalog](#using-motherduck-compute)**: Use MotherDuck for both compute and catalog services
  - **[Own compute + MotherDuck catalog](#using-own-compute)**: Use your own DuckDB client for compute while MotherDuck provides catalog services

## Creating a fully managed DuckLake database

Create a fully managed DuckLake with the following command:

```sql
CREATE DATABASE my_ducklake (TYPE DUCKLAKE);
```

MotherDuck stores both data and metadata in MotherDuck-managed storage (not externally accessible at the moment), providing a streamlined way to evaluate DuckLake functionality.

The `my_ducklake` database can be accessed like any other MotherDuck database, including over the [Postgres endpoint](/key-tasks/authenticating-and-connecting-to-motherduck/postgres-endpoint/) for clients that don't use the DuckDB SDK.
To inspect the metadata catalog backing the DuckLake, see [Performing metadata operations on a DuckLake](#performing-metadata-operations-on-a-ducklake).

You can attach the DuckLake metadata with:

```sql
ATTACH 'md:__ducklake_metadata_<database_name>' AS <database_alias>;
```

::::note
The metadata database can only be attached by the database owner.
::::

## Data inlining

Data inlining is an optimization feature that stores small data changes directly in the metadata catalog rather than creating individual Parquet files for every insert operation. This eliminates the overhead of creating small Parquet files while maintaining full query and update capabilities.

### Creating a DuckLake database with custom inlining

To create a (fully managed) DuckLake database with a custom inlining threshold:

```sql
CREATE DATABASE my_ducklake (
    TYPE DUCKLAKE,
    DATA_INLINING_ROW_LIMIT 100
);
```

This configuration will inline all inserts with fewer than 100 rows directly into the metadata catalog.

### How data inlining works

Data inlining is **enabled by default** with a threshold of 10 rows. Any insert writing fewer than 10 rows is automatically stored inline in the metadata catalog rather than creating a Parquet file.

You can customize the threshold with the `DATA_INLINING_ROW_LIMIT` parameter. For example, if you set it to 100, inserts with fewer than 100 rows are stored inline, while inserts with 100 or more rows create Parquet files. Set it to 0 to disable inlining.

The inlining threshold applies **per insert operation**. For example, if the limit is set to 100, four separate inserts of 50 rows each will all be stored inline (200 total rows), because each individual insert is below the threshold.

When an insert exceeds the threshold, that insert writes directly to a Parquet file, but any previously inlined data remains in the metadata catalog. Larger inserts do not automatically flush existing inlined data.

:::note
For [BYOB](#bring-your-own-bucket) databases, inlined data is stored in the MotherDuck-managed metadata catalog, not in your bucket. Small inserts only appear in your bucket as Parquet files after they are flushed. Set `DATA_INLINING_ROW_LIMIT` to 0 if all data must reside in your own storage.
:::

### Flushing inlined data

Because inlined data can accumulate, it is good practice to periodically flush it to parquet storage using the `ducklake_flush_inlined_data` function:

```sql
-- Flush inlined data for a specific table
SELECT ducklake_flush_inlined_data('my_ducklake.my_schema.my_table');

-- Flush all inlined data in a schema
SELECT ducklake_flush_inlined_data('my_ducklake.my_schema');

-- Flush all inlined data in the database
SELECT ducklake_flush_inlined_data('my_ducklake');
```

For workloads with frequent small inserts, schedule regular flushes to prevent excessive inlined data accumulation.

> Automatic background flush operations are in active development.

### Configuring inlining

You can override the database-level inlining threshold for individual tables:

```sql
-- Disable inlining for a specific table
CALL my_ducklake.set_option('data_inlining_row_limit', 0, table_name => 'my_table');

-- Set a custom threshold for a specific table
CALL my_ducklake.set_option('data_inlining_row_limit', 50, table_name => 'my_table');
```

You can also set a session-level default that applies to new tables:

```sql
SET ducklake_default_data_inlining_row_limit = 0;
```

## DuckLake configuration

DuckLake provides configuration options that you can set at the database or table level using the `set_option` function. For example, you can adjust the `parquet_row_group_size` to control how data is organized in Parquet files:

```sql
-- Set row group size for the entire database
CALL my_ducklake.set_option('parquet_row_group_size', 50000);

-- Set row group size for a specific table
CALL my_ducklake.set_option('parquet_row_group_size', 50000, table_name => 'my_table');
```

Note that calls the `set_option` take precedence over configuration passed when creating the database.

```sql
CREATE DATABASE my_ducklake (
    TYPE DUCKLAKE,
    DATA_INLINING_ROW_LIMIT 100 -- sets database level inlining row limit to 100
);

-- overrides the prior value and sets database level row limit to 250
CALL my_ducklake.set_option('data_inlining_row_limit', 250);

-- overrides prior value _ONLY_ for `my_table`.
CALL my_ducklake.set_option('data_inlining_row_limit', 0, table_name => 'my_table');

```

For the full list of available configuration options, see the [DuckLake configuration reference](https://ducklake.select/docs/stable/duckdb/usage/configuration#setting-config-values-1).

## Bring your own bucket (BYOB) {#bring-your-own-bucket}

You can use MotherDuck as a compute engine and managed DuckLake catalog while connecting your own [AWS S3](/integrations/cloud-storage/amazon-s3/) or [Cloudflare R2](/integrations/cloud-storage/cloudflare-r2/) object store for data storage. Additionally, you can bring your own compute (BYOC) using your DuckDB client to query and write data directly to your DuckLake.

### Setup

Configure a custom data path when creating your DuckLake to use your own bucket.

### AWS S3

:::note
Your S3 bucket must be in the same AWS region as your MotherDuck organization: `us-east-1` (US East - N. Virginia), `us-west-2` (US West - Oregon), `eu-central-1` (Europe - Frankfurt), or `eu-west-1` (Europe - Dublin). Creating a DuckLake on a bucket in a different region fails with an error.

This restriction only applies to BYOB DuckLake data paths. Reading files directly from S3-compatible object stores (for example, CSV or Parquet) is not affected: you can still query data from buckets in any region.
:::

```sql
CREATE DATABASE my_ducklake (
    TYPE DUCKLAKE,
    DATA_PATH 's3://mybucket/my_optional_path/'
);
```

### Cloudflare R2

:::tip
Cloudflare R2 buckets are not bound to a specific region, so you can use them with any MotherDuck organization regardless of region. When creating your R2 bucket, set a [location hint](https://developers.cloudflare.com/r2/reference/data-location/) close to your MotherDuck region to minimize latency (for example, `enam` for US organizations, `weur` for EU organizations).
:::

```sql
CREATE DATABASE my_ducklake (
    TYPE DUCKLAKE,
    DATA_PATH 'r2://mybucket/my_optional_path/'
);
```

Create a corresponding secret in MotherDuck to allow MotherDuck compute to access your bucket. For example, for AWS S3:

```sql
CREATE SECRET my_secret IN MOTHERDUCK (
    TYPE S3,
    KEY_ID 'my_s3_access_key',
    SECRET 'my_s3_secret_key',
    REGION 'my-bucket-region',
    SCOPE 's3://mybucket'
);
```

See [Cloud Storage integrations](/integrations/cloud-storage/) for instructions on creating secrets for your provider. The secret can be created before or after the database: bucket access is not validated at database creation time, but the bucket's region is checked against your organization's region if a matching secret exists.

You can then create DuckLake tables as you would with a standard DuckDB database using either MotherDuck or local compute as shown in the examples below.

#### Required permissions for DuckLake

### AWS S3

The minimum required IAM permissions are:

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket"
      ],
      "Resource": "${s3_bucket_arn}"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject"
      ],
      "Resource": "${s3_bucket_arn}/*"
    }
  ]
}
```

### Cloudflare R2

Your R2 API token needs the following permissions on the bucket:

- **Object Read** - read data files
- **Object Write** - write and delete data files
- **Bucket List** - list objects in the bucket

See the [Cloudflare R2 API tokens documentation](https://developers.cloudflare.com/r2/api/s3/tokens/) for instructions on creating an API token.

### Using MotherDuck compute

Connect to MotherDuck:

```sql
./duckdb md:
```

Create your first DuckLake table from an hosted Parquet file:

```sql
CREATE TABLE my_ducklake.air_quality AS
SELECT * FROM 'https://us.data.motherduck.com/who_ambient_air_quality/parquet/who_ambient_air_quality_database_version_2024.parquet';
```

Query using MotherDuck:

```sql
SELECT
    year,
    AVG(pm25_concentration::double) AS avg_pm25,
    AVG(pm10_concentration::double) AS avg_pm10,
    AVG(no2_concentration::double) AS avg_no2
FROM my_ducklake.air_quality
WHERE city = 'Berlin/DEU'
GROUP BY year
ORDER BY year DESC;
```

### Using own compute

To use your own compute (for example, your DuckDB client), you must:

1. Ensure you have appropriate credentials in your compute environment to read/write to your defined `DATA_PATH` (specified at database creation)
2. Attach the DuckLake using the `ducklake:` prefix so compute runs locally against the MotherDuck-managed metadata catalog

Secrets created `IN MOTHERDUCK` are stored in MotherDuck and automatically available to your DuckDB client, so a single static-key secret can serve both MotherDuck compute and your own compute. If you already created one in the [Setup](#setup) step, you can skip ahead to attaching the DuckLake.

### AWS S3

If you have authenticated using `aws sso login`:

```sql
CREATE OR REPLACE SECRET my_secret IN MOTHERDUCK (
    TYPE S3,
    PROVIDER credential_chain,
    CHAIN 'sso',
    PROFILE '<your_sso_profile>'
);
```

:::note
Run `aws sso login --profile <your_sso_profile>` before creating the secret to refresh your SSO token. You may need to restart your DuckDB CLI session after logging in for the credentials to be picked up. Starting with DuckDB v1.4.0, credentials are validated at creation time: if validation fails, confirm your SSO session is active and that you are using the correct `CHAIN` and `PROFILE`.
:::

:::warning
Credential-chain secrets are resolved on the machine where the query runs. The SSO secret above works for your own compute, but MotherDuck compute cannot resolve your local SSO session. Queries executed by MotherDuck against the same DuckLake will fail with an authentication error (for example, `InvalidToken`). If you also want to query the DuckLake with MotherDuck compute, create a static-key secret as shown below.
:::

Alternatively, provide static AWS keys:

```sql
CREATE SECRET my_secret IN MOTHERDUCK (
    TYPE S3,
    KEY_ID 'my_s3_access_key',
    SECRET 'my_s3_secret_key',
    REGION 'my-bucket-region',
    SCOPE 'my-bucket-path'
);
```

### Cloudflare R2

```sql
CREATE SECRET my_secret IN MOTHERDUCK (
    TYPE R2,
    KEY_ID 'your_r2_access_key',
    SECRET 'your_r2_secret_key',
    ACCOUNT_ID 'your_account_id'
);
```

Attach the DuckLake to your DuckDB session, pointing at the MotherDuck-managed metadata catalog and your data bucket:

```sql
ATTACH 'ducklake:md:__ducklake_metadata_<database_name>' AS <database_alias>
    (DATA_PATH '<bucket_path_to_data>');
```

This tells DuckLake to:

- Use `ducklake:md:__ducklake_metadata_<database_name>` as the metadata catalog (through MotherDuck)
- Use `<bucket_path_to_data>` for reading and writing data files
- Run all compute locally on your DuckDB client rather than on MotherDuck

The `ducklake:` prefix is what enables local compute. Attaching with `ATTACH 'md:__ducklake_metadata_<database_name>'` (without the prefix) gives you the metadata catalog for inspection only. See [Performing metadata operations on a DuckLake](#performing-metadata-operations-on-a-ducklake).

Create a table using your own compute:

```sql
CREATE TABLE <alias>.air_quality AS
SELECT * FROM 'https://us.data.motherduck.com/who_ambient_air_quality/parquet/who_ambient_air_quality_database_version_2024.parquet';
```

With this configuration, your own compute can directly access or write data to your DuckLake (assuming appropriate credentials are configured). Data uploaded using your own compute will appear in the MotherDuck catalog and be queryable as a standard MotherDuck database.

## What's new in DuckLake 1.0

DuckLake 1.0 is the first production-ready release, with a stable specification and backward-compatibility guarantees. Highlights include:

- **Stable specification and multi-engine support**: The DuckLake 1.0 spec is stable with backward-compatibility guarantees going forward, and is designed to be used across multiple query engines.
- **Full inlining for inserts, updates, and deletes**: Updates now join inserts and deletes in being inlined into the metadata catalog when under the row threshold (10 by default). Customize with `DATA_INLINING_ROW_LIMIT`.
- **Clustering with `SET SORTED BY`**: Declare sort keys on columns or arbitrary SQL expressions. DuckLake applies the sort during compaction and inline flush (and optionally on insert), improving row-group and file pruning for filtered queries.
- **Bucket partitioning**: Iceberg-compatible `bucket(N, column)` transforms for high-cardinality columns, giving a middle ground between traditional partitioning and avoiding the small-files problem.
- **GEOMETRY enhancements**: Per-file bounding-box statistics enable file pruning on spatial filters, and `GEOMETRY` can now be nested inside `STRUCT`, `LIST`, and `MAP`.
- **VARIANT type with shredded statistics**: `VARIANT` sub-fields receive file-level statistics, enabling filter pushdown and faster selective queries over semi-structured data.
- **Deletion vectors (experimental)**: Iceberg v3-compatible deletion vectors, stored as Puffin files as an alternative to delete files.

See the [DuckLake 1.0 release post](https://ducklake.select/2026/04/13/ducklake-10/) and the [MotherDuck announcement](https://motherduck.com/blog/announcing-ducklake-1-0-on-motherduck/) for more detail.

## Additional DuckLake features

DuckLake on MotherDuck also supports:

- **Stats-only `COUNT(*)`**: Simple `COUNT(*)` queries are answered directly from metadata statistics without scanning data files.
- **TopN file pruning**: `LIMIT` queries with an `ORDER BY` skip data files that fall outside the requested range, making paginated and top-N queries faster.
- **Expressions as default values**: Column defaults can use expressions like `now()`, not only literal values.
- **Macros**: DuckLake catalogs can store [macros](https://duckdb.org/docs/sql/statements/create_macro.html).

## Performing metadata operations on a DuckLake

Using DuckLake provides additional metadata operations for introspection and maintenance. These operations can be performed from both MotherDuck and your own compute environments. For example, you can [list the snapshots](https://ducklake.select/docs/stable/duckdb/usage/snapshots) backing your DuckLake.

Each DuckLake in MotherDuck has a corresponding **metadata database** that stores internal state, including schema definitions, snapshots, file mappings, and more. To inspect this metadata catalog directly from any DuckDB session (this works for both fully managed and BYOB databases):

```sql
ATTACH 'md:__ducklake_metadata_<database_name>' AS <database_alias>;
```

::::note
The metadata database can only be attached by the database owner. This form attaches the metadata catalog for inspection only. To run DuckLake compute locally against your data, use the `ducklake:` ATTACH form shown in [Using own compute](#using-own-compute).
::::

## Current limitations

- **Limited sharing options**: Read-only sharing is supported through the [existing share functionality](/key-tasks/sharing-data/), restricted to auto-update shares only
- **Single-account write access**: Write permissions are limited to one account per database. This account can perform multiple concurrent writes, as long as they are append-only. If multiple queries attempt to update or delete from the same table concurrently, only the first to commit will succeed. Concurrent DDL operations are also not allowed. Support for *multi-account* write access is planned for a future release.
- **Limited BYOB storage providers**: Bring Your Own Bucket is supported for [AWS S3](/integrations/cloud-storage/amazon-s3/) and [Cloudflare R2](/integrations/cloud-storage/cloudflare-r2/) storage. Other clouds are under consideration for future support.

:::info
For multiple concurrent readers to a MotherDuck DuckLake database, you can create a [read scaling token](/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/).
:::


---

## Docs feedback

MotherDuck accepts optional user-submitted feedback about this page at `GET https://motherduck.com/docs/api/feedback/agent`.
For agents and automated tools, feedback submission should be user-confirmed before sending.

URL-encode query parameter values and send a GET request:

```text
GET https://motherduck.com/docs/api/feedback/agent?page_path=%2Fintegrations%2Ffile-formats%2Fducklake%2F&page_title=DuckLake&text=<url-encoded user feedback, max 2000 characters>
```

Optionally append `&source=<url-encoded interface identifier>` such as `claude.ai` or `chatgpt`.

`page_path` and `text` are required; `page_title` and `source` are optional. Responses: `200 {"feedback_id": "<uuid>"}`, `400` for malformed query parameters, and `429` when rate-limited.