dbt with DuckDB and MotherDuck

Data Build Tool (dbt) is an open-source command-line tool that enables data analysts and engineers to transform data in their warehouses by defining SQL in model files. It bring the composability of programming languages to SQL while automating the mechanics of updating tables.

dbt-duckdb is the adapter which allows dbt to use DuckDB and MotherDuck. The adapter also supports DuckDB extensions and any of the additional DuckDB configuration options.

Installation

Since dbt is a Python library, it can be installed through pip:

pip3 install dbt-duckdb will install both dbt and duckdb.

Configuration for Local DuckDB

This configuration allows you to connect to S3 and perform read/write operations on Parquet files using an AWS access key and secret.

profiles.yml

default:
  outputs:
    dev:
      type: duckdb
      path: /tmp/dbt.duckdb
      threads: 4
      extensions:
        - httpfs
        - parquet
      settings:
        s3_region: my-aws-region
        s3_access_key_id: "{{ env_var('S3_ACCESS_KEY_ID') }}"
        s3_secret_access_key: "{{ env_var('S3_SECRET_ACCESS_KEY') }}"
  target: dev

tip

The path attribute specifies where your DuckDB database file will be created. By default, this path is relative to your profiles.yml file location. If the database doesn't exist at the specified path, DuckDB will automatically create it.

You can find more information about these connections profiles in the dbt documentation.

Configuration for MotherDuck

The only change needed for motherduck is the path: setting.

default:
  outputs:
    dev:
      type: duckdb
      path:  "md:my_db?motherduck_token={{env_var('MOTHERDUCK_TOKEN')}}"
      threads: 4
      extensions:
        - httpfs
        - parquet
      settings:
        s3_region: my-aws-region
        s3_access_key_id: "{{ env_var('S3_ACCESS_KEY_ID') }}"
        s3_secret_access_key: "{{ env_var('S3_SECRET_ACCESS_KEY') }}"
  target: dev

This assumes that you have setup MOTHERDUCK_TOKEN as an environment variable. To know more about how to persist your authentication credentials, read Authenticating to MotherDuck using an access token. If you don't set the motherduck_token in your path, you will be prompted to authenticate to MotherDuck when running your dbt run command.

auth_md

Follow the instructions and it will export the service account variable for the current dbt run process.

DuckDB will parallelize a single write query as much as possible, so the gains from running more than one query at a time are minimal on the database side. That being said, our testing indicates that setting threads: 4 typically leads to the best performance.

Attaching Additional Databases

dbt-duckdb supports attaching additional databases to your main DuckDB connection, allowing you to work with multiple databases simultaneously. This is particularly useful when you need to reference data from different sources or when working with separate databases for different purposes.

Configuration

To attach additional databases, add an attach section to your profile configuration:

default:
  outputs:
    dev:
      type: duckdb
      path: "md:my_db?motherduck_token={{env_var('MOTHERDUCK_TOKEN')}}"
      threads: 4
      extensions:
        - httpfs
        - parquet
      attach:
        - path: "md:other_db?motherduck_token={{env_var('MOTHERDUCK_TOKEN')}}"
          alias: other_db
        - path: "md:third_db?motherduck_token={{env_var('MOTHERDUCK_TOKEN')}}"
          alias: third_db
      settings:
        s3_region: my-aws-region
        s3_access_key_id: "{{ env_var('S3_ACCESS_KEY_ID') }}"
        s3_secret_access_key: "{{ env_var('S3_SECRET_ACCESS_KEY') }}"
  target: dev

tip

The alias parameter is optional. If not specified, dbt-duckdb will use the filename (without extension) as the alias for the attached database.

Usage Example

Once you have attached databases, you can use the database config parameter in your dbt models to specify which database to write to:

-- models/my_model.sql
{{ config(database='other_db') }}

SELECT 
    id,
    name,
    created_at
FROM {{ ref('source_table') }}
WHERE created_at >= '2024-01-01'

You can also specify the database for source tables in your sources.yml file:

# models/sources.yml
version: 2

sources:
  - name: external_data
    database: other_db
    tables:
      - name: customers
        description: Customer data from external database
      - name: orders
        description: Order data from external database

Then reference these sources in your models, from the correct database:

-- models/combined_data.sql
SELECT 
    c.customer_id,
    c.customer_name,
    o.order_id,
    o.order_date
FROM {{ source('external_data', 'customers') }} c
JOIN {{ source('external_data', 'orders') }} o ON c.customer_id = o.customer_id

Extra resources

Take a look at our video guide on DuckDB and dbt provided below, along with the corresponding demo tutorial on GitHub.

Installation​

Configuration for Local DuckDB​

Configuration for MotherDuck​

Attaching Additional Databases​

Configuration​

Usage Example​

Extra resources​