dbt
Data Build Tool (dbt) is an open-source command-line tool that enables data analysts and engineers to transform data in their warehouses by defining SQL in model files. It bring the composability of programming languages to SQL while automating the mechanics of updating tables.
dbt-duckdb is the adapter which allows dbt to use DuckDB and MotherDuck. The adapter also supports DuckDB extensions and any of the additional DuckDB configuration options.
Installation
Since dbt is a Python library, it can be installed through pip:
pip3 install dbt-duckdb
will install both dbt
and duckdb
.
Configuration for Local DuckDB
This configuration allows you to connect to S3 and perform read/write operations on Parquet files using an AWS access key and secret.
profiles.yml
default:
outputs:
dev:
type: duckdb
path: /tmp/dbt.duckdb
threads: 1
extensions:
- httpfs
- parquet
settings:
s3_region: my-aws-region
s3_access_key_id: "{{ env_var('S3_ACCESS_KEY_ID') }}"
s3_secret_access_key: "{{ env_var('S3_SECRET_ACCESS_KEY') }}"
target: dev
You can find more information about these connections profiles in the dbt documentation.
Configuration for MotherDuck
The only change needed for motherduck is the path:
setting.
default:
outputs:
dev:
type: duckdb
path: "md:my_db?motherduck_token={{env_var('MOTHERDUCK_TOKEN')}}"
threads: 1
extensions:
- httpfs
- parquet
settings:
s3_region: my-aws-region
s3_access_key_id: "{{ env_var('S3_ACCESS_KEY_ID') }}"
s3_secret_access_key: "{{ env_var('S3_SECRET_ACCESS_KEY') }}"
target: dev
This assumes that you have setup MOTHERDUCK_TOKEN
as an environment variable. To know more about how to persist your authentication credentials, read Authenticating to MotherDuck using an access token. If you don't set the motherduck_token
in your path, you will be prompted to authenticate to MotherDuck when running your dbt run
command.
Follow the instructions and it will export the service account variable for the current dbt run
process.
DuckDB will parallelize a single write query as much as possible, so running one at a time is the most efficient use of compute resources. For most use cases, we recommend setting threads: 1
for the best performance.
Extra resources
Take a look at our video guide on DuckDB and dbt provided below, along with the corresponding demo tutorial on GitHub.