Local dev and cloud prod for faster dbt development
2025/01/16 - 7 min read
BYIntroducktion
I hate waiting for slow pipelines to run, so I am delighted to share some strategies to iterate on your data problems at maximum speed - MotherDuck even gave a talk on this concept at dbt Coalesce in 2024. By harnessing the capabilities of DuckDB locally, backed by MotherDuck in the cloud, we can unlock an incredibly fast and efficient development cycle. We'll explore how to configure your dbt profile for dual execution and share some tips on how much data to bring local. By implementing these techniques, you can significantly accelerate your data pipeline development and iterate even faster to solve business problems.
Looking to following along in the code?
Check out the example repo!
Setting up your Profile
In order to take advantage of these capabilities, we need to configure our dbt profile to execute in the correct place, as well as define the behavior that we want in our sources. In the example dbt profile below, prod
runs entirely in the cloud, while local
runs mostly on local but is also linked to MotherDuck for reading data into your local database.
Copy code
dual_execution:
outputs:
local:
type: duckdb
path: local.db
attach:
- path: "md:" # attaches all MotherDuck databases
prod:
type: duckdb
path: "md:jdw"
target: local
Sources & Models
With your sources, you need to define which ones to replicate entirely, which ones are ok as views, and which ones to sample. Keep in mind for sampling, you need to think about your data model and make sure that related samples are hydrated (i.e. if you only bring in 100 customers, you need to make sure you also bring in their orders too).
In my example project using TPC-DS as the source data, I am sampling 1% of the data when running locally on the large tables. In general, I am aiming to keep the datasets less than a million rows per table, although there is no hard limit. For the remaining tables, I am replicating the entire data set locally since they are so small.
The way that we conditionally sample our models is by using the ‘target’ variable. You can add this parameter by checking your target
and running it conditionally on your model.
An example sql snippet is below (using jinja).
Copy code
from {{ source("tpc-ds", "catalog_sales") }}
{% if target.name == 'local' %} using sample 1 % {% endif %}
As an example of a simple “create local table from cloud”, consider the following query plan. The “L” indicates Local and the “R” indicates Remote (i.e. MotherDuck).
Copy code
🦆 explain create table
"local"."main"."call_center"
as (
from "jdw_dev"."jdw_tpcds"."call_center"
);
┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ BATCH_CREATE_TABLE_AS (L) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ DOWNLOAD_SOURCE (L) │
│ ──────────────────── │
│ bridge_id: 1 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ BATCH_DOWNLOAD_SINK (R) │
│ ──────────────────── │
│ bridge_id: 1 │
│ parallel: true │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ SEQ_SCAN (R) │
│ ──────────────────── │
│ call_center │
│ │
│ Projections: │
│ cc_call_center_sk │
│ cc_call_center_id │
│ cc_rec_start_date │
│ cc_rec_end_date │
│ cc_closed_date_sk │
│ cc_open_date_sk │
│ cc_name │
│ cc_class │
│ cc_employees │
│ cc_sq_ft │
│ cc_hours │
│ cc_manager │
│ cc_mkt_id │
│ cc_mkt_class │
│ cc_mkt_desc │
│ cc_market_manager │
│ cc_division │
│ cc_division_name │
│ cc_company │
│ cc_company_name │
│ cc_street_number │
│ cc_street_name │
│ cc_street_type │
│ cc_suite_number │
│ cc_city │
│ cc_county │
└───────────────────────────┘
This can also be extended to your sources.yml
if necessary for testing local datasets (i.e. json or parquet on experimental pipelines that have not yet made it to your data lake). Configuring these is similar:
Copy code
{%if- target.name == 'local' -%}
meta:
external_location:
data/tpcds/{name}.parquet
{%- endif -%}
Running your pipeline
Once you have this configuration in place, you can simply run your pipeline as normal, although for ease of use, you may want to add tags to the models that you are working on so you can avoid going back to the cloud data set too often. This can be set simply in the dbt_project.yml
like this:
Copy code
models:
dual_execution:
tpcds:
raw:
+tags: ['raw']
+materialized: table
queries:
+materialized: view
+tags: ['queries']
From there, it is as simple as running dbt build -s tag:raw
to load your raw data and then for subsequent query iteration, run dbt build -s tag:queries
in the CLI. The subsequent runs can be visualized like this:
Shipping dev to the cloud
Certain tables may need to be available in your cloud data warehouse for testing even in the local workflow. This may be something like a BI tool, that is connected to your cloud instance and is difficult to run locally. This can be accomplished by setting the database attribute in your model, so that after the model is run, it is available in the cloud as well.
Copy code
{{ config(
database="jdw_dev",
schema="local_to_prod"
materialized="table"
) }}
It should be noted that this is a static configuration that is best used for testing. If you don’t want to manually flip models between dev / prod destinations, you can define the database as an attribute of a specific model in your dbt_project.yml
file.
Wrapping up
As you can see from this example, using MotherDuck’s dual execution allows us to leverage the unique value proposition of DuckDB to run an accelerated development cycle on your local machine. With some basic optimization, we can get ~5x faster dbt runs by making the data smaller and using local compute. This is a very powerful combination for rapidly iterating on your pipeline and then pushing a high quality change back into your production environment.
Want to learn more? Join our webinar about Local Dev & Cloud Prod on February 13th, 2025.
CONTENT
- Introducktion
- Setting up your Profile
- Sources & Models
- Running your pipeline
- Shipping dev to the cloud
- Wrapping up
Start using MotherDuck now!
PREVIOUS POSTS
2024/12/21 - Sheila Sitaram
What’s New: Streamlined User Management, Metadata, and UI Enhancements
December’s feature roundup is focused on improving the user experience on multiple fronts. Introducing the User Management REST API, the Table Summary, and a read-only MD_INFORMATION_SCHEMA for metadata.