Local dev and cloud prod for faster dbt development

2025/01/16 - 7 min read

Introducktion

I hate waiting for slow pipelines to run, so I am delighted to share some strategies to iterate on your data problems at maximum speed - MotherDuck even gave a talk on this concept at dbt Coalesce in 2024. By harnessing the capabilities of DuckDB locally, backed by MotherDuck in the cloud, we can unlock an incredibly fast and efficient development cycle. We'll explore how to configure your dbt profile for dual execution and share some tips on how much data to bring local. By implementing these techniques, you can significantly accelerate your data pipeline development and iterate even faster to solve business problems.

Looking to following along in the code?

Check out the example repo!

Setting up your Profile

In order to take advantage of these capabilities, we need to configure our dbt profile to execute in the correct place, as well as define the behavior that we want in our sources. In the example dbt profile below, prod runs entirely in the cloud, while local runs mostly on local but is also linked to MotherDuck for reading data into your local database.

Copy code
dual_execution:
  outputs:
    local:
      type: duckdb
      path: local.db 
      attach:
        - path: "md:"	# attaches all MotherDuck databases
    prod:
      type: duckdb
      path: "md:jdw"
  target: local

Sources & Models

With your sources, you need to define which ones to replicate entirely, which ones are ok as views, and which ones to sample. Keep in mind for sampling, you need to think about your data model and make sure that related samples are hydrated (i.e. if you only bring in 100 customers, you need to make sure you also bring in their orders too).

In my example project using TPC-DS as the source data, I am sampling 1% of the data when running locally on the large tables. In general, I am aiming to keep the datasets less than a million rows per table, although there is no hard limit. For the remaining tables, I am replicating the entire data set locally since they are so small.

The way that we conditionally sample our models is by using the ‘target’ variable. You can add this parameter by checking your target and running it conditionally on your model.

An example sql snippet is below (using jinja).

Copy code
from {{ source("tpc-ds", "catalog_sales") }}
{% if target.name == 'local' %} using sample 1 % {% endif %}

As an example of a simple “create local table from cloud”, consider the following query plan. The “L” indicates Local and the “R” indicates Remote (i.e. MotherDuck).

Copy code
🦆 explain create table
        "local"."main"."call_center"
      as (
        from "jdw_dev"."jdw_tpcds"."call_center"
      );

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ BATCH_CREATE_TABLE_AS (L) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│    DOWNLOAD_SOURCE (L)    │
│    ────────────────────   │
│        bridge_id: 1       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│  BATCH_DOWNLOAD_SINK (R)  │
│    ────────────────────   │
│        bridge_id: 1       │
│       parallel: true      │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│       SEQ_SCAN  (R)       │
│    ────────────────────   │
│        call_center        │
│                           │
│        Projections:       │
│     cc_call_center_sk     │
│     cc_call_center_id     │
│     cc_rec_start_date     │
│      cc_rec_end_date      │
│     cc_closed_date_sk     │
│      cc_open_date_sk      │
│          cc_name          │
│          cc_class         │
│        cc_employees       │
│          cc_sq_ft         │
│          cc_hours         │
│         cc_manager        │
│         cc_mkt_id         │
│        cc_mkt_class       │
│        cc_mkt_desc        │
│     cc_market_manager     │
│        cc_division        │
│      cc_division_name     │
│         cc_company        │
│      cc_company_name      │
│      cc_street_number     │
│       cc_street_name      │
│       cc_street_type      │
│      cc_suite_number      │
│          cc_city          │
│         cc_county         │
└───────────────────────────┘

This can also be extended to your sources.yml if necessary for testing local datasets (i.e. json or parquet on experimental pipelines that have not yet made it to your data lake). Configuring these is similar:

Copy code
{%if- target.name == 'local' -%}
   meta:
      external_location:
        data/tpcds/{name}.parquet
{%- endif -%}

Running your pipeline

Once you have this configuration in place, you can simply run your pipeline as normal, although for ease of use, you may want to add tags to the models that you are working on so you can avoid going back to the cloud data set too often. This can be set simply in the dbt_project.yml like this:

Copy code
models:
  dual_execution:
    tpcds:
      raw:
        +tags: ['raw']
        +materialized: table
      queries:
        +materialized: view
        +tags: ['queries']

From there, it is as simple as running dbt build -s tag:raw to load your raw data and then for subsequent query iteration, run dbt build -s tag:queries in the CLI. The subsequent runs can be visualized like this:

Shipping dev to the cloud

Certain tables may need to be available in your cloud data warehouse for testing even in the local workflow. This may be something like a BI tool, that is connected to your cloud instance and is difficult to run locally. This can be accomplished by setting the database attribute in your model, so that after the model is run, it is available in the cloud as well.

Copy code
{{ config(
    database="jdw_dev",
    schema="local_to_prod"
    materialized="table"
) }}

It should be noted that this is a static configuration that is best used for testing. If you don’t want to manually flip models between dev / prod destinations, you can define the database as an attribute of a specific model in your dbt_project.yml file.

Wrapping up

As you can see from this example, using MotherDuck’s dual execution allows us to leverage the unique value proposition of DuckDB to run an accelerated development cycle on your local machine. With some basic optimization, we can get ~5x faster dbt runs by making the data smaller and using local compute. This is a very powerful combination for rapidly iterating on your pipeline and then pushing a high quality change back into your production environment.

Want to learn more? Join our webinar about Local Dev & Cloud Prod on February 13th, 2025.

CONTENT

Introducktion

Setting up your Profile

Sources & Models

Running your pipeline

Shipping dev to the cloud

Wrapping up

Start using MotherDuck now!

Try 21 Days Free

2024/12/21 - Sheila Sitaram

What’s New: Streamlined User Management, Metadata, and UI Enhancements

December’s feature roundup is focused on improving the user experience on multiple fronts. Introducing the User Management REST API, the Table Summary, and a read-only MD_INFORMATION_SCHEMA for metadata.