Sources & Models
With your sources, you need to define which ones to replicate entirely, which ones are ok as views, and which ones to sample. Keep in mind for sampling, you need to think about your data model and make sure that related samples are hydrated (i.e. if you only bring in 100 customers, you need to make sure you also bring in their orders too).
In my example project using TPC-DS as the source data, I am sampling 1% of the data when running locally on the large tables. In general, I am aiming to keep the datasets less than a million rows per table, although there is no hard limit. For the remaining tables, I am replicating the entire data set locally since they are so small.
The way that we conditionally sample our models is by using the ‘target’ variable. You can add this parameter by checking your target and running it conditionally on your model.
An example sql snippet is below (using jinja).
from {{ source("tpc-ds", "catalog_sales") }}
{% if target.name == 'local' %} using sample 1 % {% endif %}
As an example of a simple “create local table from cloud”, consider the following query plan. The “L” indicates Local and the “R” indicates Remote (i.e. MotherDuck).
🦆 explain create table
"local"."main"."call_center"
as (
from "jdw_dev"."jdw_tpcds"."call_center"
);
┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ BATCH_CREATE_TABLE_AS (L) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ DOWNLOAD_SOURCE (L) │
│ ──────────────────── │
│ bridge_id: 1 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ BATCH_DOWNLOAD_SINK (R) │
│ ──────────────────── │
│ bridge_id: 1 │
│ parallel: true │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ SEQ_SCAN (R) │
│ ──────────────────── │
│ call_center │
│ │
│ Projections: │
│ cc_call_center_sk │
│ cc_call_center_id │
│ cc_rec_start_date │
│ cc_rec_end_date │
│ cc_closed_date_sk │
│ cc_open_date_sk │
│ cc_name │
│ cc_class │
│ cc_employees │
│ cc_sq_ft │
│ cc_hours │
│ cc_manager │
│ cc_mkt_id │
│ cc_mkt_class │
│ cc_mkt_desc │
│ cc_market_manager │
│ cc_division │
│ cc_division_name │
│ cc_company │
│ cc_company_name │
│ cc_street_number │
│ cc_street_name │
│ cc_street_type │
│ cc_suite_number │
│ cc_city │
│ cc_county │
└───────────────────────────┘
This can also be extended to your sources.yml if necessary for testing local datasets (i.e. json or parquet on experimental pipelines that have not yet made it to your data lake). Configuring these is similar:
{%if- target.name == 'local' -%}
meta:
external_location:
data/tpcds/{name}.parquet
{%- endif -%}