Data Warehousing Guide

Introduction to MotherDuck for Data Warehousing

MotherDuck is a cloud-native data warehouse built on top of DuckDB that adds enterprise features like cloud storage, sharing, and collaboration to DuckDB's fast analytical engine. The platform serves these needs through its serverless architecture, sharing model, and WASM capabilities. It benefits data analysts with AI-assisted SQL, data engineers with familiar tools like dbt, and data scientists with hybrid local-cloud processing.

img_duck_stack

MotherDuck integrates with popular data tools including Estuary, Fivetran, and Airbyte for data ingestion, dbt for transformations, Tableau and PowerBI for visualization, and Airflow and Dagster for orchestration. This enables teams to build data warehousing solutions using their existing tools.

Data Ingestion

An easy way to get into MotherDuck is using ecosystem partners like Estuary, Fivetran, dlthub, and Airbyte but you can also create custom data engineering pipelines.

MotherDuck is very flexible with how to load your data:

From data you have on your filesystem: If you have CSVs, JSON files or DuckDB databases sitting around, It's easy to load it into your MotherDuck data warehouse.
From a data lake on a cloud object store: If you already have your data in a data lake, as parquet, delta, iceberg or other formats, DuckDB has abstractions for Secrets, Object Storage, and many file types. When combined, this means that many file types can be read into DuckDB from Object Storage with only SQL. Though not as performant as MotherDuck's native storage layer, you can also query your infrequently-accessed data directly from your data lake with MotherDuck.
Using Native APIs in many languages: DuckDB supports numerous languages such as C++, Python, and Java, in addition to its own mostly Postgres-compatible SQL dialect. Using these languages, Data Engineers and Developers can easily integrate with MotherDuck without having to pick up yet-another-language.

Best Practices for Programmatic Loading

The fastest way to load data is to load single tables in large batches, saturating the network connection between MotherDuck and the source data. DuckDB is incredibly good at handling both files and some kinds of in-memory objects, like Arrow dataframes. As an aside, Parquet files compress at 5-10x compared to CSV, which means you can get 5-10x more throughput simply by using Parquet files. Similarly, open table formats like Delta & Iceberg share those performance gains.

On the other hand, small writes on multiple tables will lead to suboptimal performance. While MotherDuck does indeed offer ACID compliance, it is not an OLTP system like Postgres! Significantly better performance can be achieved by using queues to batch writes to tables. While some latency is introduced with this methodology, the improvement in throughput should far outweigh the cost of doing small writes.

Streaming workloads are better suited to be handled with queues in front of MotherDuck.

Transforming Data

Once data is loaded into MotherDuck, it must be transformed into a model that matches the business purpose and needs. This can be done directly in MotherDuck using the powerful library of SQL functions offered by DuckDB. Many data engineers prefer to use data transformation tools like the open source dbt Core. More details specifically about using dbt with MotherDuck can be read in the blog on this topic.

For more in-depth reading, the free DuckDB in Action eBook explores these concepts with real-world examples.

Once your data is loaded into MotherDuck and appropriately transformed for use by your analysts, you can make that data available using MotherDuck's sharing capabilities. This can allow every user in your organization to access the data warehouse in the MotherDuck UI, in their Python code or with other tools. Admins don't need to worry that the queries run by users will impact their data pipelines as users have isolated compute.

Serving Data Analytics

Do you want to serve reports or dashboards for your users? MotherDuck provides tokens that can be used with popular tools like Tableau & Power BI to access your data warehouse to serve business intelligence to end users.

Ducks all the Way Down: Building Data Apps

MotherDuck is built on DuckDB because it is an extremely efficient SQL engine inside a ~20MB executable. This allows you to run the same DuckDB engine which powers your data warehouse inside your web browser, creating highly-interactive visualizations with near-zero latency. This enhances your experience when using the Column Explorer in the MotherDuck UI.

One thing that is unique to MotherDuck is its capabilities for serving data into the web layer via WASM. These capabilities enable novel analytical user actions, including very intensive queries that would be prohibitively expensive in other query engines. It also supports data mashup from various sources, so that data in the warehouse can easily be combined with other sources, like files in CSV, JSON, or Parquet.

Scaling up & out for DWH use cases

Furthermore, MotherDuck has a unique scaling model, of which there are four key concepts relevant for Data Warehousing.

Vertical Scaling

Compute can scale up with larger instances - currently, we offer 4 sizes by default: Pulse, Standard, Jumbo, and Mega. To get access to even larger instance types, please contact us.

Unlike other data warehouses, every instance is isolated from each other: one user's queries will not impact another user's from completing. This per-user tenancy concept assures you can size your warehouse correctly and use your resources very efficently.

Horizontal Scaling

For serving data to BI tools or other spiky consumers, Read-Scaling Replicas can absorb the loads and maintain low latency on user interactivity. These should be owned by the same user or service accounts that run production jobs, although they can also leverage SHARES depending on preferences.

Per-user tenancy

Especially for production runs, user should leverage seperate user or service accounts with their own dedicated compute for updating and maintaining core tables.

Distributed DuckDB

DuckDB and MotherDuck work together as a distributed system that automatically optimizes query execution between local and cloud resources through Dual Execution, enabling efficient data access regardless of location.

Orchestration

In order to keep data up to date inside of MotherDuck, often an orchestrator like Airflow or Dagster can be used. This runs jobs in specific orders to load & transform data, as well managing workflow and observability, which is necessary for handling more complex data engineering pipelines.

If this is your first data warehouse, you might consider starting with something as simple as GitHub actions or cron jobs to orchestrate your data pipelines.

info

For a more in-depth guide, check out the Data Warehousing Guide

Need Help Along the Way?

Please do not hesitate to contact us if you need help along your journey. We are here to help you succeed with your data warehouse!

Introduction to MotherDuck for Data Warehousing​

Data Ingestion​

Best Practices for Programmatic Loading​

Transforming Data​

Sharing Data​

Serving Data Analytics​

Ducks all the Way Down: Building Data Apps​

Scaling up & out for DWH use cases​

Vertical Scaling​

Horizontal Scaling​

Per-user tenancy​

Distributed DuckDB​

Orchestration​

Need Help Along the Way?​