Building a Data Warehouse with MotherDuck
What is a Data Warehouse?
A data warehouse is used for storing and analyzing data from multiple sources into one place. It uses a common query language (SQL) and is often the jumping off point for reporting, analytics, and supporting strategic decision making. The data warehouse serves as the bridge from raw data to a governed, scalable data set to serve downstream consumers. While DuckDB is excellent at processing and serving large datasets, MotherDuck adds the missing components to make it a true data warehouse.
Some common tools in a data stack are:
- BI tools for data visualization and reporting
- Omni, Tableau, PowerBI
- Ingestion tools to load data in from business apps across your enterprise
- Fivetran, Airbyte, Dlthub
- Transformation tools to make the data more usable
- dbt, sqlmesh, paradime.io
- Orchestration tools to stitch it all together
- Airflow, Kestra, Dagster
The MotherDuck Ecosystem
Please do not hesitate to contact us if you need help along your journey.
MotherDuck Architectural Concepts
MotherDuck is a cloud-native data warehouse, built on top of DuckDB, a fast in-process analytical database. It inherits some features from DuckDB that present opportunities to think differently about data warehousing methods in order to achieve high levels of performance and simplify the experience.
-
Isolated Compute Tenancy: Each user is allocated their own “duckling,” which is an isolated piece of compute that sits on top of the MotherDuck storage layer. MotherDuck is designed this way to lessen contention between users, which is a common challenge with other data warehouses.
-
Aggressively Serverless: Unlike conventional data warehouses, DuckDB automatically parallelizes the work that you send to it. The implication of this is that scheduling multiple queries at-a-time does not meaningfully increase throughput, as DuckDB has already parallelized the workload across all available resources.
-
Database level security model: It has a simplified access model - users either have access to an entire database, or nothing at all. As a result, users will interact with data frequently at the database level. This is unusual when compared to other databases, which often treat multiple database files as single concepts from an interactivity perspective.
-
Database Sharing: MotherDuck separates storage and compute, which means that one user cannot see another's writes into a database until that database is updated to that user. As such, it has its own concept called “SHARES” within Organizations, which are zero-copy clones of the main database for read-only use, enabling high scalability of analytics workloads.
-
Dual Execution: Every MotherDuck client is also a DuckDB engine, so you can efficiently query local data and (JOIN, UNION) with data that's stored in your MotherDuck data warehouse. The query planner automatically decides the best place to execute each part of your query.
Data Ingestion
An easy way to get into MotherDuck is using ecosystem partners like Fivetran, dlthub, and Airbyte but you can also create custom data engineering pipelines.
MotherDuck is very flexible with how to load your data:
- From data you have on your filesystem: If you have CSVs, JSON files or DuckDB databases sitting around, It's easy to load it into your MotherDuck data warehouse.
- From a data lake on a cloud object store: If you already have your data in a data lake, as parquet, delta, iceberg or other formats, DuckDB has abstractions for Secrets, Object Storage, and many file types. When combined, this means that many file types can be read into DuckDB from Object Storage with only SQL. Though not as performant, you can also query your infrequently-accessed data directly from your data lake with MotherDuck.
- Using Native APIs in many languages: DuckDB supports numerous languages such as C++, Python, and Java, in addition to its own mostly Postgres-compatible SQL dialect. Using these languages, Data Engineers and Developers can easily integrate with MotherDuck without having to pick up yet-another-language.
Best Practices for Programmatic Loading
The fastest way to load data is to load single tables in large batches, saturating the network connection between MotherDuck and the source data. DuckDB is incredibly good at handling both files and some kinds of in-memory objects, like Arrow dataframes. As an aside, Parquet files compress at 5-10x compared to CSV, which means you can get 5-10x more throughput simply by using Parquet files. Similarly, open table formats like Delta & Iceberg share those performance gains.
On the other hand, small writes on multiple tables will lead to suboptimal performance. While MotherDuck does indeed offer ACID compliance, it is not an OLTP system like postgres! Significantly better performance can be achieved by using queues to batch writes to tables. While some latency is introduced with this methodology, the improvement in throughput should far outweigh the cost of doing small writes.
Therefore, it should be noted that streaming workloads are better suited to be handled with queues in front of MotherDuck.
Transforming Data
Once data is loaded into MotherDuck, it must be transformed into a model that matches the business purpose and needs. This can be done directly in MotherDuck using the powerful library of SQL functions offered by DuckDB. Many data engineers prefer to use data transformation tools like the open source dbt Core. More details specifically about using dbt with MotherDuck can be read in the blog on this topic.
For more in-depth reading, the free DuckDB in Action eBook explores these concepts with real-world examples.
Sharing Data
Once your data is loaded into MotherDuck and appropriately transformed for use by your analysts, you can make that data available using MotherDuck's sharing capabilities. This can allow every user in your organization to access the data warehouse in the MotherDuck UI, in their python code or with other tools. Admins don't need to worry that the queries run by users will impact their data pipelines as users have isolated compute.
Serving Data Analytics
Do you want to serve reports or dashboards for your users? MotherDuck provides tokens that can be used with popular tools like Tableau & Power BI to access your data warehouse to serve business intelligence to end users.
Ducks all the Way Down: Building Data Apps
MotherDuck is built on DuckDB because it is an extremely efficient SQL engine inside a ~20MB executable. This allows you to run the same DuckDB engine which powers your data warehouse inside your web browser, creating highly-interactive visualizations with near-zero latency. This enhances your experience when using the Column Explorer in the MotherDuck UI.
One thing that is unique to MotherDuck is its capabilities for serving data into the web layer via WASM. These capabilities enable novel analytical user actions, including very intensive queries that would be prohibitively expensive in other query engines. It also supports data mashup from various sources, so that data in the warehouse can easily be combined with other sources, like files in CSV, JSON, or Parquet.
Orchestration
In order to keep data up to date inside of MotherDuck, often an orchestrator like Airflow or Dagster can be used. This runs jobs in specific orders to load & transform data, as well managing workflow and observability, which is necessary for handling more complex data engineering pipelines.
If this is your first data warehouse, you might consider starting with something as simple as GitHub actions or cron jobs to orchestrate your data pipelines.
Need Help Along the Way?
Please do not hesitate to contact us if you need help along your journey. We are here to help you succeed with your data warehouse!