Chapter 8

5 min read

Building Data Pipelines with DuckDB

This is a summary of a book chapter from DuckDB in Action, published by Manning. Download the complete book for free to read the complete chapter.

'DuckDB In Action' book cover

8.1 The Meaning and Relevance of Data Pipelines

A data pipeline is the infrastructure that moves data from source to value. It is usually set up to retrieve and ingest data from various sources—such as live streams, APIs, or flat files—into a data store like a data lake or warehouse. However, ingestion is just the beginning.

Before data is usable for products like Machine Learning (ML) models, dashboards, or APIs, it must undergo rigorous batch processing. This involves:

  • Transformation: Joining disparate datasets, filtering noise, and aggregating metrics.
  • Standardization: Ensuring schema consistency across inputs.
  • Masking: Anonymizing or distorting confidential and regulated data for privacy compliance.

While ETL (Extract, Transform, Load) is the traditional approach, modern cloud architectures often favor ELT, where data is loaded first and transformed in place. This chapter focuses specifically on batch processing workflows, where the output of one command becomes the input for the next, ultimately driving business value through reliable data delivery.

DuckDB's Role in Data Pipelines

Where does DuckDB fit in a modern data stack? While it is an excellent storage system, its unique architecture allows it to shine in the transformation and processing stages of a pipeline.

DuckDB offers a simple yet powerful execution model: a single binary capable of handling massive datasets that exceed memory limits. It acts as a "glue" component because it can:

  1. Ingest from anywhere: Read directly from JSON, CSV, Parquet, or other databases.
  2. Speak the common language: Use a complete SQL engine to transform data.
  3. Integrate seamlessly: Work within Python environments, interacting natively with libraries like pandas, Apache Arrow, and Polars.

In many pipelines, DuckDB is not the final destination but the high-performance engine that cleans and prepares data before it lands in a data lake (often as Parquet files) or a cloud warehouse.

8.2 Data Ingestion with dlt

Ingestion is often the messiest part of a pipeline. APIs change, schemas drift, and scripts break. To solve this, the chapter introduces the data load tool (dlt), an open-source Python library designed to automate data loading.

Why use dlt with DuckDB?

  • Schema Inference: dlt automatically detects schema from inputs like JSON or DataFrames and maps them to DuckDB tables.
  • Incremental Loading: It tracks state (via metadata tables like _dlt_loads), ensuring you only load new data rather than reprocessing the entire history.
  • Resilience: It handles the normalization of nested data structures automatically.

Case Study: The Chess Pipeline The chapter demonstrates building a pipeline that pulls player profiles and game archives from the Chess.com API. Using dlt init chess duckdb (see the initialization guide, developers can spin up a pipeline that fetches nested JSON data, flattens it, and loads it into a local chess_pipeline.duckdb file, ready for analysis.

  • Get the full source code for this pipeline here.

8.3 Data Transformation and Modeling with dbt

Once data is ingested, it needs shaping. The data build tool (dbt) is the industry standard for managing SQL-centric transformation workflows. It brings software engineering best practices—like modularity, version control, and testing—to data analytics.

Through the dbt-duckdb adapter, DuckDB becomes the processing engine for dbt models. This combination allows engineers to:

  • Transform in SQL: Write SELECT statements that clean and model data.
  • Output to Files: Instead of just creating tables, dbt with DuckDB can read raw CSVs and output optimized Parquet files to a local directory or S3 bucket.
  • Ensure Quality: Implement rigorous tests. The chapter details using schema.yml to assert that columns are not_null, ensuring referential integrity, and using the dbt_expectations package to validate data ranges (e.g., ensuring match dates fall within a specific year).

The example provided processes Jeff Sackmann’s tennis dataset, transforming raw CSVs into a clean Parquet dataset while validating the integrity of tournament surfaces and player IDs.

8.4 Orchestrating Pipelines with Dagster

ETL scripts and SQL models need a conductor to tell them when to run. Dagster is a cloud-native orchestration tool that introduces the concept of Software-Defined Assets. Unlike simple cron jobs (see crontab.guru for syntax help), Dagster allows you to define assets (tables, files, ML models) in Python code, making dependencies explicit and verifiable.

Using the dagster-duckdb library, the chapter builds an orchestrated pipeline that:

  1. Defines Assets: Uses the @asset decorator to define Python functions that produce data tables. (View asset definition code.
  2. Manages Dependencies: Ensures the "Players" table is fully loaded and cleaned before the "Match Results" asset attempts to query it.
  3. Visualizes Lineage: Provides a UI to view the entire graph of data flow, making debugging and auditing significantly easier.
  4. Integrates Python & SQL: Demonstrates loading a pandas DataFrame (tournament levels) directly into DuckDB as part of the orchestrated workflow.

Uploading to MotherDuck

Local pipelines are great for development, but production requires the cloud. The chapter concludes by demonstrating how to seamlessly migrate the entire architecture to MotherDuck, a managed cloud service for DuckDB.

The transition requires minimal code changes. By simply updating the connection string from a local file path to md:my_database?motherduck_token=..., the pipeline is "lifted and shifted." Dagster continues to orchestrate the logic, but the compute and storage happen in the cloud. This allows data teams to easily publish datasets, share findings with stakeholders, and scale their storage without managing infrastructure. You can then confirm the database creation in the MotherDuck UI.

'DuckDB In Action' book cover