Why use DuckDB in your data pipelines ft. Niels Claeys
2023/10/06Many data lake pipelines are over-engineered. The default reliance on powerful, distributed engines like Apache Spark for every task creates unnecessary complexity and cost, especially when 80-90% of workloads are too small to justify that level of power. This "one-size-fits-all" approach can slow development cycles, increase infrastructure costs, and create a bottleneck around a small group of specialized data engineers.
A more modern, pragmatic approach involves right-sizing the processing engine for the task at hand. By combining the strengths of dbt (data build tool) and DuckDB, teams can build a lightweight, fast, and SQL-native alternative for the majority of data lake workloads. This stack empowers skilled data analysts, who are proficient in SQL but not necessarily distributed systems, to build and maintain production-grade data pipelines.
The result is increased team velocity, reduced complexity, and a more efficient data platform.
The Hidden Costs of a One-Size-Fits-All Data Stack
One of the most significant challenges in data platform management is not just technological, but organizational. Good data engineers who can manage complex distributed systems like Spark are difficult to find and retain. This scarcity often makes the data engineering team a bottleneck, unable to keep up with the business's demand for new data products and insights.
At the same time, most organizations have highly skilled data analysts who possess deep business knowledge and an intimate understanding of the company's data. These analysts are universally proficient in SQL, the lingua franca of data. By enabling them to build their own data transformation pipelines, organizations can unlock tremendous productivity. The key is to provide them with tools that leverage their existing SQL skills within a framework that ensures production-level quality and reliability. This is where dbt enters the picture, bringing the best practices of software engineering to the world of SQL.
Bringing Engineering Discipline to SQL with dbt
The data build tool brings the best practices of software engineering to SQL-based data transformation. It allows analysts and engineers to build data pipelines with modularity, version control, testing, and documentation. By creating models that depend on each other, dbt builds a clear dependency graph (or lineage), making complex pipelines easier to understand, maintain, and debug.
However, dbt is a transformation orchestrator, not a query engine. To operate directly on files in a data lake, it needs a powerful SQL engine capable of reading from and writing to cloud object storage. While one could load data lake files into a traditional data warehouse for processing, this adds cumbersome and often unnecessary steps to the workflow. A more elegant solution is to use an engine that can query the data where it lives.
DuckDB: The Missing Engine for Data Lake Analytics
DuckDB is the ideal query engine for running dbt jobs directly on a data lake. Its design offers several key advantages for this use case:
- Direct Data Lake Access: DuckDB can natively read and write common file formats like Parquet and CSV directly from object storage systems such as AWS S3 or Azure Blob Storage. This eliminates the need for a separate data loading step, simplifying the entire architecture.
- Simplified Execution Model: As an in-process, single-node database, DuckDB is incredibly easy to reason about. When a query fails or performs poorly, developers do not need to debug the complexities of a distributed system, such as data shuffling or memory allocation across multiple nodes. This simplicity makes it more accessible to a broader range of technical users.
- Versatile and Easy to Start: DuckDB can be run anywhere, from a local Python environment for testing to a container on Kubernetes for production. This allows for a consistent development experience, where integration tests can be run locally with the same engine that powers the production pipeline, a task that is often more cumbersome with Spark.
How dbt and DuckDB Fit into a Hybrid Data Ecosystem
Perhaps the most compelling aspect of using dbt and DuckDB is its ability to coexist seamlessly with existing Spark-based pipelines. Because both Spark and DuckDB can use Parquet files on S3 as a common interface, they are perfectly interoperable. A pipeline built with dbt and DuckDB can read the output of an upstream Spark job, and a downstream Spark job can consume the Parquet files generated by a dbt and DuckDB pipeline.
This interoperability means teams can adopt this new stack incrementally without undertaking a massive migration project. They can choose the right tool for each job, using dbt and DuckDB for a new, medium-sized workload while leaving existing large-scale Spark jobs untouched.
Community-driven tools like the dbt-duckdb adapter, originally created by Josh Wills, make this integration possible. The adapter solves a critical problem: since each dbt run starts with a fresh, in-memory DuckDB instance, it needs a way to know about tables created in previous runs. The adapter provides a mechanism to register these upstream models, effectively creating a catalog of the Parquet files stored in S3. This allows the new DuckDB instance to query the output of previous jobs as if they were native tables, enabling complex dbt projects to be broken down into multiple, independent steps.
Performance Deep Dive: DuckDB vs. Spark and Trino on the Data Lake
To validate this approach, a benchmark compared its performance against established distributed engines using the 100GB TPC-DS dataset, a standard for testing analytical database performance. To ensure a fair comparison, all tests followed the same workflow: read Parquet files from S3, execute the transformation query, and write the results back to S3. This mirrors a real-world data lake ETL pipeline.
The results highlight the strengths and weaknesses of each engine for this data scale.
- Low-Overhead Advantage: DuckDB’s speed is immediately apparent. Over half of the TPC-DS queries finished in under 15 seconds, a timeframe that is often shorter than Spark’s startup time alone. For small and medium-sized jobs, this low overhead eliminates a major source of inefficiency.
- Optimized Vectorized Engine: Across the full benchmark, DuckDB consistently outperformed both Spark and Trino on most queries. Its highly optimized vectorized query engine excels at analytical workloads on a single node.
- Knowing the Limits: Transparency about limitations is crucial. As a single-node engine, DuckDB can run out of memory on extremely large or complex queries. In the benchmark, it failed on a few of the most resource-intensive queries where a distributed engine like Spark, which can spill to disk and leverage multiple nodes, remains the superior choice.
A Simple Framework for Choosing Your Data Lake Engine
The goal is not to replace Spark entirely but to add a more efficient and accessible tool to the modern data stack. By embracing a hybrid approach, data teams can become more productive and cost-effective. The decision of which tool to use can be guided by simple heuristics based on data size and pipeline complexity.
- Use dbt with DuckDB for data pipelines with simple-to-medium complexity that process small-to-medium datasets (up to roughly 100GB). This covers a large percentage of common ETL and analytics engineering tasks.
- Use Spark or another distributed engine for pipelines that process truly large-scale data or involve highly complex transformations that benefit from distributed execution.
By adopting this pragmatic approach, organizations can build more efficient data platforms. They can reduce infrastructure costs by avoiding over-provisioning for common tasks and, most importantly, empower a wider range of data professionals to build the pipelines the business needs. This shift frees up specialized data engineers to focus on the truly complex challenges where their expertise is most valuable.
Related Videos

2025-12-10
Watch Me Deploy a DuckLake to Production with MotherDuck!
In this video, Hoyt Emerson will show you the fastest way to get DuckLake into production using MotherDuck's beta implementation. If you've been following his DuckLake series, this is the next step you've been waiting for!
YouTube
Data Pipelines
Tutorial
MotherDuck Features
SQL
Ecosystem


