Why use DuckDB in your data pipelines ft. Niels Claeys

2023/10/06Featuring:

Many data lake pipelines are over-engineered. The default reliance on powerful, distributed engines like Apache Spark for every task creates unnecessary complexity and cost, especially when 80-90% of workloads are too small to justify that level of power. This "one-size-fits-all" approach can slow development cycles, increase infrastructure costs, and create a bottleneck around a small group of specialized data engineers.

A more modern, pragmatic approach involves right-sizing the processing engine for the task at hand. By combining the strengths of dbt (data build tool) and DuckDB, teams can build a lightweight, fast, and SQL-native alternative for the majority of data lake workloads. This stack empowers skilled data analysts, who are proficient in SQL but not necessarily distributed systems, to build and maintain production-grade data pipelines.

The result is increased team velocity, reduced complexity, and a more efficient data platform.

The Hidden Costs of a One-Size-Fits-All Data Stack

One of the most significant challenges in data platform management is not just technological, but organizational. Good data engineers who can manage complex distributed systems like Spark are difficult to find and retain. This scarcity often makes the data engineering team a bottleneck, unable to keep up with the business's demand for new data products and insights.

At the same time, most organizations have highly skilled data analysts who possess deep business knowledge and an intimate understanding of the company's data. These analysts are universally proficient in SQL, the lingua franca of data. By enabling them to build their own data transformation pipelines, organizations can unlock tremendous productivity. The key is to provide them with tools that leverage their existing SQL skills within a framework that ensures production-level quality and reliability. This is where dbt enters the picture, bringing the best practices of software engineering to the world of SQL.

Bringing Engineering Discipline to SQL with dbt

The data build tool brings the best practices of software engineering to SQL-based data transformation. It allows analysts and engineers to build data pipelines with modularity, version control, testing, and documentation. By creating models that depend on each other, dbt builds a clear dependency graph (or lineage), making complex pipelines easier to understand, maintain, and debug.

However, dbt is a transformation orchestrator, not a query engine. To operate directly on files in a data lake, it needs a powerful SQL engine capable of reading from and writing to cloud object storage. While one could load data lake files into a traditional data warehouse for processing, this adds cumbersome and often unnecessary steps to the workflow. A more elegant solution is to use an engine that can query the data where it lives.

DuckDB: The Missing Engine for Data Lake Analytics

DuckDB is the ideal query engine for running dbt jobs directly on a data lake. Its design offers several key advantages for this use case:

Direct Data Lake Access: DuckDB can natively read and write common file formats like Parquet and CSV directly from object storage systems such as AWS S3 or Azure Blob Storage. This eliminates the need for a separate data loading step, simplifying the entire architecture.
Simplified Execution Model: As an in-process, single-node database, DuckDB is incredibly easy to reason about. When a query fails or performs poorly, developers do not need to debug the complexities of a distributed system, such as data shuffling or memory allocation across multiple nodes. This simplicity makes it more accessible to a broader range of technical users.
Versatile and Easy to Start: DuckDB can be run anywhere, from a local Python environment for testing to a container on Kubernetes for production. This allows for a consistent development experience, where integration tests can be run locally with the same engine that powers the production pipeline, a task that is often more cumbersome with Spark.

How dbt and DuckDB Fit into a Hybrid Data Ecosystem

Perhaps the most compelling aspect of using dbt and DuckDB is its ability to coexist seamlessly with existing Spark-based pipelines. Because both Spark and DuckDB can use Parquet files on S3 as a common interface, they are perfectly interoperable. A pipeline built with dbt and DuckDB can read the output of an upstream Spark job, and a downstream Spark job can consume the Parquet files generated by a dbt and DuckDB pipeline.

This interoperability means teams can adopt this new stack incrementally without undertaking a massive migration project. They can choose the right tool for each job, using dbt and DuckDB for a new, medium-sized workload while leaving existing large-scale Spark jobs untouched.

Community-driven tools like the dbt-duckdb adapter, originally created by Josh Wills, make this integration possible. The adapter solves a critical problem: since each dbt run starts with a fresh, in-memory DuckDB instance, it needs a way to know about tables created in previous runs. The adapter provides a mechanism to register these upstream models, effectively creating a catalog of the Parquet files stored in S3. This allows the new DuckDB instance to query the output of previous jobs as if they were native tables, enabling complex dbt projects to be broken down into multiple, independent steps.

Performance Deep Dive: DuckDB vs. Spark and Trino on the Data Lake

To validate this approach, a benchmark compared its performance against established distributed engines using the 100GB TPC-DS dataset, a standard for testing analytical database performance. To ensure a fair comparison, all tests followed the same workflow: read Parquet files from S3, execute the transformation query, and write the results back to S3. This mirrors a real-world data lake ETL pipeline.

The results highlight the strengths and weaknesses of each engine for this data scale.

Low-Overhead Advantage: DuckDB’s speed is immediately apparent. Over half of the TPC-DS queries finished in under 15 seconds, a timeframe that is often shorter than Spark’s startup time alone. For small and medium-sized jobs, this low overhead eliminates a major source of inefficiency.
Optimized Vectorized Engine: Across the full benchmark, DuckDB consistently outperformed both Spark and Trino on most queries. Its highly optimized vectorized query engine excels at analytical workloads on a single node.
Knowing the Limits: Transparency about limitations is crucial. As a single-node engine, DuckDB can run out of memory on extremely large or complex queries. In the benchmark, it failed on a few of the most resource-intensive queries where a distributed engine like Spark, which can spill to disk and leverage multiple nodes, remains the superior choice.

A Simple Framework for Choosing Your Data Lake Engine

The goal is not to replace Spark entirely but to add a more efficient and accessible tool to the modern data stack. By embracing a hybrid approach, data teams can become more productive and cost-effective. The decision of which tool to use can be guided by simple heuristics based on data size and pipeline complexity.

Use dbt with DuckDB for data pipelines with simple-to-medium complexity that process small-to-medium datasets (up to roughly 100GB). This covers a large percentage of common ETL and analytics engineering tasks.
Use Spark or another distributed engine for pipelines that process truly large-scale data or involve highly complex transformations that benefit from distributed execution.

By adopting this pragmatic approach, organizations can build more efficient data platforms. They can reduce infrastructure costs by avoiding over-provisioning for common tasks and, most importantly, empower a wider range of data professionals to build the pipelines the business needs. This shift frees up specialized data engineers to focus on the truly complex challenges where their expertise is most valuable.

TABLE OF CONTENTS

The Hidden Costs of a One-Size-Fits-All Data Stack

Bringing Engineering Discipline to SQL with dbt

DuckDB: The Missing Engine for Data Lake Analytics

How dbt and DuckDB Fit into a Hybrid Data Ecosystem

Performance Deep Dive: DuckDB vs. Spark and Trino on the Data Lake

A Simple Framework for Choosing Your Data Lake Engine

Transcript

0:04so good evening everyone um tonight I'm going to talk to you about uh what we use at a data minet um and how we use dgdb um and primarily I'm going to focus on using dgdb in in your data pipelines um so let's start with the obligatory slide about myself um so we am I uh I'm one of the Le data engineers at data

0:26minet um I've worked within data for for over six years uh primarily build use cases and streaming batch um but the past four years I focused more on building data platform so basically I try to make the use case team of our customers more efficient more effective and more productive um so I blog a bit about the problems I see or the

0:50solutions I have on medium so you can find me there and and Linkedin as well so before diving into dubs let's give a a a very high I Bird's view on on on what typical uh data platform or batch data platform looks like in in at many of our customers um this figure is probably nothing surprising uh for any

1:12of you so we have your operational data on left uh you want to do some analytics you want to get some insights you want to uh a customer wants to get more data driven so what do you do uh to provide these insights you get uh you ingest all the sources you need and you put it in

1:30either your data warehouse or a data Lake uh depending on which use case you're talking about um and of course well we need to do some processing and the data we ingest from an operational system it's not perfect um there are some data quality issues sometimes you want to enrich this data you want to combine it so we need to do some

1:50processing um uh in the end the goal is of course to to serve um products uh to serve either VI dashboards do some machine learning uh or expose it as an API this data so what am I'm going to talk about in this um presentation it's it's more on the data processing on top of uh the data lake so the data resides in Blob

2:14storage in s tree Azure blob storage whatever um doesn't really matter uh and you want to do some processing on it uh I'm not going to talk about data warehouse and and how you can use that um because my main well Focus or what I through the most is processing um in in um by using flat files basically and

2:35using parket files Etc um so um if we

2:39change this this picture and if we look at the Technologies um it's a bit like the slide that M showed nothing surprising here probably all the technologies that are on there uh you already know um I just want to focus on the processing part um historically we saw that and we still see that on a lot of our c customers

3:00that um they they typically use they have two types of processing um so on one side they they use spark um for their ETL pipelines um for their complex pipelines but also for their easy pipelines um to do aggregate the data to combine join the different sources together um and those are typically built by data Engineers um data

3:24engineering teams that that are um technically well skilled and that can do this job and then the other side they they they often have a part that that uses pandas and that's primarily the data scientists because they they want to do their ml models and of course it's it's a lot easier to do ml if you pull

3:41in all the data on a single note and process it there instead of using a distributed engine um so this is what what what we see at at a lot of our customers and and this works of course this this this not by itself not a bad solution but we see that uh or well we observe or

3:59I observe that in there are quite some inefficiencies in in always using spark of course you could say well we standardize on one tool and we say well if the uh the use cases increase if the data increases we could use we could still use that same tool and that's true and that is that is a a valid argument

4:17but on the other hand if you do that and your data stays small then it might be very inefficient to always use Spar um so what what do do I typically observe at at those customer I see that um 80 to 90% of their use cases they use small or mediumsized data uh and mediumsized I my rule of thumb is up to 100 gigabytes of

4:40processing data that that's medium sized if you go above that it becomes large so 80% of their pipelines it's small and mediumsized data and only for a small portion of of their pipelines they really need a distributed engine like spark or um or other Solutions cloud data wearers Etc um so this becomes a bit bit inefficient um additionally um

5:05another problem we typically encounter um in in many of our customers is they have very they have it very difficult to find good engineering people well nothing surprising you all know this it's very hard to find good data engineers and also to keep them um but on the other side a lot of these customers they do have um very good data

5:24anal analysts um and why are they very good because they know the data very well that the company has they also know um they are close to the business so they know what use cases they typically want to develop so um it would be a win-win for the company if these dat data analysts would be uh enabled or

5:42would make would have the possibility to write a part of these pipelines certainly the non-complex part um instead of always relying on the data engineering team which is typically uh underst stoed typically as a burden as a bottleneck it doesn't know where to start um so for efficiency purposes and also for delivery purposes um it would be good if this data analyst could do a

6:06bit more and looking at at data analyst of course um instead of focusing then on on write learning them python or um teaching them how to code um they all know SQL they they have been using it for years typically they they all get it taught so it would be good if uh you would use a tool uh if you would use SQL

6:28for them then the third and last observation is that while you can write very complex things in spark I've seen crazy crazy complex pipelines it's typically again a very small portion of these pipelines that is really complex in a lot of cases it's just take some sources add some columns um add some additional features join join two or three sources together

6:52uh and that becomes the output um why is that important U well this this could be easily done in Sequel um if you write something really complex in SQL for me it becomes soon uh quickly becomes unreadable but if you keep it simple then these SQL pipelines they are as readable for me as code um and they

7:12could be developed by an analyst instead of uh data engineer so with these observations in mind um I think there there is a possibility of course no surprise um for a different text St than just relying on spark to do your data processing so uh I'm going to build this um around well DBT and du DB um so why why DBT as

7:37I mentioned the data analyst still know SQL we all also know SQL quite well we're pretty proficient in in it I guess um so um that's a good starting point um additionally what I like as as as an engineer about um about DBT and and write and writing SQL is firstly well it decouples part of the interface

7:59um functionally what does it need to do uh which is done by the analyst and how does it execute which I can still modify I can still change the underlying execution engine without needing to change everything um on the on the front end side on the functional side let's say of course there is some dialect differences between um different

8:19databases um different implementations but overall the the large portion of the functionality could stay the same um a second thing well DBT bring a bit of what what what I typically like the software best practices um um towards SQL gives you some templating creates modelization you can create your models depend on those models um those are things that that are really nice to have

8:44and um last but not least yeah you have some quite neat features that come out of the box the lineage part um but also also the documentation they can get out of the box if you document your models well then these documentation Pages they just automatically created and they're quite nice um so well um first let's say

9:08um with DBT alone we canot we cannot replace spark because DBT on top of flat files that doesn't work it needs a SQL engine to run um so what if you just let DBT what would you do well you could push your data in your data Lake towards a data warehouse could be a post or a snowflake or red shift and do your

9:28queries there but that's feels a bit a very cumbersome way around um writing retail pipelines and I think that's not a a great solution for changing um these spark pipelines um so beginning of this year well let's say the end of last year um there was the first time I I when I encountered Doug Deb um and initially I

9:51liked it it was I didn't click yet for me um but I went to Brussels um on the the duck DB conference uh um which was where where I met some people we got to talking and and from that moment on it felt like well this dub it can uh solve a lot of the challenges or the the

10:10problems that I encounter at many of our customers and why is that for me the the one of the most important things of duck DP is the the easy integration with external storage uh you could easily pull in files from from blob storage from S3 from from an Azure blob storage Etc without needing um to uh to do too

10:31much work if there are parket files if there are CSV files it comes in out comes out uh of the Box uh support it um it's simple in its execution of course it's single note um that makes it very easy to reason about um I don't need to explain um to to my colleagues or to customers why a certain spark job fails

10:51because they added the transformation which which is why it shuffled all data around and it eventually died because it it goes out of memory um a single mode node execution they will soon realize when um they did something wrong or pull in all the data because it can of course still go out of memory but I have not

11:10nothing to explain um them about this um

11:14and the last thing is that well it's very easy to get started with um we we can package it and and run it on kubernetes but you can just as well run some tests easily on your local setup um with in your python environment I you just launch you run some integration tests um so it's very versatile in that way and that that's

11:33also something I like about it I'm not going to say that the data analyst find this a a very compelling reason but for me it really is um it you can get a very similar experience locally versus remotely and for spark um it's more difficult of course you can you can write your your unitest with spark um

11:52but lunching a local node cluster is a bit more cumbersome and often too difficult to explain at at our customers um so if we combine these two technologies DBT and duct DB um why do I think this is a this is a valuable a valuable setup or this is this works very well is that well as I said um for

12:13DBT to work either you need a SQL engine could be a data warehouse um but we could also use duct DB um but the most compelling part for me is that well we could swap um spark with DBT and du

12:29um in one pipeline while still keeping uh spark to run for all the other pipelines since the input um for my spark job is just Mar my parquet file on S three uh the output is the same for DBT and dug DB it's exactly the same interface so they're actually a dropin replacement and I can choose depending

12:48on the use case which one uh do I want to use and I'm not limited um by the

12:55fact that let's say at dep pendant shop wrote this output to S3 uh dependent spark job road is output to S3 that I need to do some transformation no the output is exactly the same um so they are very interoperable um which is at a lot of our customers um a good reason because you don't need to invest in a

13:14large migration path uh you can just say well as you go I you have a use case which processes just a small set of data um you could use just DBT induct DP um and for the other use cases you can still keep um spark uh for for your for your pipelines I know that of course depends on on what the infrastructure is

13:34you run online we typically um use kubernetes quite a lot and so we schedule everything on kubernetes and there swapping between it is is a lot easier is if of course you have a customer that uses data bricks and you want to support DBT and dug DB it's a bit more effort because the infrastructure doesn't match um yeah um

13:54so yeah that that depends on how you set it up of course um but in our case this work works very well um and the last part what's what would be the last part of of my talk will be to to show you that it's actually very fast uh and very efficient on these data sets um but

14:11before going there um I I want to give a small shout out to to Josh Wills which the guy who wrote the DBT duck DB adapter um which is of course the adapter we use to be able to run uh DBT to run DBT against tug DB uh and it's actually a really great adapter I um helped him a bit on on a couple of Parts

14:35where we had missing links um but overall um he provided a a a a great adapter for us to use um I I worked with him on the the credential provider chain which we need for kubernetes I don't want to use static credentials I want to use temporary credentials and web identity tokens uh in my jobs um so so

14:53this is one one thing where we collaborated all uh and then the rest of the uh the support is is is is actually great um of course um you know dug DB

15:04runs in process you have a couple of jobs that that are joined together um the models that are produced by one job they are not available if your next job launches because the duct thep database will be empty um this is a great hook which is provided by the um by the adapter that allows you to register

15:25dependent models what does that mean let's say you have model a and model B depends on it and model A ran in a previous job so the data is on S3 but it's not when you launch model B um it's not in the database yet it says well register this Upstream models it means well register this model in DBT and it

15:44knows where to look uh for those files and this means that you don't need to run all the models within your same job when you execute TBT but you can split them up accordingly uh depending on what you want what makes sense um and then yeah there was already Iceberg support in in the adapter but I heard from Med

16:02that it's probably coming from in Duck DB itself um which would be even better because it was still in in battle support and was not yeah there were some issues I wouldn't use it in production yet um now um the last part is about

16:16performance um so I did quite some benchmarks um to be able to compare um different setups and and see from a performance standpoint what what makes sense it both performance but actually for our customers it's typically relates oneon-one what's the cost of running this Pipeline and because most of um the cost is related to how long do this

16:38instances run for this job um so what do I use um I use uh 100 gigabytes of input data for from The tpcs Benchmark I don't know if you guys know it um it's a regular Benchmark to uh test um the performance of uh databases um and there

16:57is one important thing thing that that I I I wanted to add is that a lot of these benchmarks they prepopulate the data inside a database before they run their queries which makes sense if you compare databases but if I want to compare the performance of spark against something like like dug DB uh I need to use

17:16exactly the same setup so I want the data to be in S3 load it in Duck DB uh run the query and written again the output to S3 which is exactly the same as what spark would do if you would run a spark it's reads it from from S3 or any other blob storage it does the processing and it writes it back to S3

17:33and this gives a uh in my opinion a more fair comparison um as mentioned the execution environment we typically use kubernetes I use well one of the regular nodes we have it's a 2X large 32 GB of RAM and eight vcpus um and I I did two benchmarks um and in the end I wanted to compare Doug Deb and Spark and trino

17:56which are three let's say open source uh SQL engines uh and I wanted to see how they stood up to each other um Benchmark one uh Benchmark one is actually DBT and dug DB against uh spark in different configurations um as you know spark is a cluster you can run it as a single Noe could also add more nodes um so I wanted

18:18to see what is the impact of using the local mode which means the driver and the executors are on the same note versus uh a driver with one and a driver with two executors if I if I run this Benchmark again input data set is about 100 gbt um that doesn't mean every query uses 100 gigabytes but that's just a um that's

18:41just to give you an idea of what the input data set is um and these are the these are the results of a couple of the queries um if you look at my my hit up repo you can find the results of all queries uh but this is just a highlight they um U they all look similar so we

18:56see that dug DB actually performs very well um it does best in in all queries except for the last one um where it goes out of memory and I think due to uh some issue with the predicate push down it it doesn't filter it correctly um so there there it doesn't work but for all the other queries it does very

19:16well um you also see that it typically what you expect for spark is that if you add more notes it becomes more performant um of course this is not always so easy to reason about because if you add more notes there might be also some more data shuffling which also could be a could be an issue and that

19:34could result in slower performances um so this was the initial Benchmark I run and said well okay uh spark against dgdb uh WB seems more performant on these types of data sets um and then a second thing what we wanted to test because a lot of our customers were asking about what is the comparison with trino um and

19:55trino is also a Federated distributed um execution engine um and so I ran a benchmark with exactly the same setup so I use DBT with spark DBT with trino um and DBT with dgdb um and what you see here is for the full Benchmark so the full Benchmark incompasses about 100 queries um I on the x-axis you see uh

20:19basically time buckets and on the Y AIS you see the number of queries that um finished within that time so what what's important to note here is that well this thing it shows that duck DB finishes um let's say 55 so more than half of the queries within 15 seconds um this is faster than the first

20:41query that spark uh returns a result for showing that for this medium type data set spark has of course a lot of overhead creating the spark context um analyzing what queries need to run spark is not efficient in that so for really fast queries um spark is actually not the way to go um the other thing you

21:01notice is that tro does fairly well um and yeah this part I think I mentioned it yeah um there there are some queries that are really slow on on both spark and trino um it looks like Doug DB does a lot better but this is actually not the case because there are some queries where it goes out of memory and those

21:20are not taken into account in these results so um don't mind this too much

21:26um but you see that what I want to show is that actually DBT with dgdb um is a

21:33good solution um given the two constraints or two given yeah let's say

21:41two constraints or two uh depending on which use case you have so if you use mediumsized data I would use it uh if you have simple pipelines I would use it um if that's not the case if you're doing really complex stuff um I wouldn't bother or if the data sets get too large then of course use spark use trino uh

21:59use whatever else uh exists because it's it will be more performant uh and you will have too much issues and the cost and performance benefit they won't add up anymore so that was uh that was it about the the use case I wanted to talk to are there any [Music]

22:24questions

FAQS

When should you use DuckDB instead of Spark for data pipelines?

DuckDB is a good choice when 80-90% of your pipeline workloads process small to medium-sized data (up to ~100 GB), which is the reality at many organizations. For these workloads, Spark's distributed overhead is inefficient and costly. DuckDB with dbt can work as a drop-in replacement for Spark since both read and write Parquet files on S3, meaning you can swap individual pipelines without migrating your entire platform.

How does DuckDB with dbt compare to Spark in performance benchmarks?

In benchmarks using 100 GB of TPC-S data on S3 with an 8-vCPU, 32 GB node, DuckDB outperformed Spark in most queries, including Spark running with multiple executors. DuckDB completed over half of the benchmark queries within 15 seconds, faster than Spark's first result. That said, DuckDB can run out of memory on certain complex queries, so it works best for simple to moderately complex pipelines on medium-sized data.

Why is dbt with DuckDB a good choice for enabling data analysts?

Many organizations struggle to find enough data engineers while having skilled data analysts who know SQL and understand the business well. dbt with DuckDB lets these analysts build data pipelines using SQL rather than requiring Python or Spark expertise. dbt also brings software engineering best practices like modelization, dependency management, lineage tracking, and auto-generated documentation, making SQL pipelines readable and maintainable.

What makes DuckDB easy to integrate with existing data lake architectures?

DuckDB natively reads and writes Parquet files and CSVs from S3, Azure Blob Storage, and other blob stores. Its single-node execution is easy to reason about, and it can be packaged to run on Kubernetes or tested locally in a Python environment. Since DuckDB and Spark share the same Parquet interface on S3, they are interoperable as drop-in replacements for individual pipeline steps.