YouTube Data Pipelines dbt MotherDuck Features

DuckDB: Run dbt build with sub-second execution times

Q: What file formats and data sources does DuckDB support?

DuckDB supports CSV, Parquet , and JSON as standard textual and binary formats, plus Delta Lake and Apache Iceberg as lakehouse formats. It can read data through HTTP, HTTPS, S3, and Azure Blob Storage. DuckDB also connects directly to databases like Postgres and MySQL to run analytical queries using DuckDB's engine while reading from their storage. It integrates tightly with pandas, NumPy, and Apache Arrow DataFrames.

2025/03/13

This article is based on a tech talk by Gabor Szarnyas, Developer Relations Advocate at DuckDB Labs.

The Agony of a Slow Data Pipeline

I want to tell you a story. A few years ago, I inherited a project that involved generating large graph datasets. The original pipeline was built on Hadoop. It was slow and incredibly difficult to work with. We migrated it to Spark, hoping for an improvement. While it was a step up, it was still complex and, more painfully, expensive. We were spending tens of thousands of euros just running Spark jobs to generate this data [13:51].

The pipeline was so cumbersome that we decided to pre-generate all possible variations of the data. This led to a combinatorial explosion of files, growing to tens of terabytes that we had to store on archival tape. If you wanted to use the data, you had to request it, wait for it to be retrieved from tape, and then download it. As I said to my team, "this is obviously no way to live" [13:58].

This experience set me on a path to find a better way. What if we could take that massive, expensive cloud pipeline and run it on a laptop? What if we could do it in under a second? This is the story of how DuckDB and dbt-core make that possible. To see how this leap is possible, we first need to understand what makes DuckDB's architecture so different.

What Makes DuckDB Different? The "In-Process" Revolution

When developers first encounter DuckDB, the immediate question is: what makes it different from other databases like PostgreSQL or Snowflake? The answer lies in its architecture. DuckDB is an in-process OLAP database.

Traditional databases use a client-server model. You have a database server running somewhere—on-prem or in the cloud—and your application connects to it over a network. This model works, but it has inherent friction. There is the operational overhead of setting up and managing a server, the hassle of configuring credentials, and the network itself, which is often the biggest bottleneck. Fetching a billion-row table from a remote server will always be slow and expensive [01:54].

DuckDB eliminates this entirely by running directly inside your application. There's no server to manage, making it a truly serverless architecture—unlike cloud "serverless" options where a server is still running somewhere [02:21]. You just import duckdb in Python, and you have a full-powered analytical database at your fingertips.

We were heavily inspired by SQLite, which proved the power of a single-file, embedded database. We like to think of DuckDB's place in the ecosystem like this:

"DuckDB has the form factor of SQLite... but it has the workload style of Snowflake." [03:03]

It gives you the simplicity and portability of an embedded library with the analytical power of a modern cloud data warehouse.

How DuckDB Achieves Sub-Second Speed

How can a library running on a laptop outperform a distributed cluster for certain workloads? The magic is in three core design principles.

1. Columnar Storage and Compression

Unlike row-based databases (like SQLite or Postgres) that store all values for a given row together, DuckDB stores all values for a given column together. This is a game-changer for analytics.

Imagine a table of railway data sorted by date [03:47]. In a columnar format, all the date values are contiguous on disk. If they are sorted, we can use simple Delta encoding to compress the entire column down to a starting value and a series of tiny integers [04:03]. If a delay column contains many zeros because the trains were on time, we can compress that entire block into a single constant value [04:18]. This leads to massive reductions in data size and I/O.

2. Vectorized Execution

When processing data, you could go "tuple-at-a-time," processing one row after another. This is slow and doesn't use modern CPUs efficiently [04:38]. The other extreme is "column-at-a-time," where you load an entire column into memory. This is why Pandas can famously run out of memory and crash [04:58].

DuckDB uses a vectorized execution model, which is a powerful middle ground. We process data in "vectors"—chunks of 2048 values at a time [05:13]. These chunks are small enough to fit comfortably within a CPU's L1 cache, the fastest memory available [05:51]. By operating on these cache-sized chunks, DuckDB avoids the main memory bottleneck and achieves incredible processing speed. Modern compilers can even auto-vectorize this code into SIMD (Single Instruction, Multiple Data) instructions, squeezing every drop of performance out of the hardware.

3. Lightweight "Pruning" Indexes

You might think that for fast queries, you need to add lots of indexes. For transactional systems, that's true. But for analytical workloads, heavy indexes can slow down data loading and updates without providing much benefit [08:22].

Instead, DuckDB uses lightweight metadata stored for each data block. For every column in every block, we store the minimum and maximum value in what's called a "zone map" [08:51]. When you run a query like WHERE date BETWEEN '2024-06-01' AND '2024-08-31', DuckDB first checks these zone maps. It can instantly determine which data blocks cannot possibly contain matching data and skips reading them entirely. This "pruning" dramatically reduces the amount of data we need to scan, and it works not just on DuckDB's native files but even on remote Parquet files over S3 [10:19].

Hands-On: Migrating a Spark Pipeline to `dbt-duckdb`

Let's go back to my slow, expensive Spark pipeline. Here’s how we can replace it with a simple, elegant dbt-duckdb project that runs in under a second.

Step 1: Prepare and Optimize Your Source Data

My original problem started with large, unoptimized CSV files [14:55]. While Parquet is the standard for columnar analytics, we can optimize even further. I used DuckDB to convert the raw CSVs into highly compressed Parquet files using the ZSTD compression codec, which resulted in a 4x file size reduction over the original compressed CSVs.

You can do this with a simple COPY command in DuckDB:

Copy code
-- Gabor's optimization technique

COPY person TO 'person.csv'; -- 13MB

COPY person TO 'person-v1.parquet' (PARQUET_VERSION v1); -- 4.4MB

COPY person TO 'person-v2.parquet' (PARQUET_VERSION v2); -- 3.6MB

COPY person TO 'person-v2-zstd.parquet' (PARQUET_VERSION v2, COMPRESSION 'ZSTD'); -- 2.6MB

This single command reads your source data and writes it out to a new, highly optimized Parquet file.

Step 2: Build the dbt Model

Now, we set up our dbt project. The dbt_project.yml is straightforward; you just need to specify dbt-duckdb as your adapter.

The real power comes from the model itself. Instead of loading data into a database and then writing it out, we can use materialized='external' [15:40]. This tells DuckDB to execute the transformation and write the result directly to a file (like a CSV or another Parquet file) without ever storing it in the main database file.

Here's a simplified version of my dbt model that reads the optimized Parquet file and generates one of the required CSV layouts:

Input model:

Copy code
FROM 'input/person.parquet'

Output model:

Copy code
{{
  config(
    materialized='external',
    location='person-project-fk.csv'
  )
}}
SELECT 
  creationDate, id, firstName, lastName, gender, birthday, locationIP, browserUsed
  -- some transformations here
FROM {{ ref('person_merged_fk') }}

Step 3: Run and Debug

With the model in place, you simply run:

Copy code
dbt build

The result? The entire pipeline, which used to take ages and cost a fortune on Spark, now completes in less than half a second on my laptop [16:19].

Pro-Tip for Debugging: As I was building this, I discovered a fantastic debugging trick. dbt-duckdb creates a dev.duckdb file in your project directory during a run. You can connect to this file directly with the new DuckDB UI:

Copy code
duckdb dev.duckdb -c ".ui"

"You can peek into the import and the export tables and just troubleshoot your queries, which is again much nicer than working with Spark." [16:38]

This local-first approach is incredibly powerful, but you might be wondering: what happens when I need to collaborate with a team or work with larger cloud datasets? This is where DuckDB's ecosystem comes into play.

From Local Core to Cloud Scale: The Broader Ecosystem

DuckDB is a fully open-source project with a permissive MIT license. To guarantee it stays that way forever, we've moved the intellectual property into the non-profit DuckDB Foundation in the Netherlands [17:43]. This ensures the core project will always remain free and open, addressing a common concern in today's database landscape.

The core development happens at DuckDB Labs, a revenue-funded company (no VC!) that provides support and consulting. But what happens when your local project needs to scale? What if you need to collaborate with a team?

This is where our partner, MotherDuck, comes in.

MotherDuck takes the powerful local core of DuckDB and intelligently builds back the most useful components of the client-server model: collaboration, cloud storage, and easy data sharing [18:47]. It offers a serverless data warehouse based on DuckDB, but with a unique twist: hybrid query execution [18:51].

Imagine you have sensitive PII data on your laptop that can't leave your machine, but you need to join it with a large public dataset in the cloud. MotherDuck allows you to write a single query where one part executes locally on your sensitive data, and the other part executes in the cloud, with the results seamlessly combined. It's the best of both worlds.

Conclusion: A New Default for Your Data Stack

By pairing the lightning-fast, in-process power of DuckDB with the robust framework of dbt, we've solved the problems of speed, cost, and complexity for local data transformation. That monstrous pipeline that once cost tens of thousands of euros on Spark and lived on tape archives now runs on a laptop in the blink of an eye.

DuckDB is pushing the boundaries of what's possible on a single machine, making it the new default for interactive analysis and local development pipelines.

Ready to try it yourself?

Get started: Install dbt-duckdb and run through the tutorial.
Explore: Check out the official DuckDB documentation and the new DuckDB UI.
Scale to the cloud: See how you can take your local workflows to the next level with MotherDuck.

Transcript

0:04Thanks everyone for coming. It's great to be here. My name is Gabor Sarn and I'm a developer relations advocate at Duck DB Labs. And tonight I will talk about Doug DB and some short examples with Doug DB and DBT. Briefly about my background. I spent a decade in academia. I studied here at the technical university of Budapest. Then I

0:24moved to Amsterdam in a research institute and that research institute happened to be where Dark DB was created. So that got me close to Dark DB. During research I worked on benchmarks for graph database systems. I designed benchmarks that measure the performance of analytical graph databases and now I'm sort of the chronicler of the Doug DB project. I

0:45edit the technical documentation of Doug DB and I contribute to and edit the blog of DUD DB. So what is duck DB? In a single sentence, duck DB is a universal data wrangling tool. It speaks SQL. You can do analytical queries on your tabular data and it runs in process. How many of you have heard about DUD DB

1:06before? Okay, so it's a it's a great audience. It's 80%. So I chopped all the

1:13slides that are the high level overview of DuckDB and I want to talk about more the deep dive and in-depth questions of Duck DB. First of all, why did we choose an improcess architecture? The improcess architecture stems from the observation that the traditional client server divide which works really well has a few inherent drawbacks. One is that someone

1:35has to operate the server. So that already incurs time and money costs. Second, someone has to set up the server and the connection which is cumbersome.

1:44You need to authenticate. You need to copy tokens around. That's just not nice. And then the third is that these arrow in between actually form a huge bottleneck. If you want to fetch a table with a billion rows and the computer is sitting in AWS EC2, you have to download it. It's going to be expensive. It's going to be slow. So duct DB takes the

2:02database and brings it into the client application. And when it does so, it's very easy to connect to. You just say import duck db and you have your duck db instance. And most people just run duck db in an in-memory setup. So it's very comfortable and very easy to interact with. This is what I call the truly

2:22serverless architecture because you know in snowflake even if you choose serverless there is always a server.

2:28Here duck DB is just sitting on your laptop. People think Doug DB is an in-memory database and while it can work in memory it's primarily a disk based database. In fact it has a format where we use a single file database. So all your data is in one file. Can you hazard a guess where we got this idea

2:48from? This is SQLite. Correct. SQLite with a trillion deployments. The most popular database project or maybe the most popular software project was the inspiration of Duck DB. And the way we like to think about where Duck DB belongs in the area of database systems is that Doug DB has the form factor of SQLite. So it's in process and it has a

3:10single file format, but it has the workload style of snowflake. So it's geared toward analytical queries. Let's move on to the internals mainly storage execution and the indexing of duct DB. So Doug DB has a column based storage which means that unlike posgress, my SQL and SQLite which do row based storage where every row is stored physically together on disk. Doug

3:35DB uses a column storage. This has a bunch of advantages. For example, we have this railway data set where we have some dates, some train stations, some of the delays that the trains have accumulated and we have organized this in a calendar layout. And what you can see on the slide is that this data like quite few data sets is actually sorted

3:57along the date column. So it's ordered along the date column. Meaning it's very easy to compress it. We can use something like a delta encoding and just turn it into a single integer with a starting value. Similarly, sometimes the delay is zero because we had a lucky day and the trains were on time and then we can just compress away the entire column

4:19in a single constant value. So this can save a lot in disk storage. Execution wise, there is also a big difference. The two main ideas that would come to mind is to do tpple at a time execution. You can read the first row, do your processing, read the second row, the third row, and the fourth. But this is difficult to parallelize. It's

4:40prone to cache misses and it's usually not how modern CPUs like to execute code. Column time operations, on the other hand, are great. You can just put all your dates in, and then you can aggregate on them and then you can do the same on the delays. But this is prone to running out of memory. This is

4:58the reason why pandas runs out of memory. Pandas has a column at a time execution and if you have too much data it will just simply crash because it runs out of memory. So vectorzed execution is kind of an in between approach. We try to reap the benefits of columner execution by taking chunks of the data. These chunks are the vectors

5:19in Doug DB. This means 248 entities are in every given vector and then we iterate along those vectors.

5:28And these vectors are actually quite great because these allow us to chunk the data along these vectors and then distribute them along the row groups of the data to the different CPU threads.

5:40And modern CPUs have quite big L1 caches, meaning that with the rightly chosen vector sizes, all the vectors for a given operation for a given chunk fit into the L1 cache of the CPU. And of course, the L1 cache is much faster than the L2 cache. the L3 cache and the main memory. So 32 to 128 kilobytes is

6:03actually quite a large amount of memory if the data has this compressed column door layout with vectorzed processing.

6:11And one of the things that people like to ask is, you know, why didn't Duck DB exist 15 years before? And one of the main answers is is that you need a modern CPU compiler to auto vectorize your code. We make sure that our code uses tight loops. We use branchless code so that the compiler can do its magic

6:28and turn it into SIMD single instruction multiple data instructions. This is something that is only available in modern compilers roughly 10 years ago.

6:3915 years ago you would have to do it in basically writing SIMD intrinsics which is not much easier than programming in assembly. If you run your workloads on a laptop, I found that you can usually read a CSV file from the laptop's disk into Doug DB's native database format at more than 1 GBtes per second. In one

7:01hour, you can complete something like the TPCH benchmarks queries. This is a well-known and rather rigorous benchmark that's designed to torture databases and is designed for analytical workloads.

7:13And this can run on a modern MacBook on the scare factor 3000 data set which is three terabytes of CSV files. This is something that 10 15 years ago was the territory of big iron workstations that you bought for hundreds of thousands of dollars and now it works on a MacBook.

7:31It also works in other places. Doug DB is really proud of its portability. DDB has a bunch of clients for languages like Python, R, Command line, npm, Java, and so on. And Doug DB also runs on Macintosh, Linux, and Windows. DuckDB is so portable that you can compile it to web assembly, and then it runs in your

7:51browser, and it also runs in the browser on the phone. You can just type shell.duckdb.org in your phone's browser, and you can run SQL queries as shown in the example. A bunch of neat features about DUTDB. I promise that I will talk about indexing. And kind of the counterintuitive thing about indexing is that people learn in uh

8:13graduate school that you should put indexes in your data if you want to make queries fast. This is very true for transactional systems if you know the queries in advance. But in analytical systems, it's usually not a good idea to overindex your data. It will cause the loading phase to be very slow. It will cause the updates to be slow. And

8:32sometimes it doesn't even help the queries. What does help the queries is to have some sort of lightweight indexing to help pruning. And this is quite similar to the micropartitioning uh figure that Tibbor has shown. Here we have a date column, an ID, a station, and the delay column. And for every single one of those columns, for every

8:49single row group, we store the minimum value and the maximum value. And why is this great? This is great because if you say I want something about the trains that ran during the summer, we can just look at the minimum and the maximum values, conclude that this is only row group two. Only row group two on the

9:08right has trains in the summer and then we can further prune along the column storage and that means we can basically just read 1/8 of the data compared to reading the two row groups. DB supports quite a few formats and protocols. We do CSV, parket and JSON which are the standard textual and binary formats. And we do delta and

9:31iceberg which are the lakehouse formats that are getting increasingly popular. We can read data through HTTP, HTTPS, S3 and from the Azure blob storage. And we can also connect directly into databases and run analytical queries using Doug DB's engine but reading from the storage of these systems. And finally we integrate quite tightly with the data science world. You can run duct DB

9:57queries on pandas numpy or duck DB with

10:02dlier. And the nice thing about all these integrations is that duct DB is kind of the fabric in the middle that weaves these together. And what this means is that these integrations usually work in synergy. For example, you can read something like a remote S3 or HTTPS store parket file. And the same tricks that I showed about Doug DB's native

10:23format, meaning that it can prune along the columner layout and the minmax indexes. This works well even if you query a parket file on a remote endpoint. And this is thanks to some quite obscure things in the HTTP standard. That means you can formulate your HTTP requests in a way where you have range request just seeking into the

10:43file and reading given bytes of the HTTP endpoint. Duck DB also has something called friendly SQL. So duct DB speaks a dialect based on the posgress SQL and that's actually something that helped us a lot with integrating with other tools.

10:58But of course posgress is not perfect. And if you have something like this, which is a matrix of the train station differences in the Netherlands, and you would try to turn it into something from a white table to a long table, this which is very popular and common in data science operations, you would need to write something like unpivot on and then

11:18list all the train stations 400 or so. Doug DB has something called on columns which the columns asterisk expression is substituted to the list of columns in the table and then you can exclude given columns of it. So this is a really neat and short way of expressing it. Dark DB also has something called create or replace which is incredib increasingly

11:41common in modern SQL dialects but not supported by systems like posgress yet. Create or replace is neat because you can write these item potent scripts with it. You can just keep running your script and it will always replace the tables. If it crashes halfway, you don't have to throw away the temp tables. Just write over them in the next run. And

12:01something that we announced yesterday is the Doug DB local UI. So most people so far have interacted with Doug DB using either the command line client or Python or a Python notebook. But now you can just launch duck db with the duct db minus ui flag and it will run this internet browser based neat user interface where you have notebooks, you

12:23have a table analyzer and you have your u SQL editor in your browser without doing cumbersome magic in the command line. This is of course dbt meetup. So let's talk about duck db's dbt integration. This was a use case that we like to think of as a pipeline. The three major Doug DB use cases is interactive analysis for data science.

12:47Creative architecture where people run Doug DB on Raspberry Pies, on phones, smartwatches and smart TVs. But DBT falls in the middle with the pipeline components where it's not an interactive setup. It's just duck DB that spun off quickly that it does something usually in inmemory mode and then it just shuts down. So, DBT DuckDB was originally developed by Josh Le in Silicon Valley

13:14and he used it for a pet project and ended up using it for some use cases too. Uh, Doug DB fits DBT really well because it's lightweight, it can be started quickly and it is rather fast.

13:27And when I tried to come up with an example of how I would use duct db with dbt, I was reminded of this use case of mine where we had large graph data sets and I inherited a code base which generated them with Hadoop. But of course it was rather difficult to work with and very slow. So we migrated it to

13:48Spark and it was still rather difficult to work with and very expensive to run.

13:54We spent tens of thousands of euros running spark jobs. So ultimately we said okay this is obviously no way to live. Users will not go through the hoops of getting all these AWS quotota increases and running the spark jobs of AWS. So we said let's just pre-generate the data. But what then happened is that we had this combinatorial problem of

14:16let's generate all sorts of different layouts with all sorts of different date formats. And we ended up distributing eight somewhat similar but different layouts for the data. And this grew to a size where it was tens of terabytes. And it was so expensive that I had to talk to research institute to store it on tape. So this is somewhere

14:37in Amsterdam. And if you want to download it, you have to fetch it from tape to disk and then from disk which is obviously not nice. So I have been wanting to reduce the size of this for some time. And one thing that is something that we should have done in retrospect is obviously move from CSV and compressed CSV files to a nicer

15:01compressed binary format like parquet. So I did some experiments and it turns out that CSV is very easy to beat of course with parquet but if you do compressed CSV that's around 3 megabytes that's quite difficult to reach. So for that what I ended up doing was I used duck DB and I changed to our new park

15:20version which has been supported as of the latest duct DB release and I turned on the ZSTD compression which then managed to be under 3 megabytes. So this is overall 4x save and then I created a simple dbt file and I created an input model that just reads the parket and then every single layout is just one SQL

15:43query which takes the input and then creates an external materialized table and writes it into the CSV. And this is basically the whole DBT model. You can run dbt build on it. And this is something that takes less than a second.

15:58And this is where the lightweight dependencyfree nature of duct DB really comes at play because obviously with these small data sets the processing time would be fast both in duct DB and in other systems like Spark but for Spark you have this big overhead of setting up the system starting waiting for it and so on. Doug DB does

16:20everything combined with less than half a second. And the neat thing about the duck db UI which I found out during debugging is that you can just find the dev.db file that dbt duckd leaves connect to it with duck db minus ui and then you can peek into the import and the export tables and just troubleshoot your queries which

16:42is again much nicer than working with spark and something that is on a remote endpoint with a big delay. So what's the business model behind duckd? This is something that people usually are curious about. Doug DB is developed by a company called Doug DB Labs. This is where I work. It's based in Amsterdam and we are a small team of roughly 15

17:03people. And the interesting thing about our business model is that we are not funded by venture capital. We are plainly funded by revenue and that revenue comes from other companies which we provide consulting and support services for and many of those companies have venture capital. So we work with motherdoc, work databicks and so on.

17:25And if you follow the database world in the last five years in particular, it has been very common for companies to do a license change. MongoDB changed licenses. So did Radius, Elastic, Neoforj and so on. And this is something that users are quite concerned with. So to alleviate this concern, we actually moved all the intellectual property of

17:45Duck DB to a Netherlands-based foundation. This is called the Duck DB Foundation and this owns the IP of the project. So we cannot rely it because it's not part of the duck db labs organization it's in the foundation even if labs gets acquired by a big tech player it stays there and it stays MIT licensed forever the foundation is

18:07funded by donations from contributors companies like mother duck posit who develop our studio and positron and voltron data who are very active in the arrow ecosystem and speaking of motherduck mother duck is our close partner company which is a venture capital funded American company based in Seattle. Motherduck does this weird trick where we said okay let's not do

18:32the client server architecture because it has all these inherent problems and duck DB is this simple embedded system.

18:39Motherduck takes that system and says well actually those things are rather useful if you want to collaborate with people if you want to work in the cloud if you want to share your data and it builds back many of those components in the cloud. So other black is a duck DB based cloud data warehouse with quite a

18:54few cool ideas like hybrid query execution where one part of the query runs locally maybe having access to your sensitive personal health data and the other part of the query runs in the cloud and has access to some open data set. So to sum up I like to think about Doug DB as a kind of good old school

19:14system. DDB would feel at home in a research institute old Santos machine 20 years ago because it is open source. It runs from the command line. It is built on open standards. And if you open the Doug DB website, we go to great length to make sure you have a smooth experience. We don't have cookies. We don't have telemetry. You can download

19:35our documentation and it works on offline. And if you think about this spectrum of data processing systems, we like to say that DDB is positioned around the middle. Obviously for small data sets, Excel and Pandas are great.

19:49And for large data sets, Spark is also great. But with Doug DB being increasingly more popular and having more and more integrations, you can now read Excel files with Doug DB. You can read pandas with Doug DB. And with DD being more and more scalable, I think DDB is pushing what's doable on a single machine and leaves Spark only for the 10

20:11terabyte plus big jobs. And Spark and the Duck DB integrates really well with DBT. Thank you. And if you're interested, follow our socials and I'm happy to answer any [Applause]

20:27questions. Q&A questions. Question one, DD runs on CPU. any plans on supporting GPU based calculation. So this comes up quite frequently because you have pretty amazing GPUs in computers mostly thanks to the AI boom but this would hinder the portability of duct DB. So we have thought about this and there are currently no plans of doing GPU

20:52calculations um because of the portability issue and also because GPUs are only really good at fixedsiz data. As soon as you eat something like strings, it starts to become more and more cumbersome to program them and then the benefits kind of start vanishing. Second question, full delta support read write Unity catalog when when someone steps up and funds it.

21:16That's uh part of the road map, but our road map has a big part that says open for funding and this is full delta support is one of them. Your new local UI is awesome. Do you plan to open source it? will add the SQL formatter for lazy people. Um, so the UI is shipped as a duck DB

21:38extension and that's open source. The component that runs in the browser, the JavaScript that's not open source obviously you get the source but it's an offiscusated piece of JavaScript source.

21:51I talked to the developers yesterday and we are considering open sourcing it. It's going to be a bit difficult to do that but it is under consideration. Will we add the SQL formatter? I have seen a SQL formatter a couple of days ago. I think it's in the awesome Doug DB repository. Uh and someone also developed a formatter as a Doug DB

22:14community extension where I just write a query and then it prints you out the nicely formatted one but this is not part of the UI yet.

22:22And the last question is I read some interesting news lately about distributed duct DB small pond. Uh do you think it will get traction? So this got a big attention because it was done by the Chinese company DeepS who got famous because of their R1 model.

22:39Um it's a bit early to tell whether it will get traction. The thing about it is that they built this stack with their own distributed file system and with their own distributed orchestrator for DG DB that uses Ray internally. Um, and

22:57they got quite a bit of attention. What I read from the comments is that people are a bit skeptical about this becoming an industry standard. So even if their distributed file system for example has some inherent benefits, you wouldn't really move over compared to something like AWS S3 because S3 is the industry standard and this is a shiny new idea.

23:18So I think this is something that um hackers will experiment with. It's not necessarily something that you should start using in production just yet. But this is very interesting and if you can program it meticulously and if you can make sure that small pond wants to exe can execute your queries this will alleviate a lot of the scalability

23:40issues that people say if you have a pabyte of data then you shouldn't use duct db alone maybe with small pond you can actually make sure that your queries finish all right any more questions all right seeing None. I hand back the microphone back to Alan.

23:58Thanks, [Applause]

FAQS

How does DuckDB achieve sub-second dbt build execution times?

DuckDB achieves sub-second dbt builds because of its lightweight, dependency-free, in-process architecture. Unlike Spark, which requires significant overhead to start up a cluster, DuckDB spins up instantly and processes data in-memory. In the demonstrated example, a complete dbt build that converts data across multiple layouts ran in under half a second. Compare that to Spark jobs that cost tens of thousands of euros and required complex AWS setup.

What is DuckDB's vectorized execution engine and why is it fast?

DuckDB uses vectorized execution, processing data in chunks of 2,048 entities called vectors rather than row-at-a-time or column-at-a-time. These vectors are sized to fit within the CPU's L1 cache, making operations extremely fast. DuckDB also uses tight loops and branchless code that modern compilers can auto-vectorize into SIMD instructions, avoiding the hand-coded assembly that other databases rely on. This is why DuckDB can run TPC-H benchmark queries on a laptop against 3 terabytes of data.

How does DuckDB compare to Spark for data pipeline workloads?

For small to medium data pipeline workloads, DuckDB is significantly faster and simpler than Spark. While both can process small datasets quickly, Spark has substantial overhead from cluster setup, JVM startup, and distributed coordination. DuckDB's in-process architecture eliminates this overhead entirely. DuckDB also provides a much nicer debugging experience for data scientists and engineers. You can connect the DuckDB local UI directly to the dev.db file that dbt-duckdb creates.

What file formats and data sources does DuckDB support?

DuckDB supports CSV, Parquet, and JSON as standard textual and binary formats, plus Delta Lake and Apache Iceberg as lakehouse formats. It can read data through HTTP, HTTPS, S3, and Azure Blob Storage. DuckDB also connects directly to databases like Postgres and MySQL to run analytical queries using DuckDB's engine while reading from their storage. It integrates tightly with pandas, NumPy, and Apache Arrow DataFrames.

DuckDB: Run dbt build with sub-second execution times

The Agony of a Slow Data Pipeline

What Makes DuckDB Different? The "In-Process" Revolution

How DuckDB Achieves Sub-Second Speed

1. Columnar Storage and Compression

2. Vectorized Execution

3. Lightweight "Pruning" Indexes

Hands-On: Migrating a Spark Pipeline to `dbt-duckdb`

Step 1: Prepare and Optimize Your Source Data

Step 2: Build the dbt Model

Step 3: Run and Debug

From Local Core to Cloud Scale: The Broader Ecosystem

Conclusion: A New Default for Your Data Stack

Transcript

FAQS

How does DuckDB achieve sub-second dbt build execution times?

What is DuckDB's vectorized execution engine and why is it fast?

How does DuckDB compare to Spark for data pipeline workloads?

What file formats and data sources does DuckDB support?

Related Videos

Building an Analytics Chatbot for your SaaS app in 1 day

Shareable visualizations built by your favorite agent

Preparing Your Data Warehouse for AI: Let Your Agents Cook

DuckDB: Run dbt build with sub-second execution times

The Agony of a Slow Data Pipeline

What Makes DuckDB Different? The "In-Process" Revolution

How DuckDB Achieves Sub-Second Speed

1. Columnar Storage and Compression

2. Vectorized Execution

3. Lightweight "Pruning" Indexes

Hands-On: Migrating a Spark Pipeline to dbt-duckdb

Step 1: Prepare and Optimize Your Source Data

Step 2: Build the dbt Model

Step 3: Run and Debug

From Local Core to Cloud Scale: The Broader Ecosystem

Conclusion: A New Default for Your Data Stack

Transcript

FAQS

How does DuckDB achieve sub-second dbt build execution times?

What is DuckDB's vectorized execution engine and why is it fast?

How does DuckDB compare to Spark for data pipeline workloads?

What file formats and data sources does DuckDB support?

Related Videos

Building an Analytics Chatbot for your SaaS app in 1 day

Shareable visualizations built by your favorite agent

Preparing Your Data Warehouse for AI: Let Your Agents Cook

Hands-On: Migrating a Spark Pipeline to `dbt-duckdb`