pg_duckdb: Ducking awesome analytics in Postgres

Q: How does pg_duckdb read Parquet files and external data from Postgres?

With pg_duckdb installed, you can use DuckDB's read_parquet function directly in Postgres SQL to query Parquet files from S3 or Azure Blob Storage. DuckDB fetches only the columns needed for your query rather than downloading entire files, which cuts bandwidth. You can also read JSON, Iceberg tables, and other formats supported by DuckDB's extensions, all from within the familiar Postgres interface.

2025/06/12Featuring:

Jelte Fennema-Nio

Video of a conference talk highlighting pg_duckdb, a new open-source Postgres extension that makes it easy and fast to run analytics queries straight from within Postgres. You can often use pg_duckdb to speed up your slowest Postgres queries significantly without having to make any changes to your query, data or schema! And if you want to speed up your queries even more, that's possible too. pg_duckdb makes it very easy to store some of your data remotely in columnar format in S3, Azure blob storage, MotherDuck and more.

In this talk you'll learn:

How you can use pg_duckdb to speed up your Postgres queries
How pg_duckdb can be so much faster than Postgres for analytics queries
How you can utilize remote columnar data storage to make your queries even faster

Jelte Fennema-Nio currently works at MotherDuck, a DuckDB-as-a-Service company. He spends most of his time there working on pg_duckdb. Before that, Jelte worked for 5 years at Microsoft on Citus, Postgres and Pgbouncer, and is still one of the maintainers of the PgBouncer project. He studied at the University of Amsterdam where he got his BSc in Computer Science and MSc in System and Network Engineering.

Transcript

0:08Hi, I'm Jelte, and I work at MotherDuck, and I'll talk about queries in Postgres faster and easier to write.

0:20and how it makes ducking awesome analytics in Postgres possible. with what is pg_duckdb.

0:32And well, pg_duckdb is a Postgres extension

0:36that embeds DuckDB inside Postgres.

0:40what that means.

0:47So first, we'll take us a bit of a step back.

0:51is a pretty amazing database. It's open source, there's many contributors. and there's many other good things about it.

1:01One of those also being that every release gets extra functionality.

1:08then it's very extensible. that add that functionality.

1:16which means it can very easily

1:21find a specific user in the database, or show your current shopping basket.

1:30all that from the database,

1:34and could do that very fast, very efficiently. or sometimes called OLAP.

1:43What that means is, to answer a query, instead of finding the right data, or finding it in a shopping basket.

1:56all orders to find out

2:0010 most sold products from the last month,

2:04or from the last 10 months, and display those,

2:09nice graph in some dashboard.

2:14That Postgres is usually not great at that. yeah, like this elephant under the books.

2:22it doesn't like it.

2:27So why is it not great at OLAP? because there's no columnar storage in Postgres.

2:34Postgres stores rows of data on the actual disk.

2:39So every row is one specific item. all the data for that row, like if you want to show a user profile.

2:51that means that maybe you don't,

2:57columns, because you want to see all the rows.

3:04you maybe only care about like 3 rows, but for all the rows, because that's how many columns the table has.

3:18than it actually needs to be, that you don't care about, right when you read it.

3:28parallel processing.

3:32these days computers have lots of cores.

3:37is very important for making it go fast.

3:42is not really designed with that in mind. but it's sort of bolted on a bit.

3:49with parallel processing in mind.

3:54So not all the, like many of this Postgres

3:58internals, they don't work in like a parallel way.

4:03They're very sequential. Only specific things are made parallel in Postgres.

4:12can come to the rescue.

4:17that is underneath.

4:22in-process SQL Analytics engine.

4:28It does have disk columnar storage. It's built with multi-threading in mind. It's like there's one of its core design principles, like lots of things need to be parallel.

4:40And it tries to make stuff go fast in parallel. in-process stuff mean?

4:48like we have Postgres and SQLite.

4:53a transactional database, like I explained before.

4:58which means that a client

5:03usually over their network.

5:07And that makes it very nice for certain applications, all connected to the same database.

5:19But SQLite is a different database. but instead, it is something that you load inside

5:28a process. So, for instance, some things that the web browser needs to store.

5:35And then you can use that without going over the network. that is well-suited for transactional workloads,

5:46without having a dedicated server.

5:51So it's sort of two different use cases. And DuckDB is another type of database.

5:59It's similar to SQLite in that it's in-process, so you can load it in Python.

6:07or it runs inside your browser in Wasm.

6:12binary and execute it inside your browser.

6:20like I said, it's an analytical database.

6:26And it's very fast at that. So, Postgres and DuckDB,

6:34good at different things. DuckDB being good at analytical workloads.

6:40that's sort of the idea we're trying to make them one.

6:47maybe put it inside the Postgres process,

6:53so we can put DuckDB in there. And then and everything will be faster. that is both transactional and analytical.

7:09And that's sort of what pg_duckdb tries to be.

7:16So what does that look like? how do you put DuckDB inside Postgres?

7:23And what does that mean? Well, I kind of want it to be like this.

7:28next to a sunset, drinking a nice cold beer,

7:35and just being general friends and happy together.

7:41that's not entirely what it always looks like. this sort of monster picture

7:50that is like this Frankenstein-y mess of things.

7:57And now, after a lot of work, it's a lot better. this is sort of how it started for sure.

8:06And the reason is that Postgres and DuckDB, they're very different. It's not just like ducks and elephants, basically.

8:13It's not just that

8:17and the other is good at analytical workloads. in how they're actually designed

8:27to accommodate those different workloads. because Postgres is much older than DuckDB. Postgres is from the previous millennium.

8:35And DuckDB is from a few years ago.

8:39in how it's designed based on

8:46when it got started. in Postgres stemming from those early days.

8:55And one of those things is is written in C and DuckDB is written in C++.

9:03And so there's a lot of mismatch in how things are done to make development a bit easier.

9:16well with the Postgres code.

9:21You have to make some translations here and there,

9:26but it's doable...

9:31to work relatively well with plain C.

9:39It's at least sort of similar. main problems is the way exceptions work.

9:48just exist in C++. It's a built-in thing. You can throw an exception. where an unexpected error happens.

9:58For instance, a divide by zero. saying, oh, there's an error.

10:05And Postgres provide any way of doing that by default.

10:13So Postgres its own exceptions with its own try and catch.

10:22but Postgres is PG_TRY, PG_CATCH. basically the same, but also very different.

10:30is that the different try-catch blocks

10:36don't understand each other's exceptions. PG_TRY, PG_CATCH block around that,

10:44those handlers will not get executed. And that cleanup that might happen in the

10:53PG_CATCH handler is not executed. So you might leak some memory. And the same is true the other way around.

10:59If Postgres throws an exception, with its normal try-catch handler.

11:09cleanup in C++ relies on even locking cleanup.

11:16when an exception is thrown,

11:23and release a lock, stuff like that.

11:29but we manage.

11:34It is a very hard problem to

11:38make work well. and DuckDB uses threads.

11:45DuckDB uses a ton of threads, and Postgres doesn't use any of them, basically. and the code is also not thread-safe,

11:55oh, there's only one thread in this process. That's the one that's doing everything. So I can change global variables.

12:04I can do anything that I want.

12:09And so, yeah, that's

12:13problematic because there's all these threads but that Postgres function

12:23cannot be called in a thread-safe way. around how to handle that.

12:31And similarly, the other way around, for instance, from multiple processes.

12:39It needs to open it from only one process. makes that problematic,

12:472 DuckDB instances and 2 processes that maybe want to open the same file. as best as we could.

12:59And that paid off. in the sunset that you saw in the picture before.

13:10So how does it work? That's sort of an important question. how does it actually work?

13:21how Postgres works. a client or the network,

13:31the first thing it will do is give it to the parser, or this string of characters,

13:37it's called abstract syntax tree,

13:42of what this query actually means. and it selects these columns from this table.

13:50of objects and structures

13:54than with a plain string.

13:59And then, once it has that, and the planner will then of executing this. the planner will give that plan that it's

14:11and that will actually do the work. it will do the computations, it will do the GROUP BY, stuff like that.

14:22And that's how it works in Postgres. So, how does it work in pg_duckdb? is it basically steals the query.

14:29And it doesn't do this always. There's sort of is this a query that pg_duckdb should care about?

14:36And if it does, then it will just take over.

14:40sort of meta query structure.

14:45It will not let the Postgres planner do anything. It will just and give it to the DuckDB Planner, and then the DuckDB Executor will execute that.

15:00And then, is that it can still read Postgres data. So, pg_duckdb, it can still read the data that's in Postgres.

15:13So, that's sort of a small but important detail.

15:19What can this thing do? of using pg_duckdb.

15:29and go into a little bit more detail after. engine on Postgres tables.

15:37And that's sort of it. You don't have to change where your data lives. You can just use the pg_duckdb on the data that you already have.

15:48and write data in Blob Storage.

15:52for Azure Blob Storage and S3 and

15:57formats like JSON, Parquet, Iceberg.

16:02Lots of different ones. to Blob Storage.

16:08you can offload analytics to MotherDuck.

16:14So, let's go start with the first one. engine running on Postgres tables. So, how does that work?

16:24Well, it's extremely simple. duckdb.force_execution, set it to true.

16:30and the DuckDB plan.

16:35if it was a bit faster. It depends on your query.

16:43It depends on your data. It depends on a lot of things.

16:49But, sometimes, yes, it's much faster. For example, there's this ClickBench benchmark. That's a public benchmark created by ClickHouse.

16:57analytics queries on a single table. mainly aggregation. It's very aggregation-heavy.

17:05And there, for some of the queries,

17:10if you just set this new setting to true.

17:17And so, that's pretty impressive. Some queries also get slower. So, it's not like a silver bullet.

17:24But it is sort of worth trying out. while trying some benchmarks. which is also a query command benchmark.

17:34and no indexes. But here, I wanted to see

17:41how it would work without. And for analytics, that's fairly normal because

17:48for your analytics queries are usually for your transactional workloads.

17:58So, the first query in TPC-DS,

18:03and I just gave up.

18:08And then, I set this setting to true, and I ran it again. And now, it's done in 450 milliseconds.

18:15and pretty much it's finished. query optimizations I ever did. everything is much better. As I said, this is an extreme example.

18:28This is not something you can normally expect. because maybe you get 2,

18:38maybe you get a 10% performance increase from

18:43changing a single setting. And then we get to the second use case, which is using Blob Storage.

18:52we can just read from Blob Storage.

18:56and then read_parquet, and it will just return the results,

19:06like return the first 5 rows in this case.

19:10because you can just do like...

19:15and then you get a table full of the parquet data.

19:23ORDER BY on this parquet data.

19:29parquet is actually stored in columnar format.

19:35And so DuckDB knows that, in this parquet file to your Postgres server.

19:44It will just fetch the columns that you need. we just need title, days in top 10, and type.

19:49for those 3 columns and leave all the, maybe like a hundred different columns. and not touch it.

19:57the bandwidth used will be less.

20:02One thing to note is that like r['stuff'] that you see.

20:08That's because sort of arbitrary columns, what columns it's going to be based on. based on what's actually in the file.

20:19having this sort of dynamic column stuff.

20:25So instead, what read_parquet does, just like you could index like a JSON column.

20:33on the columns of the row, basically.

20:40before we go to the last thing, we have to talk a bit about resources.

20:46normally looks like.

20:51with some transactional queries, that take up a few resources. like pg_duckdb,

21:04well, the two examples I gave before of using it, a lot more resources because

21:12it's going to use more CPU but it will use a lot more resources of the machine.

21:18concurrently as you could previously.

21:23So that's where MotherDuck comes in. this is the company I work for,

21:29you can offload this computation to the MotherDuck cloud get the results on the Postgres server.

21:42queries on Postgres are much smaller. Postgres instance, or you can still combine it

21:50that you might have. is happening in the MotherDuck cloud so that makes it much easier to scale

22:01and stuff like that. that looks like, you copy data into MotherDuck.

22:10CREATE TABLE and then USING duckdb.

22:16And that's sort of the key part. some data from a Postgres table,

22:22and then you have the data in MotherDuck. with the same data as in the hacker_news table

22:29that was originally on Postgres. And then you can just query that like normal. you can do whatever you want

22:38sort of work transparently without you

22:46except that it's a lot faster, suddenly.

22:51with some data inside Postgres. So you can do... to store in Postgres because you update that.

23:02But the archive from before, that's still in MotherDuck. and sort of get a shared result coming from Postgres coming from MotherDuck.

23:18but is that actually fast? And, well, yes. For analytics, it is really fast. in the ClickBench benchmark

23:35and pg_duckdb compared to just regular Postgres,

23:40even with Postgres with indexes in this case.

23:45So yeah, that's it. It's MIT-licensed. It's open source.

23:50or install a Docker image.

23:55Feedback is very welcome. to hearing from any of you to try it out.

FAQS

What is pg_duckdb and how does it make Postgres analytics faster?

pg_duckdb is an open-source, MIT-licensed Postgres extension that embeds the DuckDB analytical engine directly inside Postgres. By setting duckdb.force_execution to true, analytical queries are intercepted and executed by DuckDB's columnar engine instead of Postgres's row-based executor. On the ClickBench benchmark, some queries see up to 1000x speedups. One TPC-DS query that never completed on Postgres finished in 450 milliseconds with pg_duckdb.

How does pg_duckdb read Parquet files and external data from Postgres?

With pg_duckdb installed, you can use DuckDB's read_parquet function directly in Postgres SQL to query Parquet files from S3 or Azure Blob Storage. DuckDB fetches only the columns needed for your query rather than downloading entire files, which cuts bandwidth. You can also read JSON, Iceberg tables, and other formats supported by DuckDB's extensions, all from within the familiar Postgres interface.

Why is pg_duckdb needed when Postgres already supports parallel queries?

While Postgres has added some parallel processing, it was not built with parallelism as a core principle. It is bolted on and only works for specific operations. Postgres also uses row-oriented storage, which forces analytical queries to scan entire rows even when only a few columns are needed. DuckDB was built from the ground up with columnar storage and multi-threaded execution, making it fundamentally more efficient for analytical workloads.

How can you offload Postgres analytics to MotherDuck using pg_duckdb?

pg_duckdb lets you copy Postgres data into MotherDuck using CREATE TABLE ... USING duckdb statements, moving heavy analytical compute to MotherDuck's cloud. This keeps your Postgres instance focused on transactional workloads while analytics run on dedicated resources. You can even join data across both systems, combining data stored in Postgres with data in MotherDuck, to get shared results transparently in a single query.