What can Postgres learn from DuckDB? (PGConf.dev 2025)
2025/06/13Presented by Jelte Fennema-Nio at PGConf.dev 2025 (https://2025.pgconf.dev)
DuckDB an open source SQL analytics engine that is quickly growing in popularity. There are three main reasons for this:
It is very fast at analytics workloads It is extremely extensible It has made some nice improvements to its SQL syntax This begs the question: What can Postgres learn from DuckDB? In this talk I'll discuss how Postgres could learn from some of the things that DuckDB did differently.
Transcript
0:16I'm Yela. I'm a poscrist contributor. I'm a maintainer of Pidge Bouncer and of PG DuckDB. And I maintain PGDB while being employed by mother duck there at the top. So I know a bunch of things about duct DB these days.
0:31And uh so I'll tell you what posters can learn. So first I mean what is duct DB?
0:37Because I mean it's a postcus conference. So I can imagine many of you don't actually know. So it's a lightweight inprocess SQL analytics engine and that means that it's similar
0:50to SQL light in that it's in process but
0:54similar to Postgress actually not really similar to Postgress. It's like similar to SQLite except that it is an analytical database instead of a transactional database.
1:08So it's created by these two guys at the CWI which is a research center in the Netherlands. Uh they started a company to maintain it ductb labs and uh yeah they've been doing that so far and I work for Maduk which is a other ductb company that we host ductb in the cloud and we pay them for consultancy with
1:29with their ductb labs consultancy company. So duck DB has become very popular very quickly. So you see this nice graph up and it does it only seems to go more up and it's actually it's like the stars on GitHub and it's actually surpassed Postgress in stars. So it's clearly more popular than Postgress already.
1:52So I mean has to be good. So why is it so popular? It's because it's the Swiss Army knife of data.
2:01That's at least what it's what it's trying to be. And what what does why what does that mean? It's like every data scientist can use this. And one one benefit is that it is very good at parsing CSVs and sort of finding out types of CSVs uh of the columns and all data scientists always have CSVs lying around
2:21and send through email to them and in like weird Google drives. So parsing CSVs is very important. And secondly, it's very easy to install. Like Postgress can be quite hard to install if you don't know stuff about Postgress.
2:36But this is like you just do your package manager and duct DB and then it's done. Or you copy paste like a curl command into shell and then you have a ductb CLI that you can run and suddenly you have a SQL interface.
2:51But you have a SQL interface to what? You have a SQL interface to basically anything. That's that's where sort of the power of duct DB it for data science and data analysis is sort of where it comes from. It has its own database file like ductb file just like SQLite as a SQLite file but it can also read CSVs
3:12JSON paret it can read and write to pandas and polars and all kinds of data frame libraries and like arrow is also a thing that people use it with. And it's not just it's not just um good at reading and writing stuff. It's also reading and writing that from everywhere like all the file formats you can get them from
3:36S3 from HTTP. Uh you can connect to Postgress and connect to my SQL and just get all the tables and do queries that join CSV files and some Postgress table for instance.
3:51So yeah, and then the next question is, is it actually fast? Because I mean, if you can join things together, fun, but if it's not it's not going to be fast, I mean why would you do it? And basically the answer is yes, it is. It is really quite fast. This is I took a few of the
4:06results from the from the ClickBench benchmark. It's a benchmark from ClickHouse. Uh it's it's a fairly simple analytical benchmark in that it only has one table and you just do aggregations and filters on that single table. So there's not a single join in any of the queries. And duct DB is actually quite good at joins, but apparently it's also
4:27very good at looking at a single table because it's faster than click house at its own benchmark. It's also faster than snowflake with a much larger instance and it's a lot lot faster than spark.
4:39And you might be wondering why is poster missing on this picture and that's because then suddenly all the bars kind of disappear.
4:46Um but that I have a I have a theory that that's mainly because the data set size just doesn't fit in RAM for Postgress. So you're basically benchmarking uh the the GP2 disk of AWS on like how fast can I read all the data for every query. So and I I have a I have a theory that maybe maybe Click House kind of
5:08chose that data size just or and that instance type just so that it didn't fit, but no no one ever knows.
5:15So now we come to the real question. What can possess learn from duct DB now that you have some introduction into what what ductb even is? And there's basically three things. There's ease of use because duct DB is very very easy to use. There's performance and there's extensibility. And we'll go over those three points sort of uh step by step. So
5:40let's start with ease of use. Basically, duct DB it it made what they called friendly SQL SQL that anyone can sort of write and not just SQL experts. And that's that's kind of kind of useful.
5:53So, I'll give a quick demo if that hopefully works. But I think it does. Uh let's see.
6:02Let's go. Yes. Let me make this a bit smaller.
6:19All right. So,
6:24this is using a a feature that that mother duck added this demo because it makes a demo really nice and smooth.
6:30It's called instant SQL and it you will see how it works. You can basically do from and then uh it will automatically
6:39run everything. So we're going to look at this query. Look at this data set.
6:43It's an S3 file. It's a paret file. And you just do from and then paste it and it will work. You will suddenly see the results. I don't even have to press run.
6:53I don't have to do anything. Basically anything that I type will be automatically executed. Uh and you will see the results. So as you can see from without any select just works and you don't have to specify types and it figures out where to get this data and it does all that sort of magically behind the scenes. So let's select a few
7:11columns. Uh let's start with the title and the rank and uh let's see the view.
7:23Uh,
7:27oh, I forgot a comma. That's why it doesn't autocomplete. So, if so, it does autocomplete when viewership score. So, and it auto completes because it already knows what's there because the the from is there. So, it figured out what the columns are. So, let's also add a
7:44type. Fine. So now we have a bunch of things and okay let's let's order this by by rank just to uh to see how that goes.
7:55Order by rank. Okay. But it's all one. So okay it's different. Oh it's Netflix daily top 10.
8:02All right. So we have to do something else. So we make it uh we do number
8:09times times times top times number one.
8:13Number one. And we'll do sum sum rank equals
8:21one. Uh we say that that's number one. All right. So now we need to do something with the title because it's a group.
8:29Suddenly it's a group by. So we do group by title. Make this a bit bigger again here.
8:39So whole query fits. Um so group by title viewers zip core you can do average you can use that as a method instead of as like a normal function uh just like sort of people expect that don't write SQL every day that sort of these basic things are basically in every language except in SQL so if you
8:58come from Python or JavaScript it's kind SQL can be quite daunting because it misses so many sort of nice things and
9:06yeah so let's remove the type and as you can see it already works even though I didn't remove this last comma. So it's it it doesn't care about the last comma.
9:16So I can just remove columns at the end without having to worry about that. And it also it also like this is a capital letter T and this is a small letter T.
9:25So this it will do case matching sort of
9:29that makes sense. Um but you can see oh there's a all zeros of to the desk and then now this kind of looks all right.
9:41Um but this is all because it's like this on the fly execution. It's all sort of like it's a bit sampling. So you have to run the final thing and then you get a result. But this is very big. Let's close that. Then you get the real result. And apparently Cobra Kai is like the most popular like the most the most
9:59top one thing on Netflix for whatever period this was. All right, so that was kind of this demo. Let's go back to the other thing.
10:13Go back to the slides. So yeah, Doug DDB has sort of syntax that just that makes sense to people that don't use SQL every day. I think that's sort of the I don't even use SQL every day. I'm like or I mean I might use it but it's I'm not a SQL expert.
10:28I'm like a programming expert or hopefully I am. Uh but I I I'm not like
10:35a data sc I'm not constantly writing SQL queries. So I every time I write Postgress queries I have to I get lots of errors all the time. And with ductb I have that a lot less.
10:46So the other thing that it does well is it has generic types. So it has a map type and a union type. And there's you actually get the real types back and it also and there's a strct like we have in postcress we have composite types which are similar but those those are not you you have to always create them they have
11:04to have a name and these can be anonymous basically you can have a a function return sort of a new kind of strct and arrays are also sensible because arrays and postgress if you've ever looked closely at them you know how weird and strange they are and you can
11:21tell a table that it contains two dimensional arrays, but you can put a single dimensional or fivedimensional array in them and everything is totally fine unless you start to read it and post. So, duck DB you can either specify
11:35the the actual dimensions and then it will ensure that that's the case or you can not specify and just sort of put any length of of list of array in there and you can do multi-dimensional stuff like this.
11:49So what about performance? What can Postgress learn performance-wise? Well, the sort of the most obvious thing is like Postgress is row based stored.
11:59So for analytics, that's generally not good because you you you're it's good for transactional workloads, but for analytical workloads, you're scanning tons of data. And if you only need a few columns, you're still scanning all of the columns that you're not using. So you're you're doing a lot more IO than than necessary. So instead, duct DB stores stuff in what they call a row
12:16group. So they have 120k rows that they group together in and and then the data they put in columner format. So picked up pickup at and drop of at are all the values for those columns like those are together with the other values of those of the column uh in the same chunk of data in the same row group. That way.
12:39Yeah. That way all this like you can do some nice things like compression and instead of using normal compression like gzip or z standard instead duct db uses lightweight compression and what what that's basically all sort of simple compression methods like oh is all of this thing in this column in the 120k rows are they all null? All right,
13:05then we just store one null instead of like 120 of them, 120k of them. And if
13:13we don't have like all of the same ones, you can do run length encoding. So you say like first we have three W's and then we have two A's and then we have four X's and then B you sort of compact all that data together and store less.
13:29And the nice thing of doing these kind of things instead of the heavy like Z standard or those kind of things is that they're much more effort to decompress the it's it this is much cheaper.
13:43And ah there's also we also have dictionary encoding. So it's uh similar to the run length encoding but it doesn't have to be in the same order.
13:50You basically have some some enum values assigned to uh to the actual strings. for instance, if there's all the same strings. And there's lots of other like of these fancy smart lightweight compression techniques. One of them is fast static symbol table. And that's basically you can sort of build strings out of sub common substrings. So if you
14:12have email addresses, lots of them have atgmail.com. So that way you just storegmail.com once for your 120k rows and uh and then you can sort of append all the all the unique bits to that.
14:25So but storage is only one of the parts. Execution is also very important and that is done using morseldriven parallelism.
14:36Uh there's a fancy word for we create small batches or not super small but batches of of rows that we still store in like this even in memory even in the execution engine we store them in this column format. So we have you have vectors of of each each column for all
14:58of the rows and duct uses by default 2,000 rows in like a in like a single batch. It's called morsel in this paper but they they call it batches. Uh and
15:10yeah so and it's different from the volcano model in that it it doesn't do one row at a time. It's like you get a batch of rows.
15:18And what's especially nice is that you pipe it. It's it's not just the batches.
15:23It's also the parallelism side is also very important is that you sort of keep the batch that you worked on. You continue to work on it as far as you can um on the on your CPU so it stays close in the caches and you're not you don't have to transfer data across caches.
15:40Um and the so you don't have to yeah you don't have to move data across caches and you can if a CPU is not doing anything it can steal some work anyway from one of the cues of the other cores and the nice thing is that then you can still compress all this stuff in memory also so it's not just it's not just oh
16:05you're there's there's there's a bunch of benefits to that as well because you can do out of factorization even if it's like even if it's all the same if if sorry if it's all different values you can still do out of factorization to sum them all together for instance but if they're all if you have like the
16:21dictionary encoding if you want to do if you want to append like a C to the A and the B you just have to do it twice instead of doing it five times because you can just do it to the to the dictionary instead of to the full lo like all of the values so yeah then finally there's
16:38extensibility Postgress is kind of known for its accessibility. It's one one of the one of the first or maybe the first extensible databases and duct DB is also extensible. But one of the things it allows it is has a package manager. So you can actually install extensions instead of having to figure out how to get all these bytes on
16:56your computer based on your OS which is kind of a hassle in uh with Postgress generally. Sometimes you have to compile. Sometimes you have to install some other dependencies. Sometimes you have to add like a yum repo or a repo
17:11and it has fdws what I like to call FDWs on on steroids. It's a way of attaching
17:19other databases. So this is how it works to talk to Postgress. So you can just say attach post connection string and then and a name and then it creates a new database that that you can query
17:33like it was a like like a duck DB database. You can use the duct DB syntax on Postgress tables and all this is actually in Postgress we also allow this like this triple schema thing but the first one we just force we we we error out if it if it's any different. So it could very well be made
17:55extensible that ah extensions can claim the first option. So you can have an FDW that do this that does this does this kind of thing and you can also have custom file systems. So extensions can implement how and where you get data. So it supports reading parquet files. It supports reading iceberg tables. But where that data comes from that's that's that's
18:20just extensible. So if I create my own blob storage, if I have some FTP FTP server with some random JSON on it, I could create an extension to read from FTP.
18:31And finally, this is my personal one of my main gripes of writing extensions in in Postgress is that you have this split of um of SQL files and shared library
18:44files. And if you want this function that's executed in C, you still have to create the function in the SQL side and tell it that it needs to execute some C. But I mean that's that's not really induct.
18:58That's just that's just not necessary. You can just add it to some list of functions.
19:06So you you only you only have a shared library in DTB. There's no there's no SQL file. any everything that happens is is is in the shared library. It adds functions. It adds some tables. So it's it's a dynamic catalog. It's not a not a catalog that you have to update with SQL. And in Postgress that's kind of
19:26annoying because when when you update an extension like when you install it it's not so bad the first time but for updating extensions it's kind of annoying that when you update the bind when you shared library then your SQL stuff is still sort of the old thing. So if you write that shared library you have to take
19:44care of that to be sure that it it still works. It doesn't crash if you call it with the old arguments. While with ductb if you just load the new shared library, you have the new functions. It's it's that simple.
19:57So yeah, that's kind of what I have to say. Uh there's there's a bunch more, but like I can't cover everything. So the ductb has a very good blog on internals and all kinds of things. So definitely recommend reading that. So yeah, I put a few links like ductb itself, pgb project that I'm working on.
20:16Um both of them are MIT licensed. So, if there's any questions, I'm I'm happy to answer them now or in the hallway.
20:26Thank you very much. Ya. [Applause] [Music]
20:34We still have the QR codes to scan over here and at the door. So, before you leave, you can scan and give him a feedback. And right now, we have a coffee break. Thank you.
Related Videos
2026-01-13
The MCP Sessions Vol. 1: Sports Analytics
Watch us dive into NFL playoff odds and PGA Tour stats using using MotherDuck's MCP server with Claude. See how to analyze data, build visualizations, and iterate on insights in real-time using natural language queries and DuckDB.
AI, ML and LLMs
SQL
MotherDuck Features
Tutorial
BI & Visualization
Ecosystem

2025-12-10
Watch Me Deploy a DuckLake to Production with MotherDuck!
In this video, Hoyt Emerson will show you the fastest way to get DuckLake into production using MotherDuck's beta implementation. If you've been following his DuckLake series, this is the next step you've been waiting for!
YouTube
Data Pipelines
Tutorial
MotherDuck Features
SQL
Ecosystem

2025-11-19
LLMs Meet Data Warehouses: Reliable AI Agents for Business Analytics
LLMs excel at natural language understanding but struggle with factual accuracy when aggregating business data. Ryan Boyd explores the architectural patterns needed to make LLMs work effectively alongside analytics databases.
AI, ML and LLMs
MotherDuck Features
SQL
Talk
Python
BI & Visualization

