pg_duckdb: Postgres analytics just got faster with DuckDB

2024/10/25Featuring:

PostgreSQL is on a roll. It was named DBMS of the Year in 2023 and continues to be the most popular database among developers, according to the 2024 Stack Overflow survey. It's robust, reliable, and fantastic for transactional workloads (OLTP)—the bread and butter of most applications.

Many of us start our analytics journey right inside Postgres. Our application data is already there, so it's the easiest place to start asking questions. You might want to know, "How much revenue did we generate in a specific country last quarter?" or, "How many of our customers have names that rhyme with 'duck'?"

But as you ask more complex questions across larger datasets, you inevitably hit a wall. Postgres, a row-oriented database optimized for transactions, can become painfully slow for these analytical (OLAP) queries. The common solution involves setting up a separate data warehouse and a complex ETL pipeline to move data, which adds overhead and delays insights.

This is where DuckDB shines. As a fast, open-source, in-process OLAP database, it’s built from the ground up for analytics. Its popularity is growing exponentially for good reason.

So, here's the central question: What if you could get the analytical power of DuckDB without ever leaving your favorite PostgreSQL client?

Meet pg_duckdb, a new PostgreSQL extension that integrates DuckDB’s analytical engine directly into Postgres. It's an open-source project born from a partnership between Hydra and MotherDuck, designed to bring the best of both worlds together.

Today, you can. Let's get your psql client warmed up.

Part 1: Accelerate Local Analytics by 500x

Instead of moving your data, pg_duckdb brings the query engine to your data, giving you a massive performance boost right where you work.

Hands-On Demo: The 500x Performance Gain

Let's see it in action. The easiest way to get started is with the official Docker image, which comes with everything pre-configured.

Copy code
# Run the container with a permissive password setting for this demo
# In production, you should manage secrets properly!
docker run -d --name pgduckdb -e POSTGRES_HOST_AUTH_METHOD=trust ghcr.io/hydradatabase/pg_duckdb:main

Once the container is running, you can connect directly with psql:

Copy code
docker exec -it pgduckdb psql -U postgres

For this demo, we've pre-loaded a 1GB TPC-DS dataset, a standard benchmark for analytical systems. First, let's run a complex analytical query using the native PostgreSQL engine. We'll use \timing to measure how long it takes.

Copy code
-- psql
\timing
-- TPC-DS Query 1
SELECT
    c.c_customer_id AS customer_id,
    c.c_first_name AS customer_first_name,
    c.c_last_name AS customer_last_name,
    -- ... (rest of the complex query)
FROM
    customer c,
    customer_address ca,
    customer_demographics cd
WHERE
    ca.ca_city = 'Hopewell'
    -- ... (more joins and conditions)
GROUP BY
    c.c_customer_id,
    c.c_first_name,
    c.c_last_name;

Running this query on our instance took 1 minute and 29 seconds.

Copy code
Time: 89000.123 ms (01:29.000)

Now, let's unleash the duck. With pg_duckdb, all we have to do is enable the DuckDB engine with a simple SET command and run the exact same query.

Copy code
-- Enable the DuckDB engine for this session
SET pg_duckdb.enable = true;

-- Run the exact same TPC-DS query again
SELECT
    c.c_customer_id AS customer_id,
    -- ... (same query as before)

The result? 137 milliseconds.

Copy code
Time: 137.000 ms

That's not a typo. We went from nearly a minute and a half to just over a tenth of a second—a ~500x speedup. This is the power of DuckDB's columnar engine, which is purpose-built for analytical queries, reading only the data it needs from each column. This incredible boost is achieved with zero changes to how your data is stored; it's still a regular PostgreSQL table, but you're swapping in a more efficient engine for the job.

Part 2: Turn Postgres into a Data Lake Gateway

The power of pg_duckdb goes far beyond just speeding up queries on local tables. It brings the entire DuckDB extension ecosystem into PostgreSQL, turning your database into a true data hub. This allows you to query Parquet files on S3, read from Apache Iceberg tables, and more, all from within psql.

For instance, you can query Parquet files directly from your data lake. DuckDB's read_parquet() function works seamlessly, as shown in this query that reads a public dataset from S3 to find top TV shows.

Copy code
-- Remember to keep pg_duckdb.enable = true;
SELECT "Title", "Days In Top 10"
FROM read_parquet('s3://us-west-2.opendata.source.coop/netflix/daily_top_10/day=*/country_name=United States/*.parquet')
WHERE "Days In Top 10" > 200
ORDER BY "Days In Top 10" DESC;

You can even connect to modern table formats like Apache Iceberg by installing the necessary DuckDB extension on the fly.

Copy code
-- Install the Iceberg extension
SELECT duckdb_install('iceberg');
-- Load the extension for the current session
SELECT duckdb_load('iceberg');

-- Query an Iceberg table stored on S3
SELECT *
FROM iceberg_scan('s3://my-iceberg-bucket/warehouse/db/table')
LIMIT 10;

It's also a two-way street. You can use pg_duckdb to export data from PostgreSQL back to your data lake. The standard COPY command, when used with the DuckDB engine, can write to Parquet on S3.

Copy code
-- Export a Postgres table to Parquet on S3
COPY (SELECT * FROM my_postgres_table)
TO 's3://my-backup-bucket/my_table.parquet'
WITH (FORMAT 'parquet');

This opens up powerful new workflows, like backing up large tables, exporting data for your data team, or importing valuable datasets from the lake directly into Postgres to support your applications.

Part 3: Scale Analytics in the Cloud with MotherDuck

Running large analytical queries, even with DuckDB's speed, can still consume significant CPU and memory. On a production PostgreSQL instance that's also handling application transactions, this can create resource contention that slows down your application.

This is where MotherDuck comes in. MotherDuck is a serverless analytics platform powered by DuckDB. With pg_duckdb, you can seamlessly offload heavy analytical workloads to MotherDuck's cloud compute, protecting your production database without ever leaving your Postgres environment.

Connecting Postgres to MotherDuck

Connecting is simple. First, sign up for a free MotherDuck account and get a service token from the UI.

Next, you need to add this token to your postgresql.conf file.

Copy code
# postgresql.conf
# Add this line with your token
motherduck.token = 'md_my_super_secret_token'

After adding the token, restart your PostgreSQL instance.

Security Best Practice: Hardcoding secrets is not ideal. For a more secure setup, pg_duckdb also supports reading the token from an environment variable. You can set motherduck_token in your environment and use this line in postgresql.conf instead: motherduck.token = '${motherduck_token}'

Hybrid Queries: The Best of Both Worlds

Once connected, you can query MotherDuck directly from psql. MotherDuck includes a shared database, sample_data, which you can query immediately. Let's count the mentions of "DuckDB" in Hacker News titles from 2022.

Copy code
-- This query runs on MotherDuck's cloud infrastructure
SELECT
    date_trunc('month', "timestamp") AS month,
    count(*) AS mentions
FROM sample_data.hacker_news.stories_2022
WHERE
    lower(title) LIKE '%duckdb%'
GROUP BY 1
ORDER BY 1;

The true power lies in moving data effortlessly between your local Postgres instance and MotherDuck.

1. Pulling analytical results from MotherDuck into a local PostgreSQL table:

Copy code
-- Create a local Postgres table from a MotherDuck query result
CREATE TABLE local_duckdb_mentions AS
SELECT
    date_trunc('month', "timestamp") AS month,
    count(*) AS mentions
FROM sample_data.hacker_news.stories_2022
WHERE
    lower(title) LIKE '%duckdb%'
GROUP BY 1;

2. Pushing a local PostgreSQL table up to MotherDuck:

Copy code
-- Create a table in MotherDuck from a local Postgres table
CREATE TABLE my_motherduck_backup.public.customer_archive
USING duckdb AS
SELECT * FROM public.customer;

This seamless, bi-directional data movement gives you ultimate flexibility, all without leaving the comfort of your psql prompt.

Conclusion: The Best of Both Worlds

With pg_duckdb, you truly get the power of the duck in the elephant's hands. You can accelerate local analytics by orders of magnitude without changing your data storage, query your data lake (S3, Iceberg) directly from your operational database, and seamlessly scale your analytics by offloading heavy work to MotherDuck's serverless platform.

pg_duckdb is currently in Beta, and we're excited about what comes next. The success of DuckDB is all about simplicity, and we're thrilled to bring that simplicity directly to PostgreSQL users in their existing database.

Get Started Today

Try it now: The fastest way to start is with the Docker container.
Check out the code: The project is open source on GitHub.
Share your feedback: We're actively developing the roadmap. Please open an issue with feature requests or feedback!

In the meantime, keep quacking and keep coding.

TABLE OF CONTENTS

Part 1: Accelerate Local Analytics by 500x

Part 2: Turn Postgres into a Data Lake Gateway

Part 3: Scale Analytics in the Cloud with MotherDuck

Conclusion: The Best of Both Worlds

Transcript

0:00first gra is generating a lot of excitement it has been named the dbms of the year in 2023 by dbn Giants and recognized as the most popular database in the star oflow survey in 2024 and it's popular for good reason it's a robust way to be able to create update and store your data from your application and there are also a lot of

0:24robust Cloud managed bossest ready for you to use so buos is great at a lot of things but when you try to use it for analytics you eat a wall pretty quickly because while it's great at creating and finding individual roles if you want to understand what is going on on your data set it can start to be painfully slow

0:42for example you might want to know how much revs your business has been generated in a specific country or how many of you customers have names that rhyme with duck yes that's a really important question and SE antical queries often require separate way of storing and processing the data to operate efficiently until now this is where duck DB comes in Duck DB is an

1:06open source in process all La database designed for these types of queries its popularity has been growing exponentially and the team has been shipping incredible features the particularity of DB as I said it's is in process feature so what if you could use your favorite anal database without leaving bus and gain a huge performance boost that's what we are going to talk

1:30in this video so let's get your PC core client warm up so poess has a rich extension model and it's a good way to extend postgress capabilities without renting a new database each time yeah we have a lot of database nowadays and the more popular extension includes Vector search and geospatial and now we have a brand new extension for Doug DB so PG

1:54dub is a post SQL extension that integrates dub is analytical inine directly into posr it's openes and its joint partnership between Hydra and mod deck but why don't we have an eal database to handle this yes but POS is actually often used for First Steps analytics even for it's not specifically designed for that purpose this is because your initial data is often there

2:20making it easy to start you don't have to move the data elsewhere you just start querying and if you still need to skill you have even more option to move your data smothly out of poses using also PG dgb but then enough talking let's start quacking easy way to get started with uh PG dgdb is to use the

2:39docker image there is the reference on the PG ddb uh repository you have a this

2:46image and you can pick different flavor different uh bu version and add point it to a specific release so at this point of this video there is a first releas v0.1 uh I'm just going to bring it on the main uh Branch so you see here I'm just pulling the darker image um naming it PG ddb and what I'm doing here is

3:08just use uh a parameter so that I don't have to set up a poest password this is just for the sake of the demonstration you should not do that in production I mean anyway you shouldn't copy paste anything you see in a tutorial in production right so once you've run the darker container you can just access it

3:26uh within this command and uh mentioning the darker image name we gave PG dgdb and I'm going to go directly into a pcq client and actually what I've done is already load some data Some tpcs Ds data which is a data set for benchmarking I'm use a scale factor of one which means that the data size is roughly uh 1 gab

3:50so if I look at the table you see I have a couple of taable that's been generated now what I'm going to do is I'm going to add the timer of um of poqu to measure

4:01the time of each query and I have here the first query of the tpca DS Benchmark so I'm just going to run first this query using

4:17bu all right that is done and you see that it took uh 1 minute and 29 seconds so now I'm going to run the same query using ddb and giant through PG ddb with the docker container the the extension is already installed so I don't need to do a create extension command as you can see if I show the install extension I

4:37can see pg. DB you can check out the readme if you want to install it manually but in short you can do a create extension metod and then you're going to have to add it this is an extract of my postgress SQL count file and I have on the share preload Library PG ddb all right so we're going to force

4:55the execution of the DU DB and to True with uh this parameters and so now when we're going to run again the same query here is going to use a du DP inine and you see that was uh pretty fast 137 millisecond compared to 1 minute 29

5:16it is H something like 500 times faster of

5:23course this is just one query of the TPC ads uh Benchmark so this doesn't give us the W story but already pretty exciting results so this is really a huge performance boost and it's achieved without any change on how your data is stored here in pogress everything is stored as a regular pogress table you're just using the DB and GI and it's magic

5:48however we can do better if we store the data into appropriate format like colonar format because po store data in a raw oriented format which is great for transactional a workload but not for analytical queries and then the power of dug DB is also its extensions so not only you have now a powerful andal engine but you also have a swi army

6:11knife that can read right over data Lake Lakehouse with par CSV and even Apache Iceberg so let's have a look so here the following query is actually uh querying uh public buckets on a street and it Returns the top TV shows in the US during uh 2020 and 2022 which was the C

6:33lockdown time so I'm going to run this query now again using the ddb engine you see it's using the read Park uh function from ddb and it's understanding directly that it's rating over a public S3 bucket so I don't need specific credential here you can but of course CS credential uh settings you can check out again uh read

6:57me for more information about that but you see it was pretty fast with just one extension I can read our A3 Park IFA and directly have the quy result still within pass so now let's start reading over Apache Iceberg so Doug DB support read only at this point of this video uh from a Iceberg table and for that you

7:19need to install the DU DB extension and you do that by using the DU DB install uh and then the name of the extension which you hear Iceberg and after installing against you can quiry here I have a stored Iceberg table on a street small KN at the moment you just need to restart your PC qu client to uh make it

7:41work so if now I'm running the query against the iceberg table you're going to see uh in a few seconds the results all right so that's not the only thing you can do you can also writing back to your data leag so so it's not just about read operations so you can use it for instance to exports one of your large

8:01Pro table to an external data L Storage and this is always using the copy command so here for the sake of the demonstration of the syntax I'm just showing you this is the same C we just run right uh reading the park file over Netflix daily uh top 10 TV shows and you see that I'm wrapping up this into a

8:21copy to uh n three for example bucket

8:25that I have access and that's pretty much it so this open many possibilities for performing uh some operation directly in P quarrying existing data from a data Lake a backup specific po tables to an object storage and import data from a data Lake to support operational application if you have your data team that has exported valuable data set on a data Lake you can query it

8:50directly and store it as a pass table all right now we play with a local poest what happen if you need to scale further so analytical quer is require require a lot of more Hardware than transactional one so a post instance that is perfectly fine for handling High numbers of transaction per second might not be actually enough if you start running

9:12large analytical queries and so this is where mod de can help and you can leverage modog storage in Cloud compute to give you a great and inal uh performance without impacting your production bus instance and again without leaving busr so mod du has a generous free tier and with the link below in the description you get on top

9:33of that 30 days trials without a credit card maybe we will ask just what is your favorite deck so once you have your account set up you can retrieve your uh token in the UI here and I would advise you to generate a new token with uh automatically expire date and give it a name for your pogress playground or

9:55usage so now that uh now that you have your token the only thing you need to do is to add uh this one to your BS SQL

10:04configuration file and here I P uh templates on how you do that so I'll paste here basically my mod duck token and I'm going to exit the container for a moment and just run uh this command so what it's going to do is basically add uh this line entry with the token into my postgress comp when it's done you

10:25just need to restart the container and now when I'm going to launch back to psql client I'll be connected directly to modc two things the first don't try to steal that token it's already expired at this point of this video so this is just for the sake of the demo usually you will store the modu token as an

10:45environment variable which is the mod dock uh token variable and you would just had uh this parameter into your POS

10:55uh SQL uh configuration file so that's avoid you to R code sensitive information into the poster SQL configuration file all right so now I have restart my psql clients and I'm connected to modu and how I know that is that I can actually uh query the shared database so modu has share database feature so you can share a database

11:19either within an organization or publicly and every modu user has access to the simple data uh database you can find more information about those data set in our docs if you go to example data set so you see we have data set around ACR news T flow data PPI data um

11:39and so actually what I'm going to use here is basically one uh data set from iur news from 202 uh2 and this query

11:49basically just count the number of apparents in the title or in the text of di mention tag DB and it count this by uh mount and as you can see here first that was pretty fast and uh and you see I have for each month the number of mention of techb so now here basically I just use

12:12postgress as a client and thanks to the pg dub I can connect to mod du to offload basically more a query but that's not the only thing you can do because we mentioned that we can copy data to you know object storage but now you can also move data so an Navy B stable to modd or the other way around

12:34so here I'm doing the same query but let's say that I want actually to store the antical result into bus for serving

12:43operational needs so I'm just going to do a create table and I name it uh name it this way and now if I look at my database you see I have here uh the table I just uh created which is uh stored in postgress table and of of course I can directly uh query this one uh using either the do DB or pogress in

13:05finally if you want to uh push back some result for po grade as I say or moving data you can just you uh use the keyword using Doug DB and so this will create uh a table in modu to the current default selected database you can you can specify the database in the schema and so that will push the current postrate

13:27Table to uh uh mod duck table as you can see moving data from postgress to mother duck is really simple and again you are not leaving postgress client so I guess you realize by now that with pgwb it's basically the power of the tuck in the elephant's hand and while pgwb is still in beta at this point of this video we

13:49are excited about what comes next we can check out what's on the road map on the GI up Ry and we still need to trim it based on priorities so if you have certain requests please let us s on the GI up ddb success is all about Simplicity and we are bringing it directly to Bist user in their existing

14:08database in the meantime keep quacking and keep coding [Music]

FAQS

What is pg_duckdb and how does it improve Postgres analytics performance?

pg_duckdb is an open-source Postgres extension that integrates DuckDB's analytical engine directly into Postgres. It is a joint partnership between Hydra and MotherDuck. In a TPC-DS benchmark demonstration, the same query ran in 137 milliseconds with pg_duckdb compared to 1 minute 29 seconds with native Postgres, roughly 500 times faster, without changing how data is stored in Postgres.

Can pg_duckdb read data from S3 and data lake formats like Iceberg?

Yes, pg_duckdb gives you access to DuckDB's extensions, which can read and write data across data lakes. You can query Parquet files on S3 directly from within Postgres, install the Iceberg extension to read Apache Iceberg tables, and use the COPY command to export Postgres tables back to object storage. This lets you query data lake files without leaving the Postgres client.

How do you connect pg_duckdb to MotherDuck for cloud-based analytics?

To connect pg_duckdb to MotherDuck, add your MotherDuck token to the Postgres configuration file and restart the container. Once connected, you can query MotherDuck's shared databases, move data between Postgres and MotherDuck in both directions, and offload heavy analytical queries to MotherDuck's cloud compute. This prevents large analytical workloads from impacting your production Postgres instance. Get started with a free MotherDuck account.

Why would I use pg_duckdb instead of setting up a separate analytical database?

Postgres is often used for first-step analytics because your initial data is already there, making it easy to start querying without moving data elsewhere. pg_duckdb builds on this by bringing DuckDB's analytical engine directly into Postgres, so you get massive performance gains on analytical queries while staying in your familiar Postgres client. If you need to scale further, you can offload to MotherDuck without leaving the Postgres environment.

pg_duckdb: Postgres analytics just got faster with DuckDB

Part 1: Accelerate Local Analytics by 500x

Hands-On Demo: The 500x Performance Gain

Part 2: Turn Postgres into a Data Lake Gateway

Part 3: Scale Analytics in the Cloud with MotherDuck

Connecting Postgres to MotherDuck

Hybrid Queries: The Best of Both Worlds

Conclusion: The Best of Both Worlds

Get Started Today

Transcript

FAQS

What is pg_duckdb and how does it improve Postgres analytics performance?

Can pg_duckdb read data from S3 and data lake formats like Iceberg?

How do you connect pg_duckdb to MotherDuck for cloud-based analytics?

Why would I use pg_duckdb instead of setting up a separate analytical database?

Related Videos

Preparing Your Data Warehouse for AI: Let Your Agents Cook

The MCP Sessions - Vol 2: Supply Chain Analytics

The MCP Sessions Vol. 1: Sports Analytics