pg_duckdb beta release : Even faster analytics in Postgres
2024/10/23 - 12 min read
BYIn August, we announced the pg_duckdb
extension, a collaborative open-source project with Hydra, DuckDB Labs, and MotherDuck. pg_duckdb
is a PostgreSQL extension that integrates DuckDB's analytics engine directly into PostgreSQL, allowing for rapid analytical queries alongside traditional transactional workloads.
Two months later, we are happy to share a beta release of the extension, which includes some exciting features like using DuckDB engine to query PostgreSQL data, querying object storage data and much more.
The best way to do analytics in PostgreSQL is to use your favorite Duck database under the hood.
The easiest way to get started is to use the Docker image provided, which includes PostgreSQL with the latest build of the pg_duckdb
extension pre-installed.
If you want to install the extension on your own PostgreSQL instance, see the repository's README for instructions.
Let's first start the container; which will also start a PostgreSQL server :
Copy code
docker run -d --name pg_duckdb -e POSTGRES_HOST_AUTH_METHOD=trust pgduckdb/pgduckdb:17-v0.1.0
When initializing PostgreSQL, a superuser password must be set. For the sake of demonstration here, we’ve allowed all connections without a password using POSTGRES_HOST_AUTH_METHOD. This is not recommended for production usage.
Now you can connect to PostgreSQL using the psql
command line client:
Copy code
docker exec -it pg_duckdb psql
If you want to see this in live action, check out the video we made :
Separation of concerns
PostgreSQL is a transactional database, not an analytical one. It is well-suited for lookups, small updates, and running queries when you have carefully set up your indexes and join relationships. It isn’t, however, great when you want to run ad-hoc analytical queries across the full dataset.
PostgreSQL is often used for analytics, even though it's not specifically designed for that purpose. This is because the data is readily available, making it easy to start. However, as the data volume grows and more complex analytical queries involving aggregation and grouping are needed, users often encounter limitations. This is where an analytical database engine like DuckDB comes to the rescue.
With pg_duckdb
, you can use the DuckDB execution engine within PostgreSQL to work with data already stored there, and for some queries, this can result in a dramatic performance improvement. Below is an example query that shows dramatic improvement; however, this obviously does not apply to all queries, and some may actually perform slower when executed in DuckDB.
Let’s try the first query of the TPC-DS benchmark suite, which is included in the TPC-DS DuckDB extension. Using that extension we created a small script to load the TPC-DS dataset without indexes into PostgreSQL. On a recent Lenovo laptop this results in the following timings for that first query when using scale factor 1 (aka 1GB of total data):
Copy code
$ ./load-tpcds.sh 1
$ psql "options=--search-path=tpcds1" -o /dev/null
psql (17.0)
Type "help" for help.
postgres=# \timing on
Timing is on.
postgres=# \i 01.sql -- I ran this twice to warm the cache
Time: 81783.057 ms (01:21.783)
We ingested the TPC-DS datasets into PostgreSQL without indexes for two main reasons:
- Currently,
pg_duckdb
does not support indexes, which makes a direct comparison impossible. Addressing this limitation is a high priority for us. - While indexes are common in real-world PostgreSQL scenarios, optimizing them for specific analytic queries can be complicated and bring extra overhead. Considering this, we believe there is value in looking at the performance of queries without any indexes.
Running this query on standard PostgreSQL took 81.8 seconds. That’s pretty slow. Now let’s give it a try with pg_duckdb. We can force it to run using the DuckDB query engine by running SET duckdb.force_execution = true;
.
Copy code
postgres=# SET duckdb.force_execution = true; -- causes execution to use DuckDB
Time: 0.287 ms
postgres=# \i 01.sql
Time: 52.190 ms
Executing this specific query using DuckDB engine, while the data is stored in PostgreSQL, takes only 52 ms, which is more than 1500x faster than running in the native engine!
The performance improvement holds even when you scale up to larger data sizes and a production machine. If we run this on EC2 in AWS1, using 10x the data (TPC-DS scale factor 10 instead of 1), this query takes more than 2 hours with the native PostgreSQL execution engine, while it only takes ~400ms when using pg_duckdb
.
This huge performance boost is achieved without any need to change how your data is stored or updated. Everything is still stored in the regular PostgreSQL tables that you're already used to.
However, we can do even better if we store the data in a format that is better for analytics. PostgreSQL stores data in row-oriented format, which is ideal for transactional workloads but can make it harder to do queries that need to scan full columns or do aggregations. By storing the data in columnar format you can get even better performance. The sections below outline how you can use Parquet files and MotherDuck to achieve this in pg_duckdb
.
Using pg_duckdb with your Data Lake or Lakehouse
DuckDB has native support for reading and writing files on external object stores like AWS and S3, so it can be ideal for querying data against your Data Lake. DuckDB can also read from iceberg and delta, so you can also take advantage of a Lakehouse approach. The following snippets use datasets from a public bucket, so feel free to try them out yourself!
Reading a Parquet file
The following query uses pg_duckdb
to query Parquet files stored in S3 to find the top TV shows in the US during 2020-2022.
Copy code
SELECT Title, max("Days In Top 10")::int as MaxDaysInTop10
FROM read_parquet('s3://duckdb-md-dataset-121/netflix_daily_top_10.parquet')
AS ("Days In Top 10" varchar, Title varchar, Type varchar)
WHERE Type = 'TV Show'
GROUP BY Title
ORDER BY MaxDaysInTop10 DESC
LIMIT 5;
Copy code
Title | MaxDaysInTop10
--------------------------------+----------------
Cocomelon | 99
Tiger King | 44
Jurassic World Camp Cretaceous | 31
Tiger King: Murder, Mayhem … | 9
Ozark | 9
(5 rows)
Reading an Iceberg table
In order to query against data in Iceberg, you first need to install the DuckDB Iceberg extension. In pg_duckdb
, installing duckdb extensions is done using the duckdb.install_extension(<extension name>)
function.
Copy code
-- Install the iceberg extension
SELECT duckdb.install_extension('iceberg');
-- Total quantity of items ordered for each `l_shipmode`
SELECT l_shipmode, SUM(l_quantity) AS total_quantity
FROM iceberg_scan('s3://us-prd-motherduck-open-datasets/iceberg/lineitem_iceberg', allow_moved_paths := true)
AS l(l_shipmode varchar, l_quantity int)
GROUP BY l_shipmode
ORDER BY total_quantity DESC;
Copy code
l_shipmode | total_quantity
------------+----------------
TRUCK | 219078
MAIL | 216395
FOB | 214219
REG AIR | 214010
SHIP | 213141
RAIL | 212903
AIR | 211154
(7 rows)
Writing back to your Data Lake
Access to Data Lakes is not just read-only in pg_duckdb
, you can also write back by using the COPY
command. Note that you can mix and match native PostgreSQL data, so you can use this to export from your PostgreSQL tables to external Data Lake storage.
Copy code
COPY (
SELECT Title, max("Days In Top 10")::int AS MaxDaysInTop10
FROM read_parquet('s3://us-prd-motherduck-open-datasets/netflix/netflix_daily_top_10.parquet')
AS ("Days In Top 10" varchar, Title varchar, Type varchar)
WHERE Type = 'TV Show'
GROUP BY Title
ORDER BY MaxDaysInTop10 DESC
LIMIT 5
) TO 's3://my-bucket/results.parquet';
This opens up many possibilities for performing the following operations directly in PostgreSQL:
- Query existing data from a Data Lake
- Back up specific PostgreSQL tables to an object store
- Import data from the Data Lake to support operational applications.
Scaling further with MotherDuck
Analytical queries typically require a lot more hardware than transactional ones. So a PostgreSQL instance that is perfectly fine for handling high numbers of transactions per second may be severely underpowered if you start running analytics.
MotherDuck can help here, and let you leverage their storage and cloud compute resources to give you great analytical performance without impacting your production PostgreSQL instance.
With pg_duckdb
, you can leverage MotherDuck to push your analytical workload to the Cloud again without leaving PostgreSQL.
In addition to a generous free tier, MotherDuck has a free trial where you can get started for 30 days without a credit card. To get started, you can sign up for MotherDuck here. Next, you'll need to generate and retrieve an access token for authentication.
The only thing you need to do to make pg_duckdb
work with MotherDuck is to set your motherduck_token
in the postgresql.conf
config file, using the duckdb.motherduck_token
parameter. To add this one directly to your running pg_duckdb
container, you can do
Copy code
docker exec -it pg_duckdb sh -c 'echo "duckdb.motherduck_token = '\''<YOUR_MOTHERDUCK_TOKEN>'\''" >> /var/lib/postgresql/data/postgresql.conf'
After that, you will need to restart the container and relaunch a psql
session :
Copy code
docker restart pg_duckdb
docker exec -it pg_duckdb psql
If it is more convenient, you can also store the token as an environment variable and add duckdb.motherduck_enabled = true
to your postgresql.conf
. Additional details are available in the README.
Now within PostgreSQL, you can start querying MotherDuck databases or shares. The below query uses a sample_data
share database accessible by all MotherDuck users.
Copy code
-- number of mention of duckdb in HackerNews in 2022
SELECT
EXTRACT(YEAR FROM timestamp) AS year,
EXTRACT(MONTH FROM timestamp) AS month,
COUNT(*) AS keyword_mentions
FROM ddb$sample_data$hn.hacker_news
WHERE
(title LIKE '%duckdb%' OR text LIKE '%duckdb%')
GROUP BY year, month
ORDER BY year ASC, month ASC;
Copy code
year | month | keyword_mentions
------+-------+------------------
2022 | 1 | 6
2022 | 2 | 4
2022 | 3 | 10
2022 | 4 | 9
2022 | 5 | 43
2022 | 6 | 8
2022 | 7 | 15
2022 | 8 | 6
2022 | 9 | 19
2022 | 10 | 10
2022 | 11 | 9
You can join your data in MotherDuck with your live data in PostgreSQL, and you can also easily copy data from one to the other.
For instance, if you create a table by using the USING duckdb
keyword it will be created in MotherDuck, and otherwise it will be in PostgreSQL.
Let’s take the same above query using MotherDuck but now creating a PostgreSQL table :
Copy code
CREATE TABLE hacker_news_duckdb_postgres AS
SELECT
EXTRACT(YEAR FROM timestamp) AS year,
EXTRACT(MONTH FROM timestamp) AS month,
COUNT(*) AS keyword_mentions
FROM ddb$sample_data$hn.hacker_news
WHERE
(title LIKE '%duckdb%' OR text LIKE '%duckdb%')
GROUP BY year, month
ORDER BY year ASC, month ASC;
If we display the existing tables in PostgreSQL, we’ll see this one stored as PostgreSQL table (Access method
is heap
).
Copy code
postgres=# \d+
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size | Description
--------+-----------------------------+-------+----------+-------------+---------------+------------+-------------
public | hacker_news_duckdb_postgres | table | postgres | permanent | heap | 8192 bytes |
Now, we can also copy this PostgreSQL table to MotherDuck using :
Copy code
CREATE TABLE hacker_news_duckdb_motherduck USING duckdb AS SELECT * FROM hacker_news_duckdb_postgres
The power of the duck in the elephant's hand
While pg_duckdb is still in beta, we are excited about what comes next. You can check out the milestone for the next release to see what’s already on our radar. We still need to trim it based on priorities, though, so if you have certain requests that you think are important, please let us know so they have a higher chance of being part of the next release.
DuckDB's success is all about simplicity, and we are bringing it directly to PostgreSQL users in their existing database.
Check the extension repository for more information, and start playing with your PostgreSQL and MotherDuck account!
1 OS: Ubuntu 24.04, PostgreSQL version: 17.0 (from https://www.postgresql.org/download/linux/ubuntu/), Instance Type: c7a.4xlarge, vCPUs: 16, RAM: 32GB, Disk type: EBS gp3, Disk size: 500 GiB, Disk IOPS: 6000, Disk Throughput: 250MiB/s. PostgreSQL config: shared_buffers = 12GB (scale-factor 10 fits fully in memory), work_mem = 4GB, duckdb.max_memory = 4GB.
CONTENT
- Separation of concerns
- Using pg_duckdb with your Data Lake or Lakehouse
- Scaling further with MotherDuck
- The power of the duck in the elephant's hand
Start using MotherDuck now!