Analytics on PostgreSQL with ducks
Traditionally, scaling analytics workloads in PostgreSQL means maintaining a fleet of replicas. Each replica receives data from the primary instance WAL and applies changes while staying available for analytical queries. Adding indexes to your replicas will improve performance for analytical queries, but here’s the problem: the indexes must be maintained on the primary in order to read on the replicas. Updating indexes leads to a constant negotiation between the team maintaining the primary database and the team using replicas for analytical workloads.
Thankfully, the pg_duckdb extension adds DuckDB to the mix which can read directly from PostgreSQL storage format and quickly return datasets without having to replicate it into yet another storage format or add indexes. When used appropriately, this can massively accelerate queries, up to 1000x in some cases (less if indexes already exist).
It's important to note that when querying PostgreSQL tables directly with pg_duckdb, you're still working with PostgreSQL's row-oriented storage—you don't get DuckDB's columnar storage benefits or compression advantages. The performance gains come from DuckDB's vectorized execution engine, which is optimized for analytical workloads even when operating on row-oriented data.
Already a PostgreSQL expert? You can run pg_duckdb directly by using a Docker image:
docker run -d -e POSTGRES_PASSWORD=duckdb pgduckdb/pgduckdb:16-main
Then, query a PostgreSQL table directly–or, query an external Parquet file like our open dataset containing Netflix top 10 program data:
SET duckdb.force_execution = true; SELECT count(*) FROM your_pg_table WHERE status = 'active';
SELECT COUNT(*) FROM read_parquet('s3://us-prd-motherduck-open-datasets/netflix/netflix_daily_top_10.parquet');
Keep in mind: PostgreSQL requires that extensions on primary and replicas are identical, so the pg_duckdb extension must also be installed on the primary. Since DuckDB can be very resource-hungry, you’ll want controls in place to prevent use on the primary. Additionally, each connection to PostgreSQL gets its own DuckDB instance–DuckDB should be appropriately configured with resource limits that match the size of the replica.