Using MotherDuck at MotherDuck: Loading Data from Postgres with DuckDB

2025/03/07 - 7 min read

Introduction

At MotherDuck, we use MotherDuck internally for our own cloud data warehouse. As such, we need visibility into how our database tables change over time from some internal services. Specifically, we need to analyze five critical operational tables from a Postgres database that tracks user interactions, database states, and system performance. Of course, since MotherDuck is built on DuckDB, we can use the DuckDB pg_scanner to easily get at data in Postgres.

Using MotherDuck’s dual execution as a bridge, we've created a simple, reliable workflow that runs every 6 hours via a scheduled job. The entire process typically completes in about 10 minutes, replicating about 150GB of data.

In this post, you'll see exactly how we implemented this solution, with concrete examples you can test yourself.

The Problem

Our operational Postgres database contains tables that track essential metrics about our service. These tables update frequently, and our analytics team needs reliable and performant access to up-to-date copies without impacting production performance. We considered traditional approaches like:

Direct queries to production (too resource-intensive)
Complex ETL pipelines (too much maintenance overhead)
CDC solutions (often complex to set up and maintain)

All fell short of our requirements for simplicity and reliability, and critically added additional dependencies to our SWE team.

The Architecture

Our solution leverages three specific components in a straightforward architecture:

This visual representation shows exactly what happens during each sync:

DuckDB connects to our Postgres database using the Postgres Scanner extension
It reads the complete contents of the five tables we need to replicate
Using CREATE OR REPLACE TABLE, it pushes fresh, complete copies to MotherDuck, replacing previous versions

You can verify this process works by checking that data in MotherDuck exactly matches the source Postgres database at the recorded sync time.

The Code: Concrete Implementation

Here's the implementation we use in production. You can run this code yourself to test the approach.

It should be noted that this is simply SQL, wrapped in python. We will share the SQL here first, and then the actual python subsequently.

Copy code
$ duckdb md:
INSTALL postgres; 
LOAD postgres;
-- Strings PG_CONNECTION_STRING and MD_DATABASE get replaced.
ATTACH 'PG_CONNECTION_STRING' AS pg (TYPE POSTGRES, READ_ONLY);
ATTACH 'md:MD_DATABASE'; 
USE MD_DATABASE;
CREATE OR REPLACE TABLE first_table AS SELECT * FROM pg.first_table;
-- continue on for N tables using this pattern

Step 1: Attach both databases

First, we establish connections to both Postgres and MotherDuck:

Copy code
def run():
    # Read from environment variables in production
    pg_connection_string = "postgresql://username:password@hostname:5432/dbname"
    md_database = "analytics_replica"
    
    # Create a local DuckDB connection as the intermediary
    duck_con = duckdb.connect()
    
    # Attach to Postgres (read-only to ensure safety)
    duck_con.sql(f"ATTACH '{pg_connection_string}' AS pg (TYPE POSTGRES, READ_ONLY);")
    
    # Attach to MotherDuck and set it as the active database
    duck_con.sql(f"ATTACH 'md:{md_database}'; USE {md_database}")
    
    # Execute replication
    ctas_from_diff_db(duck_con)
    last_sync_time(duck_con)

You can test this by replacing the connection strings with your own and running the script.

Step 2: Replicate the tables

Here's the exact function that handles the table replication:

Copy code
def ctas_from_diff_db(duck_con):
    # Replicate the first table
    start_time = time.time()    
    duck_con.sql("CREATE OR REPLACE TABLE first_table AS SELECT * FROM pg.first_table;")
    print(f"Replicated first_table table in {time.time() - start_time:.2f} seconds")
    
    # repeat for N tables you want replicate
    ...

The output will show you exactly how long each table takes to replicate.

Step 3: Track the sync timestamp

To maintain an audit trail of sync operations, we record the exact time when each sync completes. This is useful so that end consumers understand the freshness of the data when they use it to make decisions, and for automated freshness checks.

Copy code
def last_sync_time(duck_con):
    duck_con.sql(
        "CREATE OR REPLACE TABLE last_sync_time AS SELECT current_timestamp AS last_sync_time;"
    )
    
    # Verify the timestamp was recorded
    result = duck_con.sql("SELECT * FROM last_sync_time").fetchall()
    print(f"Sync completed and recorded at: {result[0][0]}")

You can verify the synchronization by comparing data in your source and destination:

Copy code
# Check row counts match between source and destination
source_count = duck_con.sql("SELECT COUNT(*) FROM pg.databases").fetchone()[0]
dest_count = duck_con.sql("SELECT COUNT(*) FROM databases").fetchone()[0]

print(f"Source database has {source_count} rows")
print(f"Destination has {dest_count} rows")
assert source_count == dest_count, "Row counts don't match!"

Why This Workflow Works (And When It Doesn't)

This approach has specific strengths and limitations that you should understand before implementing:

Strengths:

Zero additional infrastructure: The entire process runs using just DuckDB, Postgres, and MotherDuck - no need for additional services or middleware.
Simplicity: Using CREATE OR REPLACE TABLE means we don't need complex incremental logic or change tracking mechanisms.
Transactional consistency: Since each table is copied as a complete snapshot in a single transaction, consistent point-in-time copies are assured. Transactions could also be used explicitly in your SQL statements if desired.
Low maintenance: No need to track deltas, manage watermarks, or handle complex merge logic.

Limitations of this approach:

Only practical for smaller tables: Since we're doing a full refresh each time, this approach is only practical for tables with up to tens of millions of rows. We've found it works well into the hundreds of GBs.
Reading and writing more data than needed: This approach re-writes entire tables even if only a small portion changed. While we choose this approach for simplicity, you can use "poor man's CDC" too, using timestamps to incrementally insert new data.
Not suitable for very frequent syncs: Given the full-table approach, running this more frequently than every few minutes would be inefficient.

You can test these limitations yourself by trying tables of different sizes and observing how sync time scales with row count.

Using MotherDuck at MotherDuck: Real-World Application

We've been running this exact process in production for months. Here's what our actual workflow looks like:

Our function executes every 6 hours
It replicates the five tables described above by completely refreshing them
Our analytics team has dashboards that show:
- Database growth trends over time
- Snapshot creation patterns
- System performance metrics

The concrete benefit: Our team can analyze operational data without writing complex queries against production or managing elaborate data pipelines.

You can verify the value of this approach yourself by setting up a similar workflow and measuring:

Time spent on maintaining data pipelines before vs. after
Query performance on MotherDuck vs. direct Postgres queries
Ability to perform temporal analysis with historical data

How to Implement This Yourself: A Concrete Guide

Set up your environment:

Copy code
pip install duckdb

Create this test script (replace with your connection details):

Copy code
import duckdb

# Connect to DuckDB
duck_con = duckdb.connect()

# Install and load the postgres extension if needed
duck_con.sql("INSTALL postgres; LOAD postgres;")

# Attach to your Postgres database (replace with your connection string)
duck_con.sql("ATTACH 'postgresql://user:pass@localhost:5432/mydb' AS pg (TYPE POSTGRES, READ_ONLY);")

# Attach to MotherDuck (replace with your token and database)
duck_con.sql("ATTACH 'md:mydb' (TOKEN 'your_token'); USE mydb")

# Replicate a test table
duck_con.sql("CREATE OR REPLACE TABLE test_table AS SELECT * FROM pg.test_table LIMIT 1000;")

# Record sync time
duck_con.sql("CREATE OR REPLACE TABLE last_sync_time AS SELECT current_timestamp AS last_sync_time;")

# Verify
print("Source data preview:")
print(duck_con.sql("SELECT * FROM pg.users LIMIT 5").fetchall())

print("nReplicated data preview:")
print(duck_con.sql("SELECT * FROM test_table LIMIT 5").fetchall())

print("nSync completed at:")
print(duck_con.sql("SELECT * FROM last_sync_time").fetchall())

Run the script and verify the results: Check that data in your source and destination match, and that the sync time is recorded correctly.

Next Steps

Now that you've seen a concrete implementation of our approach, you can:

Create a MotherDuck account and get your API token
Install DuckDB and the Postgres extension
Run the test script with your own connection details
Adapt our production script to replicate your own tables

If you implement this solution, you can verify its effectiveness by:

Comparing query performance between direct Postgres queries and MotherDuck queries
Measuring the time it takes to replicate different table sizes
Testing how schema changes affect the replication process

We'd love to hear about your experience implementing this solution. Does it match our results? Did you find ways to improve it? Let us know!

CONTENT

Introduction

The Problem

The Architecture

The Code: Concrete Implementation

Why This Workflow Works

Using MotherDuck at MotherDuck: Real-World Application

How to Implement This Yourself: A Concrete Guide

Next Steps

Start using MotherDuck now!

Try 21 Days Free

2025/02/26 - Simon Späti

A Beginner’s Guide to Geospatial with DuckDB Spatial and MotherDuck

Unlock the power of geospatial analysis with DuckDB Spatial and MotherDuck, making location-based data processing faster, simpler, and more accessible for data engineers.

2025/03/06 - Daniel Palma, Emily Lucek

DuckDB, MotherDuck, and Estuary: A Match Made for Your Analytics Architecture

Stream data to MotherDuck with Estuary

View all

Using MotherDuck at MotherDuck: Loading Data from Postgres with DuckDB

Introduction

The Problem

The Architecture

The Code: Concrete Implementation

Step 1: Attach both databases

Step 2: Replicate the tables

Step 3: Track the sync timestamp

Why This Workflow Works (And When It Doesn't)

Using MotherDuck at MotherDuck: Real-World Application

How to Implement This Yourself: A Concrete Guide

Next Steps

CONTENT

Subscribe to motherduck blog

PREVIOUS POSTS

A Beginner’s Guide to Geospatial with DuckDB Spatial and MotherDuck

DuckDB, MotherDuck, and Estuary: A Match Made for Your Analytics Architecture