Using MotherDuck at MotherDuck: Loading Data from Postgres with DuckDB
2025/03/07 - 7 min read
BYIntroduction
At MotherDuck, we use MotherDuck internally for our own cloud data warehouse. As such, we need visibility into how our database tables change over time from some internal services. Specifically, we need to analyze five critical operational tables from a Postgres database that tracks user interactions, database states, and system performance. Of course, since MotherDuck is built on DuckDB, we can use the DuckDB pg_scanner to easily get at data in Postgres.
Using MotherDuck’s dual execution as a bridge, we've created a simple, reliable workflow that runs every 6 hours via a scheduled job. The entire process typically completes in about 10 minutes, replicating about 150GB of data.
In this post, you'll see exactly how we implemented this solution, with concrete examples you can test yourself.
The Problem
Our operational Postgres database contains tables that track essential metrics about our service. These tables update frequently, and our analytics team needs reliable and performant access to up-to-date copies without impacting production performance. We considered traditional approaches like:
- Direct queries to production (too resource-intensive)
- Complex ETL pipelines (too much maintenance overhead)
- CDC solutions (often complex to set up and maintain)
All fell short of our requirements for simplicity and reliability, and critically added additional dependencies to our SWE team.
The Architecture
Our solution leverages three specific components in a straightforward architecture:
This visual representation shows exactly what happens during each sync:
- DuckDB connects to our Postgres database using the Postgres Scanner extension
- It reads the complete contents of the five tables we need to replicate
- Using
CREATE OR REPLACE TABLE
, it pushes fresh, complete copies to MotherDuck, replacing previous versions
You can verify this process works by checking that data in MotherDuck exactly matches the source Postgres database at the recorded sync time.
The Code: Concrete Implementation
Here's the implementation we use in production. You can run this code yourself to test the approach.
It should be noted that this is simply SQL, wrapped in python. We will share the SQL here first, and then the actual python subsequently.
Copy code
$ duckdb md:
INSTALL postgres;
LOAD postgres;
-- Strings PG_CONNECTION_STRING and MD_DATABASE get replaced.
ATTACH 'PG_CONNECTION_STRING' AS pg (TYPE POSTGRES, READ_ONLY);
ATTACH 'md:MD_DATABASE';
USE MD_DATABASE;
CREATE OR REPLACE TABLE first_table AS SELECT * FROM pg.first_table;
-- continue on for N tables using this pattern
Step 1: Attach both databases
First, we establish connections to both Postgres and MotherDuck:
Copy code
def run():
# Read from environment variables in production
pg_connection_string = "postgresql://username:password@hostname:5432/dbname"
md_database = "analytics_replica"
# Create a local DuckDB connection as the intermediary
duck_con = duckdb.connect()
# Attach to Postgres (read-only to ensure safety)
duck_con.sql(f"ATTACH '{pg_connection_string}' AS pg (TYPE POSTGRES, READ_ONLY);")
# Attach to MotherDuck and set it as the active database
duck_con.sql(f"ATTACH 'md:{md_database}'; USE {md_database}")
# Execute replication
ctas_from_diff_db(duck_con)
last_sync_time(duck_con)
You can test this by replacing the connection strings with your own and running the script.
Step 2: Replicate the tables
Here's the exact function that handles the table replication:
Copy code
def ctas_from_diff_db(duck_con):
# Replicate the first table
start_time = time.time()
duck_con.sql("CREATE OR REPLACE TABLE first_table AS SELECT * FROM pg.first_table;")
print(f"Replicated first_table table in {time.time() - start_time:.2f} seconds")
# repeat for N tables you want replicate
...
The output will show you exactly how long each table takes to replicate.
Step 3: Track the sync timestamp
To maintain an audit trail of sync operations, we record the exact time when each sync completes. This is useful so that end consumers understand the freshness of the data when they use it to make decisions, and for automated freshness checks.
Copy code
def last_sync_time(duck_con):
duck_con.sql(
"CREATE OR REPLACE TABLE last_sync_time AS SELECT current_timestamp AS last_sync_time;"
)
# Verify the timestamp was recorded
result = duck_con.sql("SELECT * FROM last_sync_time").fetchall()
print(f"Sync completed and recorded at: {result[0][0]}")
You can verify the synchronization by comparing data in your source and destination:
Copy code
# Check row counts match between source and destination
source_count = duck_con.sql("SELECT COUNT(*) FROM pg.databases").fetchone()[0]
dest_count = duck_con.sql("SELECT COUNT(*) FROM databases").fetchone()[0]
print(f"Source database has {source_count} rows")
print(f"Destination has {dest_count} rows")
assert source_count == dest_count, "Row counts don't match!"
Why This Workflow Works (And When It Doesn't)
This approach has specific strengths and limitations that you should understand before implementing:
Strengths:
- Zero additional infrastructure: The entire process runs using just DuckDB, Postgres, and MotherDuck - no need for additional services or middleware.
- Simplicity: Using
CREATE OR REPLACE TABLE
means we don't need complex incremental logic or change tracking mechanisms. - Transactional consistency: Since each table is copied as a complete snapshot in a single transaction, consistent point-in-time copies are assured. Transactions could also be used explicitly in your SQL statements if desired.
- Low maintenance: No need to track deltas, manage watermarks, or handle complex merge logic.
Limitations:
- Only practical for smaller tables: Since we're doing a full refresh each time, this approach is only practical for tables with up to tens of millions of rows. We've found it works well into the hundreds of GBs.
- Potential write amplification: In MotherDuck, we're rewriting entire tables even if only a small portion changed.
- Not suitable for very frequent syncs: Given the full-table approach, running this more frequently than every few minutes would be inefficient.
You can test these limitations yourself by trying tables of different sizes and observing how sync time scales with row count.
Using MotherDuck at MotherDuck: Real-World Application
We've been running this exact process in production for months. Here's what our actual workflow looks like:
- Our function executes every 6 hours
- It replicates the five tables described above by completely refreshing them
- Our analytics team has dashboards that show:
- Database growth trends over time
- Snapshot creation patterns
- System performance metrics
The concrete benefit: Our team can analyze operational data without writing complex queries against production or managing elaborate data pipelines.
You can verify the value of this approach yourself by setting up a similar workflow and measuring:
- Time spent on maintaining data pipelines before vs. after
- Query performance on MotherDuck vs. direct Postgres queries
- Ability to perform temporal analysis with historical data
How to Implement This Yourself: A Concrete Guide
- Set up your environment:
Copy code
pip install duckdb
- Create this test script (replace with your connection details):
Copy code
import duckdb
# Connect to DuckDB
duck_con = duckdb.connect()
# Install and load the postgres extension if needed
duck_con.sql("INSTALL postgres; LOAD postgres;")
# Attach to your Postgres database (replace with your connection string)
duck_con.sql("ATTACH 'postgresql://user:pass@localhost:5432/mydb' AS pg (TYPE POSTGRES, READ_ONLY);")
# Attach to MotherDuck (replace with your token and database)
duck_con.sql("ATTACH 'md:mydb' (TOKEN='your_token'); USE mydb")
# Replicate a test table
duck_con.sql("CREATE OR REPLACE TABLE test_table AS SELECT * FROM pg.test_table LIMIT 1000;")
# Record sync time
duck_con.sql("CREATE OR REPLACE TABLE last_sync_time AS SELECT current_timestamp AS last_sync_time;")
# Verify
print("Source data preview:")
print(duck_con.sql("SELECT * FROM pg.users LIMIT 5").fetchall())
print("nReplicated data preview:")
print(duck_con.sql("SELECT * FROM test_table LIMIT 5").fetchall())
print("nSync completed at:")
print(duck_con.sql("SELECT * FROM last_sync_time").fetchall())
- Run the script and verify the results: Check that data in your source and destination match, and that the sync time is recorded correctly.
Next Steps
Now that you've seen a concrete implementation of our approach, you can:
- Create a MotherDuck account and get your API token
- Install DuckDB and the Postgres extension
- Run the test script with your own connection details
- Adapt our production script to replicate your own tables
If you implement this solution, you can verify its effectiveness by:
- Comparing query performance between direct Postgres queries and MotherDuck queries
- Measuring the time it takes to replicate different table sizes
- Testing how schema changes affect the replication process
We'd love to hear about your experience implementing this solution. Does it match our results? Did you find ways to improve it? Let us know!
CONTENT
- Introduction
- The Problem
- The Architecture
- The Code: Concrete Implementation
- Why This Workflow Works
- Using MotherDuck at MotherDuck: Real-World Application
- How to Implement This Yourself: A Concrete Guide
Start using MotherDuck now!
