dbt cloud with MotherDuck via pg_duckdb
dbt cloud is a managed service for dbt core. MotherDuck is used with dbt cloud by deploying a Postgres proxy with pg_duckdb installed.
Getting Started
You will need the following items to get started:
- A Postgres instance with pg_duckdb installed.
- A MotherDuck token.
- A dbt cloud account.
Configuring pg_duckdb
The full documentation for pg_duckdb can be found on GitHub, but a simple way to set it up is using Docker on EC2.
In our testing, we have used m7g.xlarge, which is a 4-core, 16GB instance. Since Postgres exists as a proxy for MotherDuck, it only needs to have enough working space to stream results back to dbt. Even smaller instances could suffice as well, i.e. a1.large, although it has not been tested thoroughly. The memory limits set below assumes a 16GB limit.
Once you have added your MotherDuck Token and Postgres password to your environment, you can execute the docker run
statement below:
docker run -d \
--name pgduckdb \
-p 5432:5432 \
-e POSTGRES_PASSWORD="$POSTGRES_PASSWORD" \
-e MOTHERDUCK_TOKEN="$MOTHERDUCK_TOKEN" \
-v ~/pgduckdb_data_v17:/var/lib/postgresql/data \
--restart unless-stopped \
--memory=12288m \
pgduckdb/pgduckdb:17-main
The default configuration of Postgres is sub-optimal for m7g.xlarge. Consider making the following changes to the postgresql.conf
file.
# Memory configuration optimized for AWS m7g.xlarge with more conservative settings
work_mem = '32MB' # Per-operation memory for sorts, joins, etc.
maintenance_work_mem = '512MB' # Memory for maintenance operations
shared_buffers = '2GB' # ~12.5% of RAM for shared buffer cache
effective_cache_size = '6GB' # Conservative estimate of OS cache
max_connections = 100 # Reduced maximum concurrent connections
Upgrading to newer builds of pg_duckdb
New containers are built for pg_duckdb on every release. Since we are using docker to run the container, the pg_duckdb server can be stopped, pruned, and then rebuilt with the above docker run command. It is recommended to use a script to rebuild docker image on some cadence. Terraform or similar is recommended to handle this maintenance process.
An example shell script can be seen below:
Shell script
#!/bin/bash
# Error handling function
handle_error() {
local line_no=$1
local exit_code=$2
echo "ERROR: An error occurred at line ${line_no}, exit code ${exit_code}"
exit ${exit_code}
}
# Set up error trap
trap 'handle_error ${LINENO} $?' ERR
# Script to install Docker and run PGDuckDB with MotherDuck on AWS EC2
# Usage: POSTGRES_PASSWORD=your_secure_password MOTHERDUCK_TOKEN=your_md_token ./setup_pgduckdb.sh
# Detect OS
if grep -q 'Amazon Linux release 2023' /etc/os-release; then
OS_VERSION="Amazon Linux 2023"
elif grep -q 'Amazon Linux release 2' /etc/os-release; then
OS_VERSION="Amazon Linux 2"
elif grep -q 'Ubuntu' /etc/os-release; then
OS_VERSION="Ubuntu"
else
OS_VERSION="Linux"
fi
echo "Starting setup for PGDuckDB with MotherDuck on $OS_VERSION..."
# Check if required environment variables are set
if [ -z "$POSTGRES_PASSWORD" ]; then
echo "ERROR: POSTGRES_PASSWORD environment variable is not set."
echo "Usage: POSTGRES_PASSWORD=your_secure_password MOTHERDUCK_TOKEN=your_md_token ./setup_pgduckdb.sh"
exit 1
fi
if [ -z "$MOTHERDUCK_TOKEN" ]; then
echo "ERROR: MOTHERDUCK_TOKEN environment variable is not set."
echo "Usage: POSTGRES_PASSWORD=your_secure_password MOTHERDUCK_TOKEN=your_md_token ./setup_pgduckdb.sh"
exit 1
fi
# Update package lists - continue even if there are errors with some repositories
echo "Updating package lists..."
if [[ "$OS_VERSION" == "Ubuntu" ]]; then
sudo apt-get update -y || true
elif [[ "$OS_VERSION" == "Amazon Linux 2023" ]]; then
sudo dnf update -y || true
else
sudo yum update -y || true
fi
# Check if Docker is already installed
if command -v docker &>/dev/null; then
echo "Docker is already installed, skipping installation."
else
# Install prerequisites based on OS
echo "Installing prerequisites..."
if [[ "$OS_VERSION" == "Ubuntu" ]]; then
sudo apt-get install -y \
apt-transport-https \
ca-certificates \
curl \
gnupg \
lsb-release
elif [[ "$OS_VERSION" == "Amazon Linux 2023" ]]; then
# Use --allowerasing to handle curl package conflicts
sudo dnf install -y --allowerasing \
device-mapper-persistent-data \
lvm2 \
ca-certificates
else
sudo yum install -y \
device-mapper-persistent-data \
lvm2 \
ca-certificates
fi
# Install Docker based on OS
echo "Installing Docker..."
if [[ "$OS_VERSION" == "Ubuntu" ]]; then
# Add Docker's official GPG key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
# Set up the repository
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
# Update and install
sudo apt-get update -y
sudo apt-get install -y docker-ce docker-ce-cli containerd.io
elif [[ "$OS_VERSION" == "Amazon Linux 2023" ]]; then
# Amazon Linux 2023 - use the standard package
sudo dnf install -y docker
elif [[ "$OS_VERSION" == "Amazon Linux 2" ]]; then
# Amazon Linux 2 - use extras
sudo amazon-linux-extras install -y docker
else
# Fallback
sudo yum install -y docker
fi
# Verify Docker was installed
if ! command -v docker &>/dev/null; then
echo "ERROR: Docker installation failed."
exit 1
fi
fi
# Start Docker service
echo "Starting Docker service..."
sudo systemctl start docker || sudo service docker start
sudo systemctl enable docker || sudo chkconfig docker on
# Add current user to docker group to avoid using sudo with docker commands
echo "Adding current user to docker group..."
sudo usermod -aG docker "$USER"
# Create a new data directory for PostgreSQL 17
echo "Creating new data directory for PostgreSQL 17..."
mkdir -p ~/pgduckdb_data_v17
# Fix permissions on the data directory
echo "Setting correct permissions on data directory..."
sudo chown -R 999:999 ~/pgduckdb_data_v17 # 999 is the standard UID for postgres user in Docker
sudo chmod 700 ~/pgduckdb_data_v17
# Check architecture
ARCH=$(uname -m)
echo "Detected architecture: $ARCH"
if [[ "$ARCH" == "aarch64" || "$ARCH" == "arm64" ]]; then
echo "Using ARM64 architecture (Graviton3)..."
else
echo "Using x86_64 architecture..."
fi
# Check if container already exists and remove it if necessary
if sudo docker ps -a | grep -q pgduckdb; then
echo "Found existing pgduckdb container. Removing it..."
sudo docker stop pgduckdb || true
sudo docker rm pgduckdb || true
fi
# Pull the Docker image
echo "Pulling Docker image..."
sudo docker pull pgduckdb/pgduckdb:17-main
# Check available system memory
echo "Checking system memory..."
TOTAL_MEM_KB=$(grep MemTotal /proc/meminfo | awk '{print $2}')
TOTAL_MEM_MB=$((TOTAL_MEM_KB / 1024))
echo "Total system memory: ${TOTAL_MEM_MB}MB"
# Calculate 75% of system memory for Docker container limit
DOCKER_MEM_LIMIT=$((TOTAL_MEM_MB * 75 / 100))
echo "Setting Docker container memory limit to: ${DOCKER_MEM_LIMIT}MB"
# Run the Docker container with memory limit
echo "Starting PostgreSQL container..."
sudo docker run -d \
--name pgduckdb \
-p 5432:5432 \
-e POSTGRES_PASSWORD="$POSTGRES_PASSWORD" \
-e MOTHERDUCK_TOKEN="$MOTHERDUCK_TOKEN" \
-v ~/pgduckdb_data_v17:/var/lib/postgresql/data \
--restart unless-stopped \
--memory=${DOCKER_MEM_LIMIT}m \
pgduckdb/pgduckdb:17-main
# Wait for PostgreSQL to start
echo "Waiting for PostgreSQL to start..."
sleep 10
# Configure PostgreSQL
echo "Configuring PostgreSQL and DuckDB..."
# Append settings to the main PostgreSQL configuration file
echo "Appending settings to PostgreSQL configuration file..."
sudo docker exec -i pgduckdb bash -c "cat >> /var/lib/postgresql/data/postgresql.conf << 'EOT'
# DuckDB integration settings
duckdb.motherduck_enabled = true
# Memory configuration optimized for AWS m7g.xlarge with more conservative settings
work_mem = '32MB' # Per-operation memory for sorts, joins, etc.
maintenance_work_mem = '512MB' # Memory for maintenance operations
shared_buffers = '2GB' # ~12.5% of RAM for shared buffer cache
effective_cache_size = '6GB' # Conservative estimate of OS cache
max_connections = 100 # Reduced maximum concurrent connections
# Detailed query logging
log_min_duration_statement = 0 # Log all queries
log_statement = 'all' # Log all SQL statements
log_duration = on # Log duration of each SQL statement
log_line_prefix = '%t [%p]: [%l-1] db=%d,user=%u ' # Prefix format
EOT"
# Restart PostgreSQL to apply all configuration settings
echo "Restarting PostgreSQL container to apply all configuration settings..."
sudo docker restart pgduckdb
# Wait for PostgreSQL to restart
echo "Waiting for PostgreSQL container to restart..."
sleep 10
# Verify PostgreSQL is running with new settings
echo "Verifying PostgreSQL configuration..."
sudo docker exec -i pgduckdb psql -U postgres << EOF
-- Check if PostgreSQL is running
SELECT version();
EOF
# Create monitoring script
echo "Creating monitoring script..."
cat > ~/monitor_pg.sh << 'EOF'
#!/bin/bash
echo "=== PostgreSQL Container Status ==="
docker ps -a -f name=pgduckdb
echo -e "\n=== Resource Usage ==="
docker stats --no-stream pgduckdb
echo -e "\n=== Recent Logs ==="
docker logs --tail 10 pgduckdb
echo -e "\n=== Connection Test ==="
docker exec -it pgduckdb pg_isready -U postgres
if [ $? -eq 0 ]; then
echo "PostgreSQL is accepting connections."
else
echo "PostgreSQL is not accepting connections."
fi
EOF
chmod +x ~/monitor_pg.sh
# Create startup script
echo "Creating startup script..."
cat > ~/start_pg.sh << 'EOF'
#!/bin/bash
echo "Starting PostgreSQL container..."
docker start pgduckdb
echo "Container status:"
docker ps -a -f name=pgduckdb
EOF
chmod +x ~/start_pg.sh
# Check if container is running or restarting
echo "Checking container status..."
CONTAINER_STATUS=$(sudo docker inspect -f '{{.State.Status}}' pgduckdb 2>/dev/null || echo "not_found")
if [[ "$CONTAINER_STATUS" == "restarting" ]]; then
echo "WARNING: Container is restarting. Checking logs for errors..."
sudo docker logs pgduckdb
echo "
Try reducing the memory settings in the PostgreSQL configuration if the container keeps restarting."
echo "You can manually adjust settings by connecting to the container once it's stable."
elif [[ "$CONTAINER_STATUS" != "running" && "$CONTAINER_STATUS" != "not_found" ]]; then
echo "WARNING: Container is not running (status: $CONTAINER_STATUS). Checking logs for errors..."
sudo docker logs pgduckdb
fi
# Final status check
echo "=== Setup Complete ==="
echo "PostgreSQL with DuckDB is now running."
echo "Container status:"
sudo docker ps -a -f name=pgduckdb
echo -e "\n=== Connection Information ==="
echo "Host: localhost"
echo "Port: 5432"
echo "User: postgres"
echo "Password: [The password you provided]"
echo "Database: postgres"
echo -e "\n=== Useful Commands ==="
echo "Monitor status: ./monitor_pg.sh"
echo "Start after reboot: ./start_pg.sh"
echo "Connect to PostgreSQL: docker exec -it pgduckdb psql -U postgres"
echo "View logs: docker logs pgduckdb"
echo -e "\n=== Note ==="
echo "You may need to log out and log back in for the docker group changes to take effect."
echo "After that, you can run docker commands without sudo."
dbt cloud configuration
dbt cloud is configured as standard Postgres, with a couple of key details.
- You will need to create a schema in MotherDuck for each user as well as production, as using pg_duckdb to create new schemas in MotherDuck is not supported.
- You will need to set an environmental variable for
DBT_SCHEMA
that uses the pg_duckdb schema format, which isddb$[database]$[schema]
since Postgres only supports a single databse per instance. This will need to be set for each user as well as production with{{ env_var('DBT_SCHEMA')}}
. - The recommended thread count follow our dbt-core recommendation, which is 4 threads.
If dbt is configured incorrectly, data may write to Postgres, which is much slower than MotherDuck. In that case, the easiest fix is to rebuild the docker container per above, to assure that no data accidently ends up in Postgres.
Usage notes
There are a few things to know about using dbt cloud with pg_duckdb that are unusual.
- You write Postgres dialect SQL that is executed against DuckDB. As such, there is some ideosyncracies that are neither Postgres nor DuckDB, but a secret, third thing (pg_duckdb SQL). The details of this are described in the pg_duckdb documentation.
- Views are only stored in Postgres without any artifacts in MotherDuck. As such, they can be used for interim data but not final datasets to be consumed by end-users. As such, changing materialization type from view to table, or table to view, is a hybrid MotherDuck & Postgres transaction, and unsupported.
- Running on multiple threads can occasionally cause deadlocks with the pg_duckdb catalog maintenance service. This can be resolved with
dbt retry
in your production pipeline runs. - DuckDB types are more specific than Postgres, so model builds using numeric types will throw errors that can be resolved with specific typing.
- From time-to-time the Postgres catalog can get out of sync, and will show tables that do not exist in MotherDuck. To resolve this, create the missing object in MotherDuck, i.e.
CREATE TABLE my_schema.model_name AS SELECT 1;
, which will unblock your dbt model.