Going from 0 to 1 with MotherDuck - Webinar - Thursday, January 23 9am PSTRegister

Skip to main content

Loading data to MotherDuck with Python

As you ingest data using Python, typically coming from API or other sources, you have different options to load data to MotherDuck.

  1. (fast) Using a Pandas/Polars/PyArrow dataframe as an in memory buffer before bulk loading to MotherDuck.
  2. (fast) Write to a temporary file and load it to MotherDuck using a COPY command.
  3. (slow) Using executemany method to perform serveral INSERT statements in a single transaction and load data to MotherDuck.

Option 1 is the easiest as dataframe libraries are optimized for bulk insert.

Option 2 involve writing to disk but COPY command is faster than INSERT statement.

Option 3 should be discouraged unless data is very small (< 500 rows).

tip

No matter which options you are picking, we recommend loading data in chunks (typically 100k rows to match row group size) to avoid memory issues and making sure your transaction is not too large, typically finishing around a minute maximum.

info

Next to the below recommendation, we suggest reading our guidelines around connections and threading which will help you to optimize your data loading process.

1. Using Pandas/Polars/PyArrow to load data to MotherDuck

When using a dataframe library you can load data to MotherDuck in a single transaction.

import duckdb
import pyarrow as pa

# Create a PyArrow table
data = {
'id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva']
}
arrow_table = pa.table(data)

con = duckdb.connect('md:')
con.sql('CREATE TABLE my_table (id INTEGER, name VARCHAR) as SELECT * FROM arrow_table')

Buffering data

When you have a large dataset, it's recommended you chunk your data and load it in batches. This will help you to avoid memory issues and make sure your transaction is not too large. Here's a class example to load data in chunks using PyArrow and DuckDB.

import duckdb
import os
import pyarrow as pa
import logging

# Setup basic configuration for logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

class ArrowTableLoadingBuffer:
def __init__(
self,
duckdb_schema: str,
pyarrow_schema: pa.Schema,
database_name: str,
table_name: str,
destination="local",
chunk_size: int = 100_000, # Default chunk size
):
self.duckdb_schema = duckdb_schema
self.pyarrow_schema = pyarrow_schema
self.database_name = database_name
self.table_name = table_name
self.total_inserted = 0
self.conn = self.initialize_connection(destination, duckdb_schema)
self.chunk_size = chunk_size

def initialize_connection(self, destination, sql):
if destination == "md":
logging.info("Connecting to MotherDuck...")
if not os.environ.get("motherduck_token"):
raise ValueError(
"MotherDuck token is required. Set the environment variable 'MOTHERDUCK_TOKEN'."
)
conn = duckdb.connect("md:")
logging.info(
f"Creating database {self.database_name} if it doesn't exist"
)
conn.execute(f"CREATE DATABASE IF NOT EXISTS {self.database_name}")
conn.execute(f"USE {self.database_name}")
else:
conn = duckdb.connect(database=f"{self.database_name}.db")
conn.execute(sql) # Execute schema setup on initialization
return conn

def insert(self, table: pa.Table):
total_rows = table.num_rows
for batch_start in range(0, total_rows, self.chunk_size):
batch_end = min(batch_start + self.chunk_size, total_rows)
chunk = table.slice(batch_start, batch_end - batch_start)
self.insert_chunk(chunk)
logging.info(f"Inserted chunk {batch_start} to {batch_end}")
self.total_inserted += total_rows
logging.info(f"Total inserted: {self.total_inserted} rows")

def insert_chunk(self, chunk: pa.Table):
self.conn.register("buffer_table", chunk)
insert_query = f"INSERT INTO {self.table_name} SELECT * FROM buffer_table"
self.conn.execute(insert_query)
self.conn.unregister("buffer_table")

Using the above class, you can load your data in chunks.

import pyarrow as pa

# Define the explicit PyArrow schema
pyarrow_schema = pa.schema([
('id', pa.int32()),
('name', pa.string())
])

# Sample data to create a PyArrow table based on the schema
data = {
'id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva']
}
arrow_table = pa.table(data, schema=pyarrow_schema)

# Define the DuckDB schema as a DDL statement
duckdb_schema = "CREATE TABLE IF NOT EXISTS my_table (id INTEGER, name VARCHAR)"

# Initialize the loading buffer
loader = ArrowTableLoadingBuffer(
duckdb_schema=duckdb_schema,
pyarrow_schema=pyarrow_schema,
database_name="my_db", # The DuckDB database filename or MotherDuck database name
table_name="my_table", # The name of the table in DuckDB or MotherDuck
destination="md", # Set "md" for MotherDuck or "local" for a local DuckDB database
chunk_size=2 # Example chunk size for illustration
)

# Load the data
loader.insert(arrow_table)

Typing your dataset

When working with production pipeline, it's recommended to type your dataset to avoid any issues with inference. Pyarrow is our recommendation to type your dataset as it's the easiest way to type your dataset, especially for complex data types.

In the above example, the schema is defined explicitly on both the PyArrow table and the DuckDB schema.

# Initialize the loading buffer
loader = ArrowTableLoadingBuffer(
duckdb_schema=duckdb_schema, # prepare a DuckDB DDL statement
pyarrow_schema=pyarrow_schema, # define explictely your PyArrow schema
database_name="my_db",
table_name="my_table",
destination="md",
chunk_size=2
)

2. Write to a temporary file and load it to MotherDuck using a COPY command

When you have a large dataset, another method is to write your data to temporary files and load it to MotherDuck using a COPY command. This also works great if you have existing data on a blob storage like AWS S3, Google Cloud Storage or Azure Blob Storage as you will benefit from cloud network speed.

import pyarrow as pa
import pyarrow.parquet as pq
import duckdb
import os

# Step 1: Define the schema and create a large PyArrow table
schema = pa.schema([
('id', pa.int32()),
('name', pa.string())
])

# Example data - multiply the data to simulate a large dataset
data = {
'id': list(range(1, 1000001)), # Simulating 1 million rows
'name': ['Name_' + str(i) for i in range(1, 1000001)]
}

# Create the PyArrow table with the schema
large_table = pa.table(data, schema=schema)

# Step 2: Write the large PyArrow table to a Parquet file
parquet_file = "large_data.parquet"
pq.write_table(large_table, parquet_file)

# Step 3: Load the Parquet file into MotherDuck using the COPY command
conn = duckdb.connect("md:") # Connect to MotherDuck
conn.execute("CREATE TABLE IF NOT EXISTS my_table (id INTEGER, name VARCHAR)")

# Use the COPY command to load the Parquet file into MotherDuck
conn.execute(f"COPY my_table FROM '{os.path.abspath(parquet_file)}' (FORMAT 'parquet')")

print("Data successfully loaded into MotherDuck")