Loading data to MotherDuck with Python
When you ingest data with Python, typically from an API or other source, you have three options to load it into MotherDuck:
- FAST: Use a Pandas, Polars, or PyArrow dataframe as an in-memory buffer before bulk loading. This is the easiest approach because dataframe libraries are optimized for bulk insert.
- FAST: Write to a temporary file and load it with a
COPYcommand. This involves writing to disk, but theCOPYcommand is faster thanINSERTstatements. - SLOW: Use
executemanyto perform severalINSERTstatements in a single transaction. This should be discouraged unless data is very small (fewer than 500 rows).
No matter which options you are picking, we recommend loading data in chunks (typically 100k rows to match row group size) to avoid memory issues and making sure your transaction is not too large, typically finishing around a minute maximum. You can further optimize the data loading by reading our guidelines on connections and threading.
1. load data to MotherDuck with Pandas, Polars, or PyArrow
When using a dataframe library you can load data to MotherDuck in a single transaction. DuckDB uses Apache Arrow as its internal data interchange format. This means PyArrow and Polars (which are Arrow-native) benefit from zero-copy data transfer, making them the most memory-efficient choice. Pandas with the default NumPy backend copies data during transfer, which doubles memory usage. If you use Pandas, consider using Arrow-backed dtypes (dtype_backend="pyarrow") to avoid the extra copy.
# Creating your table with PyArrow
import duckdb
import pyarrow as pa
data = {
'id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva']
}
arrow_table = pa.table(data)
con = duckdb.connect('md:')
con.sql('CREATE TABLE my_table AS SELECT * FROM arrow_table')
Batching data
When you have a large dataset, it's recommended you chunk your data and load it in batches. This will help you to avoid memory issues and make sure your transaction is not too large. This example uses PyArrow and DuckDB in a class to:
- Initialize a connection
- Create a database and table if they do not already exist
- Accept a PyArrow table to insert
- Insert the data in chunks
import duckdb
import os
import pyarrow as pa
import logging
# Setup basic configuration for logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
class ArrowTableLoadingBuffer:
def __init__(
self,
duckdb_schema: str,
pyarrow_schema: pa.Schema,
database_name: str,
table_name: str,
destination="local",
chunk_size: int = 100_000, # Default chunk size
):
self.duckdb_schema = duckdb_schema
self.pyarrow_schema = pyarrow_schema
self.database_name = database_name
self.table_name = table_name
self.total_inserted = 0
self.conn = self.initialize_connection(destination, duckdb_schema)
self.chunk_size = chunk_size
def initialize_connection(self, destination, sql):
if destination == "md":
logging.info("Connecting to MotherDuck...")
if not os.environ.get("motherduck_token"):
raise ValueError(
"MotherDuck token is required. Set the environment variable 'MOTHERDUCK_TOKEN'."
)
conn = duckdb.connect("md:")
logging.info(
f"Creating database {self.database_name} if it doesn't exist"
)
conn.execute(f"CREATE DATABASE IF NOT EXISTS {self.database_name}")
conn.execute(f"USE {self.database_name}")
else:
conn = duckdb.connect(database=f"{self.database_name}.db")
conn.execute(sql) # Execute schema setup on initialization
return conn
def insert(self, table: pa.Table):
total_rows = table.num_rows
for batch_start in range(0, total_rows, self.chunk_size):
batch_end = min(batch_start + self.chunk_size, total_rows)
chunk = table.slice(batch_start, batch_end - batch_start)
self.insert_chunk(chunk)
logging.info(f"Inserted chunk {batch_start} to {batch_end}")
self.total_inserted += total_rows
logging.info(f"Total inserted: {self.total_inserted} rows")
def insert_chunk(self, chunk: pa.Table):
self.conn.register("buffer_table", chunk)
insert_query = f"INSERT INTO {self.table_name} SELECT * FROM buffer_table"
self.conn.execute(insert_query)
self.conn.unregister("buffer_table")
Using the above class, you can load your data in chunks.
import pyarrow as pa
# Define the explicit PyArrow schema
pyarrow_schema = pa.schema([
('id', pa.int32()),
('name', pa.string())
])
# Sample data to create a PyArrow table based on the schema
data = {
'id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva']
}
arrow_table = pa.table(data, schema=pyarrow_schema)
# Define the DuckDB schema as a DDL statement
duckdb_schema = "CREATE TABLE IF NOT EXISTS my_table (id INTEGER, name VARCHAR)"
# Initialize the loading buffer
loader = ArrowTableLoadingBuffer(
duckdb_schema=duckdb_schema,
pyarrow_schema=pyarrow_schema,
database_name="my_db", # The DuckDB database filename or MotherDuck database name
table_name="my_table", # The name of the table in DuckDB or MotherDuck
destination="md", # Set "md" for MotherDuck or "local" for a local DuckDB database
chunk_size=2 # Example chunk size for illustration
)
# Load the data
loader.insert(arrow_table)
Typing your dataset
When working with production pipeline, it's recommended to type your dataset to avoid any issues with inference. Pyarrow is our recommendation to type your dataset as it's the easiest way to type your dataset, especially for complex data types.
In the above example, the schema is defined explicitly on both the PyArrow table and the DuckDB schema.
# Initialize the loading buffer
loader = ArrowTableLoadingBuffer(
duckdb_schema=duckdb_schema, # prepare a DuckDB DDL statement
pyarrow_schema=pyarrow_schema, # define explictely your PyArrow schema
database_name="my_db",
table_name="my_table",
destination="md",
chunk_size=2
)
2. write to a temporary file and load with COPY
When you have a large dataset, another method is to write your data to temporary files and load it to MotherDuck using a COPY command. This also works great if you have existing data on a blob storage like AWS S3, Google Cloud Storage or Azure Blob Storage as you will benefit from cloud network speed.
import pyarrow as pa
import pyarrow.parquet as pq
import duckdb
import os
# Step 1: Define the schema and create a large PyArrow table
schema = pa.schema([
('id', pa.int32()),
('name', pa.string())
])
# Example data - multiply the data to simulate a large dataset
data = {
'id': list(range(1, 1000001)), # Simulating 1 million rows
'name': ['Name_' + str(i) for i in range(1, 1000001)]
}
# Create the PyArrow table with the schema
large_table = pa.table(data, schema=schema)
# Step 2: Write the large PyArrow table to a Parquet file
parquet_file = "large_data.parquet"
pq.write_table(large_table, parquet_file)
# Step 3: Load the Parquet file into MotherDuck using the COPY command
conn = duckdb.connect("md:") # Connect to MotherDuck
conn.execute("CREATE TABLE IF NOT EXISTS my_table (id INTEGER, name VARCHAR)")
# Use the COPY command to load the Parquet file into MotherDuck
conn.execute(f"COPY my_table FROM '{os.path.abspath(parquet_file)}' (FORMAT 'parquet')")
print("Data successfully loaded into MotherDuck")
3. use executemany for small datasets
For small datasets (fewer than 500 rows), you can use the executemany method to insert data row by row in a single transaction. This approach is the slowest of the three options and should only be used when working with very small amounts of data.
import duckdb
# Sample data as a list of tuples
data = [
(1, 'Alice'),
(2, 'Bob'),
(3, 'Charlie'),
(4, 'David'),
(5, 'Eva')
]
con = duckdb.connect('md:')
con.execute('CREATE TABLE IF NOT EXISTS my_table (id INTEGER, name VARCHAR)')
con.executemany('INSERT INTO my_table VALUES (?, ?)', data)
print("Data successfully loaded into MotherDuck")
The executemany method sends individual INSERT statements, which is significantly slower than the dataframe or COPY approaches. Use Option 1 or Option 2 for datasets larger than a few hundred rows.