---
sidebar_position: 1
title: Loading data to MotherDuck with Python
description: Efficient methods for loading data from Python using DataFrames, temporary files, or bulk inserts.
---

# Loading data to MotherDuck with Python

When you ingest data with Python, typically from an API or other source, you have three options to load it into MotherDuck:

1. **FAST:** Use a Pandas, Polars, or PyArrow dataframe as an in-memory buffer before bulk loading. This is the easiest approach because dataframe libraries are optimized for bulk insert.
2. **FAST:** Write to a temporary file and load it with a `COPY` command. This involves writing to disk, but the `COPY` command is faster than `INSERT` statements.
3. **SLOW:** Use `executemany` to perform several `INSERT` statements in a single transaction. This should be discouraged unless data is very small (fewer than 500 rows).

:::tip
No matter which options you are picking, we recommend loading data in chunks (typically `100k` rows to match row group size) to avoid memory issues and making sure your transaction is not too large, typically finishing around a minute maximum. You can further optimize the data loading by reading our guidelines on [connections](/key-tasks/authenticating-and-connecting-to-motherduck/connecting-to-motherduck.md) and [threading](/key-tasks/authenticating-and-connecting-to-motherduck/multithreading-and-parallelism/multithreading-and-parallelism-python.md).
:::

## 1. load data to MotherDuck with Pandas, Polars, or PyArrow
When using a dataframe library you can load data to MotherDuck in a single transaction. DuckDB uses Apache Arrow as its internal data interchange format. This means **PyArrow and Polars** (which are Arrow-native) benefit from zero-copy data transfer, making them the most memory-efficient choice. **Pandas** with the default NumPy backend copies data during transfer, which doubles memory usage. If you use Pandas, consider using [Arrow-backed dtypes](https://pandas.pydata.org/docs/user_guide/pyarrow.html) (`dtype_backend="pyarrow"`) to avoid the extra copy.

```python
# Creating your table with PyArrow

import duckdb
import pyarrow as pa

data = {
    'id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva']
}
arrow_table = pa.table(data)

con = duckdb.connect('md:')
con.sql('CREATE TABLE my_table AS SELECT * FROM arrow_table')
```

### Batching data
When you have a large dataset, it's recommended you chunk your data and load it in batches. This will help you to avoid memory issues and make sure your transaction is not too large.
This example uses PyArrow and DuckDB in a class to:

1. Initialize a connection
2. Create a database and table if they do not already exist
3. Accept a PyArrow table to insert
4. Insert the data in chunks


```python
import duckdb
import os
import pyarrow as pa
import logging

# Setup basic configuration for logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

class ArrowTableLoadingBuffer:
    def __init__(
        self,
        duckdb_schema: str,
        pyarrow_schema: pa.Schema,
        database_name: str,
        table_name: str,
        destination="local",
        chunk_size: int = 100_000, # Default chunk size
    ):
        self.duckdb_schema = duckdb_schema
        self.pyarrow_schema = pyarrow_schema
        self.database_name = database_name
        self.table_name = table_name
        self.total_inserted = 0
        self.conn = self.initialize_connection(destination, duckdb_schema)
        self.chunk_size = chunk_size

    def initialize_connection(self, destination, sql):
        if destination == "md":
            logging.info("Connecting to MotherDuck...")
            if not os.environ.get("motherduck_token"):
                raise ValueError(
                    "MotherDuck token is required. Set the environment variable 'MOTHERDUCK_TOKEN'."
                )
            conn = duckdb.connect("md:")
            logging.info(
                f"Creating database {self.database_name} if it doesn't exist"
            )
            conn.execute(f"CREATE DATABASE IF NOT EXISTS {self.database_name}")
            conn.execute(f"USE {self.database_name}")
        else:
            conn = duckdb.connect(database=f"{self.database_name}.db")
        conn.execute(sql)  # Execute schema setup on initialization
        return conn

    def insert(self, table: pa.Table):
        total_rows = table.num_rows
        for batch_start in range(0, total_rows, self.chunk_size):
            batch_end = min(batch_start + self.chunk_size, total_rows)
            chunk = table.slice(batch_start, batch_end - batch_start)
            self.insert_chunk(chunk)
            logging.info(f"Inserted chunk {batch_start} to {batch_end}")
        self.total_inserted += total_rows
        logging.info(f"Total inserted: {self.total_inserted} rows")

    def insert_chunk(self, chunk: pa.Table):
        self.conn.register("buffer_table", chunk)
        insert_query = f"INSERT INTO {self.table_name} SELECT * FROM buffer_table"
        self.conn.execute(insert_query)
        self.conn.unregister("buffer_table")
```

Using the above class, you can load your data in chunks.

```python
import pyarrow as pa

# Define the explicit PyArrow schema
pyarrow_schema = pa.schema([
    ('id', pa.int32()),
    ('name', pa.string())
])

# Sample data to create a PyArrow table based on the schema
data = {
    'id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva']
}
arrow_table = pa.table(data, schema=pyarrow_schema)

# Define the DuckDB schema as a DDL statement
duckdb_schema = "CREATE TABLE IF NOT EXISTS my_table (id INTEGER, name VARCHAR)"

# Initialize the loading buffer
loader = ArrowTableLoadingBuffer(
    duckdb_schema=duckdb_schema,
    pyarrow_schema=pyarrow_schema,
    database_name="my_db",  # The DuckDB database filename or MotherDuck database name
    table_name="my_table",  # The name of the table in DuckDB or MotherDuck
    destination="md",  # Set "md" for MotherDuck or "local" for a local DuckDB database
    chunk_size=2  # Example chunk size for illustration
)

# Load the data
loader.insert(arrow_table)
```

### Typing your dataset
When working with production pipeline, it's recommended to type your dataset to avoid any issues with inference.
Pyarrow is our recommendation to type your dataset as it's the easiest way to type your dataset, especially for complex data types.

In the above example, the schema is defined explicitly on both the PyArrow table and the DuckDB schema.

```python
# Initialize the loading buffer
loader = ArrowTableLoadingBuffer(
    duckdb_schema=duckdb_schema, # prepare a DuckDB DDL statement
    pyarrow_schema=pyarrow_schema, # define explictely your PyArrow schema
    database_name="my_db",  
    table_name="my_table",  
    destination="md",  
    chunk_size=2  
)
```



## 2. write to a temporary file and load with `COPY`
When you have a large dataset, another method is to write your data to temporary files and load it to MotherDuck using a `COPY` command. This also works great if you have existing data on a blob storage like AWS S3, Google Cloud Storage or Azure Blob Storage as you will benefit from cloud network speed.

```python
import pyarrow as pa
import pyarrow.parquet as pq
import duckdb
import os

# Step 1: Define the schema and create a large PyArrow table
schema = pa.schema([
    ('id', pa.int32()),
    ('name', pa.string())
])

# Example data - multiply the data to simulate a large dataset
data = {
    'id': list(range(1, 1000001)),  # Simulating 1 million rows
    'name': ['Name_' + str(i) for i in range(1, 1000001)]
}

# Create the PyArrow table with the schema
large_table = pa.table(data, schema=schema)

# Step 2: Write the large PyArrow table to a Parquet file
parquet_file = "large_data.parquet"
pq.write_table(large_table, parquet_file)

# Step 3: Load the Parquet file into MotherDuck using the COPY command
conn = duckdb.connect("md:")  # Connect to MotherDuck
conn.execute("CREATE TABLE IF NOT EXISTS my_table (id INTEGER, name VARCHAR)")

# Use the COPY command to load the Parquet file into MotherDuck
conn.execute(f"COPY my_table FROM '{os.path.abspath(parquet_file)}' (FORMAT 'parquet')")

print("Data successfully loaded into MotherDuck")
```

## 3. use `executemany` for small datasets

For small datasets (fewer than 500 rows), you can use the `executemany` method to insert data row by row in a single transaction. This approach is the slowest of the three options and should only be used when working with very small amounts of data.

```python
import duckdb

# Sample data as a list of tuples
data = [
    (1, 'Alice'),
    (2, 'Bob'),
    (3, 'Charlie'),
    (4, 'David'),
    (5, 'Eva')
]

con = duckdb.connect('md:')
con.execute('CREATE TABLE IF NOT EXISTS my_table (id INTEGER, name VARCHAR)')
con.executemany('INSERT INTO my_table VALUES (?, ?)', data)

print("Data successfully loaded into MotherDuck")
```

:::warning
The `executemany` method sends individual `INSERT` statements, which is significantly slower than the dataframe or `COPY` approaches. Use Option 1 or Option 2 for datasets larger than a few hundred rows.
:::
