# Loading data best practices


> Understanding trade-offs and performance implications when loading data into MotherDuck

When loading data into MotherDuck, understanding the trade-offs between different approaches helps you make informed decisions that optimize for your specific use case. This guide explains the key considerations that impact performance, cost, and reliability.

## File format considerations

The choice of file format significantly impacts loading performance:

| | Parquet (recommended) | CSV | JSON |
|---|---|---|---|
| **Compression** | 5-10x better than CSV | Minimal | Moderate |
| **Performance** | 5-10x more throughput | Slower, especially for large files | Slower than Parquet due to parsing overhead |
| **Schema** | Self-describing with embedded metadata | Requires type inference or specification | Flexible but requires careful type handling. DuckDB scans data to discover the schema before running the query, which can add significant time for large or deeply nested files (see [tips for loading JSON](/key-tasks/data-warehousing/replication/flat-files/#json)) |
| **Best for** | Production data loading, large datasets | Simple data exploration, small datasets | Semi-structured data, API responses |

## Avoid single-row INSERTs

A common mistake is inserting data one row at a time using repeated `INSERT INTO ... VALUES (...)` statements. This pattern is significantly slower than bulk loading because each individual INSERT statement incurs network round-trip overhead to MotherDuck and prevents DuckDB from parallelizing the work.

:::tip
Do not use single-row `INSERT INTO ... VALUES` statements to load data into MotherDuck. Instead, use bulk approaches like `INSERT INTO ... SELECT` from files, `COPY`, or load data from DataFrames. See [Loading data into MotherDuck](/key-tasks/loading-data-into-motherduck/loading-data-into-motherduck.mdx) for recommended methods.
:::

If you're working with a client library (Python, Node.js, Java), avoid looping over rows and calling `execute("INSERT INTO ...")` for each one. Methods like `executemany` also send individual INSERT statements under the hood and are equally slow. Instead, write your data to a file (Parquet or CSV) and load it with `COPY` or `INSERT INTO ... SELECT`, or use a DataFrame-based approach where available.

## Performance optimization strategies

### Batch size

DuckDB internally processes data in row groups of ~122,000 rows and parallelizes work across multiple row groups. This means batch size affects both memory usage and throughput:

| Batch size | What happens |
|---|---|
| **1-100 rows** (single-row INSERTs) | Each statement has network and transaction overhead. Very slow — avoid this pattern entirely. |
| **100K rows** | Fits in roughly one row group. Already a bulk operation and orders of magnitude faster than row-by-row. Good default chunk size when streaming from Python to manage memory. |
| **1M+ rows** | Spans multiple row groups, so DuckDB parallelizes across threads. Best throughput for large loads. |

:::tip
When streaming data from a client library, load in chunks of at least **100K rows** to keep memory manageable while staying well above row-by-row overhead. For maximum throughput on large datasets, aim for **1M+ rows** per load operation to fully leverage DuckDB's parallelization.
:::

Keep individual transactions under roughly one minute. If you have tens of millions of rows, break them into multiple loads rather than one very large transaction.

### Memory management

Effective memory management is crucial for large data loads:

**Data Type Optimization**
- Use explicit schemas to avoid type inference overhead — this is especially important for JSON, where schema discovery can add minutes for large or deeply nested files
- Choose appropriate data types (for example, TIMESTAMP for dates)
- Avoid unnecessary type conversions

**Sorting Strategy**
- Sort data by frequently queried columns during loading
  - To re-sort existing tables, use `CREATE OR REPLACE` with the preferred sorting method
- Improves query performance through better data locality
- Consider the trade-off between loading speed and query performance

### Network and location considerations

**Data Location**
- MotherDuck is available on AWS in four regions: **US East (N. Virginia)** - `us-east-1`, **US West (Oregon)** - `us-west-2`, **Europe (Frankfurt)** - `eu-central-1`, and **Europe (Dublin)** - `eu-west-1`
- For optimal performance, consider locating source data in the same region as your MotherDuck Organization
- Consider network latency when loading from remote sources

**Cloud Storage Integration**
- Direct integration with S3, R2, GCS, Azure Blob Storage
- Use [cloud storage](/integrations/cloud-storage/) to leverage network speeds for better performance
- Reduces local storage requirements
- Consider setting [force_download=true](https://duckdb.org/docs/stable/configuration/overview) when querying files stored in remote storage to accelerate response times. This could be useful in scenarios where it makes sense to download the full file upfront instead of making many small requests.

## Duckling sizing

**Duckling Selection**

For data sets under 100 GB in size, use Jumbo Ducklings to load the data. For larger data sizes, use [Mega or Giga](/about-motherduck/billing/duckling-sizes/).

## Summary

The key to successful data loading in MotherDuck is understanding the trade-offs between different approaches and optimizing for your specific use case. Focus on:

1. **Bulk loading** with at least 100K rows per chunk, and 1M+ for maximum throughput.
2. If you can control how they are written from sources, use **Parquet** for compression and speed
3. Write data into **S3** for speedy reads.
4. Use **larger Duckling sizes (Jumbo or bigger)** for loading bigger data sets.

By following these guidelines and understanding the underlying principles, you can build efficient, reliable data loading pipelines that scale with your needs while managing costs effectively.


---

## Docs feedback

MotherDuck accepts optional user-submitted feedback about this page at `GET https://motherduck.com/docs/api/feedback/agent`.
For agents and automated tools, feedback submission should be user-confirmed before sending.

URL-encode query parameter values and send a GET request:

```text
GET https://motherduck.com/docs/api/feedback/agent?page_path=%2Fkey-tasks%2Floading-data-into-motherduck%2Fconsiderations-for-loading-data%2F&page_title=Loading%20Data%20Best%20Practices&text=<url-encoded user feedback, max 2000 characters>
```

Optionally append `&source=<url-encoded interface identifier>` such as `claude.ai` or `chatgpt`.

`page_path` and `text` are required; `page_title` and `source` are optional. Responses: `200 {"feedback_id": "<uuid>"}`, `400` for malformed query parameters, and `429` when rate-limited.