Loading data best practices
When loading data into MotherDuck, understanding the trade-offs between different approaches helps you make informed decisions that optimize for your specific use case. This guide explains the key considerations that impact performance, cost, and reliability.
File format considerations
The choice of file format significantly impacts loading performance:
| Parquet (recommended) | CSV | JSON | |
|---|---|---|---|
| Compression | 5-10x better than CSV | Minimal | Moderate |
| Performance | 5-10x more throughput | Slower, especially for large files | Slower than Parquet due to parsing overhead |
| Schema | Self-describing with embedded metadata | Requires type inference or specification | Flexible but requires careful type handling. DuckDB scans data to discover the schema before running the query, which can add significant time for large or deeply nested files (see tips for loading JSON) |
| Best for | Production data loading, large datasets | Simple data exploration, small datasets | Semi-structured data, API responses |
Avoid single-row INSERTs
A common mistake is inserting data one row at a time using repeated INSERT INTO ... VALUES (...) statements. This pattern is significantly slower than bulk loading because each individual INSERT statement incurs network round-trip overhead to MotherDuck and prevents DuckDB from parallelizing the work.
Do not use single-row INSERT INTO ... VALUES statements to load data into MotherDuck. Instead, use bulk approaches like INSERT INTO ... SELECT from files, COPY, or load data from DataFrames. See Loading data into MotherDuck for recommended methods.
If you're working with a client library (Python, Node.js, Java), avoid looping over rows and calling execute("INSERT INTO ...") for each one. Methods like executemany also send individual INSERT statements under the hood and are equally slow. Instead, write your data to a file (Parquet or CSV) and load it with COPY or INSERT INTO ... SELECT, or use a DataFrame-based approach where available.
Performance optimization strategies
Batch size
DuckDB internally processes data in row groups of ~122,000 rows and parallelizes work across multiple row groups. This means batch size affects both memory usage and throughput:
| Batch size | What happens |
|---|---|
| 1-100 rows (single-row INSERTs) | Each statement has network and transaction overhead. Very slow — avoid this pattern entirely. |
| 100K rows | Fits in roughly one row group. Already a bulk operation and orders of magnitude faster than row-by-row. Good default chunk size when streaming from Python to manage memory. |
| 1M+ rows | Spans multiple row groups, so DuckDB parallelizes across threads. Best throughput for large loads. |
When streaming data from a client library, load in chunks of at least 100K rows to keep memory manageable while staying well above row-by-row overhead. For maximum throughput on large datasets, aim for 1M+ rows per load operation to fully leverage DuckDB's parallelization.
Keep individual transactions under roughly one minute. If you have tens of millions of rows, break them into multiple loads rather than one very large transaction.
Memory management
Effective memory management is crucial for large data loads:
Data Type Optimization
- Use explicit schemas to avoid type inference overhead — this is especially important for JSON, where schema discovery can add minutes for large or deeply nested files
- Choose appropriate data types (for example, TIMESTAMP for dates)
- Avoid unnecessary type conversions
Sorting Strategy
- Sort data by frequently queried columns during loading
- To re-sort existing tables, use
CREATE OR REPLACEwith the preferred sorting method
- To re-sort existing tables, use
- Improves query performance through better data locality
- Consider the trade-off between loading speed and query performance
Network and location considerations
Data Location
- MotherDuck is available on AWS in three regions: US East (N. Virginia) -
us-east-1, US West (Oregon) -us-west-2, and Europe (Frankfurt) -eu-central-1 - For optimal performance, consider locating source data in the same region as your MotherDuck Organization
- Consider network latency when loading from remote sources
Cloud Storage Integration
- Direct integration with S3, R2, GCS, Azure Blob Storage
- Use cloud storage to leverage network speeds for better performance
- Reduces local storage requirements
- Consider setting force_download=true when querying files stored in remote storage to accelerate response times. This could be useful in scenarios where it makes sense to download the full file upfront instead of making many small requests.
Duckling sizing
Duckling Selection
For data sets under 100 GB in size, use Jumbo Ducklings to load the data. For larger data sizes, use Mega or Giga.
Summary
The key to successful data loading in MotherDuck is understanding the trade-offs between different approaches and optimizing for your specific use case. Focus on:
- Bulk loading with at least 100K rows per chunk, and 1M+ for maximum throughput.
- If you can control how they are written from sources, use Parquet for compression and speed
- Write data into S3 for speedy reads.
- Use larger Duckling sizes (Jumbo or bigger) for loading bigger data sets.
By following these guidelines and understanding the underlying principles, you can build efficient, reliable data loading pipelines that scale with your needs while managing costs effectively.