Skip to main content

Loading data best practices

When loading data into MotherDuck, understanding the trade-offs between different approaches helps you make informed decisions that optimize for your specific use case. This guide explains the key considerations that impact performance, cost, and reliability.

File format considerations

The choice of file format significantly impacts loading performance:

Parquet (recommended)CSVJSON
Compression5-10x better than CSVMinimalModerate
Performance5-10x more throughputSlower, especially for large filesSlower than Parquet due to parsing overhead
SchemaSelf-describing with embedded metadataRequires type inference or specificationFlexible but requires careful type handling. DuckDB scans data to discover the schema before running the query, which can add significant time for large or deeply nested files (see tips for loading JSON)
Best forProduction data loading, large datasetsSimple data exploration, small datasetsSemi-structured data, API responses

Avoid single-row INSERTs

A common mistake is inserting data one row at a time using repeated INSERT INTO ... VALUES (...) statements. This pattern is significantly slower than bulk loading because each individual INSERT statement incurs network round-trip overhead to MotherDuck and prevents DuckDB from parallelizing the work.

tip

Do not use single-row INSERT INTO ... VALUES statements to load data into MotherDuck. Instead, use bulk approaches like INSERT INTO ... SELECT from files, COPY, or load data from DataFrames. See Loading data into MotherDuck for recommended methods.

If you're working with a client library (Python, Node.js, Java), avoid looping over rows and calling execute("INSERT INTO ...") for each one. Methods like executemany also send individual INSERT statements under the hood and are equally slow. Instead, write your data to a file (Parquet or CSV) and load it with COPY or INSERT INTO ... SELECT, or use a DataFrame-based approach where available.

Performance optimization strategies

Batch size

DuckDB internally processes data in row groups of ~122,000 rows and parallelizes work across multiple row groups. This means batch size affects both memory usage and throughput:

Batch sizeWhat happens
1-100 rows (single-row INSERTs)Each statement has network and transaction overhead. Very slow — avoid this pattern entirely.
100K rowsFits in roughly one row group. Already a bulk operation and orders of magnitude faster than row-by-row. Good default chunk size when streaming from Python to manage memory.
1M+ rowsSpans multiple row groups, so DuckDB parallelizes across threads. Best throughput for large loads.
tip

When streaming data from a client library, load in chunks of at least 100K rows to keep memory manageable while staying well above row-by-row overhead. For maximum throughput on large datasets, aim for 1M+ rows per load operation to fully leverage DuckDB's parallelization.

Keep individual transactions under roughly one minute. If you have tens of millions of rows, break them into multiple loads rather than one very large transaction.

Memory management

Effective memory management is crucial for large data loads:

Data Type Optimization

  • Use explicit schemas to avoid type inference overhead — this is especially important for JSON, where schema discovery can add minutes for large or deeply nested files
  • Choose appropriate data types (for example, TIMESTAMP for dates)
  • Avoid unnecessary type conversions

Sorting Strategy

  • Sort data by frequently queried columns during loading
    • To re-sort existing tables, use CREATE OR REPLACE with the preferred sorting method
  • Improves query performance through better data locality
  • Consider the trade-off between loading speed and query performance

Network and location considerations

Data Location

  • MotherDuck is available on AWS in three regions: US East (N. Virginia) - us-east-1, US West (Oregon) - us-west-2, and Europe (Frankfurt) - eu-central-1
  • For optimal performance, consider locating source data in the same region as your MotherDuck Organization
  • Consider network latency when loading from remote sources

Cloud Storage Integration

  • Direct integration with S3, R2, GCS, Azure Blob Storage
  • Use cloud storage to leverage network speeds for better performance
  • Reduces local storage requirements
  • Consider setting force_download=true when querying files stored in remote storage to accelerate response times. This could be useful in scenarios where it makes sense to download the full file upfront instead of making many small requests.

Duckling sizing

Duckling Selection

For data sets under 100 GB in size, use Jumbo Ducklings to load the data. For larger data sizes, use Mega or Giga.

Summary

The key to successful data loading in MotherDuck is understanding the trade-offs between different approaches and optimizing for your specific use case. Focus on:

  1. Bulk loading with at least 100K rows per chunk, and 1M+ for maximum throughput.
  2. If you can control how they are written from sources, use Parquet for compression and speed
  3. Write data into S3 for speedy reads.
  4. Use larger Duckling sizes (Jumbo or bigger) for loading bigger data sets.

By following these guidelines and understanding the underlying principles, you can build efficient, reliable data loading pipelines that scale with your needs while managing costs effectively.