Boosting Data Performance: Unlocking the Power of DuckDB in your Gold LayerLive demo: October 27th

Skip to main content

Loading Data Best Practices

When loading data into MotherDuck, understanding the trade-offs between different approaches helps you make informed decisions that optimize for your specific use case. This guide explains the key considerations that impact performance, cost, and reliability.

Data Loading Methods and Their Trade-offs

File Format Considerations

The choice of file format significantly impacts loading performance:

Parquet (Recommended)

  • Compression: 5-10x better compression than CSV
  • Performance: 5-10x more throughput due to compression
  • Schema: Self-describing with embedded metadata
  • Use Case: Production data loading, large datasets

CSV

  • Compression: Minimal compression benefits
  • Performance: Slower loading, especially for large files
  • Schema: Requires manual type inference or specification
  • Use Case: Simple data exploration, small datasets

JSON

  • Compression: Moderate compression
  • Performance: Slower than Parquet due to parsing overhead
  • Schema: Flexible but requires careful type handling
  • Use Case: Semi-structured data, API responses

Performance Optimization Strategies

Batch Size Optimization

MotherDuck can handle both batch operations and real-time streaming. For optimal insert performance, use the section below as a guide.

The size of your data batches directly impacts performance and resource usage:

Optimal Batch Size: 1,000,000+ rows

  • DuckDB operates in groups of 122,800 rows (row group size)
  • A 1.2M row insert will parallelize across 10 threads automatically
  • 100k and 1M row inserts will perform roughly the same due to parallelization overhead
  • Minimum effective batch size is >1M rows for optimal performance
tip

Load data in batches of at least 1M rows to leverage DuckDB's parallelization. Smaller batches (like 100k rows) don't provide meaningful performance benefits and may actually be slower due to overhead.

Memory Management

Effective memory management is crucial for large data loads:

Data Type Optimization

  • Use explicit schemas to avoid type inference overhead
  • Choose appropriate data types (e.g., TIMESTAMP for dates)
  • Avoid unnecessary type conversions

Sorting Strategy

  • Sort data by frequently queried columns during loading
    • To re-sort existing tables, use CREATE OR REPLACE with the preferred sorting method
  • Improves query performance through better data locality
  • Consider the trade-off between loading speed and query performance

Network and Location Considerations

Data Location

  • MotherDuck is hosted in AWS us-east-1
  • Locate source data in the same region for optimal performance
  • Consider network latency when loading from remote sources

Cloud Storage Integration

  • Direct integration with S3, GCS, Azure Blob Storage
  • Use cloud storage to leverage network speeds for better performance
  • Reduces local storage requirements

Instance Sizing

Instance Selection

For data sets under 100 GB in size, use Jumbo instances to load the data. For larger data sizes, use Mega or Giga.

Summary

The key to successful data loading in MotherDuck is understanding the trade-offs between different approaches and optimizing for your specific use case. Focus on:

  1. Batch inserts of at least 1,000,000 rows for fastest performance.
  2. If you can control how they are written from sources, use Parquet for compression and speed
  3. Write data into S3 for speedy reads.
  4. Use larger instance types (Jumbo or bigger) for loading bigger data sets.

By following these guidelines and understanding the underlying principles, you can build efficient, reliable data loading pipelines that scale with your needs while managing costs effectively.