Loading Data Best Practices
When loading data into MotherDuck, understanding the trade-offs between different approaches helps you make informed decisions that optimize for your specific use case. This guide explains the key considerations that impact performance, cost, and reliability.
Data Loading Methods and Their Trade-offs
File Format Considerations
The choice of file format significantly impacts loading performance:
Parquet (Recommended)
- Compression: 5-10x better compression than CSV
- Performance: 5-10x more throughput due to compression
- Schema: Self-describing with embedded metadata
- Use Case: Production data loading, large datasets
CSV
- Compression: Minimal compression benefits
- Performance: Slower loading, especially for large files
- Schema: Requires manual type inference or specification
- Use Case: Simple data exploration, small datasets
JSON
- Compression: Moderate compression
- Performance: Slower than Parquet due to parsing overhead
- Schema: Flexible but requires careful type handling
- Use Case: Semi-structured data, API responses
Performance Optimization Strategies
Batch Size Optimization
MotherDuck can handle both batch operations and real-time streaming. For optimal insert performance, use the section below as a guide.
The size of your data batches directly impacts performance and resource usage:
Optimal Batch Size: 1,000,000+ rows
- DuckDB operates in groups of 122,800 rows (row group size)
- A 1.2M row insert will parallelize across 10 threads automatically
- 100k and 1M row inserts will perform roughly the same due to parallelization overhead
- Minimum effective batch size is >1M rows for optimal performance
Load data in batches of at least 1M rows to leverage DuckDB's parallelization. Smaller batches (like 100k rows) don't provide meaningful performance benefits and may actually be slower due to overhead.
Memory Management
Effective memory management is crucial for large data loads:
Data Type Optimization
- Use explicit schemas to avoid type inference overhead
- Choose appropriate data types (e.g., TIMESTAMP for dates)
- Avoid unnecessary type conversions
Sorting Strategy
- Sort data by frequently queried columns during loading
- To re-sort existing tables, use
CREATE OR REPLACEwith the preferred sorting method
- To re-sort existing tables, use
- Improves query performance through better data locality
- Consider the trade-off between loading speed and query performance
Network and Location Considerations
Data Location
- MotherDuck is hosted in AWS
us-east-1 - Locate source data in the same region for optimal performance
- Consider network latency when loading from remote sources
Cloud Storage Integration
- Direct integration with S3, GCS, Azure Blob Storage
- Use cloud storage to leverage network speeds for better performance
- Reduces local storage requirements
Instance Sizing
Instance Selection
For data sets under 100 GB in size, use Jumbo instances to load the data. For larger data sizes, use Mega or Giga.
Summary
The key to successful data loading in MotherDuck is understanding the trade-offs between different approaches and optimizing for your specific use case. Focus on:
- Batch inserts of at least 1,000,000 rows for fastest performance.
- If you can control how they are written from sources, use Parquet for compression and speed
- Write data into S3 for speedy reads.
- Use larger instance types (Jumbo or bigger) for loading bigger data sets.
By following these guidelines and understanding the underlying principles, you can build efficient, reliable data loading pipelines that scale with your needs while managing costs effectively.