Considerations for Loading Data

When loading data into MotherDuck, understanding the trade-offs between different approaches helps you make informed decisions that optimize for your specific use case. This guide explains the key considerations that impact performance, cost, and reliability.

Data Loading Methods and Their Trade-offs

Batch vs. Streaming Approaches

MotherDuck is optimized for batch operations rather than real-time streaming. This fundamental design choice affects how you should approach data loading:

Batch Loading (Recommended)

Performance: Significantly faster throughput when loading large datasets
Cost: More efficient use of compute resources and network bandwidth
Reliability: Better error handling and recovery mechanisms
Use Case: Data warehousing, analytics, reporting

Streaming/Real-time

Performance: Suboptimal for small, frequent writes
Cost: Higher overhead per operation
Reliability: More complex error handling
Use Case: Consider using queues or buffering systems in front of MotherDuck

tip

For streaming workloads, use queues to batch writes to tables. While this introduces some latency, the improvement in throughput far outweighs the cost of small writes.

File Format Considerations

The choice of file format significantly impacts loading performance:

Parquet (Recommended)

Compression: 5-10x better compression than CSV
Performance: 5-10x more throughput due to compression
Schema: Self-describing with embedded metadata
Use Case: Production data loading, large datasets

CSV

Compression: Minimal compression benefits
Performance: Slower loading, especially for large files
Schema: Requires manual type inference or specification
Use Case: Simple data exploration, small datasets

JSON

Compression: Moderate compression
Performance: Slower than Parquet due to parsing overhead
Schema: Flexible but requires careful type handling
Use Case: Semi-structured data, API responses

Performance Optimization Strategies

Batch Size Optimization

The size of your data batches directly impacts performance and resource usage:

Optimal Batch Size: 1,000,000+ rows

DuckDB operates in groups of 122,800 rows (row group size)
A 1.2M row insert will parallelize across 10 threads automatically
100k and 1M row inserts will perform roughly the same due to parallelization overhead
Minimum effective batch size is >1M rows for optimal performance

Too Small (< 1,000,000 rows)

Higher overhead per batch due to parallelization setup
Increased network round trips
Suboptimal resource utilization
100k row batches don't provide meaningful performance benefits

Very Large (> 1,000,000,000 rows)

Risk of memory issues
Longer transaction times
Potential timeout issues
Consider breaking into multiple 100M row batches

tip

Load data in batches of at least 1M rows to leverage DuckDB's parallelization. Smaller batches (like 100k rows) don't provide meaningful performance benefits and may actually be slower due to overhead.

Memory Management

Effective memory management is crucial for large data loads:

Connection Pooling

Reuse connections to reduce overhead
Implement connection recycling (every 15-20 minutes)
Use appropriate pool sizes based on concurrency needs

Data Type Optimization

Use explicit schemas to avoid type inference overhead
Choose appropriate data types (e.g., TIMESTAMP for dates)
Avoid unnecessary type conversions

Sorting Strategy

Sort data by frequently queried columns during loading
- To re-sort existing tables, use CREATE OR REPLACE with the preferred sorting method
Improves query performance through better data locality
Consider the trade-off between loading speed and query performance

Network and Location Considerations

Data Location

MotherDuck is hosted in AWS us-east-1
Locate source data in the same region for optimal performance
Consider network latency when loading from remote sources

Cloud Storage Integration

Direct integration with S3, GCS, Azure Blob Storage
Leverages cloud network speeds for better performance
Reduces local storage requirements

Cost Implications

Compute Costs

Instance Selection

Instance Type	Plans	Use Case	Cooldown Period	Startup Time	Read-Write Duckling Enabled?	Read Scaling Duckling Enabled?
Pulse	Free, Lite, Business	Not advised for data loading	1 second	~100ms	Yes	Yes
Standard	Lite, Business	Good for most data loading workloads	60 seconds	~100ms	Yes	Yes
Jumbo	Business	Better for large, complex transformations during loading	60 seconds	~100ms	Yes	Yes
Mega	Business	Optimal for demanding jobs with even larger scale and volumes than a Jumbo can handle	5 minutes	~a few minutes	Yes	Yes
Giga	Available on request	Best used for your largest and toughest workloads like batch jobs that run overnight or on weekends	10 minutes	~a few minutes	Yes	No

Keep the cooldown periods in mind when planning batch sizes

Query Optimization

Monitor query history during large loads
Use EXPLAIN and/or EXPLAIN ANALYZE to understand execution plans
Optimize to minimize compute time

Reliability and Error Handling

Transaction Management

Transaction Size Limits

Keep transactions under 1 minute for optimal performance
Use appropriate batch sizes to avoid timeouts
Implement retry logic for failed transactions

ACID Compliance

MotherDuck provides ACID compliance but is not an OLTP system
Use for analytical workloads, not high-frequency transactional data
Consider eventual consistency for read scaling scenarios

Data Quality Considerations

Schema Validation

Use explicit schemas to catch data quality issues early
Implement data validation before loading
Consider using data quality tools in your pipeline

Error Recovery

Implement idempotent loading processes
Use CREATE OR REPLACE for table recreation
Maintain audit trails for troubleshooting

Loading Patterns and Best Practices

Loading from Different Sources

Local Files

Good for development and testing
Limited by local network and storage
Use for smaller datasets or initial exploration

Cloud Storage

Better performance for large datasets
Leverages cloud infrastructure
Supports multiple formats and compression

Database Sources

Use DuckDB extensions for PostgreSQL/MySQL
Consider network latency and bandwidth
Implement appropriate connection pooling

API Sources

Buffer data before loading to MotherDuck
Consider rate limits and error handling
Use appropriate data types for API responses

Monitoring and Troubleshooting

Performance Monitoring

Key Metrics to Track

Loading time per batch
Memory usage during loads
Network throughput
Error rates and types

Tools and Queries

Use PRAGMA database_size for storage monitoring
Check query_history for performance analysis
Monitor connection pool health

Common Issues and Solutions

Memory Issues

Reduce batch sizes
Use more efficient data types
Implement proper connection management

Timeout Issues

Reduce transaction size
Optimize network connectivity
Use appropriate instance types

Storage Issues

Monitor storage lifecycle stages
Implement data retention policies
Optimize data compression

Summary

The key to successful data loading in MotherDuck is understanding the trade-offs between different approaches and optimizing for your specific use case. Focus on:

Batch operations over streaming for better performance (unless streaming is required)
If you can control how they are written from sources, Appropriate file formats (Parquet preferred) for compression and speed
Optimal batch sizes (1,000,000+ rows) to balance memory and throughput
Proper connection management to avoid resource issues
Monitoring and optimization to maintain performance over time

By following these guidelines and understanding the underlying principles, you can build efficient, reliable data loading pipelines that scale with your needs while managing costs effectively.

Data Loading Methods and Their Trade-offs​

Batch vs. Streaming Approaches​

File Format Considerations​

Performance Optimization Strategies​

Batch Size Optimization​

Memory Management​

Network and Location Considerations​

Cost Implications​

Compute Costs​

Reliability and Error Handling​

Transaction Management​

Data Quality Considerations​

Loading Patterns and Best Practices​

Loading from Different Sources​

Monitoring and Troubleshooting​

Performance Monitoring​

Common Issues and Solutions​

Summary​