Livestream: Getting Started with MotherDuck and DuckLake - June 26Register Now

Skip to main content

Considerations for Loading Data

When loading data into MotherDuck, understanding the trade-offs between different approaches helps you make informed decisions that optimize for your specific use case. This guide explains the key considerations that impact performance, cost, and reliability.

Data Loading Methods and Their Trade-offs

Batch vs. Streaming Approaches

MotherDuck is optimized for batch operations rather than real-time streaming. This fundamental design choice affects how you should approach data loading:

Batch Loading (Recommended)

  • Performance: Significantly faster throughput when loading large datasets
  • Cost: More efficient use of compute resources and network bandwidth
  • Reliability: Better error handling and recovery mechanisms
  • Use Case: Data warehousing, analytics, reporting

Streaming/Real-time

  • Performance: Suboptimal for small, frequent writes
  • Cost: Higher overhead per operation
  • Reliability: More complex error handling
  • Use Case: Consider using queues or buffering systems in front of MotherDuck
tip

For streaming workloads, use queues to batch writes to tables. While this introduces some latency, the improvement in throughput far outweighs the cost of small writes.

File Format Considerations

The choice of file format significantly impacts loading performance:

Parquet (Recommended)

  • Compression: 5-10x better compression than CSV
  • Performance: 5-10x more throughput due to compression
  • Schema: Self-describing with embedded metadata
  • Use Case: Production data loading, large datasets

CSV

  • Compression: Minimal compression benefits
  • Performance: Slower loading, especially for large files
  • Schema: Requires manual type inference or specification
  • Use Case: Simple data exploration, small datasets

JSON

  • Compression: Moderate compression
  • Performance: Slower than Parquet due to parsing overhead
  • Schema: Flexible but requires careful type handling
  • Use Case: Semi-structured data, API responses

Performance Optimization Strategies

Batch Size Optimization

The size of your data batches directly impacts performance and resource usage:

Optimal Batch Size: 1,000,000+ rows

  • DuckDB operates in groups of 122,800 rows (row group size)
  • A 1.2M row insert will parallelize across 10 threads automatically
  • 100k and 1M row inserts will perform roughly the same due to parallelization overhead
  • Minimum effective batch size is >1M rows for optimal performance

Too Small (< 1,000,000 rows)

  • Higher overhead per batch due to parallelization setup
  • Increased network round trips
  • Suboptimal resource utilization
  • 100k row batches don't provide meaningful performance benefits

Very Large (> 1,000,000,000 rows)

  • Risk of memory issues
  • Longer transaction times
  • Potential timeout issues
  • Consider breaking into multiple 100M row batches
tip

Load data in batches of at least 1M rows to leverage DuckDB's parallelization. Smaller batches (like 100k rows) don't provide meaningful performance benefits and may actually be slower due to overhead.

Memory Management

Effective memory management is crucial for large data loads:

Connection Pooling

  • Reuse connections to reduce overhead
  • Implement connection recycling (every 15-20 minutes)
  • Use appropriate pool sizes based on concurrency needs

Data Type Optimization

  • Use explicit schemas to avoid type inference overhead
  • Choose appropriate data types (e.g., TIMESTAMP for dates)
  • Avoid unnecessary type conversions

Sorting Strategy

  • Sort data by frequently queried columns during loading
    • To re-sort existing tables, use CREATE OR REPLACE with the preferred sorting method
  • Improves query performance through better data locality
  • Consider the trade-off between loading speed and query performance

Network and Location Considerations

Data Location

  • MotherDuck is hosted in AWS us-east-1
  • Locate source data in the same region for optimal performance
  • Consider network latency when loading from remote sources

Cloud Storage Integration

  • Direct integration with S3, GCS, Azure Blob Storage
  • Leverages cloud network speeds for better performance
  • Reduces local storage requirements

Cost Implications

Compute Costs

Instance Selection

  • Pulse: Not advised for data loading
  • Standard: Good for most data loading workloads
  • Jumbo: Better for large, complex transformations during loading
  • Consider the 60-second cooldown period when planning batch sizes

Query Optimization

  • Monitor query history during large loads
  • Use EXPLAIN and/or EXPLAIN ANALYZE to understand execution plans
  • Optimize to minimize compute time

Reliability and Error Handling

Transaction Management

Transaction Size Limits

  • Keep transactions under 1 minute for optimal performance
  • Use appropriate batch sizes to avoid timeouts
  • Implement retry logic for failed transactions

ACID Compliance

  • MotherDuck provides ACID compliance but is not an OLTP system
  • Use for analytical workloads, not high-frequency transactional data
  • Consider eventual consistency for read scaling scenarios

Data Quality Considerations

Schema Validation

  • Use explicit schemas to catch data quality issues early
  • Implement data validation before loading
  • Consider using data quality tools in your pipeline

Error Recovery

  • Implement idempotent loading processes
  • Use CREATE OR REPLACE for table recreation
  • Maintain audit trails for troubleshooting

Loading Patterns and Best Practices

Loading from Different Sources

Local Files

  • Good for development and testing
  • Limited by local network and storage
  • Use for smaller datasets or initial exploration

Cloud Storage

  • Better performance for large datasets
  • Leverages cloud infrastructure
  • Supports multiple formats and compression

Database Sources

  • Use DuckDB extensions for PostgreSQL/MySQL
  • Consider network latency and bandwidth
  • Implement appropriate connection pooling

API Sources

  • Buffer data before loading to MotherDuck
  • Consider rate limits and error handling
  • Use appropriate data types for API responses

Monitoring and Troubleshooting

Performance Monitoring

Key Metrics to Track

  • Loading time per batch
  • Memory usage during loads
  • Network throughput
  • Error rates and types

Tools and Queries

  • Use PRAGMA database_size for storage monitoring
  • Check query_history for performance analysis
  • Monitor connection pool health

Common Issues and Solutions

Memory Issues

  • Reduce batch sizes
  • Use more efficient data types
  • Implement proper connection management

Timeout Issues

  • Reduce transaction size
  • Optimize network connectivity
  • Use appropriate instance types

Storage Issues

  • Monitor storage lifecycle stages
  • Implement data retention policies
  • Optimize data compression

Summary

The key to successful data loading in MotherDuck is understanding the trade-offs between different approaches and optimizing for your specific use case. Focus on:

  1. Batch operations over streaming for better performance (unless streaming is required)
  2. If you can control how they are written from sources, Appropriate file formats (Parquet preferred) for compression and speed
  3. Optimal batch sizes (1,000,000+ rows) to balance memory and throughput
  4. Proper connection management to avoid resource issues
  5. Monitoring and optimization to maintain performance over time

By following these guidelines and understanding the underlying principles, you can build efficient, reliable data loading pipelines that scale with your needs while managing costs effectively.