Skip to main content

S3 import best practices

Loading data from Amazon S3 is one of the fastest ways to get data into MotherDuck. Because MotherDuck runs queries against S3 directly from the cloud, the file layout in your bucket has a significant impact on loading speed and cost. This guide covers how to organize files in S3 for optimal performance.

For general loading advice (batch sizes, memory management, Duckling sizing), see Loading data best practices.

Choose the right file format

Parquet is the best format for most S3 imports. It compresses well, includes schema metadata, and lets DuckDB read only the columns and row groups it needs.

FormatBest forAvoid when
ParquetMost workloads, large files, production pipelinesFiles under ~1 MB, where metadata overhead outweighs benefits
CSVSmall files (under 5 MB), quick exploration, simple schemasLarge datasets, complex types, multi-line text
JSONSmall files (under 5 MB), Semi-structured data, API responsesLarge files without a known schema (schema discovery is slow)
tip

For very small files (under ~1 MB), CSV or JSON can be faster than Parquet because Parquet's metadata and footer add overhead that outweighs the compression benefits at small sizes. However, you want to avoid the 'small files problem' where your bottleneck becomes listing and reading each individual small file with the same schema when they could have been aggregated in one or more bigger Parquet files.

Parquet settings

When writing Parquet files destined for MotherDuck:

  • Compression: Use Snappy (default) or ZSTD. Snappy offers faster decompression; ZSTD gives better compression ratios for cold storage.
  • Row group size: Aim for 100K-1M rows per row group. DuckDB processes row groups in parallel, so multiple groups per file improve throughput.
  • Column encoding: Leave this at the writer's default. DuckDB and most Parquet writers choose efficient encodings automatically.

Optimize file size

File size is the single most impactful factor for S3 import performance. Files that are too small create per-file overhead (HTTP requests, file listing, metadata parsing). Files that are too large limit parallelism.

File sizeImpact
Under 1 MBToo small. Per-file overhead dominates. Merge small files into larger ones.
1-10 MBAcceptable for low-volume or infrequent loads.
10-256 MBOptimal range. Good balance of parallelism and minimal overhead.
Over 256 MBStill works fine into the multiple gigabytes, but DuckDB can only parallelize within a single file by row group.
tip

Aim for 10-256 MB per file in Parquet format. If your pipeline produces many small files (for example, one file per API call or per minute), batch them before writing to S3 or use a compaction step to merge them periodically.

Row count guidelines

Row count guidelines follow from file size, but as a rough reference:

Rows per fileTypical file size (Parquet)Recommendation
Under 1,000Under 100 KBToo small, merge files
1,000-100,000100 KB - 10 MBAcceptable for small tables
100,000-10,000,00010 MB - 500 MBOptimal range
Over 10,000,000Over 500 MBConsider splitting into multiple files

Organize your S3 bucket

A consistent file layout in S3 makes it easier to load data incrementally and query subsets efficiently.

Use Hive-style partitioning for large datasets

If your dataset is large and you query it by date or category, partition your files using Hive-style paths:

s3://my-bucket/events/year=2025/month=03/data.parquet
s3://my-bucket/events/year=2025/month=04/data.parquet

DuckDB automatically detects Hive partitioning and prunes partitions during queries:

SELECT *
FROM read_parquet('s3://my-bucket/events/**/*.parquet', hive_partitioning=true)
WHERE year = 2025 AND month = 3;

Use consistent naming conventions

  • Use lowercase paths (MotherDuck URLs are case-sensitive)
  • Avoid dots in bucket names (causes SSL issues)
  • Include timestamps or sequence numbers in file names for incremental loads:
s3://my-bucket/orders/orders_20250323_001.parquet
s3://my-bucket/orders/orders_20250323_002.parquet

Set up continuous loading from S3

For pipelines that continuously land files in S3, keep these guidelines in mind:

Loading frequency

FrequencyRecommendation
Under 1 minuteNot recommended. Per-file overhead and small file sizes make this inefficient. Instead consider Ducklake which will inline data until the batch is big enough to write to a file.
1-5 minutesPossible for time-sensitive workloads, but files will be small. Ensure each file is at least 1 MB.
5-15 minutesGood balance of freshness and file size for most use cases.
Hourly or dailyIdeal for batch workloads. Produces well-sized files with minimal overhead.
tip

If your source system produces data continuously, buffer at least 5-15 minutes of data before writing a file to S3. This produces files in the optimal 10-256 MB range and avoids the small-file problem.

Incremental loading pattern

For incremental loads, use a landing zone pattern:

  1. Land new files in an incoming/ prefix
  2. Load them into MotherDuck with a timestamp filter or file listing
  3. Move processed files to a processed/ prefix
-- Load new files from the incoming prefix
INSERT INTO my_table
SELECT * FROM read_parquet('s3://my-bucket/incoming/*.parquet');

For more complex incremental workflows with state management, use an ingestion tool.

Use ingestion tools for production pipelines

For production pipelines that need scheduling, error handling, retries, and schema evolution, use a dedicated ingestion tool rather than writing raw SQL scripts. Many tools support MotherDuck as a destination and handle S3 file management automatically.

Ingestion tools with MotherDuck support:

  • dlt (data load tool) supports loading from APIs, databases, and files into MotherDuck with automatic schema evolution
  • Streamkap provides real-time CDC from databases to MotherDuck

Orchestration tools like Dagster, Airflow, Prefect, and Kestra can schedule S3-to-MotherDuck pipelines. Browse the full list of ingestion and orchestration tools in the MotherDuck ecosystem.

Colocate data with MotherDuck

MotherDuck connects to S3 directly from the cloud, so network distance between your S3 bucket and MotherDuck's region matters.

  • MotherDuck is available in US East (N. Virginia) (us-east-1), US West (Oregon) (us-west-2), and Europe (Frankfurt) (eu-central-1)
  • Place your S3 bucket in the same region as your MotherDuck organization for best performance

Summary

AreaRecommendation
File formatParquet for most workloads; CSV/JSON for files under 1 MB
File size10-256 MB per file
Row count100K-10M rows per file
Loading frequency5-15 minutes minimum; hourly or daily for batch
PartitioningHive-style for large, time-series datasets
RegionSame region as your MotherDuck organization
Production pipelinesUse a dedicated ingestion or orchestration tool