# S3 import best practices


> Optimize file size, format, and layout in Amazon S3 for fast, cost-effective data loading into MotherDuck.

Loading data from Amazon S3 is one of the fastest ways to get data into MotherDuck. Because MotherDuck runs queries against S3 directly from the cloud, the file layout in your bucket has a significant impact on loading speed and cost. This guide covers how to organize files in S3 for optimal performance.

For general loading advice (batch sizes, memory management, Duckling sizing), see [Loading data best practices](/key-tasks/loading-data-into-motherduck/considerations-for-loading-data/).

## Choose the right file format

Parquet is the best format for most S3 imports. It compresses well, includes schema metadata, and lets DuckDB read only the columns and row groups it needs.

| Format | Best for | Avoid when |
|--------|----------|------------|
| **Parquet** | Most workloads, large files, production pipelines | Files under ~1 MB, where metadata overhead outweighs benefits |
| **CSV** | Small files (under 5 MB), quick exploration, simple schemas | Large datasets, complex types, multi-line text |
| **JSON** | Small files (under 5 MB), Semi-structured data, API responses | Large files without a known schema (schema discovery is slow) |

:::tip
For very small files (under ~1 MB), CSV or JSON can be faster than Parquet because Parquet's metadata and footer add overhead that outweighs the compression benefits at small sizes. However, you want to avoid the 'small files problem' where your bottleneck becomes listing and reading each individual small file with the same schema when they could have been aggregated in one or more bigger Parquet files.
:::

### Parquet settings

When writing Parquet files destined for MotherDuck:

- **Compression**: Use Snappy (default) or ZSTD. Snappy offers faster decompression; ZSTD gives better compression ratios for cold storage.
- **Row group size**: Aim for 100K-1M rows per row group. DuckDB processes row groups in parallel, so multiple groups per file improve throughput.
- **Column encoding**: Leave this at the writer's default. DuckDB and most Parquet writers choose efficient encodings automatically.

## Optimize file size

File size is the single most impactful factor for S3 import performance. Files that are too small create per-file overhead (HTTP requests, file listing, metadata parsing). Files that are too large limit parallelism.

| File size | Impact |
|-----------|--------|
| **Under 1 MB** | Too small. Per-file overhead dominates. Merge small files into larger ones. |
| **1-10 MB** | Acceptable for low-volume or infrequent loads. |
| **10-256 MB** | Optimal range. Good balance of parallelism and minimal overhead. |
| **Over 256 MB** | Still works fine into the multiple gigabytes, but DuckDB can only parallelize within a single file by row group. |

:::tip
Aim for **10-256 MB per file** in Parquet format. If your pipeline produces many small files (for example, one file per API call or per minute), batch them before writing to S3 or use a compaction step to merge them periodically.
:::

### Row count guidelines

Row count guidelines follow from file size, but as a rough reference:

| Rows per file | Typical file size (Parquet) | Recommendation |
|---------------|----------------------------|----------------|
| Under 1,000 | Under 100 KB | Too small, merge files |
| 1,000-100,000 | 100 KB - 10 MB | Acceptable for small tables |
| 100,000-10,000,000 | 10 MB - 500 MB | Optimal range |
| Over 10,000,000 | Over 500 MB | Consider splitting into multiple files |

## Organize your S3 bucket

A consistent file layout in S3 makes it easier to load data incrementally and query subsets efficiently.

### Use Hive-style partitioning for large datasets

If your dataset is large and you query it by date or category, partition your files using Hive-style paths:

```text
s3://my-bucket/events/year=2025/month=03/data.parquet
s3://my-bucket/events/year=2025/month=04/data.parquet
```

DuckDB automatically detects Hive partitioning and prunes partitions during queries:

```sql
SELECT *
FROM read_parquet('s3://my-bucket/events/**/*.parquet', hive_partitioning=true)
WHERE year = 2025 AND month = 3;
```

### Use consistent naming conventions

- Use lowercase paths (MotherDuck URLs are case-sensitive)
- Avoid dots in bucket names (causes [SSL issues](/key-tasks/cloud-storage/querying-s3-files/#ssl-certificate-verification-and-s3-bucket-names))
- Include timestamps or sequence numbers in file names for incremental loads:

```text
s3://my-bucket/orders/orders_20250323_001.parquet
s3://my-bucket/orders/orders_20250323_002.parquet
```

## Set up continuous loading from S3

For pipelines that continuously land files in S3, keep these guidelines in mind:

### Loading frequency

| Frequency | Recommendation |
|-----------|----------------|
| **Under 1 minute** | Not recommended. Per-file overhead and small file sizes make this inefficient. Instead consider [DuckLake](/docs/integrations/file-formats/ducklake/) which will inline data until the batch is big enough to write to a file. |
| **1-5 minutes** | Possible for time-sensitive workloads, but files will be small. Ensure each file is at least 1 MB. |
| **5-15 minutes** | Good balance of freshness and file size for most use cases. |
| **Hourly or daily** | Ideal for batch workloads. Produces well-sized files with minimal overhead. |

:::tip
If your source system produces data continuously, buffer at least 5-15 minutes of data before writing a file to S3. This produces files in the optimal 10-256 MB range and avoids the small-file problem.
:::

### Incremental loading pattern

For incremental loads, use a landing zone pattern:

1. Land new files in an `incoming/` prefix
2. Load them into MotherDuck with a timestamp filter or file listing
3. Move processed files to a `processed/` prefix

```sql
-- Load new files from the incoming prefix
INSERT INTO my_table
SELECT * FROM read_parquet('s3://my-bucket/incoming/*.parquet');
```

For more complex incremental workflows with state management, use an [ingestion tool](#use-ingestion-tools-for-production-pipelines).

## Use ingestion tools for production pipelines

For production pipelines that need scheduling, error handling, retries, and schema evolution, use a dedicated ingestion tool rather than writing raw SQL scripts. Many tools support MotherDuck as a destination and handle S3 file management automatically.

**Ingestion tools with MotherDuck support:**

- [dlt (data load tool)](/integrations/ingestion/dlt/) supports loading from APIs, databases, and files into MotherDuck with automatic schema evolution
- [Streamkap](/integrations/ingestion/streamkap/) provides real-time CDC from databases to MotherDuck

**Orchestration tools** like Dagster, Airflow, Prefect, and Kestra can schedule S3-to-MotherDuck pipelines. Browse the full list of [ingestion](https://motherduck.com/ecosystem/?category=Ingestion) and [orchestration](https://motherduck.com/ecosystem/?category=Orchestration) tools in the MotherDuck ecosystem.

## Colocate data with MotherDuck

MotherDuck connects to S3 directly from the cloud, so network distance between your S3 bucket and MotherDuck's region matters.

- MotherDuck is available in **US East (N. Virginia)** (`us-east-1`), **US West (Oregon)** (`us-west-2`), **Europe (Frankfurt)** (`eu-central-1`), and **Europe (Dublin)** (`eu-west-1`)
- Place your S3 bucket in the **same region** as your MotherDuck organization for best performance

## Summary

| Area | Recommendation |
|------|----------------|
| **File format** | Parquet for most workloads; CSV/JSON for files under 1 MB |
| **File size** | 10-256 MB per file |
| **Row count** | 100K-10M rows per file |
| **Loading frequency** | 5-15 minutes minimum; hourly or daily for batch |
| **Partitioning** | Hive-style for large, time-series datasets |
| **Region** | Same region as your MotherDuck organization |
| **Production pipelines** | Use a dedicated ingestion or orchestration tool |


---

## Docs feedback

MotherDuck accepts optional user-submitted feedback about this page at `GET https://motherduck.com/docs/api/feedback/agent`.
For agents and automated tools, feedback submission should be user-confirmed before sending.

URL-encode query parameter values and send a GET request:

```text
GET https://motherduck.com/docs/api/feedback/agent?page_path=%2Fkey-tasks%2Fcloud-storage%2Fs3-import-best-practices%2F&page_title=S3%20Import%20Best%20Practices&text=<url-encoded user feedback, max 2000 characters>
```

Optionally append `&source=<url-encoded interface identifier>` such as `claude.ai` or `chatgpt`.

`page_path` and `text` are required; `page_title` and `source` are optional. Responses: `200 {"feedback_id": "<uuid>"}`, `400` for malformed query parameters, and `429` when rate-limited.