Parquet
Back to DuckDB Data Engineering Glossary
Overview
Apache Parquet is a columnar storage file format designed for efficient data processing and analytics. Unlike row-based formats like CSV, Parquet stores data by column rather than by row, which enables better compression and faster querying for analytical workloads. Parquet files also contain metadata about the schema and statistics about the data, allowing query engines to skip reading irrelevant data blocks.
Key Benefits
Parquet provides excellent compression since similar data is stored together in columns. The format supports advanced compression techniques and encoding schemes that work especially well for repeating values. When querying Parquet files, systems like DuckDB can skip reading entire columns that aren't needed for a query (known as column pruning) and can skip reading row groups that don't match filter conditions (known as predicate pushdown).
Usage with DuckDB
DuckDB has native support for reading and writing Parquet files. Here are some common usage patterns:
Reading a Parquet file is as simple as:
SELECT * FROM 'mydata.parquet';
You can write query results to Parquet using:
COPY (SELECT * FROM mytable) TO 'output.parquet' (FORMAT PARQUET);
DuckDB supports configuring Parquet write options like compression codec and row group size:
COPY (SELECT * FROM mytable) TO 'output.parquet' (FORMAT PARQUET, CODEC 'SNAPPY', ROW_GROUP_SIZE 100000);
Integration with Data Lakes
Parquet is a popular format for data lakes built on cloud storage systems like Amazon S3, Azure Data Lake Storage, and Google Cloud Storage. When combined with table formats like Apache Iceberg or Delta Lake, Parquet enables building efficient and scalable data lake architectures.