SUMMARIZE
Back to DuckDB Data Engineering Glossary
Overview
SUMMARIZE
is a powerful DuckDB command that automatically generates descriptive statistics and metadata about your data. Unlike traditional SQL databases where you'd need to write multiple aggregation queries, SUMMARIZE
gives you a comprehensive overview of your data in a single command.
Basic Usage
The simplest way to use SUMMARIZE
is to apply it directly to a table or query:
Copy code
SUMMARIZE SELECT * FROM my_table;
This generates statistics including:
- Count of rows
- Number of null values
- Approximate count of unique values
- Minimum and maximum values
- Average and standard deviation (for numeric columns)
- 25th, 50th, and 75th percentiles (for numeric columns)
Advanced Usage
You can also SUMMARIZE
specific columns or expressions:
Copy code
SUMMARIZE (SELECT age, income FROM customers WHERE region = 'West');
Since DuckDB 0.10.0, SUMMARIZE
can be used as a source for a SELECT
statement, allowing you to filter or transform the summary statistics:
Copy code
SELECT column_name, min, max, approx_unique
FROM SUMMARIZE my_table
WHERE null_percentage > 0;
Performance Considerations
SUMMARIZE
is optimized to scan data only once while computing multiple statistics, making it much more efficient than running separate aggregate queries. However, for very large datasets, it may take longer than simpler aggregations since it computes comprehensive statistics for all columns by default.
Differences from Other Databases
SUMMARIZE
is unique to DuckDB - most other databases require you to write multiple separate queries to get the same information. While some databases like PostgreSQL have similar functions (like pg_stats
), they typically only show basic statistics and require administrator privileges to access.