*New* The MotherDuck Native Integration is Live on Vercel Marketplace for Embedded Analytics and Data AppsLearn more

SUMMARIZE

Back to DuckDB Data Engineering Glossary

Overview

SUMMARIZE is a powerful DuckDB command that automatically generates descriptive statistics and metadata about your data. Unlike traditional SQL databases where you'd need to write multiple aggregation queries, SUMMARIZE gives you a comprehensive overview of your data in a single command.

Basic Usage

The simplest way to use SUMMARIZE is to apply it directly to a table or query:

Copy code

SUMMARIZE SELECT * FROM my_table;

This generates statistics including:

  • Count of rows
  • Number of null values
  • Approximate count of unique values
  • Minimum and maximum values
  • Average and standard deviation (for numeric columns)
  • 25th, 50th, and 75th percentiles (for numeric columns)

Advanced Usage

You can also SUMMARIZE specific columns or expressions:

Copy code

SUMMARIZE (SELECT age, income FROM customers WHERE region = 'West');

Since DuckDB 0.10.0, SUMMARIZE can be used as a source for a SELECT statement, allowing you to filter or transform the summary statistics:

Copy code

SELECT column_name, min, max, approx_unique FROM SUMMARIZE my_table WHERE null_percentage > 0;

Performance Considerations

SUMMARIZE is optimized to scan data only once while computing multiple statistics, making it much more efficient than running separate aggregate queries. However, for very large datasets, it may take longer than simpler aggregations since it computes comprehensive statistics for all columns by default.

Differences from Other Databases

SUMMARIZE is unique to DuckDB - most other databases require you to write multiple separate queries to get the same information. While some databases like PostgreSQL have similar functions (like pg_stats), they typically only show basic statistics and require administrator privileges to access.