7 min read
Advanced Aggregation and Analysis of Data
This is a summary of a book chapter from DuckDB in Action, published by Manning. Download the complete book for free to read the complete chapter.

4.1 PRE-AGGREGATING DATA WHILE INGESTING
Data is rarely perfect upon arrival. In analytical workflows, you often face high-frequency sensor data or inconsistent intervals that need normalization before storage. The examples in this chapter utilize real-world photovoltaic data (e.g., from the NREL API to demonstrate how DuckDB allows for pre-aggregation during the ingestion phase.
- Time Bucketing: The chapter introduces the
time_bucket()function, which truncates timestamps to specific intervals (e.g., 15 minutes). This is essential for aligning inconsistent time-series data from different sources into a uniform schema. - On-the-fly Cleaning: By combining ingestion with aggregation functions like
avg(), you can clean data as it is read. - Documentation: For a complete list of date parsing and bucket capabilities, the book recommends consulting the DuckDB Timestamp Function documentation.
4.2 SUMMARIZING DATA
Before diving into complex joins and aggregates, analysts need to understand the shape of their data. DuckDB provides a unique and powerful SQL extension called SUMMARIZE.
- Instant Statistics: Running
SUMMARIZE table_namereturns a wealth of meta-information, including column types, min/max values, null percentages, and approximate unique counts. - Distribution Metrics: It automatically calculates statistical quartiles (q25, q50, q75), standard deviation, and averages.
- Zero-Copy Analysis: A key SEO-friendly feature of DuckDB is that
SUMMARIZEworks directly on CSV or Parquet files without requiring data ingestion, allowing for rapid exploration of data lakes.
4.3 ON SUBQUERIES
Subqueries are essential for staging computations, such as calculating nested aggregates (e.g., the average of a sum), which standard SQL cannot do in a single step.
- Uncorrelated Subqueries: These operate independently of the outer query. DuckDB treats them as if the outer query runs on the result of the inner query.
- Correlated Subqueries: These use values from the outer query to drive the inner query. DuckDB features a subquery decorrelation optimizer that rewrites these into efficient joins automatically, allowing developers to write expressive, readable SQL without worrying about performance penalties.
4.3.1 SUBQUERIES AS EXPRESSIONS
Subqueries act as expressions when they are not used as a relation in a JOIN.
- Scalar Comparisons: Using standard operators (
=,<,>) requires the subquery to return a single scalar row. - Set Comparisons: For subqueries returning multiple rows, DuckDB supports operators like
IN,EXISTS,ANY, andALL. This facilitates tasks such as identifying rows that match any value in a target set or all values in a generated list.
4.4 GROUPING SETS
When building drill-down reports or dashboards, you often need aggregates at multiple levels of granularity (e.g., total sales, sales by year, and sales by year and system).
- GROUPING SETS: Instead of writing multiple queries and unioning the results, DuckDB supports
GROUPING SETSto define multiple buckets of aggregation in a single pass. - ROLLUP: This shorthand clause creates hierarchical aggregates (e.g., Year+Month, Year, and Grand Total).
- CUBE: This generates aggregates for all possible combinations of the grouping columns (2^n grouping sets).
4.5 WINDOW FUNCTIONS
Window functions are a cornerstone of modern analytics. Unlike standard aggregates that collapse rows, window functions allow you to calculate values based on a set of rows (a window) while retaining the individual identity of each row.
4.5.1 DEFINING PARTITIONS
To perform calculations relevant to specific groups (like the top 3 power readings per system), you must partition the data.
- PARTITION BY: Breaks the relation into independent pieces. Window functions cannot access values outside the current partition.
- Ordering: The
ORDER BYclause within the window definition determines the sequence of execution, which is vital for ranking functions likerank()(gaps in ranking) anddense_rank()(no gaps).
4.5.2 FRAMING
Framing specifies exactly which rows relative to the current row are included in the calculation. This is used to create Moving Averages.
- Range Definition: You can define frames using specific intervals, such as
RANGE BETWEEN INTERVAL 3 Days PRECEDING AND INTERVAL 3 Days FOLLOWING, to calculate rolling statistics (e.g., a 7-day moving average) that smooth out volatile data.
4.5.3 NAMED WINDOWS
Complex queries often reuse the same window definition for multiple aggregates (e.g., calculating min, max, and quantiles over the same 7-day period).
- The WINDOW Clause: DuckDB allows you to define a Named Window (e.g.,
WINDOW seven_days AS (...)) after theFROMclause. This improves code readability and maintainability by preventing the repetition of verbose window definitions.
4.5.4 ACCESSING PRECEDING OR FOLLOWING ROWS IN A PARTITION
For time-series analysis, you often need to compare the current row to a specific previous or future row.
- Lag and Lead: The
lag()function accesses a preceding row (e.g., yesterday's price), whilelead()accesses a following row. - Delta Calculation: These functions are essential for computing day-over-day changes or spotting trends without complex self-joins.
4.6 CONDITIONS AND FILTERING OUTSIDE THE WHERE CLAUSE
Standard SQL uses WHERE to filter raw rows, but advanced analysis requires filtering based on computed values.
4.6.1 USING THE HAVING CLAUSE
The HAVING clause is applied after the GROUP BY aggregation. It is used to filter groups based on the result of an aggregate function, such as filtering for days where the sum of energy production exceeded 900 kWh.
4.6.2 USING THE QUALIFY CLAUSE
QUALIFY is a powerful, non-standard clause supported by DuckDB that filters the results of window functions.
- Simplifying Queries: Instead of wrapping a window function in a CTE to filter on a rank (e.g.,
WHERE rank <= 3), you can writeQUALIFY dense_rank() OVER (...) <= 3directly in the main query.
4.6.3 USING THE FILTER CLAUSE
Sometimes you need to include specific rows in an aggregate without filtering them out of the entire query.
- Selective Aggregation: You can append
FILTER (WHERE ...)to any aggregate function. For example, calculating the average of only non-negative readings:avg(power) FILTER (WHERE power >= 0). This allows you to compute "clean" metrics alongside "raw" counts in a single pass.
4.7 THE PIVOT STATEMENT
Reporting often requires transforming data from a "long" format (normalized) to a "wide" format (spreadsheet style).
- Dynamic Pivoting: DuckDB simplifies this with the
PIVOTstatement. Unlike other databases that require hardcoding every column, DuckDB can dynamically pivot on arbitrary expressions using theONclause. - Example:
PIVOT table ON year(date) USING sum(kwh)instantly turns years into column headers.
4.8 USING THE ASOF JOIN
One of DuckDB's standout features for financial and time-series analysis is the ASOF JOIN (As-Of Join).
- The Problem: Joining two time-series tables (e.g., Sales and Prices) is difficult because timestamps rarely align perfectly.
- The Solution:
ASOF JOINjoins on inequality, finding the "closest preceding" value. This allows you to easily look up the valid price at the specific moment a sale occurred, even if the price table wasn't updated at that exact second.
4.9 USING TABLE FUNCTIONS
DuckDB treats functions that return collections of rows as Table Functions. These are versatile tools that can appear anywhere a table is expected in a SQL statement.
- Data Generation: Functions like
generate_series()andrange()allow analysts to create driving tables—such as a continuous list of dates—to fill gaps in sparse datasets or project future trends. - External Data: Readers like
read_csv,read_parquet, and spatial extensions are all implemented as table functions.
4.10 USING LATERAL JOINS
For scenarios requiring a "for-loop" logic within SQL, DuckDB supports LATERAL joins.
- Correlated Execution: A
LATERALjoin allows the inner subquery to reference columns from the preceding tables in theFROMclause. - Unnesting API Data: This is particularly powerful for unnesting arrays. The book illustrates this by querying solar irradiance data from the Open-Meteo API and using
LATERALto unnest the JSON response into rows for analysis.
SUMMARY
- Modern SQL Compliance: DuckDB supports a broad range of modern SQL features including CTEs, window functions (SQL:2003), and list aggregations.
- Advanced Aggregation: Features like
GROUPING SETS,ROLLUP, andCUBEallow for multi-level drill-down reports. - Window Optimization: Window functions include support for named windows, framing, and optimized statistics.
- Flexible Filtering:
HAVING,QUALIFY, andFILTERclauses provide precise control over data selection at different stages of the query lifecycle. - Time-Series Mastery:
ASOF JOINsolves common misalignment issues in time-series data. - Dynamic Reshaping:
PIVOTandLATERALjoins allow for complex data reshaping and "fan-out" operations.


