Chapter 4

7 min read

Advanced Aggregation and Analysis of Data

This is a summary of a book chapter from DuckDB in Action, published by Manning. Download the complete book for free to read the complete chapter.

'DuckDB In Action' book cover

4.1 PRE-AGGREGATING DATA WHILE INGESTING

Data is rarely perfect upon arrival. In analytical workflows, you often face high-frequency sensor data or inconsistent intervals that need normalization before storage. The examples in this chapter utilize real-world photovoltaic data (e.g., from the NREL API to demonstrate how DuckDB allows for pre-aggregation during the ingestion phase.

  • Time Bucketing: The chapter introduces the time_bucket() function, which truncates timestamps to specific intervals (e.g., 15 minutes). This is essential for aligning inconsistent time-series data from different sources into a uniform schema.
  • On-the-fly Cleaning: By combining ingestion with aggregation functions like avg(), you can clean data as it is read.
  • Documentation: For a complete list of date parsing and bucket capabilities, the book recommends consulting the DuckDB Timestamp Function documentation.

4.2 SUMMARIZING DATA

Before diving into complex joins and aggregates, analysts need to understand the shape of their data. DuckDB provides a unique and powerful SQL extension called SUMMARIZE.

  • Instant Statistics: Running SUMMARIZE table_name returns a wealth of meta-information, including column types, min/max values, null percentages, and approximate unique counts.
  • Distribution Metrics: It automatically calculates statistical quartiles (q25, q50, q75), standard deviation, and averages.
  • Zero-Copy Analysis: A key SEO-friendly feature of DuckDB is that SUMMARIZE works directly on CSV or Parquet files without requiring data ingestion, allowing for rapid exploration of data lakes.

4.3 ON SUBQUERIES

Subqueries are essential for staging computations, such as calculating nested aggregates (e.g., the average of a sum), which standard SQL cannot do in a single step.

  • Uncorrelated Subqueries: These operate independently of the outer query. DuckDB treats them as if the outer query runs on the result of the inner query.
  • Correlated Subqueries: These use values from the outer query to drive the inner query. DuckDB features a subquery decorrelation optimizer that rewrites these into efficient joins automatically, allowing developers to write expressive, readable SQL without worrying about performance penalties.

4.3.1 SUBQUERIES AS EXPRESSIONS

Subqueries act as expressions when they are not used as a relation in a JOIN.

  • Scalar Comparisons: Using standard operators (=, <, >) requires the subquery to return a single scalar row.
  • Set Comparisons: For subqueries returning multiple rows, DuckDB supports operators like IN, EXISTS, ANY, and ALL. This facilitates tasks such as identifying rows that match any value in a target set or all values in a generated list.

4.4 GROUPING SETS

When building drill-down reports or dashboards, you often need aggregates at multiple levels of granularity (e.g., total sales, sales by year, and sales by year and system).

  • GROUPING SETS: Instead of writing multiple queries and unioning the results, DuckDB supports GROUPING SETS to define multiple buckets of aggregation in a single pass.
  • ROLLUP: This shorthand clause creates hierarchical aggregates (e.g., Year+Month, Year, and Grand Total).
  • CUBE: This generates aggregates for all possible combinations of the grouping columns (2^n grouping sets).

4.5 WINDOW FUNCTIONS

Window functions are a cornerstone of modern analytics. Unlike standard aggregates that collapse rows, window functions allow you to calculate values based on a set of rows (a window) while retaining the individual identity of each row.

4.5.1 DEFINING PARTITIONS

To perform calculations relevant to specific groups (like the top 3 power readings per system), you must partition the data.

  • PARTITION BY: Breaks the relation into independent pieces. Window functions cannot access values outside the current partition.
  • Ordering: The ORDER BY clause within the window definition determines the sequence of execution, which is vital for ranking functions like rank() (gaps in ranking) and dense_rank() (no gaps).

4.5.2 FRAMING

Framing specifies exactly which rows relative to the current row are included in the calculation. This is used to create Moving Averages.

  • Range Definition: You can define frames using specific intervals, such as RANGE BETWEEN INTERVAL 3 Days PRECEDING AND INTERVAL 3 Days FOLLOWING, to calculate rolling statistics (e.g., a 7-day moving average) that smooth out volatile data.

4.5.3 NAMED WINDOWS

Complex queries often reuse the same window definition for multiple aggregates (e.g., calculating min, max, and quantiles over the same 7-day period).

  • The WINDOW Clause: DuckDB allows you to define a Named Window (e.g., WINDOW seven_days AS (...)) after the FROM clause. This improves code readability and maintainability by preventing the repetition of verbose window definitions.

4.5.4 ACCESSING PRECEDING OR FOLLOWING ROWS IN A PARTITION

For time-series analysis, you often need to compare the current row to a specific previous or future row.

  • Lag and Lead: The lag() function accesses a preceding row (e.g., yesterday's price), while lead() accesses a following row.
  • Delta Calculation: These functions are essential for computing day-over-day changes or spotting trends without complex self-joins.

4.6 CONDITIONS AND FILTERING OUTSIDE THE WHERE CLAUSE

Standard SQL uses WHERE to filter raw rows, but advanced analysis requires filtering based on computed values.

4.6.1 USING THE HAVING CLAUSE

The HAVING clause is applied after the GROUP BY aggregation. It is used to filter groups based on the result of an aggregate function, such as filtering for days where the sum of energy production exceeded 900 kWh.

4.6.2 USING THE QUALIFY CLAUSE

QUALIFY is a powerful, non-standard clause supported by DuckDB that filters the results of window functions.

  • Simplifying Queries: Instead of wrapping a window function in a CTE to filter on a rank (e.g., WHERE rank <= 3), you can write QUALIFY dense_rank() OVER (...) <= 3 directly in the main query.

4.6.3 USING THE FILTER CLAUSE

Sometimes you need to include specific rows in an aggregate without filtering them out of the entire query.

  • Selective Aggregation: You can append FILTER (WHERE ...) to any aggregate function. For example, calculating the average of only non-negative readings: avg(power) FILTER (WHERE power >= 0). This allows you to compute "clean" metrics alongside "raw" counts in a single pass.

4.7 THE PIVOT STATEMENT

Reporting often requires transforming data from a "long" format (normalized) to a "wide" format (spreadsheet style).

  • Dynamic Pivoting: DuckDB simplifies this with the PIVOT statement. Unlike other databases that require hardcoding every column, DuckDB can dynamically pivot on arbitrary expressions using the ON clause.
  • Example: PIVOT table ON year(date) USING sum(kwh) instantly turns years into column headers.

4.8 USING THE ASOF JOIN

One of DuckDB's standout features for financial and time-series analysis is the ASOF JOIN (As-Of Join).

  • The Problem: Joining two time-series tables (e.g., Sales and Prices) is difficult because timestamps rarely align perfectly.
  • The Solution: ASOF JOIN joins on inequality, finding the "closest preceding" value. This allows you to easily look up the valid price at the specific moment a sale occurred, even if the price table wasn't updated at that exact second.

4.9 USING TABLE FUNCTIONS

DuckDB treats functions that return collections of rows as Table Functions. These are versatile tools that can appear anywhere a table is expected in a SQL statement.

  • Data Generation: Functions like generate_series() and range() allow analysts to create driving tables—such as a continuous list of dates—to fill gaps in sparse datasets or project future trends.
  • External Data: Readers like read_csv, read_parquet, and spatial extensions are all implemented as table functions.

4.10 USING LATERAL JOINS

For scenarios requiring a "for-loop" logic within SQL, DuckDB supports LATERAL joins.

  • Correlated Execution: A LATERAL join allows the inner subquery to reference columns from the preceding tables in the FROM clause.
  • Unnesting API Data: This is particularly powerful for unnesting arrays. The book illustrates this by querying solar irradiance data from the Open-Meteo API and using LATERAL to unnest the JSON response into rows for analysis.

SUMMARY

  • Modern SQL Compliance: DuckDB supports a broad range of modern SQL features including CTEs, window functions (SQL:2003), and list aggregations.
  • Advanced Aggregation: Features like GROUPING SETS, ROLLUP, and CUBE allow for multi-level drill-down reports.
  • Window Optimization: Window functions include support for named windows, framing, and optimized statistics.
  • Flexible Filtering: HAVING, QUALIFY, and FILTER clauses provide precise control over data selection at different stages of the query lifecycle.
  • Time-Series Mastery: ASOF JOIN solves common misalignment issues in time-series data.
  • Dynamic Reshaping: PIVOT and LATERAL joins allow for complex data reshaping and "fan-out" operations.
'DuckDB In Action' book cover