Chapter 3

7 min read

Executing SQL Queries

This is a summary of a book chapter from DuckDB in Action, published by Manning. Download the complete book for free to read the complete chapter.

'DuckDB In Action' book cover

3.1 A Quick SQL Recap

DuckDB is designed to "tickle your SQL brain" by supporting a highly standard-compliant SQL dialect. In this section, the authors review the anatomy of a SQL command within the DuckDB CLI and other clients.

Key takeaways include:

  • Structure: Commands are composed of statements (like SELECT) and clauses (like WHERE, GROUP BY, ORDER BY).
  • Flexibility: DuckDB handles whitespace freely and is case-insensitive for keywords and identifiers, allowing users to format queries for maximum readability.
  • Execution: Commands in the CLI are terminated with a semicolon.
  • Core Clauses: The chapter reviews the logical order of operations: WHERE filters rows before aggregation, GROUP BY buckets data into keys, and ORDER BY sorts the final result set.

3.2 Analyzing Energy Production

To demonstrate OLAP capabilities in a real-world context, the chapter introduces a concrete dataset: Photovoltaic Data Acquisition (PVDAQ) from the U.S. Department of Energy.

3.2.1 Downloading the dataset

This section outlines the data sources and ingestion methods. The dataset consists of time-series data representing energy readings (in 15-minute intervals) and market prices, fully documented on GitHub. The chapter demonstrates using the httpfs extension to load CSV data directly from S3 URLs (such as the systems.csv file) or via the NREL API without downloading files locally first.

3.2.2 The target schema

The data is modeled into a normalized schema to support joins and aggregation:

  • systems: Metadata about the solar panels (using an external ID as a surrogate key).
  • readings: The actual power output (composite keys using system ID and timestamps).
  • prices: Energy prices per kilowatt-hour (kWh), utilizing sequences for IDs.

3.3 Data Definition Language (DDL) Queries

DuckDB is a full-fledged Relational Database Management System (RDBMS). This section details how to structure data persistence using Data Definition Language (DDL), referencing supported SQL statements.

3.3.1 The CREATE TABLE statement

The authors demonstrate creating tables with specific data types (INTEGER, DECIMAL, TIMESTAMP). They emphasize the use of constraints—PRIMARY KEY, FOREIGN KEY, CHECK, and NOT NULL—to ensure data integrity. Readers are advised to check the documentation on indexes and limitations as these constraints can impact bulk loading performance.

3.3.2 The ALTER TABLE statement

Schema requirements often evolve. This section covers ALTER TABLE for adding columns (e.g., adding a validity date to prices) or renaming columns. It also introduces Create Table As Select (CTAS), a powerful shortcut to duplicate table structures and content in a single command, such as CREATE TABLE prices_duplicate AS SELECT * FROM prices.

3.3.3 The CREATE VIEW statement

The chapter explains CREATE VIEW as a method to encapsulate complex logic—such as converting Watts to kWh per day—creating an internal API for the database.

3.3.4 The DESCRIBE statement

The DESCRIBE statement is highlighted as a crucial tool for introspection. It works not just on tables, but on views, queries, and even remote CSV/Parquet files to preview schemas before ingestion.

3.4 Data Manipulation Language (DML) Queries

DML is the engine of data analysis. This section moves beyond simple queries to complex data transformations. Code examples are available in the GitHub repository.

3.4.1 The INSERT statement

The chapter explores robust data entry strategies:

  • Idempotency: Using INSERT INTO ... ON CONFLICT DO NOTHING to safely handle duplicate data.
  • Insert from Select: Pipelines often pipe data directly from a SELECT query on a raw file into a structured table.

3.4.2 Merging data

For refining messy datasets, the authors demonstrate ON CONFLICT DO UPDATE. This "upsert" capability allows new data to merge with existing records, such as averaging a new reading with an existing one if a conflict occurs on the composite key.

3.4.3 The DELETE statement

The authors show how to clean data sets (e.g., removing readings taken at irregular minute intervals) using DELETE combined with date functions like date_part.

3.4.4 The SELECT statement

This is the core of the chapter, detailing how to retrieve and reshape data (see the SELECT documentation).

  • The VALUES clause: DuckDB allows VALUES to be used as a standalone statement (e.g., to mock data) or within a FROM clause to generate virtual tables on the fly.
  • The JOIN clause: The chapter covers INNER, LEFT/RIGHT OUTER, and FULL OUTER joins. It specifically highlights the ASOF JOIN as a solution for the energy use case, allowing readings to be matched with the most recent valid price ("as of" the reading time) without requiring exact timestamp matches.
  • The COPY TO command: A sidebar highlights how to build data pipelines using COPY (SELECT ...) TO 'file.csv'. This allows users to join data from multiple files and export a clean, deduplicated result to a single output file.
  • The WITH clause (CTEs): Using Common Table Expressions to break complex logic into readable, modular parts.
  • Recursive Queries: A detailed example of WITH RECURSIVE to query graph-shaped structures (tree traversal).
  • Aggregates: Beyond standard SUM and COUNT, the chapter introduces advanced DuckDB aggregates like arg_max (finding the row associated with a maximum value), list (creating arrays), and statistical functions.

3.5 DuckDB-Specific SQL Extensions

DuckDB aims to make SQL more "user-friendly." This section showcases proprietary extensions that solve common pain points in standard SQL.

3.5.1 Dealing with SELECT

  • Exclude: SELECT * EXCLUDE (col_name) returns all columns except specific ones.
  • Replace: SELECT * REPLACE (expression AS col_name) modifies a column (e.g., rounding values) while keeping the rest of the star-select intact.
  • Columns Expression: Using regular expressions or lambda functions to select columns dynamically. For example, SELECT COLUMNS('valid.*') grabs all columns starting with "valid".

3.5.2 Inserting by name

INSERT INTO ... BY NAME automatically maps source columns to target columns by name, reducing the fragility of positional inserts.

3.5.3 Accessing aliases everywhere

Unlike standard SQL, DuckDB allows you to reuse column aliases defined in the SELECT clause immediately within the WHERE and GROUP BY clauses, significantly reducing code duplication.

3.5.4 Grouping and ordering by all relevant columns

The introduction of GROUP BY ALL and ORDER BY ALL automatically infers the non-aggregated columns, making ad-hoc analysis faster and less verbose.

3.5.5 Sampling data

The USING SAMPLE clause allows analysts to instantly query a percentage or fixed number of rows from massive datasets. The sampling documentation covers the specific algorithms used (Bernoulli or Reservoir methods).

3.5.6 Functions with optional parameters

DuckDB functions, such as read_json_auto, often support named, optional parameters. This allows users to specify configuration like dateformat without needing to provide arguments for every single parameter in a specific order.

Summary

  • Structure: SQL queries are built from statements and clauses. They are broadly categorized into Data Definition Language (DDL) for structure and Data Manipulation Language (DML) for data operations.
  • Scope of DML: DML queries cover creating, reading, updating, and deleting rows. In DuckDB, reading data is considered manipulation as it involves transforming existing relations into new result sets.
  • Persistence: DDL queries (CREATE TABLE, CREATE VIEW) define a persistent schema. This works identically whether DuckDB is running in-memory or on disk.
  • Data Integrity: Defining a rigid schema helps identify data inconsistencies that might be missed in schemaless systems. Constraint errors during ingestion can be handled gracefully using ON CONFLICT clauses.
  • Innovation: DuckDB significantly lowers the barrier to writing complex SQL with features like SELECT * EXCLUDE, SELECT * REPLACE, and intuitive alias usage that works across different clauses.
'DuckDB In Action' book cover