Chapter 3

10 min read

Executing SQL Queries

INFOThis is a summary of a book chapter from DuckDB in Action, published by Manning. Download the complete book for free to read the complete chapter.
'DuckDB In Action' book cover

3.1 A Quick SQL Recap

DuckDB is designed to "tickle your SQL brain" by supporting a highly standard-compliant SQL dialect. In this section, the authors review the anatomy of a SQL command within the DuckDB CLI and other clients.

Key takeaways include:

  • Structure: Commands are composed of statements (like SELECT) and clauses (like WHERE, GROUP BY, ORDER BY).
  • Flexibility: DuckDB handles whitespace freely and is case-insensitive for keywords and identifiers, allowing users to format queries for maximum readability.
  • Execution: Commands in the CLI are terminated with a semicolon.
  • Core Clauses: The chapter reviews the logical order of operations: WHERE filters rows before aggregation, GROUP BY buckets data into keys, and ORDER BY sorts the final result set.

3.2 Analyzing Energy Production

To demonstrate OLAP capabilities in a real-world context, the chapter introduces a concrete dataset: Photovoltaic Data Acquisition (PVDAQ) from the U.S. Department of Energy.

3.2.1 Downloading the dataset

This section outlines the data sources and ingestion methods. The dataset consists of time-series data representing energy readings (in 15-minute intervals) and market prices, fully documented on GitHub. The chapter demonstrates using the httpfs extension to load CSV data directly from S3 URLs (such as the systems.csv file) or via the NREL API without downloading files locally first.

3.2.2 The target schema

The data is modeled into a normalized schema to support joins and aggregation:

  • systems: Metadata about the solar panels (using an external ID as a surrogate key).
  • readings: The actual power output (composite keys using system ID and timestamps).
  • prices: Energy prices per kilowatt-hour (kWh), utilizing sequences for IDs.
Figure 3.1: Energy consumption schema showing the relationships between readings, systems, and prices tables

3.3 Data Definition Language (DDL) Queries

DuckDB is a full-fledged Relational Database Management System (RDBMS). This section details how to structure data persistence using Data Definition Language (DDL), referencing supported SQL statements.

DDL vs DML: SQL Statement Categories

CategoryPurposeKey StatementsWhen to Use
DDL (Data Definition)Define database structureCREATE, ALTER, DROPSchema design, table creation
DML (Data Manipulation)Work with dataSELECT, INSERT, UPDATE, DELETEQuerying, data changes
DCL (Data Control)Manage permissionsGRANT, REVOKEAccess control
TCL (Transaction Control)Manage transactionsCOMMIT, ROLLBACKData integrity

3.3.1 The CREATE TABLE statement

The authors demonstrate creating tables with specific data types (INTEGER, DECIMAL, TIMESTAMP). They emphasize the use of constraints—PRIMARY KEY, FOREIGN KEY, CHECK, and NOT NULL—to ensure data integrity. Readers are advised to check the documentation on indexes and limitations as these constraints can impact bulk loading performance.

DuckDB Table Constraints Reference

ConstraintPurposeExamplePerformance Impact
PRIMARY KEYUnique row identifierid INTEGER PRIMARY KEYCreates index
FOREIGN KEYReferential integrityREFERENCES other_table(id)Creates index
NOT NULLPrevent missing valuesname VARCHAR NOT NULLMinimal
CHECKCustom validationCHECK(power >= 0)Evaluated on insert
UNIQUEPrevent duplicatesUNIQUE (valid_from)Creates index
DEFAULTAuto-fill valuesDEFAULT 0None

3.3.2 The ALTER TABLE statement

Schema requirements often evolve. This section covers ALTER TABLE for adding columns (e.g., adding a validity date to prices) or renaming columns. It also introduces Create Table As Select (CTAS), a powerful shortcut to duplicate table structures and content in a single command, such as CREATE TABLE prices_duplicate AS SELECT * FROM prices.

3.3.3 The CREATE VIEW statement

The chapter explains CREATE VIEW as a method to encapsulate complex logic—such as converting Watts to kWh per day—creating an internal API for the database.

3.3.4 The DESCRIBE statement

The DESCRIBE statement is highlighted as a crucial tool for introspection. It works not just on tables, but on views, queries, and even remote CSV/Parquet files to preview schemas before ingestion.

3.4 Data Manipulation Language (DML) Queries

DML is the engine of data analysis. This section moves beyond simple queries to complex data transformations. Code examples are available in the GitHub repository.

3.4.1 The INSERT statement

The chapter explores robust data entry strategies:

  • Idempotency: Using INSERT INTO ... ON CONFLICT DO NOTHING to safely handle duplicate data.
  • Insert from Select: Pipelines often pipe data directly from a SELECT query on a raw file into a structured table.

3.4.2 Merging data

For refining messy datasets, the authors demonstrate ON CONFLICT DO UPDATE. This "upsert" capability allows new data to merge with existing records, such as averaging a new reading with an existing one if a conflict occurs on the composite key.

Handling Duplicates: ON CONFLICT Options

StrategySyntaxUse Case
Ignore duplicatesON CONFLICT DO NOTHINGIdempotent inserts, skip existing
Update on conflictON CONFLICT DO UPDATE SET col = valueUpsert/merge new data
Replace entirelyINSERT OR REPLACEShorthand for full replacement
Conditional updateON CONFLICT DO UPDATE SET col = CASE...Complex merge logic

3.4.3 The DELETE statement

The authors show how to clean data sets (e.g., removing readings taken at irregular minute intervals) using DELETE combined with date functions like date_part.

3.4.4 The SELECT statement

This is the core of the chapter, detailing how to retrieve and reshape data (see the SELECT documentation).

  • The VALUES clause: DuckDB allows VALUES to be used as a standalone statement (e.g., to mock data) or within a FROM clause to generate virtual tables on the fly.

The GROUP BY Clause

The GROUP BY clause generates one row of output per unique value of specified columns. All grouped values get aggregated via functions like count, sum, avg, min, or max.

Figure 3.2: Visualization of how GROUP BY groups data by year and applies aggregate functions
  • The JOIN clause: The chapter covers INNER, LEFT/RIGHT OUTER, and FULL OUTER joins. It specifically highlights the ASOF JOIN as a solution for the energy use case, allowing readings to be matched with the most recent valid price ("as of" the reading time) without requiring exact timestamp matches.

Understanding INNER JOIN

An inner join matches all rows from the left-hand side to rows from the right-hand side that have a column with the same value. Rows without matches are excluded.

Figure 3.3: Inner join diagram showing how only matching pairs with equal keys are returned

Understanding OUTER JOINs

Outer joins supplement NULL values for rows that have no matching entry on the other side. LEFT OUTER includes all left rows, RIGHT OUTER includes all right rows, and FULL OUTER includes all rows from both tables.

Figure 3.4: Types of outer joins - LEFT, FULL, and RIGHT outer joins visualized

DuckDB JOIN Types Comparison

JOIN TypeReturnsNULL HandlingBest For
INNER JOINOnly matching rows from both tablesExcludes non-matchesRequired relationships
LEFT OUTER JOINAll left rows + matching rightNULL for unmatched rightOptional enrichment
RIGHT OUTER JOINAll right rows + matching leftNULL for unmatched leftReverse of LEFT
FULL OUTER JOINAll rows from both tablesNULL for either sideComplete dataset merge
CROSS JOINCartesian product (all combinations)N/AGenerating combinations
ASOF JOINNearest match by inequalityTemporal/range matchingTime-series data
  • The COPY TO command: A sidebar highlights how to build data pipelines using COPY (SELECT ...) TO 'file.csv'. This allows users to join data from multiple files and export a clean, deduplicated result to a single output file.
  • The WITH clause (CTEs): Using Common Table Expressions to break complex logic into readable, modular parts.
  • Recursive Queries: A detailed example of WITH RECURSIVE to query graph-shaped structures (tree traversal).
  • Aggregates: Beyond standard SUM and COUNT, the chapter introduces advanced DuckDB aggregates like arg_max (finding the row associated with a maximum value), list (creating arrays), and statistical functions.

Essential DuckDB Aggregate Functions

FunctionPurposeExample
COUNT(*)Count all rowsSELECT COUNT(*) FROM readings
SUM(col)Sum valuesSELECT SUM(power) FROM readings
AVG(col)Average valueSELECT AVG(power) FROM readings
MIN/MAX(col)Extreme valuesSELECT MAX(power) FROM readings
arg_max(expr, col)Value at maximumSELECT arg_max(read_on, power)
arg_min(expr, col)Value at minimumSELECT arg_min(read_on, power)
list(col)Aggregate into arraySELECT list(name) FROM systems
first(col)First value in groupSELECT first(power) FROM readings
any_value(col)Any value from groupSELECT any_value(name)

3.5 DuckDB-Specific SQL Extensions

DuckDB aims to make SQL more "user-friendly." This section showcases proprietary extensions that solve common pain points in standard SQL.

DuckDB SQL Extensions vs Standard SQL

FeatureStandard SQLDuckDB ExtensionBenefit
Exclude columnsMust list all wanted columnsSELECT * EXCLUDE (col)Simpler queries
Replace columnsRequires full column listSELECT * REPLACE (expr AS col)In-place transforms
Dynamic columnsNot availableSELECT COLUMNS('pattern')Regex column selection
Alias in WHERENot allowedFully supportedLess repetition
Auto GROUP BYMust list all columnsGROUP BY ALLFaster ad-hoc queries
Auto ORDER BYMust list all columnsORDER BY ALLConvenient sorting
Data samplingVendor-specificUSING SAMPLE n%Built-in sampling

3.5.1 Dealing with SELECT

  • Exclude: SELECT * EXCLUDE (col_name) returns all columns except specific ones.
  • Replace: SELECT * REPLACE (expression AS col_name) modifies a column (e.g., rounding values) while keeping the rest of the star-select intact.
  • Columns Expression: Using regular expressions or lambda functions to select columns dynamically. For example, SELECT COLUMNS('valid.*') grabs all columns starting with "valid".

3.5.2 Inserting by name

INSERT INTO ... BY NAME automatically maps source columns to target columns by name, reducing the fragility of positional inserts.

3.5.3 Accessing aliases everywhere

Unlike standard SQL, DuckDB allows you to reuse column aliases defined in the SELECT clause immediately within the WHERE and GROUP BY clauses, significantly reducing code duplication.

3.5.4 Grouping and ordering by all relevant columns

The introduction of GROUP BY ALL and ORDER BY ALL automatically infers the non-aggregated columns, making ad-hoc analysis faster and less verbose.

3.5.5 Sampling data

The USING SAMPLE clause allows analysts to instantly query a percentage or fixed number of rows from massive datasets. The sampling documentation covers the specific algorithms used (Bernoulli or Reservoir methods).

3.5.6 Functions with optional parameters

DuckDB functions, such as read_json_auto, often support named, optional parameters. This allows users to specify configuration like dateformat without needing to provide arguments for every single parameter in a specific order.

Summary

  • Structure: SQL queries are built from statements and clauses. They are broadly categorized into Data Definition Language (DDL) for structure and Data Manipulation Language (DML) for data operations.
  • Scope of DML: DML queries cover creating, reading, updating, and deleting rows. In DuckDB, reading data is considered manipulation as it involves transforming existing relations into new result sets.
  • Persistence: DDL queries (CREATE TABLE, CREATE VIEW) define a persistent schema. This works identically whether DuckDB is running in-memory or on disk.
  • Data Integrity: Defining a rigid schema helps identify data inconsistencies that might be missed in schemaless systems. Constraint errors during ingestion can be handled gracefully using ON CONFLICT clauses.
  • Innovation: DuckDB significantly lowers the barrier to writing complex SQL with features like SELECT * EXCLUDE, SELECT * REPLACE, and intuitive alias usage that works across different clauses.

Continue Reading: Chapter 2: The DuckDB CLI | Chapter 4: Advanced Analytics

Related: Learn how MotherDuck extends DuckDB to the cloud with serverless analytics.

'DuckDB In Action' book cover

FAQS

What is the difference between DDL and DML in SQL?

DDL (Data Definition Language) defines database structure using statements like CREATE TABLE, ALTER TABLE, and DROP TABLE. DML (Data Manipulation Language) works with the actual data using SELECT, INSERT, UPDATE, and DELETE. Think of DDL as building the container (tables, views, schemas) and DML as filling and querying that container. In DuckDB, both work identically whether running in-memory or with persistent storage.

How do JOINs work in DuckDB?

DuckDB supports all standard SQL JOIN types: INNER JOIN returns only matching rows, LEFT/RIGHT OUTER JOIN includes all rows from one side with NULLs for non-matches, and FULL OUTER JOIN includes all rows from both tables. DuckDB also offers ASOF JOIN for time-series data, which matches rows based on the nearest timestamp rather than exact equality—perfect for joining readings with the most recent valid price. Use JOIN ... USING (column) when both tables share a column name, or JOIN ... ON for different column names.

What is a Common Table Expression (CTE) in DuckDB?

A Common Table Expression (CTE) is a temporary named result set defined with the WITH clause that exists only for the duration of a query. CTEs make complex queries more readable by breaking them into logical steps. Unlike subqueries, CTEs can reference each other and can be recursive (WITH RECURSIVE) for traversing hierarchical data like trees. Example: WITH monthly_totals AS (SELECT month, SUM(power) FROM readings GROUP BY month) SELECT * FROM monthly_totals WHERE ...

What SQL extensions does DuckDB add beyond standard SQL?

DuckDB adds several productivity features: (1) SELECT * EXCLUDE (col) to select all columns except specific ones; (2) SELECT * REPLACE (expr AS col) to transform columns inline; (3) COLUMNS('regex') for dynamic column selection; (4) GROUP BY ALL and ORDER BY ALL to auto-infer columns; (5) Alias reuse in WHERE and GROUP BY clauses; (6) USING SAMPLE n% for built-in data sampling; (7) INSERT ... BY NAME for column-name matching. These extensions reduce boilerplate and make ad-hoc analysis faster.

How do I handle duplicate data when inserting into DuckDB?

DuckDB provides ON CONFLICT clauses for handling duplicates: (1) ON CONFLICT DO NOTHING silently skips duplicates, making inserts idempotent; (2) ON CONFLICT DO UPDATE SET col = value performs an upsert, merging new data with existing rows; (3) ON CONFLICT (column) DO UPDATE SET col = excluded.col lets you reference the incoming row's values using the excluded alias. You can also use complex expressions like CASE statements in the update logic.

What is an ASOF JOIN and when should I use it?

An ASOF JOIN matches rows based on inequality conditions (<=, <, >, >=) rather than exact equality—typically used for temporal data. For example, joining energy readings with prices where each reading should use the price that was valid 'as of' that timestamp. Instead of requiring exact timestamp matches, ASOF JOIN finds the nearest preceding (or following) match. This is essential for time-series analysis, financial data, and any scenario where reference data has validity periods.

How does GROUP BY ALL work in DuckDB?

GROUP BY ALL is a DuckDB extension that automatically groups by all non-aggregated columns in your SELECT clause. Instead of writing SELECT category, region, SUM(sales) FROM data GROUP BY category, region, you can write SELECT category, region, SUM(sales) FROM data GROUP BY ALL. DuckDB infers which columns need grouping. Similarly, ORDER BY ALL sorts by all columns from left to right. These features dramatically speed up ad-hoc analysis and reduce errors from mismatched column lists.

How can I sample data in DuckDB for faster exploration?

DuckDB's USING SAMPLE clause lets you query a subset of data: SELECT * FROM large_table USING SAMPLE 10% returns roughly 10% of rows. You can specify methods: bernoulli (row-by-row sampling, more even distribution) or system (vector-based, faster but less precise for small datasets). For exact row counts, use reservoir sampling: USING SAMPLE 1000 ROWS. Add REPEATABLE (seed) for reproducible results. This is invaluable for exploring massive datasets without loading everything into memory.

FAQs Lottie