*New* The MotherDuck Native Integration is Live on Vercel Marketplace for Embedded Analytics and Data AppsLearn more

query

Back to DuckDB Data Engineering Glossary

Definition

A query is a request made to a database to retrieve, modify, or analyze data. In data analytics, queries are most commonly written in SQL and allow you to ask specific questions of your data to get meaningful answers. Think of a query like asking a librarian to find all books published after 2020 that are about data science - you're defining specific criteria to get exactly the information you need.

Structure

A typical SQL query consists of several key components that work together:

  • The operation you want to perform (like SELECT to retrieve data)
  • The data you want to work with (specified in the FROM clause)
  • Any conditions or filters (using WHERE)
  • How you want the results organized (with clauses like GROUP BY or ORDER BY)

Examples in DuckDB

Here's a simple query that retrieves all columns from a table:

Copy code

SELECT * FROM employees;

A more complex query that filters and aggregates data:

Copy code

SELECT department, COUNT(*) as employee_count, AVG(salary) as avg_salary FROM employees WHERE hire_date >= '2020-01-01' GROUP BY department ORDER BY avg_salary DESC;

DuckDB supports both standard SQL queries and some extended functionality, like the ability to query directly from Parquet files:

Copy code

SELECT * FROM 'data/employees.parquet' WHERE salary > 50000;

Performance

DuckDB is specifically optimized for analytical queries, which typically involve scanning large amounts of data, performing aggregations, and joining tables. It uses vectorized query execution and parallel processing to make these operations particularly fast, especially compared to traditional row-oriented databases like SQLite.

Interactive Usage

Queries can be run through DuckDB's command-line interface, through programming languages like Python or R, or through various data tools that support DuckDB as a backend. The results can then be used for analysis, visualization, or further processing in your data pipeline.