filter
Back to DuckDB Data Engineering Glossary
Definition
A filter operation selectively includes or excludes records from a dataset based on one or more conditions. In data pipelines and SQL queries, filtering helps narrow down large datasets to just the relevant information needed for analysis or further processing.
SQL Implementation
In DuckDB SQL, filtering is primarily done using the WHERE clause in queries. The conditions in a WHERE clause can use comparison operators (like =, <, >), logical operators (AND, OR, NOT), and pattern matching.
Basic example:
Copy code
SELECT * FROM customers
WHERE age > 21 AND country = 'Canada';
DuckDB also supports filtering aggregations using the FILTER clause, which is more powerful than the standard SQL HAVING clause because it can be applied to individual aggregate functions:
Copy code
SELECT
category,
COUNT(*) AS total_items,
COUNT(*) FILTER (WHERE price > 100) AS expensive_items
FROM products
GROUP BY category;
Python Integration
When using DuckDB's Python API, filtering can be done through both SQL and the relational API. The relational API provides a filter() method that accepts SQL-style conditions:
Copy code
# Using DuckDB's relational API
duckdb.table('products').filter('price > 100').show()
Key Differences
Unlike some databases that only support HAVING for filtering aggregated results, DuckDB's FILTER clause allows for more granular control over individual aggregations within the same query. This makes complex analytical queries more concise and easier to understand.

