filter
Back to DuckDB Data Engineering Glossary
Definition
A filter
operation selectively includes or excludes records from a dataset based on one or more conditions. In data pipelines and SQL queries, filtering helps narrow down large datasets to just the relevant information needed for analysis or further processing.
SQL Implementation
In DuckDB SQL, filtering is primarily done using the WHERE
clause in queries. The conditions in a WHERE
clause can use comparison operators (like =
, <
, >
), logical operators (AND
, OR
, NOT
), and pattern matching.
Basic example:
Copy code
SELECT * FROM customers
WHERE age > 21 AND country = 'Canada';
DuckDB also supports filtering aggregations using the FILTER
clause, which is more powerful than the standard SQL HAVING
clause because it can be applied to individual aggregate functions:
Copy code
SELECT
category,
COUNT(*) AS total_items,
COUNT(*) FILTER (WHERE price > 100) AS expensive_items
FROM products
GROUP BY category;
Python Integration
When using DuckDB's Python API, filtering can be done through both SQL and the relational API. The relational API provides a filter()
method that accepts SQL-style conditions:
Copy code
# Using DuckDB's relational API
duckdb.table('products').filter('price > 100').show()
Key Differences
Unlike some databases that only support HAVING
for filtering aggregated results, DuckDB's FILTER
clause allows for more granular control over individual aggregations within the same query. This makes complex analytical queries more concise and easier to understand.