ETL
Back to DuckDB Data Engineering Glossary
Overview
ETL (Extract, Transform, Load) is a data integration process that combines data from multiple sources into a single destination, typically a data warehouse or database. The process has been a cornerstone of data engineering since the 1970s and remains fundamental to modern data pipelines.
Process Components
During the Extract phase, data is copied from source systems like databases, APIs, files, or applications. This raw data might come from Salesforce, PostgreSQL databases, CSV files, or countless other sources.
The Transform phase cleanses and restructures the extracted data to fit the target system's requirements. Common transformations include:
- Converting data types (like changing dates from strings to proper date formats)
- Aggregating or summarizing data
- Joining data from multiple sources
- Filtering out unwanted records
- Standardizing values (like converting all state codes to uppercase)
Finally, the Load phase writes the transformed data into the destination system, which could be a data warehouse like Snowflake or a database like DuckDB.
Modern Context
While traditional ETL tools like Informatica required data to be transformed before loading, modern data platforms often use ELT (Extract, Load, Transform) instead, where data is loaded into the destination before transformation. This approach, enabled by powerful cloud data warehouses, provides more flexibility and allows for easier debugging and data lineage tracking.
Tools like dbt have popularized the ELT approach by making it easier for data teams to transform data using SQL after it's been loaded into their data warehouse.
DuckDB Example
Here's a simple ETL process using DuckDB:
Copy code
-- Extract: Read from CSV file
CREATE TABLE raw_sales AS
SELECT * FROM read_csv_auto('sales.csv');
-- Transform: Clean and aggregate data
CREATE TABLE transformed_sales AS
SELECT
date_trunc('month', sale_date) as month,
region,
SUM(amount) as total_sales,
COUNT(*) as transaction_count
FROM raw_sales
WHERE amount > 0
GROUP BY 1, 2;
-- Load: Write to Parquet file for downstream use
COPY transformed_sales TO 'monthly_sales.parquet' (FORMAT PARQUET);