ETL

Back to DuckDB Data Engineering Glossary

Overview

ETL (Extract, Transform, Load) is a data integration process that combines data from multiple sources into a single destination, typically a data warehouse or database. The process has been a cornerstone of data engineering since the 1970s and remains fundamental to modern data pipelines.

Process Components

During the Extract phase, data is copied from source systems like databases, APIs, files, or applications. This raw data might come from Salesforce, PostgreSQL databases, CSV files, or countless other sources.

The Transform phase cleanses and restructures the extracted data to fit the target system's requirements. Common transformations include:

  • Converting data types (like changing dates from strings to proper date formats)
  • Aggregating or summarizing data
  • Joining data from multiple sources
  • Filtering out unwanted records
  • Standardizing values (like converting all state codes to uppercase)

Finally, the Load phase writes the transformed data into the destination system, which could be a data warehouse like Snowflake or a database like DuckDB.

Modern Context

While traditional ETL tools like Informatica required data to be transformed before loading, modern data platforms often use ELT (Extract, Load, Transform) instead, where data is loaded into the destination before transformation. This approach, enabled by powerful cloud data warehouses, provides more flexibility and allows for easier debugging and data lineage tracking.

Tools like dbt have popularized the ELT approach by making it easier for data teams to transform data using SQL after it's been loaded into their data warehouse.

DuckDB Example

Here's a simple ETL process using DuckDB:

Copy code

-- Extract: Read from CSV file CREATE TABLE raw_sales AS SELECT * FROM read_csv_auto('sales.csv'); -- Transform: Clean and aggregate data CREATE TABLE transformed_sales AS SELECT date_trunc('month', sale_date) as month, region, SUM(amount) as total_sales, COUNT(*) as transaction_count FROM raw_sales WHERE amount > 0 GROUP BY 1, 2; -- Load: Write to Parquet file for downstream use COPY transformed_sales TO 'monthly_sales.parquet' (FORMAT PARQUET);