data ingestion

Back to DuckDB Data Engineering Glossary

Data ingestion is the process of importing raw data from various sources into a system where it can be stored and analyzed. This crucial first step in the data pipeline involves collecting data from diverse origins such as databases, APIs, file systems, or streaming platforms, and then loading it into a target destination like a data warehouse or data lake.

Tools like Airbyte or Fivetran specialize in data ingestion, offering connectors to hundreds of data sources and handling the complexities of extracting and loading data. For developers, libraries such as Singer provide a specification for writing scripts that can extract data from any source and load it into any destination.

In the context of DuckDB, data ingestion often involves using SQL commands to read data from external files or databases. For example:

Copy code

-- Ingesting data from a CSV file CREATE TABLE users AS SELECT * FROM read_csv_auto('users.csv'); -- Ingesting data from a Parquet file CREATE TABLE sales AS SELECT * FROM read_parquet('sales.parquet'); -- Ingesting data from a PostgreSQL database INSTALL postgres_scanner; LOAD postgres_scanner; CREATE TABLE orders AS SELECT * FROM postgres_scan('host=localhost port=5432 dbname=mydb', 'orders');

Effective data ingestion sets the foundation for all subsequent data processing and analysis tasks, making it a critical skill for aspiring data professionals to master.