dlt
dlt is an open-source Python library that loads data from various, often messy data sources into well-structured, live datasets. It offers a lightweight interface for extracting data from REST APIs, SQL databases, cloud storage, Python data structures, and many more.
dlt is designed to be easy to use, flexible, and scalable:
- dlt infers schemas and data types, normalizes the data, and handles nested data structures.
- dlt supports a variety of popular destinations and has an interface to add custom destinations to create reverse ETL pipelines.
- dlt can be deployed anywhere Python runs, be it on Airflow, serverless functions, or any other cloud deployment of your choice.
- dlt automates pipeline maintenance with schema evolution and schema and data contracts.
Dlt integrates well with DuckDB (they also used it as a local cache) and therefore with MotherDuck.
You can check more about MotherDuck integration in the official documentation.
Authentication
To authenticate with MotherDuck, you have two options:
- Environment variable: export your
motherduck_token
as an environment variable:
export motherduck_token="your_motherduck_token"
- For Local development: add the token to
.dlt/secrets.toml
:
[destination.motherduck.credentials]
password = "my_motherduck_token"
Minimal example
Below is a minimal example of using dlt to load data from a REST API (with fake data) into a DuckDB (MotherDuck) database:
import dlt
from typing import Dict, Iterator, List, Sequence
import random
from datetime import datetime
from dlt.sources import DltResource
@dlt.source(name="dummy_github")
def dummy_source(repos: List[str] = None) -> Sequence[DltResource]:
"""
A minimal DLT source that generates dummy GitHub-like data.
Args:
repos (List[str]): A list of dummy repository names.
Returns:
Sequence[DltResource]: A sequence of resources with dummy data.
"""
if repos is None:
repos = ["dummy/repo1", "dummy/repo2"]
return (
dummy_repo_info(repos),
dummy_languages(repos),
)
@dlt.resource(write_disposition="replace")
def dummy_repo_info(repos: List[str]) -> Iterator[Dict]:
"""
Generates dummy repository information.
Args:
repos (List[str]): List of repository names.
Yields:
Iterator[Dict]: An iterator over dummy repository data.
"""
for repo in repos:
owner, name = repo.split("/")
yield {
"id": random.randint(10000, 99999),
"name": name,
"full_name": repo,
"owner": {"login": owner},
"description": f"This is a dummy repository for {repo}",
"created_at": datetime.now().isoformat(),
"updated_at": datetime.now().isoformat(),
"stargazers_count": random.randint(0, 1000),
"forks_count": random.randint(0, 500),
}
@dlt.resource(write_disposition="replace")
def dummy_languages(repos: List[str]) -> Iterator[Dict]:
"""
Generates dummy language data for repositories in an unpivoted format.
Args:
repos (List[str]): List of repository names.
Yields:
Iterator[Dict]: An iterator over dummy language data.
"""
languages = ["Python", "JavaScript", "TypeScript", "C++", "Rust", "Go"]
for repo in repos:
# Generate 2-4 random languages for each repo
num_languages = random.randint(2, 4)
selected_languages = random.sample(languages, num_languages)
for language in selected_languages:
yield {
"repo": repo,
"language": language,
"bytes": random.randint(1000, 100000),
"check_time": datetime.now().isoformat(),
}
def run_minimal_example():
"""
Runs a minimal example pipeline that loads dummy GitHub data to MotherDuck.
"""
# Define some dummy repositories
repos = ["example/repo1", "example/repo2", "example/repo3"]
# Configure the pipeline
pipeline = dlt.pipeline(
pipeline_name="minimal_github_pipeline",
destination='motherduck',
dataset_name="minimal_example",
)
# Create the data source
data = dummy_source(repos)
# Run the pipeline with all resources
info = pipeline.run(data)
print(info)
# Show what was loaded
print("\nLoaded data:")
print(f"- {len(repos)} repositories")
print(f"- Languages for {len(repos)} repositories")
if __name__ == "__main__":
run_minimal_example()
dlt revolves around three core concepts:
- Sources: Define where the data comes from.
- Resources: Represent structured units of data within a source.
- Pipelines: Manage the data loading process.
In the example above:
- dummy_source defines a source that simulates GitHub-like data.
- dummy_repo_info and dummy_languages are resources producing repository and language data.
- A pipeline loads this data into MotherDuck.
The core integration with MotherDuck is defined in the pipeline configuration:
pipeline = dlt.pipeline(
pipeline_name="minimal_github_pipeline",
destination="motherduck",
dataset_name="minimal_example",
)
Setting destination="motherduck" tells dlt to load the data into MotherDuck.