Getting Started with MotherDuck - Live WebinarRegister today

Skip to main content

dlt

dlt is an open-source Python library that loads data from various, often messy data sources into well-structured, live datasets. It offers a lightweight interface for extracting data from REST APIs, SQL databases, cloud storage, Python data structures, and many more.

dlt is designed to be easy to use, flexible, and scalable:

  • dlt infers schemas and data types, normalizes the data, and handles nested data structures.
  • dlt supports a variety of popular destinations and has an interface to add custom destinations to create reverse ETL pipelines.
  • dlt can be deployed anywhere Python runs, be it on Airflow, serverless functions, or any other cloud deployment of your choice.
  • dlt automates pipeline maintenance with schema evolution and schema and data contracts.

Dlt integrates well with DuckDB (they also used it as a local cache) and therefore with MotherDuck.

You can check more about MotherDuck integration in the official documentation.

Authentication

To authenticate with MotherDuck, you have two options:

  1. Environment variable: export your motherduck_token as an environment variable:
export motherduck_token="your_motherduck_token"
  1. For Local development: add the token to .dlt/secrets.toml:
[destination.motherduck.credentials]
password = "my_motherduck_token"

Minimal example

Below is a minimal example of using dlt to load data from a REST API (with fake data) into a DuckDB (MotherDuck) database:

import dlt
from typing import Dict, Iterator, List, Sequence
import random
from datetime import datetime
from dlt.sources import DltResource


@dlt.source(name="dummy_github")
def dummy_source(repos: List[str] = None) -> Sequence[DltResource]:
"""
A minimal DLT source that generates dummy GitHub-like data.

Args:
repos (List[str]): A list of dummy repository names.

Returns:
Sequence[DltResource]: A sequence of resources with dummy data.
"""
if repos is None:
repos = ["dummy/repo1", "dummy/repo2"]

return (
dummy_repo_info(repos),
dummy_languages(repos),
)


@dlt.resource(write_disposition="replace")
def dummy_repo_info(repos: List[str]) -> Iterator[Dict]:
"""
Generates dummy repository information.

Args:
repos (List[str]): List of repository names.

Yields:
Iterator[Dict]: An iterator over dummy repository data.
"""
for repo in repos:
owner, name = repo.split("/")
yield {
"id": random.randint(10000, 99999),
"name": name,
"full_name": repo,
"owner": {"login": owner},
"description": f"This is a dummy repository for {repo}",
"created_at": datetime.now().isoformat(),
"updated_at": datetime.now().isoformat(),
"stargazers_count": random.randint(0, 1000),
"forks_count": random.randint(0, 500),
}


@dlt.resource(write_disposition="replace")
def dummy_languages(repos: List[str]) -> Iterator[Dict]:
"""
Generates dummy language data for repositories in an unpivoted format.

Args:
repos (List[str]): List of repository names.

Yields:
Iterator[Dict]: An iterator over dummy language data.
"""
languages = ["Python", "JavaScript", "TypeScript", "C++", "Rust", "Go"]

for repo in repos:
# Generate 2-4 random languages for each repo
num_languages = random.randint(2, 4)
selected_languages = random.sample(languages, num_languages)

for language in selected_languages:
yield {
"repo": repo,
"language": language,
"bytes": random.randint(1000, 100000),
"check_time": datetime.now().isoformat(),
}


def run_minimal_example():
"""
Runs a minimal example pipeline that loads dummy GitHub data to MotherDuck.
"""
# Define some dummy repositories
repos = ["example/repo1", "example/repo2", "example/repo3"]

# Configure the pipeline
pipeline = dlt.pipeline(
pipeline_name="minimal_github_pipeline",
destination='motherduck',
dataset_name="minimal_example",
)

# Create the data source
data = dummy_source(repos)

# Run the pipeline with all resources
info = pipeline.run(data)
print(info)

# Show what was loaded
print("\nLoaded data:")
print(f"- {len(repos)} repositories")
print(f"- Languages for {len(repos)} repositories")


if __name__ == "__main__":
run_minimal_example()

dlt revolves around three core concepts:

  • Sources: Define where the data comes from.
  • Resources: Represent structured units of data within a source.
  • Pipelines: Manage the data loading process.

In the example above:

  • dummy_source defines a source that simulates GitHub-like data.
  • dummy_repo_info and dummy_languages are resources producing repository and language data.
  • A pipeline loads this data into MotherDuck.

The core integration with MotherDuck is defined in the pipeline configuration:

pipeline = dlt.pipeline(
pipeline_name="minimal_github_pipeline",
destination="motherduck",
dataset_name="minimal_example",
)

Setting destination="motherduck" tells dlt to load the data into MotherDuck.