dlt

dlt is an open-source Python library that loads data from various, often messy data sources into well-structured, live datasets. It offers a lightweight interface for extracting data from REST APIs, SQL databases, cloud storage, Python data structures, and many more.

dlt is designed to be easy to use, flexible, and scalable:

dlt infers schemas and data types, normalizes the data, and handles nested data structures.
dlt supports a variety of popular destinations and has an interface to add custom destinations to create reverse ETL pipelines.
dlt can be deployed anywhere Python runs, be it on Airflow, serverless functions, or any other cloud deployment of your choice.
dlt automates pipeline maintenance with schema evolution and schema and data contracts.

Dlt integrates well with DuckDB (they also used it as a local cache) and therefore with MotherDuck.

You can check more about MotherDuck integration in the official documentation.

Authentication

To authenticate with MotherDuck, you have two options:

Environment variable: export your motherduck_token as an environment variable:

export motherduck_token="your_motherduck_token"

For Local development: add the token to .dlt/secrets.toml:

[destination.motherduck.credentials]
password = "my_motherduck_token"

Minimal example

Below is a minimal example of using dlt to load data from a REST API (with fake data) into a DuckDB (MotherDuck) database:

import dlt
from typing import Dict, Iterator, List, Sequence
import random
from datetime import datetime
from dlt.sources import DltResource


@dlt.source(name="dummy_github")
def dummy_source(repos: List[str] = None) -> Sequence[DltResource]:
    """
    A minimal DLT source that generates dummy GitHub-like data.
    
    Args:
        repos (List[str]): A list of dummy repository names.
        
    Returns:
        Sequence[DltResource]: A sequence of resources with dummy data.
    """
    if repos is None:
        repos = ["dummy/repo1", "dummy/repo2"]
        
    return (
        dummy_repo_info(repos),
        dummy_languages(repos),
    )


@dlt.resource(write_disposition="replace")
def dummy_repo_info(repos: List[str]) -> Iterator[Dict]:
    """
    Generates dummy repository information.
    
    Args:
        repos (List[str]): List of repository names.
        
    Yields:
        Iterator[Dict]: An iterator over dummy repository data.
    """
    for repo in repos:
        owner, name = repo.split("/")
        yield {
            "id": random.randint(10000, 99999),
            "name": name,
            "full_name": repo,
            "owner": {"login": owner},
            "description": f"This is a dummy repository for {repo}",
            "created_at": datetime.now().isoformat(),
            "updated_at": datetime.now().isoformat(),
            "stargazers_count": random.randint(0, 1000),
            "forks_count": random.randint(0, 500),
        }


@dlt.resource(write_disposition="replace")
def dummy_languages(repos: List[str]) -> Iterator[Dict]:
    """
    Generates dummy language data for repositories in an unpivoted format.
    
    Args:
        repos (List[str]): List of repository names.
        
    Yields:
        Iterator[Dict]: An iterator over dummy language data.
    """
    languages = ["Python", "JavaScript", "TypeScript", "C++", "Rust", "Go"]
    
    for repo in repos:
        # Generate 2-4 random languages for each repo
        num_languages = random.randint(2, 4)
        selected_languages = random.sample(languages, num_languages)
        
        for language in selected_languages:
            yield {
                "repo": repo,
                "language": language,
                "bytes": random.randint(1000, 100000),
                "check_time": datetime.now().isoformat(),
            }


def run_minimal_example():
    """
    Runs a minimal example pipeline that loads dummy GitHub data to MotherDuck.
    """
    # Define some dummy repositories
    repos = ["example/repo1", "example/repo2", "example/repo3"]
    
    # Configure the pipeline
    pipeline = dlt.pipeline(
        pipeline_name="minimal_github_pipeline",
        destination='motherduck',
        dataset_name="minimal_example",
    )
    
    # Create the data source
    data = dummy_source(repos)
    
    # Run the pipeline with all resources
    info = pipeline.run(data)
    print(info)
    
    # Show what was loaded
    print("\nLoaded data:")
    print(f"- {len(repos)} repositories")
    print(f"- Languages for {len(repos)} repositories")


if __name__ == "__main__":
    run_minimal_example() 

dlt revolves around three core concepts:

Sources: Define where the data comes from.
Resources: Represent structured units of data within a source.
Pipelines: Manage the data loading process.

In the example above:

dummy_source defines a source that simulates GitHub-like data.
dummy_repo_info and dummy_languages are resources producing repository and language data.
A pipeline loads this data into MotherDuck.

The core integration with MotherDuck is defined in the pipeline configuration:

pipeline = dlt.pipeline(
    pipeline_name="minimal_github_pipeline",
    destination="motherduck",
    dataset_name="minimal_example",
)

Setting destination="motherduck" tells dlt to load the data into MotherDuck.

Authentication​

Minimal example​

Authentication

Minimal example