---
title: dlt (data load tool)
description: Use dlt to extract and load data from APIs and databases into MotherDuck with automatic schema inference.
---

[dlt](https://dlthub.com/docs/intro) is an open-source Python library that loads data from various, often messy data sources into well-structured, live datasets. It offers a lightweight interface for extracting data from REST APIs, SQL databases, cloud storage, Python data structures, and many more.

dlt is designed to be easy to use, flexible, and scalable:

* dlt infers schemas and data types, normalizes the data, and handles nested data structures.
* dlt supports a variety of popular destinations and has an interface to add custom destinations to create reverse ETL pipelines.
* dlt can be deployed anywhere Python runs, be it on Airflow, serverless functions, or any other cloud deployment of your choice.
* dlt automates pipeline maintenance with schema evolution and schema and data contracts.

Dlt integrates well with DuckDB (they also used it as a local [cache](https://dlthub.com/blog/dltplus-project-cache-in-early-access)) and therefore with MotherDuck.

You can check more about MotherDuck integration in the [official documentation](https://dlthub.com/docs/dlt-ecosystem/destinations/motherduck).

## Authentication

To authenticate with MotherDuck, you have two options:

1. **Environment variable:** export your `motherduck_token` as an environment variable:

```bash
export motherduck_token="your_motherduck_token"
```

2.	For Local development: add the token to `.dlt/secrets.toml`:

```toml
[destination.motherduck.credentials]
password = "my_motherduck_token"
```

## Minimal example

Below is a minimal example of using dlt to load data from a REST API (with fake data) into a DuckDB (MotherDuck) database:

```python
import dlt
from typing import Dict, Iterator, List, Sequence
import random
from datetime import datetime
from dlt.sources import DltResource


@dlt.source(name="dummy_github")
def dummy_source(repos: List[str] = None) -> Sequence[DltResource]:
    """
    A minimal DLT source that generates dummy GitHub-like data.
    
    Args:
        repos (List[str]): A list of dummy repository names.
        
    Returns:
        Sequence[DltResource]: A sequence of resources with dummy data.
    """
    if repos is None:
        repos = ["dummy/repo1", "dummy/repo2"]
        
    return (
        dummy_repo_info(repos),
        dummy_languages(repos),
    )


@dlt.resource(write_disposition="replace")
def dummy_repo_info(repos: List[str]) -> Iterator[Dict]:
    """
    Generates dummy repository information.
    
    Args:
        repos (List[str]): List of repository names.
        
    Yields:
        Iterator[Dict]: An iterator over dummy repository data.
    """
    for repo in repos:
        owner, name = repo.split("/")
        yield {
            "id": random.randint(10000, 99999),
            "name": name,
            "full_name": repo,
            "owner": {"login": owner},
            "description": f"This is a dummy repository for {repo}",
            "created_at": datetime.now().isoformat(),
            "updated_at": datetime.now().isoformat(),
            "stargazers_count": random.randint(0, 1000),
            "forks_count": random.randint(0, 500),
        }


@dlt.resource(write_disposition="replace")
def dummy_languages(repos: List[str]) -> Iterator[Dict]:
    """
    Generates dummy language data for repositories in an unpivoted format.
    
    Args:
        repos (List[str]): List of repository names.
        
    Yields:
        Iterator[Dict]: An iterator over dummy language data.
    """
    languages = ["Python", "JavaScript", "TypeScript", "C++", "Rust", "Go"]
    
    for repo in repos:
        # Generate 2-4 random languages for each repo
        num_languages = random.randint(2, 4)
        selected_languages = random.sample(languages, num_languages)
        
        for language in selected_languages:
            yield {
                "repo": repo,
                "language": language,
                "bytes": random.randint(1000, 100000),
                "check_time": datetime.now().isoformat(),
            }


def run_minimal_example():
    """
    Runs a minimal example pipeline that loads dummy GitHub data to MotherDuck.
    """
    # Define some dummy repositories
    repos = ["example/repo1", "example/repo2", "example/repo3"]
    
    # Configure the pipeline
    pipeline = dlt.pipeline(
        pipeline_name="minimal_github_pipeline",
        destination='motherduck',
        dataset_name="minimal_example",
    )
    
    # Create the data source
    data = dummy_source(repos)
    
    # Run the pipeline with all resources
    info = pipeline.run(data)
    print(info)
    
    # Show what was loaded
    print("\nLoaded data:")
    print(f"- {len(repos)} repositories")
    print(f"- Languages for {len(repos)} repositories")


if __name__ == "__main__":
    run_minimal_example() 
```

dlt revolves around three core concepts:
* Sources: Define where the data comes from.
* Resources: Represent structured units of data within a source.
* Pipelines: Manage the data loading process.

In the example above:
* dummy_source defines a source that simulates GitHub-like data.
* dummy_repo_info and dummy_languages are resources producing repository and language data.
* A pipeline loads this data into MotherDuck.

The core integration with MotherDuck is defined in the pipeline configuration:

```python
pipeline = dlt.pipeline(
    pipeline_name="minimal_github_pipeline",
    destination="motherduck",
    dataset_name="minimal_example",
)
```

Setting destination="motherduck" tells dlt to load the data into MotherDuck.
