# Build Hacker News Models From S3 With dbt
> Queries a public Hacker News Parquet file in S3 and builds three dbt models on top of it, run locally against DuckDB or in the cloud against MotherDuck. Use when you want a dbt-on-MotherDuck recipe that reads Parquet/CSV directly from object storage without copying it first.
<ExampleHeader
  githubUrl={"https://github.com/motherduckdb/motherduck-examples/tree/main/dbt-ingestion-s3"}
  features={[]}
  tags={["dbt"]}
/>

A small dbt project that reads the public Hacker News Parquet file straight from
S3 as a dbt source, then builds three table models from it. It shows the
MotherDuck pattern of querying object storage in place (no copy step) and running
the same dbt project two ways: locally against a DuckDB file, or in the cloud
against MotherDuck.

## How it works

`models/sources.yml` defines the S3 file as a dbt source using DuckDB's external
location support, so dbt reads the Parquet in place instead of copying it. The
`{name}` placeholder is filled from the table name:

```yaml
sources:
  - name: hn_external
    meta:
      external_location: "s3://us-prd-motherduck-open-datasets/hacker_news/parquet/{name}.parquet"
    tables:
      - name: hacker_news_2024_2025
```

Models reference that source with `{{ source('hn_external', 'hacker_news_2024_2025') }}`,
so swapping the data source is a one-line change in `sources.yml`. The three
models are:

- `top_story_by_comments.sql`: top story per month by comment count, using a
  windowed `ROW_NUMBER()` partition over year/month.
- `duckdb_keyword_mentions.sql`: monthly count of stories mentioning `duckdb` in
  the title or text.
- `top_domains.sql`: top 20 story domains, extracted from the URL with
  `regexp_extract`.

The same project runs unchanged against a local DuckDB file or against
MotherDuck; only the dbt target changes.

## What you'll adjust

| Setting | Purpose | Options / example |
|---|---|---|
| `models/sources.yml` `external_location` | The object-storage path dbt reads as its source. | `s3://us-prd-motherduck-open-datasets/hacker_news/parquet/{name}.parquet`; swap for your own S3/HTTPS Parquet or CSV |
| Source table `name` | Which file under that location to query (`{name}` in the path). | `hacker_news_2024_2025` |
| `models/*.sql` | The three analytical models built from the source. | `top_story_by_comments`, `duckdb_keyword_mentions`, `top_domains`; add or replace your own |
| `dbt_project.yml` `models.+materialized` | How models are persisted. | `table` (default) or `view` |
| `profiles.yml` targets | Local vs cloud destination. | `local` (`local.db` DuckDB file) or `prod` (`md:hacker_news_stats`) |
| `profiles.yml` `prod` `path` | The MotherDuck database dbt builds into. | `md:hacker_news_stats`; create the database first |
| `MOTHERDUCK_TOKEN` (env) | Auth for MotherDuck runs. | a read/write token from your account |

## Questions to answer

- What is the source data: which S3/HTTPS path and file, and is it Parquet or CSV?
- Which models are actually needed (keep the three samples, replace them, or select a subset)?
- Target MotherDuck database and schema for the built tables.
- Local DuckDB run or MotherDuck run?
- Is there a MotherDuck account and access token available?

## Run it

Prerequisites: a MotherDuck account and token for cloud runs, and `uv` for the
local Python runtime. The S3 dataset is public, so no AWS credentials are needed.

Local DuckDB run (writes to `local.db`):

```bash
uv run --with dbt-duckdb dbt run --target local
```

MotherDuck run (create the database once, then build):

```bash
export MOTHERDUCK_TOKEN=your_token_here
uv run --with dbt-duckdb dbt run --target prod
```

Create the `prod` database first if it does not exist. dbt does not create the
database for you, and a missing database makes the `prod` run fail:

```sql
CREATE DATABASE IF NOT EXISTS hacker_news_stats;
```

`profiles.yml` ships two targets: `local` (a `local.db` DuckDB file) and `prod`
(`md:hacker_news_stats`). The default target is `local`, so a bare `dbt run`
stays on disk. There is no `dev` target; use `local`.

## Files

- [`models/`](https://github.com/motherduckdb/motherduck-examples/tree/main/dbt-ingestion-s3/models/): the dbt project content. Three table models ([`top_story_by_comments.sql`](https://github.com/motherduckdb/motherduck-examples/blob/main/dbt-ingestion-s3/models/top_story_by_comments.sql), [`duckdb_keyword_mentions.sql`](https://github.com/motherduckdb/motherduck-examples/blob/main/dbt-ingestion-s3/models/duckdb_keyword_mentions.sql), [`top_domains.sql`](https://github.com/motherduckdb/motherduck-examples/blob/main/dbt-ingestion-s3/models/top_domains.sql)) plus [`sources.yml`](https://github.com/motherduckdb/motherduck-examples/blob/main/dbt-ingestion-s3/models/sources.yml), which declares the public Hacker News S3 Parquet as a dbt source using DuckDB's `external_location`.
- [`dbt_project.yml`](https://github.com/motherduckdb/motherduck-examples/blob/main/dbt-ingestion-s3/dbt_project.yml): dbt project config (profile name `dbt_ingestion_s3`, models materialized as `table`).
- [`profiles.yml`](https://github.com/motherduckdb/motherduck-examples/blob/main/dbt-ingestion-s3/profiles.yml): the two targets, `local` (a `local.db` DuckDB file) and `prod` (`md:hacker_news_stats`).
- [`pyproject.toml`](https://github.com/motherduckdb/motherduck-examples/blob/main/dbt-ingestion-s3/pyproject.toml): Python project metadata for local `uv run` (pins `dbt-duckdb==1.9.3`).
- [`uv.lock`](https://github.com/motherduckdb/motherduck-examples/blob/main/dbt-ingestion-s3/uv.lock): resolved lockfile for the local `uv` environment. [`.python-version`](https://github.com/motherduckdb/motherduck-examples/tree/main/dbt-ingestion-s3/.python-version) pins Python 3.12.
- [`analyses/`](https://github.com/motherduckdb/motherduck-examples/tree/main/dbt-ingestion-s3/analyses/), [`macros/`](https://github.com/motherduckdb/motherduck-examples/tree/main/dbt-ingestion-s3/macros/), [`seeds/`](https://github.com/motherduckdb/motherduck-examples/tree/main/dbt-ingestion-s3/seeds/), [`snapshots/`](https://github.com/motherduckdb/motherduck-examples/tree/main/dbt-ingestion-s3/snapshots/), [`tests/`](https://github.com/motherduckdb/motherduck-examples/tree/main/dbt-ingestion-s3/tests/): standard dbt scaffold directories, empty for now (each holds a `.gitkeep`).

## Caveats

- **The target database must already exist.** dbt does not create it. The `prod`
  target connects to `md:hacker_news_stats`, which fails if the database is
  missing. Run `CREATE DATABASE IF NOT EXISTS ...` first.
- **No `dev` target exists.** `profiles.yml` defines `local` and `prod` only. dbt
  errors on `--target dev`.
- **Swapping to a private bucket needs a secret.** The default S3 dataset is
  public. Pointing `external_location` at a private bucket requires a DuckDB/
  MotherDuck `SECRET`; the project ships none.
- **Do not put a token in source or config.** The `local` and `prod` runs read
  `MOTHERDUCK_TOKEN` from the environment; keep it out of the repo.

## Learn more

- Deeper MotherDuck or DuckDB questions (querying S3, dbt-duckdb behavior): use the
  `ask_docs_question` MCP tool, or see the [dbt-duckdb adapter](https://github.com/duckdb/dbt-duckdb)
  and [MotherDuck docs](https://motherduck.com/docs).


---

## Docs feedback

MotherDuck accepts optional user-submitted feedback about this page at `POST https://motherduck.com/docs/api/feedback/agent`.
For agents and automated tools, feedback submission should be user-confirmed before sending.

Payload:

```json
{
  "page_path": "/cookbook/dbt-ingestion-s3/",
  "page_title": "Build Hacker News Models From S3 With dbt",
  "text": "<the user's feedback, max 2000 characters>",
  "source": "<optional identifier for your interface, for example 'claude.ai' or 'chatgpt'>"
}
```

`page_path` and `text` are required; `page_title` and `source` are optional. Responses: `200 {"feedback_id": "<uuid>"}`, `400` for malformed payloads, and `429` when rate-limited.
