Close the Loop: Faster Data Pipelines with MCP, DuckDB & AI

2025/04/15 - 11 min read

As data engineers, we constantly face the challenge of slow feedback loops when building data pipelines. Unlike the rapid iteration cycles often seen in web development (write some JavaScript/HTML, refresh, and boom, you see a page), data pipelines frequently involve multiple tools, complex transformations, and a heavy reliance on data storage. This complexity creates bottlenecks and slows down development.

But what if there was a way to accelerate this process and get quicker insights from your data? The Model Context Protocol (MCP) has been a hot topic lately. Could it play a role in speeding up data engineering workflows? Let's explore.

Understanding the development lifecycle

The typical data engineering lifecycle involves several stages: ingestion, transformation, storage, serving, and finally, analysis as defined in the excellent book of Fundamentals of Data Engineering.

Source: Fundamentals of Data Engineering by Joe Reis & Matt Housley

Each step critically depends on the data itself. Mocking realistic data is challenging, often requiring use of samples of production data to properly develop and test data transformation pipelines and analytical models.

Even for ingestion, it's really hard to proceed without looking at the data first. For instance, you might have CSV files that you want to convert to Parquet. Relying only on schema inference can be dangerous; a column that initially appears to be boolean might actually contain string values further down in the file.

The solution to avoid these traps during development isn't a secret: you have to query the source data and inspect it directly.

AI Copilots: A Step in the Right Direction

AI copilots like GitHub Copilot and Cursor have emerged as valuable tools for accelerating code generation. The typical workflow involves:

Writing a prompt describing the desired code.
Letting the AI generate the code snippet.
Testing the generated code against your data.

However, this process can still be inefficient. If the AI produces inaccurate code (which often happens when dealing with specific data schemas or complex logic), you need to revise the prompt, regenerate, and re-test against the data, leading to frustrating delays.

MCP: Closing the Feedback Loop

The Model Context Protocol (MCP) is an emerging open protocol designed to connect AI copilots (like Cursor, GitHub Copilot, or Claude) to local and cloud-based tools. Think of it as an API layer that allows Large Language Models (LLMs) to query, inspect, and interact with various tools – databases, code repositories, APIs, etc. – either directly guided by you or through an autonomous agent.

Originally introduced by Anthropic in 2024, MCP quickly gained traction among AI-first developer tools like Zed, Replit, and Sourcegraph. It offers a model-agnostic, extensible way for AI applications to work with structured data, code, or documents residing in external systems.

Under the hood, MCP typically uses a client–host–server architecture:

The host is your AI tool (e.g., your local IDE like Cursor or VS Code with an extension).
The client is a lightweight connector managing communication.
The server exposes specific tools (like a database connection or a file system browser) to the AI via a standardized interface.

Each MCP session is scoped, secure, and focused on a specific domain (e.g., querying a particular database, Browse a specific repository).

Today, MCP is primarily used to accelerate AI workflows within development environments or through automated agents. In the context of data engineering, this means MCP can enable AI copilots to perform tasks ranging from running SQL queries against databases to understanding complex schemas and metadata, bringing a new level of context-awareness to AI assistance.

While the standard is still evolving – with related efforts and forks like GitHub Copilot Apps and Google’s Agent-to-Agent (A2A) Communication emerging – MCP is shaping up to be a foundational piece for agent-tool communication.

You can find a growing list of community MCP servers on mcp.so.

Using MCP for Building Data Pipelines

Let's walk through how MCP can help building data pipelines, specifically using a DuckDB+dbt stack.

Setup

To set up our working environment for this demo, we'll need:

An IDE that supports MCP: Cursor is used here, but others like VS Code (with extensions) might support it.
Install the MCP Server: We need the specific MCP server for our chosen tool. In this case, we'll use the MotherDuck/DuckDB MCP server.

In Cursor, you can easily set up an MCP server via the Settings :

The DuckDB/MotherDuck MCP server allows the AI copilot (Cursor) to directly run queries against local DuckDB databases and/or remote MotherDuck databases and interpret the results. This drastically shortens the feedback loop compared to manually running queries and pasting results back into the AI prompt.

To install the DuckDB/MotherDuck MCP server in Cursor, go to Settings > Cursor Settings > MCP > Add a new Global MCP Server and add the following JSON configuration:

Copy code
{
  "mcpServers": {
    "mcp-server-motherduck": {
      "command": "uvx", // Assumes uvx is installed and in PATH for running Python CLI tools
      "args": [
        "mcp-server-motherduck",
        "--db-path",
        "md:", // Connects to MotherDuck by default. Use a local file path like "my_local_db.duckdb" for local DBs.
        "--motherduck-token",
        "<YOUR_MOTHERDUCK_TOKEN_HERE>" // Required if connecting to MotherDuck
      ]
    }
  }
}

INFO: Authentication with MotherDuck Replace <YOUR_MOTHERDUCK_TOKEN_HERE> with your actual MotherDuck service token if you intend to connect to MotherDuck. You can omit the token lines if only querying local DuckDB files.

Adding Documentation Context

Modern AI copilots benefit greatly from having access to relevant and updated documentation. Cursor, for instance, supports adding documentation sources. You can then reference this documentation in your prompts (e.g., @docs/my_doc) to provide context to the LLM.

To add documentation in Cursor, navigate to Settings -> Cursor Settings and look under the 'Features' tab (or similar, depending on the version).

TIP: Other options to add documentation context If you are not using Cursor, you can also use repomix to repack code base (e.g public documentation) into friendly format for AI.

Cursor supports simply adding the main documentation website URL; it will then crawl and index the content for you!

Relatedly, a new standard called llms.txt is emerging (see llmstxt.org), and several documentation sites have started adopting it. In a nutshell, websites provide:

/llms.txt: A file listing key documentation pages (often linking to Markdown versions).
/llms-full.txt: A file containing the aggregated content of the documentation.

This standard helps LLMs and services like Cursor quickly access up-to-date documentation efficiently. Luckily, both MotherDuck and DuckDB have adopted this standard:

MotherDuck:llms.txt,llms-full.txt
DuckDB:llms.txt,llms-full.txt
Adding these documentation sources makes the AI copilot much more effective when generating database-specific code or queries.

Demo: Querying data and building dbt models with MCP

Now that our MCP setup is complete, let's see it in action. In the following demo, I'll use an extensive prompt within Cursor, leveraging the DuckDB/MotherDuck MCP server and documentation context.

Here's the prompt :

Copy code
I want to analyze data tool trends using the following datasets:

- **GitHub language usage (bytes)** for DuckDB, Spark, Polars, Arrow, and Pandas. This reflects *actual codebase usage*.
  - Use the GitHub API directly from DuckDB via `httpfs` extension if possible, or guide me on how to fetch this. Assume the relevant repositories are known (e.g., duckdb/duckdb, apache/spark, polars-rs/polars, apache/arrow, pandas-dev/pandas).

- **Stack Overflow Developer Survey** data. This reflects *developer-reported preferences and usage*.
  - Stored in MotherDuck cloud storage:
    - `s3://us-prd-motherduck-open-datasets/stackoverflow_survey/2017_2024/survey_results.parquet`
    - `s3://us-prd-motherduck-open-datasets/stackoverflow_survey/2017_2024/survey_schemas.parquet`

- **Hacker News data**. This reflects *community interest and discussion* (the "buzz").
  - Stored in MotherDuck cloud storage:
    - `s3://us-prd-motherduck-open-datasets/hacker_news/parquet/hacker_news_2024_2025.parquet`

### Workflow

- Use the DuckDB/MotherDuck MCP server configured as `mcp-server-motherduck` to preview data structures and sample contents.
- My local project base path is: `/Users/mehdio/repos/tmp/mcp-playground`
- The goal is to create dbt models for a final table showing how data tools align across developer usage, perception, and community interest.
- Use the existing dbt project structure located in the `mcp_demo` subfolder within my base path.

### Tasks

1.  **Inspect Data:** Use the MCP server to run `DESCRIBE` or `SELECT * ... LIMIT 5` queries on the S3 parquet files to understand their schemas and contents. Show me the output.
2.  **GitHub Data Query:** Suggest a DuckDB query using `httpfs` to get language bytes for the specified GitHub repositories. If direct API access is complex, outline the steps needed.
3.  **dbt Model Generation:** Based on the schemas and goals, suggest valid `dbt` models (staging and final).
4.  **Staging Models:** Create initial SQL files for staging tables within the `mcp_demo/models/staging/` directory.
5.  **Testing:** Use the MCP DuckDB server to test run the generated staging model queries against the source data.
6.  **dbt Tests:** Add appropriate basic `dbt` tests (e.g., `not_null`, `unique`) to the staging models' `.yml` configuration file.

I'm providing the data sources link (here AWS s3 paths) and asking the AI (Cursor) to help create the dbt models. I have a rough idea of the goal but haven't specified the exact transformations, relying on the AI and MCP interaction to explore the data first.

Optimizing Workflows with MCP Interaction

When processing this prompt, the LLM identifies the need to query the S3 data. It recognizes that the mcp-server-motherduck MCP server can fulfill this request and prepares the necessary SQL query (e.g., a DESCRIBE or SELECT LIMIT 5). Cursor then prompts for confirmation before executing the query via MCP.

INFO: DuckDB's powerful integrations DuckDB's in-process nature and extensive integrations make it highly suitable for data exploration. The MCP exemplifies this by running a local DuckDB instance capable of directly querying various object storage systems, including AWS S3, Google Cloud Storage, and Cloudflare's R2.

AI suggests running a query via MCP, awaiting user confirmation.

Once the query is executed through the MCP server, the LLM receives the results (schema information or sample data) directly, enriching its context.

Optimization: Schema First

Interestingly, LLMs sometimes make assumptions about data structure instead of explicitly retrieving metadata first. This can lead to generating incorrect queries.

Since our source files are Parquet, running a simple command to get the schema is fast, easy, and cheap using DuckDB: DESCRIBE SELECT * FROM read_parquet('s3://path/to/your/data/*.parquet');

It's highly beneficial to instruct the AI to perform this step before attempting complex transformations. This recommendation can be included directly in your prompt or potentially configured via custom rules within your AI tool (like Cursor's .cursorrules).

Example Prompt Instruction: "Before generating any transformation query on a Parquet file path, first use the DuckDB MCP server to run DESCRIBE SELECT * FROM read_parquet('<path>'); and incorporate the resulting schema information."

This simple step avoids iterative loops of failing queries and trial-and-error debugging caused by schema mismatches (e.g., treating an integer column as a string).

After a few interactions facilitated by MCP (often within the context of a single, well-crafted prompt!), the AI can generate the required dbt models, like this staging model for Hacker News data:

Copy code
-- models/staging/stg_hacker_news.sql
{{ config(materialized='view') }}

WITH source AS (
    SELECT *,
           to_timestamp(time) AS event_timestamp -- Rename and cast timestamp
    FROM read_parquet('s3://us-prd-motherduck-open-datasets/hacker_news/parquet/hacker_news_2024_2025.parquet')
),

final AS (
    SELECT
        id AS hn_id,
        DATE_TRUNC('month', event_timestamp) AS event_month,
        title,
        text AS story_text,
        score,
        "by" AS author,
        descendants AS num_comments
    FROM source
    WHERE type = 'story' -- Filter for stories, not comments/jobs etc.
)

SELECT * FROM final

And eventually, it can propose a final model to unify insights from the different sources.
The key is that the AI could test parts of this query logic directly against the data using MCP during the generation process.

Key takeaways and future outlook

MCP represents a significant step forward for data pipeline development. By enabling AI copilots to directly interact with data sources and tools like DuckDB, it accelerates the data engineering feedback loop that often slows data engineering progress. This direct interaction leads to faster iteration cycles, more accurate AI-generated code, and ultimately, quicker insights from your data.

To make the most of AI and MCP in your data workflows, consider this :

Provide rich context: Equip your AI copilot with necessary information. This includes referencing relevant documentation (@docs/duckdb), specifying the correct MCP servers to use (Use mcp-server-motherduck), outlining your project structure, and leveraging llms.txt sources when available for up-to-date context. When using Cursor, you can also leverage .cursorrules.
Prioritize schema inspection first: Explicitly instruct the AI to use MCP for retrieving schema metadata (e.g., DESCRIBE SELECT * FROM read_parquet(...)) before attempting complex data transformations. This proactive step prevents many common errors caused by incorrect assumptions about data types or column names.
Use sampling for large datasets (Optional): When dealing with very large datasets, consider using MCP to create a smaller, local sample (CREATE TABLE local_sample AS SELECT * FROM read_parquet('s3://...') LIMIT 1000;). Iterating on this faster local sample can significantly speed up development before applying logic to the full dataset.

While MCP and the surrounding ecosystem of AI agents and tools are still evolving, the potential impact on data engineering is clear.

We encourage you to experiment with MCP in your next data project to experience the benefits firsthand.

Have a look at our DuckDB/MotherDuck MCP's documentation, keep quacking and keep coding.