Streaming Kafka Data into MotherDuck with Estuary Live demo: October 9th

Skip to main content

Building Analytics Agents on MotherDuck

Analytics agents are AI-powered systems that allow users to interact with data using natural language. Instead of writing SQL queries or building dashboards, users can ask questions like "What were our top-selling products last quarter?" and get immediate answers.

This guide covers best practices for building production-ready analytics agents on MotherDuck.

Prerequisites

  • Agent framework: Claude Agent SDK, OpenAI Agents SDK, or Claude Desktop with MotherDuck MCP connector
  • MotherDuck account with the data you want to query
  • Clean, well-structured data: The better your schema and metadata, the better your agent performs

Step 1: Define Your Agent's Interface

Choose the interface your agent will use to query your MotherDuck database.

Option A: Generated SQL

The agent generates SQL queries and executes them via a tool/function call. This provides maximum flexibility - agents can answer any question your data supports - but requires good SQL generation capabilities.

Implementation approaches:

MCP Server: Use our Model Context Protocol server for Claude Desktop, Cursor, or custom agents with local MCP server support

Custom tool calling: Create a function that accepts SQL strings and executes them:

import duckdb

def execute_sql(query: str) -> str:
"""Execute SQL query against MotherDuck"""
conn = duckdb.connect('md:my_database?motherduck_token=<token>', read_only=True)
try:
result = conn.execute(query).fetchdf()
return result.to_string()
except Exception as e:
return f"Error: {str(e)}"

Option B: Parameterized Query Templates

The agent receives structured parameters that fill predefined SQL templates. This provides strict correctness guarantees and is easier to validate, but is less flexible and requires more upfront development with queries limited to predefined questions.

Example: Agent chooses calling a custom tool with a domain-specific signature like get_sales_by_region(region: str, start_date: date, end_date: date) instead of generating custom SQL.

Recommendation: Start with Option A (SQL generation) unless you have strict correctness requirements or very limited query patterns.

Step 2: Give Your Agent SQL Knowledge

Your LLM needs to know how to write good DuckDB queries.

System Prompt for DuckDB and MotherDuck

A system prompt is the foundational instruction set that guides your agent's behavior and capabilities. It's critical for ensuring your agent generates correct, efficient SQL queries and understands how to explore data effectively.

The query guide below should be added to your system prompt because it contains:

  • DuckDB SQL syntax and conventions
  • Common patterns and best practices
  • How to explore schemas efficiently
query_guide.md
# DuckDB SQL Query Syntax and Performance Guide 

## General Knowledge

### Basic Syntax and Features

**Identifiers and Literals:**
- Use double quotes (`"`) for identifiers with spaces/special characters or case-sensitivity
- Use single quotes (`'`) for string literals

**Flexible Query Structure:**
- Queries can start with `FROM`: `FROM my_table WHERE condition;` (equivalent to `SELECT * FROM my_table WHERE condition;`)
- `SELECT` without `FROM` for expressions: `SELECT 1 + 1 AS result;`
- Support for `CREATE TABLE AS` (CTAS): `CREATE TABLE new_table AS SELECT * FROM old_table;`

**Advanced Column Selection:**
- Exclude columns: `SELECT * EXCLUDE (sensitive_data) FROM users;`
- Replace columns: `SELECT * REPLACE (UPPER(name) AS name) FROM users;`
- Pattern matching: `SELECT COLUMNS('sales_.*') FROM sales_data;`
- Transform multiple columns: `SELECT AVG(COLUMNS('sales_.*')) FROM sales_data;`

**Grouping and Ordering Shortcuts:**
- Group by all non-aggregated columns: `SELECT category, SUM(sales) FROM sales_data GROUP BY ALL;`
- Order by all columns: `SELECT * FROM my_table ORDER BY ALL;`

**Complex Data Types:**
- Lists: `SELECT [1, 2, 3] AS my_list;`
- Structs: `SELECT {'a': 1, 'b': 'text'} AS my_struct;`
- Maps: `SELECT MAP([1,2],['one','two']) AS my_map;`
- Access struct fields: `struct_col.field_name` or `struct_col['field_name']`
- Access map values: `map_col[key]`

**Date/Time Operations:**
- String to timestamp: `strptime('2023-07-23', '%Y-%m-%d')::TIMESTAMP`
- Format timestamp: `strftime(NOW(), '%Y-%m-%d')`
- Extract parts: `EXTRACT(YEAR FROM DATE '2023-07-23')`

### Database and Table Qualification

**Fully Qualified Names:**
- Tables are accessed by fully qualified names: `database_name.schema_name.table_name`
- There is always one current database: `SELECT current_database();`
- Tables from the current database don't need database qualification: `schema_name.table_name`
- Tables in the main schema don't need schema qualification: `table_name`
- Shorthand: `my_database.my_table` is equivalent to `my_database.main.my_table`

**Switching Databases:**
- Use `USE my_other_db;` to switch current database
- After switching, tables in that database can be accessed without qualification

### Schema Exploration

**Get database and table information:**
- List all databases: `SELECT alias as database_name, type FROM MD_ALL_DATABASES();`
- List tables in database: `SELECT database_name, schema_name, table_name, comment FROM duckdb_tables() WHERE database_name = 'your_database';`
- List views in database: `SELECT database_name, schema_name, view_name, comment, sql FROM duckdb_views() WHERE database_name = 'your_database';`
- Get column information: `SELECT column_name, data_type, comment, is_nullable FROM duckdb_columns() WHERE database_name = 'your_database' AND table_name = 'your_table';`

**Sample data exploration:**
- Quick preview: `SELECT * FROM table_name LIMIT 5;`
- Column statistics: `SUMMARIZE table_name;`
- Describe table: `DESCRIBE table_name;`

### Performance Tips

**QUALIFY Clause for Window Functions:**
-- Get top 2 products by sales in each category
SELECT category, product_name, sales_amount
FROM products
QUALIFY ROW_NUMBER() OVER (PARTITION BY category ORDER BY sales_amount DESC) <= 2;

**Efficient Patterns:**
- Use `arg_max()` and `arg_min()` for "most recent" queries
- Filter early to reduce data volume
- Use CTEs for complex queries
- Prefer `GROUP BY ALL` for readability
- Use `QUALIFY` instead of subqueries for window function filtering

**Avoid These Patterns:**
- Functions on the left side of WHERE clauses (prevents pushdown)
- Unnecessary ORDER BY on intermediate results
- Cross products and cartesian joins

Function Documentation

MotherDuck maintains function_docs.jsonl - compact, LLM-friendly documentation for every DuckDB/MotherDuck function available at: https://app.motherduck.com/assets/docs/function_docs.jsonl

How to use:

  1. When user asks a question, search function docs using FTS or semantic search
  2. Add the 5 most relevant function descriptions to the agent's context
  3. This helps with specialized functions (window functions, date arithmetic, JSON operations, etc.)
Preview Feature

Function documentation is in 'Preview' and will change in the future. The schema, content, and availability of function_docs.jsonl will change over time as we improve the resource.

Step 3: Give Your Agent Schema Context

Your agent needs to understand your database structure to generate correct queries.

Finding Relevant Tables

Our query_guide.md explains how agents can explore schemas autonomously to find relevant tables. For faster, non-agentic identification, use the built-in __MD_FILTER_TABLES() function with fuzzy keyword search:

-- Use MotherDuck's smart table filtering function
-- Replace 'sales', 'customer' with relevant search terms
SELECT 
database_name,
schema_name,
table_name,
table_comment,
column_comments,
table_similarity,
column_matches
FROM __MD_FILTER_TABLES(['sales', 'customer'], current_database())
ORDER BY table_similarity DESC LIMIT 15;
SQL Editor loading...
Preview Feature

The __MD_FILTER_TABLES function is in 'Preview'. The function name, signature, or filtering behavior may change in the future.

If you want to build your own filtering function, these are the raw tables to start with: duckdb_tables(), duckdb_columns(), duckdb_views(), and MD_ALL_DATABASES().

Make Schemas Agent-Friendly

Use clear naming: Choose explicit, unambiguous table and column names

❌ Bad: ord_dtl, cust_id, amt
✅ Good: order_details, customer_id, total_amount

Add context with COMMENT ON:

COMMENT ON TABLE orders IS 'Customer orders since 2020. Join to customers via customer_id';
COMMENT ON COLUMN orders.status IS 'Possible values: pending, shipped, delivered, cancelled';
COMMENT ON COLUMN orders.total_amount IS 'Total in USD including tax and shipping';

Comments help agents understand table relationships, valid values, and business logic. Learn more: COMMENT ON documentation

Step 4: Configure Access Controls

Secure your agent's database access with appropriate permissions and isolation.

Read-Only Access

For single-user or internal agents:

import duckdb
# Using read-only connection parameter
con = duckdb.connect('md:?motherduck_token=<token>', read_only=True)

For multi-tenant customer-facing analytics agents:

Use service accounts for your agents. You can grant these service accounts read-only access to specific databases using shares:

ATTACH 'md:_share/my_org/abc123' AS shared_data;

Consider creating separate service accounts per user/tenant for full compute isolation, or use read-scaling tokens to distribute query load across dedicated read replicas.

Capacity planning: Choose the number of read scaling replicas and Duckling size according to the expected query complexity and concurrency.

Read-Write Access & Sandboxing

For agents that need to create tables, modify data, or experiment safely, use zero-copy clones to create an isolated sandbox. This provides safe experimentation completely isolated from production data, with instant creation through zero-copy operations. Agents get full capabilities to create tables, modify data, and experiment freely, with easy sharing of results back to production when ready.

-- Create instant writable copy (clones must match source retention type)
CREATE DATABASE my_sandbox FROM SHARE my_database_share;

-- Agent can now read/write without affecting production data
-- Changes are isolated to this copy

Learn more: CREATE DATABASE documentation

Step 5: Implement Your Agent

Build your agent using an SDK or framework that supports function calling.

Quick start option: For immediate experimentation, try Claude Desktop with the MotherDuck MCP connector - no coding required.

Custom agent option: Here's a simple example using the OpenAI Agents SDK:

import duckdb
from agents import Agent, Runner, function_tool

# Connect to MotherDuck
conn = duckdb.connect('md:?motherduck_token=<token>', read_only=True)

@function_tool
def query_motherduck(sql: str) -> str:
"""Execute SQL query against MotherDuck database.

Args:
sql: The SQL query to execute against the MotherDuck database.
"""
try:
result = conn.execute(sql).fetchdf()
return result.to_string()
except Exception as e:
return f"Error executing query: {str(e)}"

# Load the DuckDB query guide (copy the system prompt template above into a local file)
with open('query_guide.md', 'r') as f:
query_guide = f.read()

# Create agent with database tool
agent = Agent(
name="MotherDuck Analytics Agent",
instructions=f"""You are a data analyst helping users query a MotherDuck database.

Use the query_motherduck tool to execute SQL queries against the database.

Always start with schema exploration before querying specific tables.

{query_guide}
""",
tools=[query_motherduck]
)

# Run the agent
result = Runner.run_sync(
agent,
"What were the top 5 products by revenue last month?"
)
print(result.final_output)

Validating Queries Before Showing to Users

If your agent has a human in the loop to review, edit, or learn from generated queries, you face a challenge: you don't want to show users queries that are syntactically wrong or use non-existing table or column names, but you also don't want to execute queries without human approval. Use try_bind() to check for errors without execution—it validates syntax and verifies all referenced tables/columns exist in just a few milliseconds.

-- Valid query - empty result means success
CALL try_bind (
  'SELECT customer_id, total FROM orders WHERE status = ''shipped'''
);

-- Invalid query - returns error message
CALL try_bind ('SELECT * FORM orders');
SQL Editor loading...

Example integration:

def generate_query_for_review(question: str) -> str:
"""Generate and validate SQL before showing to user."""
error_msg = None
for attempt in range(3):
sql = agent.generate_sql(question, error_feedback=error_msg)

# Validate before showing
validation = conn.execute("CALL try_bind(?)", [sql]).fetchone()

if not validation[0]: # Empty result means success
return f"Generated query:\n{sql}"

error_msg = validation[0] # Pass error to agent for next attempt

return "Could not generate a valid query to answer the question"

The agent should incorporate the error feedback from try_bind() into subsequent generation attempts to fix syntax errors or incorrect table/column references.

Step 6: Test and Iterate

Validate your agent's performance and refine its behavior based on real-world usage.

Testing and Quality

Choose a set of realistic user questions that cover simple filters ("Show me sales from last month"), complex analysis ("What's the trend in customer retention by region?"), and edge cases like empty results ("Show me sales for December 2019") or ambiguous requests ("Show me the best customers"). Test each question and check the agent's behavior. Focus on SQL correctness, result accuracy and query performance. See the next section for how to tackle common issues.

Common Issues and Solutions

IssueSolution
Invalid SQL generationImprove system prompt, add function docs to context
Wrong tables queriedAdd COMMENT ON, improve schema descriptions, implement table filtering
Misunderstood questionsAdd domain-specific examples to system prompt
Query performanceEXPLAIN ANALYZE to diagnose query inefficiencies, adjust Duckling size to scale compute resources

Next Steps