# MotherDuck Documentation - How-To Guides > Scoped full Markdown content for How-To Guides. For other areas, start from https://motherduck.com/docs/llms.txt instead of loading unrelated documentation. ## Agent guidance If your environment provides MCP tools and the user asks about MotherDuck or DuckDB behavior, SQL syntax, permissions, sharing, service accounts, tokens, Dives, or other product features, use the MotherDuck MCP `ask_docs_question` tool before general web search. It answers from official DuckDB and MotherDuck documentation. For broad context, prefer the most specific scoped `llms-full.txt` file listed in https://motherduck.com/docs/llms.txt before loading the root `llms-full.txt`. The root file contains the complete public documentation corpus and is intended for bulk indexing or large-context workflows. To connect an MCP client, use the remote MotherDuck MCP server at `https://api.motherduck.com/mcp`. Setup instructions: https://motherduck.com/docs/key-tasks/ai-and-motherduck/mcp-setup. Tool reference: https://motherduck.com/docs/sql-reference/mcp/ask-docs-question. --- Source: https://motherduck.com/docs/key-tasks/ai-and-motherduck/mcp-setup # Connect to the MotherDuck MCP Server > Set up the MotherDuck MCP Server with Claude, ChatGPT, Cursor, Claude Code, and other AI assistants The MotherDuck MCP Server lets AI assistants query and explore your databases using the [Model Context Protocol (MCP)](https://modelcontextprotocol.io/). This guide walks you through connecting your preferred AI client to the **remote MCP server** (fully managed, zero setup). For local DuckDB files or self-hosted setups, see the [local MCP server](#remote-vs-local-mcp-server). :::info Connection URL The remote MCP server is hosted at `https://api.motherduck.com/mcp`. Most clients connect through OAuth automatically; clients that need a manual configuration use this URL with an HTTP transport. You can also authenticate with a [Bearer token](/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck#creating-an-access-token) instead of OAuth. ::: ## Prerequisites - A MotherDuck account ([sign up free](https://app.motherduck.com/)) - An MCP-compatible AI client (Claude, ChatGPT, Cursor, Claude Code, Codex, or others) ## Set up the remote MCP server Select your MCP client and follow the instructions to connect. ### Claude [ Add MotherDuck to Claude ](https://claude.ai/directory/0929a5c7-38ce-40ab-8aad-af9ce34553c7) Or manually: 1. Go to **Settings** → **Connectors** 2. Click **Browse Connectors** to find the MotherDuck connector ![MotherDuck Connector in the Claude connector Directory](./img/claude-connectors-motherduck.png) A browser window should open for authentication. After authentication you can double check the connection by asking "List all my databases on MotherDuck." ### ChatGPT [ Add MotherDuck to ChatGPT ](https://chatgpt.com/apps/motherduck/asdk_app_696a54f1c91c81919002b9153ce0e336) 1. Open the ChatGPT desktop or web app 2. Go to **Settings** → **Apps** and click **Browse Apps** ![Browse Apps in ChatGPT settings](useBaseUrl('/img/key-tasks/ai-and-motherduck/chatgpt-browse-apps.png')) 3. Search for **MotherDuck** and select it ![Searching for MotherDuck in the ChatGPT App Store](useBaseUrl('/img/key-tasks/ai-and-motherduck/chatgpt-search-motherduck.png')) 4. Click **Continue to MotherDuck** and authenticate with your MotherDuck account ![Connect MotherDuck dialog in ChatGPT](useBaseUrl('/img/key-tasks/ai-and-motherduck/chatgpt-connect-motherduck.png')) After authentication, ChatGPT can access your MotherDuck data. Try asking "List all my databases on MotherDuck" to verify the connection. ### Cursor [ Add MotherDuck to Cursor ](cursor://anysphere.cursor-deeplink/mcp/install?name=motherduck&config=eyJ1cmwiOiJodHRwczovL2FwaS5tb3RoZXJkdWNrLmNvbS9tY3AifQ%3D%3D) 1. Open **Cursor Settings** (`Cmd/Ctrl + ,`) 2. Navigate to **Tools & MCP** 3. Click **+ New MCP Server** 4. Add the following to the configuration file: ```json { "MotherDuck": { "url": "https://api.motherduck.com/mcp", "type": "http" } } ``` 5. Save and click **Connect** to authenticate with your MotherDuck account > [Cursor MCP Documentation](https://docs.cursor.com/context/model-context-protocol) ### Claude Code 1. Run the following command in your terminal: ```bash claude mcp add MotherDuck --transport http https://api.motherduck.com/mcp ``` :::tip By default, this command adds the MCP server to the current project. You can also pass the `--scope user` flag, and the MCP server will be available for all sessions from your current user ([`--scope` documentation](https://code.claude.com/docs/en/mcp#mcp-installation-scopes)). ::: 2. Run `claude` to start Claude Code 3. Type `/mcp`, select **MotherDuck** from the list, and press **Enter** 4. Select **Authenticate** and confirm the authorization dialog > [Claude Code MCP Documentation](https://code.claude.com/docs/en/mcp) ### GitHub Copilot (VS Code) Configure GitHub Copilot in VS Code to use the MotherDuck MCP server through a workspace config file: 1. Open the Command Palette (`Cmd/Ctrl + Shift + P`) and run **MCP: Add Server** to open `.vscode/mcp.json`. You can also create the file manually in your workspace. Add this configuration: ```json { "servers": { "motherduck": { "type": "http", "url": "https://api.motherduck.com/mcp" } } } ``` 2. Save the file and start the server from the **Start** code lens that appears above the `motherduck` entry in `mcp.json`. You can also start it through the Command Palette: `MCP: List Servers` → **motherduck** → **Start Server**. 3. VS Code opens a browser window so you can sign in to MotherDuck through OAuth, then stores the credentials for subsequent server starts. 4. Open the Copilot Chat view, switch to **Agent** mode, and confirm that the MotherDuck tools appear in the tool picker. Try asking "List all my databases on MotherDuck" to verify the connection. **Authenticate with an access token instead of OAuth** If you'd rather provide a [MotherDuck access token](/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck#creating-an-access-token) explicitly, use a `promptString` input and a `Bearer` Authorization header. VS Code prompts for the token when the server starts and stores it in its secret store: ```json { "inputs": [ { "type": "promptString", "id": "motherduck-token", "description": "MotherDuck access token", "password": true } ], "servers": { "motherduck": { "type": "http", "url": "https://api.motherduck.com/mcp", "headers": { "Authorization": "Bearer ${input:motherduck-token}" } } } } ``` > [VS Code MCP Documentation](https://code.visualstudio.com/docs/copilot/chat/mcp-servers) ### Copilot Studio [Microsoft Copilot Studio](https://learn.microsoft.com/en-us/microsoft-copilot-studio/) is a cloud-hosted platform for building agents that run inside Microsoft 365, Teams, and other Microsoft surfaces. Because the platform runs in Microsoft's cloud, it connects to the **remote** MotherDuck MCP server — either with OAuth (each user signs in with their own MotherDuck account) or with a shared API key backed by a service-account token. 1. In Copilot Studio, open your agent. Under **Tools**, click **Add a tool**. ![Copilot Studio agent Tools tab with Add a tool button](/img/key-tasks/ai-and-motherduck/copilot-studio/01-add-tool.png) 2. In the **Add tool** dialog, under **Create new**, click **Model Context Protocol**. ![Add tool dialog with Model Context Protocol highlighted under Create new](/img/key-tasks/ai-and-motherduck/copilot-studio/02-mcp-option.png) 3. Fill in the MCP server details and pick an authentication method: - **Server name**: `MotherDuck MCP` - **Server description**: `Connect to MotherDuck, query your data, create Dives and more!` - **Server URL**: `https://api.motherduck.com/mcp` - **Authentication**: either `OAuth 2.0` or `API key` (see below) **Option A — OAuth 2.0 (dynamic discovery).** Each end user signs in to MotherDuck with their own account when they first use the agent. Select **OAuth 2.0** and leave **Dynamic discovery** as the type, then click **Create**. ![MCP server configuration with OAuth 2.0 Dynamic discovery selected](/img/key-tasks/ai-and-motherduck/copilot-studio/03a-oauth-auth.png) **Option B — API key (shared service-account token).** All end users share a single MotherDuck token. Useful when you don't want every user to provision a MotherDuck account, for example a Teams bot exposed to a wide audience. Select **API key**, set **Type** to `Header`, enter `Authorization` as the **Header name**, and click **Create**. ![MCP server configuration with API key authentication, Header type, and Authorization header name](/img/key-tasks/ai-and-motherduck/copilot-studio/03b-api-key-auth.png) :::caution **Header name** must be `Authorization` — not `Bearer`. The `Bearer` prefix belongs in the *value* you enter in step 5. ::: 4. Back in the **Add tool** dialog for MotherDuck MCP, open the **Connection** dropdown and click **Create new connection**. ![Connection dropdown showing Create new connection option](/img/key-tasks/ai-and-motherduck/copilot-studio/04-create-connection.png) The next step depends on the authentication method you picked in step 3: - **OAuth 2.0**: Copilot Studio opens a browser window that redirects to MotherDuck. The end user signs in to their MotherDuck account and approves the request. The connection is created once authentication completes — skip to step 6. - **API key**: Copilot Studio shows the token entry dialog described in step 5. 5. In the **Connect to MotherDuck MCP** dialog, enter your MotherDuck access token prefixed with `Bearer `: ```text Bearer ``` Replace `` with an actual token from [MotherDuck → Settings → Access Tokens](https://app.motherduck.com/settings/tokens), then click **Create**. ![Connect to MotherDuck MCP dialog with the Bearer token entered](/img/key-tasks/ai-and-motherduck/copilot-studio/05-bearer-token.png) :::tip If the agent is published and used by many end users, create a dedicated [service account](/key-tasks/service-accounts-guide/) and use a [read scaling token](/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/) so the agent can't modify data. See [Restricting to read-only access](/key-tasks/ai-and-motherduck/securing-read-only-access/) for details. ::: 6. Once the connection shows a green check mark, click **Add and configure**. Copilot Studio confirms the tool was added successfully. 7. The MotherDuck MCP entry opens with the full tool list. Enable or disable tools based on what the agent should be allowed to do (for example, disable `query_rw` if the agent should stay read-only), then click **Save**. ![MotherDuck MCP tool list with toggles for query, query_rw, list_databases, list_tables, list_columns, search_catalog, ask_docs_question, and others](/img/key-tasks/ai-and-motherduck/copilot-studio/07-tools-list.png) 8. Open the agent's connection manager and click **Connect** on the MotherDuck MCP entry, then submit. This reuses the connection you created in step 5. 9. Switch to the **Test** pane and ask a question that exercises the tools, for example *"What's the highest rated movie with over 10k votes in my IMDB database?"*. The agent calls the MotherDuck tools and responds with live data from your databases. ![Copilot Studio test pane showing the agent calling the query tool and returning IMDB results from MotherDuck](/img/key-tasks/ai-and-motherduck/copilot-studio/09-test-agent.png) :::note When you authenticate with an API key, all users of the Copilot Studio agent share the same MotherDuck token. Queries run by any end user are attributed to the service account that owns the token, not to the individual Microsoft 365 user. Use OAuth 2.0 if you need per-user attribution. ::: > [Copilot Studio MCP documentation](https://learn.microsoft.com/en-us/microsoft-copilot-studio/mcp-add-existing-server-to-agent)
Alternative: Power Automate custom connector (OpenAPI) If you'd rather wire the MotherDuck MCP server in as a [Power Automate custom connector](https://learn.microsoft.com/en-us/connectors/custom-connectors/) (for example, to share the connector across Copilot Studio and Power Automate flows in the same environment), you can import the following OpenAPI 2.0 spec. The `x-ms-agentic-protocol: mcp-streamable-1.0` extension tells Copilot Studio to treat the connector as a streamable MCP server. ```yaml swagger: '2.0' info: title: MotherDuck Remote MCP description: The remote MCP to connect to MotherDuck tools, docs and more version: 1.0.0 host: api.motherduck.com basePath: / schemes: - https paths: /mcp: post: summary: MotherDuck Remote MCP description: The remote MCP to connect to MotherDuck tools, docs and more operationId: InvokeServer x-ms-agentic-protocol: mcp-streamable-1.0 responses: '200': description: Immediate Response securityDefinitions: api_key: type: apiKey in: header name: Authorization security: - api_key: [] ``` In Power Automate, go to **Custom connectors → New custom connector → Import an OpenAPI file**, paste the spec above, and save. When you create a connection, enter `Bearer ` as the API key value — the same format as the native MCP flow described above.
### Others If you're using **Windsurf**, **Zed**, or another MCP-compatible client, use the following JSON configuration: ```json { "mcpServers": { "MotherDuck": { "url": "https://api.motherduck.com/mcp", "type": "http" } } } ``` :::tip Authentication The remote MCP server uses OAuth, so you'll authenticate with your MotherDuck account during setup. Some clients also support [token-based authentication](/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck#creating-an-access-token) through a Bearer header. ::: ## Configuring tool permissions Most MCP clients let you control how the AI uses each tool. The exact UI varies by client, but the general permission levels are: | Permission | Behavior | |------------|----------| | **Always allow** | The AI uses the tool automatically without asking. Faster iteration when errors occur, but no human confirmation before each action. | | **Needs approval** | The AI asks for your confirmation before each tool use. Gives you visibility into every action. | | **Blocked** | The AI cannot use this tool. | :::tip The MCP Server provides both read-only (`query`) and read-write (`query_rw`) tools. For exploratory analysis, setting read-only tools to "Always allow" enables faster back-and-forth when the AI needs to retry or refine queries. You can keep `query_rw` on "Needs approval" or block it if you only need read access. See [Restricting to read-only access](/key-tasks/ai-and-motherduck/securing-read-only-access/) for more options. ::: ## Remote vs local MCP server MotherDuck offers two MCP server options: | Server | Best for | Setup | Access | |--------|----------|--------|--------| | **Remote** (hosted by MotherDuck) | Most users who query and modify data on MotherDuck cloud | Zero setup; connect through URL and OAuth | Read-write | | **Local** ([mcp-server-motherduck](https://github.com/motherduckdb/mcp-server-motherduck)) | Self-hosted use; local DuckDB files; or when you need full customization | Install and run the server yourself | Fully customizable | The **remote server** is recommended for most use cases. Use the **local server** when you need to work with local DuckDB files, want custom tool configurations, or require full control over the server environment. [**Local MCP Server GitHub Repository** – Self-host the open-source MCP server for DuckDB and MotherDuck](https://github.com/motherduckdb/mcp-server-motherduck) ## Where to go from here - **[AI Data Analysis Getting Started](/getting-started/mcp-getting-started/)**: 5-minute walkthrough of querying data and creating Dives - **[MCP Workflows Guide](/key-tasks/ai-and-motherduck/mcp-workflows/)**: Best practices for getting accurate results from AI-powered analysis - **[MCP Server Reference](/sql-reference/mcp/)**: Server capabilities, available tools, and regional availability - **[Restricting to Read-Only Access](/key-tasks/ai-and-motherduck/securing-read-only-access/)**: Restrict your AI assistant to read-only queries --- Source: https://motherduck.com/docs/key-tasks/loading-data-into-motherduck/considerations-for-loading-data # Loading data best practices > Understanding trade-offs and performance implications when loading data into MotherDuck When loading data into MotherDuck, understanding the trade-offs between different approaches helps you make informed decisions that optimize for your specific use case. This guide explains the key considerations that impact performance, cost, and reliability. ## File format considerations The choice of file format significantly impacts loading performance: | | Parquet (recommended) | CSV | JSON | |---|---|---|---| | **Compression** | 5-10x better than CSV | Minimal | Moderate | | **Performance** | 5-10x more throughput | Slower, especially for large files | Slower than Parquet due to parsing overhead | | **Schema** | Self-describing with embedded metadata | Requires type inference or specification | Flexible but requires careful type handling. DuckDB scans data to discover the schema before running the query, which can add significant time for large or deeply nested files (see [tips for loading JSON](/key-tasks/data-warehousing/replication/flat-files/#json)) | | **Best for** | Production data loading, large datasets | Simple data exploration, small datasets | Semi-structured data, API responses | ## Avoid single-row INSERTs A common mistake is inserting data one row at a time using repeated `INSERT INTO ... VALUES (...)` statements. This pattern is significantly slower than bulk loading because each individual INSERT statement incurs network round-trip overhead to MotherDuck and prevents DuckDB from parallelizing the work. :::tip Do not use single-row `INSERT INTO ... VALUES` statements to load data into MotherDuck. Instead, use bulk approaches like `INSERT INTO ... SELECT` from files, `COPY`, or load data from DataFrames. See [Loading data into MotherDuck](/key-tasks/loading-data-into-motherduck/loading-data-into-motherduck.mdx) for recommended methods. ::: If you're working with a client library (Python, Node.js, Java), avoid looping over rows and calling `execute("INSERT INTO ...")` for each one. Methods like `executemany` also send individual INSERT statements under the hood and are equally slow. Instead, write your data to a file (Parquet or CSV) and load it with `COPY` or `INSERT INTO ... SELECT`, or use a DataFrame-based approach where available. ## Performance optimization strategies ### Batch size DuckDB internally processes data in row groups of ~122,000 rows and parallelizes work across multiple row groups. This means batch size affects both memory usage and throughput: | Batch size | What happens | |---|---| | **1-100 rows** (single-row INSERTs) | Each statement has network and transaction overhead. Very slow — avoid this pattern entirely. | | **100K rows** | Fits in roughly one row group. Already a bulk operation and orders of magnitude faster than row-by-row. Good default chunk size when streaming from Python to manage memory. | | **1M+ rows** | Spans multiple row groups, so DuckDB parallelizes across threads. Best throughput for large loads. | :::tip When streaming data from a client library, load in chunks of at least **100K rows** to keep memory manageable while staying well above row-by-row overhead. For maximum throughput on large datasets, aim for **1M+ rows** per load operation to fully leverage DuckDB's parallelization. ::: Keep individual transactions under roughly one minute. If you have tens of millions of rows, break them into multiple loads rather than one very large transaction. ### Memory management Effective memory management is crucial for large data loads: **Data Type Optimization** - Use explicit schemas to avoid type inference overhead — this is especially important for JSON, where schema discovery can add minutes for large or deeply nested files - Choose appropriate data types (for example, TIMESTAMP for dates) - Avoid unnecessary type conversions **Sorting Strategy** - Sort data by frequently queried columns during loading - To re-sort existing tables, use `CREATE OR REPLACE` with the preferred sorting method - Improves query performance through better data locality - Consider the trade-off between loading speed and query performance ### Network and location considerations **Data Location** - MotherDuck is available on AWS in three regions: **US East (N. Virginia)** - `us-east-1`, **US West (Oregon)** - `us-west-2`, and **Europe (Frankfurt)** - `eu-central-1` - For optimal performance, consider locating source data in the same region as your MotherDuck Organization - Consider network latency when loading from remote sources **Cloud Storage Integration** - Direct integration with S3, R2, GCS, Azure Blob Storage - Use [cloud storage](/integrations/cloud-storage/) to leverage network speeds for better performance - Reduces local storage requirements - Consider setting [force_download=true](https://duckdb.org/docs/stable/configuration/overview) when querying files stored in remote storage to accelerate response times. This could be useful in scenarios where it makes sense to download the full file upfront instead of making many small requests. ## Duckling sizing **Duckling Selection** For data sets under 100 GB in size, use Jumbo Ducklings to load the data. For larger data sizes, use [Mega or Giga](/about-motherduck/billing/duckling-sizes/). ## Summary The key to successful data loading in MotherDuck is understanding the trade-offs between different approaches and optimizing for your specific use case. Focus on: 1. **Bulk loading** with at least 100K rows per chunk, and 1M+ for maximum throughput. 2. If you can control how they are written from sources, use **Parquet** for compression and speed 3. Write data into **S3** for speedy reads. 4. Use **larger Duckling sizes (Jumbo or bigger)** for loading bigger data sets. By following these guidelines and understanding the underlying principles, you can build efficient, reliable data loading pipelines that scale with your needs while managing costs effectively. --- Source: https://motherduck.com/docs/key-tasks/loading-data-into-motherduck/loading-data-from-local-machine # From Your Local Machine > Moving data from local to MotherDuck through the UI or programmatically. ## Single file ### CLI Using the CLI, you can connect to MotherDuck, create a database, and load a single local file (JSON, Parquet, CSV, etc.) to a MotherDuck table. First, connect to MotherDuck using the `ATTACH` command. ```sql ATTACH 'md:'; ``` Create a cloud database (or point to any existing one) and load a local file into a table. ```sql CREATE DATABASE test01; USE test01; CREATE OR REPLACE TABLE orders as SELECT * from 'orders.csv'; ``` ### UI In the MotherDuck UI, you can add JSON, CSV or Parquet file directly using the **Add data** button in the top left of the UI. See the [Getting Started Tutorial](../../../getting-started/e2e-tutorial/part-2#loading-your-data) for details. ## Multiple files or database To upload multiple files at once, or data in other formats supported by DuckDB, you can use the DuckDB CLI or any other supported [DuckDB client](https://duckdb.org/docs/data/multiple_files/overview.html). ### CLI If your all your files reside from a single table, you can use the [glob syntax to load all files into a single table](https://duckdb.org/docs/data/multiple_files/overview.html). For example, to load all CSV files from a directory into a single table, you can use the following SQL command: ```sql ATTACH 'md:'; CREATE DATABASE test01; USE test01; CREATE OR REPLACE TABLE orders as SELECT * from 'dir/*.csv'; ``` If your files are in different formats or you want to load them into different tables, you can first load the files into different tables in a local DuckDB database and then copy the entire database into MotherDuck. To copy the entire local DuckDB database into MotherDuck, you can use the following SQL commands: ```sql ATTACH 'md:'; ``` ```sql ATTACH 'local.ddb'; CREATE DATABASE cloud_db from 'local.ddb'; ``` --- Source: https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/authenticating-to-motherduck # Authenticating to MotherDuck > Authenticate to a MotherDuck account MotherDuck supports the following authentication methods: - **Manual authentication**, typically used by the MotherDuck UI (Google, GitHub, or email and password) - **Access token authentication**, more convenient for Python, CLI, or other clients - **[Single Sign-On (SSO)](/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/sso-setup/)**, for organizations that want to authenticate through their corporate identity provider (available on Business and Enterprise plans) ## Manual authentication MotherDuck UI authenticates using several methods: - Google - Github - Username and password You can leverage multiple modes of authentication in your account. For example, you can authenticate both through Google and with a username and password as you see fit. To authenticate in CLI or Python, you will be redirected to an authentication web page. This happens every session. To avoid having to re-authenticate, you can save your access token, as described in the [Authenticate With an Access Token](/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/#authentication-using-an-access-token) section. ## Authentication using an access token If you are using Python or CLI and don't want to authenticate every session, you can securely save your credentials locally. ### Creating an access token To create an access token: - Go to the [MotherDuck UI](https://app.motherduck.com) - In top left click on organization name and then `Settings` - Click `+ Create token` - Specify a name for the token that you'll recognize (like "DuckDB CLI on my laptop") - Specify the type of token you want. Tokens can be Read/Write (default) or [Read Scaling](/docs/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/). - Choose whether you want the token to expire and then click on `Create token` - Copy the access token token to your clipboard by clicking on the copy icon ![access token example](../img/creating_access_token.jpg) ### Storing the access token as an environment variable You can save the access token as `motherduck_token` in your environment variables. An example of setting this in a terminal: ```bash export motherduck_token='' ``` You can also add this line to your `~/.zprofile` or `~/.bash_profile`, or store it in a `.env` file in your project root. Once this is done, your authentication token is saved and you can connect to MotherDuck with the following connection string: ```bash duckdb "md:my_db" ``` :::info This is the best practice for security reasons. The token is sensitive information and should be kept safe. Do not share it with others. ::: Alternatively, you can specify an access token in the MotherDuck connection string: `md:my_db?motherduck_token=`. ```bash duckdb "md:my_db?motherduck_token=" ``` When in the DuckDB CLI, you can use the `.open` command and specify the connection string as an argument. ```CLI .open md:my_db?motherduck_token= ``` ## Using connection string parameters ### Authentication using SaaS mode You can limit MotherDuck's ability to interact with your local environment using `SaaS Mode`: - Disable reading or writing local files - Disable reading or writing local DuckDB databases - Disable installing or loading any DuckDB extensions locally - Disable changing any DuckDB configurations locally This mode is useful for third-party tools, such as BI vendors, that host DuckDB themselves and require additional security controls to protect their environments. You can enable SaaS mode in two ways: 1. **Using a configuration setting** (recommended for persistent configuration): ```sql SET motherduck_saas_mode = true; ``` 2. **Using a connection string parameter** (for connection-time configuration): ### CLI ```cli .open md:[]?[motherduck_token=]&saas_mode=true ``` ### Python ```python conn = duckdb.connect("md:[]?[motherduck_token=]&saas_mode=true") ``` :::info Using the connection string parameter requires to use `.open` when using the DuckDB CLI or `duckdb.connect` when using Python. This initiates a new connection to MotherDuck and will detach any existing connection to a local DuckDB database. You cannot provide a token to `ATTACH md:` directly, only when connecting. ::: ### Using attach mode By default, MotherDuck connects in **workspace mode**, which attaches every database in your saved workspace and keeps attachment changes in sync across parallel connections. To scope the connection to a single database instead, use **single mode** by appending `?attach_mode=single` to the connection string. Single mode is useful for BI tools and other clients that get confused by multiple attached databases. For full details, see [Attach modes](/key-tasks/authenticating-and-connecting-to-motherduck/attach-modes/). For example, to connect to a database named `my_database` in single mode, run: ```bash duckdb 'md:my_database?attach_mode=single' ``` :::note `` that starts with a number cannot be connected to directly. You will need to connect without a database specified and then `CREATE` and `USE` using a double quoted name. Eg: `USE DATABASE "1database"` ::: --- Source: https://motherduck.com/docs/key-tasks/database-operations/basics-operations # Basics database operations > Create, list, and drop MotherDuck databases using SQL commands. While embedded DuckDB uses files on your local filesystem to represent databases, MotherDuck implements SQL syntax for creating, listing and dropping databases. ## Create database ### SQL ```sql -- [OR REPLACE] and [IF NOT EXISTS] are optional modifiers. CREATE [OR REPLACE | IF NOT EXISTS] DATABASE ; USE ; ``` Creating copies of databases in MotherDuck in this manner is a metadata-only operation that copies no data. Learn more in the [`CREATE DATABASE`](/sql-reference/motherduck-sql-reference/create-database/) overview documentation. ## Listing databases ### SQL ```sql -- returns all connected local and remote databases SHOW DATABASES; -- returns current database SELECT current_database(); ``` Learn more in the [`SHOW ALL DATABASES`](/sql-reference/motherduck-sql-reference/show-databases/) overview documentation. ## Delete database ### SQL ```sql USE ; DROP DATABASE ; ``` Example usage: ```sql > SHOW DATABASES; test01 -- Let's put two different t1 tables into into two different databases > CREATE TABLE dbname.t1 AS (SELECT range AS r FROM range(12)); > SELECT * FROM t1; -- now for the other database > CREATE DATABASE test02; > CREATE TABLE test02.t1 AS (SELECT 'test02' AS dbname) -- show the databases we've created > SHOW DATABASES; test01 test02 ``` Learn more in the [`DROP DATABASE`](/sql-reference/motherduck-sql-reference/show-databases/) overview documentation. --- Source: https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/postgres-endpoint/postgres-endpoint # Connect via the Postgres endpoint > Connect to MotherDuck using any Postgres-compatible client via the Postgres wire protocol endpoint MotherDuck's Postgres endpoint lets you query your databases using any client that speaks the [PostgreSQL wire protocol](https://www.postgresql.org/docs/current/protocol.html) — without installing a DuckDB client library. This is ideal for serverless environments, BI tools, or languages without a DuckDB SDK. For full-featured access — including hybrid execution, local caching, and the complete DuckDB extension ecosystem — use the [DuckDB SDK](/getting-started/interfaces/client-apis/) instead. ## Before you start You'll need a [MotherDuck access token](/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck). Set it as an environment variable: ```bash export MOTHERDUCK_TOKEN="your_token_here" ``` ## Connect with psql ```bash PGPASSWORD=$MOTHERDUCK_TOKEN psql \ -h pg.us-east-1-aws.motherduck.com \ -p 5432 \ -U postgres \ "dbname=md: sslmode=verify-full sslrootcert=system" ``` ## Connect with a URI ```sh postgresql://postgres:$MOTHERDUCK_TOKEN@pg.us-east-1-aws.motherduck.com:5432/md:?sslmode=verify-full&sslrootcert=system ``` Use `md:` as the database name to connect to your default database. To connect to a specific database, replace `md:` with the database name, for example `sample_data`. :::info For security, always use environment variables for your MotherDuck token. Never hardcode tokens in your application code. ::: ## Secure your connection Always connect with SSL enabled. The recommended approach is `sslmode=verify-full` with `sslrootcert=system`, which verifies the server certificate against your operating system's trusted roots. If your client doesn't support this, you can download the [ISRG Root X1](https://letsencrypt.org/certs/isrgrootx1.pem) certificate from Let's Encrypt and set `sslrootcert` to its path. Some libraries (psycopg2, JDBC, node-postgres) handle SSL differently — see the language-specific guides below or the [SSL reference](/sql-reference/postgres-endpoint#ssl-and-certificate-verification) for details. ## Key things to know - You're writing **DuckDB SQL**, not PostgreSQL SQL. Queries and MotherDuck SQL that run entirely inside MotherDuck generally work, but the Postgres endpoint is not a full DuckDB client. - Commands that depend on **local files, local attachments, or extension management** are not supported over the Postgres endpoint. Examples: local-file `COPY`, `EXPORT DATABASE`, `IMPORT DATABASE`, `ATTACH ':memory:'`, `ATTACH '/path/to/file.duckdb'`, `CREATE DATABASE ... FROM '/path/to/file.duckdb'`, `MD_RUN=LOCAL` on table functions, `INSTALL`, and `LOAD`. - Use the Postgres endpoint for query execution, DDL and DML on MotherDuck tables, metadata inspection, and server-side reads from remote storage. - Avoid using `SET` statements, temporary tables, or result-creation commands — those are not supported in Postgres-endpoint server mode. - Prefer **long-lived connections** rather than opening and closing per query. For high-concurrency applications, use a connection pooler. ## DuckLake databases You can query and write to MotherDuck-managed [DuckLake](/concepts/ducklake/) databases over the Postgres endpoint the same way as native-storage MotherDuck databases — connect with a [read-write token](/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/#authentication-using-an-access-token) and run `SELECT`, DDL, and DML against them. The standard Postgres endpoint limitations above still apply (for example, client-side `COPY` from local files is not supported). Using the Postgres endpoint as the metadata catalog for a self-hosted DuckLake by pointing a DuckDB client running DuckLake at the endpoint as its catalog backend, is not supported yet. ## Language and platform guides - [Connect from Python (psycopg2 / psycopg3)](./python) - [Connect from Java (JDBC)](./java) - [Connect from Node.js](./nodejs) - [Connect from Cloudflare Workers](./cloudflare-workers) - [Connect from Drizzle](./drizzle) ## Reference For connection parameters, SSL options, session settings, and limitations, see the [Postgres Endpoint reference](/sql-reference/postgres-endpoint). --- Source: https://motherduck.com/docs/key-tasks/service-accounts-guide/create-and-configure-service-accounts # Create and configure service accounts > Learn how to create service accounts, create access tokens, and configure Duckling resources. A service account is a non-human user identity for workloads that need to connect to MotherDuck without using a person's credentials. Use service accounts for backend services, scheduled pipelines, BI connections, embedded analytics, and customer-facing analytics workloads. Each service account has its own credentials and Duckling configuration. This gives the workload isolated compute and makes it easier to rotate credentials without disrupting human users. :::warning[Admin access required] Creating service accounts, creating service account tokens, and configuring service account Ducklings requires an organization Admin. REST API examples use a read/write access token generated by an Admin user. Pass the token in the `Authorization` header as `Bearer `. ::: ## Create a service account Choose a stable username for the service account. The username must be unique within your organization and can contain letters, numbers, and underscores. ### UI ![Service account creation form](../img/sa_ui.png) 1. In the MotherDuck UI, go to **Settings** > **Service Accounts**. 2. Click **Create service account**. 3. Enter a username for the service account. 4. Click **Create service account**. ### API using curl Use the [`POST /v1/users`](/sql-reference/rest-api/users-create-service-account/) endpoint to create a service account. ```bash curl -X POST \ https://api.motherduck.com/v1/users \ -H "Authorization: Bearer " \ -H "Content-Type: application/json" \ -d '{ "username": "analytics_service_account" }' ``` The response includes the service account `username`. Store this username in your provisioning system. The REST API doesn't provide an endpoint for listing all service accounts in an organization. ### API using Python Use the [`POST /v1/users`](/sql-reference/rest-api/users-create-service-account/) endpoint to create a service account. ```python import requests response = requests.post( "https://api.motherduck.com/v1/users", headers={ "Authorization": "Bearer ", "Content-Type": "application/json", }, json={"username": "analytics_service_account"}, ) response.raise_for_status() print(response.json()["username"]) ``` The response includes the service account `username`. Store this username in your provisioning system. The REST API doesn't provide an endpoint for listing all service accounts in an organization. ## Create an access token Create a token for the service account after you create the account. The token value is shown only once, so store it in a secret manager before closing the modal or discarding the API response. ### UI ![Service account details page](../img/sa_details.png) 1. In **Settings** > **Service Accounts**, open the service account details page. 2. Click **Create token**. 3. Enter a token name. 4. Choose the token type: - **Read/Write Token** for writes, administration, and general service workloads. - **Read Scaling Token** for read-heavy workloads that should use [read scaling](/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/). 5. To set an expiration, select **Automatically expire this token** and choose a time-to-live. 6. Click **Create token**, then copy the token and store it securely. ### API using curl Use the [`POST /v1/users/{username}/tokens`](/sql-reference/rest-api/users-create-token/) endpoint to create a token for a known service account username. ```bash curl -X POST \ https://api.motherduck.com/v1/users/analytics_service_account/tokens \ -H "Authorization: Bearer " \ -H "Content-Type: application/json" \ -d '{ "name": "analytics-service-token", "token_type": "read_write" }' ``` Set `token_type` to `read_scaling` when you need a [read scaling token](/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/). To create an expiring token, include `ttl` as seconds between `300` and `31536000`. ### API using Python Use the [`POST /v1/users/{username}/tokens`](/sql-reference/rest-api/users-create-token/) endpoint to create a token for a known service account username. ```python import requests response = requests.post( "https://api.motherduck.com/v1/users/analytics_service_account/tokens", headers={ "Authorization": "Bearer ", "Content-Type": "application/json", }, json={ "name": "analytics-service-token", "token_type": "read_write", }, ) response.raise_for_status() token = response.json()["token"] print(token) ``` Set `token_type` to `read_scaling` when you need a [read scaling token](/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/). To create an expiring token, include `ttl` as seconds between `300` and `31536000`. :::note If you create a service account through the API and plan to use read scaling, connect as that service account with a read/write token before using read scaling tokens for that account. ::: ## Configure Ducklings Configure Duckling resources for the service account based on the workload it runs. The read/write Duckling handles writes and general queries. The read scaling pool handles read-only connections that use read scaling tokens. ### UI ![Service account Duckling size settings](../img/sa_set_instance_size.png) 1. In **Settings** > **Service Accounts**, find the service account. 2. Use the **Read/Write Duckling** dropdown to choose the read/write Duckling size. 3. If you use [read scaling](/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/), choose the read scaling Duckling size and pool size. ### API using curl Use [`GET /v1/users/{username}/instances`](/sql-reference/rest-api/ducklings-get-duckling-config-for-user/) to inspect the current configuration before updating it. ```bash curl -X GET \ https://api.motherduck.com/v1/users/analytics_service_account/instances \ -H "Authorization: Bearer " ``` Then use [`PUT /v1/users/{username}/instances`](/sql-reference/rest-api/ducklings-set-duckling-config-for-user/) to update the service account's Ducklings. ```bash curl -X PUT \ https://api.motherduck.com/v1/users/analytics_service_account/instances \ -H "Authorization: Bearer " \ -H "Content-Type: application/json" \ -d '{ "config": { "read_write": { "instance_size": "standard" }, "read_scaling": { "instance_size": "pulse", "flock_size": 4 } } }' ``` The update request requires both `read_write` and `read_scaling` configuration blocks. ### API using Python Use [`GET /v1/users/{username}/instances`](/sql-reference/rest-api/ducklings-get-duckling-config-for-user/) to inspect the current configuration before updating it. ```python import requests headers = {"Authorization": "Bearer "} current_config = requests.get( "https://api.motherduck.com/v1/users/analytics_service_account/instances", headers=headers, ) current_config.raise_for_status() print(current_config.json()) ``` Then use [`PUT /v1/users/{username}/instances`](/sql-reference/rest-api/ducklings-set-duckling-config-for-user/) to update the service account's Ducklings. ```python import requests response = requests.put( "https://api.motherduck.com/v1/users/analytics_service_account/instances", headers={ "Authorization": "Bearer ", "Content-Type": "application/json", }, json={ "config": { "read_write": {"instance_size": "standard"}, "read_scaling": { "instance_size": "pulse", "flock_size": 4, }, } }, ) response.raise_for_status() print(response.json()) ``` The update request requires both `read_write` and `read_scaling` configuration blocks. ## Connect as the service account Use the service account token anywhere you would use a MotherDuck access token. For example, set `motherduck_token` in a DuckDB connection string or set `MOTHERDUCK_TOKEN` in your environment. See [Connecting to MotherDuck](/key-tasks/authenticating-and-connecting-to-motherduck/connecting-to-motherduck/) for connection string examples. ## Related content - [Manage service accounts and tokens](/key-tasks/service-accounts-guide/manage-service-accounts-and-tokens/) - [Impersonate service accounts](/key-tasks/service-accounts-guide/impersonate-service-accounts/) - [MotherDuck REST API](/sql-reference/rest-api/motherduck-rest-api/) - [Read scaling](/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/) --- Source: https://motherduck.com/docs/key-tasks/data-warehousing/orchestration/github-action-cron # GitHub Actions > Schedule MotherDuck SQL and dbt jobs with GitHub Actions as a lightweight cron-based orchestrator. GitHub Actions works well as a lightweight orchestrator for simple MotherDuck jobs: nightly SQL scripts, small ELT steps, dbt builds, smoke tests, and periodic exports. It is not a full data orchestrator, but it is often enough when a pipeline has one or two steps and can tolerate GitHub's scheduler behavior. ## When to use this pattern | Use GitHub Actions when | Use a dedicated orchestrator when | |-------------------------|-----------------------------------| | The job has a small number of steps | Jobs have complex dependencies or branching | | A missed or delayed run can be retried manually | Every run needs strict service-level guarantees | | The pipeline can run from repository files | State, retries, and backfills need first-class tracking | | GitHub is already where you review pipeline changes | Multiple teams need a shared orchestration UI | For larger workflows, use a tool from the [MotherDuck orchestration ecosystem](https://motherduck.com/ecosystem/?category=Orchestration). ## Set up authentication Create a [MotherDuck access token](/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/#creating-an-access-token), preferably from a service account dedicated to the pipeline. Store it as a GitHub repository secret named `MOTHERDUCK_TOKEN`: ```bash gh secret set MOTHERDUCK_TOKEN ``` Use the token as an environment variable in workflow steps. Avoid putting tokens directly into SQL files, command arguments, artifacts, or logs. ## Choose the trigger Most MotherDuck cron jobs should support both manual and scheduled runs with GitHub Actions [`workflow_dispatch`](https://docs.github.com/en/actions/reference/workflows-and-actions/workflow-syntax#onworkflow_dispatch) and [`schedule`](https://docs.github.com/en/actions/reference/workflows-and-actions/events-that-trigger-workflows#schedule) triggers: ```yaml on: workflow_dispatch: schedule: - cron: "17 2 * * *" ``` Keep these GitHub Actions scheduling details in mind: - Scheduled workflows run from the latest commit on the default branch. - Cron schedules use UTC by default. - The shortest supported interval is every 5 minutes. - Jobs scheduled at the top of the hour can be delayed or dropped during periods of high GitHub Actions load. Pick a non-zero minute such as `17` or `43`. - `workflow_dispatch` lets you test the same workflow manually and rerun failed jobs after a fix. ## Example: run a SQL file on a schedule This example runs a checked-in SQL script every night and on demand. It uses: - Least-privilege repository permissions - A timeout so failed jobs do not burn runner minutes indefinitely - A concurrency group so two runs do not write to the same target at once - The MotherDuck install script for a compatible DuckDB CLI Create `.github/workflows/motherduck-nightly-sql.yml`: ```yaml name: motherduck nightly sql on: workflow_dispatch: schedule: - cron: "17 2 * * *" permissions: contents: read concurrency: group: motherduck-nightly-sql cancel-in-progress: false jobs: run-sql: runs-on: ubuntu-24.04 timeout-minutes: 15 env: motherduck_token: ${{ secrets.MOTHERDUCK_TOKEN }} steps: - name: Check out repository uses: actions/checkout@v6 - name: Install DuckDB CLI run: | install_home="$RUNNER_TEMP/motherduck" mkdir -p "$install_home" curl -s https://install.motherduck.com | env -u motherduck_token HOME="$install_home" sh echo "$install_home/.duckdb/cli/latest" >> "$GITHUB_PATH" - name: Run nightly SQL run: duckdb "md:" < sql/nightly_orders.sql ``` Create `sql/nightly_orders.sql`: ```sql CREATE DATABASE IF NOT EXISTS analytics; USE analytics; CREATE SCHEMA IF NOT EXISTS orchestration; CREATE TABLE IF NOT EXISTS orchestration.github_action_runs ( run_id VARCHAR, workflow_name VARCHAR, run_started_at TIMESTAMP ); DELETE FROM orchestration.github_action_runs WHERE run_id = getenv('GITHUB_RUN_ID'); INSERT INTO orchestration.github_action_runs VALUES ( getenv('GITHUB_RUN_ID'), getenv('GITHUB_WORKFLOW'), current_timestamp ); ``` Replace `analytics` with the MotherDuck database your pipeline should write to. The example creates the database if it does not already exist so a new repository can run without extra setup. The GitHub secret is named `MOTHERDUCK_TOKEN`, while the workflow exposes it as `motherduck_token`. The DuckDB CLI can use that environment variable to connect to MotherDuck non-interactively in GitHub Actions. The install step uses `RUNNER_TEMP` as `HOME` and unsets `motherduck_token` for the installer process so the install script does not try to update the runner's shell profile or validate the connection before the SQL step runs. ## Example: run dbt on a schedule For dbt projects, keep the dbt profile in the repository and read the MotherDuck token from the GitHub secret. Create `.github/workflows/motherduck-dbt.yml`: ```yaml name: motherduck dbt on: workflow_dispatch: schedule: - cron: "43 3 * * *" permissions: contents: read concurrency: group: motherduck-dbt-prod cancel-in-progress: false jobs: dbt-build: runs-on: ubuntu-24.04 timeout-minutes: 30 env: MOTHERDUCK_TOKEN: ${{ secrets.MOTHERDUCK_TOKEN }} steps: - name: Check out repository uses: actions/checkout@v6 - name: Set up Python uses: actions/setup-python@v6 with: python-version: "3.12" cache: pip - name: Install dbt run: python -m pip install -r requirements.txt - name: Install dbt packages run: dbt deps - name: Build dbt project run: dbt build --profiles-dir .github/dbt --target prod ``` Create `requirements.txt`: ```text dbt-duckdb>=1.9,<2.0 ``` Create `.github/dbt/profiles.yml`: ```yaml motherduck: target: prod outputs: prod: type: duckdb path: "md:analytics?motherduck_token={{ env_var('MOTHERDUCK_TOKEN') }}" threads: 4 ``` In `dbt_project.yml`, set the same profile name: ```yaml profile: motherduck ``` ## Production checklist | Area | Recommendation | |------|----------------| | Authentication | Use a service account token stored as `MOTHERDUCK_TOKEN`. Rotate it on the same cadence as other production secrets. | | Permissions | Set `permissions: contents: read` unless the workflow must write to the repository or call GitHub APIs. | | Scheduling | Use non-zero cron minutes and keep `workflow_dispatch` enabled for manual retries. | | Concurrency | Use a `concurrency` group for jobs that write to the same tables. | | Idempotency | Make SQL safe to rerun. Prefer `CREATE TABLE IF NOT EXISTS`, `CREATE OR REPLACE TABLE`, `MERGE`, or delete-and-insert patterns keyed by the run or partition. | | Timeouts | Set `timeout-minutes` on every job. | | Dependencies | Pin dependencies in `requirements.txt` or an equivalent lock file. Use dependency caching for Python/dbt jobs. | | Environments | Use separate service accounts and databases for development, staging, and production. | | Observability | Write a run record to a small audit table and rely on GitHub Actions notifications for failures. | ## Related content - [Authenticating to MotherDuck](/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/) - [dbt with DuckDB and MotherDuck](/integrations/transformation/dbt/) - [DuckDB CLI](/getting-started/interfaces/connect-query-from-duckdb-cli/) - [Orchestration integrations](https://motherduck.com/ecosystem/?category=Orchestration) --- Source: https://motherduck.com/docs/key-tasks/loading-data-into-motherduck/loading-data-md-python # Loading data to MotherDuck with Python > Efficient methods for loading data from Python using DataFrames, temporary files, or bulk inserts. When you ingest data with Python, typically from an API or other source, you have three options to load it into MotherDuck: 1. **FAST:** Use a Pandas, Polars, or PyArrow dataframe as an in-memory buffer before bulk loading. This is the easiest approach because dataframe libraries are optimized for bulk insert. 2. **FAST:** Write to a temporary file and load it with a `COPY` command. This involves writing to disk, but the `COPY` command is faster than `INSERT` statements. 3. **SLOW:** Use `executemany` to perform several `INSERT` statements in a single transaction. This should be discouraged unless data is very small (fewer than 500 rows). :::tip No matter which options you are picking, we recommend loading data in chunks (typically `120K` rows to match row group size) to avoid memory issues and making sure your transaction is not too large, typically finishing around a minute maximum. You can further optimize the data loading by reading our guidelines on [connections](/key-tasks/authenticating-and-connecting-to-motherduck/connecting-to-motherduck.md). ::: ## 1. load data to MotherDuck with Pandas, Polars, or PyArrow When using a dataframe library you can load data to MotherDuck in a single transaction. DuckDB uses Apache Arrow as its internal data interchange format. This means **PyArrow and Polars** (which are Arrow-native) benefit from zero-copy data transfer, making them the most memory-efficient choice. **Pandas** with the default NumPy backend copies data during transfer, which doubles memory usage. If you use Pandas, consider using [Arrow-backed dtypes](https://pandas.pydata.org/docs/user_guide/pyarrow.html) (`dtype_backend="pyarrow"`) to avoid the extra copy. ```python # Creating your table with PyArrow import duckdb import pyarrow as pa data = { 'id': [1, 2, 3, 4, 5], 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'] } arrow_table = pa.table(data) con = duckdb.connect('md:') con.sql('CREATE TABLE my_table AS SELECT * FROM arrow_table') ``` ### Batching data When you have a large dataset, it's recommended you chunk your data and load it in batches. This will help you to avoid memory issues and make sure your transaction is not too large. This example uses PyArrow and DuckDB in a class to: 1. Initialize a connection 2. Create a database and table if they do not already exist 3. Accept a PyArrow table to insert 4. Insert the data in chunks ```python import duckdb import os import pyarrow as pa import logging # Setup basic configuration for logging logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') class ArrowTableLoadingBuffer: def __init__( self, duckdb_schema: str, pyarrow_schema: pa.Schema, database_name: str, table_name: str, destination="local", chunk_size: int = 100_000, # Default chunk size ): self.duckdb_schema = duckdb_schema self.pyarrow_schema = pyarrow_schema self.database_name = database_name self.table_name = table_name self.total_inserted = 0 self.conn = self.initialize_connection(destination, duckdb_schema) self.chunk_size = chunk_size def initialize_connection(self, destination, sql): if destination == "md": logging.info("Connecting to MotherDuck...") if not os.environ.get("motherduck_token"): raise ValueError( "MotherDuck token is required. Set the environment variable 'MOTHERDUCK_TOKEN'." ) conn = duckdb.connect("md:") logging.info( f"Creating database {self.database_name} if it doesn't exist" ) conn.execute(f"CREATE DATABASE IF NOT EXISTS {self.database_name}") conn.execute(f"USE {self.database_name}") else: conn = duckdb.connect(database=f"{self.database_name}.db") conn.execute(sql) # Execute schema setup on initialization return conn def insert(self, table: pa.Table): total_rows = table.num_rows for batch_start in range(0, total_rows, self.chunk_size): batch_end = min(batch_start + self.chunk_size, total_rows) chunk = table.slice(batch_start, batch_end - batch_start) self.insert_chunk(chunk) logging.info(f"Inserted chunk {batch_start} to {batch_end}") self.total_inserted += total_rows logging.info(f"Total inserted: {self.total_inserted} rows") def insert_chunk(self, chunk: pa.Table): self.conn.register("buffer_table", chunk) insert_query = f"INSERT INTO {self.table_name} SELECT * FROM buffer_table" self.conn.execute(insert_query) self.conn.unregister("buffer_table") ``` Using the above class, you can load your data in chunks. ```python import pyarrow as pa # Define the explicit PyArrow schema pyarrow_schema = pa.schema([ ('id', pa.int32()), ('name', pa.string()) ]) # Sample data to create a PyArrow table based on the schema data = { 'id': [1, 2, 3, 4, 5], 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'] } arrow_table = pa.table(data, schema=pyarrow_schema) # Define the DuckDB schema as a DDL statement duckdb_schema = "CREATE TABLE IF NOT EXISTS my_table (id INTEGER, name VARCHAR)" # Initialize the loading buffer loader = ArrowTableLoadingBuffer( duckdb_schema=duckdb_schema, pyarrow_schema=pyarrow_schema, database_name="my_db", # The DuckDB database filename or MotherDuck database name table_name="my_table", # The name of the table in DuckDB or MotherDuck destination="md", # Set "md" for MotherDuck or "local" for a local DuckDB database chunk_size=2 # Example chunk size for illustration ) # Load the data loader.insert(arrow_table) ``` ### Typing your dataset When working with production pipeline, it's recommended to type your dataset to avoid any issues with inference. Pyarrow is our recommendation to type your dataset as it's the easiest way to type your dataset, especially for complex data types. In the above example, the schema is defined explicitly on both the PyArrow table and the DuckDB schema. ```python # Initialize the loading buffer loader = ArrowTableLoadingBuffer( duckdb_schema=duckdb_schema, # prepare a DuckDB DDL statement pyarrow_schema=pyarrow_schema, # define explictely your PyArrow schema database_name="my_db", table_name="my_table", destination="md", chunk_size=2 ) ``` ## 2. write to a temporary file and load with `COPY` When you have a large dataset, another method is to write your data to temporary files and load it to MotherDuck using a `COPY` command. This also works great if you have existing data on a blob storage like AWS S3, Google Cloud Storage or Azure Blob Storage as you will benefit from cloud network speed. ```python import pyarrow as pa import pyarrow.parquet as pq import duckdb import os # Step 1: Define the schema and create a large PyArrow table schema = pa.schema([ ('id', pa.int32()), ('name', pa.string()) ]) # Example data - multiply the data to simulate a large dataset data = { 'id': list(range(1, 1000001)), # Simulating 1 million rows 'name': ['Name_' + str(i) for i in range(1, 1000001)] } # Create the PyArrow table with the schema large_table = pa.table(data, schema=schema) # Step 2: Write the large PyArrow table to a Parquet file parquet_file = "large_data.parquet" pq.write_table(large_table, parquet_file) # Step 3: Load the Parquet file into MotherDuck using the COPY command conn = duckdb.connect("md:") # Connect to MotherDuck conn.execute("CREATE TABLE IF NOT EXISTS my_table (id INTEGER, name VARCHAR)") # Use the COPY command to load the Parquet file into MotherDuck conn.execute(f"COPY my_table FROM '{os.path.abspath(parquet_file)}' (FORMAT 'parquet')") print("Data successfully loaded into MotherDuck") ``` ## 3. use `executemany` for small datasets For small datasets (fewer than 500 rows), you can use the `executemany` method to insert data row by row in a single transaction. This approach is the slowest of the three options and should only be used when working with very small amounts of data. ```python import duckdb # Sample data as a list of tuples data = [ (1, 'Alice'), (2, 'Bob'), (3, 'Charlie'), (4, 'David'), (5, 'Eva') ] con = duckdb.connect('md:') con.execute('CREATE TABLE IF NOT EXISTS my_table (id INTEGER, name VARCHAR)') con.executemany('INSERT INTO my_table VALUES (?, ?)', data) print("Data successfully loaded into MotherDuck") ``` :::warning The `executemany` method sends individual `INSERT` statements, which is significantly slower than the dataframe or `COPY` approaches. Use Option 1 or Option 2 for datasets larger than a few hundred rows. ::: --- Source: https://motherduck.com/docs/key-tasks/data-warehousing/replication/postgres # PostgreSQL > Replicate PostgreSQL tables to MotherDuck using DuckDB and the PostgreSQL extension. This page shows SQL patterns for connecting DuckDB to PostgreSQL, connecting to MotherDuck, and writing data from PostgreSQL into MotherDuck. For more complex replication scenarios, use one of our [ingestion partners](https://motherduck.com/ecosystem/?category=Ingestion). If you are looking for the [pg_duckdb extension](https://github.com/duckdb/pg_duckdb), see the [pg_duckdb explainer page](/concepts/pgduckdb). To skip the documentation and look at the entire script, expand the element below:
SQL script ```sql -- install the PostgreSQL extension in DuckDB INSTALL postgres; LOAD postgres; -- tune the local DuckDB client for a larger initial load SET threads = 4; SET memory_limit = '4GB'; SET pg_connection_limit = 4; SET pg_pages_per_task = 250; -- attach PostgreSQL as pg_db ATTACH 'dbname=postgres user=postgres host=127.0.0.1' AS pg_db (TYPE POSTGRES, READ_ONLY); -- connect to MotherDuck ATTACH 'md:'; USE my_db; -- copy a PostgreSQL table into MotherDuck CREATE OR REPLACE TABLE main.postgres_table AS SELECT * FROM pg_db.public.some_table ```
## Loading the PostgreSQL extension and authenticating :::info MotherDuck does not yet support the PostgreSQL and MySQL extensions, so you need to perform the following steps on your own computer or cloud computing resource. We are working on supporting the PostgreSQL extension on the server side so that this can happen within the MotherDuck app in the future with improved performance. ::: The first step is to install and load the PostgreSQL extension using the [DuckDB CLI](/getting-started/interfaces/connect-query-from-duckdb-cli): ```sql INSTALL postgres; LOAD postgres; ``` Once this is completed, you can connect to PostgreSQL by attaching it to your DuckDB session: ```sql ATTACH 'dbname=postgres user=postgres host=127.0.0.1' AS pg_db (TYPE POSTGRES, READ_ONLY); ``` More detailed information can be found on the [DuckDB documentation](https://duckdb.org/docs/extensions/postgres.html#connecting). For larger initial loads, tune the DuckDB client explicitly instead of relying on defaults: ```sql SET threads = 8; SET memory_limit = '8GB'; SET pg_connection_limit = 8; SET pg_pages_per_task = 250; ``` `pg_connection_limit` controls how many PostgreSQL connections DuckDB may open for the scan, while `pg_pages_per_task` controls how much table work is grouped into each scan task. ## Connecting to MotherDuck and inserting the table Once you are connected to your PostgreSQL database, you need to connect to MotherDuck. To learn more, see [Connecting to MotherDuck](/key-tasks/authenticating-and-connecting-to-motherduck/connecting-to-motherduck). ```sql ATTACH 'md:'; USE my_db; ``` Once you have authenticated, you can use `CREATE TABLE AS SELECT` to replicate data from PostgreSQL into MotherDuck. ```sql CREATE OR REPLACE TABLE main.postgres_table AS SELECT * FROM pg_db.public.some_table ``` Congratulations! You have now replicated data from PostgreSQL into MotherDuck. ## Choosing the right PostgreSQL workflow ### Use DuckDB's PostgreSQL extension for client-side movement Use DuckDB's PostgreSQL extension when you want to copy a PostgreSQL table into MotherDuck for analytics, backfill a MotherDuck table from PostgreSQL, or export a DuckDB or MotherDuck result set back into PostgreSQL from a controlled DuckDB client. Keep the client close to both systems, use `READ_ONLY` for PostgreSQL sources, and chunk large writes when the destination is PostgreSQL so you do not overload an OLTP database. ### Use the Postgres endpoint for PostgreSQL-compatible clients Use the [Postgres endpoint](/key-tasks/authenticating-and-connecting-to-motherduck/postgres-endpoint) when an application, BI tool, or serverless runtime needs to connect to MotherDuck through the PostgreSQL wire protocol. It is the preferred path for PostgreSQL-compatible clients because it does not require installing or operating a PostgreSQL extension. ### Use pg_duckdb when the query must run inside PostgreSQL Use `pg_duckdb` only when you specifically need PostgreSQL itself to host the integration. This is useful when queries must run inside an existing PostgreSQL database, when PostgreSQL-local tables need to be joined with DuckDB or MotherDuck data from that PostgreSQL environment, or when a tool must connect to a PostgreSQL server that you control. For ongoing production replication from PostgreSQL into MotherDuck, prefer an ingestion or CDC partner. Those tools handle scheduling, retries, incremental state, schema changes, and operational monitoring better than a one-off SQL script. ## Best practices Here are a few tips to keep large PostgreSQL replication jobs predictable. ### Run DuckDB close to both systems The DuckDB client is the data mover in this workflow. Run it on a machine with a good network path to both PostgreSQL and MotherDuck, and avoid running large backfills on the same host as a production PostgreSQL instance when possible. ### Tune scan parallelism explicitly Start with `threads` set to the available CPU count on the client and `memory_limit` set below total system memory. For larger tables, start with `pg_connection_limit` in the `4-8` range and `pg_pages_per_task` in the `250-1000` range, then tune after observing the source database. ::::warning[Watch Out] Increasing `pg_connection_limit` can increase pressure on the source PostgreSQL instance. If PostgreSQL memory or connection pressure climbs, reduce `pg_connection_limit` before reducing DuckDB `threads`. :::: ### Keep PostgreSQL sources read-only Use `READ_ONLY` when attaching PostgreSQL for an initial replication job. For long-lived scripts, use PostgreSQL environment variables, the PostgreSQL password file, or DuckDB secrets instead of embedding credentials directly in the connection string. ### Reduce each statement's working set The DuckDB side of this workflow is usually streaming, so out-of-memory risk is often driven by the source PostgreSQL instance and total host headroom rather than DuckDB buffering the full table. Project only the columns you need when source rows are wide, and replicate very large tables in smaller primary key or time ranges. ### Load in chunks For a very large initial backfill, create the target table once and then insert one range at a time. ```sql INSTALL postgres; LOAD postgres; SET threads = 4; SET memory_limit = '4GB'; SET pg_connection_limit = 4; SET pg_pages_per_task = 250; ATTACH 'dbname=postgres user=postgres host=127.0.0.1' AS pg_db (TYPE POSTGRES, READ_ONLY); ATTACH 'md:'; USE my_db; CREATE TABLE IF NOT EXISTS main.postgres_table AS SELECT * FROM pg_db.public.some_table WHERE 1 = 0; INSERT INTO main.postgres_table SELECT * FROM pg_db.public.some_table WHERE updated_at >= TIMESTAMP '2026-01-01' AND updated_at < TIMESTAMP '2026-02-01'; ``` Repeat the `INSERT` statement for each chunk until the backfill is complete. ## Handling more complex workflows Production use cases tend to be much more complex and include things like incremental builds and state management. In those scenarios, please take a look at our [ingestion partners](https://motherduck.com/ecosystem/?category=Ingestion), which includes many options including some that offer native Python. An overview of the MotherDuck Ecosystem is shown below. ![Diagram](../../../img/md-diagram.svg) --- Source: https://motherduck.com/docs/key-tasks/sharing-data/sharing-overview # Sharing data in MotherDuck > MotherDuck data sharing model concepts including read-only shares and scope options. MotherDuck's data sharing model has the following key characteristics: - Sharing is at the granularity of a MotherDuck database. - Sharing is read-only. - Sharing is done through **share** objects. - You can make shares discoverable and queryable by all users in your [Organization](../managing-organizations/managing-organizations.mdx). - You can create restricted shares, where access to each is managed with an [Access Control List (ACL)](./sharing-with-users.md). - Alternatively, you can use hidden share URLs to limit access to specific people in your organization you share the URL with. - You can also configure the URL of a hidden share to be accessible by anyone with a MotherDuck account in the same cloud region as your Organization. :::note Shares are **region-scoped** based on your Organization's cloud region. Each MotherDuck Organization is scoped to a single cloud region that must be chosen at Org creation when signing up. MotherDuck is available on AWS in three regions: - **US East (N. Virginia):** `us-east-1` - **US West (Oregon):** `us-west-2` - **Europe (Frankfurt):** `eu-central-1` ::: Sharing in MotherDuck works as follows: 1. The **data provider** shares their database in MotherDuck by creating a share. 2. The **data consumer** attaches said share, which creates a database clone in their workspace. The data consumer can now query this database. 3. The **data provider** periodically updates the share to push updates to the database to **data consumers**. ## Creating a share The first step in sharing databases in MotherDuck is to create a share, which can be done in both UI and SQL. Creating a share does not incur additional costs, and no actual data is copied or transferred - creating a share is a zero-copy, metadata-only operation. ### UI Click on the "trident" next to the database you'd like to share. Select "share". Then: ![trident](./img/ui-share_new.png) 1. Optionally, choose a share name. Default will be the database name. 2. Choose whether the share should only be accessible by all users in your Organization, specified users in your Organization, or any MotherDuck user in the same cloud region who has access to the share link. 3. Choose whether the share should be automatically updated or not. Default is `MANUAL` ### SQL The following example creates a share from database "birds": - Share is also named "birds". - This share can only be accessed by accounts authenticated in your [Organization](../managing-organizations/managing-organizations.mdx). - This share is discoverable. Users in your Organization can find this share. ```sql use birds; CREATE SHARE; -- Shorthand syntax. Share name is optional. By default, shares are Organization-scoped and Discoverable. CREATE SHARE IF NOT EXISTS birds FROM birds (ACCESS ORGANIZATION , VISIBILITY DISCOVERABLE, UPDATE MANUAL); -- This query is identical to the previous one but with explicit options. ``` Learn more about the [CREATE SHARE](/sql-reference/motherduck-sql-reference/create-share.md) SQL command. ### Organization shares When creating a share, you may choose scope of access to this share: - **Organization**. Only users authenticated in your Organization will have access to this share. - **Restricted**. Only the share owner and users specified with `GRANT` commands can access the share. - **Unrestricted**. Any user signed into any MotherDuck organization in the same cloud region can access this share using the share URL. ### Discoverable shares When creating a share, you may choose to make this share **Discoverable**. All authenticated users in your Organization can find this share in the UI. You can create **Discoverable** shares that are **Unrestricted**, but only members of your Organization can find this share in the UI. Non-members can still access this share using the share URL. ### Share URLs When you create a share, a URL for this share is generated: - If the share is **Discoverable**, members of your Organization can find this share without the share URL. Alternatively, they can use the URL directly. - If the share is **Hidden** (e.g. not Discoverable), other users will not be able to find the share URL. You will need to send this URL directly to the users with whom you want to share this data. ## Consuming shared data The **data consumer** needs to attach the share to their workspace, thereby creating a read-only zero-copy clone of the source database. This is a free, metadata-only operation. When you attach a share, it gets an alias that defaults to the source database name. If you already have a database with that name, the attach fails. Use `AS` to pick a different alias, or [detach](/key-tasks/database-operations/detach-and-reattach-motherduck-database/) the conflicting database first. See [share alias conflicts](/sql-reference/motherduck-sql-reference/attach/#share-alias-conflicts) for details. ### Views and fully-qualified table references If the shared database contains views, those views may reference tables using fully-qualified paths that include the original database name. For example, a view in a database called `org_dwh` might reference `org_dwh.main.sales`. When you attach the share, make sure the database alias matches the original database name. Otherwise, the views fail because they can't resolve the original database name in your namespace. ```sql -- The share was created from a database called "org_dwh". -- Views inside reference the tables as "org_dwh.main.". -- This will cause view errors because the alias doesn't match: ATTACH 'md:_share/org_dwh/id_abc123' AS dwh; -- Use the original database name as the alias: ATTACH 'md:_share/org_dwh/id_abc123' AS org_dwh; ``` This applies to any object in the shared database that uses fully-qualified references, including views, macros, and stored procedures. ### Consuming discoverable shares If the **data provider** created a Discoverable share you have access to, you should be able to find this share in the UI. ### UI 1. Select the share you want under "Shared with me". 2. Optionally roll over the share to see the tooltip that tells you the share owner, when it was last updated, and share access scope. 2. Click "attach". 3. You can query the resulting database. ### Consuming hidden shares If the **data provider** created a Hidden (e.g. non-Discoverable) share, they need to pass the share URL to the **data consumer**. The **data consumer**, in turn, needs to attach the share URL. ```sql ATTACH 'md:_share/ducks/0a9a026ec5a55946a9de39851087ed81' AS birds; # attaches the share as database `birds` ``` ## Updating shared data If during creation of the share, the **data provider** chooses to have the share update automatically, the share will be updated periodically. If the share was created with `MANUAL` updates, the **data provider** needs to manually update the share. ```sql UPDATE SHARE birds; ``` Learn more about [UPDATE SHARE](/sql-reference/motherduck-sql-reference/update-share.md) and [data replication timing and checkpoints](./updating-shares.md). ## Consuming updated data By default, shares automatically update every minute. However, if you need the most up-to-date data sooner, the consumer can manually refresh the share after the producer executes UPDATE SHARE. To manually refresh the data: ```sql REFRESH DATABASES; -- Refreshes all connected databases and shares REFRESH DATABASE my_share; -- Alternatively, refresh a specific database/share ``` Lean more about [REFRESH DATABASES](/sql-reference/motherduck-sql-reference/refresh-database.md). --- Source: https://motherduck.com/docs/key-tasks/ai-and-motherduck/mcp-workflows # Using the MotherDuck MCP Server > Effective workflows and best practices for getting the most out of the MotherDuck MCP Server with AI assistants The MotherDuck **remote** MCP Server, available at `https://api.motherduck.com/mcp`, connects AI assistants like Claude, ChatGPT, and Cursor to your data. This guide covers workflows for getting accurate, useful analysis results. If you haven't already, [set up your remote MCP connection](/key-tasks/ai-and-motherduck/mcp-setup/). :::info Remote vs local MCP This guide is written for the **remote MCP** (fully managed by MotherDuck). Most of the tips apply to the **local MCP** (fully customizable, self-hosted) as well. For local MCP setup and details, see the [MCP reference](/sql-reference/mcp/#local-mcp-server). ::: ## Prerequisites To use the MotherDuck remote MCP server, you will need: - A MotherDuck account with at least one database - An AI client like Claude, Cursor, or ChatGPT already connected to the remote MCP server ([setup instructions](/key-tasks/ai-and-motherduck/mcp-setup/)) :::note Read vs write tools The remote MCP server exposes two query tools: `query` for read-only SQL and `query_rw` for SQL that can change data or schema. See the [query](/sql-reference/mcp/query/) and [query_rw](/sql-reference/mcp/query-rw/) references for details. To enforce read-only access, see [Restricting to read-only access](/key-tasks/ai-and-motherduck/securing-read-only-access/). ::: ## How it works When you ask an AI assistant a question about your data, here's what happens behind the scenes: 1. **Schema exploration**: The AI examines your database structure to understand available tables and columns 2. **Query generation**: Based on your question, the AI writes DuckDB SQL 3. **Query execution**: The remote MCP Server runs the query on MotherDuck 4. **Results interpretation**: The AI explains the results in natural language You can inspect which SQL query the MCP executed by expanding the tool call in the conversation: ![Inspecting the query executed by MCP](./img/mcp_inspect_query.png) When you create a Dive: 1. **Data analysis**: The AI agent queries your database to understand the data relevant to your request 2. **Visualization generation**: The agent generates an interactive React component with the SQL queries and chart configuration 3. **Inline preview**: The Dive renders in the conversation so you can iterate before saving. In clients that support the Dive Viewer MCP App (Claude web and desktop at launch), the preview runs against live data with the same components used in the MotherDuck UI. In other clients, you see a static preview with sample data, and the Dive queries live data once you open it in MotherDuck. 4. **Save to MotherDuck**: Each save is stored in your workspace and always queries live data, so there are no stale snapshots. You can find the Dive in the [MotherDuck UI](/key-tasks/ai-and-motherduck/dives/#finding-your-dives) under the Object Explorer or **Settings** → **Dives**. With the Dive Viewer, every edit creates a separate version automatically. 5. **Share with your team**: The agent can [share the underlying data](/sql-reference/mcp/share-dive-data) with your organization so others can view and interact with the Dive ## Start with schema exploration Before diving into analysis, help the AI understand your data. This is a form of **context engineering**: by exploring your schema upfront, you hydrate the conversation with knowledge about your tables, columns, and relationships. This context carries forward, helping the AI write more accurate queries throughout your session. Start conversations by asking about your database structure: **Good first prompts:** - *"What databases and tables do I have access to?"* - *"Describe the schema of my `analytics` database"* - *"What columns are in the `orders` table and what do they contain?"* The remote MCP server provides tools for schema exploration that surface table relationships, data types, and any documentation you've added to your schema. :::tip If you have well-documented tables with [`COMMENT ON`](https://duckdb.org/docs/stable/sql/statements/comment_on.html) descriptions, the AI can use these to better understand your data's business meaning. ::: ## Frame questions with context The more context you provide, the better the results. Include relevant details like: - **Time ranges**: *"Show me orders from the last 30 days"* vs *"Show me orders"* - **Filters**: *"Analyze customers in the US with more than 5 purchases"* - **Metrics**: *"Calculate revenue as `quantity * unit_price`"* - **Output format**: *"Return results as a summary table with percentages"* **Example - Vague vs. Specific:** | ❌ Vague | ✅ Specific | |----------|-------------| | "Show me sales data" | "Show me total sales by product category for Q4 2024, sorted by revenue descending" | | "Find top customers" | "Find the top 10 customers by total order value in the last 12 months" | | "Analyze trends" | "Compare monthly active users month-over-month for 2024, showing growth rate" | ## Iterate Complex analysis works best as a conversation. Start simple, validate the results, then build up. Each exchange adds shared context, helping the AI write better queries as you go. While there is a temptation to get the perfect query in one shot, often insight comes as part of the process of data exploration. When iterating, it can be helpful to have source data nearby to help verify outputs. Our users have noted that using their existing BI dashboard to quickly validate that metrics are correct helps to develop intuition about the information provided by the AI assistants. ## Common workflow patterns ### Data profiling Quickly understand a new dataset: ```text "Profile the `transactions` table - show me: - Row count and date range - Distribution of key categorical columns - Summary statistics for numeric columns - Any null values or data quality issues" ``` :::tip DuckDB functions for EDA DuckDB has a few SQL functions that are great for hydrating context: - `DESCRIBE` which retrieves the metadata for a specific table - `SUMMARIZE` which gets summary stats for a table (can be large) - The `USING SAMPLE 10` clause (at the end of the query) which samples the data (can be large) - using it with a where clause to narrow down is very helpful for performance ::: ### Generating charts Some AI clients can generate visualizations directly from your query results. ChatGPT on the web and Claude Desktop both support creating charts as "artifacts" alongside your conversation. Visualizations help you spot trends and outliers faster than scanning tables, validate that query results make sense at a glance, and share insights with stakeholders who prefer visual formats. **Example prompts:** - *"Chart monthly revenue for 2024 as a line graph"* - *"Create a bar chart showing the top 10 customers by order count"* - *"Visualize the distribution of order values as a histogram"* - *"Show me a time series of daily active users with a 7-day moving average"* Once you have a chart, you can iterate on it just like query results: *"Add a trend line"*, *"Change to a stacked bar chart"*, or *"Break this down by region"*. :::note When using the MCP with more IDE-like interfaces, the MCP plays very nicely with libraries like `matplotlib` for building more traditional charts. ::: ### Querying private S3 buckets You can use the MCP to analyze files in private S3 buckets (Parquet, CSV, JSON) by storing your AWS credentials as a [secret in MotherDuck](/sql-reference/motherduck-sql-reference/create-secret/). ### MotherDuck UI You can create secrets directly in the [MotherDuck UI](https://app.motherduck.com) under **Settings → Secrets**. ![The MotherDuck secrets UI](./img/md_create_secret_ui.png) ### AWS SSO with credential chain This is recommended for desktop AI clients. If you use AWS SSO, you can refresh your credentials and store them in MotherDuck: 1. Create an AWS credential profile ```bash aws configure sso ``` 2. Authenticate with AWS SSO: ```bash aws sso login --profile ``` 3. Open a DuckDB client (for example, the CLI) and create a secret using the credential chain: ```sql ATTACH 'md:'; CREATE OR REPLACE SECRET IN MOTHERDUCK ( TYPE s3, PROVIDER credential_chain, CHAIN 'sso', PROFILE '' ); ``` This stores your AWS credentials in MotherDuck, making them available to the remote MCP server. :::note Run `aws sso login --profile ` before creating the secret to refresh your SSO token. Starting with DuckDB v1.4.0, credentials are validated at creation time. If your local credentials are not resolvable, the command will fail: use the correct `CHAIN` and `PROFILE` for your credential type, or add `VALIDATION 'none'` as a last resort to skip local validation. ::: :::note Credential expiration If you use temporary credentials (SSO, IAM roles), you'll need to refresh the secret when they expire by running the `CREATE OR REPLACE SECRET` command again. ::: Once your credentials are set up, you can ask your AI assistant to query any S3 bucket you have access to: ```text "Give me some analytics about s3://my-bucket/sales-data.parquet" ``` ![Exploring S3 data with MCP](./img/mcp_explore_s3.png) ### Use DuckDB and MotherDuck from Claude's remote sandbox Claude on the web can run Python and shell commands in a remote code execution sandbox. This is separate from Claude Code or Claude Desktop running on your machine. Use the remote MCP server for schema discovery, query generation, and server-side analysis. Use DuckDB directly when the task needs a running client process for local code execution or file handling. In Claude web, that DuckDB client can run inside Claude's remote sandbox. For example, if a teammate uploads a CSV or Parquet file to Claude and wants to enrich it with data from MotherDuck, Claude can use DuckDB in the sandbox to read the uploaded file, query MotherDuck, join the data, and write a downloadable result file. That avoids sending a large file or result set through MCP tool responses, which are designed for conversation context rather than bulk file transfer. To let Claude install DuckDB, load the MotherDuck extension, and query MotherDuck from the sandbox, organization owners can configure **Settings** → **Capabilities** → **Code execution and file creation** → **Allow network egress**. The **All domains** option gives the sandbox enough network access for this workflow, subject to your organization's policy. See Anthropic's [code execution and file creation documentation](https://support.claude.com/en/articles/12111783-create-and-edit-files-wit) for the security tradeoffs. The same requirement applies in other sandboxed agent environments: the DuckDB Python package or CLI runs as a client process, and the sandbox must allow that process to reach the package host, DuckDB extension download host, and MotherDuck service. Add your MotherDuck token as an environment variable in `.env` format: ```text MOTHERDUCK_TOKEN= ``` Use a scoped token that matches the task. A [read scaling token](/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/) is enough when Claude only needs to read from MotherDuck and write output files in its sandbox. Only add tokens to cloud environments whose users should have that access. ![Updating a Claude cloud environment with full network access and MotherDuck token environment variables](./img/claude-cloud-environment-env-vars.png) Changes to a cloud environment apply to new sessions. Before you start the workflow, select the cloud environment that has network access and `MOTHERDUCK_TOKEN` configured. ![Selecting a Claude cloud environment before starting a session](./img/claude-select-cloud-environment.png) Start with a small connection test: ```text Install the duckdb Python package and use it to run SELECT 42 from my MotherDuck account. Use the MotherDuck token I provide, and don't print the token. ``` Example CSV or Parquet workflow prompt: ```text Use Python with DuckDB for this file workflow. Connect to MotherDuck with the token I provide, read the uploaded CSV or Parquet file, join it to the relevant MotherDuck table, and write the enriched result as a downloadable CSV or Parquet file. ``` If direct DuckDB access isn't available, keep the heavy work in MotherDuck: ```text Use the MotherDuck MCP to create a table with the result instead of returning all rows in the chat. Tell me the table name and the SQL you used so I can export it from MotherDuck. ``` This fallback works when Claude's sandbox can't reach the MotherDuck extension download host or can't make outbound requests to MotherDuck. It also keeps large intermediate results out of the model's context window. ### Ad-hoc investigation The MCP is especially useful for exploratory debugging when you're not sure what you're looking for. Rather than writing queries upfront, you can describe the problem and let the AI help you dig in. ```text "I noticed a spike in errors on Dec 10th. Help me investigate: - What types of errors increased? - Were specific users or endpoints affected? - What changed compared to the previous week?" ``` One pattern we use at MotherDuck is loading logs or event data into a database and using the MCP to interrogate it conversationally. Instead of manually crafting regex patterns or grep commands, you can ask questions like *"What are the most common error messages in the last hour?"* or *"Show me all requests from user X that resulted in a 500 error"*. This turns log analysis from a tedious grep session into an interactive investigation where each answer informs the next question. ## Working with query results ### Refining results Results rarely come out perfect on the first try. The conversational nature of MCP means you can refine incrementally rather than rewriting queries from scratch. If you're seeing test data mixed in, just say *"Add a filter to exclude test accounts"*. If the granularity is wrong, ask to *"Change the grouping from daily to weekly"*. Small adjustments like changing sort order or adding a column are easy follow-ups. ### Understanding queries When the AI generates complex SQL, don't hesitate to ask for an explanation. This is useful both for validating the approach and for learning. Ask *"Explain what this query is doing step by step"* to understand the logic, or *"Are there any edge cases this query might miss?"* to sanity-check the results before relying on them. ### Exporting for further use Once you have the results you need, ask for output in the format that fits your workflow. Small result sets can be returned as a markdown table, spreadsheet-friendly CSV, or written summary. For larger exports, don't ask the MCP to stream all rows into the chat. Ask the AI to keep the result in MotherDuck with `CREATE TABLE AS SELECT ...` and give you the table name, or run a DuckDB client somewhere that can access both MotherDuck and the file destination. That client can be on your machine, in Claude Code, or in Claude's remote sandbox when its network rules allow the required hosts. Asking for the final SQL is also useful when you want to hand the analysis to another teammate or tool. ## Tips for better results ### Be explicit about assumptions Your data likely has business rules that aren't obvious from the schema alone. If a "completed" order means status is either 'shipped' or 'delivered', say so. If revenue calculations should exclude refunds, mention it upfront. The AI can't infer these domain-specific rules, so stating them early prevents incorrect results and saves iteration time. ### Reference specific tables and columns When you already know your schema, being specific helps the AI get it right the first time. Instead of asking about "the timestamp", say *"Use the `user_events.event_timestamp` column"*. If you know how tables relate, specify the join: *"Join `orders` to `customers` on `customer_id`"*. This is especially helpful in larger schemas where column names might be ambiguous. ### Ask for validation When accuracy matters, ask the AI to sanity-check its own work. Questions like *"Does this total match what you'd expect based on the row counts?"* or *"Can you verify this join doesn't create duplicates?"* can catch subtle bugs before you rely on the results. The AI can run quick validation queries to confirm the logic is sound. ## Troubleshooting :::tip Beyond querying The remote MCP server includes tools beyond just running queries. Most are metadata lookups or search functions for finding tables and columns, but the [ask docs question](/sql-reference/mcp/ask-docs-question) tool is particularly useful when you're stuck on tricky syntax or DuckDB-specific features. If the AI is struggling with a query pattern, try asking it to look up the relevant documentation first. ::: | Issue | Solution | |-------|----------| | AI queries wrong table | Ask: *"What tables are available?"* then specify the correct one | | Results don't look right | Ask: *"Show me sample data from the source table"* to verify the data | | Query is slow | Ask: *"Can you optimize this query?"*, add filters to reduce data scanned, or [increase your Duckling size](/about-motherduck/billing/duckling-sizes/) | | AI doesn't understand the question | Rephrase with more specific column names and business context | | Can't type fast enough | Use voice-to-text to interact with your AI assistant | ## Related resources - [Connect to MCP Server](/key-tasks/ai-and-motherduck/mcp-setup/) - Setup instructions for all supported AI clients - [AI Features in the UI](/key-tasks/ai-and-motherduck/ai-features-in-ui/) - Built-in AI features for the MotherDuck interface - [Building Analytics Agents](/key-tasks/ai-and-motherduck/building-analytics-agents/) - Build custom AI agents with MotherDuck --- Source: https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/postgres-endpoint/python # Connect from Python via Postgres endpoint > Connect to MotherDuck from Python using psycopg2 or psycopg3 via the Postgres wire protocol You can query MotherDuck from Python using standard PostgreSQL client libraries. No DuckDB installation is required. This guide covers [psycopg2](https://www.psycopg.org/docs/) and [psycopg (v3)](https://www.psycopg.org/psycopg3/docs/). For connection parameters, SSL options, and limitations, see the [Postgres Endpoint reference](/sql-reference/postgres-endpoint). ## Prerequisites You need a [MotherDuck access token](/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck). Set it as an environment variable: ```bash export MOTHERDUCK_TOKEN="your_token_here" ``` ## Connect ### psycopg (v3) ```python # /// script # dependencies = ["psycopg"] # /// import os import psycopg with psycopg.connect( host="pg.us-east-1-aws.motherduck.com", # or us-west-2-aws or eu-central-1-aws port=5432, dbname="md:", user="postgres", password=os.environ["MOTHERDUCK_TOKEN"], sslmode="verify-full", sslrootcert="system", # available in libpq 16+ ) as conn: with conn.cursor() as cur: cur.execute( """ SELECT title, score FROM sample_data.hn.hacker_news WHERE type = 'story' ORDER BY score DESC LIMIT 5 """ ) for row in cur: print(row) ``` You can also use a connection URI: ```python import os import psycopg token = os.environ["MOTHERDUCK_TOKEN"] with psycopg.connect( f"postgresql://postgres:{token}@pg.us-east-1-aws.motherduck.com:5432/md:?sslmode=verify-full&sslrootcert=system" ) as conn: with conn.cursor() as cur: cur.execute("SELECT current_database()") print(cur.fetchone()) ``` ### psycopg2 ```python # /// script # dependencies = ["psycopg2-binary", "certifi"] # /// import os import certifi import psycopg2 conn = psycopg2.connect( host="pg.us-east-1-aws.motherduck.com", # or us-west-2-aws or eu-central-1-aws port=5432, dbname="md:", user="postgres", password=os.environ["MOTHERDUCK_TOKEN"], sslmode="verify-full", sslrootcert=certifi.where(), ) with conn: with conn.cursor() as cur: cur.execute( """ SELECT title, score FROM sample_data.hn.hacker_news WHERE type = 'story' ORDER BY score DESC LIMIT 5 """ ) for row in cur.fetchall(): print(row) ``` Use `md:` as the database name to connect to your default database, or replace it with a specific database name such as `sample_data`. ## Loading data from Python For loading through the Postgres endpoint, the recommended pattern is server-side reads from remote storage: - Use `psycopg` or SQLAlchemy to execute `CREATE TABLE AS SELECT` or `INSERT INTO ... SELECT`. - Point `read_parquet`, `read_csv`, or `read_json` at S3, GCS, R2, Azure, or HTTPS. - Set `MD_RUN = REMOTE` on those file reads. Example with SQLAlchemy: ```python import os from sqlalchemy import create_engine, text engine = create_engine( "postgresql+psycopg://postgres:@pg.us-east-1-aws.motherduck.com:5432/md:", connect_args={ "password": os.environ["MOTHERDUCK_TOKEN"], "sslmode": "require", }, ) with engine.begin() as conn: conn.execute( text( """ CREATE OR REPLACE TABLE my_db.main.weather_events AS SELECT * FROM read_csv( 'https://raw.githubusercontent.com/duckdb/duckdb-web/main/data/weather.csv', HEADER = true, AUTO_DETECT = true, MD_RUN = REMOTE ) """ ) ) ``` The following patterns are not supported from Python over the Postgres endpoint: - `COPY ... FROM '/local/file.csv'` - `cursor.copy(...)` / `COPY FROM STDIN` - `psql \copy` - `MD_RUN = LOCAL` - SQLAlchemy's default `executemany` path for bulk ingest If the rows exist only in application memory and the volume is modest, prefer explicit multi-values `INSERT` statements. For large local bulk loads, switch to a DuckDB client path instead. See [Loading data through the Postgres endpoint](/key-tasks/loading-data-into-motherduck/loading-data-via-postgres-endpoint) for the full decision guide. ## SSL notes - **psycopg (v3)** wraps libpq and supports `sslrootcert=system` directly. - **psycopg2** bundles its own statically linked OpenSSL, so `sslrootcert=system` is not supported. Use the `certifi` package to point to CA certificates, or download the [ISRG Root X1](https://letsencrypt.org/certs/isrgrootx1.pem) certificate and set `sslrootcert` to its path. For more details on SSL options, see [SSL and certificate verification](/sql-reference/postgres-endpoint#ssl-and-certificate-verification). --- Source: https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/connecting-to-motherduck # Connecting to MotherDuck > Create one or more connections to a MotherDuck database There are two ways to connect to MotherDuck: | Method | Client needed | Best for | |--------|--------------|----------| | **DuckDB SDK** | DuckDB client library | Python, Node.js, Java, CLI — full feature set, hybrid execution, local caching | | **[Postgres Endpoint](/key-tasks/authenticating-and-connecting-to-motherduck/postgres-endpoint)** | Any PostgreSQL client | Thin clients, serverless environments, BI tools, languages without a DuckDB SDK | This page covers connecting with the **DuckDB SDK**. For the Postgres endpoint, see [Postgres Endpoint](/key-tasks/authenticating-and-connecting-to-motherduck/postgres-endpoint). ## Connecting with the DuckDB SDK A single DuckDB connection executes one query at a time, aiming to maximize the performance of that query, making reuse of a single connection both simple and performant. We recommend starting with the simplest way of connecting to MotherDuck and running queries, and if that does not meet your requirements, to explore the advanced use-cases described in subsequent sections. ## Create a connection ![Image](useBaseUrl('/img/key-tasks/authenticating-and-connecting-to-motherduck/one-connection.png')) The below code snippets show how to create a connection to a MotherDuck database from the CLI, Python, JDBC, and Node.js language APIs. :::info For security reasons, it's generally recommended to use environment variables to store your MotherDuck token rather than hardcoding it in your application. ::: :::tip The `INSERT INTO` statements below are for illustration only. For loading real data, do not insert rows one at a time — use bulk methods like `INSERT INTO ... SELECT` from files, `COPY`, or DataFrame-based approaches. See [Loading data into MotherDuck](/key-tasks/loading-data-into-motherduck/loading-data-into-motherduck.mdx) for recommended approaches. ::: ### Python To connect to your MotherDuck database, use `duckdb.connect("md:my_database_name")`. This will return a `DuckDBPyConnection` object that you can use to interact with your database. There are two ways to provide your access token in Python to authenticate your user session. ### Within a config dictionary ```python import duckdb # Create connection to your default database conn = duckdb.connect("md:my_db", config={"motherduck_token" :}) # Optionally, import your token from a .env file # Run query conn.sql("CREATE TABLE items (item VARCHAR, value DECIMAL(10, 2), count INTEGER)") conn.sql("INSERT INTO items VALUES ('jeans', 20.0, 1), ('hammer', 42.2, 2)") res = conn.sql("SELECT * FROM items") # Close the connection conn.close() ``` ### Included in the connection string ```python import duckdb # Create connection to your default database conn = duckdb.connect(f"md:my_db?motherduck_token={}") # Optionally, import your token directly from a .env file # Run query conn.sql("CREATE TABLE items (item VARCHAR, value DECIMAL(10, 2), count INTEGER)") conn.sql("INSERT INTO items VALUES ('jeans', 20.0, 1), ('hammer', 42.2, 2)") res = conn.sql("SELECT * FROM items") # Close the connection conn.close() ``` ### JDBC To connect to your MotherDuck database, you can create a `Connection` by using the `"jdbc:duckdb:md:databaseName"` connection string format. For authentication, you need to provide a MotherDuck token. There are two ways to provide the token: ### As a connection property ```java import java.sql.Connection; import java.sql.DriverManager; import java.sql.Statement; import java.sql.ResultSet; import java.util.Properties; // Create properties with your MotherDuck token Properties props = new Properties(); props.setProperty("motherduck_token", ""); // Create connection to your database try (Connection conn = DriverManager.getConnection("jdbc:duckdb:md:my_db", props); Statement stmt = conn.createStatement()) { stmt.executeUpdate("CREATE TABLE items (item VARCHAR, value DECIMAL(10, 2), count INTEGER)"); stmt.executeUpdate("INSERT INTO items VALUES ('jeans', 20.0, 1), ('hammer', 42.2, 2)"); try (ResultSet rs = stmt.executeQuery("SELECT * FROM items")) { while (rs.next()) { System.out.println("Item: " + rs.getString(1) + " costs " + rs.getInt(3)); } } } ``` ### As part of the connection string ```java // Create connection with token in the connection string try (Connection conn = DriverManager.getConnection("jdbc:duckdb:md:my_db?motherduck_token="); Statement stmt = conn.createStatement()) { stmt.executeUpdate("CREATE TABLE items (item VARCHAR, value DECIMAL(10, 2), count INTEGER)"); stmt.executeUpdate("INSERT INTO items VALUES ('jeans', 20.0, 1), ('hammer', 42.2, 2)"); try (ResultSet rs = stmt.executeQuery("SELECT * FROM items")) { while (rs.next()) { System.out.println("Item: " + rs.getString(1) + " costs " + rs.getInt(3)); } } } ``` :::info If an environment variable named `motherduck_token` is set, it will be used automatically. ::: ### Node.js To connect to your MotherDuck database, you can create a `DuckDBInstance` with the `'md:databaseName'` connection string format. For authentication, you need to provide a MotherDuck token. There are two ways to provide the token: ### Within a config dictionary ```javascript import { DuckDBInstance } from '@duckdb/node-api'; // Create connection to your default database const instance = await DuckDBInstance.create('md:my_db', { motherduck_token: '', }); const conn = await instance.connect(); // Run queries await conn.run('CREATE TABLE items (item VARCHAR, value DECIMAL(10, 2), count INTEGER)'); await conn.run("INSERT INTO items VALUES ('jeans', 20.0, 1), ('hammer', 42.2, 2)"); const result = await conn.runAndReadAll('SELECT * FROM items'); console.table(result.getRowObjects()); ``` ### Included in the connection string ```javascript import { DuckDBInstance } from '@duckdb/node-api'; // Create connection to your default database const instance = await DuckDBInstance.create('md:my_db?motherduck_token='); const conn = await instance.connect(); // Run queries await conn.run('CREATE TABLE items (item VARCHAR, value DECIMAL(10, 2), count INTEGER)'); await conn.run("INSERT INTO items VALUES ('jeans', 20.0, 1), ('hammer', 42.2, 2)"); const result = await conn.runAndReadAll('SELECT * FROM items'); console.table(result.getRowObjects()); ``` :::info If an environment variable named `motherduck_token` is set, it's used automatically. ::: ### CLI To connect to your MotherDuck database, run `duckdb md:`. ```shell duckdb "md:my_db" ``` Now, you will enter the DuckDB interactive terminal to interact with your database. ```sql D CREATE TABLE items (item VARCHAR, value DECIMAL(10, 2), count INTEGER); D INSERT INTO items VALUES ('jeans', 20.0, 1), ('hammer', 42.2, 2); D SELECT * FROM items; ``` ## Session names The `session_name` connection string parameter lets you give your session a name. You can set it in the connection string (`md:my_db?session_name=my_label`) or as a DuckDB setting before connecting to MotherDuck (`SET motherduck_session_name='my_label'`). :::note The older `session_hint` parameter still works as an alias for `session_name`. ::: ### Read scaling with session names If you are planning on multiple end users connecting with a [Read Scaling Token](/documentation/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/read-scaling.mdx), ensure each user can get a dedicated backend (up to the maximum configured pool size) by passing a `session_name` in the connection string. Session names ensure that all the queries from the same end user are routed to the same backend duckling, even if they originate from different services/servers. This allows for optimal caching and resource allocation for each specific user's needs. After establishing the connection, it can be used the same way as any DuckDB/MotherDuck connection -- to run queries, and then either be closed explicitly or go out of scope, as in the examples above. ### Annotating queries with session names The `session_name` value appears in the `SESSION_NAME` column of [query history](/sql-reference/motherduck-sql-reference/md_information_schema/query_history/), making it easy to identify and group queries. This works for both read scaling and read/write connections. ### Python ```python import duckdb # Create a connection and allocate a stable backend for user123. con = duckdb.connect( "md:my_db?session_name=user123", config = {'motherduck_token': ''} ) ``` ### JDBC ```java import java.sql.Connection; import java.sql.DriverManager; import java.sql.Statement; import java.sql.ResultSet; import java.util.Properties; // Create properties with your MotherDuck token Properties props = new Properties(); props.setProperty("motherduck_token", ""); // Create a connection and allocate a stable backend for user123. try (Connection conn = DriverManager.getConnection("jdbc:duckdb:md:my_db?session_name=user123", props)) { // ... } ``` ### Node.js ```javascript import { DuckDBInstance } from '@duckdb/node-api'; // Create a connection and allocate a stable backend for user123. const instance = await DuckDBInstance.create( 'md:my_db?session_name=user123', { motherduck_token: '' } ); // ... ``` ## Multiple connections and the database instance cache DuckDB clients in Python, Go, R, JDBC, and ODBC prevent redundant reinitialization by keeping instances of database-global context cached by the database path. This usually makes external connection pools unnecessary for the DuckDB client. If your application uses connection pooling libraries, they may not be aware of this behavior. In that case, consider using the [Postgres Endpoint](/key-tasks/authenticating-and-connecting-to-motherduck/postgres-endpoint) as a drop-in replacement for the DuckDB client. When connecting to MotherDuck, the instance is cached for an additional 15 minutes after the last connection is closed (see [Setting Custom Database Instance Cache TTL](#setting-custom-database-instance-cache-time-ttl) for how to override this value). For an application that creates and closes connections frequently, this could provide a significant speedup for connection creation, as the same catalog data can be reused across connections. This means that only the first of multiple connections to the same database will take the time to load the MotherDuck extension, verify its signature, and fetch the catalog metadata. ### Python ```python con1 = duckdb.connect("md:my_db") // MotherDuck catalog fetched con2 = duckdb.connect("md:my_db") // MotherDuck catalog reused ``` ### Java ```java // Create properties with your MotherDuck token Properties props = new Properties(); props.setProperty("motherduck_token", ""); try (var con1 = DriverManager.getConnection("jdbc:duckdb:md:my_db", props); // MotherDuck catalog fetched var con2 = DriverManager.getConnection("jdbc:duckdb:md:my_db", props); // MotherDuck catalog reused ) { // ... } ``` ### Node.js :::warning Node.js does not cache instances automatically Unlike some other clients, the Node.js client (`@duckdb/node-api`) does **not** cache database instances by default. Each call to `DuckDBInstance.create()` creates a new instance, which means the MotherDuck extension is reloaded and the catalog metadata is re-fetched every time. Depending on the size of your catalog this can cause significant connection delays. To avoid this, use `DuckDBInstance.fromCache()` or create a `DuckDBInstanceCache` as shown below. ::: In Node.js, you must explicitly opt in to instance caching by using `DuckDBInstance.fromCache()` instead of `DuckDBInstance.create()`. This uses a built-in default cache to ensure only one instance is created per database path, avoiding reloading the MotherDuck extension and re-fetching catalog metadata on subsequent connections. ```javascript import { DuckDBInstance } from '@duckdb/node-api'; // First call creates the instance and fetches the MotherDuck catalog const instance = await DuckDBInstance.fromCache('md:my_db', { motherduck_token: '', }); const connection1 = await instance.connect(); // Second call reuses the cached instance — no reinitialization needed const instance2 = await DuckDBInstance.fromCache('md:my_db'); const connection2 = await instance2.connect(); ``` For more control, you can create your own `DuckDBInstanceCache`: ```javascript import { DuckDBInstanceCache } from '@duckdb/node-api'; const cache = new DuckDBInstanceCache(); // Retrieves an existing instance or creates one if it doesn't exist const instance = await cache.getOrCreateInstance('md:my_db'); const connection = await instance.connect(); ``` ## Setting custom database instance cache time (TTL) By default, connections to MotherDuck established through the database instance caching supporting DuckDB APIs will reuse the same database instance for 15 minutes after the last connection is closed. In some cases, you may want to make that period longer (to avoid the redundant reinitialization) or shorter (to connect to the same database with a different configuration). The database TTL value can be set either at the initial connection time, or by using the `SET` command at any point. Any valid [DuckDB Instant part specifiers](https://duckdb.org/docs/stable/sql/functions/datepart.html#part-specifiers-usable-as-date-part-specifiers-and-in-intervals) can be used for the TTL value, for example '5s', '3m', or '1h'. :::note The examples below assume you have configured your MotherDuck token using one of the authentication methods described in the [Create a connection](#create-a-connection) section above. ::: ### Python ```python con = duckdb.connect("md:my_db?dbinstance_inactivity_ttl=1h") con.close() # different database connection string (without `?dbinstance_inactivity_ttl=1h`), no instance cached; TTL is 15 minutes (default) con2 = duckdb.connect("md:my_db") # allow the database instance to expire immediately con2.execute("SET motherduck_dbinstance_inactivity_ttl='0s'") # the database instance can only expire after the last connection is closed con2.close() # new database instance with a new TTL (the 15 minute default) con3 = duckdb.connect("md:my_db") con3.close() # the last TTL for this database was 15 minutes; the cached database instance will be reused con4 = duckdb.connect("md:my_db") ``` ### Java The TTL can be set either through the connection string or through Properties. However, be careful when using Properties as the database instance cache is keyed by the connection string. This means that if you change the TTL in Properties between connections, you'll get an error as it's trying to connect to the same database with different configurations. Here's an example that will fail: ```java Properties props = new Properties(); props.setProperty("motherduck_dbinstance_inactivity_ttl", "2m"); // First connection works fine try (var con = DriverManager.getConnection("jdbc:duckdb:md:my_db", props)) { // TTL is set to 2m } // Changing TTL in properties will fail props.setProperty("motherduck_dbinstance_inactivity_ttl", "5m"); try (var con = DriverManager.getConnection("jdbc:duckdb:md:my_db", props)) { // This will throw: "Can't open a connection to same database file // with a different configuration than existing connections" } ``` For this reason, it's generally safer to set the TTL through the connection string: ```java // Set TTL through connection string try (var con = DriverManager.getConnection("jdbc:duckdb:md:my_db?dbinstance_inactivity_ttl=1h")) { // TTL is set to 1h } // Different TTL creates a new instance try (var con = DriverManager.getConnection("jdbc:duckdb:md:my_db?dbinstance_inactivity_ttl=30m")) { // This works - creates a new instance with 30m TTL } // Can also set TTL using SQL try (var con = DriverManager.getConnection("jdbc:duckdb:md:my_db"); var st = con.createStatement()) { // allow the database instance to expire immediately st.executeUpdate("SET motherduck_dbinstance_inactivity_ttl='0s'"); } ``` :::note When using Properties, you must include the `motherduck_` prefix for the TTL property name (i.e., `motherduck_dbinstance_inactivity_ttl`). This prefix is only optional when passing the TTL through the connection string. ::: ### NodeJS ```javascript import { DuckDBInstance } from '@duckdb/node-api'; // Set TTL to 1 hour through the connection string const instance = await DuckDBInstance.fromCache('md:my_db?dbinstance_inactivity_ttl=1h'); const conn = await instance.connect(); // Or set the TTL using SQL after connecting await conn.run("SET motherduck_dbinstance_inactivity_ttl='30m'"); // Allow the database instance to expire immediately after the connection closes await conn.run("SET motherduck_dbinstance_inactivity_ttl='0s'"); ``` ## Connect to multiple databases If you need to connect to MotherDuck and run one or more queries in succession on the same account, you can use a [single database connection](#create-a-connection). If you want to connect to another database in the same account, you can either [reuse the same connection](#example-1-reuse-the-same-duckdb-connection), or [create copies](#example-2-create-copies-of-the-initial-duckdb-connection) of the connection. ### Python If you need to connect to multiple databases, you can either directly reuse the same `DuckDBPyConnection` instance, or create copies of the connection using the `.cursor()` method. :::note `FROM ` is a shorthand version of `SELECT * FROM
`. ::: ### Example 1: Reuse the same DuckDB connection ![Image](useBaseUrl('/img/key-tasks/authenticating-and-connecting-to-motherduck/one-connection.png')) To connect to different databases in the same MotherDuck account, you can use the same connection object and fully qualify the names of the tables in your query. ```python conn = duckdb.connect("md:my_db") res1 = conn.sql("FROM my_db1.main.tbl") res2 = conn.sql("FROM my_db2.main.tbl") res3 = conn.sql("FROM my_db3.main.tbl") conn.close() ``` ### Example 2: Create copies of the initial DuckDB connection ![Image](useBaseUrl('/img/key-tasks/authenticating-and-connecting-to-motherduck/one-connection-threads.png')) `conn.cursor()` returns a copy of the DuckDB connection, with a reference to the existing DuckDB database instance. Closing the original connection also closes all associated cursors. ```python conn = duckdb.connect("md:my_db") cur1 = conn.cursor() cur2 = conn.cursor() cur3 = conn.cursor() cur1.sql("USE my_db1") cur2.sql("USE my_db2") cur3.sql("USE my_db3") res = [] for cur in [cur1, cur2, cur3]: res.append(cur.sql("SELECT * FROM tbl")) # This closes the original DuckDB connection and all cursors conn.close() ``` :::note `duckdb.connect(path)` creates and caches a DuckDB instance. Subsequent calls with the same path reuse this instance. New connections to the same instance are independent, similar to `conn.cursor()`, but closing one doesn't affect others. To create a new instance instead of using the cached one, make the path unique (e.g., `md:my_db?user=`). ::: ### Example 3: Create multiple connections ![Image](useBaseUrl('/img/key-tasks/authenticating-and-connecting-to-motherduck/multiple-connections.png')) You can also create multiple connections to the same MotherDuck account using different DuckDB instances. However, keep in mind that each connection takes time to establish, and if connection times are an important factor for your application, it might be beneficial to consider [Example 1](#example-1-reuse-the-same-duckdb-connection) or [Example 2](#example-2-create-copies-of-the-initial-duckdb-connection). ### JDBC If you need to connect to multiple databases, you typically won't need to create multiple DuckDB instances. You can either directly reuse the same `DuckDBConnection` instance, or create copies of the connection using the `.duplicate()` method. ```java // Create connection with your MotherDuck token Properties props = new Properties(); props.setProperty("motherduck_token", ""); try (DuckDBConnection duckdbConn = (DuckDBConnection) DriverManager.getConnection("jdbc:duckdb:md:my_db", props)) { Connection conn1 = duckdbConn.duplicate(); Connection conn2 = duckdbConn.duplicate(); Connection conn3 = duckdbConn.duplicate(); // ... } ``` ### Node.js If you need to connect to multiple databases, you can re-use the same `DuckDBInstance` and connection. Use `fromCache` to ensure the instance is reused efficiently. ```javascript import { DuckDBInstance } from '@duckdb/node-api'; const instance = await DuckDBInstance.fromCache('md:', { motherduck_token: '', }); const conn = await instance.connect(); const result1 = await conn.runAndReadAll('FROM my_db1.main.tbl'); const result2 = await conn.runAndReadAll('FROM my_db2.main.tbl'); ``` --- Source: https://motherduck.com/docs/key-tasks/data-warehousing/orchestration/dagster # Dagster > Orchestrate an incremental S3-to-MotherDuck data loading pipeline with Dagster and Python. Use Dagster when you want asset lineage, schedules, retries, and run history around a Python data loading job. This guide builds a minimum viable Dagster asset that reads Parquet data from S3, loads rows newer than the last successful run, upserts them into MotherDuck, and stores a watermark for the next run. The example uses a public S3 Parquet file from the MotherDuck sample data bucket. Replace the S3 path and column mapping with your own bucket layout when you move from the demo to your pipeline. ## How the pipeline works ```mermaid graph LR S3[("S3 Parquet file")]:::yellow A["Dagster asset
taxi_trips"]:::watermelon W[("ingestion_watermarks")]:::yellow T[("taxi_trips")]:::yellow W --> A S3 --> A A --> T A --> W ``` The asset keeps the state in MotherDuck: - `taxi_trips` is the target table. - `ingestion_watermarks` stores the latest `pickup_at` value loaded by this pipeline. - Each run reads only rows where `tpep_pickup_datetime` is greater than the stored watermark. - The target table has a primary key, so reprocessing the same row updates the existing row instead of creating a duplicate. ## Prerequisites Before you start, ensure you have: - Python 3.10 or later. - `uv` for Python project and dependency management. - A MotherDuck access token in `MOTHERDUCK_TOKEN`. - A MotherDuck database name for the pipeline. The example creates the database if it doesn't exist. - For private S3 buckets, a MotherDuck S3 secret. See [Amazon S3 credentials](/integrations/cloud-storage/amazon-s3/) for setup. :::tip Use a dedicated MotherDuck service account for scheduled ingestion jobs. This keeps ingestion compute, permissions, and cost attribution separate from analyst and application workloads. See [Hypertenancy](/concepts/hypertenancy/) for the compute isolation model. ::: ## Create the Dagster project Create a small Python project and add Dagster with DuckDB: ```bash > uv init dagster-motherduck-s3 > cd dagster-motherduck-s3 > uv add dagster dagster-webserver duckdb ``` Create `definitions.py`: ```python import os import re import dagster as dg import duckdb S3_URI = os.getenv( "S3_URI", "s3://us-prd-motherduck-open-datasets/nyc_taxi/parquet/yellow_cab_nyc_2022_11.parquet", ) MOTHERDUCK_DATABASE = os.getenv("MOTHERDUCK_DATABASE", "dagster_s3_demo") PIPELINE_NAME = "dagster_s3_taxi_trips" # Optional cap for running the demo quickly. Leave unset for a real pipeline. INGESTION_END_TS = os.getenv("MOTHERDUCK_INGESTION_END_TS") PUBLIC_DEMO_SCOPE = "s3://us-prd-motherduck-open-datasets/" def database_identifier(name: str) -> str: if not re.fullmatch(r"[A-Za-z_][A-Za-z0-9_]*", name): raise ValueError("Use a database name with letters, numbers, and underscores.") return name def open_motherduck_connection() -> duckdb.DuckDBPyConnection: database = database_identifier(MOTHERDUCK_DATABASE) con = duckdb.connect("md:") con.execute(f"CREATE DATABASE IF NOT EXISTS {database}") con.execute(f"USE {database}") if S3_URI.startswith(PUBLIC_DEMO_SCOPE): con.execute(""" CREATE OR REPLACE TEMPORARY SECRET public_motherduck_open_data ( TYPE S3, PROVIDER config, REGION 'us-east-1', SCOPE 's3://us-prd-motherduck-open-datasets/' ) """) return con @dg.asset def taxi_trips(context: dg.AssetExecutionContext) -> dg.MaterializeResult: con = open_motherduck_connection() try: con.execute(""" CREATE TABLE IF NOT EXISTS taxi_trips ( trip_id VARCHAR PRIMARY KEY, pickup_at TIMESTAMP, dropoff_at TIMESTAMP, passenger_count DOUBLE, trip_distance DOUBLE, total_amount DOUBLE, source_file VARCHAR, loaded_at TIMESTAMP DEFAULT now() ) """) con.execute(""" CREATE TABLE IF NOT EXISTS ingestion_watermarks ( pipeline_name VARCHAR PRIMARY KEY, last_pickup_at TIMESTAMP ) """) con.execute(""" INSERT INTO ingestion_watermarks VALUES (?, TIMESTAMP '1970-01-01') ON CONFLICT (pipeline_name) DO NOTHING """, [PIPELINE_NAME]) last_pickup_at = con.execute( "SELECT last_pickup_at FROM ingestion_watermarks WHERE pipeline_name = ?", [PIPELINE_NAME], ).fetchone()[0] con.execute(""" CREATE OR REPLACE TEMP TABLE new_taxi_trips AS SELECT md5(concat_ws('|', VendorID::VARCHAR, tpep_pickup_datetime::VARCHAR, tpep_dropoff_datetime::VARCHAR, PULocationID::VARCHAR, DOLocationID::VARCHAR, total_amount::VARCHAR )) AS trip_id, tpep_pickup_datetime AS pickup_at, tpep_dropoff_datetime AS dropoff_at, passenger_count, trip_distance, total_amount, filename AS source_file, now() AS loaded_at FROM read_parquet(?, filename = true) WHERE tpep_pickup_datetime > ? AND (? IS NULL OR tpep_pickup_datetime < ?::TIMESTAMP) """, [S3_URI, last_pickup_at, INGESTION_END_TS, INGESTION_END_TS]) rows_loaded = con.execute("SELECT count(*) FROM new_taxi_trips").fetchone()[0] con.execute(""" INSERT INTO taxi_trips BY NAME SELECT * FROM new_taxi_trips ON CONFLICT (trip_id) DO UPDATE SET pickup_at = excluded.pickup_at, dropoff_at = excluded.dropoff_at, passenger_count = excluded.passenger_count, trip_distance = excluded.trip_distance, total_amount = excluded.total_amount, source_file = excluded.source_file, loaded_at = excluded.loaded_at """) max_pickup_at = con.execute( "SELECT max(pickup_at) FROM new_taxi_trips" ).fetchone()[0] if max_pickup_at is not None: con.execute( "UPDATE ingestion_watermarks SET last_pickup_at = ? WHERE pipeline_name = ?", [max_pickup_at, PIPELINE_NAME], ) total_rows = con.execute("SELECT count(*) FROM taxi_trips").fetchone()[0] context.log.info("Loaded %s rows into taxi_trips", rows_loaded) return dg.MaterializeResult( metadata={ "rows_loaded": rows_loaded, "total_rows": total_rows, "last_pickup_at": str(max_pickup_at or last_pickup_at), } ) finally: con.close() daily_s3_ingestion = dg.ScheduleDefinition( name="daily_s3_taxi_trips", cron_schedule="0 2 * * *", target=[taxi_trips], ) defs = dg.Definitions( assets=[taxi_trips], schedules=[daily_s3_ingestion], ) if __name__ == "__main__": result = dg.materialize([taxi_trips]) if not result.success: raise RuntimeError("Dagster materialization failed.") ``` ## Run the ingestion Set the MotherDuck token and database name: ```bash > export MOTHERDUCK_TOKEN="" > export MOTHERDUCK_DATABASE="dagster_s3_demo" ``` For the public demo file, you can cap the first run to one day of taxi trips so the example finishes quickly: ```bash > export MOTHERDUCK_INGESTION_END_TS="2022-11-02" ``` Run the asset once from Python: ```bash > uv run python definitions.py ``` Run the same command again. The second run should load `0` rows because the first run advanced the watermark. Verify the loaded rows in MotherDuck: ```sql SELECT count(*) FROM taxi_trips; SELECT pipeline_name, last_pickup_at FROM ingestion_watermarks; ``` When you use your own S3 data, remove `MOTHERDUCK_INGESTION_END_TS` and replace: - `S3_URI` with your `s3:////*.parquet` path. - The `SELECT` list in `new_taxi_trips` with your source columns. - The watermark column with a stable source timestamp, such as `updated_at` or `created_at`. - The primary key expression with the source system's durable row key. ## Run it in Dagster Start the Dagster UI from the same directory: ```bash > uv run dagster dev -f definitions.py ``` Open `http://localhost:3000`, select the `taxi_trips` asset, and materialize it. Dagster records the asset materialization, metadata, logs, and schedule definition. To use the schedule in a long-running Dagster deployment, keep the `daily_s3_taxi_trips` schedule enabled and run a Dagster daemon. For local one-off testing, `uv run python definitions.py` is enough. ## Production considerations This example is intentionally small. Before using the pattern in production: - Use a dedicated service account token with only the permissions needed for ingestion. - Store private bucket credentials as a MotherDuck S3 secret instead of embedding AWS keys in code. - Keep S3 files in Parquet and avoid very small files. See [S3 import best practices](/key-tasks/cloud-storage/s3-import-best-practices/). - Use a source-provided primary key for upserts. Hashing source fields is useful for demos but less stable than a real key. - Use a source timestamp that only moves forward for watermarking. If your source sends late-arriving records, add a small overlap window and deduplicate by primary key. ## Related content - [Amazon S3 credentials](/integrations/cloud-storage/amazon-s3/) - [S3 import best practices](/key-tasks/cloud-storage/s3-import-best-practices/) - [Connecting to MotherDuck](/key-tasks/authenticating-and-connecting-to-motherduck/connecting-to-motherduck/) - [Hypertenancy](/concepts/hypertenancy/) --- Source: https://motherduck.com/docs/key-tasks/loading-data-into-motherduck/loading-data-from-cloud-or-https # From cloud storage or over HTTPS > Load data into MotherDuck from S3, Azure, GCS, or public HTTPS URLs. # From public cloud storage MotherDuck supports several cloud storage providers, including [Amazon S3](/integrations/cloud-storage/amazon-s3.mdx), [Azure](/integrations/cloud-storage/azure-blob-storage.mdx), [Google Cloud](/integrations/cloud-storage/google-cloud-storage.mdx) and [Cloudflare R2](/integrations/cloud-storage/cloudflare-r2). :::note MotherDuck is available on AWS in three regions: **US East (N. Virginia)** - `us-east-1`, **US West (Oregon)** - `us-west-2`, and **Europe (Frankfurt)** - `eu-central-1`. For an optimal experience, we strongly encourage you locate your data in the same region as your MotherDuck Organization. ::: :::tip If you want to inspect storage paths from SQL before loading data, see [`MD_LIST_FILES()`](/sql-reference/motherduck-sql-reference/md-list-files). It supports S3 and Azure paths. For S3 bucket discovery by secret, see [`MD_LIST_BUCKETS_FOR_SECRET()`](/sql-reference/motherduck-sql-reference/md-list-buckets-for-secret). ::: The following example features Amazon S3. ### UI 1. In the left panel of the UI, click **Add data** 2. Select **From cloud storage** ![Image](useBaseUrl('/img/key-tasks/loading-data-into-motherduck/from-cloud-storage.png')) 3. For a publicly accessible bucket, skip creating a secret ![Image](useBaseUrl('/img/key-tasks/loading-data-into-motherduck/skip-create-secret.png')) 4. Enter the S3 bucket path (e.g., `s3://motherduck-demo`) and select the files you want, or use Wildcard mode to choose files with a matching pattern 5. Preview the files and select the table names and destination database 6. Click **Create tables** ![Image](useBaseUrl('/img/key-tasks/loading-data-into-motherduck/create-multiple-tables-browse.png')) ### SQL Connect to MotherDuck if you haven't already by doing the following: ```sql -- assuming the db my_db exists ATTACH 'md:my_db'; ``` ```sql -- CTAS a table from a publicly available demo dataset stored in s3 CREATE OR REPLACE TABLE pypi_small AS SELECT * FROM 's3://motherduck-demo/pypi.small.parquet'; -- JOIN the demo dataset against a larger table to find the most common duplicate urls -- Note you can directly refer to the url as a table! SELECT pypi_small.url, COUNT(*) FROM pypi_small JOIN 's3://motherduck-demo/pypi_downloads.parquet' AS s3_pypi ON pypi_small.url = s3_pypi.url GROUP BY pypi_small.url ORDER BY COUNT(*) DESC LIMIT 10; ``` ## From a secure cloud storage provider MotherDuck supports several cloud storage providers, including [Amazon S3](/integrations/cloud-storage/amazon-s3.mdx), [Azure](/integrations/cloud-storage/azure-blob-storage.mdx), [Google Cloud](/integrations/cloud-storage/google-cloud-storage.mdx), and [Cloudflare R2](/integrations/cloud-storage/cloudflare-r2). To access them securely, you first must [create a secret](/sql-reference/motherduck-sql-reference/create-secret/). :::info When you load data from cloud storage while connected to MotherDuck, the query runs on MotherDuck's cloud execution engine, not your local machine. MotherDuck connects to your storage provider directly and can use any matching secret, including temporary secrets from your local DuckDB session. For more details, see [CREATE SECRET](/sql-reference/motherduck-sql-reference/create-secret/#querying-with-secrets). ::: :::note For SQL-based object discovery, [`MD_LIST_FILES()`](/sql-reference/motherduck-sql-reference/md-list-files) supports only `s3://`, `azure://`, and `az://` paths. It does not accept `gcs://`, `gs://`, or `r2://` paths. ::: ### UI You can set cloud storage secrets directly from the UI under Settings —> Integrations —> Secrets, or with the "Add data" button in the left panel. First, create a secret for your cloud storage credentials: 1. Go to **Settings** → **Integrations** → **Secrets** ![The MotherDuck UI for adding a new secret](./img/loading_data__secrets_overview.png) 2. Click **Add secret** and select your cloud storage provider (S3, R2, GCS, Azure) ![Image](useBaseUrl('/img/key-tasks/loading-data-into-motherduck/loading_data__secrets_add_new.png')) 3. Enter your access key and secret for your service account in your cloud storage provider. 4. For S3 credentials, you can test and verify your connection before saving Once your secret is configured, load data from your secure bucket: 1. In the left panel of the notebook UI, click **Add data** 2. Select **From cloud storage** 3. Enter the bucket path and select the files you want, or use Wildcard mode to choose files with a matching pattern 4. Preview the files and select the table names and destination database 5. Click **Create tables** :::note When loading data from [Azure](/integrations/cloud-storage/azure-blob-storage) or [Hugging Face](https://duckdb.org/docs/extensions/httpfs/hugging_face), you must use Wildcard mode to select files. Browse mode is not supported for these providers. ::: ### SQL To create a secret in MotherDuck using the CLI or SQL notebooks you'll need to explicitly add the `IN MOTHERDUCK`. ```sql CREATE SECRET IN MOTHERDUCK ( TYPE S3, KEY_ID 'access_key', SECRET 'secret_key', REGION 'us-east-1', SCOPE 'my-bucket-path' ); -- Now you can query from a secure S3 bucket CREATE OR REPLACE TABLE mytable AS SELECT * FROM 's3://...'; ``` ## Over HTTPS MotherDuck supports loading data over HTTPS, including CSV exports from public Google Sheets. ### SQL ```sql SELECT * FROM read_csv( 'https://docs.google.com/spreadsheets/d//export?format=csv&gid=', MD_RUN = REMOTE ); ``` For a full Google Sheets walkthrough, including private sheets with HTTP authentication, see the [Google Sheets integration](/integrations/file-formats/google-sheets/). ## Related content - [Troubleshooting AWS S3 Secrets](/docs/troubleshooting/aws-s3-secrets/) --- Source: https://motherduck.com/docs/key-tasks/service-accounts-guide/impersonate-service-accounts # Impersonate service accounts > Use UI impersonation to troubleshoot and inspect resources as a service account. Organization Admins can impersonate a service account in the MotherDuck UI. Impersonation is useful when you need to inspect resources, run one-off queries, or troubleshoot service account-specific behavior from that account's point of view. Impersonation is different from using a service account token. Tokens are for applications and automation. Impersonation is an interactive UI workflow for Admin users. :::warning[UI only] Service account impersonation is available only in the MotherDuck UI. DuckDB clients, the CLI, and the REST API don't support impersonation sessions. Use service account tokens for non-UI access. ::: ## Start an impersonation session ![Service account impersonation action](../img/sa_impersonate_option.png) 1. In the MotherDuck UI, go to **Settings** > **Service Accounts**. 2. Open the three-dot menu for the service account. 3. Click **Impersonate this account**. 4. The UI refreshes and signs you in as the service account. While impersonating, MotherDuck shows a banner with controls to refresh the session or return to your Admin account. ![Service account impersonation banner](../img/sa_impersonate_banner.png) Impersonation sessions expire after two hours. Refresh the browser tab to reset the expiry countdown. :::tip You can bookmark the URL while impersonating a service account. Opening the bookmark starts a new impersonation session for the same service account when you're signed in as an Admin user. ::: ## Use impersonation for troubleshooting Use impersonation when you need to: - Verify which databases, shares, secrets, and Dives the service account can access. - Run read-write actions as the service account from the MotherDuck UI. - Inspect query history and ongoing query activity for that service account. - Confirm that a service account-specific setup works before wiring it into an application. ## Use tokens for applications Applications and DuckDB clients should connect with a service account token instead of impersonation. Create a read/write token for workloads that need to write data or manage resources. Create a read scaling token for read-heavy workloads that should use [read scaling](/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/). ## Related content - [Create and configure service accounts](/key-tasks/service-accounts-guide/create-and-configure-service-accounts/) - [Manage service accounts and tokens](/key-tasks/service-accounts-guide/manage-service-accounts-and-tokens/) - [Connecting to MotherDuck](/key-tasks/authenticating-and-connecting-to-motherduck/connecting-to-motherduck/) --- Source: https://motherduck.com/docs/key-tasks/ai-and-motherduck/agent-skills # Install MotherDuck Skills for coding agents > Install the MotherDuck Skills plugin catalog to teach Claude Code, Cursor, Codex, Copilot CLI, and Gemini CLI to work with MotherDuck. [MotherDuck Skills](https://github.com/motherduckdb/agent-skills/) is an opinionated, installable catalog of [agent skills](https://agentskills.io/home) that teaches coding agents how to work with MotherDuck. The skills cover picking the right connection path, writing DuckDB SQL (not Postgres-shaped SQL), inspecting a live workspace, and shipping production analytics patterns safely. Skills work with any DuckDB client — they teach the agent behavior, not how to connect. Pairing them with the [MotherDuck MCP server](/key-tasks/ai-and-motherduck/mcp-setup/) is recommended but not required: MCP gives the agent live access to your workspace so it can inspect real schemas while it applies the guidance from the skills. ## What agent skills are Agent skills are reusable instruction bundles for AI coding agents. They give the agent domain-specific guidance it can apply during a task, such as which tools to use, which SQL dialect rules matter, what safety checks to run, and what good output should look like. MotherDuck Skills do not connect to your account or run queries on their own. Instead, your agent loads the relevant skill when a task calls for MotherDuck-specific knowledge. Use them when you want the agent to: - Choose between the MotherDuck MCP server, a Postgres-compatible endpoint, a native `md:` DuckDB connection, or the REST API. - Inspect a workspace and summarize databases, schemas, tables, and columns before writing queries. - Write DuckDB SQL that works in MotherDuck instead of PostgreSQL-shaped SQL. - Load files or application data into MotherDuck with repeatable validation steps. - Design a dashboard, Dive, customer-facing analytics app, or data pipeline on top of MotherDuck. - Plan a migration to MotherDuck, including validation, rollout, and rollback steps. ## Prerequisites Before you install: - Git available on your `PATH`. - Node.js 18 or later (required for the Skills CLI). - A MotherDuck account and one of the supported agent harnesses below. For live MotherDuck work, authenticate through your normal path — a `MOTHERDUCK_TOKEN`, the [Postgres endpoint](/sql-reference/postgres-endpoint/), a native `md:` DuckDB connection, or [MotherDuck MCP](/key-tasks/ai-and-motherduck/mcp-setup/). Do not paste tokens into prompts or skill files. ## Install Pick your agent harness and run the command. Each install pulls the full MotherDuck Skills catalog. | Harness | Install | |---|---| | Claude Code | `/plugin marketplace add motherduckdb/agent-skills` then `/plugin install motherduck-skills@motherduck-skills` | | GitHub Copilot CLI | `/plugin marketplace add motherduckdb/agent-skills` then `/plugin install motherduck-skills@motherduck-skills` | | Codex | `codex plugin marketplace add motherduckdb/agent-skills`, then install **MotherDuck Skills** from `/plugins` | | Cursor | `npx -y skills add motherduckdb/agent-skills --agent cursor --skill '*' --yes --global` | | Gemini CLI | `gemini extensions install https://github.com/motherduckdb/agent-skills --consent` | For other agents, project-scoped installs, or to install individual skills, use Vercel's portable [Skills CLI](https://github.com/vercel-labs/skills). See Vercel's [Agent Skills documentation](https://vercel.com/docs/agent-resources/skills) for more details. ```bash npx -y skills add motherduckdb/agent-skills --skill '*' --yes --global ``` Check what got installed: ```bash npx -y skills ls -g ``` ## Verify the installation After install, try this prompt to confirm the skills are wired up: > Use MotherDuck Skills to choose the best connection path for this project. You should get MotherDuck-specific connection guidance, including the Postgres endpoint and native DuckDB tradeoffs. ## Prompts to try Once the skills are installed, these prompts route to the right skill automatically: - `Use MotherDuck Skills to connect this app to MotherDuck.` - `Explore my MotherDuck workspace and identify the best table for a dashboard.` - `Write a DuckDB SQL query for this KPI and validate the syntax.` - `Design a Dive-backed dashboard from these tables.` - `Plan a Snowflake-to-MotherDuck migration with validation and rollback steps.` - `Design a customer-facing analytics architecture on MotherDuck.` - `Decide whether this workload needs DuckLake or native MotherDuck storage.` - `Use the MotherDuck REST API guidance to manage service accounts and tokens safely.` ## How the catalog is organized The catalog has three layers. Agents pick the right layer based on the task. **Utility skills** cover exact MotherDuck mechanics: connect, explore, query, use the REST API, or check DuckDB SQL behavior. Start here for narrow technical work. **Workflow skills** cover multi-step work with MotherDuck-specific tradeoffs: loading data, modeling, sharing, building [Dives](/key-tasks/ai-and-motherduck/dives/), evaluating DuckLake, planning security and governance, or framing pricing and ROI. **Use-case skills** cover designing or shipping a product surface: building customer-facing analytics, a dashboard, or a data pipeline; planning a migration to MotherDuck; rolling out self-serve analytics; or delivering repeatable partner implementations. For the full skill list and the latest install paths, see the [agent-skills repository](https://github.com/motherduckdb/agent-skills/). --- Source: https://motherduck.com/docs/key-tasks/ai-and-motherduck/securing-read-only-access # Restricting to read-only access > Restrict the remote MCP server to read-only queries using client-side blocking, read scaling tokens, or proxy filtering The remote MCP server exposes both the read-only `query` tool and the read-write `query_rw` tool. If you want to ensure your AI assistant can only read data, there are three approaches depending on your setup. | Approach | Enforcement | Setup | Works with OAuth connectors | |----------|------------|-------|-----------------------------| | [Block the tool at the client](#block-the-query_rw-tool-at-the-client) | Client-side | Low (UI toggle) | Yes | | [Use a read scaling token](#use-a-read-scaling-token) | Server-side | Medium (manual config) | No (replaces OAuth) | | [Proxy filtering](#proxy-filtering) | Application-side | Varies | N/A (custom backend) | ## Block the `query_rw` tool at the client The simplest approach: keep using the OAuth connector, but configure your MCP client to never call the `query_rw` tool. The server still exposes the tool, but the client will never invoke it. Most clients support this at the **individual user** level. ChatGPT also lets **organization admins** enforce tool restrictions across all workspace members. ### Claude Each user can block tools individually. Go to **Settings → Connectors → MotherDuck**, expand **Write/delete tools**, and select the blocked icon next to `query_rw`: ![Blocking the query_rw tool in Claude's connector settings](./img/query-rw-blocked.png) :::note Claude does not support org-level per-tool blocking. Team/Enterprise admins can remove a connector entirely from **Organization settings → Connectors**, but cannot selectively disable individual tools like `query_rw` for all members. ::: > [Claude connector permissions documentation](https://support.claude.com/en/articles/11175166-get-started-with-custom-connectors-using-remote-mcp) ### ChatGPT **Enterprise/Edu admins:** Admins can [enable or disable specific app actions after publishing](https://help.openai.com/en/articles/12584461-developer-mode-and-full-mcp-connectors-in-chatgpt-beta). Go to **Workspace Settings → Apps**, click the `...` menu next to MotherDuck, select **Action control**, and deselect `query_rw`. New tools added by the MCP server are disabled by default — admins must explicitly enable them. **Business plans:** Per-tool Action control is not available for custom MCP apps after publishing. To change which tools are exposed, remove and recreate the app ([developer mode documentation](https://help.openai.com/en/articles/12584461-developer-mode-and-full-mcp-connectors-in-chatgpt-beta)). ### Cursor Open **Cursor Settings** → **Tools & MCP**, expand the MotherDuck server entry, and toggle off `query_rw`. :::note Tool toggles are stored locally in Cursor's database, not in the `mcp.json` config file. They cannot be shared across a team through config files. ::: ### Claude Code Add a deny rule to your `.claude/settings.json` (project-level) or `~/.claude/settings.json` (user-level): ```json { "permissions": { "deny": ["mcp__MotherDuck__query_rw"] } } ``` > [Claude Code permissions documentation](https://code.claude.com/docs/en/permissions) ### Copilot Studio Open your agent in Copilot Studio, go to **Tools**, and open the MotherDuck MCP entry. Toggle `query_rw` off in the tool list and click **Save**. The agent only sees `query` and the schema exploration tools. ![MotherDuck MCP tool list in Copilot Studio with query_rw toggled off](/img/key-tasks/ai-and-motherduck/copilot-studio/07-tools-list.png) ## Use a read scaling token For server-side enforcement, authenticate with a [read scaling token](/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/) instead of a regular access token. Read scaling tokens connect to dedicated read replicas that reject all write operations — even if the client calls `query_rw`, writes will fail. This requires manual configuration instead of the one-click OAuth connectors. :::note Read scaling connections are [eventually consistent](/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/#ensuring-data-freshness). Results may lag a few minutes behind the latest database state. ::: You can create a read scaling token from the [MotherDuck UI](https://app.motherduck.com) under **Settings → Access Tokens** or through the [REST API](/sql-reference/rest-api/users-create-token/). Read scaling tokens also unlock concurrent MCP sessions: each MCP instance that connects with a read scaling token is assigned to a read replica (duckling) from a pool. Up to the pool size (default 4, max 16), each connection gets its own duckling; once the pool is full, new connections are assigned to existing ducklings in round-robin. This means you can run many MCP sessions in parallel from the same account—for example, multiple AI agents or team members querying simultaneously. See [Read Scaling](/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/) for details on pool sizing and how replicas are assigned. ### Claude Claude's web connector only supports OAuth, so you need to use the desktop config instead. Open **Settings → Developer → Edit Config** and add: ```json { "mcpServers": { "MotherDuck": { "command": "npx", "args": [ "mcp-remote", "https://api.motherduck.com/mcp", "--header", "Authorization: Bearer ${MOTHERDUCK_TOKEN}" ], "env": { "MOTHERDUCK_TOKEN": "" } } } } ``` This uses [`mcp-remote`](https://www.npmjs.com/package/mcp-remote) to bridge the remote MCP server into Claude Desktop's local stdio transport. ### ChatGPT ChatGPT connectors can't set static headers. To use a read scaling token, run a proxy that injects the `Authorization` header and connect ChatGPT to that proxy. Example proxy (Cloudflare Worker): ```js export default { async fetch(request, env) { const upstreamUrl = new URL(request.url); upstreamUrl.protocol = "https:"; upstreamUrl.hostname = "api.motherduck.com"; upstreamUrl.pathname = "/mcp"; const upstreamRequest = new Request(upstreamUrl, request); upstreamRequest.headers.set( "Authorization", `Bearer ${env.MOTHERDUCK_READ_SCALING_TOKEN}` ); upstreamRequest.headers.delete("cookie"); return fetch(upstreamRequest); }, }; ``` 1. Deploy the proxy and store the read scaling token as a secret (for example, `MOTHERDUCK_READ_SCALING_TOKEN`). 2. In [ChatGPT Settings → Connectors](https://chatgpt.com/#settings/Connectors), click **Create App**. 3. Enter: - **Name:** `MotherDuck (Read Only)` - **MCP Server URL:** `` - **Authentication:** `No authentication` 4. Open a chat, select the connector, and run a query (for example: `SELECT * FROM information_schema.tables LIMIT 5`). `query_rw` may still appear, but writes fail because read scaling tokens are read-only. ### Cursor Open **Cursor Settings** → **Tools & MCP** → **+ New MCP Server** and add the following configuration: ```json { "MotherDuck": { "url": "https://api.motherduck.com/mcp", "type": "http", "headers": { "Authorization": "Bearer " } } } ``` ### Claude Code ```bash claude mcp add --transport http \ --header "Authorization: Bearer " \ MotherDuck https://api.motherduck.com/mcp ``` ### Copilot Studio Follow the [Copilot Studio MCP setup](/key-tasks/ai-and-motherduck/mcp-setup/?mcp-client=copilot-studio) with **API key** authentication, and when prompted for the connection value, enter your read scaling token: ```text Bearer ``` The `query_rw` tool may still appear in the agent's tool list, but writes fail at the server because read scaling replicas reject write operations. For belt-and-braces, also toggle `query_rw` off in the tool list so the model never sees it as an option. ![MotherDuck MCP tool list in Copilot Studio with query_rw toggled off](/img/key-tasks/ai-and-motherduck/copilot-studio/07-tools-list.png) ### Others For MCP-compatible clients that support simple authentication, use the following JSON configuration with a read scaling token as the Bearer value: ```json { "mcpServers": { "MotherDuck": { "url": "https://api.motherduck.com/mcp", "type": "http", "headers": { "Authorization": "Bearer " } } } } ``` For clients that only support local (stdio) servers, use `mcp-remote` to bridge the connection: ```json { "mcpServers": { "MotherDuck": { "command": "npx", "args": [ "mcp-remote", "https://api.motherduck.com/mcp", "--header", "Authorization: Bearer ${MOTHERDUCK_TOKEN}" ], "env": { "MOTHERDUCK_TOKEN": "" } } } } ``` ## Proxy filtering If you're integrating the remote MCP server into a backend service or custom agent framework, you can restrict access at the application layer. When proxying MCP tool calls, omit or reject calls to the `query_rw` tool and only forward calls to the read-only `query` tool and schema exploration tools. See [Building Analytics Agents](/key-tasks/ai-and-motherduck/building-analytics-agents) for patterns on building custom agent integrations with read-only access controls. --- Source: https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/sso-setup # Setting up SSO > Configure Single Sign-On (SSO) for your MotherDuck organization using your identity provider. Single Sign-On (SSO) allows your organization to authenticate MotherDuck users through your existing identity provider (IdP). When SSO is enabled, users with a verified email domain are automatically redirected to your corporate login page, removing the need for separate MotherDuck credentials. :::note SSO is available on **Business** and **Enterprise** plans. ::: ## How SSO works When you configure SSO, MotherDuck connects to your identity provider using either the SAML or OIDC protocol. The login flow works as follows: 1. A user enters their email on the MotherDuck login page. 2. MotherDuck looks up the email domain. If the domain is verified and SSO is enabled, the user is redirected to your corporate IdP. 3. The user authenticates with the IdP. 4. MotherDuck receives the authentication response and creates or updates the user's session. Users with personal email addresses or domains without SSO configured continue to use standard login methods (Google, GitHub, or email and password). ## Supported SSO configurations MotherDuck supports four SSO configuration options: | Configuration | Protocol | Use when | | --- | --- | --- | | **Okta** | OIDC | Your organization uses Okta Workforce Identity | | **Microsoft Entra ID** | OIDC | Your organization uses Microsoft Entra ID (formerly Azure AD) | | **SAML** | SAML | Your IdP supports SAML but is not Okta or Entra ID | | **OIDC** | OIDC | Your IdP supports OpenID Connect but is not Okta or Entra ID | The generic SAML and OIDC options allow you to connect any compatible identity provider, such as Google Workspace, PingFederate, or Keycloak. ### SAML vs. OIDC **SAML** (Security Assertion Markup Language) is an XML-based protocol widely used in enterprise environments for browser-based SSO. Most traditional enterprise IdPs support SAML. **OIDC** (OpenID Connect) is a JSON-based protocol built on top of OAuth 2.0. It is more common in cloud-native and modern environments. Both protocols achieve the same result: authenticating users through your IdP. Choose the protocol that your IdP supports or that your IT team is most familiar with. ## Prerequisites Before setting up SSO, ensure you have: - **Admin** role in your MotherDuck organization - A **Business** or **Enterprise** plan - Admin access to your company's identity provider - A **custom domain name** for your organization (for example, `acme.com`) and the ability to add a DNS TXT record to the domain for verification - All users in your organization use **non-aliased email addresses** (addresses like `user+tag@company.com` are not supported) :::caution SSO is supported for organizations where all users belong to a **single MotherDuck organization**. If your users are spread across multiple MotherDuck organizations (for example, separate US and EU orgs), do not enable SSO. Multi-organization SSO support is planned for a future release. ::: ## Setting up SSO ### Step 1: Start SSO configuration in MotherDuck 1. In the MotherDuck UI, click your organization name in the top left and select **Settings**. 2. Navigate to the **Authentication** tab. 3. Click **Set up SSO** to begin the setup process. ![MotherDuck Settings showing the Authentication tab with the Set up SSO button](./img/sso-authentication-settings.png) 4. Select your identity provider from the list, or choose **Custom SAML** or **Custom OIDC** if your IdP is not listed. ![Select your identity provider for SSO configuration](./img/sso-select-identity-provider.png) ### Step 2: Create a MotherDuck application in your identity provider 1. Log in to your identity provider's admin console. 2. Create a new application and name it **MotherDuck**. 3. Select the appropriate protocol (SAML or OIDC) based on your chosen configuration. ### Step 3: Configure the connection The MotherDuck setup wizard provides step-by-step instructions for each provider. Follow the instructions on the SSO onboarding portal to configure the connection between your IDP and MotherDuck. For example, the Okta configuration walks you through creating an OIDC application: ![Okta OIDC SSO configuration wizard showing the Create Application step](./img/sso-okta-create-application.png) ### Step 4: Map user attributes In your IdP, map the following attributes to the MotherDuck application: | Attribute | Required | Description | | --- | --- | --- | | `email` | Yes | The user's email address (primary login identifier) | | `given_name` | No | The user's first name | | `family_name` | No | The user's last name | ### Step 5: Assign users Assign yourself (and optionally other users) to the MotherDuck application in your IdP. ### Step 6: Verify your domain MotherDuck requires domain ownership verification before SSO can be enabled. Follow the instructions to add a DNS TXT record for your domain. Once the record is detected, your domain is verified. ![SSO configuration status showing pending domain verification](./img/sso-pending-domain-verification.png) ### Step 7: Enable SSO After domain verification succeeds, return to the setup wizard and click **Done** to complete the configuration, then click **Enable SSO** to activate the connection. ![SSO configuration dialog to confirm enabling SSO](./img/sso-enable-sso-dialog-confirmation.png) :::warning Enabling SSO is **not reversible** without contacting MotherDuck support. Before enabling, ensure that: - All users in your organization use non-aliased email addresses on the verified domain - Your users belong to **only this** MotherDuck organization - You have tested the IdP configuration by assigning yourself to the application ::: When SSO is enabled: - All existing non-SSO login methods (Google, GitHub, email/password) are **deactivated** for users with the verified domain - Any pending invitations matching the SSO domain will need to **sign up through SSO** - Users must authenticate through the configured IdP going forward ### Step 8: Test SSO login 1. Log out of MotherDuck. 2. On the login page, enter your corporate email address. 3. You should be redirected to your IdP's login page. 4. After authenticating, you are returned to the MotherDuck UI. ## Just-in-Time (JIT) user provisioning When SSO is enabled, new users from your verified domain can be automatically provisioned on their first login. This is called Just-in-Time (JIT) provisioning. JIT provisioning is enabled by default the first time you activate SSO. Admins can change this setting at any time from the organization **Settings** page (see below). With JIT enabled: - A user enters their corporate email on the MotherDuck login page - They are redirected to your IdP and authenticate - The user is automatically given the option to join your organization at signup ### Controlling access with JIT and invite settings Admins can configure JIT provisioning and organization invite policies from the organization **Settings** page. These two settings work together to control how new users join your organization: | Setting | When enabled | When disabled | | --- | --- | --- | | **JIT provisioning** | Users who authenticate through your IdP can join the organization on first login *(default on first SSO activation)* | New users must be invited by an Admin | | **Organization invites** | Any member can invite new users to the organization | Only Admins can invite new users, giving you tighter control over who has access | When both organization invites and JIT provisioning are disabled, new users can only join if an Admin invites them. When JIT is enabled but invites are disabled, users who have been given access in your IdP can still join on first login, but members cannot send invitations. ![invite policy](./img/org-invite-policy.png) For more information on managing organization members and roles, see [Managing organizations](/docs/key-tasks/managing-organizations/). JIT provisioning handles initial account creation only. It does not manage role changes or account deletion after provisioning. For automated user lifecycle management, see [SCIM provisioning](/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/scim/). ### How SCIM affects JIT and invites When SCIM provisioning is enabled, MotherDuck delegates user lifecycle to your IdP. SCIM replaces JIT as the auto-provisioning mode, and organization invites are automatically disabled (the **Invite policy** setting is locked). To re-enable manual invites or fall back to JIT, [disable SCIM](/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/scim/#disabling-scim) from the Authentication settings page. :::warning If you disable SCIM and then change members from inside MotherDuck, the user state in MotherDuck and your IdP will drift. Either keep SCIM enabled and manage users in your IdP, or disable SCIM and accept that the two systems are no longer in sync. ::: ## Managing members Managing users with SSO works the same as before. You can invite any new user by supplying their email address. If the email domain matches one of your verified domains, the user will be redirected to their IdP for authentication. If you have [SCIM provisioning](/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/scim/) enabled, manual invites are disabled. Users are created automatically when you assign them to the MotherDuck application in your IdP, and deprovisioned when you unassign them. To hard-delete a user's record from MotherDuck, explicitly delete the user in your IdP — deprovisioning alone keeps the record for later reprovisioning. ## Limitations - **Single organization only**: SSO is supported for users who belong to a single MotherDuck organization. Multi-org SSO is planned for a future release. - **No aliased emails**: Email addresses with aliases (for example, `user+tag@company.com`) are not supported when SSO is enabled. - **One connection per domain**: Each verified domain can have only one SSO connection. Users with an email address on that domain in any MotherDuck organization will be redirected to their IdP. - **Non-reversible**: Enabling SSO cannot be undone without contacting [MotherDuck support](mailto:support@motherduck.com). - **CLI and SDK authentication**: Users authenticating through the SDKs continue to use [access tokens](/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/#authentication-using-an-access-token). SSO applies to browser-based login flows for the WebUI, CLI and MCP. --- Source: https://motherduck.com/docs/key-tasks/sharing-data/sharing-within-org # Sharing data with your organization > Share databases with all members of your MotherDuck organization. MotherDuck makes it easy for you to share data with all members of your Organization and making that data discoverable and queryable. This is a common use case for small, highly collaborative data teams. 1. **Data provider** creates an **Organization** scoped, **Discoverable** share. 2. **Data consumers** find the share and **attach** it. 3. **Data provider** periodically updates the share to push new data to **data consumers**. :::note Shares are **region-scoped** based on your Organization's cloud region. Each MotherDuck Organization is scoped to a single cloud region that must be chosen at Org creation when signing up. MotherDuck is available on AWS in three regions: - **US East (N. Virginia):** `us-east-1` - **US West (Oregon):** `us-west-2` - **Europe (Frankfurt):** `eu-central-1` ::: ## 1. create an organization-scoped, discoverable share To share a database with your Organization, create a share. No actual data is copied and no additional costs are incurred in this process. ### UI ![trident](./img/ui-share_new.png) Click on the "trident" next to the database you'd like to share. Select "share". Then: 1. Optionally, choose a share name. Default will be the database name. 2. Choose whether the share should only be accessible by all users in your Organization, specified users in your Organization, or any MotherDuck user in the same cloud region who has access to the share link. 4. Choose whether the share should be automatically updated or not; the current default is `MANUAL` ### SQL ```sql use birds; CREATE SHARE; -- Shorthand syntax. Share name is optional. By default, shares are Organization-scoped and Discoverable. CREATE SHARE birds FROM birds (ACCESS ORGANIZATION , VISIBILITY DISCOVERABLE); -- This query is identical to the previous one yet optionally more verbose. ``` ## 2. find and consume shares The **data consumer** in your Organization can use the UI to find the share, attach it, and start querying it! ### UI 1. Select the share you want under "Shared with me" 2. Click "attach" and optionally name the resulting database. 3. You can query the resulting database. :::note The ability to list and discover Discoverable shares in SQL is coming shortly. ::: ## 3. update shared data If during creation of the share, the **data provider** chose to have the share updated automatically, the share will be updated periodically. If the share was created with `MANUAL` updates, the **data provider** needs to manually update the share. ```sql UPDATE SHARE birds; ``` Learn more about [UPDATE SHARE](/sql-reference/motherduck-sql-reference/update-share.md) and [data replication timing and checkpoints](./updating-shares.md). --- Source: https://motherduck.com/docs/key-tasks/data-warehousing/replication/sql-server # Replicating SQL Server tables to MotherDuck > Replicate SQL Server tables to MotherDuck using Python and dataframes. This page will serve to show basic patterns for using Python to connect to SQL Server, read data into a dataframe, connect to MotherDuck, and then writing the data from the dataframe into MotherDuck. For more complex replication scenarios, please take a look at our [ingestion partners](https://motherduck.com/ecosystem/?category=Ingestion). To skip the documentation and look at the entire script, expand the element below:
Python script ```py import pyodbc # Define your connection parameters server = 'ip_address' database = 'master' # or use your database name username = 'your_username' password = 'your_password' # consider using a secret manager or .env port = 1433 # default SQL Server port # Define the connection string for ODBC Driver 17 connection_string = ( f"DRIVER={{ODBC Driver 17 for SQL Server}};" f"SERVER={server},{port};" f"DATABASE={database};" f"UID={username};" f"PWD={password};" ) # Connect to SQL Server try: connection = pyodbc.connect(connection_string) print("Connection successful.") except pyodbc.Error as e: print(f"Error: {e}") finally: connection.close() import pandas as pd try: connection = pyodbc.connect(connection_string) query = "SELECT * FROM AdventureWorks2022.Production.BillOfMaterials" # Execute the query using pyodbc cursor = connection.cursor() cursor.execute(query) # Fetch the column names and data columns = [column[0] for column in cursor.description] data = cursor.fetchall() # Convert the data into a DataFrame df = pd.DataFrame.from_records(data, columns=columns) finally: connection.close() import duckdb motherduck_token = 'your_token' # Attach using the MOTHERDUCK_TOKEN duckdb.sql(f"ATTACH 'md:my_db?MOTHERDUCK_TOKEN={motherduck_token}'") # Create or replace table in the attached database duckdb.sql( """ CREATE OR REPLACE TABLE my_db.main.BillOfMaterials AS SELECT * FROM df """ ) ```
## SQL Server Authentication SQL Server supports [multiple methods of authentication](https://learn.microsoft.com/en-us/sql/relational-databases/security/choose-an-authentication-mode?view=sql-server-ver16) - for the purpose of this example, we will use username/password authentication and [pyodbc](https://github.com/mkleehammer/pyodbc/), along with [ODBC Driver 17 for SQL Server](https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server?view=sql-server-ver16). It should be noted that 'ODBC Driver 18 for SQL Server' is also available and includes support for some newer SQL Server features, but for the sake of compatibility, this example will use 17. Consider the following authentication example: ```py import pyodbc # Define your connection parameters server = 'ip_address' database = 'master' # or use your database name username = 'your_username' password = 'your_password' # consider using a secret manager or .env port = 1433 # default SQL Server port # Define the connection string for ODBC Driver 17 connection_string = ( f"DRIVER={{ODBC Driver 17 for SQL Server}};" f"SERVER={server},{port};" f"DATABASE={database};" f"UID={username};" f"PWD={password};" ) # Connect to SQL Server try: connection = pyodbc.connect(connection_string) print("Connection successful.") except pyodbc.Error as e: print(f"Error: {e}") finally: connection.close() ``` This will set your credentials, and then attempt to connect to your server with `pyodbc.connect`, and return an error if it fails. ## Reading a SQL Server table into a dataframe Once you have authenticated, you can define arbitrary queries and then execute them with `pd.read_sql`, using the `query` and `connection` objects. For the purpose of this example, we are using SQL Server 2022 along with the AdventureWorks OLTP database. :::note While `pandas` is a great library, it is not particularly well-suited for very large tables. To learn more about using buffers and alternative libraries, check out [Loading data with Python](/key-tasks/loading-data-into-motherduck/loading-data-md-python/). ::: ```py import pandas as pd try: connection = pyodbc.connect(connection_string) query = "SELECT * FROM AdventureWorks2022.Production.BillOfMaterials" # Execute the query using pyodbc cursor = connection.cursor() cursor.execute(query) # Fetch the column names and data columns = [column[0] for column in cursor.description] data = cursor.fetchall() # Convert the data into a DataFrame df = pd.DataFrame.from_records(data, columns=columns) finally: connection.close() ``` ## Inserting the table into MotherDuck Now that the data has been loaded into a dataframe object, we can connect to MotherDuck and insert the table. :::note You will need to [generate a token](/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/#creating-an-access-token) in your MotherDuck account. For production use cases, make sure to use a secret manager and never commit your token to your codebase. ::: ```py import duckdb motherduck_token = 'your_token' # Attach using the MOTHERDUCK_TOKEN duckdb.sql(f"ATTACH 'md:my_db?MOTHERDUCK_TOKEN={motherduck_token}'") # Create or replace table in the attached database duckdb.sql( """ CREATE OR REPLACE TABLE my_db.main.BillOfMaterials AS SELECT * FROM df """ ) ``` This will create the table, or replace it for the table already exists. ## Handling More Complex Workflows Production use cases tend to be much more complex and include things like incremental builds & state management. In those scenarios, please take a look at our [ingestion partners](https://motherduck.com/ecosystem/?category=Ingestion), which includes many options including some that offer native python. An overview of the MotherDuck Ecosystem is shown below. ![Diagram](../../../img/md-diagram.svg) --- Source: https://motherduck.com/docs/key-tasks/database-operations/specifying-different-databases # Specifying different databases > Reference tables across databases using fully qualified names with database.schema.table syntax. MotherDuck enables you to specify an active/current database and an active/current schema within that database. Queryable objects (e.g. tables) that belong to the current database are resolved with just ``. MotherDuck will automatically search all schemas within the current database. If there are overlapping names within different schemas, objects can be qualified with `.`. Queryable objects in your account outside of the active/current database are resolved with `.`. However, if a schema in the current database shares the same name as another database, the fully qualified name must be used: `..` (an error will be thrown to indicate the ambiguity). This applies to databases that both live in MotherDuck and in your local DuckDB environment. For example: ### CLI ```sql -- check your current database SELECT current_database(); dbname -- check your current schema SELECT current_schema(); main -- query a table mytable that exists in the current database dbname SELECT count(*) FROM mytable; 34 -- query a table mytable2 that exists in the database dbname2 SELECT count(*) FROM dbname2.mytable2; 41 -- query a table mytable3 that exists in schema2 -- note that the syntax is identical to the database name syntax above and -- MotherDuck will detect whether a database or schema is involved SELECT count(*) FROM schema2.mytable3 42 -- query a table in another database when a schema exists with the same name in the current database -- (overlappingname is both a database name and a schema name) SELECT count(*) FROM overlappingname.myschemaname.mytable4 43 ``` You can also reference local databases in the same MotherDuck queries. This type of query is known as a [hybrid query](/key-tasks/running-hybrid-queries.md). To change the active database, schema, or database/schema combination, execute a `USE` command. See the documentation on [switching the current database](./switching-the-current-database.md) for details. --- Source: https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/postgres-endpoint/java # Connect from Java via Postgres endpoint > Connect to MotherDuck from Java using the PostgreSQL JDBC driver via the Postgres wire protocol You can query MotherDuck from Java using the standard [PostgreSQL JDBC driver](https://jdbc.postgresql.org/) — no DuckDB installation required. For connection parameters, SSL options, and limitations, see the [Postgres Endpoint reference](/sql-reference/postgres-endpoint). ## Prerequisites You'll need a [MotherDuck access token](/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck). Set it as an environment variable: ```bash export MOTHERDUCK_TOKEN="your_token_here" ``` Add the PostgreSQL JDBC driver to your project: ### Maven ```xml org.postgresql postgresql 42.7.5 ``` ### Gradle ```groovy implementation 'org.postgresql:postgresql:42.7.5' ``` ## Connect ```java import java.sql.*; public class MotherDuckExample { public static void main(String[] args) throws SQLException { String token = System.getenv("MOTHERDUCK_TOKEN"); String url = "jdbc:postgresql://pg.us-east-1-aws.motherduck.com:5432/md:" + "?sslmode=verify-full" + "&sslfactory=org.postgresql.ssl.DefaultJavaSSLFactory"; try (Connection conn = DriverManager.getConnection(url, "postgres", token); Statement stmt = conn.createStatement(); ResultSet rs = stmt.executeQuery( "SELECT title, score FROM sample_data.hn.hacker_news WHERE type='story' LIMIT 10")) { ResultSetMetaData meta = rs.getMetaData(); int columnCount = meta.getColumnCount(); while (rs.next()) { for (int i = 1; i <= columnCount; i++) { System.out.print(meta.getColumnName(i) + "=" + rs.getString(i)); if (i < columnCount) System.out.print(", "); } System.out.println(); } } } } ``` You can also configure the connection using a `Properties` object: ```java import java.sql.*; import java.util.Properties; Properties props = new Properties(); props.setProperty("user", "postgres"); props.setProperty("password", System.getenv("MOTHERDUCK_TOKEN")); props.setProperty("sslmode", "verify-full"); props.setProperty("sslfactory", "org.postgresql.ssl.DefaultJavaSSLFactory"); Connection conn = DriverManager.getConnection( "jdbc:postgresql://pg.us-east-1-aws.motherduck.com:5432/md:", props ); ``` ## SSL notes The PostgreSQL JDBC driver looks for a root certificate at `~/.postgresql/root.crt` by default. To use your JVM's built-in truststore instead (which includes standard CAs like Let's Encrypt), set `sslfactory=org.postgresql.ssl.DefaultJavaSSLFactory`. If certificate verification doesn't work in your environment, you can fall back to `sslmode=require`, which encrypts the connection but doesn't verify the server certificate. For more details on SSL options, see [SSL and certificate verification](/sql-reference/postgres-endpoint#ssl-and-certificate-verification). --- Source: https://motherduck.com/docs/key-tasks/ai-and-motherduck/dives/index # Creating Visualizations with Dives > Build interactive visualizations from natural language using AI agents and the MotherDuck MCP Server Dives are interactive visualizations you create with natural language, directly on top of your data in MotherDuck. Ask a question to your AI agent, and MotherDuck generates a persistent, interactive component that lives in your workspace alongside your SQL. Think of Dives as a bridge between one-off questions and always-up-to-date dashboards. Instead of building a full dashboard or writing complex queries, you can ask a question and save the answer as a Dive that stays current with your data. ## How Dives work When you create a Dive with the [MotherDuck MCP](/sql-reference/mcp/) through an AI agent: 1. You ask a question in natural language (for example, "Show me monthly revenue trends by product category") 2. The AI agent queries your MotherDuck database through the [MCP Server](/sql-reference/mcp/) to understand the data 3. The agent creates an interactive visualization, with the necessary SQL to query the data 4. In clients that support the Dive Viewer MCP App, the Dive renders inline in the chat against live data. In other clients, the agent shows a static preview with sample data until you open the Dive in MotherDuck 5. MotherDuck saves the Dive to your workspace Dives use MotherDuck's [hypertenancy](/concepts/hypertenancy) architecture to serve sub-second queries. Every user gets dedicated compute, so there's no slowdown when your whole team explores data at once. ### Inline preview with the Dive Viewer On clients that support [MCP Apps](https://apps.extensions.modelcontextprotocol.io/), the MotherDuck MCP Server serves a **Dive Viewer MCP App** that renders your Dive directly in the chat with the same React components used in the MotherDuck UI. At launch, this is supported in Claude web and desktop; other clients fall back to a sample-data preview. With the Dive Viewer: - The preview queries **live data** through the MCP Server, so what you see in the chat matches what you'll see in MotherDuck. - Every edit is applied incrementally and saved as a separate version of the Dive, rather than rewritten from scratch. You can browse versions from the version picker in the MotherDuck UI. - You iterate conversationally (*"add a filter for US region"*, *"switch to a bar chart"*) and the Viewer updates in place. ## Prerequisites To create a Dive, you will need: - A MotherDuck account with at least one database - An [AI client](/docs/getting-started/mcp-getting-started/) connected to the [MotherDuck MCP Server](/key-tasks/ai-and-motherduck/mcp-setup/) (Claude, ChatGPT, Cursor, or others) Dives are available on all MotherDuck plans at no additional charge. ## Creating a Dive Connect your AI assistant to the MotherDuck MCP Server, then ask it to create a visualization. The key is to ask for a "Dive" specifically as this tells the agent to persist the visualization in your MotherDuck workspace. **Example prompts:** - *"Create a Dive showing monthly revenue trends for the last 12 months"* - *"Make a Dive that breaks down customer sign-ups by region"* - *"Build a Dive with a chart of our top 10 products by sales volume. Use MotherDuck's brand colors"* The AI agent handles the SQL, chart configuration, styling and saving. You just describe what you want to see. ### Iterating on a Dive Once you have a Dive, you can refine it through conversation: - *"Add a filter for the US region only"* - *"Change the chart to a stacked bar chart"* - *"Add a trend line to show the overall direction"* Each update modifies the Dive in place, keeping your visualization current. ## Finding your Dives Dives appear in two places in the MotherDuck UI: ### Object explorer Your recent Dives appear in the left sidebar, above your Notebooks. Click any Dive to load it in the main view. The list shows your most recent Dives first. ![A screenshot of a dives dashboard in the MotherDuck UI](./img/dives_airquality_eastcoats_westcoast.png) ### Settings page For a complete list of all Dives in your organization, go to **Settings** → **Dives**. This view makes it easier to find Dives created by others in your team. ![A screenshot of the dives settings and overview in the MotherDuck UI](./img/dives_settings_ui.png) ## Sharing Dives with your team When you save a Dive, the AI agent checks whether the databases it queries are shared with your organization. If not, it will suggest sharing them so your team can view the Dive. You can also explicitly ask: > *"Share the data for my revenue Dive with my team"* This creates org-scoped shares for any private databases referenced in the Dive's queries and updates the Dive to use the shared references. See [`share_dive_data`](/sql-reference/mcp/share-dive-data) for details. ## Sharing the current view Dives can also share their current interactive state through the URL. If a Dive uses [`useDiveState`](/sql-reference/motherduck-sql-reference/ai-functions/dives/use-dive-state) for controls such as filters, sorting, selected tabs, or drill-downs, the state is encoded into the URL. When someone copies the URL, another viewer opens the same Dive with the same selections applied. (Embedded Dives surface the same data through `postMessage` events — see [Handle Dive state updates from embedded Dives](/key-tasks/ai-and-motherduck/dives/embedding-dives/#handle-dive-state-updates-from-embedded-dives).) Small amounts of state are encoded directly in the URL fragment. Larger state is stored on the server and referenced by a short, opaque ID, so the URL stays compact even when selections grow to many kilobytes. Reference resolution is best-effort: if a reference can't be resolved, for example, because the underlying Dive was deleted, the Dive opens at its default state instead of failing to load. Use this for state that should survive a refresh or travel with a link. Temporary interface state, such as an open dialog or unsaved text input, should stay in React's `useState`. ## Version history Every time you update a Dive, MotherDuck saves a version. You can browse previous versions directly in the MotherDuck UI using the version picker in the top-right corner of a Dive. The dropdown shows each version with its description and when it was created. ![A screenshot of the version history dropdown in the MotherDuck Dives UI](./img/dives_version_history.png) Selecting a previous version lets you view what the Dive looked like at that point. Version browsing is read-only: switching to an older version does not overwrite the latest version. You can also retrieve versions programmatically. Use [`list_dives`](/sql-reference/mcp/list-dives) to see the `current_version` for each Dive, and [`read_dive`](/sql-reference/mcp/read-dive) with the `version` parameter to inspect a specific version. ## What makes Dives different Unlike traditional dashboards: - **Natural language creation**: Describe what you want in plain English instead of clicking through a UI or writing visualization code - **Always current**: Dives query live data—no manual refreshes or stale snapshots - **Workspace-native**: Dives live alongside your SQL in MotherDuck, not in a separate tool - **Instant exploration**: Filter, drill down, and explore without waiting for queries to run Unlike one-off AI-generated charts: - **Persistent**: Dives save to your workspace so you can return to them anytime - **Shareable**: Team members can view and interact with Dives you create—[share the underlying data](/sql-reference/mcp/share-dive-data) to give them access, and share the URL to preserve supported filters and view state - **Interactive**: Filter and explore the data, not just view a static image ## Walkthrough: Building a Dive step by step ### Claude Desktop/Web Connect the [MotherDuck MCP Server](/sql-reference/mcp/) to Claude for desktop or Claude on the web, then open a new conversation. **Step 1: Explore your data** Don't ask for a finished Dive right away. Start vague: *"Take a look at what tables I have in my analytics database."* Claude lists tables, reads column names, samples rows, and figures out how things connect. Doing this first saves you from chasing down SQL errors later. When it reports back, keep asking questions. *"How do the orders and customers tables connect? What date range am I working with?"* The more Claude knows about your schema upfront, the fewer corrections you'll need. **Step 2: Shape the analysis** Point Claude at what you want to see. If you're not sure what to look for, go open-ended: *"What are the most interesting patterns in this data?"* Claude runs queries and pulls out trends you might have missed. If you already have something in mind, say so: *"I want to see how revenue breaks down by product category over the last 12 months."* You can also paste in a SQL query or a screenshot of a dashboard you want to recreate. Mention specifics like calculated columns, filters, or date ranges before asking Claude to build the Dive. **Step 3: Iterate on the live preview** Claude renders the Dive inline in the chat with the Dive Viewer MCP App, using the same components as the MotherDuck UI and running against live data. Dive edits are versioned. Users can ask their agent to refer to and clone prior versions for continued iterations. They can also browse through past versions directly in the MotherDuck UI. Explain *why* you want a change, not just *what*. *"I want to spot outliers quickly"* gives Claude more to work with than *"make the dots bigger."* Group related tweaks into one message. Keep unrelated changes separate. If something isn't working after two or three rounds, try a different approach. If you know what you want to change specifically, go ahead and do it. Even beyond the charts and visuals themselves, there are so many ways to enhance your Dive. Every type of custom interaction you've seen on the web is available to you. Ask for features like drill downs, cross-filtering, zooming, and more. You don't have to finish in one sitting. **Step 4: Find it in MotherDuck** Every edit from the Dive Viewer is saved to your workspace as a separate version, so the Dive is already there when you're done iterating. If you want to force a save or name a checkpoint explicitly, ask Claude: *"Save this as a Dive in MotherDuck."* Find the Dive in the [Object Explorer sidebar](#object-explorer) or on the [Settings page](#settings-page), share it with your team, and come back to Claude when you want to change anything. ### Claude Code Claude Code can allow you to iterate very quickly when building Dives. With Claude Code, you can preview your changes in a local environment for instant feedback loops - and Claude can get that environment set up for you! To get started, connect the [MotherDuck MCP Server](/sql-reference/mcp/) to Claude Code, then open a new conversation. **Step 1: Explore your data** Don't ask for a finished Dive right away. Start vague: *"Take a look at what tables I have in my analytics database."* Claude lists tables, reads column names, samples rows, and figures out how things connect. Doing this first saves you from chasing down SQL errors later. When it reports back, keep asking questions. *"How do the orders and customers tables connect? What date range am I working with?"* The more Claude knows about your schema upfront, the fewer corrections you'll need. **Step 2: Shape the analysis** Point Claude at what you want to see. If you're not sure what to look for, go open-ended: *"What are the most interesting patterns in this data?"* Claude runs queries and pulls out trends you might have missed. If you already have something in mind, say so: *"I want to see how revenue breaks down by product category over the last 12 months."* You can also paste in a SQL query or a screenshot of a dashboard you want to recreate. Mention specifics like calculated columns, filters, or date ranges before asking Claude to build the Dive. **Step 3: Create a Dive local preview** Next, ask Claude to create a Dive based on your analysis thus far and any other open questions on your mind. Claude will ask if you would like to see a local preview, and if you accept, the MotherDuck MCP will give Claude the instructions to set up a preview on your local machine. To set up the preview, Claude will make some local folders and run some npm commands, and after a moment your environment will be ready. You will receive a message like this: > `The preview is running at http://localhost:5177/.` > `Open that in your browser to see the Dive with live data from MotherDuck.` So, cmd + click on that localhost URL (or ctrl + click if you are in Windows), and you'll have a live preview in your browser of the Dive you just created. **Step 4: Iterate with the preview** Now you get to tap into the power of Agents for follow up analysis and enhancing the visual. Explain *why* you want a change, not just *what*. *"I want to spot outliers quickly"* gives Claude more to work with than *"make the dots bigger."* Group related tweaks into one message. Keep unrelated changes separate. If something isn't working after two or three rounds, try a different approach. If you know what you want to change specifically, go ahead and do it. Feel free to keep questions open ended. Things like, *"What other columns are correlated with revenue? What other interesting patterns should I investigate?"* can let Claude uncover hidden patterns on your behalf. Even beyond the charts and visuals themselves, there are so many ways to enhance your Dive. Every type of custom interaction you've seen on the web is available to you. Ask for features like drill downs, cross-filtering, zooming, and more. **Step 5: Publish to MotherDuck** Tell Claude to save it: *"Save this as a Dive in MotherDuck."* The Dive runs against live data. Find it in the [Object Explorer sidebar](#object-explorer) or on the [Settings page](#settings-page), share it with your team, and come back to Claude when you want to change anything. ### ChatGPT Connect the [MotherDuck MCP Server](/sql-reference/mcp/) to ChatGPT and follow the general steps in [Creating a Dive](#creating-a-dive). The workflow is similar to the Claude Desktop/Web tab: explore your data, shape the analysis, then ask ChatGPT to save the result as a Dive. ### Cursor Connect the [MotherDuck MCP Server](/sql-reference/mcp/) to Cursor and follow the general steps in [Creating a Dive](#creating-a-dive). The workflow is similar to the Claude Code tab: explore your data, shape the analysis, preview locally, then publish the Dive to MotherDuck. ## Tips for better Dives ### Be specific about the visualization Include details about chart type, time ranges, and groupings: | Less effective | More effective | |----------------|----------------| | "Show me sales data" | "Create a Dive with a line chart of weekly sales for 2024, broken down by product category" | | "Make a customer chart" | "Build a Dive showing customer count by signup month as a bar chart" | ### Use your schema knowledge If you know your table and column names, include them: > "Create a Dive from the `orders` table showing `total_amount` by `order_date`, grouped by month" ### Start simple, then iterate Begin with a basic visualization, then add complexity: 1. *"Create a Dive showing revenue by month"* 2. *"Add a breakdown by region"* 3. *"Filter to show only the top 5 regions"* ## Troubleshooting | Issue | Solution | |-------|----------| | AI creates a chart but doesn't save it as a Dive | Explicitly ask to "create a Dive" or "save this as a Dive in MotherDuck" | | Dive shows unexpected data | Ask the AI to explain the query it used, then refine your request | | Can't find a Dive | Check **Settings** → **Dives** for the complete list | | Dive is slow to load | The underlying query may be scanning a lot of data—ask the AI to add filters or optimize | ## Declaring required databases When your Dive queries a database that viewers might not have attached, export a `REQUIRED_DATABASES` constant from your component. MotherDuck automatically attaches these databases (including shared databases) before running any queries, so your teammates don't see "Catalog does not exist" errors. ```jsx export const REQUIRED_DATABASES = [ { type: 'share', path: 'md:_share//', alias: '' } ]; ``` Each entry describes one database: | Field | Description | |-------|-------------| | `type` | `"share"` for shared databases, `"database"` for owned databases | | `path` | The share URL (for example, `md:_share/galactic_coffee/af03aa17-...`) or database name | | `alias` | The local alias used in your SQL queries | You can find your share URLs by running `FROM MD_INFORMATION_SCHEMA.OWNED_SHARES;` or by asking the AI agent to use the [`share_dive_data`](/sql-reference/mcp/share-dive-data) tool. This approach is preferred over calling `ATTACH` inside `useSQLQuery`, because it lets MotherDuck handle the attachment before any data queries fire. ## Related resources - [Embedding Dives in your website](/key-tasks/ai-and-motherduck/dives/embedding-dives) - [Dives SQL Functions](/sql-reference/motherduck-sql-reference/ai-functions/dives/) — Manage Dives directly from SQL - [`useSQLQuery` hook](/sql-reference/motherduck-sql-reference/ai-functions/dives/use-sql-query) — React hook reference for querying data inside Dives - [`useDiveState` hook](/sql-reference/motherduck-sql-reference/ai-functions/dives/use-dive-state) — React hook reference for shareable Dive state - [Connect to MCP Server](/key-tasks/ai-and-motherduck/mcp-setup/) — Set up the MCP server with your AI assistant - [MCP Workflows](/key-tasks/ai-and-motherduck/mcp-workflows/) — Tips for effective AI-powered data analysis - [AI Features in MotherDuck](/docs/key-tasks/ai-and-motherduck/ai-features-in-ui/) — Explore instant SQL and automatic SQL fixes. --- Source: https://motherduck.com/docs/key-tasks/service-accounts-guide/manage-service-accounts-and-tokens # Manage service accounts and tokens > Use the MotherDuck UI and REST API to view, delete, and rotate service account tokens. Use the MotherDuck UI for service account inventory and one-off administration. Use the REST API when your automation already knows the target service account username. :::warning[Admin access required] Managing service accounts and service account tokens requires an organization Admin. REST API examples use a read/write access token generated by an Admin user. ::: ## Check what each interface supports | Task | MotherDuck UI | REST API | |---|---|---| | List all service accounts in an organization | Yes | No | | Create a service account | Yes | Yes, with [`POST /v1/users`](/sql-reference/rest-api/users-create-service-account/) | | View tokens for a known service account | Yes | Yes, with [`GET /v1/users/{username}/tokens`](/sql-reference/rest-api/users-list-tokens/) | | Create a token for a known service account | Yes | Yes, with [`POST /v1/users/{username}/tokens`](/sql-reference/rest-api/users-create-token/) | | Revoke a known token | Yes | Yes, with [`DELETE /v1/users/{username}/tokens/{token_id}`](/sql-reference/rest-api/users-delete-token/) | | Delete a known service account | Yes | Yes, with [`DELETE /v1/users/{username}`](/sql-reference/rest-api/users-delete/) | | View or configure Ducklings for a known service account | Yes | Yes, with the [Duckling configuration endpoints](/sql-reference/rest-api/ducklings-get-duckling-config-for-user/) | | Impersonate a service account | Yes | No | The REST API doesn't provide an endpoint for listing all service accounts in an organization. If you provision service accounts through the API, store the returned usernames in your own system. ## View service accounts ### UI ![Service account management page](../img/sa_manage_details.png) 1. In the MotherDuck UI, go to **Settings** > **Service Accounts**. 2. Review the service account list. 3. Click a username to view that service account's details and tokens. 4. Use the Duckling size and pool size dropdowns to review compute configuration. ### API The REST API doesn't provide a service account list endpoint. Use the UI to view organization-level service account inventory. For automated provisioning, persist the `username` returned by [`POST /v1/users`](/sql-reference/rest-api/users-create-service-account/) when you create each service account. ## View tokens for a service account The token list shows token metadata, including token ID, name, type, creation time, and expiration time. It doesn't return the token secret. ### UI 1. In **Settings** > **Service Accounts**, open the service account details page. 2. Review the token list. ### API using curl Use [`GET /v1/users/{username}/tokens`](/sql-reference/rest-api/users-list-tokens/) to list tokens for a known service account username. ```bash curl -X GET \ https://api.motherduck.com/v1/users/analytics_service_account/tokens \ -H "Authorization: Bearer " ``` ### API using Python Use [`GET /v1/users/{username}/tokens`](/sql-reference/rest-api/users-list-tokens/) to list tokens for a known service account username. ```python import pprint import requests response = requests.get( "https://api.motherduck.com/v1/users/analytics_service_account/tokens", headers={"Authorization": "Bearer "}, ) response.raise_for_status() pprint.pp(response.json()["tokens"]) ``` ## Rotate a service account token Rotate tokens by creating a replacement token before revoking the old token. 1. Create a replacement token for the service account. 2. Update your secret manager or application configuration to use the replacement token. 3. Deploy or restart clients that use the token. 4. Verify that the workload can connect to MotherDuck with the replacement token. 5. Revoke the old token. ## Revoke a token ### UI ![Service account token actions](../img/sa_revoke_token_option.png) 1. In **Settings** > **Service Accounts**, open the service account details page. 2. Open the token's three-dot menu. 3. Click **Revoke token**. 4. Confirm the revocation. ### API using curl Use [`DELETE /v1/users/{username}/tokens/{token_id}`](/sql-reference/rest-api/users-delete-token/) to revoke a known token. ```bash curl -X DELETE \ "https://api.motherduck.com/v1/users/analytics_service_account/tokens/" \ -H "Authorization: Bearer " ``` ### API using Python Use [`DELETE /v1/users/{username}/tokens/{token_id}`](/sql-reference/rest-api/users-delete-token/) to revoke a known token. ```python import requests response = requests.delete( "https://api.motherduck.com/v1/users/analytics_service_account/tokens/", headers={"Authorization": "Bearer "}, ) response.raise_for_status() ``` ## Delete a service account Deleting a service account immediately revokes its tokens and permanently deletes data owned by that account. :::warning[This action can't be undone] Verify the service account username before deleting it. Data and users deleted through the API can't be recovered. ::: ### UI 1. In **Settings** > **Service Accounts**, find the service account. 2. Open the service account's three-dot menu. 3. Click **Delete account**. 4. Confirm the deletion. ### API using curl Use [`DELETE /v1/users/{username}`](/sql-reference/rest-api/users-delete/) to delete a known service account. ```bash curl -X DELETE \ https://api.motherduck.com/v1/users/analytics_service_account \ -H "Authorization: Bearer " ``` ### API using Python Use [`DELETE /v1/users/{username}`](/sql-reference/rest-api/users-delete/) to delete a known service account. ```python import requests response = requests.delete( "https://api.motherduck.com/v1/users/analytics_service_account", headers={"Authorization": "Bearer "}, ) response.raise_for_status() print(response.json()["username"]) ``` ## Related content - [Create and configure service accounts](/key-tasks/service-accounts-guide/create-and-configure-service-accounts/) - [Impersonate service accounts](/key-tasks/service-accounts-guide/impersonate-service-accounts/) - [MotherDuck REST API](/sql-reference/rest-api/motherduck-rest-api/) --- Source: https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/scim # Setting up SCIM provisioning > Automate user lifecycle management in MotherDuck using SCIM with your identity provider. SCIM (System for Cross-domain Identity Management) keeps your MotherDuck users in sync with your identity provider. When you assign, update, or remove a user in your IdP, the change is automatically applied in MotherDuck. :::note SCIM provisioning is available on **Business** and **Enterprise** plans, and requires an active [SSO connection](/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/sso-setup/). ::: ## How SCIM complements SSO SSO and SCIM solve different problems: | | SSO | SCIM | | --- | --- | --- | | **Purpose** | Authentication — controls **how** users sign in | Provisioning — controls **which** users exist | | **Handles** | Sign-in redirects, session management | Account creation, updates, deprovisioning | | **Trigger** | User-initiated (at sign-in) | IdP-initiated (when staff changes) | With SSO alone, MotherDuck uses [just-in-time (JIT) provisioning](/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/sso-setup/#just-in-time-jit-user-provisioning) to create accounts on first sign-in. JIT does not handle changes after the account is created — if an employee leaves your company, their MotherDuck account stays active until an admin [deprovisions or removes them](/docs/key-tasks/managing-organizations/#deprovisioning-users) manually. SCIM closes that gap by making your IdP the source of truth for who has access. SCIM replaces JIT as the auto-provisioning mode and disables manual invite flows, so the IdP becomes the only place where members are added or removed. ## Prerequisites Before you enable SCIM, confirm: - **Admin** role in MotherDuck. - A **Business** or **Enterprise** plan. - An [SSO connection](/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/sso-setup/) that is **active** (not pending). SCIM cannot be enabled on a pending connection. - Admin access to the IdP application that's already linked to your SSO connection. Configuring SCIM uses your existing SSO connection, so any of the supported enterprise SSO connection types work — **SAML**, **OIDC**, **Okta Workforce**, or **Microsoft Entra ID** (Azure AD). ## Supported operations | IdP action | Effect in MotherDuck | | --- | --- | | Assign user to the MotherDuck application | Creates a MotherDuck user with the **Member** role on first SCIM event | | Update user attributes (name, email) | Updates the MotherDuck user record | | Deprovision user | Deprovisions the user — sign-in is blocked, all access tokens are revoked, but data is retained and the account can be reprovisioned | | Reprovision user | Restores a deprovisioned user to active status | | Unassign / delete user | Removes the user from the organization (hard delete) | Role assignment through SCIM is not yet supported — all SCIM-provisioned users start with the **Member** role. Use the MotherDuck **Members** page to change a user's role after provisioning. ## Attribute mapping MotherDuck reads the following attributes from each SCIM request: | SCIM attribute | Required | Description | | --- | --- | --- | | `userName` | Yes | The user's email address. Must be on a [verified domain](/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/sso-setup/#step-6-verify-your-domain) of the SSO connection. | | `emails[].value` | Yes (if `userName` is not set to email) | Used as a fallback for the email address. | | `name.givenName` | No | The user's first name. | | `name.familyName` | No | The user's last name. | | `active` | Yes | Drives deprovisioning and reprovisioning. | User email addresses with aliases (for example, `user+tag@company.com`) are not supported, matching the SSO requirement. ## Enabling SCIM ### Step 1: Generate the SCIM endpoint and token in MotherDuck 1. In the MotherDuck UI, click your organization name in the top left and select **Settings**. 2. Open the **Authentication** tab. 3. In the **SCIM** section, click **Enable SCIM**. 4. Confirm in the dialog. MotherDuck generates a SCIM endpoint URL and a SCIM token. The endpoint URL has this shape: ```text https://auth.motherduck.com/scim/v2/connections/ ``` :::warning The SCIM token is shown **once**. Copy it immediately and store it in your IdP — MotherDuck cannot show it again. If you lose the token, you can regenerate it (which revokes the previous token). ::: ### Step 2: Configure SCIM provisioning in your identity provider In your IdP's admin console, open the application that's linked to your MotherDuck SSO connection and turn on SCIM provisioning. The exact path depends on the IdP: - **Okta**: open the application's **Provisioning** tab and switch to **SCIM**. - **Microsoft Entra ID**: open the application's **Provisioning** blade and set the mode to **Automatic**. When prompted, supply: - **Tenant URL** (also called **SCIM endpoint URL** or **Base URL**): paste the URL from Step 1. - **Secret token** (also called **Bearer token**): paste the SCIM token from Step 1. Use the IdP's **Test Connection** button to verify connectivity before assigning users. ### Step 3: Map attributes in your identity provider Map your IdP's user attributes to the SCIM attributes [listed above](#attribute-mapping). Most IdPs ship a default mapping that already covers `userName`, `emails`, `name.givenName`, `name.familyName`, and `active`. ### Step 4: Assign users Assign users (or groups) to the MotherDuck application in your IdP. Each assignment triggers a SCIM `create` request, which provisions the user in MotherDuck with the **Member** role. For ongoing changes, your IdP automatically sends: - A SCIM `update` request when a user's attributes change. - A SCIM `update` or `patch` request with `active=false` when a user is deprovisioned or unassigned. - A SCIM `delete` request when the user is fully removed. ## Managing SCIM after enablement ### Regenerating the token If the SCIM token is lost or compromised, regenerate it from **Settings → Authentication → SCIM → Regenerate token**. The previous token is revoked immediately, so update the new token in your IdP right away to avoid provisioning failures. ### Disabling SCIM To stop SCIM provisioning, click **Disable SCIM** on the **Authentication** page. Disabling: - Removes the SCIM connection from MotherDuck's identity layer (your IdP can no longer make SCIM requests). - Switches auto-provisioning back to [JIT](/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/sso-setup/#just-in-time-jit-user-provisioning). - Leaves existing user accounts untouched. Previously deprovisioned users remain deprovisioned. You can re-enable SCIM later, but a new endpoint URL and token will be issued. ### Manual invites are disabled and JIT is replaced When SCIM is enabled, MotherDuck switches the auto-provisioning mode from JIT to SCIM and disables manual invite flows, so the IdP stays the single source of truth for who can sign in: - The **Invite** action in the org menu and on the Members page is disabled. - The **Invite policy** setting on the org details page is disabled. To grant a new user access, assign them to the MotherDuck application in your IdP. ### Deprovisioned users on the members page Deprovisioned users appear on the **Members** page with a `deprovisioned` badge. Hover the badge for a reminder that the user can no longer sign in and that all of their access tokens have been revoked. The Members page status filter lets you narrow the list to **active**, **invited**, or **deprovisioned** users. Admins can impersonate **deprovisioned users and service accounts** — useful for inspecting their data before deciding whether to reprovision or delete the account. Admins cannot impersonate active users. ## Deletion vs. deprovisioning Deprovisioning and deletion are distinct user states with different recovery semantics: | State | What happens | How to enter | How to exit | | --- | --- | --- | --- | | **Deprovisioned** | The user record and data are retained, the identity is disabled, and all PATs and short-lived tokens are revoked. The user cannot sign in. Admins can still impersonate the account. | IdP sends `active=false` (PATCH or PUT) | Reprovision the user in your IdP — works at any time, including past the deletion fail-safe window | | **Deleted** | The user is removed from the organization. Email is freed for reuse. No one can impersonate a deleted account. | IdP sends a SCIM `delete` request | Restore the account within the **7-day fail-safe** window through MotherDuck support | ### How deletion is triggered - **From the IdP (SCIM orgs)**: removing the user from the MotherDuck application — or deleting them from the IdP entirely — sends a SCIM `delete` event when your IdP is configured to forward delete events. SCIM-enabled organizations cannot hard-delete users from inside MotherDuck; the IdP is the only authoritative path. - **From inside MotherDuck (non-SCIM orgs only)**: the in-app **Remove member** action is available only when SCIM is disabled. :::note Some IdPs do not forward delete events to applications by default — they only mark the user inactive on their side. In that case, MotherDuck sees the inactive signal and the user appears as **deprovisioned** rather than deleted. Configure your IdP to forward delete events if you want hard deletes to flow through. ::: ### Restoring after deprovisioning or deletion - A **deprovisioned** user can be reprovisioned at any time from your IdP. The next SCIM event will restore their account. Their data is preserved, but issued access tokens are not — affected users need to mint new tokens. - A **deleted** user can be restored within the **7-day fail-safe** window by contacting [MotherDuck support](mailto:support@motherduck.com). After the window elapses, the account and its data are gone. ## Limitations - **One SCIM connection per organization**: SCIM uses the same Auth0 connection as SSO. Each MotherDuck organization can have only one SCIM connection, matching its single SSO connection. - **No role mapping yet**: SCIM-provisioned users start as **Member**. Adjust roles in MotherDuck after provisioning. - **No aliased emails**: Addresses like `user+tag@company.com` are rejected, the same as for SSO. - **PATCH `remove` operations are ignored**: MotherDuck handles SCIM `add` and `replace` operations on `active`. `remove` operations are logged and skipped to keep behavior predictable across IdPs. ## Related - [Setting up SSO](/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/sso-setup/) - [Managing organizations](/docs/key-tasks/managing-organizations/) --- Source: https://motherduck.com/docs/key-tasks/sharing-data/sharing-with-users # Sharing data with specific users > Grant read access to specific users for multi-tenant applications and collaboration. MotherDuck lets you securely share data with specific users. Common scenarios include: - Building data applications, in which each tenant should only have access to their own data. - Sharing sensitive data within your Organization. - Sharing data outside of your Organization. :::note Shares are **region-scoped** based on your Organization's cloud region. Each MotherDuck Organization is scoped to a single cloud region that must be chosen at Org creation when signing up. MotherDuck is available on AWS in three regions: - **US East (N. Virginia):** `us-east-1` - **US West (Oregon):** `us-west-2` - **Europe (Frankfurt):** `eu-central-1` ::: Sharing data with individuals is easy. MotherDuck supports two approaches: - Creating a share with **Restricted** access, limiting access to a list of specified users within your organization (known as an "ACL" or "Access Control List"). - Creating a **Hidden** share and providing individuals with the share URL. ## Creating a share with restricted access (ACL) **Overview** 1. **Data provider** creates a share with **Restricted** access. 2. **Data provider** _(share owner)_ specifies which **data consumers** _(users)_ can read from the share. 3. **Data consumer** **attaches** the share. 4. **Data provider** periodically updates the share to push new data to **data consumers**. Anyone within your organization that is _not_ included in the list will **not** be able to access the share, even if they have a share link. ### UI Click on the "trident" next to the database you'd like to share. Select "Share". ![trident](useBaseUrl('/img/key-tasks/sharing-data/share_acl_ui.png')) 1. Optionally name the share. 2. Under "Who has access" choose "Specified users with the share link". Search for and add the users within your Organization that should have access to read the share. 3. Choose whether the share should be [automatically updated or not](../sharing-overview/#updating-shared-data). Default is `MANUAL`. 3. Create the share. 4. For the specified users, the share will appear in their UI under 'Shared with me' and can be attached. ### SQL ```sql use birds; CREATE SHARE birds FROM birds (ACCESS RESTRICTED); -- This query creates a share accessible only by organization users specified with GRANT commands GRANT READ ON SHARE birds TO duck1, duck2; -- Gives the users with usernames 'duck1' and 'duck2' access to the share 'birds' ``` **Data consumer** must `ATTACH` the restricted share before querying the share. See [consuming restricted shares](./#consuming-restricted-shares). :::note Restricted shares default to **Discoverable** visibility for users who have been granted access to the share. (Learn more about ["Discoverable shares"](../sharing-overview/#discoverable-shares)). ::: ### Consuming restricted shares The **data consumers** in your Organization with access to the restricted share can use the UI or SQL to **attach** the share and start querying it. ### UI 1. Select the restricted share you want to attach under "Shared with me" 2. Click "attach" and optionally name the resulting database. 3. You can query the resulting database. ### SQL Run the `ATTACH` command to attach the share as a queryable database. This is a zero-cost metadata-only operation. ```sql ATTACH md:_share/birds/e9ads7-dfr32-41b4-a230-bsadgfdg32tfa; -- Creates a zero-copy clone database called birds ``` Learn more about [ATTACH](/sql-reference/motherduck-sql-reference/attach.md). ### Modifying share access **Data providers** _(share owners)_ can modify which users within your Organization have access to the share. ### UI 1. Find the target share in the "Shares I've created" section of the Object Explorer, and choose the 'Alter' option from the context menu. 2. From here, you can add and remove users with access to the share. 3. You may also alter the share to use a different **access** scope. Learn more about [share access scopes](../sharing-overview/#organization-shares). For more details on how to configure access controls for restricted shares, see the [`GRANT READ ON SHARE` reference page](/sql-reference/motherduck-sql-reference/grant-access/). ### SQL ```sql GRANT READ ON SHARE birds TO duck3; -- Gives the user with username 'duck3' access to the share 'birds' REVOKE READ ON SHARE birds FROM penguin; -- Revokes access to the share 'birds' from the user with username 'penguin' ``` For more details on configuring access controls for restricted shares, see the [`GRANT READ ON SHARE` reference page](/sql-reference/motherduck-sql-reference/grant-access/). ## Creating hidden shares **Overview** 1. **Data provider** creates the share URL and passes this URL to the **data consumer**. 2. **Data consumer** **attaches** the share. 3. **Data provider** periodically updates the share to push new data to **data consumers**. To share a database, first create a Hidden share. No actual data is copied and no additional costs are incurred in this process. ### UI Click on the "trident" next to the database you'd like to share. Select "share". ![trident](useBaseUrl('/img/key-tasks/sharing-data/ui-share3.png')) 1. Optionally name the share. 2. To share the data with MotherDuck users inside or outside of your Organization, choose the "Anyone with the share link" option. This will enable anyone with the share link in the same cloud region to attach and query the share, including users outside your Organization. 3. Create the share. 4. Copy the resulting **ATTACH** command to your clipboard and send it to your **data consumers**. ### SQL ```sql use birds; CREATE SHARE birds FROM birds (ACCESS UNRESTRICTED , VISIBILITY HIDDEN); -- This query creates a Hidden share accessible by anyone with the share link in the same cloud region, including users outside your Organization > md:_share/birds/e9ads7-dfr32-41b4-a230-bsadgfdg32tfa ``` Save the returned share URL and pass it to **data consumers**. ### Consuming hidden shares The **data consumer** in your Organization can use SQL to attach the share and start querying it! ### SQL Run the `ATTACH` command to attach the share as a queryable database. This is a zero-cost metadata-only operation. ```sql ATTACH md:_share/birds/e9ads7-dfr32-41b4-a230-bsadgfdg32tfa; -- Creates a zero-copy clone database called birds ``` Learn more about [ATTACH](/sql-reference/motherduck-sql-reference/attach.md). ## Updating shared data If during creation of the share, the **data provider** chose to have the share updated automatically, the share will be updated periodically. If the share was created with `MANUAL` updates, the **data provider** needs to manually update the share. ```sql UPDATE SHARE birds; ``` Learn more about [UPDATE SHARE](/sql-reference/motherduck-sql-reference/update-share.md) and [data replication timing and checkpoints](./updating-shares.md). --- Source: https://motherduck.com/docs/key-tasks/database-operations/switching-the-current-database # Switching the current database > Change the active database and schema context using USE statements. Below are examples of how to determine the current/active database and schema and switch between different databases and schemas: ### CLI ```sql -- check your current database SELECT current_database(); dbname -- list all tables in the current database SHOW TABLES; table1 table2 -- list all databases SHOW DATABASES; dbname dbname2 -- switch to database named 'dbname2' USE dbname2; -- verify that you've successfully switched databases SELECT current_database(); dbname2 -- check your current schema SELECT current_schema(); main -- list all schemas across all databases SELECT * FROM duckdb_schemas(); ``` | oid | database_name | database_oid | schema_name | internal | sql | |------|---------------|--------------|--------------------|----------|------| | 986 | my_db | 989 | information_schema | true | NULL | | 974 | my_db | 989 | main | false | NULL | | 972 | my_db | 989 | my_schema | false | NULL | | 987 | my_db | 989 | pg_catalog | true | NULL | | 1508 | system | 0 | information_schema | true | NULL | | 0 | system | 0 | main | true | NULL | | 1509 | system | 0 | pg_catalog | true | NULL | | 1510 | temp | 1453 | information_schema | true | NULL | | 1454 | temp | 1453 | main | true | NULL | | 1511 | temp | 1453 | pg_catalog | true | NULL | ```sql -- switch to schema my_schema within the same database USE my_schema; -- verify that you've successfully switched schemas SELECT current_schema(); my_schema -- switch to database my_db and schema main USE my_db.my_schema -- verify that both the database and schema have been changed SELECT current_database(), current_schema(); ``` | current_database() | current_schema() | |--------------------|------------------| | my_db | main | --- Source: https://motherduck.com/docs/key-tasks/ai-and-motherduck/dives/theming-and-styling-dives # Theming and styling your Dives > Control the visual appearance of your Dives with theme definitions, chart selection, and interactive filters When you create a Dive, you can go beyond the default look and feel. By providing a **theme definition** in your prompt, you control colors, typography, chart types, and interaction patterns — turning a basic visualization into a polished, branded data experience. This guide covers how to structure a theme prompt, pick the right chart types for your data, and add interactivity through filters and cross-filtering. You can explore and play with themed Dives in the [live theme gallery](https://duck-dives.vercel.app/snippets/galactic-coffee-theme-gallery), or browse our [curated theme gallery](/key-tasks/ai-and-motherduck/dives/dive-theme-gallery/) with screenshots and ready-to-copy prompts. ## How theming works in Dives A Dive is a React component that renders charts using [Recharts](https://recharts.org/) and queries live MotherDuck data through `useSQLQuery`. When you describe a visual style in your prompt, the AI agent translates it into: - A **color palette** (background, text, muted, and chart colors) - **Typography** (font family, title weight, text transform) - **Chart configuration** (grid lines, stroke width, curve type, bar radius) - **Layout** (grid columns, spacing, card styling) You don't need to write any code — describe the style and the agent handles the implementation. ## Writing a theme prompt A good theme prompt has four parts: **colors**, **typography**, **chart rules**, and **feel**. Here's an example that produces a Financial Times-inspired Dive: ```text Create a Dive with an FT Salmon style. Inspired by: Financial Times Visual Journalism. Visual rules: - Background: #FFF1E5 (signature salmon). Text: #33302E. Muted: #807973. - Chart colors: ["#0F5499", "#990F3D", "#FF7FAA", "#00A0DD"]. - Font: Georgia, serif. Titles: semibold. - Interactive: year & metric toggles, click-to-filter cross-filtering. Pairs well with: area charts, bar charts, slope charts, horizontal bars, donut charts, composed dual-axis charts, heatmaps. Feel: Financial authority — the pink paper, digitized. ``` ### What to include in your prompt | Section | What to specify | Example | |---------|----------------|---------| | Colors | Background, text, muted accent, 3-5 chart colors | `Background: #0d1117. Chart colors: ["#58a6ff", "#3fb950"]` | | Typography | Font family, title weight, text transform | `Font: Georgia, serif. Titles: bold, UPPERCASE` | | Chart rules | Grid lines, stroke width, curve type, bar radius | `No gridlines, 1.5px strokes, linear interpolation` | | Chart types | Which charts to include | `Pairs well with: area charts, bar charts, heatmaps` | | Interactivity | Filters and cross-filtering behavior | `Interactive: year toggle, metric toggle, click-to-filter` | | Feel | One-line mood descriptor | `Feel: Midnight studio — data glowing in the dark` | ### Tips for effective theme prompts **Reference real-world styles.** Naming a specific design tradition helps the agent make consistent decisions. "Tufte minimal" or "Neon 80s synthwave" gives more coherent results than listing individual properties. **Specify chart colors as an array.** Providing 3-5 hex colors as a JSON array (for example, `["#2563eb", "#16a34a", "#dc2626"]`) gives the agent an explicit palette instead of leaving it to guess. **Pick colors that work in charts, not just colors that look nice together.** General-purpose palette generators often produce colors that clash or become indistinguishable when applied to bars, lines, and slices. Use tools designed for data visualization: - [ColorBrewer 2.0](https://colorbrewer2.org/) — the gold standard for cartography and charts. Pick sequential, diverging, or qualitative palettes and get hex values ready to paste. Every palette is tested for perceptual uniformity and colorblind safety. - [Viz Palette](https://projects.susielu.com/viz-palette) — paste your candidate colors and preview them on actual chart types (bars, lines, scatter). It flags pairs that are too similar or hard to distinguish with color vision deficiencies. As a rule of thumb, limit your palette to 5-7 chart colors. More than that and the colors start blending together, especially in legends. If you have more categories than colors, consider grouping smaller categories into an "Other" bucket. **Mention the "feel" in one sentence.** This guides the agent on ambiguous decisions like spacing, border radius, and animation. "Sugar rush — joyful and bold" produces different results than "Quiet authority — the data speaks for itself." ## Choosing chart types Different chart types serve different purposes. When building a Dive with multiple charts, pick a mix that covers different analytical angles of your data. ### Chart type reference | Chart type | Best for | Data shape | |------------|----------|------------| | Line chart | Trends over time | Time series | | Area chart | Volume over time, part-to-whole trends | Time series | | Bar chart | Comparing categories | Categorical | | Horizontal bar | Ranked lists, long category names | Categorical, sorted | | Stacked area | Composition over time | Multi-series time | | Composed chart (bar + line) | Dual metrics on shared timeline | Time series, two metrics | | Heatmap | Density across two dimensions | Matrix (for example, station x month) | | Pie / donut | Part-to-whole — ideally aim for 2 or 3 slices, max 5. A horizontal bar or donut is almost always easier to read. If you still want a pie chart, label slices directly. | Categorical, proportional | | Radar | Multi-dimensional profile comparison | Categorical, normalized | | Scatter | Correlation between two measures | Two continuous variables | | Table | Exact values, detailed comparison | Any structured data | ### Chart pairing recommendations A 6-chart grid works well with this pattern: 1. **Trend chart** (line, area, or stepped line) — shows how metrics move over time 2. **Comparison chart** (bar or horizontal bar) — ranks categories side by side 3. **Composition chart** (pie, donut, or stacked area) — shows part-to-whole relationships 4. **Detail view** (table or direct-labeled bars) — provides exact values 5. **Dual-axis chart** (composed bar + line) — overlays two related metrics 6. **Density chart** (heatmap or scatter) — reveals patterns across dimensions This mix gives viewers both the big picture and the ability to drill into specifics. ## Adding interactivity Interactive filters make a Dive more useful than a static dashboard. You can ask for several types of interactivity in your prompt. ### Time filters Time filters are the most common interactive control. Two patterns work well depending on your data: **Relative time windows** work best for operational data that updates continuously — think logs, events, or transactions. Users care about what happened in the last few hours or days, not a specific calendar year: ```text Add time filter pills: Last 24h | Last 7 days | Last 30 days | Last 90 days | All time. Filter all charts when a time range is selected. Default to Last 30 days. ``` **Year or period toggles** work better for data with natural calendar boundaries — annual reports, quarterly metrics, or fiscal comparisons: ```text Add year toggle pills: 2024 | 2025 | All. Filter all charts when a year is selected. ``` Pick whichever pattern matches how your users think about the data. If they ask "what happened this week?" go with relative windows. If they ask "how did Q4 compare to Q3?" go with period toggles. ### Metric toggles Let users switch which measure the charts display: ```text Add a metric toggle between Revenue and Cups Sold. The hero KPI and all chart Y-axes should update when toggled. ``` This changes the `dataKey` used by line, area, and bar charts, and swaps which metric appears as the primary KPI. ### Cross-filtering with click interactions Cross-filtering means clicking an element in one chart filters every other chart in the Dive. This is different from putting a filter dropdown on each individual chart — and the difference matters. **Why cross-filtering over individual filters?** When each chart has its own filter controls, users end up in a state where Chart A shows "US only," Chart B shows "all regions," and Chart C shows "Europe." The charts look coherent but they're answering different questions, and comparing them leads to wrong conclusions. Cross-filtering avoids this by keeping every chart in sync: click "US" on any chart and the entire Dive updates to show the US view. The user always sees one consistent story across all charts. **When individual filters make sense.** There are cases where a per-chart filter is the right choice — when a chart has a dimension that doesn't exist in the other charts. For example, a chart showing data broken down by warehouse location doesn't need to cross-filter a chart that doesn't have a warehouse column. In that case, a local filter on just that chart is appropriate. A good rule of thumb: use cross-filtering for shared dimensions (time, region, product category) and individual filters for dimensions unique to a single chart. Enable cross-filtering in your prompt: ```text Add click-to-filter cross-filtering: - Click a bar in the station chart to filter by that station - Click a pie slice to filter by that coffee type - Non-selected items render at 30% opacity - Show dismissible filter pills when filters are active ``` Cross-filtering works best when: - **Bar charts** filter on their categorical axis (for example, clicking a station bar filters by station) - **Pie and donut charts** filter on slice category (for example, clicking a product slice filters by product) - **Unselected items** dim to 30% opacity rather than disappearing, so users keep the full context while focusing on a subset - **Filter pills** appear below the controls showing active filters with a dismiss button ### Filter pills When cross-filters are active, visible pills show what's filtered and let users clear filters with one click: ```text Show active filters as colored pills with ✕ dismiss buttons. Only show the pills row when filters are active. ``` ### Tooltips and accordions Interactive Dives let you keep the visual layout clean while still providing rich context. Move descriptions, methodology notes, and supporting text into **tooltips** and **accordions** so they're available on demand without cluttering the charts: ```text Add an info tooltip on each chart title that explains the metric. Add an expandable accordion below the charts with methodology notes. ``` This works well for Dives shared with a broad audience — power users can expand the details, while casual viewers get an uncluttered experience. ## Laying out a multi-chart Dive For Dives with multiple charts, specify the grid layout in your prompt: ```text Use a 3×2 grid layout (3 columns, 2 rows) with 6 charts. Each chart card should have a title, subtle border, and 160px chart height. ``` Common layouts: | Charts | Layout | Use case | |--------|--------|----------| | 2-4 | `repeat(2, 1fr)` | Focused analysis, fewer metrics | | 5-6 | `repeat(3, 1fr)` | Dashboard-style overview | | 8+ | `repeat(4, 1fr)` | Small multiples, sparkline grids | ## Example: full theme prompt Here's a complete prompt that produces a themed, interactive Dive: ```text Create a Dive showing sales data from my galactic_coffee database. Theme: Corporate Dashboard - Background: #f5f5f5. Text: #333. Muted: #777. - Chart colors: ["#2563eb", "#16a34a", "#dc2626", "#f59e0b", "#8b5cf6"]. - Font: system-ui, sans-serif. Titles: semibold, UPPERCASE. - Layout: 3×2 grid with card borders and 8px border radius. Charts: 1. Line chart — Revenue trend over time 2. Pie chart — Product mix breakdown 3. Table — Station performance details 4. Bar chart — Station comparison 5. Composed chart — Revenue bars + Cups sold line (dual Y-axis) 6. Heatmap — Station × Month revenue density Interactivity: - Year toggle: 2024 | 2025 | All - Metric toggle: Revenue | Cups - Click a bar to filter by station, click a pie slice to filter by product - Show filter pills with ✕ dismiss when filters are active KPIs: Show total revenue, total cups sold, and average rating above the charts. ``` ## Related resources - [Dive theme gallery](/key-tasks/ai-and-motherduck/dives/dive-theme-gallery/) — Screenshots and ready-to-copy prompts for 15 themes - [Creating Visualizations with Dives](/key-tasks/ai-and-motherduck/dives/) — Get started with your first Dive - [Managing Dives as code](/key-tasks/ai-and-motherduck/dives/managing-dives-as-code/) — Version control and CI/CD for Dives - [Dives SQL functions](/sql-reference/motherduck-sql-reference/ai-functions/dives/) — Manage Dives directly from SQL - [MCP Server tools](/sql-reference/mcp/) — Reference for all MCP tools including Dive operations --- Source: https://motherduck.com/docs/key-tasks/ai-and-motherduck/dives/dive-theme-gallery # Dive theme gallery > Ready-to-use theme prompts for Dives with screenshots showing each style applied to the same dataset Dives give you unlimited abilities in creating visualizations, but that does not automatically mean *good* visualizations. Use the following themes to guide your AI agent to learn from decades of experienced, excellent data visualizers. Pick a theme, copy the prompt, and paste it into your AI agent alongside your data question. The live theme gallery Dive below lets you switch between all 15 themes interactively. ## Tufte Minimal Inspired by Edward Tufte, *The Visual Display of Quantitative Information* (1983). ![A Dive styled with the Tufte Minimal theme showing monochrome charts with generous whitespace and no gridlines](./img/theme_gallery_tufftle_minimal.png) ```text Create a Dive with a Tufte Minimal style. Inspired by: Edward Tufte, The Visual Display of Quantitative Information (1983). Visual rules: - Background: #FFFFF8. Text: #111. Muted: #666. - Chart colors: monochrome ["#111","#666","#999"]. - Font: Georgia, serif. Titles: normal weight, no transform. - Layout: generous whitespace, no gridlines, no chart borders. - Charts: no gridlines, thin strokes (1.5px), linear interpolation. - Direct labeling instead of legends. Small multiples preferred. - Interactive: year toggle, metric toggle, click-to-filter on bars/pies. Pairs well with: small multiples, sparklines, scatter plots, slope charts, direct-labeled values, heatmaps, composed dual-axis charts. Avoid: pie charts, 3D charts, heavy gridlines. Feel: Quiet authority — the data speaks for itself. ``` ## Ink & Paper Inspired by the New York Times Graphics Desk. ![A Dive styled with the Ink and Paper theme showing clean left-aligned charts with subtle gridlines](./img/theme_gallery_ink_and_paper.png) ```text Create a Dive with an Ink & Paper style. Inspired by: New York Times Graphics Desk. Visual rules: - Background: #fff. Text: #121212. Muted: #666. - Chart colors: ["#326fa8","#e15759","#59a14f","#edc949","#af7aa1"]. - Font: Georgia, serif. Titles: bold. - Layout: clean, left-aligned, subtle gridlines. - Charts: light gridlines, 2px strokes, linear interpolation. - Interactive: year toggle, metric toggle, click-to-filter cross-filtering. Pairs well with: annotated line charts, bar charts, horizontal bars, step charts, small multiples, tables, composed dual-axis charts, heatmaps. Feel: Authoritative journalism — clarity above all. ``` ## Corporate Dashboard Inspired by classic BI tools (Tableau, Power BI). ![A Dive styled with the Corporate Dashboard theme showing card-based charts with structured grid and uppercase titles](./img/theme_gallery_corporate_dashboard.png) ```text Create a Dive with a Corporate Dashboard style. Inspired by: Classic BI tools (Tableau, Power BI). Visual rules: - Background: #f5f5f5. Text: #333. Muted: #777. - Chart colors: ["#2563eb","#16a34a","#dc2626","#f59e0b","#8b5cf6"]. - Font: system-ui, sans-serif. Titles: semibold, UPPERCASE. - Layout: card-based, subtle borders, structured grid. - Interactive: year & metric toggles, click-to-filter cross-filtering. Pairs well with: line charts, pie charts, KPI cards, data tables, bar charts, combo charts, heatmaps. Feel: Boardroom-ready — structured and professional. ``` ## FT Salmon Inspired by Financial Times Visual Journalism. ![A Dive styled with the FT Salmon theme showing charts on a signature salmon background with serif typography](./img/theme_gallery_ft_salmon.png) ```text Create a Dive with an FT Salmon style. Inspired by: Financial Times Visual Journalism. Visual rules: - Background: #FFF1E5 (signature salmon). Text: #33302E. Muted: #807973. - Chart colors: ["#0F5499","#990F3D","#FF7FAA","#00A0DD"]. - Font: Georgia, serif. Titles: semibold. - Interactive: year & metric toggles, click-to-filter cross-filtering. Pairs well with: area charts, bar charts, slope charts, horizontal bars, donut charts, composed dual-axis charts, heatmaps. Feel: Financial authority — the pink paper, digitized. ``` ## Soft Infographic Inspired by David McCandless, *Information is Beautiful*. ![A Dive styled with the Soft Infographic theme showing rounded bar charts and pastel colors on a light background](./img/theme_gallery_soft_infographic.png) ```text Create a Dive with a Soft Infographic style. Inspired by: David McCandless, Information is Beautiful. Visual rules: - Background: #fafafa. Text: #2d2d2d. Muted: #888. - Chart colors: ["#FF6B6B","#4ECDC4","#45B7D1","#FFA07A","#98D8C8"]. - Font: system-ui, sans-serif. Titles: bold. - Charts: rounded bars (8px radius), smooth curves. - Interactive: year & metric toggles, click-to-filter cross-filtering. Pairs well with: rounded bar charts, donut charts, line charts, radar charts, composed charts, heatmaps. Feel: Friendly and approachable — data for everyone. ``` ## Du Bois Inspired by W.E.B. Du Bois, Paris Exposition (1900). ![A Dive styled with the Du Bois theme showing bold horizontal bars on a parchment background with crimson and gold accents](./img/theme_gallery_dubois.png) ```text Create a Dive with a Du Bois style. Inspired by: W.E.B. Du Bois, Paris Exposition (1900). Visual rules: - Background: #e8d4b8 (parchment). Text: #1a1a1a. Muted: #654321. - Chart colors: ["#dc143c","#228b22","#000","#ffd700","#654321"]. - Charts: horizontal bars, no gridlines, sharp edges (0 radius). - Interactive: year & metric toggles, click-to-filter cross-filtering. Pairs well with: horizontal bar charts, pie charts, heatmaps, composed dual-axis charts. Feel: Bold proclamation — data as civil rights evidence. ``` ## More themes The live gallery includes 9 additional themes you can explore and copy: | Theme | Category | Feel | |-------|----------|------| | Knowledge Beautiful | Modern | Dense and layered — every pixel earns its place | | Film Flowers | Artistic | Organic and poetic — data as a living garden | | Dark Canvas | Modern | Midnight studio — data glowing in the dark | | Playful Sketch | Artistic | Personal and intimate — a handwritten letter in data | | Neon 80s | Fun | Arcade at midnight — data goes synthwave | | Pirate Map | Fun | X marks the data — adventure on the high seas | | Vaporwave | Fun | Digital sunset — nostalgia rendered in pastel neon | | Terminal | Fun | `> data.query --style=hacker` — pure terminal vibes | | Candy Pop | Fun | Sugar rush — joyful, bold, unapologetically fun | Explore all 15 themes in the dive: ## Using a gallery prompt with your own data These prompts are designed to be mixed with your data question. Replace the dataset-specific parts and keep the visual rules: ```text Create a Dive showing monthly active users from my analytics database. Theme: FT Salmon - Background: #FFF1E5 (signature salmon). Text: #33302E. Muted: #807973. - Chart colors: ["#0F5499","#990F3D","#FF7FAA","#00A0DD"]. - Font: Georgia, serif. Titles: semibold. - Interactive: time filter (Last 7 days | Last 30 days | Last 90 days | All time), click-to-filter cross-filtering. Charts: 1. Area chart — DAU trend over time 2. Bar chart — Users by country 3. Donut — Traffic source breakdown 4. Table — Top pages by session count 5. Composed chart — Sessions bars + Bounce rate line (dual Y-axis) 6. Heatmap — Country × Day of week activity ``` For more on structuring theme prompts, see [Theming and styling your Dives](/key-tasks/ai-and-motherduck/dives/theming-and-styling-dives/). ## Related resources - [Theming and styling your Dives](/key-tasks/ai-and-motherduck/dives/theming-and-styling-dives/) — How to write theme prompts, pick chart types, and add interactivity - [Creating Visualizations with Dives](/key-tasks/ai-and-motherduck/dives/) — Get started with your first Dive - [Managing Dives as code](/key-tasks/ai-and-motherduck/dives/managing-dives-as-code/) — Version control and CI/CD for Dives --- Source: https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/postgres-endpoint/nodejs # Connect from Node.js via Postgres endpoint > Connect to MotherDuck from Node.js using the pg (node-postgres) library via the Postgres wire protocol You can query MotherDuck from Node.js using [node-postgres](https://node-postgres.com/) (`pg`) — no DuckDB installation required. For connection parameters, SSL options, and limitations, see the [Postgres Endpoint reference](/sql-reference/postgres-endpoint). ## Prerequisites You'll need a [MotherDuck access token](/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck). Set it as an environment variable: ```bash export MOTHERDUCK_TOKEN="your_token_here" ``` Install the `pg` package: ```bash npm install pg ``` ## Connect Use a configuration object to connect. Do **not** pass `sslrootcert=system` in a connection string — node-postgres tries to read `system` as a file path and throws an `ENOENT` error. ```js import pg from "pg"; const client = new pg.Client({ host: "pg.us-east-1-aws.motherduck.com", port: 5432, user: "postgres", password: process.env.MOTHERDUCK_TOKEN, database: "md:", ssl: { rejectUnauthorized: true }, }); await client.connect(); const { rows } = await client.query( "SELECT title, score FROM sample_data.hn.hacker_news WHERE type='story' LIMIT 10" ); console.log(rows); await client.end(); ``` ## SSL notes Node.js uses the operating system's certificate store by default. Setting `ssl: { rejectUnauthorized: true }` tells node-postgres to use TLS and verify the server certificate against these trusted roots — this is the equivalent of `sslmode=verify-full` with `sslrootcert=system` in libpq. If you need to specify a custom CA certificate (for example, the [ISRG Root X1](https://letsencrypt.org/certs/isrgrootx1.pem) certificate from Let's Encrypt): ```js import fs from "fs"; const client = new pg.Client({ host: "pg.us-east-1-aws.motherduck.com", port: 5432, user: "postgres", password: process.env.MOTHERDUCK_TOKEN, database: "md:", ssl: { rejectUnauthorized: true, ca: fs.readFileSync("/path/to/isrgrootx1.pem").toString(), }, }); ``` For more details on SSL options, see [SSL and certificate verification](/sql-reference/postgres-endpoint#ssl-and-certificate-verification). :::info Cloudflare Workers Cloudflare Workers use a different socket implementation (`pg-cloudflare`) that handles SSL differently. See [Connect from Cloudflare Workers](/key-tasks/authenticating-and-connecting-to-motherduck/postgres-endpoint/cloudflare-workers) for Workers-specific setup. ::: --- Source: https://motherduck.com/docs/key-tasks/loading-data-into-motherduck/loading-duckdb-database # Load a DuckDB database into MotherDuck > Upload a local DuckDB database file to MotherDuck cloud storage. MotherDuck supports uploading local DuckDB databases in the cloud as referenced by the [CREATE DATABASE](/sql-reference/motherduck-sql-reference/create-database.md) statement. ### CLI To create a remote database from the current active local database, execute the following command: ```sql CREATE OR REPLACE DATABASE remote_database_name FROM CURRENT_DATABASE(); ``` To upload an attached local duckdb database, execute the following commands: ```sql ATTACH '/path/to/local/database.ddb' AS local_db_name; ATTACH 'md:'; CREATE OR REPLACE DATABASE remote_database_name FROM local_db_name; ``` To upload an duckdb file on disk: ```sql ATTACH 'md:'; CREATE OR REPLACE DATABASE remote_database_name FROM '/path/to/local/database.ddb'; ``` Here's a full end-to-end example: ```sql -- Let's generate some data based on the tpch extension (will be automatically autoloaded). -- This will create a couple of tables in the current database. CALL dbgen(sf=0.1); -- Connect to MotherDuck ATTACH 'md:'; CREATE OR REPLACE DATABASE remote_tpch from CURRENT_DATABASE(); ``` :::note Uploading database does not alter context, meaning you are still in the local context after the upload and the query will run locally. ::: --- Source: https://motherduck.com/docs/key-tasks/ai-and-motherduck/dives/managing-dives-as-code # Managing Dives as Code > Set up a Git-based workflow for developing, previewing, and deploying Dives with GitHub Actions and Claude Code Creating Dives through an AI agent is fast, but as your team relies on them for decision-making, you may want the same rigor you apply to production code: version history, code review, and automated deployments. Since Dives are React components and SQL queries under the hood, you can manage them with Git and CI/CD — just like the rest of your codebase. This guide walks through setting up that workflow: local development with hot reload, PR-based preview deployments, and automated production updates on merge. A [starter repo](https://github.com/motherduckdb/blessed-dives-example) with GitHub Actions pipelines is ready to fork and use. ## Quick start Fork the [starter repo](https://github.com/motherduckdb/blessed-dives-example) to get up and running immediately. It includes: - A working example Dive - The Vite preview setup for local development - GitHub Actions for deploy and cleanup - A `CLAUDE.md` that teaches the agent the repo conventions Fork the repo, set a `MOTHERDUCK_TOKEN` secret, and you're deploying Dives on merge. ## Prerequisites - A [MotherDuck account](https://app.motherduck.com/) with at least one Dive already published - A GitHub repository to store your Dive source files (or fork the [starter repo](https://github.com/motherduckdb/blessed-dives-example)) - [Claude Code](https://docs.anthropic.com/en/docs/build-with-claude/claude-code/overview) connected to the [MotherDuck MCP Server](/key-tasks/ai-and-motherduck/mcp-setup/) - A MotherDuck API token set as a GitHub secret (`MOTHERDUCK_TOKEN`) ## Pull a dive for local development Start with a Dive that's already published in MotherDuck. Copy its share link from the MotherDuck UI, then tell Claude Code to set it up locally: ```text Set up this dive for local development: https://app.motherduck.com/dives/... ``` The agent uses the MotherDuck MCP Server to: 1. Read the Dive source through the SQL API using the share link 2. Pull down the file into a local directory in your repo 3. Register the Dive for CI 4. Start a lightweight Vite development server for live preview The MCP Server's `get_dive_guide` tool provides the agent with everything it needs — the React component contract, dependency setup, and instructions for the local dev server. No additional skills or context files are required beyond what the MCP server provides. ![Claude Code spinning up the Vite dev server after pulling down a Dive for local development.](./img/claude_code_vite_terminal_1ffa9f80a9.png) ## Edit locally with an AI agent With the local dev server running, you can iterate on the Dive using Claude Code. The agent can restyle charts, rewrite SQL queries, add filters, swap visualizations — anything you can express as a prompt. ```text Make this much better visually. Top-tier style please. ``` The Vite dev server hot-reloads changes, so you see updates instantly in the browser. The MCP server provides schema context so the agent writes accurate SQL against your live data. ![A Dive running locally, showing the updated dashboard with improved styling and layout.](./img/dive_local_preview_9bccdb19bf.png) If your repo includes a `CLAUDE.md` file (the [starter repo](https://github.com/motherduckdb/blessed-dives-example) includes one), the agent also knows the folder conventions and how to register new Dives for CI — so you can go from "pull this Dive down" to "push up a PR" without explaining any plumbing. ## Deploy a preview with GitHub actions Once you're happy with your changes, tell the agent to push a PR: ```text Put up a PR on a new feature branch ``` When a PR is opened (or updated with new commits), a GitHub Action detects which Dive folders changed and deploys a **preview** Dive to MotherDuck. The preview uses the same live environment as production but has a branch-tagged title so it's clearly labeled. A comment appears on the PR with a direct link. ![A GitHub Actions bot comment on a PR showing a preview Dive link — click Open Dive to see it live in MotherDuck.](./img/pr_preview_comment_13ca302ff9.png) Your reviewer clicks the link and sees the Dive running with live queries — no local setup needed. The deploy action uses path filters to detect which Dive folders changed, then calls a shared deploy script (`scripts/deploy-dive.sh`) for each one. The script reads the Dive's source and metadata, and uses the DuckDB CLI with the MotherDuck extension to create or update the Dive. ## Merge to production When the preview looks right, merge the PR. A separate deploy job runs that creates or updates the production Dive, matched by title. The production Dive is now live and shareable with anyone in your organization. ![The deploy GitHub Action after a merge to main, completing in 20 seconds.](./img/deploy_action_success_f763894ae0.png) ## Clean up preview dives Delete the feature branch after merging. A cleanup action fires that removes the preview Dive from your MotherDuck account — no orphaned Dives cluttering your workspace. The entire pipeline is two GitHub Actions and one secret (`MOTHERDUCK_TOKEN`). At MotherDuck, we use a dedicated service account so anyone with repo access can edit and deploy with the same ownership scope. ## Related resources - [Creating Visualizations with Dives](/key-tasks/ai-and-motherduck/dives/) — Create Dives from natural language with AI agents - [Dives SQL Functions](/sql-reference/motherduck-sql-reference/ai-functions/dives/) — Manage Dives directly from SQL - [Connect to MCP Server](/key-tasks/ai-and-motherduck/mcp-setup/) — Set up the MCP server with your AI assistant - [Starter repo](https://github.com/motherduckdb/blessed-dives-example) — Fork and start deploying --- Source: https://motherduck.com/docs/key-tasks/sharing-data/managing-shares # Managing shares > View share details, modify permissions, and manage shared database access. ## Getting details about a share You can learn more about a specific share that you've created by using [`DESCRIBE SHARE`](/sql-reference/motherduck-sql-reference/describe-share.md) command. For example: ### SQL ```sql -- if you are the share owner, use the database name DESCRIBE SHARE "duckshare"; -- if you are the share viewer, use the full url DESCRIBE SHARE "md:_share/sample_data/23b0d623-1361-421d-ae77-62d701d471e6"; ``` ### UI In the UI you can roll over a share to see a tooltip that tells you the share owner, when it was last updated, and access scope. ## Listing Shares You can list the shares you have created via the [`LIST SHARES`](/sql-reference/motherduck-sql-reference/list-shares.md) statement. For example: ### SQL ```sql LIST SHARES; ``` ### UI 1. You can see shares that you've created under "Shares I've created". 2. You can find **Discoverable** **Organization** shares that members of your Organization created under "Shared with me". To view the URLs of shares created by others that you have currently attached, use the [`SHOW ALL DATABASES`](/sql-reference/motherduck-sql-reference/show-databases/) command. The `fully_qualified_name` column gives you the share URL of the attached share. ## Deleting a share Shares can be deleted with the [`DROP SHARE`](/sql-reference/motherduck-sql-reference/drop-share.md) or `DROP SHARE IF EXISTS` method. For example: Users who have [`ATTACH`](/sql-reference/motherduck-sql-reference/attach.md)-ed it will lose access. ### SQL ```sql DROP SHARE "share1"; ``` ### UI 1. Roll over the share you'd like to delete. 2. Click on the "trident" on the right side. 3. Select "Drop". 4. Confirm. ## Updating a share Sharing a database creates a point-in-time snapshot of the database at the time it is shared. To publish changes, you need to explicitly run `UPDATE SHARE `. When updating a `SHARE` with the same database, the URL does not change. ### SQL ```sql UPDATE SHARE ; ``` In the following example database 'mydb' was previously shared by creating a share 'myshare', and the database 'mydb' has been updated since. Owner of the database would like their colleagues to receive the new version of this database: ### SQL ```sql # 'myshare' was previously created on the database 'mydb' UPDATE SHARE "myshare"; ``` If you lost your database share url, you can use the `LIST SHARES` command to list all your share or `DESCRIBE SHARE ` to get specific details about a given share name. ## Editing/Altering a share You can change the configuration of shares you've created in the UI. SQL operation `ALTER SHARE` is in the works. ### UI 1. Roll over the share you'd like to edit. 2. Click on the "trident" on the right side. 3. Select "Alter". 4. Change the share configuration as you see fit. 5. Confirm "Alter share". **Error handling:** If you don't see the trident icon, you may not have permission to edit this share. --- Source: https://motherduck.com/docs/key-tasks/ai-and-motherduck/ai-features-in-ui # AI Features in the MotherDuck UI > Use AI-powered SQL editing, FixUp, and natural language queries in the MotherDuck web interface. :::tip Quick overview For a hands-on walkthrough of FixIt and Edit in the web UI, see the [Web UI guide](/getting-started/interfaces/motherduck-quick-tour/#fix-errors-and-edit-queries-with-ai). ::: ## Automatically Edit SQL Queries in the MotherDuck UI Edit is a MotherDuck AI-powered feature which allows you to edit SQL queries in the MotherDuck UI. The AI is aware of DuckDB-specific SQL features and relevant database schemas to provide effective suggestions. Select the specific part of the query you want to edit, then press the keyboard shortcut to open the Edit dialog: * Windows/Linux: `Ctrl + Shift + E` * macOS: `⌘ + Shift + E` In the Edit dialog, enter your prompt (e.g., "extract the domain from the url, using a regex") and click Suggest edit. ![Edit](../img/edit-prompt.png) If the suggestion is not as desired, it can be further clarified with follow-up prompts. ![Edit](../img/edit-follow-up.png) When happy with the change, click 'Apply edit', and the change will be applied to the query. ![Edit](../img/edit-follow-up-2.png) ## Automatically Fix SQL Errors in the MotherDuck UI FixIt is a MotherDuck AI-powered feature that helps you resolve common SQL errors by offering fixes in-line. Read more about it in our [blog post](https://motherduck.com/blog/introducing-fixit-ai-sql-error-fixer/). FixIt can also be called programmatically using the `prompt_fix_line` . Find more information in the [prompt_fix_line documentation](/sql-reference/motherduck-sql-reference/ai-functions/sql-assistant/prompt-fix-line). ### How FixIt works By default, FixIt is enabled for all users. If you run a query that has an error, FixIt will automatically analyze the query and suggest in-line fixes. When accepting a fix, MotherDuck will automatically update your query and re-execute it. ![FixIt](../img/fixit-suggestion.png) When 'Auto-suggest' is un-toggled, FixIt will not automatically suggest fixes anymore. FixIt can still be manually triggered by clicking 'Suggest fix' at the bottom of the error message. ![FixIt](../img/fixit-manual-suggestion.png) ## Access SQL Assistant functions MotherDuck provides built-in AI features to help you write, understand and fix DuckDB SQL queries more efficiently. These features include: - [Answer questions about your data](/sql-reference/motherduck-sql-reference/ai-functions/sql-assistant/prompt-query) using the `prompt_query` pragma. - [Generate SQL](/sql-reference/motherduck-sql-reference/ai-functions/sql-assistant/prompt-sql) for you using the `prompt_sql` table function. - [Correct and fix up your SQL query](/sql-reference/motherduck-sql-reference/ai-functions/sql-assistant/prompt-fixup) using the `prompt_fixup` table function. - [Correct and fix up your SQL query line-by-line](/sql-reference/motherduck-sql-reference/ai-functions/sql-assistant/prompt-fix-line) using the `prompt_fix_line` table function. - [Help you understand a query](/sql-reference/motherduck-sql-reference/ai-functions/sql-assistant/prompt-explain) using the `prompt_explain` table function. - [Help you understand contents of a database](/sql-reference/motherduck-sql-reference/ai-functions/sql-assistant/prompt-schema) using the `prompt_schema` table function. ### Example usage of prompt_sql We use MotherDuck's sample [Hacker News dataset](/getting-started/sample-data-queries/hacker-news) from [MotherDuck's sample data database](/getting-started/sample-data-queries/datasets). ```sql CALL prompt_sql('what are the top domains being shared on hacker_news?'); ``` Output of this SQL statement is a single column table that contains the AI-generated SQL query. | **query** | |-----------------| | ```sql SELECT COUNT(*) as domain_count, SUBSTRING(SPLIT_PART(url, '//', 2), 1, POSITION('/' IN SPLIT_PART(url, '//', 2)) - 1) as domain FROM hn.hacker_news WHERE url IS NOT NULL GROUP BY domain ORDER BY domain_count DESC LIMIT 10``` | --- Source: https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/postgres-endpoint/cloudflare-workers # Connect from Cloudflare Workers > Query MotherDuck from Cloudflare Workers using the Postgres wire protocol Cloudflare Workers do not support native DuckDB bindings, but they can connect to MotherDuck through the [Postgres endpoint](/key-tasks/authenticating-and-connecting-to-motherduck/postgres-endpoint) using the [`pg`](https://www.npmjs.com/package/pg) npm package. This gives you a thin-client path to query MotherDuck from edge functions without any DuckDB dependencies. This guide walks through building a Worker that queries NYC taxi data from MotherDuck's built-in `sample_data` database. The full source code is available in the [motherduck-examples](https://github.com/motherduckdb/motherduck-examples/tree/main/cloudflare-workers) repository. ## Prerequisites - [Node.js](https://nodejs.org/) v18+ - A [Cloudflare account](https://dash.cloudflare.com/sign-up) - A [MotherDuck account](https://motherduck.com/) and [access token](/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck) ## Project setup Create a new directory and install dependencies: ```bash mkdir motherduck-worker && cd motherduck-worker npm init -y npm install pg@^8.16.3 npm install --save-dev wrangler @types/pg ``` ### Configure wrangler.toml ```toml name = "motherduck-taxi-stats" main = "src/index.ts" compatibility_date = "2026-04-02" compatibility_flags = ["nodejs_compat"] [vars] MOTHERDUCK_HOST = "pg.us-east-1-aws.motherduck.com" MOTHERDUCK_DB = "sample_data" ``` The `nodejs_compat` flag is required — it enables the `node:net` module that the `pg` package uses for TCP connections. Use a `compatibility_date` on or after `2024-09-23`; in practice, set it to today's date when you create the project. Generate the Worker binding types after you save `wrangler.toml`: ```bash npx wrangler types ``` ### Store your token as a secret ```bash npx wrangler secret put MOTHERDUCK_TOKEN ``` This prompts you to paste your MotherDuck token. It's stored encrypted and injected as an environment variable at runtime — it never appears in your source code or `wrangler.toml`. For local development, create a `.dev.vars` file (add this to `.gitignore`): ```text MOTHERDUCK_TOKEN="your_token_here" ``` ## Write the Worker Create `src/index.ts`. We'll build this in two parts: first the connection and routing, then the route handlers. ### Connect and route requests ```typescript import { Client } from "pg"; interface Env { MOTHERDUCK_HOST: string; MOTHERDUCK_DB: string; MOTHERDUCK_TOKEN: string; } function createClient(env: Env): Client { return new Client({ connectionString: `postgresql://user:${env.MOTHERDUCK_TOKEN}@${env.MOTHERDUCK_HOST}:5432/${env.MOTHERDUCK_DB}?sslmode=require`, }); } export default { async fetch(request: Request, env: Env): Promise { const url = new URL(request.url); if (url.pathname === "/stats") { return handleStats(env, url); } return handleDefault(env); }, }; ``` The connection string is assembled from the environment variables defined in `wrangler.toml` and the secret token. The `?sslmode=require` parameter tells `pg` to open a TLS connection, and the Workers runtime performs certificate verification. The `fetch` handler routes first and opens a database connection only inside the route handlers. That keeps validation failures on `/stats` returning `400` instead of depending on database connectivity. ### Handle route logic Add the two handler functions to the same file. The `/stats` route accepts date range parameters and returns aggregated fare data. It validates inputs before querying and uses parameterized queries (`$1`, `$2`) to prevent SQL injection — never interpolate user input directly into SQL strings. ```typescript async function handleStats(env: Env, url: URL): Promise { const startDate = url.searchParams.get("start"); const endDate = url.searchParams.get("end"); if (!startDate || !endDate) { return Response.json( { error: "Both 'start' and 'end' query parameters are required. Use YYYY-MM-DD format." }, { status: 400 } ); } const datePattern = /^\d{4}-\d{2}-\d{2}$/; if (!datePattern.test(startDate) || !datePattern.test(endDate)) { return Response.json( { error: "Invalid date format. Use YYYY-MM-DD." }, { status: 400 } ); } const client = createClient(env); try { await client.connect(); const result = await client.query( `SELECT sum(passenger_count)::INTEGER AS total_passengers, round(sum(fare_amount), 2) AS total_fare FROM nyc.taxi WHERE tpep_pickup_datetime >= $1 AND tpep_pickup_datetime < $2`, [`${startDate} 00:00:00`, `${endDate} 00:00:00`] ); return Response.json({ start: startDate, end: endDate, ...result.rows[0], }); } finally { await client.end(); } } ``` The default route returns a sample of recent taxi trips — no user input needed: ```typescript async function handleDefault(env: Env): Promise { const client = createClient(env); try { await client.connect(); const result = await client.query( `SELECT tpep_pickup_datetime AS pickup, tpep_dropoff_datetime AS dropoff, passenger_count, trip_distance, fare_amount, tip_amount, total_amount FROM nyc.taxi ORDER BY tpep_pickup_datetime DESC LIMIT 20` ); return Response.json(result.rows); } finally { await client.end(); } } ``` ## Test locally ```bash npx wrangler dev ``` Then open `http://localhost:8787/` or try the stats endpoint with a date range: ```text http://localhost:8787/stats?start=2022-11-01&end=2022-12-01 ``` If `wrangler dev` starts successfully but direct Postgres queries fail locally with `Connection terminated`, switch to the Hyperdrive setup below and use a `localConnectionString` for local testing, or run `npx wrangler dev --remote` to exercise the Cloudflare runtime directly. ## Deploy ```bash npx wrangler deploy ``` ## Using Hyperdrive for connection pooling For production workloads, [Cloudflare Hyperdrive](https://developers.cloudflare.com/hyperdrive/) provides built-in connection pooling. This reduces latency by reusing connections across Worker invocations instead of opening a new connection per request. ### 1. create a Hyperdrive configuration ```bash npx wrangler hyperdrive create motherduck-db \ --connection-string="postgresql://user:$MOTHERDUCK_TOKEN@pg.us-east-1-aws.motherduck.com:5432/sample_data?sslmode=require" ``` ### 2. update wrangler.toml ```toml name = "motherduck-taxi-stats" main = "src/index.ts" compatibility_date = "2026-04-02" compatibility_flags = ["nodejs_compat"] [[hyperdrive]] binding = "MD_HYPERDRIVE" id = "" ``` ### 3. update the connection code Replace the connection string construction with: ```typescript const client = new Client({ connectionString: env.MD_HYPERDRIVE.connectionString, }); ``` Hyperdrive handles connection pooling and credential injection automatically. For local development with Hyperdrive, configure a direct connection string for `wrangler dev`: ```bash export CLOUDFLARE_HYPERDRIVE_LOCAL_CONNECTION_STRING_MD_HYPERDRIVE="postgresql://user:$MOTHERDUCK_TOKEN@pg.us-east-1-aws.motherduck.com:5432/sample_data?sslmode=require" npx wrangler dev ``` ## SSL notes Cloudflare Workers use `pg-cloudflare` for socket connections, which delegates TLS to the Workers runtime through `cloudflare:sockets`. The runtime encrypts the connection and verifies the server certificate against Cloudflare's trust store, but those verification settings are not exposed through the `pg` client. In this environment, application code uses the runtime-managed TLS configuration rather than supplying `rejectUnauthorized`, custom CA certificates, or `sslmode=verify-full`. Use `?sslmode=require` in the connection string. This tells `pg` to initiate TLS using STARTTLS, and the Workers runtime handles the actual certificate verification at the socket level. For standard Node.js environments where you can configure certificate verification directly, see [Connect from Node.js](/key-tasks/authenticating-and-connecting-to-motherduck/postgres-endpoint/nodejs). --- Source: https://motherduck.com/docs/key-tasks/ai-and-motherduck/building-analytics-agents # Building analytics agents with MotherDuck > Build AI-powered analytics agents using MotherDuck's SQL functions and MCP server integration. Analytics agents are AI-powered systems that allow users to interact with data using natural language. Instead of writing SQL queries or building dashboards, users can ask questions like "What were our top-selling products last quarter?" and get immediate answers. This guide covers best practices for building production-ready analytics agents on MotherDuck. ## Prerequisites - **Agent framework**: [Claude Agent SDK](https://docs.anthropic.com/en/api/agent-sdk/overview), [OpenAI Agents SDK](https://openai.github.io/openai-agents-python/), or Claude Desktop with MotherDuck remote MCP connector - **MotherDuck account** with the data you want to query - **Clean, well-structured data**: The better your schema and metadata, the better your agent performs ## Step 1: Define your agent's interface Choose the interface your agent will use to query your MotherDuck database. ### Option A: Generated SQL The agent generates SQL queries and executes them through a tool/function call. This provides maximum flexibility - agents can answer any question your data supports - but requires good SQL generation capabilities. **Implementation approaches:** **MCP Server**: Use our [remote MCP Server](/key-tasks/ai-and-motherduck/mcp-setup/) (or [local MCP server](/key-tasks/ai-and-motherduck/mcp-setup/#remote-vs-local-mcp-server) for self-hosted, read-write) for Claude Desktop, Cursor, ChatGPT, or Claude Code **Custom tool calling**: Create a function that accepts SQL strings and executes them: ### Python ```python import duckdb def execute_sql(query: str) -> str: """Execute SQL query against MotherDuck""" conn = duckdb.connect('md:my_database?motherduck_token=') try: result = conn.execute(query).fetchdf() return result.to_string() except Exception as e: return f"Error: {str(e)}" ``` ### Option B: Parameterized query templates The agent receives structured parameters that fill predefined SQL templates. This provides strict correctness guarantees and is easier to validate, but is less flexible and requires more upfront development with queries limited to predefined questions. **Example**: Agent chooses calling a custom tool with a domain-specific signature like `get_sales_by_region(region: str, start_date: date, end_date: date)` instead of generating custom SQL. **Recommendation**: Start with Option A (SQL generation) unless you have strict correctness requirements or very limited query patterns. ## Step 2: Give your agent SQL knowledge Your LLM needs to know how to write good DuckDB queries. ### System prompt for DuckDB and MotherDuck A system prompt is the foundational instruction set that guides your agent's behavior and capabilities. It's critical for ensuring your agent generates correct, efficient SQL queries and understands how to explore data effectively. The query guide below should be added to your system prompt because it contains: - DuckDB SQL syntax and conventions - Common patterns and best practices - How to explore schemas efficiently
query_guide.md ```text # DuckDB SQL Query Syntax and Performance Guide ## General Knowledge ### Basic Syntax and Features **Identifiers and Literals:** - Use double quotes (`"`) for identifiers with spaces/special characters or case-sensitivity - Use single quotes (`'`) for string literals **Flexible Query Structure:** - Queries can start with `FROM`: `FROM my_table WHERE condition;` (equivalent to `SELECT * FROM my_table WHERE condition;`) - `SELECT` without `FROM` for expressions: `SELECT 1 + 1 AS result;` - Support for `CREATE TABLE AS` (CTAS): `CREATE TABLE new_table AS SELECT * FROM old_table;` **Advanced Column Selection:** - Exclude columns: `SELECT * EXCLUDE (sensitive_data) FROM users;` - Replace columns: `SELECT * REPLACE (UPPER(name) AS name) FROM users;` - Pattern matching: `SELECT COLUMNS('sales_.*') FROM sales_data;` - Transform multiple columns: `SELECT AVG(COLUMNS('sales_.*')) FROM sales_data;` **Grouping and Ordering Shortcuts:** - Group by all non-aggregated columns: `SELECT category, SUM(sales) FROM sales_data GROUP BY ALL;` - Order by all columns: `SELECT * FROM my_table ORDER BY ALL;` **Complex Data Types:** - Lists: `SELECT [1, 2, 3] AS my_list;` - Structs: `SELECT {'a': 1, 'b': 'text'} AS my_struct;` - Maps: `SELECT MAP([1,2],['one','two']) AS my_map;` - Access struct fields: `struct_col.field_name` or `struct_col['field_name']` - Access map values: `map_col[key]` **Date/Time Operations:** - String to timestamp: `strptime('2023-07-23', '%Y-%m-%d')::TIMESTAMP` - Format timestamp: `strftime(NOW(), '%Y-%m-%d')` - Extract parts: `EXTRACT(YEAR FROM DATE '2023-07-23')` ### Database and Table Qualification **Fully Qualified Names:** - Tables are accessed by fully qualified names: `database_name.schema_name.table_name` - There is always one current database: `SELECT current_database();` - Tables from the current database don't need database qualification: `schema_name.table_name` - Tables in the main schema don't need schema qualification: `table_name` - Shorthand: `my_database.my_table` is equivalent to `my_database.main.my_table` **Switching Databases:** - Use `USE my_other_db;` to switch current database - After switching, tables in that database can be accessed without qualification ### Schema Exploration **Get database and table information:** - List all databases: `SELECT alias as database_name, type FROM MD_ALL_DATABASES();` - List tables in database: `SELECT database_name, schema_name, table_name, comment FROM duckdb_tables() WHERE database_name = 'your_database';` - List views in database: `SELECT database_name, schema_name, view_name, comment, sql FROM duckdb_views() WHERE database_name = 'your_database';` - Get column information: `SELECT column_name, data_type, comment, is_nullable FROM duckdb_columns() WHERE database_name = 'your_database' AND table_name = 'your_table';` **Sample data exploration:** - Quick preview: `SELECT * FROM table_name LIMIT 5;` - Column statistics: `SUMMARIZE table_name;` - Describe table: `DESCRIBE table_name;` ### Performance Tips **QUALIFY Clause for Window Functions:** -- Get top 2 products by sales in each category SELECT category, product_name, sales_amount FROM products QUALIFY ROW_NUMBER() OVER (PARTITION BY category ORDER BY sales_amount DESC) <= 2; **Efficient Patterns:** - Use `arg_max()` and `arg_min()` for "most recent" queries - Filter early to reduce data volume - Use CTEs for complex queries - Prefer `GROUP BY ALL` for readability - Use `QUALIFY` instead of subqueries for window function filtering **Avoid These Patterns:** - Functions on the left side of WHERE clauses (prevents pushdown) - Unnecessary ORDER BY on intermediate results - Cross products and cartesian joins ```
### Function documentation MotherDuck maintains `function_docs.jsonl` - compact, LLM-friendly documentation for every DuckDB/MotherDuck function available at: https://app.motherduck.com/assets/docs/function_docs.jsonl **How to use**: 1. When user asks a question, search function docs using FTS or semantic search 2. Add the 5 most relevant function descriptions to the agent's context 3. This helps with specialized functions (window functions, date arithmetic, JSON operations, etc.) ## Step 3: Give your agent schema context Your agent needs to understand your database structure to generate correct queries. ### Finding relevant tables Our `query_guide.md` explains how agents can explore schemas autonomously to find relevant tables. For faster, non-agentic identification, use the built-in `INFORMATION_SCHEMA`. ```sql -- adjust the search terms and database(s) to your needs SELECT table_schema, table_name, table_comment FROM information_schema."tables" where table_catalog = current_database() and table_name like '%sales%' or table_name like '%customer%' or table_name like '%cust%' or table_comment like '%sales%' or table_comment like '%customer%'; ``` For column level information you can use `information_schema.columns`. ### Make schemas agent-friendly **Use clear naming**: Choose explicit, unambiguous table and column names ❌ Bad: `ord_dtl`, `cust_id`, `amt` ✅ Good: `order_details`, `customer_id`, `total_amount` **Add context with COMMENT ON**: ```sql COMMENT ON TABLE orders IS 'Customer orders since 2020. Join to customers via customer_id'; COMMENT ON COLUMN orders.status IS 'Possible values: pending, shipped, delivered, cancelled'; COMMENT ON COLUMN orders.total_amount IS 'Total in USD including tax and shipping'; ``` Comments help agents understand table relationships, valid values, and business logic. Learn more: [COMMENT ON documentation](https://duckdb.org/docs/stable/sql/statements/comment_on.html) ## Step 4: Configure access controls Secure your agent's database access with appropriate permissions and isolation. ### Read-only access Use [read-scaling tokens](/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/) to ensure your agent only has read access. Read-scaling tokens connect to dedicated read replicas that cannot modify data. ### Python ```python import duckdb # Using a read-scaling token ensures read-only access con = duckdb.connect('md:my_database?motherduck_token=') ``` **For multi-tenant [customer-facing analytics](/getting-started/customer-facing-analytics/) agents**: Use [service accounts](/key-tasks/service-accounts-guide/create-and-configure-service-accounts/) for your agents. You can grant these service accounts read-only access to specific databases using [shares](/key-tasks/sharing-data/sharing-overview/): ```sql ATTACH 'md:_share/my_org/abc123' AS shared_data; ``` Consider creating separate service accounts per user/tenant for full compute isolation. **Capacity planning**: Choose the number of [read scaling](/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/) replicas and [Duckling size](/about-motherduck/billing/duckling-sizes/) according to the expected query complexity and concurrency. ### Read-write access & sandboxing For agents that need to create tables, modify data, or experiment safely, use zero-copy clones to create an isolated sandbox. This provides safe experimentation completely isolated from production data, with instant creation through zero-copy operations. Agents get full capabilities to create tables, modify data, and experiment freely, with easy sharing of results back to production when ready. ```sql -- Create instant writable copy (clones must match source retention type) CREATE DATABASE my_sandbox FROM my_database_share; -- Agent can now read/write without affecting production data -- Changes are isolated to this copy ``` Learn more: [CREATE DATABASE documentation](/sql-reference/motherduck-sql-reference/create-database/) ## Step 5: Implement your agent Build your agent using an SDK or framework that supports function calling. **Quick start option**: For immediate experimentation, try [Claude Desktop with the MotherDuck remote MCP Server](/key-tasks/ai-and-motherduck/mcp-setup/) - no coding required. **Custom agent option**: Here's a simple example using the [OpenAI Agents SDK](https://openai.github.io/openai-agents-python/): ### Python ```python import duckdb from agents import Agent, Runner, function_tool # Connect to MotherDuck (use a read-scaling token for read-only access) conn = duckdb.connect('md:?motherduck_token=') @function_tool def query_motherduck(sql: str) -> str: """Execute SQL query against MotherDuck database. Args: sql: The SQL query to execute against the MotherDuck database. """ try: result = conn.execute(sql).fetchdf() return result.to_string() except Exception as e: return f"Error executing query: {str(e)}" # Load the DuckDB query guide (copy the system prompt template above into a local file) with open('query_guide.md', 'r') as f: query_guide = f.read() # Create agent with database tool agent = Agent( name="MotherDuck Analytics Agent", instructions=f"""You are a data analyst helping users query a MotherDuck database. Use the query_motherduck tool to execute SQL queries against the database. Always start with schema exploration before querying specific tables. {query_guide} """, tools=[query_motherduck] ) # Run the agent result = Runner.run_sync( agent, "What were the top 5 products by revenue last month?" ) print(result.final_output) ``` ### Validating queries before showing to users If a human reviews generated queries before execution, use `try_bind()` to validate SQL without running it. It checks syntax and referenced tables/columns in milliseconds. **Structured output:** `try_bind()` returns `error_message` (VARCHAR) and `error_type` (VARCHAR). Use `error_type` to decide what to do next: `ok` means validation passed, `parser` means SQL syntax is invalid, and `binder` means object resolution failed (for example, a missing table/column or invalid reference). On `parser` or `binder`, pass `error_message` back into the next generation attempt so the model can repair the query. ```sql -- Valid query - error_type is 'ok', error_message is empty CALL try_bind('SELECT customer_id, total FROM orders WHERE status = ''shipped'''); -- Invalid query - returns error_message and error_type (e.g. 'parser' or 'binder') CALL try_bind('SELECT * FORM orders'); ``` **Example integration:** ### Python ```python def generate_query_for_review(question: str) -> str: """Generate and validate SQL before showing to user.""" error_msg = None for attempt in range(3): sql = agent.generate_sql(question, error_feedback=error_msg) # Validate before showing (error_message, error_type) row = conn.execute("CALL try_bind(?)", [sql]).fetchall()[0] error_message, error_type = row[0], row[1] if error_type == "ok": return f"Generated query:\n{sql}" error_msg = error_message or f"Validation failed: {error_type}" return "Could not generate a valid query to answer the question" ``` Feed `error_message` and `error_type` from `try_bind()` into retries to fix syntax and binding errors. ## Step 6: Test and iterate Validate your agent's performance and refine its behavior based on real-world usage. ### Testing and quality Choose a set of realistic user questions that cover simple filters ("Show me sales from last month"), complex analysis ("What's the trend in customer retention by region?"), and edge cases like empty results ("Show me sales for December 2019") or ambiguous requests ("Show me the best customers"). Test each question and check the agent's behavior. Focus on SQL correctness, result accuracy and query performance. See the next section for how to tackle common issues. ### Common issues and solutions | Issue | Solution | |-------|----------| | Invalid SQL generation | Improve system prompt, add [function docs](#function-documentation) to context | | Wrong tables queried | Add [COMMENT ON](https://duckdb.org/docs/stable/sql/statements/comment_on.html), improve schema descriptions, implement table filtering | | Misunderstood questions | Add domain-specific examples to system prompt | | Query performance | [EXPLAIN ANALYZE](/sql-reference/motherduck-sql-reference/explain-analyze/) to diagnose query inefficiencies, adjust [Duckling size](/about-motherduck/billing/duckling-sizes/) to scale compute resources | ## Next steps - Explore our [MCP Server](/sql-reference/mcp/) docs (remote and local) - Try [AI Features in the MotherDuck UI](/key-tasks/ai-and-motherduck/ai-features-in-ui/) with Generate SQL & Edit - Learn about [Read Scaling](/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/) for multi-tenant agents - Review [Shares](/key-tasks/sharing-data/sharing-overview/) for read-only data access --- Source: https://motherduck.com/docs/key-tasks/ai-and-motherduck/dives/embedding-dives # Embedding Dives in your web application > Embed interactive MotherDuck Dives in your web app using iframes and embed sessions You can embed Dives in your own web application so your users can interact with live data dashboards without signing in to MotherDuck. Your backend creates an embed session, and your frontend loads the Dive in a sandboxed iframe. Embedding Dives is available on the **Business plan**. ## Prerequisites Before you start, you need: - A **MotherDuck Business plan** account - A read/write access token for an account with the Admin role. For production, we recommend using a dedicated [service account](/key-tasks/service-accounts-guide/create-and-configure-service-accounts/) - A Dive you want to embed, with its [data shared](/sql-reference/mcp/share-dive-data) to the target service account that the embedded Dive will run as - A backend server that can make authenticated API calls :::tip Use a dedicated service account We recommend using a service account that does not own databases with the same names as the databases your Dives query. When the service account attaches shared Dive data, the share alias defaults to the source database name. If the service account already has a database with that name, the attach fails. Using a dedicated, empty service account for embedding avoids this conflict. ::: ## How it works Embedded Dives follow a short server-side flow: 1. **Your backend** calls the MotherDuck API with your access token to create an embed session: an opaque string that contains a read-only session string and the information needed to load the Dive. 2. **Your frontend** renders a sandboxed iframe that loads the Dive from `embed-motherduck.com`, passing the session string. 3. **MotherDuck** loads the Dive and runs live SQL queries. Your end-users see an interactive dashboard without needing a MotherDuck account. ::::info[Two tokens are in play] Your service account's access token is a **high-privilege read-write admin token** that stays on your backend and is used only to create embed sessions. The session string it produces contains a **separate, read-only token** that is limited in scope and expires after 24 hours. Only the session string should ever reach the frontend. :::: ```mermaid sequenceDiagram participant M as MotherDuck participant B as Your backend participant F as Your frontend participant E as Embed iframe Note over B: Holds your access token B->>M: POST /v1/dives//embed-session M-->>B: Session string B-->>F: Return session string F->>E: Load iframe /sandbox/#session= Note over F,E: The session stays in the
URL fragment, not the request E->>M: Fetch Dive metadata and content M-->>E: Return the Dive ``` ## Step 1: Create an embed session Your backend calls the MotherDuck API to create an embed session. The access token used for this call must belong to an account with admin-level access. The session string contains a read-only token that expires after 24 hours. ::::warning[Important] **Never expose your access token in client-side code.** The access token stays on your backend. Only the session string reaches the browser. :::: ### Node.js ```javascript const DIVE_ID = ""; const VERSION = 12; const response = await fetch( `https://api.motherduck.com/v1/dives/${DIVE_ID}/embed-session`, { method: "POST", headers: { // This is the admin account used to generate the embed session. Authorization: `Bearer ${MOTHERDUCK_TOKEN}`, "Content-Type": "application/json", }, // This is the service account whose compute / perms will be used for the Dive. body: JSON.stringify({ username: SERVICE_ACCOUNT_USERNAME, // Optional: render a specific Dive version. version: VERSION, }), } ); if (!response.ok) { throw new Error(`Failed to create embed session: ${response.status}`); } const { session } = await response.json(); // Return this session string to your frontend ``` ### Python ```python import httpx DIVE_ID = "" VERSION = 12 response = httpx.post( f"https://api.motherduck.com/v1/dives/{DIVE_ID}/embed-session", headers={ "Authorization": f"Bearer {MOTHERDUCK_TOKEN}", "Content-Type": "application/json", }, json={ "username": SERVICE_ACCOUNT_USERNAME, # Optional: render a specific Dive version. "version": VERSION, }, ) response.raise_for_status() session = response.json()["session"] # Return this session string to your frontend ``` Replace `` with the ID of your Dive. You can find this in **Settings** > **Dives** or through the [`list_dives`](/sql-reference/mcp/list-dives) MCP tool. To render a specific version of a Dive, for example for an embedded Dive in a production environment, pass `version` when you create the embed session. MotherDuck validates that the requested version exists for the Dive before returning the session. If you omit `version`, the embedded Dive renders the latest saved version when it loads. The `version` value maps to the Dive version number, not the version UUID. Use the `current_version` value from [`MD_LIST_DIVES`](/sql-reference/motherduck-sql-reference/ai-functions/dives/md-list-dives), the `version` value from [`MD_LIST_DIVE_VERSIONS`](/sql-reference/motherduck-sql-reference/ai-functions/dives/md-list-dive-versions), or the `version` argument to [`MD_GET_DIVE_VERSION`](/sql-reference/motherduck-sql-reference/ai-functions/dives/md-get-dive-version). Your application owns which version to embed. Store the approved version alongside your own release or customer configuration, then pass it when generating sessions for that embed. Each session is tied to a single Dive. If you embed multiple Dives on the same page, create a separate embed session for each one. You can use the same service account and access token for all of them. The session string is base64-encoded but **not encrypted** — it contains a read-only (read scaling) token, the Dive ID, and endpoint URLs. Treat it like a short-lived credential: do not log it or store it in persistent storage. The embedded Dive runs queries as the service account specified in the session. If you need data isolation (for example, separate databases per region), use separate service accounts scoped to only the data each should access. ## Customize the embed session (optional) `POST /v1/dives//embed-session` accepts optional fields that let you tailor each session. ### Override required databases By default, an embedded Dive uses the [`REQUIRED_DATABASES`](/key-tasks/ai-and-motherduck/dives/#declaring-required-databases) declared in the Dive's source code. To point the same Dive at different databases on a per-session basis — for example, to render the same dashboard for each of your tenants against their own database — pass a `required_resources` array when creating the embed session: ### Node.js ```javascript body: JSON.stringify({ username: SERVICE_ACCOUNT_USERNAME, required_resources: [ { url: "md:_share/tenant_a_data/", alias: "tenant_data", }, ], }), ``` ### Python ```python json={ "username": SERVICE_ACCOUNT_USERNAME, "required_resources": [ { "url": "md:_share/tenant_a_data/", "alias": "tenant_data", }, ], }, ``` Each entry describes one database: | Field | Required | Description | |-------|----------|-------------| | `url` | Yes | Share URL (`md:_share//`) or owned database identifier (`md:`). | | `alias` | No | Local alias used in the Dive's SQL. Defaults to the database name from the URL. | When you set `required_resources`, it **replaces** the Dive's source-declared `REQUIRED_DATABASES` for that session. Omit the field to use the source-declared list. ### Preconfigure the starting UI state Dives can use the `useDiveState` hook from `@motherduck/react-sql-query` to store interactive state such as filters, sort order, selected views, and drill-downs. To seed that state for a given session — for example, to render the same Dive against each customer's selected date range — pass an `initial_state` object when creating the embed session: ### Node.js ```javascript body: JSON.stringify({ username: SERVICE_ACCOUNT_USERNAME, initial_state: { region: "emea", dateRange: { start: "2026-01-01", end: "2026-03-31" }, }, }), ``` ### Python ```python json={ "username": SERVICE_ACCOUNT_USERNAME, "initial_state": { "region": "emea", "dateRange": {"start": "2026-01-01", "end": "2026-03-31"}, }, }, ``` Each key in `initial_state` matches a key used in `useDiveState(key, ...)` inside the Dive's code. Values must be JSON-serializable. Keys absent from `initial_state` fall back to the `initialValue` declared in the Dive's source. Viewer interactions update the Dive's UI state, but those changes are not persisted server-side. To capture viewer changes from your host page, listen for [`dive-state-update` messages](#handle-dive-state-updates-from-embedded-dives). :::note `required_resources` is capped at 8 KB on the encoded session. `initial_state` is capped at 64 KB; bags larger than 8 KB are stored server-side rather than inlined on the session. ::: ## Step 2: Embed the iframe Add a sandboxed iframe to your page that points to the MotherDuck embed URL. Pass the session string in the URL fragment: ```html ``` Replace `` with the session string your backend generated. The `sandbox` attribute must include `allow-scripts allow-same-origin` for the embed to function. ### Query modes Embedded Dives use **dual mode** by default, where queries can use browser DuckDB WASM or run server-side through MotherDuck depending on the query. Dual mode is required for browser DuckDB features such as data exports. You can force **server mode** for embeds that only need server-side SQL queries. To use server mode, add `?queryMode=server` to the iframe URL: ```html ``` #### Server mode data type limitations Server mode runs queries through the Postgres wire protocol, which does not support all DuckDB data types. Basic types (integers, strings, floats) work fine, but nested types (structs, lists) and some less common timestamp types may not render correctly. If you encounter issues with specific columns, try dual (WASM) mode, which supports the full range of DuckDB types. ### URL structure | Part | Description | |------|-------------| | `embed-motherduck.com/sandbox/` | The MotherDuck embed host | | `?queryMode=server` | Optional: forces server-only query mode | | `#session=` | The session string, passed in the URL fragment so it is never sent to the server | The session is placed in the URL fragment (after `#`) rather than the query string. Browsers strip fragments before making HTTP requests, so the session does not appear in server logs or Referer headers. ## Handle link navigation from embedded Dives Embedded Dives run inside an isolated MotherDuck sandbox iframe. Dive code cannot directly navigate the parent page or open popups. When someone clicks a link in an embedded Dive, or Dive code calls `window.open()`, the sandbox blocks the browser navigation and sends a `postMessage` to the parent page. The message has the following shape: ```typescript type NavigationRequest = { type: "navigation-request"; url: string; source: "anchor-click" | "window-open"; target: "_blank" | "_self" | null; rel: string | null; }; ``` The parent page decides how to handle the request. Listen for `navigation-request`, validate the event origin and URL, and apply your own policy before opening anything. The following example uses `window.confirm`; replace it with your application's confirmation UI: ```typescript const iframe = document.querySelector("#motherduck-dive"); if (!iframe) { throw new Error("MotherDuck Dive iframe not found"); } const motherduckEmbedOrigin = new URL(iframe.src).origin; window.addEventListener("message", (event) => { if (event.origin !== motherduckEmbedOrigin) return; if (event.source !== iframe.contentWindow) return; const message = event.data; if (message?.type !== "navigation-request") return; let url: URL; try { url = new URL(message.url); } catch { return; } if (!["https:", "http:"].includes(url.protocol)) return; const confirmed = window.confirm(`Open ${url.toString()}?`); if (!confirmed) return; window.open(url.toString(), "_blank", "noopener,noreferrer"); }); ``` ::::warning[Important] Treat `navigation-request` as untrusted user intent from sandboxed content, not as a command. The parent page should not navigate, submit forms, mutate application state, or grant permissions based only on the message. :::: ### Use absolute URLs in Dive links If you plan to embed a Dive, use absolute URLs in links inside the Dive. Avoid app-relative links like this: ```html [Settings](/settings/members) ``` In an embedded Dive, `/settings/members` resolves against the embed origin, not the MotherDuck app. The parent page receives a URL such as: ```text https://embed-motherduck.com/settings/members ``` Use absolute URLs instead: ```html [Docs](https://motherduck.com/docs/) [Another Dive](https://app.motherduck.com/dives/) ``` For embedded Dives, the parent page owns the policy for whether a navigation request opens a new tab, replaces the current page, or is blocked. ## Handle data exports from embedded Dives Dives can include export buttons created with the `exportAs` return value from `useSQLQuery()` or the `useExport()` hook. When a user starts an export, the Dive runs the export SQL with DuckDB `COPY TO` and sends the generated file to the parent page. Because embedded Dives run in a sandboxed iframe, the iframe cannot download the file directly. Your parent page must listen for export messages, validate the event, and decide how to offer the file to your user. Embedded exports support `csv`, `json`, `parquet`, and `xlsx` formats. Exports require dual mode because file generation uses browser DuckDB. If you force `?queryMode=server`, export controls return an error. The parent page receives these message types: ```typescript type ExportStarted = { type: "export-started"; requestId: string; format: "csv" | "json" | "parquet" | "xlsx"; title?: string; filename: string; }; type ExportFile = { type: "export-file"; requestId: string; format: "csv" | "json" | "parquet" | "xlsx"; title?: string; filename: string; mimeType: string; byteLength: number; previewOptions?: Record; data: ArrayBuffer; }; type ExportError = { type: "export-error"; requestId: string; format: "csv" | "json" | "parquet" | "xlsx"; title?: string; filename?: string; error: string; }; ``` The following example stores the completed export and shows a host-page download button. Replace the status and button UI with your application's pattern: ```html ``` ```javascript const iframe = document.querySelector("#motherduck-dive"); const status = document.querySelector("#dive-export-status"); const downloadButton = document.querySelector("#dive-export-download"); if (!iframe || !status || !downloadButton) { throw new Error("MotherDuck Dive export controls not found"); } const motherduckEmbedOrigin = new URL(iframe.src).origin; let pendingExport = null; function isArrayBuffer(value) { return Object.prototype.toString.call(value) === "[object ArrayBuffer]"; } function isExportFile(message) { return ( message?.type === "export-file" && typeof message.requestId === "string" && typeof message.filename === "string" && typeof message.mimeType === "string" && typeof message.byteLength === "number" && isArrayBuffer(message.data) ); } window.addEventListener("message", (event) => { if (event.origin !== motherduckEmbedOrigin) return; if (event.source !== iframe.contentWindow) return; const message = event.data; if (message?.type === "export-started") { status.textContent = `Preparing ${message.filename}`; downloadButton.hidden = true; pendingExport = null; return; } if (message?.type === "export-error") { status.textContent = `Export failed: ${message.error}`; downloadButton.hidden = true; pendingExport = null; return; } if (!isExportFile(message)) return; pendingExport = message; status.textContent = `${message.filename} is ready to download`; downloadButton.hidden = false; }); downloadButton.addEventListener("click", () => { if (!pendingExport) return; const blob = new Blob([pendingExport.data], { type: pendingExport.mimeType || "application/octet-stream", }); const url = URL.createObjectURL(blob); const link = document.createElement("a"); link.href = url; link.download = pendingExport.filename; document.body.appendChild(link); link.click(); link.remove(); URL.revokeObjectURL(url); pendingExport = null; downloadButton.hidden = true; status.textContent = "Export downloaded"; }); ``` ::::warning[Important] Treat export messages as untrusted content from sandboxed Dive code. After you validate the event origin and source, use the message to offer a download to your user. Do not upload the file, attach it to another account, or trigger backend workflows based only on the message. :::: Exports run the full SQL passed by the Dive, not the rows already rendered in React. Large exports can use significant browser memory because the generated file is transferred to the parent page as an `ArrayBuffer`. For larger data delivery workflows, consider creating a server-side export flow outside the embedded Dive. ## Handle Dive state updates from embedded Dives When a viewer interacts with a Dive built using the `useDiveState` hook, the embed sends a `dive-state-update` message to the parent page each time the state changes. MotherDuck does not persist these changes server-side — the parent page decides whether to capture the snapshot. A common use is to save it to your backend so the viewer's selections survive across sessions; you can then [seed the next session](#preconfigure-the-starting-ui-state) with the saved bag. The message has the following shape: ```typescript type DiveStateUpdate = { type: "dive-state-update"; state: Record; }; ``` `state` is the **full snapshot** of every key the Dive holds, not a delta. Dropped or out-of-order messages are safe to ignore — the next snapshot supersedes them. MotherDuck debounces updates (~100 ms) to limit chatter during rapid interactions. The following example saves each snapshot to `localStorage` keyed by the Dive ID. Replace the storage with whatever persistence layer fits your application: ```javascript const iframe = document.querySelector("#motherduck-dive"); const motherduckEmbedOrigin = new URL(iframe.src).origin; const STORAGE_KEY = `dive-state:${DIVE_ID}`; window.addEventListener("message", (event) => { if (event.origin !== motherduckEmbedOrigin) return; if (event.source !== iframe.contentWindow) return; if (event.data?.type !== "dive-state-update") return; localStorage.setItem(STORAGE_KEY, JSON.stringify(event.data.state)); }); ``` To replay the saved snapshot on the viewer's next visit, pass it as `initial_state` when creating the next embed session. ::::warning[Important] Treat `dive-state-update` payloads as untrusted content from sandboxed Dive code. After validating the event origin and source, only use the bag for storage or to seed the next session — do not interpret it as a command, attach it to other accounts, or feed it into backend workflows that grant permissions. :::: ## Session lifecycle Embed sessions expire after 24 hours. You have two options for handling expiration: - **Generate a fresh session per page load.** The simplest approach. Each time a user loads the page, your backend creates a new embed session and passes it to the iframe. - **Cache and refresh.** Your backend caches the session and refreshes it before it expires. This reduces API calls but adds complexity. If a session expires while a Dive is open, the embed displays a "Session expired" message. The user needs to reload the page to get a new session. ## Security best practices - **Keep your access token server-side.** Never include your access token in client-side JavaScript, HTML, or any code that reaches the browser. - **Use a dedicated service account.** Create a [service account](/key-tasks/service-accounts-guide/create-and-configure-service-accounts/) specifically for embedding, separate from your personal account. The account needs a read/write, Admin-level access token to create embed sessions, but the sessions it generates are always read-only. - **Sessions are read-only.** The embed session always contains a read scaling token, so it can only read data, not modify it. - **Session in URL fragment.** The fragment (`#session=...`) is never sent to the server in HTTP requests, keeping the session out of access logs and referrer headers. - **Scope service accounts for data isolation.** If you need to restrict which data different users can see (for example, per-region databases), create separate service accounts with access scoped to the appropriate data. The embedded Dive queries data as the service account used to create the session. ## CSP configuration If your site uses a restrictive [Content Security Policy](https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP), add `embed-motherduck.com` to your `frame-src` directive: ```text Content-Security-Policy: frame-src https://embed-motherduck.com; ``` Without this, the browser blocks the iframe from loading. ## Troubleshooting Errors from the embed itself (expired token, Dive not found) appear as messages **inside the iframe**. CSP or network-related errors typically appear only in the **browser developer console**. | Error message | Cause | Solution | |---------------|-------|----------| | "Dive embedding requires a Business plan." | Your organization is not on the Business plan | Upgrade to a [Business plan](https://motherduck.com/pricing/) | | "Invalid or expired token. Please reload the page." | The session has expired or is malformed | Create a fresh embed session from your backend | | "Dive not found." | The Dive ID is incorrect or the Dive has been deleted | Verify the Dive ID in **Settings** > **Dives** | | "Failed to load dive. Please try again." | A generic error occurred while loading | Check your session string and network connectivity, then reload | | "Can't open share: Share alias cannot be the same as an existing database name. _name_ is already taken and used as a database name." | Your service account already has a database with the same name as one of the Dive's shared databases | Rename or [detach](/key-tasks/database-operations/detach-and-reattach-motherduck-database/) the conflicting database on the service account. See [share alias conflicts](/sql-reference/motherduck-sql-reference/attach/#share-alias-conflicts) for details. | | Links in the embedded Dive do not open | Embedded Dives cannot directly navigate the parent page or open popups from the sandbox | Listen for `navigation-request` messages in the parent page, validate the URL, and decide whether to open it | | Export buttons do not download a file | The iframe cannot download files directly from the sandbox, or the embed is using server mode | Listen for `export-file` messages in the parent page and offer the file for download. Use dual mode for Dives that include export controls. | | Iframe does not load (blank or blocked) | Your site's CSP blocks `embed-motherduck.com` | Add `frame-src https://embed-motherduck.com` to your CSP header (visible in browser dev console as a CSP violation) | | User role "restricted" does not meet minimum role "admin" required for dashboards.createEmbedSession" | The user associated with the token is not an admin. Generating embed tokens requires the user or service account to have admin permissions. | In the service accounts panel under settings, change the role of the service account to 'Admin' | | unauthorized_client: Callback URL mismatch. `` is not in the list of allowed callback URLs | Embedded dives use MotherDuck's authorization system to determine permissions this limits what URLs can be used for authorization. | For local development ensure that you are running on `localhost` not something like `127.0.0.1` | ## Related resources - [Creating visualizations with Dives](/key-tasks/ai-and-motherduck/dives/) - [Dives SQL functions](/sql-reference/motherduck-sql-reference/ai-functions/dives/) - [Managing Dives as code](/key-tasks/ai-and-motherduck/dives/managing-dives-as-code) --- Source: https://motherduck.com/docs/key-tasks/cloud-storage/querying-s3-files # Querying Files in Amazon S3 > Query Parquet, CSV, and JSON files in S3 with automatic cloud execution routing. Since MotherDuck is hosted in the cloud, one of the benefits of MotherDuck is better and faster interoperability with Amazon S3. MotherDuck's [Dual Execution](/concepts/architecture-and-capabilities#dual-execution) automatically routes queries against cloud storage to MotherDuck's execution runtime in the cloud rather than executing them locally. :::note MotherDuck supports several cloud storage providers, including [Azure](/integrations/cloud-storage/azure-blob-storage.mdx), [Google Cloud](/integrations/cloud-storage/google-cloud-storage.mdx) and [Cloudflare R2](/integrations/cloud-storage/cloudflare-r2). ::: :::info How MotherDuck accesses cloud storage When you query cloud storage while connected to MotherDuck (for example, `read_parquet('s3://...')`), the query runs on MotherDuck's cloud execution engine, not on your local machine. MotherDuck connects to your storage provider directly from the cloud. To authenticate, MotherDuck can use **any** of your secrets, including temporary, in-memory secrets created in your local DuckDB session. This means even if you create a secret locally without `IN MOTHERDUCK` or `PERSISTENT`, MotherDuck's cloud service can still use it to read your data. Your local DuckDB client does not connect to cloud storage directly. For details on secret storage options and how secrets are resolved, see [CREATE SECRET](/sql-reference/motherduck-sql-reference/create-secret/). ::: :::tip To browse objects before you query them, use [`MD_LIST_FILES()`](/sql-reference/motherduck-sql-reference/md-list-files): ```sql FROM md_list_files('s3:////'); ``` To discover buckets exposed by an S3 secret, use [`MD_LIST_BUCKETS_FOR_SECRET()`](/sql-reference/motherduck-sql-reference/md-list-buckets-for-secret). ::: MotherDuck supports the [DuckDB dialect](https://duckdb.org/docs/guides/import/s3_import) to query data stored in Amazon S3. Such queries are automatically routed to MotherDuck's cloud execution engines for faster and more efficient execution. Here are some examples of querying data in Amazon S3: ```sql SELECT * FROM read_parquet('s3:///'); SELECT * FROM read_parquet(['s3:///', ... ,'s3:///']); SELECT * FROM read_parquet('s3:///*'); SELECT * FROM 's3:////*'; SELECT * FROM iceberg_scan('s3:///', ALLOW_MOVED_PATHS=true); SELECT * FROM delta_scan('s3:///'); ``` See [Apache Iceberg](/integrations/file-formats/apache-iceberg.mdx) for more information on reading Iceberg data. See [Delta Lake](/integrations/file-formats/delta-lake.mdx) for more information on reading Delta Lake data. ## Accessing private files in S3 Protected Amazon S3 files require an AWS access key and secret. You can configure MotherDuck using [CREATE SECRET](/sql-reference/motherduck-sql-reference/create-secret.md) ### SSL certificate verification and S3 bucket names Because of SSL certificate verification requirements, S3 bucket names that contain dots (.) cannot be accessed using virtual-hosted style URLs. This is due to AWS's SSL wildcard certificate (*.s3.amazonaws.com) which only validates single-level subdomains. When a bucket name contains dots, it creates multi-level subdomains that don't match the wildcard pattern, causing SSL verification to fail. If your bucket name contains dots, you have two options: 1. **Rename your bucket** to remove dots (e.g., use dashes instead) 2. **Use path-style URLs** by adding the `URL_STYLE 'path'` option to your secret: ```sql CREATE OR REPLACE SECRET my_secret IN MOTHERDUCK ( TYPE s3, URL_STYLE 'path', SCOPE 's3://my.bucket.with.dots' ); ``` For more information, see [Amazon S3 Virtual Hosting documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html). --- Source: https://motherduck.com/docs/key-tasks/sharing-data/updating-shares # Updating shares > Learn about data replication timing, checkpoints, and how to ensure your latest data is available in shares and read-only Ducklings. ## Data replication speed **Use this when you need to:** Understand how quickly data changes become available in shares and read-only Ducklings. **Prerequisites:** You should have shares or read-only Ducklings configured in your MotherDuck environment. **You'll know you're done when:** You understand the timing characteristics and can optimize data availability when needed. MotherDuck automatically replicates data to shares and read-only Ducklings with the following timing characteristics: ### Auto-updated shares For shares configured with auto-update enabled, MotherDuck polls for new data **once per minute**. When new data is detected, it becomes available in the share after the next checkpoint occurs. ### Checkpoints and data availability Data is written to shares whenever there is a checkpoint. Checkpoints occur automatically based on your database's configuration. Starting with DuckDB 1.5, checkpoints run in the background, so reads, writes, and deletes can continue while a checkpoint is in progress. For read scaling Ducklings, you can force a snapshot using [`CREATE SNAPSHOT`](/sql-reference/motherduck-sql-reference/create-snapshot/) to make data available sooner. ### SQL For read scaling Ducklings, to force a snapshot and make data immediately available: ```sql CREATE SNAPSHOT OF ; ``` **Expected result:** A new read-only snapshot is created, ensuring read scaling connections can access the most up-to-date data. **Use case:** Run this when you need to ensure the latest data is available to read scaling Ducklings immediately. **Important:** This command will wait for any ongoing write queries to complete and prevent new ones from starting during snapshot creation. ### UI 1. Navigate to your database in the MotherDuck interface 2. Look for snapshot options in the database management section 3. Trigger a snapshot to ensure your latest data is available in read scaling Ducklings immediately **Expected result:** Your latest data becomes immediately available in all read scaling Ducklings. ### Read-only Ducklings Data replication to read-only Ducklings within the same account follows the same timing as shares - data becomes available after checkpoints, with polling occurring once per minute for auto-updated configurations. ## Manual share updates **Use this when you need to:** Publish recent changes from your database to make them available in the share. **Prerequisites:** You must be the owner of the share and have made changes to the source database since the last share update. **You'll know you're done when:** The share reflects the latest version of your database and the last updated timestamp changes. Sharing a database creates a point-in-time snapshot of the database at the time it is shared. To publish changes, you need to explicitly run `UPDATE SHARE `. When updating a `SHARE` with the same database, the URL does not change. ### SQL ```sql UPDATE SHARE ; ``` **Example:** Database 'mydb' was previously shared by creating a share 'myshare', and the database 'mydb' has been updated since. The owner wants colleagues to receive the latest version: ```sql # 'myshare' was previously created on the database 'mydb' UPDATE SHARE "myshare"; ``` **Expected result:** The share is updated with the latest data from the source database. **Recovery:** If you lost your database share URL, you can use the `LIST SHARES` command to list all your shares or `DESCRIBE SHARE ` to get specific details about a given share name. ## Refreshing shared data (consumer side) **Use this when you need to:** Get the most up-to-date data from a share or read scaling Duckling after the producer has made updates. **Prerequisites:** You must have attached a share or be connected to a read scaling Duckling. **You'll know you're done when:** Your local copy reflects the latest data from the producer. By default, shares and read scaling Ducklings _automatically sync every minute_. However, if you need the most up-to-date data sooner, you can manually refresh after the producer executes their update command. ### Complete workflow for maximum freshness For the freshest possible data, follow this two-step process: 1. **Producer side:** Either wait for normal checkpoints or force an update 2. **Consumer side:** Run `REFRESH DATABASE` to pull the latest changes ### Read-scaling workflow **Producer (writer connection):** ```sql -- Make your changes INSERT INTO my_db.my_table VALUES (...); -- Option 1: Wait for normal checkpoint (automatic) -- Data becomes available after the next checkpoint occurs -- Option 2: Force a snapshot to make data immediately available CREATE SNAPSHOT OF my_db; ``` **Consumer (read scaling connection):** ```sql -- Refresh to get the latest snapshot REFRESH DATABASES; -- Refreshes all connected databases and shares -- OR REFRESH DATABASE my_db; -- Refresh just one specific database ``` ### Share workflow **Producer (share owner):** ```sql -- Make your changes INSERT INTO my_db.my_table VALUES (...); -- Option 1: Wait for normal checkpoint (automatic) -- Data becomes available after the next checkpoint occurs -- Option 2: Force a share update to make data immediately available UPDATE SHARE "myshare"; ``` **Consumer (share recipient):** ```sql -- Refresh to get the latest share data REFRESH DATABASES; -- Refreshes all connected databases and shares -- OR REFRESH DATABASE my_share; -- Refresh just one specific share ``` ### Understanding the refresh output When you run `REFRESH DATABASES`, you'll see output showing which databases were refreshed: ```sql REFRESH DATABASES; ┌─────────┬───────────────────┬──────────────────────────┬───────────┐ │ name │ type │ fully_qualified_name │ refreshed │ │ varchar │ varchar │ varchar │ boolean │ ├─────────┼───────────────────┼──────────────────────────┼───────────┤ │ my_db │ motherduck │ md:my_db │ false │ │ myshare │ motherduck share │ md:_share/myshare/uuid │ true │ └─────────┴───────────────────┴──────────────────────────┴───────────┘ ``` The `refreshed` column shows `true` for databases that were successfully refreshed with new data. Learn more about [`REFRESH DATABASE`](/sql-reference/motherduck-sql-reference/refresh-database.md). --- Source: https://motherduck.com/docs/key-tasks/cloud-storage/writing-to-s3 # Writing Data to Amazon S3 > Export data from MotherDuck to Amazon S3 or transform S3 files in place. You can use MotherDuck to transform files on Amazon S3 or export data from MotherDuck to Amazon S3. :::note MotherDuck supports several cloud storage providers, including [Azure](/integrations/cloud-storage/azure-blob-storage.mdx), [Google Cloud](/integrations/cloud-storage/google-cloud-storage.mdx) and [Cloudflare R2](/integrations/cloud-storage/cloudflare-r2). ::: MotherDuck supports the [DuckDB dialect](https://duckdb.org/docs/guides/import/s3_export) to write data to Amazon S3. The examples here write data in Parquet format, for more options refer to the [documentation for DuckDB's COPY command](https://duckdb.org/docs/stable/sql/statements/copy.html). ## Syntax ```sql COPY
TO 's3:///[]/'; ``` ## Example usage ```sql -- write entire ducks_table table to parquet file in S3 COPY ducks_table to 's3://ducks_bucket/ducks.parquet'; -- writing the output of a query will also work COPY (SELECT * FROM ducks_table LIMIT 100) to 's3://ducks_bucket/ducks_head.parquet'; ``` --- Source: https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/postgres-endpoint/drizzle # Connect from Drizzle via Postgres endpoint > Use Drizzle as a typed wrapper around the pg driver to query MotherDuck via the Postgres wire protocol [Drizzle](https://orm.drizzle.team/) is a TypeScript ORM with both relational and SQL-like query APIs. It runs in Node.js servers, Vercel functions, Cloudflare Workers, and other edge runtimes. You can use Drizzle with MotherDuck through the Postgres endpoint. Drizzle's `drizzle-orm/node-postgres` integration wraps the `pg` driver, so you get the typed `db.execute(sql\`...\`)` API and connection lifecycle management on top of the same Postgres-protocol connection covered in [Connect from Node.js](./nodejs.md). Use Drizzle here as a **typed query executor over `pg`**, not as a schema-and-migrations ORM. Drizzle's schema introspection, code-first migrations (`drizzle-kit pull` / `migrate` / `push`), and query-builder code generation all assume a Postgres backend with `pg_catalog` and Postgres DDL semantics — none of which the pg endpoint exposes. Define your MotherDuck schema separately (DuckDB client, MotherDuck UI, or SQL scripts) and use Drizzle for query execution. For connection parameters, SSL options, and limitations, see the [Postgres Endpoint reference](/sql-reference/postgres-endpoint). ## Prerequisites You'll need a [MotherDuck access token](/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck). Set it as an environment variable: ```bash export MOTHERDUCK_TOKEN="your_token_here" ``` Install Drizzle and `pg`: ```bash npm install drizzle-orm pg npm install --save-dev @types/pg ``` ## Connect Wrap a `pg` client with `drizzle()`. As with the bare `pg` client, pass SSL through the config object — do **not** put `sslrootcert=system` in a connection string, since node-postgres tries to read `system` as a file path and throws `ENOENT`. ```ts import pg from "pg"; import { drizzle } from "drizzle-orm/node-postgres"; import { sql } from "drizzle-orm"; const client = new pg.Client({ host: "pg.us-east-1-aws.motherduck.com", port: 5432, user: "postgres", password: process.env.MOTHERDUCK_TOKEN, database: "md:", ssl: { rejectUnauthorized: true }, }); await client.connect(); const db = drizzle(client); const { rows } = await db.execute(sql` SELECT title, score FROM sample_data.hn.hacker_news WHERE type = ${'story'} LIMIT 10 `); console.log(rows); await client.end(); ``` Using `md:` as the database name connects to your default database in `workspace` [attach mode](key-tasks/authenticating-and-connecting-to-motherduck/attach-modes/attach-modes.md), so all databases attached in your MotherDuck workspace are accessible. To connect to a specific database, pass its name in `database` (e.g., `database: "my_db"`) — this uses `single` attach mode by default. The `sql` template tag is what you'll use most. It produces parameterized queries against the pg endpoint and lets you write DuckDB SQL directly, including three-part names (`database.schema.table`), DuckDB functions, and DuckDB-specific syntax. For pure dynamic SQL with no parameters, `sql.raw("...")` works too. ## Read scaling and concurrency For concurrent workloads, MotherDuck's pg endpoint can route each session to a separate read replica using the `session_hint` startup option — this dramatically improves throughput under concurrency. See [Session affinity and routing](/concepts/scaling-patterns/#session-affinity-and-routing) for the underlying scaling pattern. Drizzle's `Pool` doesn't expose per-connection startup options, so for read scaling you'll want a raw `pg.Client` per session: ```ts const client = new pg.Client({ host: "pg.us-east-1-aws.motherduck.com", port: 5432, user: "postgres", password: process.env.MOTHERDUCK_TOKEN, database: "md:", ssl: { rejectUnauthorized: true }, options: "-c session_hint=user_1", // unique per concurrent session }); await client.connect(); const db = drizzle(client); ``` In benchmarking, `session_hint` cut 5-user concurrent latency from ~16s to ~1.3s on the same workload. ## What doesn't work The pg endpoint speaks DuckDB SQL, not Postgres SQL, and doesn't expose Postgres system catalogs. Drizzle features that depend on either will fail: - **`drizzle-kit migrate`, `push`, `generate`** — these execute Postgres DDL and assume Postgres migration tracking. Manage your MotherDuck schema separately. - **`drizzle-kit pull` / `introspect`** — schema introspection queries `pg_catalog` tables that don't exist on the pg endpoint. - **`pgTable(...)` schema definitions for query-builder calls** (`db.select().from(...)`) work for simple cases but are brittle: Drizzle treats the table name as a single quoted identifier, so three-part DuckDB names (`database.schema.table`) need careful handling. Prefer `db.execute(sql\`...\`)` with explicit SQL until you know the shape you need. - **Standard pg endpoint limits** — local-file `COPY`, `INSTALL` / `LOAD`, `SET`, temp tables, and result-creation commands are not supported. See the [main pg endpoint reference](/sql-reference/postgres-endpoint) for the full list. ## SSL notes Setting `ssl: { rejectUnauthorized: true }` is the equivalent of `sslmode=verify-full` with `sslrootcert=system` in libpq — node-postgres uses Node's built-in trusted root store. For a custom CA, see the [Node.js page](./nodejs.md#ssl-notes); the same approach applies when wrapping the client with `drizzle()`. For more details on SSL options across drivers, see [SSL and certificate verification](/sql-reference/postgres-endpoint#ssl-and-certificate-verification). --- Source: https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/multithreading-and-parallelism # Multithreading and parallelism > Run concurrent queries against MotherDuck, and learn when to use Read Scaling or the Postgres endpoint instead of managing connection pools. Most applications don't need to manage threads or connection pools to get good concurrency from MotherDuck. The DuckDB client and MotherDuck's architecture cover the cases that connection pooling traditionally solved. This page explains what to reach for instead. ## You probably don't need a connection pool DuckDB clients in Python, Go, R, JDBC, and ODBC keep a single database instance cached by database path, and minting connections off that instance is cheap. Because of this, external connection pools are usually unnecessary, and they can work against the instance cache rather than with it. Within a single process, share one connection and create lightweight copies per thread instead of opening a new instance for every query. In Python, a single connection object [is not thread-safe](https://duckdb.org/docs/api/python/overview.html#using-connections-in-parallel-python-programs), so call `.cursor()` to get a copy for each thread. See [Connecting to MotherDuck](/key-tasks/authenticating-and-connecting-to-motherduck/connecting-to-motherduck.md#multiple-connections-and-the-database-instance-cache) for how the instance cache works and how to reuse connections. For background on when concurrency improves performance, see the DuckDB documentation on [concurrency](https://duckdb.org/docs/stable/connect/concurrency.html) and [parallelism](https://duckdb.org/docs/guides/performance/how_to_tune_workloads.html#parallelism-multi-core-processing). ## Run many concurrent read-only queries To serve a high volume of concurrent read-only queries against the same database, use a [Read Scaling](/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/) token. Read scaling replicas handle the fan-out, so you don't have to coordinate a pool of connections yourself. ## Use the Postgres endpoint for connection pooling If your application relies on a connection-pooling library, or you need to manage the connection lifecycle beyond a single DuckDB instance, connect through the [Postgres endpoint](/key-tasks/authenticating-and-connecting-to-motherduck/postgres-endpoint). It speaks the PostgreSQL wire protocol, so it works as a drop-in replacement with standard pooling libraries. --- Source: https://motherduck.com/docs/key-tasks/database-operations/time-travel # Querying historical data with time travel > Use MotherDuck snapshots to query past database states, compare data across time periods, debug pipeline issues, reproduce reports, and create audit checkpoints. MotherDuck's [snapshot system](/concepts/snapshots) automatically captures your database state whenever you insert, delete, or update rows in a table, or create a new table. This means you can query your database as it existed at any point within your [retention window](/concepts/snapshots#snapshot-retention): this is called **time travel**, though there is no flux capacitor involved. Unlike the traditional backup strategy of copy-paste and restore workflows, time travel lets you read historical data directly alongside your current data without modifying anything. This guide covers practical patterns for querying historical database states: - [**Compare data across time periods**](#comparing-data-across-time-periods) — Diff today vs. yesterday, detect changed records, and spot anomalies - [**Debug data pipeline issues**](#debugging-data-pipeline-issues) — Find exactly when and how bad data entered your system - [**Reproduce past reports**](#reproducing-past-reports) — Re-run a query against the exact data a dashboard showed last week - [**Create audit checkpoints**](#creating-audit-checkpoints-with-named-snapshots) — Preserve database state at key moments for compliance and regulatory needs :::info Prerequisites Time travel requires a paid plan with `snapshot_retention_days` > 0. See [snapshot features per plan](/concepts/snapshots#snapshot-features-per-plan) for details. ::: ## Try it yourself: sample data setup The examples in this guide all use the same `shop_db` database. Run the following to create it and follow along. ```sql CREATE DATABASE IF NOT EXISTS shop_db; USE shop_db; -- Customers table CREATE OR REPLACE TABLE customers AS SELECT * FROM (VALUES (1, 'Alice Johnson', 'alice@example.com', 'US-West', '2025-11-01'::DATE), (2, 'Bob Smith', 'bob@example.com', 'US-East', '2025-11-15'::DATE), (3, 'Carol Williams', 'carol@example.com', 'EU-West', '2025-12-01'::DATE) ) AS t(customer_id, name, email, region, created_at); -- Orders table CREATE OR REPLACE TABLE orders AS SELECT * FROM (VALUES (101, 1, 250.00, '2026-01-15'::DATE, 'completed'), (102, 2, 89.99, '2026-01-16'::DATE, 'completed'), (103, 3, 450.00, '2026-01-20'::DATE, 'completed'), (104, 1, 125.50, '2026-02-01'::DATE, 'completed'), (105, 2, 67.25, '2026-02-10'::DATE, 'completed'), (106, 3, 215.75, '2026-02-14'::DATE, 'pending'), (107, 1, 175.00, '2026-02-15'::DATE, 'pending') ) AS t(order_id, customer_id, amount, order_date, status); ``` Now create a snapshot to mark this as a known-good baseline: ```sql CREATE SNAPSHOT baseline OF shop_db; ``` To simulate changes over time (for testing the examples below), apply some modifications and snapshot again: ```sql -- Simulate a data update: customer email change + new customer UPDATE customers SET email = 'alice.j@newdomain.com' WHERE customer_id = 1; INSERT INTO customers VALUES (6, 'Dave Miller', 'dave@example.com', 'US-East', '2026-02-16'); -- Simulate a pipeline issue: accidentally delete some orders DELETE FROM orders WHERE order_id IN (106, 107); -- Insert a new order INSERT INTO orders VALUES (108, 6, 95.00, '2026-02-16', 'pending'); CREATE SNAPSHOT after_changes OF shop_db; ``` You now have two named snapshots (`baseline` and `after_changes`) you can use with the patterns below. ## Core pattern: clone a point-in-time snapshot The fundamental time travel pattern is to create a temporary database from a historical snapshot, then query it alongside your current data: ```sql -- Create a zero-copy clone of your database at a past point in time CREATE DATABASE shop_db_yesterday FROM shop_db ( SNAPSHOT_NAME 'baseline' ); -- Query the historical clone SELECT * FROM shop_db_yesterday.main.orders; ``` To make sure you don't unnecessary store data we clean up the database again. ```sql DROP DATABASE shop_db_yesterday; ``` This uses a [zero-copy clone](/concepts/database-concepts/#motherduck-architectural-concepts), so no data is duplicated. The clone points to the same underlying storage objects. To see what snapshots are available and find the right timestamp, query: ```sql SELECT snapshot_id, created_ts, active_bytes FROM md_information_schema.database_snapshots WHERE database_name = 'shop_db' ORDER BY created_ts DESC LIMIT 10; ``` ## Comparing data across time periods Your operations team notices that order volume looks off this morning. Rather than waiting for a full data audit, you can instantly diff today's data against yesterday's snapshot to find new records, deleted rows, or unexpected changes — useful for anomaly detection, daily change tracking, and operational monitoring. ```sql -- Clone yesterday's state CREATE DATABASE shop_yesterday FROM shop_db ( SNAPSHOT_NAME 'baseline' -- or use a timebased reference SNAPSHOT_TIME '2026-02-15 00:00:00' ); -- Find new customers added since yesterday SELECT c.customer_id, c.name, c.created_at FROM shop_db.main.customers c ANTI JOIN shop_yesterday.main.customers y ON c.customer_id = y.customer_id; -- Compare daily order totals SELECT 'today' AS period, count(*) AS order_count, sum(amount) AS total_revenue FROM shop_db.main.orders WHERE order_date = CURRENT_DATE UNION ALL SELECT 'yesterday' AS period, count(*) AS order_count, sum(amount) AS total_revenue FROM shop_yesterday.main.orders WHERE order_date = CURRENT_DATE - INTERVAL 1 DAY; -- Detect changed records (e.g. email updates) SELECT c.customer_id, y.email AS old_email, c.email AS new_email FROM shop_db.main.customers c JOIN shop_yesterday.main.customers y ON c.customer_id = y.customer_id WHERE c.email != y.email; DROP DATABASE shop_yesterday; ``` ## Debugging data pipeline issues A dashboard that was showing correct numbers yesterday is now off. You suspect a pipeline run corrupted or dropped data, but you're not sure when it happened. Time travel lets you clone the database at a known-good point and compare it to the current state to find exactly which records disappeared, changed, or were introduced incorrectly. ```sql -- List recent snapshots to narrow down the issue SELECT snapshot_id, created_ts, active_bytes FROM md_information_schema.database_snapshots WHERE database_name = 'shop_db' AND created_ts >= '2026-02-14 00:00:00' ORDER BY created_ts; ``` ```sql -- Clone the database at a known-good time CREATE DATABASE shop_before FROM shop_db ( SNAPSHOT_ID 'b1ecf2f3-4567-8901-b23f-45c67890b12' ); -- Compare row counts to spot unexpected changes SELECT 'before' AS state, count(*) AS row_count, count(DISTINCT customer_id) AS unique_customers FROM shop_before.main.orders UNION ALL SELECT 'current' AS state, count(*) AS row_count, count(DISTINCT customer_id) AS unique_customers FROM shop_db.main.orders; -- Find records that disappeared SELECT b.order_id, b.customer_id, b.amount, b.order_date FROM shop_before.main.orders b ANTI JOIN shop_db.main.orders c ON b.order_id = c.order_id; DROP DATABASE shop_before; ``` ## Reproducing past reports A stakeholder asks "why did last week's revenue report show different numbers?" Instead of guessing what data has changed since then, you can clone the exact database state from when the report ran and re-execute the same query. This is also useful for validating past analyses, debugging metric discrepancies, and ensuring reproducibility of historical results. ```sql -- Recreate the database state from last Tuesday morning CREATE DATABASE shop_last_tuesday FROM shop_db ( SNAPSHOT_NAME 'baseline' -- or use a timebased reference SNAPSHOT_TIME '2026-02-15 00:00:00' ); -- Re-run the same report query against the historical state SELECT region, sum(amount) AS total_revenue, count(DISTINCT customer_id) AS active_customers FROM shop_last_tuesday.main.orders o JOIN shop_last_tuesday.main.customers c USING (customer_id) WHERE order_date BETWEEN '2026-02-01' AND '2026-02-09' GROUP BY region ORDER BY total_revenue DESC; DROP DATABASE shop_last_tuesday; ``` ## Creating audit checkpoints with named snapshots Regulatory audits, end-of-quarter financial reviews, and legal discovery often require proof of what data looked like at a specific moment. [Named snapshots](/concepts/snapshots#2-named-snapshots ) let you preserve the exact database state at key business milestones. Unlike automatic snapshots, named snapshots are not subject to garbage collection — they persist until you explicitly remove them. This feature is available on the Business plan. ```sql -- Create a named snapshot at end-of-quarter close CREATE SNAPSHOT q1_2026_close OF shop_db; -- Months later, an auditor needs to verify the numbers CREATE DATABASE audit_q1 FROM shop_db ( SNAPSHOT_NAME 'q1_2026_close' ); -- Re-run the audit query against the exact data from that moment SELECT c.region, count(*) AS order_count, sum(o.amount) AS total_revenue FROM audit_q1.main.orders o JOIN audit_q1.main.customers c USING (customer_id) WHERE o.order_date BETWEEN '2026-01-01' AND '2026-03-31' GROUP BY c.region; DROP DATABASE audit_q1; ``` To manage your named snapshots: ```sql -- List all named snapshots SELECT snapshot_id, snapshot_name, database_name, created_ts FROM md_information_schema.database_snapshots WHERE snapshot_name IS NOT NULL; -- Rename a snapshot ALTER SNAPSHOT q1_2026_close SET snapshot_name = 'audit_fy2026_q1'; -- Remove a snapshot name (makes it subject to garbage collection) ALTER SNAPSHOT old_checkpoint SET snapshot_name = ''; ``` ## Best practices - **Clean up clones promptly.** Snapshot clones are zero-copy, but they may hold `historical_bytes` longer than necessary unless they are dropped. When they original database is deleted the clone may still hold `retained_for_clone_bytes`. - **Use `SNAPSHOT_TIME` for exploration, `SNAPSHOT_ID` for precision, `SNAPSHOT_NAME` for re-usability.** When narrowing down a time range, timestamps are convenient. Once you've identified the exact snapshot, switch to the ID to avoid ambiguity. See [restoring a database to a historical snapshot](/concepts/data-recovery#restoring-a-database-to-a-historical-snapshot). - **Set retention to match your needs.** Longer `snapshot_retention_days` gives you a wider time travel window but increases `historical_bytes` storage. See [snapshot retention](/concepts/snapshots#snapshot-retention). - **Use named snapshots for fixed checkpoints.** Automatic snapshots are garbage-collected after the retention window. For audit or compliance points that need to persist, create a [named snapshot](/concepts/snapshots#2-named-snapshots). ## See also - [Database Snapshots](/concepts/snapshots) — Snapshot types, retention, and plan availability - [Data Recovery](/concepts/data-recovery) — Step-by-step restore workflows - [Storage Lifecycle](/concepts/storage-lifecycle) — How historical bytes affect your storage bill - [`CREATE DATABASE FROM`](/sql-reference/motherduck-sql-reference/create-database) — Clone from a snapshot - [`ALTER DATABASE SET SNAPSHOT`](/sql-reference/motherduck-sql-reference/alter-database-snapshot) — Restore a database in-place --- Source: https://motherduck.com/docs/key-tasks/cloud-storage/s3-import-best-practices # S3 import best practices > Optimize file size, format, and layout in Amazon S3 for fast, cost-effective data loading into MotherDuck. Loading data from Amazon S3 is one of the fastest ways to get data into MotherDuck. Because MotherDuck runs queries against S3 directly from the cloud, the file layout in your bucket has a significant impact on loading speed and cost. This guide covers how to organize files in S3 for optimal performance. For general loading advice (batch sizes, memory management, Duckling sizing), see [Loading data best practices](/key-tasks/loading-data-into-motherduck/considerations-for-loading-data/). ## Choose the right file format Parquet is the best format for most S3 imports. It compresses well, includes schema metadata, and lets DuckDB read only the columns and row groups it needs. | Format | Best for | Avoid when | |--------|----------|------------| | **Parquet** | Most workloads, large files, production pipelines | Files under ~1 MB, where metadata overhead outweighs benefits | | **CSV** | Small files (under 5 MB), quick exploration, simple schemas | Large datasets, complex types, multi-line text | | **JSON** | Small files (under 5 MB), Semi-structured data, API responses | Large files without a known schema (schema discovery is slow) | :::tip For very small files (under ~1 MB), CSV or JSON can be faster than Parquet because Parquet's metadata and footer add overhead that outweighs the compression benefits at small sizes. However, you want to avoid the 'small files problem' where your bottleneck becomes listing and reading each individual small file with the same schema when they could have been aggregated in one or more bigger Parquet files. ::: ### Parquet settings When writing Parquet files destined for MotherDuck: - **Compression**: Use Snappy (default) or ZSTD. Snappy offers faster decompression; ZSTD gives better compression ratios for cold storage. - **Row group size**: Aim for 100K-1M rows per row group. DuckDB processes row groups in parallel, so multiple groups per file improve throughput. - **Column encoding**: Leave this at the writer's default. DuckDB and most Parquet writers choose efficient encodings automatically. ## Optimize file size File size is the single most impactful factor for S3 import performance. Files that are too small create per-file overhead (HTTP requests, file listing, metadata parsing). Files that are too large limit parallelism. | File size | Impact | |-----------|--------| | **Under 1 MB** | Too small. Per-file overhead dominates. Merge small files into larger ones. | | **1-10 MB** | Acceptable for low-volume or infrequent loads. | | **10-256 MB** | Optimal range. Good balance of parallelism and minimal overhead. | | **Over 256 MB** | Still works fine into the multiple gigabytes, but DuckDB can only parallelize within a single file by row group. | :::tip Aim for **10-256 MB per file** in Parquet format. If your pipeline produces many small files (for example, one file per API call or per minute), batch them before writing to S3 or use a compaction step to merge them periodically. ::: ### Row count guidelines Row count guidelines follow from file size, but as a rough reference: | Rows per file | Typical file size (Parquet) | Recommendation | |---------------|----------------------------|----------------| | Under 1,000 | Under 100 KB | Too small, merge files | | 1,000-100,000 | 100 KB - 10 MB | Acceptable for small tables | | 100,000-10,000,000 | 10 MB - 500 MB | Optimal range | | Over 10,000,000 | Over 500 MB | Consider splitting into multiple files | ## Organize your S3 bucket A consistent file layout in S3 makes it easier to load data incrementally and query subsets efficiently. ### Use Hive-style partitioning for large datasets If your dataset is large and you query it by date or category, partition your files using Hive-style paths: ```text s3://my-bucket/events/year=2025/month=03/data.parquet s3://my-bucket/events/year=2025/month=04/data.parquet ``` DuckDB automatically detects Hive partitioning and prunes partitions during queries: ```sql SELECT * FROM read_parquet('s3://my-bucket/events/**/*.parquet', hive_partitioning=true) WHERE year = 2025 AND month = 3; ``` ### Use consistent naming conventions - Use lowercase paths (MotherDuck URLs are case-sensitive) - Avoid dots in bucket names (causes [SSL issues](/key-tasks/cloud-storage/querying-s3-files/#ssl-certificate-verification-and-s3-bucket-names)) - Include timestamps or sequence numbers in file names for incremental loads: ```text s3://my-bucket/orders/orders_20250323_001.parquet s3://my-bucket/orders/orders_20250323_002.parquet ``` ## Set up continuous loading from S3 For pipelines that continuously land files in S3, keep these guidelines in mind: ### Loading frequency | Frequency | Recommendation | |-----------|----------------| | **Under 1 minute** | Not recommended. Per-file overhead and small file sizes make this inefficient. Instead consider [Ducklake](/docs/integrations/file-formats/ducklake/) which will inline data until the batch is big enough to write to a file. | | **1-5 minutes** | Possible for time-sensitive workloads, but files will be small. Ensure each file is at least 1 MB. | | **5-15 minutes** | Good balance of freshness and file size for most use cases. | | **Hourly or daily** | Ideal for batch workloads. Produces well-sized files with minimal overhead. | :::tip If your source system produces data continuously, buffer at least 5-15 minutes of data before writing a file to S3. This produces files in the optimal 10-256 MB range and avoids the small-file problem. ::: ### Incremental loading pattern For incremental loads, use a landing zone pattern: 1. Land new files in an `incoming/` prefix 2. Load them into MotherDuck with a timestamp filter or file listing 3. Move processed files to a `processed/` prefix ```sql -- Load new files from the incoming prefix INSERT INTO my_table SELECT * FROM read_parquet('s3://my-bucket/incoming/*.parquet'); ``` For more complex incremental workflows with state management, use an [ingestion tool](#use-ingestion-tools-for-production-pipelines). ## Use ingestion tools for production pipelines For production pipelines that need scheduling, error handling, retries, and schema evolution, use a dedicated ingestion tool rather than writing raw SQL scripts. Many tools support MotherDuck as a destination and handle S3 file management automatically. **Ingestion tools with MotherDuck support:** - [dlt (data load tool)](/integrations/ingestion/dlt/) supports loading from APIs, databases, and files into MotherDuck with automatic schema evolution - [Streamkap](/integrations/ingestion/streamkap/) provides real-time CDC from databases to MotherDuck **Orchestration tools** like Dagster, Airflow, Prefect, and Kestra can schedule S3-to-MotherDuck pipelines. Browse the full list of [ingestion](https://motherduck.com/ecosystem/?category=Ingestion) and [orchestration](https://motherduck.com/ecosystem/?category=Orchestration) tools in the MotherDuck ecosystem. ## Colocate data with MotherDuck MotherDuck connects to S3 directly from the cloud, so network distance between your S3 bucket and MotherDuck's region matters. - MotherDuck is available in **US East (N. Virginia)** (`us-east-1`), **US West (Oregon)** (`us-west-2`), and **Europe (Frankfurt)** (`eu-central-1`) - Place your S3 bucket in the **same region** as your MotherDuck organization for best performance ## Summary | Area | Recommendation | |------|----------------| | **File format** | Parquet for most workloads; CSV/JSON for files under 1 MB | | **File size** | 10-256 MB per file | | **Row count** | 100K-10M rows per file | | **Loading frequency** | 5-15 minutes minimum; hourly or daily for batch | | **Partitioning** | Hive-style for large, time-series datasets | | **Region** | Same region as your MotherDuck organization | | **Production pipelines** | Use a dedicated ingestion or orchestration tool | --- Source: https://motherduck.com/docs/key-tasks/running-hybrid-queries # Running dual execution (or hybrid) queries > Query local and cloud data together using MotherDuck's dual execution hybrid query engine. MotherDuck can use local data and remote data in the same query. The editors on this page connect to your `my_db` MotherDuck database, so you can run each example against your own account. "Local" data in these examples comes from an inline `VALUES` clause, which DuckDB evaluates in the browser; the sales table lives in MotherDuck, so the planner runs it remotely. ## Create a remote sales table The editor below writes to `my_db.main.remote_sales_table`. The preview shows what the second statement returns; run the query to materialize the table in your own account. In your own CLI or notebook you can use any database name. For example `CREATE OR REPLACE DATABASE remote_db;` followed by `CREATE TABLE remote_db.sales AS ...`. ## Join local and remote data The query below joins inline pricing data (local) with the sales table you created above (remote) to produce revenue by month. DuckDB executes the `VALUES` clause locally and reads `remote_sales_table` from MotherDuck. 2 GROUP BY mo ORDER BY mo;`} previewRows={[ { mo: '2024-09-01', rev: 9241.39 }, { mo: '2024-10-01', rev: 14226.12 }, { mo: '2024-11-01', rev: 13136.55 }, { mo: '2024-12-01', rev: 7783.26 }, ]} /> ## Inspect the hybrid query plan Prefix the query with `EXPLAIN` to see which operators run locally and which run on MotherDuck. The editor renders DuckDB's JSON plan as a tree; each operator carries an `L` (local) or `R` (remote) tag and the JOIN branches into its two inputs. 2 GROUP BY mo ORDER BY mo;`} previewPlan={{ label: 'physical_plan', nodes: [{ name: 'DOWNLOAD_SOURCE', location: 'local', details: { bridge_id: '1' }, children: [{ name: 'BATCH_DOWNLOAD_SINK', location: 'remote', details: { bridge_id: '1', parallel: 'true' }, children: [{ name: 'ORDER_BY', location: 'remote', details: { 'Order By': "date_trunc('month', sales.dt) ASC" }, children: [{ name: 'HASH_GROUP_BY', location: 'remote', details: { Groups: '#0', Aggregates: 'sum(#1)' }, children: [{ name: 'PROJECTION', location: 'remote', details: { Projections: ['mo', 'price * tally'] }, children: [{ name: 'HASH_JOIN', location: 'remote', details: { 'Join Type': 'INNER', Conditions: 'item = item' }, children: [ { name: 'SEQ_SCAN', location: 'remote', details: { Table: 'remote_sales_table', Projections: ['item', 'dt', 'tally'], }, }, { name: 'UPLOAD_SOURCE', location: 'remote', details: { bridge_id: '2' }, children: [{ name: 'BATCH_UPLOAD_SINK', location: 'local', details: { bridge_id: '2', parallel: 'true' }, children: [{ name: 'SEQ_SCAN', location: 'local', details: { Table: 'pricing (VALUES)', Projections: ['item', 'price'], Filters: 'price > 2', }, }], }], }, ], }], }], }], }], }], }], }} /> Data is transferred between local and remote with matching pairs of sinks and sources, identified by `bridge_id`. A dual execution (or hybrid) query can run on any database format supported by DuckDB, including [sqlite](https://duckdb.org/docs/stable/core_extensions/sqlite), [postgres](https://duckdb.org/docs/stable/core_extensions/postgres.html) and many others. --- Source: https://motherduck.com/docs/key-tasks/database-operations/copying-databases # Copying MotherDuck and DuckDB Databases > Duplicate databases between MotherDuck cloud and local DuckDB using COPY FROM DATABASE. The `COPY FROM DATABASE` statement creates an exact duplicate of an existing database, including both schema and data. This functionality enables the following operations: [Interact with MotherDuck Databases](#copy-a-motherduck-database-to-a-motherduck-database) - Copy between MotherDuck databases [Interact with Local Databases](#interacting-with-local-databases) - Import local database to MotherDuck - Export MotherDuck database to local filesystem - Copy between local databases The `COPY FROM DATABASE` command is implemented as a multiple statement macro, which is not supported in WebAssembly. As a result, simultaneous schema and data copying is not available in the MotherDuck Web UI. However, the Web UI supports copying schema only (`SCHEMA` option) or data only (`DATA` option). All functionality is available in other drivers, including the DuckDB CLI. :::caution No zero-copy clone `COPY FROM DATABASE` creates a *physical* copy of both the schema and the data. It **does not** use MotherDuck's zero-copy cloning, so the operation may take longer to run and will consume additional storage proportional to the size of the source database. ::: ## Syntax The syntax for `COPY FROM DATABASE` is: ```sql COPY FROM DATABASE TO [ (SCHEMA) | (DATA) ] ``` ### Parameters - ``: The name or path of the source database to copy from - ``: The name or path of the target database to create - `(SCHEMA)`: Optional parameter to copy only the database schema without data - `(DATA)`: Optional parameter to copy only the database data without schema ## Example Usage ### Copy a MotherDuck database to a MotherDuck database This is the same as [creating a new database from an existing one](/sql-reference/motherduck-sql-reference/create-database.md). ```sql COPY FROM DATABASE my_db TO my_db_copy; ``` ### Interacting with Local Databases These operations can be done with access to the local filesystem, i.e. inside the DuckDB CLI. #### Copy a local database to a MotherDuck database ```sql ATTACH 'local_database.db'; ATTACH 'md:'; CREATE DATABASE md_database; COPY FROM DATABASE local_database TO md_database; ``` #### Copy a MotherDuck database to a local database To copy a MotherDuck database to a local database requires some extra steps. ```sql ATTACH 'md:'; ATTACH 'local_database.db' as local_db; COPY FROM DATABASE my_db TO local_db; ``` #### Copy a local database to a local database To copy a local database to a local database, please see the [DuckDB documentation](https://duckdb.org/docs/stable/sql/statements/copy.html#copy-from-database--to). ### Copying the Database Schema ```sql COPY FROM DATABASE my_db TO my_db_copy (SCHEMA); ``` This will copy the schema of the database, but not the data. ### Copying the Database Data ```sql COPY FROM DATABASE my_db TO my_db_copy (DATA); ``` This will copy the data of the database, but not the schema. --- Source: https://motherduck.com/docs/key-tasks/data-warehousing/replication/flat-files # Replicating flat files to MotherDuck > Load CSV, Parquet, and JSON files into MotherDuck from local storage or cloud sources. The goal of this guide is to show users simple examples of loading data from flat file sources into MotherDuck. Examples are shown for both the MotherDuck Web UI and the DuckDB CLI. To install the DuckDB CLI, [check out the instructions first.](/getting-started/interfaces/connect-query-from-duckdb-cli) ## CSV ### MotherDuck UI From the UI, follow these steps: 1. Navigate to the **Add Data** section. 2. Select the file. This file will be uploaded into your browser so that it can be queried by DuckDB. 3. Execute the generated query which will create a table for you. 1. Modify the query as needed to suit the correct Database / Schema / Table name. ### DuckDB CLI In the CLI, you can load a CSV file using the `read_csv` function. For example: ### Local file ```sql CREATE TABLE my_table AS SELECT * FROM read_csv('path/to/local_file.csv'); ``` ### S3 file To load from S3, ensure your DuckDB instance is configured with [S3 secrets](/documentation/integrations/cloud-storage/amazon-s3.mdx). Then: ```sql CREATE TABLE my_table AS SELECT * FROM read_csv('s3://bucket-name/path-to-file.csv'); ``` ## JSON ### MotherDuck UI From the UI, follow these steps: 1. Navigate to the **Add Data** section. 2. Select the file. This file will be uploaded into your browser so that it can be queried by DuckDB. 3. Execute the generated query which will create a table for you. 1. Modify the query as needed to suit the correct Database / Schema / Table name. ### DuckDB CLI In the CLI, use the `read_json` function to load JSON files. ### Local file ```sql CREATE TABLE my_table AS SELECT * FROM read_json('path/to/local_file.json'); ``` ### S3 file Make sure S3 support is enabled as described in the [S3 secrets documentation](/documentation/integrations/cloud-storage/amazon-s3.mdx). ```sql CREATE TABLE my_table AS SELECT * FROM read_json('s3://bucket-name/path-to-file.json'); ``` :::tip Provide a schema for large or deeply nested JSON When loading large JSON files, DuckDB scans the data to discover the schema during query planning. For deeply nested or complex JSON, this can add significant time. To speed things up, provide the schema directly with the `columns` parameter: ```sql CREATE TABLE my_table AS SELECT * FROM read_json( 'path/to/local_file.json', columns={ id: 'BIGINT', name: 'VARCHAR', amount: 'DECIMAL(10,2)' } ); ``` If you already have a table with the right schema, use `INSERT INTO` instead of `CREATE TABLE AS` — DuckDB skips schema discovery when the target schema is known: ```sql INSERT INTO my_table SELECT * FROM read_json('path/to/local_file.json'); ``` You can also limit how deep DuckDB looks into nested structures with `maximum_depth`, or reduce the number of sampled objects with `sample_size` (default: 20480). See the [DuckDB JSON documentation](https://duckdb.org/docs/stable/data/json/loading_json) for all available options. ::: ## Parquet ### MotherDuck UI From the UI, follow these steps: 1. Navigate to the **Add Data** section. 2. Select the file. This file will be uploaded into your browser so that it can be queried by DuckDB. 3. Execute the generated query which will create a table for you. 1. Modify the query as needed to suit the correct Database / Schema / Table name. ### DuckDB CLI In the CLI, use the `read_parquet` function to load Parquet files. ### Local file ```sql CREATE TABLE my_table AS SELECT * FROM read_parquet('path/to/local_file.parquet'); ``` ### S3 file Ensure S3 support is enabled as described in the [S3 secrets documentation](/documentation/integrations/cloud-storage/amazon-s3.mdx). ```sql CREATE TABLE my_table AS SELECT * FROM read_parquet('s3://bucket-name/path-to-file.parquet'); ``` ## Handling more complex workflows Production use cases tend to be much more complex and include things like incremental builds & state management. In those scenarios, please take a look at our [ingestion partners](https://motherduck.com/ecosystem/?category=Ingestion), which includes many options including some that offer native python. An overview of the MotherDuck Ecosystem is shown below. ![Diagram](../../../img/md-diagram.svg) --- Source: https://motherduck.com/docs/key-tasks/loading-data-into-motherduck/loading-data-from-postgres # From a PostgreSQL or MySQL Database > Learn to load a table from your PostgreSQL or MySQL database into MotherDuck. ## Using PostgreSQL or MySQL DuckDB extensions DuckDB's [PostgreSQL extension](https://duckdb.org/docs/extensions/postgres.html) and [MySQL extension](https://duckdb.org/docs/extensions/mysql.html) make it easy to connect to OLTP databases and copy data into MotherDuck from a DuckDB client running on your own machine or compute resource. In this guide we demonstrate the workflow with PostgreSQL. Consult the [DuckDB MySQL extension documentation](https://duckdb.org/docs/extensions/mysql) to adapt the same pattern for MySQL. :::info MotherDuck does not yet support the PostgreSQL and MySQL extensions, so you need to perform the following steps on your own computer or cloud computing resource. We are working on supporting the PostgreSQL extension on the server side so that this can happen within the MotherDuck app in the future with improved performance. ::: ### Prerequisites - **PostgreSQL Database Credentials**: Ensure you have access details to the PostgreSQL database, including host address, port, and user credentials. You can put the user credentials in the [PostgreSQL Password File](https://www.postgresql.org/docs/current/libpq-pgpass.html), [store them in environment variables](https://duckdb.org/docs/extensions/postgres.html#configuring-via-environment-variables), or pass them inline in the script below. - **Network Connectivity**: Your machine must be able to connect to the target PostgreSQL database. - **MotherDuck Credentials**: MotherDuck credentials should be set up. If not, follow the steps in [Authenticating to MotherDuck](/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/authenticating-to-motherduck.md). - **DuckDB**: Either the DuckDB command-line interface or Python + the DuckDB package should be installed and operational. See the [Getting Started tutorials](../../getting-started/getting-started.mdx) for instructions to install DuckDB. ### Steps The following SQL script installs and loads DuckDB's PostgreSQL extension, tunes a few settings that matter for larger bulk loads and copies one PostgreSQL table into the MotherDuck table `my_db.pg_data_schema.first_pg_table`. Fill in the placeholders ``, ``, ``, ``, ``, and `
` with the appropriate values and save the script to a file, for example `ingest_data_from_postgres.sql`. ```sql INSTALL postgres; LOAD postgres; -- Tune the local DuckDB client for a larger initial load. SET threads = 8; SET memory_limit = '8GB'; SET pg_connection_limit = 8; SET pg_pages_per_task = 250; -- Connect to MotherDuck. ATTACH 'md:'; USE my_db; -- Optionally create a schema. By default MotherDuck uses the main schema. CREATE SCHEMA IF NOT EXISTS pg_data_schema; -- Ingest data from PostgreSQL to a MotherDuck table. CREATE OR REPLACE TABLE pg_data_schema.first_pg_table AS SELECT * FROM postgres_scan( 'dbname= host= port=5432 user= password= connect_timeout=10', '', '
' ); -- Optional: verify the number of rows in the MotherDuck table. SELECT count(1) FROM pg_data_schema.first_pg_table; ``` If you only want to smoke-test the connection first, add `LIMIT 1000` to the `SELECT` before running the full load. ### Best practices Here are a few tips to keep larger PostgreSQL loads predictable. #### Run DuckDB close to both systems This workflow is client-side, so the DuckDB client becomes the data mover. Run DuckDB on a machine with a good network path to both PostgreSQL and MotherDuck, and use separate client compute when possible instead of competing with the production PostgreSQL instance for the same RAM. #### Tune scan parallelism explicitly Start with `SET threads = ` and `SET memory_limit = ''`, then tune `pg_connection_limit` and `pg_pages_per_task` for your source table. For larger tables, start with `pg_connection_limit` in the `4-8` range and `pg_pages_per_task` in the `250-1000` range rather than relying on defaults. ::::warning[Watch Out] Increasing `pg_connection_limit` can increase pressure on the source PostgreSQL instance. If PostgreSQL memory or connection pressure climbs, reduce `pg_connection_limit` before reducing DuckDB `threads`. :::: #### Reduce each statement's working set The DuckDB side of this workflow is typically streaming rather than loading the full source table into RAM. Out-of-memory risk is usually driven more by the source PostgreSQL instance and the host's overall headroom than by DuckDB itself. Select only the schema and columns you need, and attach PostgreSQL with `READ_ONLY` if you use `ATTACH` instead of `postgres_scan`. #### Keep credentials out of long-lived scripts Use PostgreSQL environment variables, the PostgreSQL password file, or DuckDB secrets instead of embedding credentials directly in production scripts. #### Load in chunks For very large tables, break the initial load into ranges and insert them one chunk at a time. ```sql INSTALL postgres; LOAD postgres; SET threads = 8; SET memory_limit = '8GB'; SET pg_connection_limit = 8; SET pg_pages_per_task = 250; ATTACH 'md:'; USE my_db; CREATE SCHEMA IF NOT EXISTS pg_data_schema; CREATE TABLE IF NOT EXISTS pg_data_schema.first_pg_table AS SELECT * FROM postgres_scan( 'dbname= host= port=5432 user= password= connect_timeout=10', '', '
' ) WHERE 1 = 0; INSERT INTO pg_data_schema.first_pg_table SELECT * FROM postgres_scan( 'dbname= host= port=5432 user= password= connect_timeout=10', '', '
' ) WHERE updated_at >= TIMESTAMP '2026-01-01' AND updated_at < TIMESTAMP '2026-02-01'; ``` Repeat the `INSERT` statement for each key range or time window until the backfill is complete. If you need recurring replication, change data capture (CDC), or production orchestration, prefer a dedicated ingestion partner over a one-off client-side script. ### Run with DuckDB CLI After filling out the placeholders, you can either execute the statements line by line in the DuckDB CLI, or save the commands in a file, for example `ingest_data_from_postgres.sql`, and run: ```sh > duckdb < ingest_data_from_postgres.sql ``` ### Run with Python You can also execute it using Python with the DuckDB package. ```python import duckdb with open("ingest_data_from_postgres.sql", 'r') as f: s = f.read() duckdb.sql(s) ``` After completing these steps, you should see the new table show up in the MotherDuck Web UI. ## Using MotherDuck ingestion partners MotherDuck collaborates with various integration partners to facilitate data transfer in diverse ways—including change data capture (CDC)—from your PostgreSQL or MySQL database to MotherDuck. For example, you can refer to our [Estuary guide](https://motherduck.com/blog/streaming-data-to-motherduck/) that demonstrates how to stream data from Neon, a PostgreSQL-based database, to MotherDuck. To explore the full range of solutions tailored to your needs, visit our [MotherDuck ecosystem partners page](https://motherduck.com/ecosystem/). --- Source: https://motherduck.com/docs/key-tasks/database-operations/detach-and-reattach-motherduck-database # Detach and re-attach a MotherDuck database > Temporarily disconnect from a MotherDuck database using DETACH and reconnect with ATTACH. After [creating a remote MotherDuck database](/sql-reference/motherduck-sql-reference/create-database.md), the [`DETACH` command](/sql-reference/motherduck-sql-reference/detach.md) may be used to detach it. This will prevent access and modifications to the database until it is re-attached using the [`ATTACH` command](/sql-reference/motherduck-sql-reference/attach.md). This pattern can be used to isolate queries and changes to a specific set of databases. Note that this is a convenience feature and not a security feature, as a MotherDuck database may be reattached at any time. Database shares behave slightly differently than non-shared databases, so if you want to `ATTACH` and `DETACH` shares, please have a look at how to [manage shared MotherDuck databases](/key-tasks/sharing-data/sharing-data.mdx). ## Creating, detaching, and re-attaching a database This guide will show how to `CREATE`, `DETACH`, and `ATTACH` a database using the CLI and the UI. ### CLI ```sql CREATE DATABASE my_new_md_database; DETACH my_new_md_database; ATTACH 'my_new_md_database'; -- OR ATTACH 'md:my_new_md_database'; ``` ### UI To create a database, add a new cell and enter the SQL command `CREATE DATABASE `. Click the Run button. ![create_database](./img/create_database.png) Click on the menu of the database you would like to detach and select `Detach`. ![detach_database](./img/detach_database.png) The database will be moved to the "Detached Databases" section of the object explorer. ![detached_databases](./img/detached_databases.png) To re-attach, click on the menu of the database in the "Detached Databases" section and select `Attach`. ![attach_database](./img/attach_database.png) The database will be returned to the "My Databases" section. ![my_databases_post_attach](./img/my_databases_post_attach.png) ## Show All Databases To see all databases, both attached and detached, use the [`SHOW ALL DATABASES` command](/sql-reference/motherduck-sql-reference/show-databases.md). ### CLI ```sql SHOW ALL DATABASES; ``` Example output: ```bash ┌──────────────────────────────────────────┬─────────────┬──────────────────┬─────────────────────────────────────────────────────────────────────────────────────────┐ │ alias │ is_attached │ type │ fully_qualified_name │ │ varchar │ boolean │ varchar │ varchar │ ├──────────────────────────────────────────┼─────────────┼──────────────────┼─────────────────────────────────────────────────────────────────────────────────────────┤ │ TEST_DB_02d6fc2158094bd693b6f285dbd402f7 │ true │ motherduck │ md:TEST_DB_02d6fc2158094bd693b6f285dbd402f7 │ │ TEST_DB_62b53d968a4f4b6682ed117a7251b814 │ true │ motherduck │ md:TEST_DB_62b53d968a4f4b6682ed117a7251b814 │ │ base │ false │ motherduck │ md:base │ │ base2 │ true │ motherduck │ md:base2 │ │ db1 │ false │ motherduck │ md:db1 │ │ integration_test_001 │ false │ motherduck │ md:integration_test_001 │ │ my_db │ true │ motherduck │ md:my_db │ │ my_share_1 │ true │ motherduck share │ md:_share/integration_test_001/18d6dbdb-e130-4cdf-97c4-60782ed5972b │ │ sample_data │ false │ motherduck │ md:sample_data │ │ source_db │ true │ motherduck │ md:source_db │ │ test_db_115 │ false │ motherduck │ md:test_db_115 │ │ test_db_28d │ false │ motherduck │ md:test_db_28d │ │ test_db_cc9 │ false │ motherduck │ md:test_db_cc9 │ │ test_share │ true │ motherduck share │ md:_share/source_db/b990b424-2f9a-477a-b216-680a22c3f43f │ │ test_share_002 │ true │ motherduck share │ md:_share/integration_test_001/06cc5500-e49a-4f62-9203-105e89a4b8ae │ ├──────────────────────────────────────────┴─────────────┴──────────────────┴─────────────────────────────────────────────────────────────────────────────────────────┤ │ 15 rows (15 shown) 4 columns │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ ``` --- Source: https://motherduck.com/docs/key-tasks/loading-data-into-motherduck/loading-data-via-postgres-endpoint # Loading data via the Postgres endpoint > Best practices for loading data into MotherDuck efficiently when you are connected through the Postgres endpoint. MotherDuck's Postgres endpoint is a good thin-client loading path when your application, BI tool, or serverless runtime already speaks PostgreSQL and you want to run SQL in MotherDuck without installing a DuckDB client. It is best suited to server-side loading from remote data sources. :::tip Best practice If your files already live in object storage or are available over HTTPS, use the Postgres endpoint to run `CREATE TABLE AS SELECT` or `INSERT INTO ... SELECT` and let MotherDuck read the files remotely. ::: If your data is on your laptop, application server disk, or in a local DuckDB file, a DuckDB client path is usually a better fit. In that case, either: - Upload the files to object storage first, then load them remotely through the Postgres endpoint. - Use a DuckDB client path instead, such as `duckdb`, Python DuckDB, or another DuckDB client connected to `md:`. ## Recommended patterns ### Load directly from cloud storage or HTTPS This is the preferred pattern for the Postgres endpoint. The examples below use public sample files so you can run them directly. ```sql CREATE OR REPLACE TABLE my_db.main.orders_raw AS SELECT * FROM read_parquet( 'https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parquet', MD_RUN = REMOTE ); ``` You can use the same approach with CSV or JSON: ```sql CREATE OR REPLACE TABLE my_db.main.weather_events AS SELECT * FROM read_csv( 'https://raw.githubusercontent.com/duckdb/duckdb-web/main/data/weather.csv', HEADER = true, AUTO_DETECT = true, MD_RUN = REMOTE ); ``` This keeps the work inside MotherDuck and avoids sending rows one statement at a time over the Postgres wire. ### Load into a staging table, then transform For repeatable pipelines, stage the raw data first and then publish into the final table. ```sql CREATE SCHEMA IF NOT EXISTS my_db.ingest; CREATE OR REPLACE TABLE my_db.ingest.orders_stage AS SELECT * FROM read_parquet( 'https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parquet', MD_RUN = REMOTE ); CREATE OR REPLACE TABLE my_db.main.orders_curated AS SELECT o_orderkey AS order_id, o_custkey AS customer_id, o_orderdate::TIMESTAMP AS order_ts, o_totalprice::DOUBLE AS total_amount FROM my_db.ingest.orders_stage; ``` This keeps ingestion and transformation separate, which makes validation, retries, and backfills easier. ### Batch rows if the data exists only in application memory If your source data exists only in application memory, use multi-row `INSERT` statements instead of row-by-row inserts. Recommended: ```sql CREATE OR REPLACE TABLE my_db.main.orders_batch ( id INTEGER, note VARCHAR, amount DOUBLE ); INSERT INTO my_db.main.orders_batch VALUES (1, 'a', 10.0), (2, 'b', 20.0), (3, 'c', 30.0); ``` Less efficient: ```sql INSERT INTO my_db.main.orders_batch VALUES (1, 'a', 10.0); INSERT INTO my_db.main.orders_batch VALUES (2, 'b', 20.0); INSERT INTO my_db.main.orders_batch VALUES (3, 'c', 30.0); ``` Single-row inserts create unnecessary round trips and are much slower for loading. When loading rows from an application: - fewer, larger batches - append-only staging tables - transactions that stay comfortably below a minute ## Use a DuckDB client path instead when The Postgres endpoint is not currently intended for workflows that depend on local DuckDB-client capabilities. Use a DuckDB client path instead when you need: - local-file `COPY` - `EXPORT DATABASE` - `IMPORT DATABASE` - `ATTACH ':memory:'` - `ATTACH '/path/to/file.duckdb'` - `CREATE DATABASE ... FROM '/path/to/file.duckdb'` - `MD_RUN = LOCAL` - `INSTALL` and `LOAD` In practice, that means the Postgres endpoint is not the primary interface for: - loading directly from local files - attaching local or in-memory DuckDB databases - extension-based workflows - local execution paths such as `MD_RUN = LOCAL` ## Protected cloud storage If you are loading from protected S3, GCS, R2, or Azure storage, make sure the required MotherDuck secret already exists. Cloud-storage secret creation requires DuckDB extension support and is not currently supported through the Postgres endpoint. The recommended workflow is: 1. Create the secret using a DuckDB client path or another supported MotherDuck workflow. 2. Then use the Postgres endpoint to run the load query. ## Decision guide | Situation | Best approach | |---|---| | Files already in S3, GCS, R2, Azure, or public HTTPS | Use `read_parquet`, `read_csv`, or `read_json` with `MD_RUN = REMOTE` over the Postgres endpoint | | Data is local on your machine | Prefer a DuckDB client path, or upload the files to object storage first | | Data exists only in app memory and volume is modest | Use explicit large multi-row `INSERT` batches over the Postgres endpoint | | Very large local bulk load | Use a DuckDB client path instead | ## Summary For the best mix of throughput and simplicity: 1. Write source files as Parquet when you can. 2. Put them in object storage close to your MotherDuck region. 3. Use the Postgres endpoint to run `CREATE TABLE AS SELECT` or `INSERT INTO ... SELECT` with `MD_RUN = REMOTE`. 4. Stage first, validate row counts and schemas, then publish into the final table. ## Related pages - [Postgres Endpoint reference](/sql-reference/postgres-endpoint) - [Loading data best practices](./considerations-for-loading-data.mdx) - [From cloud storage or HTTPS](./loading-data-from-cloud-or-https.md) - [From your local machine](./loading-data-from-local-machine.md) - [Loading a DuckDB database](./loading-duckdb-database.md) - [Connect from Python via Postgres endpoint](/key-tasks/authenticating-and-connecting-to-motherduck/postgres-endpoint/python) --- Source: https://motherduck.com/docs/key-tasks/customer-facing-analytics/customer-facing-analytics # Build a customer-facing analytics app > Build customer-facing analytics applications with read scaling tokens and isolated tenant data. To build your first application with **Customer-Facing Analytics (CFA)** on MotherDuck, use this overview as a starting point. You'll know you're done when: - Each of your customer tenants (or organizations) has its own service account and database(s) in MotherDuck. - Your application can query customer-specific analytics data with predictable performance and isolation. - You understand which detailed guide to follow next for implementation. Use this overview to choose a **tenancy model** and learn the building blocks before the step-by-step 3-tier guide. ## Customer provisioning Every [Duckling](https://motherduck.com/blog/scaling-duckdb-with-ducklings/) is an isolated bucket of compute. For Customer-Facing Analytics, this usually means: - Each **customer tenant or organization** has **one service account** dedicated to serving analytics (and often also ingestion and transformation). - Your backend mediates all access; customers typically do not log into MotherDuck directly. You manage service accounts and tokens using: - [`users-create-service-account`](/sql-reference/rest-api/users-create-service-account/) – create a service account per customer tenant. - [`users-create-token`](/sql-reference/rest-api/users-create-token/) – create tokens for ingestion and read workloads. With accounts and tokens in place, you can: - Create databases under each service account. - Load data into those databases using your orchestrator. - Use dedicated read tokens from your application to serve analytics. For a concrete example of this pattern in a 3-tier web app, see the **[CFA Guide](/key-tasks/customer-facing-analytics/3-tier-cfa-guide/)**. ## Data modeling and loading One database per customer tenant or organization scales cleanly because: - Each database is tied to a tenant's service account. - Each tenant's workloads are isolated from the others. - You can scale Duckling (compute instance) sizes independently based on tenant needs using [different sizes (Pulse, Standard, etc)](/about-motherduck/billing/duckling-sizes/). You can also: - Use a single "landing" service account to ingest raw data from upstream systems. - Use [ATTACH](/sql-reference/motherduck-sql-reference/attach.md) and [zero-copy cloning](/key-tasks/sharing-data/sharing-overview/#consuming-shared-data) to fan that data out into per-customer databases owned by their respective service accounts. High-level patterns for data pipelines: ```mermaid graph LR; A[Source Systems]-->D[(Landing Database)]:::database; D-->F[(Transform & Clone)]:::database; F-->G[(Customer DB A)]:::database; F-->H[(Customer DB B)]:::database; F-->I[(Customer DB C)]:::database; subgraph App E[Serve Analytics] end G-->E; H-->E; I-->E; ``` Check out the detailed [Builder's Guide](/key-tasks/customer-facing-analytics/3-tier-cfa-guide/) for instructions on loading data into per-customer MotherDuck databases and orchestrating customer-facing analytics pipelines. ## Other considerations Since MotherDuck [Shares](/key-tasks/sharing-data/sharing-overview/) are read-only, in more real-time scenarios it may make sense to use: - [`CREATE SNAPSHOT`](/sql-reference/motherduck-sql-reference/create-snapshot/) to force a checkpoint on the writer. - [`REFRESH DATABASE`](/sql-reference/motherduck-sql-reference/refresh-database/) to get the latest version of the data on the reader. This pattern can help enforce consistency between writer and reader databases that power your customer-facing dashboards. For high-scale, high-concurrency applications, MotherDuck offers [Read Scaling Replicas](https://motherduck.com/blog/read-scaling-preview/) for applications that send hundreds or thousands of queries in a few seconds, such as BI tools or busy embedded dashboards. Read replicas: - Can be created and modified in the UI. - Can be managed using the [MotherDuck REST API](/sql-reference/rest-api/motherduck-rest-api/). - Follow the same consistency considerations as Shares, and can be checkpointed and refreshed more frequently if needed. When you're ready to implement a full 3-tier architecture with per-customer service accounts, scheduled data loading, and a backend API, continue to the [**Customer-Facing Analytics Guide**](/key-tasks/customer-facing-analytics/3-tier-cfa-guide/). --- Source: https://motherduck.com/docs/key-tasks/data-warehousing/replication/spreadsheets # Using Excel and Google Sheets data in MotherDuck > Load Excel and Google Sheets data into MotherDuck using the DuckDB CLI or HTTPS CSV export URLs. Key bits of data and side schedules often exist in spreadsheets like Excel and Google Sheets. It is useful to add that data to your data warehouse and query it. This guide shows how to perform this workflow using the DuckDB CLI for both [Excel](#microsoft-excel) and [Google Sheets](#google-sheets). :::tip To use these extensions, you will need to first install the DuckDB CLI. [Instructions can be found here.](/getting-started/interfaces/connect-query-from-duckdb-cli). ::: ## Microsoft Excel :::note The purpose of this guide is to show you how to _load_ data from Excel into MotherDuck. If you'd like to _retrieve_ MotherDuck data in Excel, you can [follow this guide](/integrations/bi-tools/excel/). ::: To read from an Excel spreadsheet, open the DuckDB CLI by typing `duckdb 'md:'` in your terminal. This will ask you for access to your MotherDuck account if you haven't already provided it. You can read Excel files directly with `SELECT * FROM 'movies.xlsx'`, which will automatically load the DuckDB Excel extension. If you want to get more control you can use [the `read_xlsx` function](https://duckdb.org/docs/stable/core_extensions/excel) directly. ```sql SELECT * FROM read_xlsx('movies.xlsx', sheet = 'Action Movies'); ``` The previous query returns the data set to the terminal, but the query can be modified to write the data into MotherDuck with "Create Table As Select" (CTAS). ```sql CREATE OR REPLACE TABLE my_db.main.my_movies AS -- use fully qualified table name SELECT * FROM 'C:\users\documents\movies.xlsx'; ``` Sometimes there is data in multiple tabs. In that case, you can use the `sheet` parameter to pass the tab names, and depending on the context, even union multiple tabs into a single table. ```sql CREATE OR REPLACE TABLE my_db.main.my_movies AS -- use fully qualified table name SELECT * FROM read_xlsx('C:\users\documents\movies.xlsx', sheet = 'Action Movies') UNION ALL SELECT * FROM read_xlsx('C:\users\documents\movies.xlsx', sheet = 'Romance Movies'); ``` ## Google Sheets ### Query Google Sheets as CSV over HTTPS If a Google Sheet is publicly accessible, or can be accessed with HTTP authentication, query it from MotherDuck with DuckDB's `read_csv()` function and the Google Sheets CSV export URL: ```sql SELECT * FROM read_csv( 'https://docs.google.com/spreadsheets/d//export?format=csv&gid=', MD_RUN = REMOTE ); ``` The `sheet_id` is the value between `/d/` and `/edit` in the Google Sheet URL. The `gid` identifies the worksheet tab. When you run this while connected to MotherDuck, the HTTPS read can execute server side in MotherDuck. To keep the spreadsheet queryable as live source data, create a view: ```sql CREATE OR REPLACE VIEW my_db.main.sheet_source AS SELECT * FROM read_csv( 'https://docs.google.com/spreadsheets/d//export?format=csv&gid=', MD_RUN = REMOTE ); ``` To snapshot the current spreadsheet data into MotherDuck, create a table instead: ```sql CREATE OR REPLACE TABLE my_db.main.sheet_snapshot AS SELECT * FROM read_csv( 'https://docs.google.com/spreadsheets/d//export?format=csv&gid=', MD_RUN = REMOTE ); ``` For private sheets, create an HTTP secret with an OAuth bearer token and scope it to Google Sheets: ```sql CREATE SECRET google_sheets_http IN MOTHERDUCK ( TYPE HTTP, SCOPE 'https://docs.google.com', EXTRA_HTTP_HEADERS MAP { 'Authorization': 'Bearer ' } ); ``` See the [DuckDB HTTP authentication documentation](https://duckdb.org/docs/current/core_extensions/httpfs/https#authenticating) for more `httpfs` authentication options. For more detail on this Google Sheets URL pattern, see [Swimming in Google Sheets with MotherDuck](https://motherduck.com/blog/google-sheets-motherduck/). ### Query with the Google Sheets extension ::::info While the Excel extension is a core DuckDB extension, the Google Sheets extension is a community extension maintained by Evidence. :::: The first step to handle Google Sheets is to install the [duckdb-gsheets](https://duckdb-gsheets.com/) extension. That is done with these commands after starting the DuckDB CLI with `duckdb 'md:'` ```sql INSTALL gsheets FROM community; LOAD gsheets; ``` Since Google Sheets is a hosted application, we need to use [DuckDB Secrets](https://duckdb.org/docs/configuration/secrets_manager.html) to handle authentication. This is as simple as: ```sql CREATE SECRET (TYPE gsheet); ``` :::note Using this workflow will require interactivity with a browser, so if you need to run it from a job (i.e. Airflow or similar), consider setting up a [Google API access token](https://duckdb-gsheets.com/#getting-a-google-api-access-token). ::: To read from a Google Sheet, we need at minimum the sheet id, which is found in the URL, for example `https://docs.google.com/spreadsheets/d/11QdEasMWbETbFVxry-SsD8jVcdYIT1zBQszcF84MdE8/edit`. The string between `d/` and `/edit` represents the spreadsheet id. It can therefore be queried with: ```sql SELECT * FROM read_gsheet('https://docs.google.com/spreadsheets/d/11QdEasMWbETbFVxry-SsD8jVcdYIT1zBQszcF84MdE8/edit'); ``` The previous query returns the data set to the terminal, but the query can be modified to write the data into MotherDuck with "Create Table As Select" (CTAS). ```sql CREATE OR REPLACE TABLE my_db.main.my_table AS -- use fully qualified table name SELECT * FROM read_gsheet('https://docs.google.com/spreadsheets/d/11QdEasMWbETbFVxry-SsD8jVcdYIT1zBQszcF84MdE8/edit'); ``` For convenience, the spreadsheet id itself can be queried as well. ```sql SELECT * FROM read_gsheet('11QdEasMWbETbFVxry-SsD8jVcdYIT1zBQszcF84MdE8'); ``` To query data from multiple tabs, the tab name can be passed as parameter using `sheet` to select the preferred tab. ```sql SELECT * FROM read_gsheet('11QdEasMWbETbFVxry-SsD8jVcdYIT1zBQszcF84MdE8', sheet='Sheet2'); ``` For more detailed documentation, including writing to Google Sheets, review the [duckdb-gsheets documentation](https://duckdb-gsheets.com/#getting-a-google-api-access-token). ## Handling more complex workflows Production use cases tend to be much more complex and include things like incremental builds & state management. In those scenarios, please take a look at our [ingestion partners](https://motherduck.com/ecosystem/?category=Ingestion), which includes many options including some that offer native python. An overview of the MotherDuck Ecosystem is shown below. ![Diagram](../../../img/md-diagram.svg) --- Source: https://motherduck.com/docs/key-tasks/customer-facing-analytics/3-tier-cfa-guide # 3-tier customer-facing analytics guide > Step-by-step guide to building a 3-tier customer-facing analytics application with MotherDuck. To build a **Customer-Facing Analytics (CFA) application** on MotherDuck, use this step-by-step guide. This guide will focus on patterns for traditional 3-tier architecture, but you can also run 1.5-tier apps using Wasm, as seen in the [1.5-tier architecture guide](/getting-started/customer-facing-analytics/#15-tier-architecture-duckdb-wasm). You'll know you're done when: - Your application (`B2B Tool`) can run analytics queries for a customer (`Goose Inc`) against MotherDuck from a backend service. - Data from a transactional database is synced into a per-customer MotherDuck database on a schedule using your orchestrator. - You understand when to add more service accounts, databases, and read scaling capacity as your product grows. Use this guide when you want to: - Build a 3-tier web app (browser → app server → MotherDuck) with embedded analytics. - Use per-customer service accounts and databases to isolate data and compute. - Keep analytics data in MotherDuck in sync with your transactional database. Before starting, ensure you have: - A MotherDuck account and an organization you can use for development. - Basic familiarity with Python and SQL. - Access to a PostgreSQL database (or a test instance) with an `orders`-style schema. - Python installed locally (DuckDB is compatible with the latest Python LTS version). > This guide assumes you've read the conceptual overview [**Customer-Facing Analytics Getting Started**](/getting-started/customer-facing-analytics). ## 1. understand the 3-tier CFA architecture In this guide, you are building `B2B Tool`, a SaaS product that serves analytics to employees at many customer companies. Each customer company gets: - Its own **service account** in MotherDuck. - Its own **database(s)** for analytics tables. - Its own **compute** (Ducklings) for queries and data loading. Your high-level architecture: ```mermaid graph LR; subgraph Users["End Users"] U1{{"Kate (Goose Inc)"}}:::green; U2{{"John (Goose Inc)"}}:::green; U3{{"Hari (Duck Co)"}}:::green; end subgraph App["Your Application"] FE["Frontend"]; BE["Backend API"]; TX["Transactional DB"]; end MDORG["MotherDuck"]; U1 --> FE; U2 --> FE; U3 --> FE; FE -->|"HTTP / JSON APIs"| BE; BE -->|"User + Company lookup"| TX; BE -->|"Analytics queries"| MDORG; ``` [Hypertenancy](/concepts/hypertenancy) here means each company (`Goose Inc`, `Duck Co`) owns its MotherDuck database(s) (that store only that company's analytics data), that compute is isolated (each company has its own Ducklings) and heavy workloads for one customer cannot slow down others. You will: 1. Set up a dev organization and add other developers on the team. 2. Create a service account for your first customer company (`Goose Inc`). 3. Sync data from your transactional DB, such as Postgres, into Goose Inc’s MotherDuck analytical database using your chosen replication method. 4. Connect your backend service to MotherDuck with a **read token** to serve analytics queries. 5. Plan how to scale to many customer companies and higher concurrency. ### Alternative to per-customer service accounts The per-customer service account pattern is the strongest isolation model. Some teams, especially B2C or lighter multi-tenant apps, opt for a simpler setup: - Keep a **single writer service account** that owns all customer databases. - Create a **[read scaling token](/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/)** for that account and configure the pool size to target one duckling per concurrent end user (default max 16, adjustable through support). For cost control, users can share a duckling, but that increases contention. - Have each end user connect in **[single attach mode](/key-tasks/authenticating-and-connecting-to-motherduck/attach-modes/)** to the one database they should see (`md:?attach_mode=single`), which avoids carrying other attachments from the workspace. - Use [`session_name`](/key-tasks/authenticating-and-connecting-to-motherduck/connecting-to-motherduck/#session-names) in the connection string to keep an end user pinned to the same read scaling duckling for cache reuse and steadier latency. This model trades away service-account isolation in favor of operational simplicity. Ensure your security and compliance needs allow a shared service account before choosing it. Read scaling replicas are eventually consistent. If you need fresher reads on demand, combine `CREATE SNAPSHOT` on the writer with `REFRESH DATABASE` on the read scaling connections. Example connection string for an end user: ```text md:customer_db?attach_mode=single&session_name= ``` ## 2. set up your dev environment and organization Prepare your dev environment: 1. **Create your dev organization and account** 1. Go to `https://motherduck.com` and sign up or log in with your work email (for example, `manager@b2btool.com`). 2. Create or select an organization you’ll use for development (for example, `B2B Tool Co`). 3. In the MotherDuck UI, open the default database (`my_db`) and confirm you can run a simple query such as: ```sql SELECT 1; ``` You should see a single row with the value `1`. 2. **Upload a small CSV to confirm data ownership and access** 1. In the MotherDuck web UI, upload a small example CSV (for example, `orders_sample.csv`) into `my_db`. If this step is unclear, check out the [MotherDuck tutorial on loading data](/getting-started/e2e-tutorial/part-2/#loading-your-data). 2. Run a query like: ```sql SELECT COUNT(*) AS row_count FROM orders_sample; ``` You should see the number of rows you uploaded. 3. **Invite a second developer and share data** 1. Invite `devlead@b2btool.com` to your `B2B Tool Co` organization. 2. Create a new database in your personal account (for example, `b2btool_dev`) and copy or create a simple table. 3. Share that database with your colleague following the [**Sharing Data** guide](/key-tasks/sharing-data/sharing-overview/). 4. Ask your colleague to query the shared database from their account. At this point: - You have a dev org with two human users. - You’ve seen how database ownership and read-only sharing works. Conceptually, your dev setup looks like this: ```mermaid graph LR; DM["devlead@b2btool.com"] <-->|"read/write"| DB1[("DB: b2btool_dev")]:::database; DB1 -->|"read only"| DC{{Colleague}}:::green; ``` ## 3. create a service account for a customer company For customer-facing analytics, your customers usually do **not** log into MotherDuck directly. Instead: - Your application mediates access. - Each customer company gets a **service account** in your MotherDuck organization. - Your backend uses that service account’s tokens to load and query data. In this guide, you’ll create a service account for your first customer company: `Goose Inc`. ### 3.1 create a service account in the MotherDuck UI 1. In the MotherDuck UI, go to the **Service Accounts** section for your organization. 2. Click **Create Service Account**. 3. Name it something like `goose-inc-service-account`. 4. Save the generated access token in your secret manager or a secure store. For more detail, see [Create and configure service accounts](/key-tasks/service-accounts-guide/create-and-configure-service-accounts/). ### 3.2 (optional) create service accounts through REST API Later, you will likely automate service account creation. To create a service account programmatically: - Use the [`users-create-service-account`](/sql-reference/rest-api/users-create-service-account/) REST API endpoint. - Use the [`users-create-token`](/sql-reference/rest-api/users-create-token/) endpoint to create an access token for that service account. Your provisioning workflow should: **(1)** detect a new customer signup, **(2)** call `users-create-service-account` for that company, **(3)** call `users-create-token`, and **(4)** store the token metadata (or an alias) in your transactional database so your backend can look it up later. ## 4. model and load customer data in MotherDuck Next, populate data for `Goose Inc` into its own MotherDuck database. Assume: - Your transactional system (`B2B Tool`) uses PostgreSQL. - Each customer company is an e-commerce store with: - `orders` table: order-level facts. - `fulfillments` table: shipment or delivery events. Example schema: ```sql CREATE TABLE orders ( order_id BIGINT PRIMARY KEY, company_id BIGINT, order_date TIMESTAMP, customer_email TEXT, total_amount NUMERIC(18, 2), status TEXT ); CREATE TABLE fulfillments ( fulfillment_id BIGINT PRIMARY KEY, order_id BIGINT REFERENCES orders(order_id), fulfilled_at TIMESTAMP, carrier TEXT, status TEXT ); ``` Example data: ```sql INSERT INTO orders SELECT row_number() OVER () AS order_id, (random() * 9 + 1)::BIGINT AS company_id, current_timestamp - INTERVAL (random() * 365) DAY AS order_date, 'customer' || (random() * 999 + 1)::INT || '@example.com' AS customer_email, (random() * 9999 + 1)::NUMERIC(18, 2) AS total_amount, (['pending', 'processing', 'shipped', 'delivered', 'cancelled'])[(random() * 4)::INT + 1] AS status FROM range(1000); INSERT INTO fulfillments SELECT row_number() OVER () AS fulfillment_id, (random() * 999 + 1)::BIGINT AS order_id, current_timestamp - INTERVAL (random() * 300) DAY AS fulfilled_at, (['UPS', 'FedEx', 'USPS', 'DHL', 'Amazon Logistics'])[(random() * 4)::INT + 1] AS carrier, (['pending', 'in_transit', 'out_for_delivery', 'delivered', 'failed'])[(random() * 4)::INT + 1] AS status FROM range(1000); ``` :::info Use your [orchestrator](/integrations/orchestration/) and [ingestion tool](/integrations/ingestion/) to keep this data in sync for each customer company. ::: ### 4.1 create a MotherDuck database for `Goose Inc` Use the `Goose Inc` service account’s token to create a database for that customer: ```sql CREATE DATABASE goose_inc; ``` Run this in the UI after impersonating the `Goose Inc` service account or connect as that service account from Python and issue the `CREATE DATABASE` statement. :::note To move forward, replicate your data into `goose_inc`. [This page](/key-tasks/data-warehousing/replication/postgres/) shows a simple example for replicating a Postgres database to MotherDuck. ::: ## 5. run analytics queries from your backend With data in Goose Inc’s MotherDuck database, your backend can run analytics queries. At a high level: 1. Your user (`Kate` at Goose Inc) logs into `B2B Tool`. 2. Your backend authenticates Kate and determines she belongs to the `Goose Inc` customer company. 3. Your backend looks up Goose Inc’s **read token** for its service account from your transactional database or secret store. 4. Your backend uses that read token to run analytics queries against the `goose_inc` database in MotherDuck. ### 5.1 create a read token for `Goose Inc` For production, you’ll usually create a token dedicated to **reading** analytics data: 1. In the MotherDuck UI, impersonate the Goose Inc service account. 2. Create a new access token intended only for read workloads. 3. Store this token securely and associate it with Goose Inc in your transactional database. You can also create tokens through the REST API using the [`users-create-token`](/sql-reference/rest-api/users-create-token/) endpoint. ### 5.2 connect from Python using DuckDB Your backend service connects to MotherDuck using the DuckDB client and the `md:` connection string. Typically, you: - Set the `MOTHERDUCK_TOKEN` (or `motherduck_token`) environment variable to the Goose Inc read token. - Connect to the `goose_inc` database using DuckDB. Example helper in your backend (for example, `analytics_client.py`): ```python import os import duckdb def get_customer_connection(customer_id: str): """ Get a DuckDB connection to a customer's MotherDuck database. Args: customer_id: Identifier for the customer (e.g., 'goose_inc', 'duck_co') Returns: DuckDB connection to the customer's database """ # Look up the customer's read token from your secret store or environment # In production, you'd fetch this from your transactional DB or secret manager token_env_var = f"{customer_id.upper().replace('-', '_')}_READ_TOKEN" read_token = os.environ.get(token_env_var) if not read_token: raise ValueError(f"Read token not found for customer: {customer_id}") # Set the token for this connection os.environ["MOTHERDUCK_TOKEN"] = read_token # Connect to the customer's database on MotherDuck # Database name typically matches the customer_id conn = duckdb.connect(f"md:{customer_id}") return conn ``` Then, a simple analytics function in your API service: ```python def get_customer_kpis(customer_id: str): conn = get_customer_connection(customer_id) query = """ SELECT date_trunc('day', order_date) AS day, COUNT(*) AS orders_count, SUM(total_amount) AS gross_revenue FROM orders WHERE order_date >= current_date - INTERVAL 30 DAY GROUP BY 1 ORDER BY 1 """ result = conn.execute(query).fetch_df() # Convert to JSON-serializable structure for your frontend return result.to_dict(orient="records") ``` Expose this from a REST endpoint such as `/api/customers/{customer_id}/kpis` and render the results in your frontend dashboards. The same code works for any customer by passing their identifier. The runtime query flow looks like: ```mermaid sequenceDiagram participant User as Kate (Goose Inc) participant FE as B2B Tool Frontend participant BE as B2B Tool Backend participant MD as MotherDuck (Goose Inc DB) User->>FE: Opens analytics dashboard FE->>BE: GET /api/customers/goose-inc/kpis BE->>BE: Lookup Goose Inc read token BE->>MD: Analytics query using DuckDB + md:goose_inc MD-->>BE: Result rows BE-->>FE: JSON KPIs FE-->>User: Render charts ``` ## 6. scaling to many customer companies As your product grows, add more customer companies. For each new company: 1. **Create a service account** (through the UI or REST API). 2. **Create one or more databases** for that company’s analytics data. 3. **Configure your orchestrator** to run a `dlt` pipeline (or equivalent) for that company. 4. **Create a read token** for the company and store it in your transactional database. Your architecture naturally scales horizontally: ```mermaid graph LR; subgraph Org["Your MotherDuck Org"] SA1["Service Account: Goose Inc"]; SA2["Service Account: Swan Gmbh"]; SA3["Service Account: Duck Co"]; DB1[("DB: goose_inc")]:::db; DB2[("DB: swan_gmbh")]:::db; DB3[("DB: duck_co")]:::db; end SA1 --> DB1; SA2 --> DB2; SA3 --> DB3; ``` Each service account and database pair has its own compute, minimizing noisy neighbors and making performance a per-customer concern. ## 7. scaling a single customer to high concurrency When a customer (for example, `Goose Inc`) grows to hundreds or thousands of simultaneous users, use these levers: 1. **Increase the Duckling size** for the service account’s default compute Duckling to handle heavier transformation jobs (vertical scaling). 2. **Use read scaling** for high-concurrency read workloads: - Refer to [read scaling](/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/) to create read scaling Ducklings for Goose Inc's read token. - Point your backend’s analytics queries at the read scaling token instead of the main read/write token. 3. **Optimize queries and models**: - Pre-aggregate frequently-used metrics. - Use summary tables to avoid scanning the full `orders` table on every request. For most applications, you start with a single Duckling per customer and introduce read scaling only when your monitoring shows sustained high concurrency or latency issues. ## 8. troubleshooting and when to add more service accounts As you operate your CFA deployment, you may run into several common situations. ### 8.1 queries are slow or time out for one customer If you see slow queries or timeouts for a specific customer: - **Check query patterns**: - Are you scanning too much data on every request? - Can you pre-aggregate or cache common metrics? - **Scale compute for that customer**: - Increase the size for the service account’s Duckling. - Add read scaling Ducklings OR increase the Duckling size used for the read token used by that customer. You rarely need to change the number of service accounts in this case; focus on scaling and optimizing the existing one. ### 8.2 data loads interfere with reads If your hourly (or more frequent) data load jobs are locking tables and causing read queries to queue: - Consider: - Scheduling heavy load jobs during off-peak times. - Using zero-copy cloning (`CREATE SNAPSHOT` and `REFRESH DATABASE`) patterns so that readers query a snapshot database while writers update the primary. - Ensure you are using a **dedicated read token** and read scaling configuration for user-facing queries. ### 8.3 when to add more service accounts In most B2B scenarios: - You create **one service account per customer company**. - All users at that company share the same analytics data and compute through your application. You should consider adding **additional service accounts** when: - You need hard isolation between different environments (for example, separate service accounts for `Prod`, `Staging`, and `Sandbox` within the same customer). - A customer has sub-tenants of their own and you want to isolate compute and data at that sub-tenant level (for example, separate service accounts per region or per major business unit). When you add new service accounts: 1. Create the service account (UI or REST API). 2. Create dedicated databases for the new scope. 3. Create tokens and wire them into your application’s configuration. ### 8.4 common token and permission issues If you see authentication or permission errors: - **Token expired or revoked**: - Rotate the token in MotherDuck and update your secret store. - **Permission denied on database or table**: - Confirm that the service account owns the database or has the necessary privileges. - Re-check sharing settings if you are using shared data. ## 9. next steps Once you have a basic 3-tier CFA deployment working: - **Automate provisioning**: - Automate service account and token creation using the [REST APIs](/sql-reference/rest-api/motherduck-rest-api/). - Automate database and schema creation for new customer companies. - **Automate data loading**: - Move your `dlt` jobs fully into your orchestrator so that new companies are onboarded with little manual work. - Monitor load durations and adjust scheduling as your data grows. - **Enhance your frontend**: - Add charts and drill-downs powered by MotherDuck. - Consider additional guides under `Customer-Facing Analytics` for advanced topics in your docs set. For a high-level conceptual overview and architecture comparison, see the [**Customer-Facing Analytics Getting Started**](/getting-started/customer-facing-analytics/) page. --- Source: https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/attach-modes/attach-modes # Attach Modes > Understand Workspace and Single attach modes ## MotherDuck attach modes: workspace and single modes This guide explains MotherDuck's two connection modes: **workspace** and **single**. Workspace mode is designed for working with multiple databases persistently across sessions, while single mode uses a non-persistent, isolated session that does not reuse your saved workspace. ### Connection modes MotherDuck offers two connection modes: workspace and single. The mode you use determines how your attachments and detachments are handled and whether these changes are saved for future sessions. * **Workspace Mode**: This is the default mode when you want to work with all attached MotherDuck databases. When you attach or detach a database in this mode, that change is remembered for your next session. This is useful when you consistently work with the same set of databases. Parallel connections to MotherDuck in workspace mode will keep their attachments in sync. E.g. detaching a database in one client in workspace mode will detach it in all other clients that are connected in workspace mode. * **Single Mode**: This mode is for when you want a one-time, non-persistent session that does not reuse or change your saved workspace. Any databases you attach or detach during this session will not affect your saved workspace for the next time you connect or interfere with attachment state of other parallel connections to MotherDuck. You can still attach multiple databases in a single-mode session, including databases shared with you. For example, you can start with your own database and then `ATTACH 'md:_share/...'` to attach a share. Single mode is useful with BI tools that only support a single attached database at a time. :::tip You can't switch between modes in the middle of a session. The mode is set by the first command you use to connect to MotherDuck. ::: ### Connecting to MotherDuck with a connection string When you first connect to MotherDuck in a session, the connection string you use determines the attach mode. This applies to most of clients, like the DuckDB CLI (`duckdb 'md:...'`) and Python (`duckdb.connect('md:...')`). * **To connect in Workspace Mode (default):** * Use `md:` or `md:`. * This connects to your MotherDuck workspace, attaching *all* databases from your last saved session. * If you specify a database name, it becomes the active database. * Any changes to attachments (attaching or detaching databases) are saved and will be restored in your next workspace session. * **To connect in Single Mode:** * Use `md:?attach_mode=single`. * This connects to the specified database without using your saved workspace. * Attachment changes are *temporary* and will *not* be saved. * Note: You must specify a database name to use single mode. Connecting with `md:?attach_mode=single` is not allowed, as this mode requires a specific database target. ### Connecting to MotherDuck using the ATTACH command If you are already in a DuckDB session, but **not** connected to MotherDuck yet, your first ATTACH command that targets MotherDuck establishes the attach mode for that session. * **To connect in Workspace Mode:** * Use `ATTACH 'md:'`. * This attaches your entire saved workspace. * The session is now in workspace mode, and any subsequent attachment changes will be persisted for future sessions. * **To connect in Single Mode:** * Use `ATTACH 'md:'`. * This attaches the specified database without using your saved workspace. * The session is implicitly set to single mode. Attachment changes are not saved. * Once in single mode, you cannot attach the entire workspace using `ATTACH 'md:'`. ### Tips & tricks Further Notes: * You can also explicitly set the attach mode before connecting to MotherDuck. ```sql LOAD motherduck; SET motherduck_attach_mode = 'workspace'; -- or 'single' ATTACH 'md:foo'; -- database created by your account ``` * The MotherDuck UI is always connecting in workspace mode. --- Source: https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-and-connecting-to-motherduck # Authenticating and connecting to MotherDuck > Learn how to authenticate and connect to MotherDuck These pages explain how to connect to MotherDuck using the CLI, Python, JDBC and NodeJS. First, you need to [authenticate to MotherDuck](./authenticating-to-motherduck) by [manual authentication](/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/#manual-authentication) via the Web UI, or automatic authentication via an [access token](/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/#authentication-using-an-access-token). Organizations on Business or Enterprise plans can also configure [Single Sign-On (SSO)](/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/sso-setup/) with their identity provider. To connect to a MotherDuck database, you can [create a connection](/docs/key-tasks/authenticating-and-connecting-to-motherduck/connecting-to-motherduck/). ## Included pages - [Authenticating to MotherDuck](https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck): Authenticate to a MotherDuck account - [Connecting to MotherDuck](https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/connecting-to-motherduck): Create one or more connections to a MotherDuck database - [Connect via the Postgres endpoint](https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/postgres-endpoint): Connect to MotherDuck using any Postgres-compatible client via the Postgres wire protocol endpoint - [Read Scaling](https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling): Learn how to scale your data applications using read scaling tokens - [Attach Modes](https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/attach-modes): Understand Workspace and Single attach modes - [Multithreading and parallelism](https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/multithreading-and-parallelism): Run concurrent queries against MotherDuck, and learn when to use Read Scaling or the Postgres endpoint instead of managing connection pools. --- Source: https://motherduck.com/docs/key-tasks/data-warehousing/data-warehousing # Data Warehousing How-to > Data Warehousing How-to guides ## Introduction to MotherDuck for data warehousing MotherDuck is a cloud-native data warehouse built on top of [DuckDB](https://duckdb.org/docs/sql/introduction), a fast in-process analytical database. While DuckDB provides the core analytical engine capabilities, MotherDuck adds cloud storage, sharing, and collaboration features that make it a complete data warehouse solution. Key advantages include its serverless architecture that eliminates infrastructure management, an intuitive interface that simplifies data analysis, and hybrid execution that intelligently processes queries across local and cloud resources. MotherDuck is an ideal choice for organizations seeking a modern data warehouse solution. It excels at ad-hoc analytics by providing instant compute resources for each user, serves well as a departmental data mart with its simplified sharing model, and enables powerful embedded analytics through its WASM capabilities. Different personas benefit uniquely - data analysts get an intuitive SQL interface with AI assistance, engineers can leverage familiar APIs and tools like dbt, and data scientists can seamlessly combine local and cloud data processing. ![img_duck_stack](./img/md-diagram.svg) The modern data stack with MotherDuck integrates seamlessly with popular tools across the ecosystem. As shown in the ecosystem diagram, this includes ingestion tools like [Fivetran](https://fivetran.com/docs/destinations/motherduck#motherduck) and [Airbyte](https://docs.airbyte.com/integrations/destinations/motherduck) for loading data, transformation tools like [dbt](/docs/integrations/transformation/dbt) for modeling, BI tools like [Tableau](/integrations/bi-tools/tableau/) and [PowerBI](/integrations/bi-tools/powerbi/) for visualization, and orchestration tools like [Airflow](https://airflow.apache.org/docs/) and [Dagster](https://docs.dagster.io/integrations/libraries/duckdb/using-duckdb-with-dagster) for pipeline management. This comprehensive integration enables teams to build complete data warehousing solutions while leveraging their existing tooling investments. ## MotherDuck basics: concepts to understand before you start ![Architecture](./img/the-md-dwh.png) MotherDuck's core architecture is built on a serverless foundation that eliminates infrastructure management overhead. The platform handles data storage with enterprise-grade durability and security, while optimizing performance through intelligent data organization. Each user gets their own isolated compute resource called a "Duckling" that sits on top of the storage layer, and the separation of storage and compute enables independent scaling of these resources based on workload demands. The [dual execution model](/concepts/architecture-and-capabilities/#dual-execution) is a unique capability that allows MotherDuck to seamlessly query both local and cloud data. The query planner intelligently determines the optimal execution path, deciding whether to process data locally, in the cloud, or using a hybrid approach. This enables efficient querying across data sources while minimizing data movement and optimizing for performance. MotherDuck follows a familiar hierarchical structure with databases containing schemas and tables. Databases serve as the primary unit of organization and access control, while schemas help logically group related tables together. This structure provides a clean way to organize data while maintaining compatibility with common [SQL patterns](https://duckdb.org/docs/sql/introduction) and tools. Authentication in MotherDuck is handled through secure [token-based access](/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/#creating-an-access-token), with comprehensive user and organization management capabilities. The platform uses a simplified access model where users either have full access to a database or none at all. The [SHARES](/key-tasks/sharing-data/managing-shares/) feature enables secure data sharing within organizations and with external parties through zero-copy clones that maintain data consistency and security. The [MotherDuck user interface](/getting-started/interfaces/motherduck-quick-tour/) provides a modern notebook-style environment for data interaction. The SQL IDE includes powerful features like intelligent autocomplete, AI-powered query suggestions and fixes, and an interactive Column Explorer that helps users understand and analyze their data structure. These features combine to create an intuitive and productive environment for data analysis. While MotherDuck is designed for analytical workloads, it's important to note that it's not optimized for high-frequency small transactions like traditional OLTP databases. The platform works best with batch operations and [analytical queries](https://duckdb.org/docs/sql/introduction), and users should consider using queues for streaming workloads to achieve optimal performance. Additionally, the database-level security model means access cannot be controlled at the schema or table level. ## Data ingestion: getting your data in MotherDuck provides multiple strategies for ingesting data into your data warehouse. The platform leverages DuckDB's powerful data loading capabilities while adding cloud-native features for seamless data ingestion at scale. You can load data through direct file imports, cloud storage connections, database migrations, or specialized ETL tools like [Fivetran](https://fivetran.com/docs/destinations/motherduck#motherduck) and [Airbyte](https://docs.airbyte.com/integrations/destinations/motherduck) depending on your needs. The [MotherDuck Web UI](/getting-started/interfaces/motherduck-quick-tour/) provides an intuitive interface for data loading and exploration. ### Loading local data Loading data from local files supports common formats like CSV, Parquet, and JSON. The [MotherDuck UI](/getting-started/interfaces/motherduck-quick-tour/) provides an intuitive interface for uploading files directly, while the [Python client](https://duckdb.org/docs/api/python/overview) enables programmatic loading using DuckDB's native functions. For example, you can use [read_csv()](https://duckdb.org/docs/data/csv), [read_parquet()](https://duckdb.org/docs/data/parquet), or [read_json()](https://duckdb.org/docs/data/json) to efficiently load data files while taking advantage of DuckDB's parallel processing capabilities. ### Interacting with cloud storage (S3, GCS, etc) Cloud storage integration lets you directly query and load data from major providers including [AWS S3](https://duckdb.org/docs/guides/import/s3_import), [Google Cloud Storage](https://duckdb.org/docs/guides/import/gcs_import), [Azure Blob Storage](https://duckdb.org/docs/stable/extensions/azure), and [Cloudflare R2](https://duckdb.org/docs/guides/import/s3_import). Using SQL commands like SELECT FROM read_parquet('s3://bucket/file.parquet'), you can seamlessly access cloud data. MotherDuck handles credential management securely through [environment variables](/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck) or configuration settings. ### Database-to-database data loading For database migrations, MotherDuck supports importing data from other databases like [PostgreSQL](https://duckdb.org/docs/guides/import/query_postgres.html) and [MySQL](https://duckdb.org/docs/guides/import/query_mysql). You can directly connect to these sources using database connectors and execute queries to extract and load data. Existing [DuckDB databases](https://duckdb.org/docs/stable/data/multiple_files/overview) can be imported efficiently since MotherDuck is built on DuckDB's core engine. ### Fetching data from APIs [Data ingestion](/integrations/ingestion/) tools like Fivetran, Airbyte, dltHub and Estuary integrate with MotherDuck to provide automated, reliable data pipelines. These tools handle complex ETL workflows, data validation, and transformation while offering features like scheduling, monitoring and error handling that simplify ongoing data operations. For real-time data needs, MotherDuck works with streaming partners like [Estuary](https://docs.estuary.dev/reference/Connectors/materialization-connectors/motherduck/) to enable continuous data ingestion. While DuckDB is optimized for batch operations, these integrations allow you to build streaming pipelines that buffer and load data in micro-batches for near real-time analytics. ### Unstructured data integrations When working with unstructured data like documents, emails or images, tools like [Unstructured.io](https://motherduck.com/blog/effortless-etl-unstructured-data-unstructuredio-motherduck/) can pre-process and structure the data before loading into MotherDuck. This lets you analyze unstructured data alongside your structured data warehouse tables. ### Loading performance notes For optimal performance, follow DuckDB's recommended practices around batch sizes and data types. Load data in reasonably sized batches (at leasts 122k rows) to balance memory usage and throughput. Use appropriate data types like TIMESTAMP for datetime values and avoid unnecessary type conversions. Sort data by columns that are frequently queried together such as TIMESTAMPs. Monitor [recent queries](/sql-reference/motherduck-sql-reference/md_information_schema/recent_queries/) during large loads and adjust batch sizes accordingly. ## Data transformation: shaping your data for analysis Data transformation is a critical step in the data warehousing process that converts raw data into analysis-ready formats. MotherDuck provides powerful SQL capabilities inherited from DuckDB for transforming data directly within the warehouse. You can leverage DuckDB's rich library of SQL functions to clean, reshape, and model your data through operations like filtering, joining, aggregating and pivoting. ### Transformation tools - **[dbt (data build tool)](/integrations/transformation/dbt/)** * Native MotherDuck adapter for seamless integration to dbt core * Enables version controlled, modular SQL transformations * Supports testing, documentation and lineage tracking * Recommended for complex transformation workflows * See our [blog post](https://motherduck.com/blog/duckdb-dbt-e2e-data-engineering-project-part-2/) for detailed examples - **[SQLMesh](https://sqlmesh.readthedocs.io/en/stable/integrations/engines/motherduck/)** * Compatible with MotherDuck through DuckDB support * Provides data pipeline and transformation management * Enables incremental processing and scheduling * - **[Paradime](https://docs.paradime.io/app-help/documentation/settings/connections/scheduler-environment/duckdb)** * Modern data transformation platform built for DuckDB/MotherDuck * Offers collaborative development environment * Includes version control and deployment tools ## Orchestration: automating your data pipelines Orchestration is essential for keeping data up to date with MotherDuck. Scheduling data loads and transformations ensures your data warehouse stays current by running ingestion jobs at appropriate intervals to capture new data from your sources. Managing dependencies between tasks lets you create reliable pipelines where transformations only run after their prerequisite data loads complete successfully. Monitoring and alerting capabilities help you track pipeline health and quickly address any issues that arise. For orchestrating MotherDuck workflows, you have several options: Popular workflow orchestration platforms like [Airflow, Dagster, Kestra, Prefect and Bacalhau](/integrations/orchestration/) provide robust scheduling, dependency management and monitoring capabilities. For simpler use cases, basic scheduling tools like cron jobs or [GitHub Actions](/key-tasks/data-warehousing/orchestration/github-action-cron/) can effectively orchestrate data pipelines. Many ingestion & transformation tools also come with built-in orchestration features, allowing you to schedule and monitor data loads without additional tooling. When orchestrating MotherDuck pipelines, follow these best practices: - Design idempotent jobs that can safely re-run without duplicating or corrupting data. - Implement proper error handling and retries to gracefully handle temporary failures. - Set up logging and monitoring to maintain visibility into pipeline health and performance. ## Connecting BI tools and data applications MotherDuck provides robust support for business intelligence and reporting through its cloud data warehouse capabilities. The platform enables organizations to build scalable analytics solutions by connecting their data warehouse to popular visualization and reporting tools. With isolated compute tenancy per user, analysts can run complex queries without impacting other users' performance. For connecting popular BI tools, MotherDuck offers several integration options. Tableau users can connect through the [cloud and server connectors](/integrations/bi-tools/tableau/), with support for both token-based and environment variable authentication methods. The platform works with both live and extracted connections, and Tableau Bridge enables cloud connectivity. [Microsoft Power BI](/integrations/bi-tools/powerbi/) integration is achieved through the DuckDB ODBC driver and Power Query connector, supporting both import and DirectQuery modes. Other supported BI tools include Omni, Metabase, Preset/Superset, and Rill, typically connecting through standard JDBC/ODBC interfaces. MotherDuck seamlessly integrates with data science and AI tools through its native APIs and connectors. Python users can leverage the DuckDB SDK and Pandas integration for data analysis workflows. The platform supports R for statistical computing, while AI applications can be built using LangChain or LlamaIndex integrations. Notebook tools like Hex and Jupyter provide both hosted and on-prem environments for data exploration. For building [custom data applications](/getting-started/customer-facing-analytics/), MotherDuck's unique architecture enables novel approaches through its WASM-powered 1.5-tier architecture. The platform runs DuckDB in the browser through WebAssembly, allowing for highly interactive visualizations with near-zero latency. Developers can use MotherDuck's APIs and SDKs in languages like Python and Go to create custom data applications that leverage both local and cloud-based data processing. ## Advanced topics & best practices ### Performance tuning and optimization in MotherDuck MotherDuck inherits DuckDB's powerful query optimization capabilities. You can analyze query performance using the `EXPLAIN` command to view execution plans and identify bottlenecks. While DuckDB doesn't use traditional indexes, it automatically creates statistics and metadata to optimize query execution with row groups. As a result, [sorting the data on insert](https://duckdb.org/2025/05/14/sorting-for-fast-selective-queries.html) is very effective way to improve query performance. ### Data sharing and collaboration MotherDuck implements a data sharing model through SHARES, which provide read-only access to specific databases. To create a share, use the [`CREATE SHARE`](/sql-reference/motherduck-sql-reference/create-share/) command and specify the database you want to share. Recipients can then access the shared data through their own MotherDuck account while maintaining data isolation. ### Monitoring and logging MotherDuck usage DuckDB's meta-queries like `EXPLAIN ANALYZE` provide detailed query execution statistics. You can also use the platform's built-in profiling capabilities to monitor query performance and resource utilization, helping identify optimization opportunities and troubleshoot performance issues. [Recent queries](/sql-reference/motherduck-sql-reference/md_information_schema/recent_queries/) and [historical queries](/sql-reference/motherduck-sql-reference/md_information_schema/query_history/) can be observed as well, to further optimize the warehouse load. ### Cost management While MotherDuck's pricing model is still evolving, you can optimize costs by efficiently managing compute resources. Consider implementing data lifecycle policies to archive or delete old data. Monitor query patterns to identify opportunities for optimization and avoid unnecessary data processing. ### Security best practices for your MotherDuck warehouse - Implement robust security practices by following MotherDuck's database-level security model. - Use token-based authentication for all connections and avoid sharing credentials. - When integrating with tools, leverage environment variables for secure credential management. - Regularly audit database access and maintain an inventory of active shares. ### Leveraging AI features within MotherDuck MotherDuck enhances DuckDB with AI-powered features to improve productivity. The platform includes a [SQL AI fixer](/getting-started/interfaces/motherduck-quick-tour/#fix-errors-and-edit-queries-with-ai) that helps identify and correct query syntax issues. The `prompt()` function enables natural language interactions with your data warehouse, allowing users to generate SQL queries from plain English descriptions. These are just a few of the AI capabilities that help make data analysis more accessible while maintaining the power and flexibility of SQL. ## Further guides: ## Included pages - [GitHub Actions](https://motherduck.com/docs/key-tasks/data-warehousing/orchestration/github-action-cron): Schedule MotherDuck SQL and dbt jobs with GitHub Actions as a lightweight cron-based orchestrator. - [PostgreSQL](https://motherduck.com/docs/key-tasks/data-warehousing/replication/postgres): Replicate PostgreSQL tables to MotherDuck using DuckDB and the PostgreSQL extension. - [Dagster](https://motherduck.com/docs/key-tasks/data-warehousing/orchestration/dagster): Orchestrate an incremental S3-to-MotherDuck data loading pipeline with Dagster and Python. - [SQL Server](https://motherduck.com/docs/key-tasks/data-warehousing/replication/sql-server): Replicate SQL Server tables to MotherDuck using Python and dataframes. - [Flat Files](https://motherduck.com/docs/key-tasks/data-warehousing/replication/flat-files): Load CSV, Parquet, and JSON files into MotherDuck from local storage or cloud sources. - [Excel and Google Sheets](https://motherduck.com/docs/key-tasks/data-warehousing/replication/spreadsheets): Load Excel and Google Sheets data into MotherDuck using the DuckDB CLI or HTTPS CSV export URLs. ## Appendix ### Troubleshooting common issues When working with MotherDuck, you may encounter challenges around data loading, query performance, or connectivity. For data loading issues, refer to our [best practices for programmatic loading](/key-tasks/data-warehousing/) which covers optimizing batch sizes and file formats. For query performance, review our [dual execution capabilities](/concepts/architecture-and-capabilities/#dual-execution) to understand how MotherDuck optimizes query execution across local and cloud resources. For connectivity problems, check our [authentication guides](/key-tasks/authenticating-and-connecting-to-motherduck/connecting-to-motherduck) and ensure you're following the recommended connection patterns. ### Useful SQL snippets for MotherDuck MotherDuck supports a wide range of SQL functionality inherited from DuckDB. For data ingestion, refer to our [PostgreSQL replication examples](/key-tasks/data-warehousing/replication/postgres) which demonstrate common patterns for loading data. For building customer facing analytics, check our [guide](/getting-started/customer-facing-analytics) which includes examples of data processing and visualization queries. The [DuckDB SQL documentation](https://duckdb.org/docs/sql/introduction.html) provides comprehensive reference for the SQL dialect. ### Links to further resources (MotherDuck docs, community) To deepen your understanding of data warehousing with MotherDuck, explore our [data warehousing concepts guide](/key-tasks/data-warehousing/) which covers architectural principles and best practices. For hands-on examples, the free [DuckDB in Action eBook](https://motherduck.com/duckdb-book-brief/) provides real-world scenarios and solutions. If you need help, don't hesitate to [contact our support team](https://motherduck.com/customer-support/) or explore our [ecosystem integrations](/integrations/) for additional tools and capabilities. Please do not hesitate to **[contact us](https://motherduck.com/customer-support/)** if you need help along your journey. --- Source: https://motherduck.com/docs/key-tasks/database-operations/database-operations # Database operations > Learn how to work with databases and MotherDuck ## Included pages - [Basics database operations](https://motherduck.com/docs/key-tasks/database-operations/basics-operations): Create, list, and drop MotherDuck databases using SQL commands. - [Specifying different databases](https://motherduck.com/docs/key-tasks/database-operations/specifying-different-databases): Reference tables across databases using fully qualified names with database.schema.table syntax. - [Switching the current database](https://motherduck.com/docs/key-tasks/database-operations/switching-the-current-database): Change the active database and schema context using USE statements. - [Querying historical data with time travel](https://motherduck.com/docs/key-tasks/database-operations/time-travel): Use MotherDuck snapshots to query past database states, compare data across time periods, debug pipeline issues, reproduce reports, and create audit checkpoints. - [Copying DuckDB Databases](https://motherduck.com/docs/key-tasks/database-operations/copying-databases): Duplicate databases between MotherDuck cloud and local DuckDB using COPY FROM DATABASE. - [Detach and re-attach a MotherDuck database](https://motherduck.com/docs/key-tasks/database-operations/detach-and-reattach-motherduck-database): Temporarily disconnect from a MotherDuck database using DETACH and reconnect with ATTACH. --- Source: https://motherduck.com/docs/key-tasks/how-to-guides # How-to guides > How-to guides ## Included pages - [AI and MotherDuck](https://motherduck.com/docs/category/ai-and-motherduck): Practical guides for using AI with MotherDuck. - [Authenticating and connecting to MotherDuck](https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck): Learn how to authenticate and connect to MotherDuck - [Data Warehousing How-to](https://motherduck.com/docs/key-tasks/data-warehousing): Data Warehousing How-to guides - [Database operations](https://motherduck.com/docs/key-tasks/database-operations): Learn how to work with databases and MotherDuck - [Interacting with cloud storage](https://motherduck.com/docs/key-tasks/cloud-storage): Learn how to work with databases and MotherDuck - [Loading Data into MotherDuck](https://motherduck.com/docs/key-tasks/loading-data-into-motherduck): Learn how to load data into MotherDuck from various sources - [Managing organizations](https://motherduck.com/docs/key-tasks/managing-organizations): Learn how to manage your organization with MotherDuck - [Running dual execution (or hybrid) queries](https://motherduck.com/docs/key-tasks/running-hybrid-queries): Query local and cloud data together using MotherDuck's dual execution hybrid query engine. - [Service accounts](https://motherduck.com/docs/key-tasks/service-accounts-guide): Learn how to create, configure, manage, and impersonate MotherDuck service accounts. - [Sharing data in MotherDuck](https://motherduck.com/docs/key-tasks/sharing-data): Learn how to securely share data in MotherDuck - [Build a customer-facing analytics app](https://motherduck.com/docs/key-tasks/customer-facing-analytics): Build customer-facing analytics applications with read scaling tokens and isolated tenant data. - [3-tier customer-facing analytics guide](https://motherduck.com/docs/key-tasks/customer-facing-analytics/3-tier-cfa-guide): Step-by-step guide to building a 3-tier customer-facing analytics application with MotherDuck. --- Source: https://motherduck.com/docs/key-tasks/cloud-storage/cloud-storage # Interacting with cloud storage > Learn how to work with databases and MotherDuck ## Included pages - [Querying Files in Amazon S3](https://motherduck.com/docs/key-tasks/cloud-storage/querying-s3-files): Query Parquet, CSV, and JSON files in S3 with automatic cloud execution routing. - [Writing Data to Amazon S3](https://motherduck.com/docs/key-tasks/cloud-storage/writing-to-s3): Export data from MotherDuck to Amazon S3 or transform S3 files in place. - [S3 Import Best Practices](https://motherduck.com/docs/key-tasks/cloud-storage/s3-import-best-practices): Optimize file size, format, and layout in Amazon S3 for fast, cost-effective data loading into MotherDuck. --- Source: https://motherduck.com/docs/key-tasks/loading-data-into-motherduck/loading-data-into-motherduck # Loading Data into MotherDuck > Learn how to load data into MotherDuck from various sources You can leverage MotherDuck's managed storage to persist your data. MotherDuck storage provides a high level of manageability and abstraction, optimizing your data for secure, durable, performant, and efficient use. There are several ways to load data into MotherDuck storage. ## Before You Start: Understanding Trade-offs Before choosing a loading method, it's important to understand the performance implications and trade-offs involved. Our [Considerations for Loading Data](./considerations-for-loading-data.mdx) guide explains: - **Batch vs. streaming approaches** and when to use each - **File format choices** and their impact on performance - **Optimal batch sizes** for different scenarios - **Cost implications** of different loading strategies - **Common performance pitfalls** and how to avoid them This understanding will help you make informed decisions that optimize for your specific use case. ## Included pages - [Loading Data Best Practices](https://motherduck.com/docs/key-tasks/loading-data-into-motherduck/considerations-for-loading-data): Understanding trade-offs and performance implications when loading data into MotherDuck - [From Your Local Machine](https://motherduck.com/docs/key-tasks/loading-data-into-motherduck/loading-data-from-local-machine): Moving data from local to MotherDuck through the UI or programmatically. - [Loading data to MotherDuck with Python](https://motherduck.com/docs/key-tasks/loading-data-into-motherduck/loading-data-md-python): Efficient methods for loading data from Python using DataFrames, temporary files, or bulk inserts. - [From Cloud Storage or over HTTPS](https://motherduck.com/docs/key-tasks/loading-data-into-motherduck/loading-data-from-cloud-or-https): Load data into MotherDuck from S3, Azure, GCS, or public HTTPS URLs. - [Load a DuckDB database into MotherDuck](https://motherduck.com/docs/key-tasks/loading-data-into-motherduck/loading-duckdb-database): Upload a local DuckDB database file to MotherDuck cloud storage. - [From a PostgreSQL or MySQL Database](https://motherduck.com/docs/key-tasks/loading-data-into-motherduck/loading-data-from-postgres): Learn to load a table from your PostgreSQL or MySQL database into MotherDuck. - [Via the Postgres Endpoint](https://motherduck.com/docs/key-tasks/loading-data-into-motherduck/loading-data-via-postgres-endpoint): Best practices for loading data into MotherDuck efficiently when you are connected through the Postgres endpoint. --- Source: https://motherduck.com/docs/key-tasks/managing-organizations/managing-organizations # Managing organizations > Learn how to manage your organization with MotherDuck An organization is a top-level entity in MotherDuck that lets you perform administrative functions, such as managing users, setting up billing, configuring sharing, and monitoring security. A MotherDuck user can only belong to a single organization at a time. Multi-organization membership support is planned for a future release. ::: ::: Organizations are helpful for: - Grouping users together for tracking usage and billing. - Sharing data with other users of the same organization. :::note MotherDuck is available on three AWS regions: - **US East (N. Virginia):** `us-east-1`, supporting DuckDB versions between 1.4.0 and 1.5.3. - **US West (Oregon):** `us-west-2`, supporting DuckDB versions between 1.4.1 and 1.5.3. - **Europe (Frankfurt):** `eu-central-1`, supporting DuckDB versions between 1.4.1 and 1.5.3. You can choose the region in which to create your organization. Organizations can only exist within a single cloud region. ::: ## Creating an organization If you already have a MotherDuck account, an organization was already created for you by MotherDuck. If you are a new MotherDuck user, during sign-up you will be prompted to create a new organization. ![create_org](./img/create_org.png) :::note If another coworker at your company already has an organization, you can create your own organization to get started with MotherDuck right away, and then ask them to invite you to their organization later (see ["Joining an existing organization"](#joining-an-existing-organization) below). ::: ## Inviting users to your organization You can check if your teammates are in your organization by navigating to the MotherDuck Web UI -> **Settings** -> **Members**. There you can also invite your teammates to join your organization. You can invite both teammates without a MotherDuck account and existing MotherDuck users. ![members](./img/members.png) Admins can control whether members are allowed to send invitations. When organization invites are disabled, only Admins can invite new users. This gives you tighter control over who has access to MotherDuck. You can configure this setting from the organization **Settings** page. ![invite policy](../authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/img/org-invite-policy.png) :::tip If your organization has [SSO enabled](/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/sso-setup/) you can use [Just-in-Time (JIT) provisioning](/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/sso-setup/#just-in-time-jit-user-provisioning) enabled, users in your verified domains who authenticate through your identity provider can join the organization on first login without needing an invitation. ::: ## Joining an existing organization If you'd like to join your teammates' existing MotherDuck organization, you must be invited by an Admin in that organization. Once an invite is generated, you will receive an email with a link to join the organization. ## Roles Within an organization a user can have an "Admin" or "Member" role. The first user in an organization is the Admin and subsequent users have the Member role. Admin users can change the roles of other users in the organization or remove a user from the organization. :::note Sending invitations, changing between plans, and updating billing information requires an Admin role. ::: ## Deprovisioning users If you need to revoke a user's access without deleting their data, Admin users can deprovision them from the context menu in the [Members table](https://app.motherduck.com/settings/members). Deprovisioning is a reversible alternative to [removing](#removing-users) a user. When you deprovision a user: - They can no longer sign in to MotherDuck. - Their personal access tokens and short-lived tokens are revoked. - Their account, databases, and shares are retained. To restore access later, choose **Reactivate** from the same context menu. The user can sign in again, but previously revoked tokens are not restored — they need to create new tokens. Two actions are blocked: - You can't deprovision yourself. - You can't deprovision the last active user in the organization. :::note If your organization uses SCIM provisioning, user lifecycle is managed by your identity provider and the deprovision and reactivate actions are hidden from the Members table. ::: ## Removing users If a user leaves your team or no longer needs access, Admin users can remove them from the organization to restrict data access or clean up resources that are no longer used. This is done from the context menu in the [Members table](https://app.motherduck.com/settings/members). :::warning Because a user can only belong to one organization, removing them from the organization permanently deletes the user and all of their data. This action cannot be undone. To revoke access reversibly instead, [deprovision](#deprovisioning-users) the user. ::: ## Limitations - It is not possible to search for existing organizations to join. Please reach out to other MotherDuck users at your company or [contact us](../../troubleshooting/support.md) if you would like to find other existing users at your company. --- Source: https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/read-scaling # Read Scaling > Learn how to scale your data applications using read scaling tokens Connecting read-heavy applications or BI tools with many concurrent users through a single MotherDuck account can sometimes lead to performance bottlenecks. By default, all connections using the same account share a single cloud DuckDB instance, called a "duckling". In addition to your read/write duckling, you can use Read Scaling to spin up additional read-only ducklings for read-heavy workloads. These replicas are **eventually consistent**. Results may lag a few minutes behind the latest database state. This tradeoff prioritizes high availability and performance while achieving near real-time synchronization across all replicas. ## Configuring a read scaling duckling pool ### Creating a read scaling token To use Read Scaling, you use a read scaling access token from the **MotherDuck UI** when [generating an access token][md-access-token] or through the [REST API](/docs/sql-reference/rest-api/users-create-token/). ### Connect with a read scaling token {#understanding-read-scaling-tokens} Once you have a read scaling token, you can use it to connect to MotherDuck from any DuckDB client as you would with any other authorization token. See [Connecting to MotherDuck](/key-tasks/authenticating-and-connecting-to-motherduck/connecting-to-motherduck/#session-names). ### Duckling assignment Read scaling ducklings remain idle until a connection is initialized from a DuckDB client. When a DuckDB client connects to MotherDuck with a read scaling token, the connection is assigned to one of the read scaling replicas. As more users connect, additional ducklings are spun up until you reach your Read Scaling Duckling Pool size. If the number of connections exceeds your pool size, new connections are assigned to existing ducklings in a round-robin fashion. The default Read Scaling Duckling Pool Size is 4 and can be increased up to 16. This is a soft limit, so if you need more ducklings in your pool, please [contact support](https://motherduck.com/contact-us/support/). ### Permissions A read scaling token grants permission for **read operations** (`SELECT`) while restricting write and administrative operations (updating tables, creating new databases, attaching or detaching databases). ## Ensuring data freshness In read scaling mode, ducklings sync changes from the primary read-write instance within a few minutes which works for most use cases. If your application requires stricter synchronization, you can manually trigger updates to be more frequent by: 1. Calling [CREATE SNAPSHOT](/sql-reference/motherduck-sql-reference/create-snapshot.md) on the writer duckling 2. Calling [REFRESH DATABASES](/sql-reference/motherduck-sql-reference/refresh-database.md) on any read scaling ducklings This approach guarantees that readers see the most recent snapshot. ::::warning[Watch Out] Creating a snapshot of a database will interrupt any ongoing queries interacting with that database. :::: ## Best practices Here are a few tips to get the most out of MotherDuck's read scaling capabilities. ### Optimize your read scaling duckling pool size For the best experience, aim for one duckling per concurrent user to take advantage of DuckDB's single-node power and efficiency. You can scale up as much as you need by configuring a maximum pool size based on expected concurrency and cost considerations. Users are also able to share ducklings if needed. While the default limit is 16 replicas, this is a soft limit. [Get in touch with MotherDuck support](https://motherduck.com/contact-us/support/) if you need more. ### Leverage local processing where possible Consider using DuckDB WASM to run client instances directly in the browser when possible to fully utilize client resources. ### Maintain user-duckling affinity with `session_name` {#session-affinity-with-session-name} To ensure users consistently connect to the same replica (improving caching and consistency), the DuckDB connection string supports the [`session_name`](/key-tasks/authenticating-and-connecting-to-motherduck/connecting-to-motherduck/#session-names) parameter: - Clients providing the same `session_name` value are directed to the same replica. This improves caching effectiveness, provides a more consistent view of data across queries for that user and offers better isolation between concurrent users. - This parameter can be set to the ID of a user session, a user ID, or a hashed value for privacy. By leveraging read scaling tokens and `session_name`, you can efficiently scale read operations and group user sessions for optimal performance. ### Instance caching with `dbinstance_inactivity_ttl` Some DuckDB client library integrations support an *instance cache* to keep connections to the same database instance alive for a short period after use. This improves read scaling by helping maintain session affinity even across separate queries or short connection gaps. This caching behavior boosts the effectiveness of `session_name`, making it more likely that frequent queries from the same client land on the same duckling, even with short breaks between connections. See [Connecting to MotherDuck](/key-tasks/authenticating-and-connecting-to-motherduck/connecting-to-motherduck/#setting-custom-database-instance-cache-time-ttl) for more details. [md-access-token]: /key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/#authentication-using-an-access-token --- Source: https://motherduck.com/docs/key-tasks/service-accounts-guide/index # Service accounts > Learn how to create, configure, manage, and impersonate MotherDuck service accounts. Service accounts are non-human user identities for workloads that need to connect to MotherDuck without using a person's credentials. Use these guides to create service accounts, configure their Ducklings, manage tokens, and troubleshoot through UI impersonation. ## Included pages - [Create and configure service accounts](https://motherduck.com/docs/key-tasks/service-accounts-guide/create-and-configure-service-accounts): Learn how to create service accounts, create access tokens, and configure Duckling resources. - [Impersonate service accounts](https://motherduck.com/docs/key-tasks/service-accounts-guide/impersonate-service-accounts): Use UI impersonation to troubleshoot and inspect resources as a service account. - [Manage service accounts and tokens](https://motherduck.com/docs/key-tasks/service-accounts-guide/manage-service-accounts-and-tokens): Use the MotherDuck UI and REST API to view, delete, and rotate service account tokens. --- Source: https://motherduck.com/docs/key-tasks/sharing-data/sharing-data # Sharing data in MotherDuck > Learn how to securely share data in MotherDuck :::note Shares are **region-scoped** based on your Organization's cloud region. Each MotherDuck Organization is scoped to a single cloud region that must be chosen at Org creation when signing up. MotherDuck is available on AWS in three regions: - **US East (N. Virginia):** `us-east-1` - **US West (Oregon):** `us-west-2` - **Europe (Frankfurt):** `eu-central-1` ::: You can securely share data in MotherDuck. MotherDuck's sharing model is specifically optimized for the following scenarios: - Sharing data with everyone in your Organization for easy discovery and low-friction access. Typical of small highly collaborative data teams. - Sharing data with specific accounts in your Organization. Popular with data application builders needing to isolate tenants. - Sharing data publicly with anyone with a MotherDuck account in the same cloud region as your Organization, including users outside your Organization. ## Included pages - [Sharing concepts and overview](https://motherduck.com/docs/key-tasks/sharing-data/sharing-overview): MotherDuck data sharing model concepts including read-only shares and scope options. - [Sharing data with your organization](https://motherduck.com/docs/key-tasks/sharing-data/sharing-within-org): Share databases with all members of your MotherDuck organization. - [Sharing data with specific users](https://motherduck.com/docs/key-tasks/sharing-data/sharing-with-users): Grant read access to specific users for multi-tenant applications and collaboration. - [Managing shares](https://motherduck.com/docs/key-tasks/sharing-data/managing-shares): View share details, modify permissions, and manage shared database access. - [Updating shares](https://motherduck.com/docs/key-tasks/sharing-data/updating-shares): Learn about data replication timing, checkpoints, and how to ensure your latest data is available in shares and read-only Ducklings. --- Source: https://motherduck.com/docs/key-tasks/ai-and-motherduck/text-search-in-motherduck # Text Search in MotherDuck > Text search strategies from pattern matching to semantic search with embeddings in MotherDuck. Text search is a fundamental operation in data analytics - whether you're finding records by name, searching documents for relevant content, or building question-answering systems. This guide covers search strategies available in MotherDuck, from simple pattern matching to advanced semantic search, and how to combine them for optimal results. ## Quick Start: Common Search Patterns Start here to identify the best search method for your use case. The right search approach depends on what you're searching, how you expect to use search, and what results you need. Most use cases fall into one of three patterns, each linking to detailed implementation guidance below: **Keyword Search Over Identifiers**: When searching for specific items like company names, product codes, or customer names, use [Exact Match](#exact-match) for precise and low-latency lookups. If you need typo tolerance (e.g., "MotheDuck" → "MotherDuck"), use [Fuzzy Search](#fuzzy-search-text-similarity). **Keyword Search Over Documents**: When searching longer text like articles, product descriptions, or documentation, use [Full-Text Search](#full-text-search-fts). This ranks documents by keyword relevance, and handles cases where users provide a few keywords that should appear in the content. **Semantic Search**: When searching by meaning and similarity rather than exact keywords, use [Embedding-based Search](#embedding-based-search). This covers: - Understanding synonyms (e.g., matching "data warehouse" with "analytics platform") - Understanding natural language queries (e.g., "wireless headphones with good battery life") - Finding similar content (e.g., support tickets describing similar customer issues) --- For answering natural language questions about *structured* data (e.g., "How many customers do we have in California?"), see [Analytics Agents](/key-tasks/ai-and-motherduck/building-analytics-agents/). ## Refining Your Search Strategy If the patterns above don't fully match your use case, use these four questions to navigate to the right method. Each question links to specific sections with implementation details: 1. **What is the search corpus?** Consider what you're searching through: - **Identifiers** like company names, product IDs, or person names → [Exact Match](#exact-match) or [Fuzzy Search](#fuzzy-search-text-similarity) - **Documents** like articles, descriptions, or reports → [Keyword search (regex)](#exact-match) or [Full-Text Search](#full-text-search-fts) (FTS) or [Embedding-Based Search](#embedding-based-search) or [Hybrid](#fts-pre-filtering-hybrid-search) (combining FTS + embeddings) - **Structured (numerical) data** → [Analytics Agents](/key-tasks/ai-and-motherduck/building-analytics-agents/) that convert natural language questions to SQL 2. **What is the user input?** Think about how users express their search: - **Single terms** like "MotherDuck" → [Exact Match](#exact-match) or [Fuzzy Search](#fuzzy-search-text-similarity) - **Keyword phrases** like "data warehouse analytics" → [Keyword search (regex)](#exact-match) or [Full-Text Search](#full-text-search-fts) or [Embedding-based search](#embedding-based-search) - **Questions** like "What companies offer cloud analytics?" → [Embedding-based search](#embedding-based-search) with [HyDE](#hypothetical-document-embeddings-hyde) - **Example documents** (finding similar content) → [Embedding-based search](#embedding-based-search) 3. **What is the desired output?** Clarify what you're returning: - **Ranked list** (retrieval of documents/records) → Covered by this guide - **Generated text answers** (RAG-style Q&A, chatbots, summarization) → Use retrieval methods from this guide in combination with the [`prompt()`](/sql-reference/motherduck-sql-reference/ai-functions/prompt/#retrieval-augmented-generation-rag) function. 4. **What is the desired search behavior?** Think about what search qualities matter: - **Exact match** for specific words (IDs and codes) → [Exact Match](#exact-match) or [Keyword search (regex)](#using-regular-expressions) - **Typo resilience** to handle misspellings like "MotheDuck" → "MotherDuck" → [Fuzzy search](#fuzzy-search-text-similarity) - **Synonym resilience** to match "data warehouse" with "analytics platform" → [Embedding-based search](#embedding-based-search) - **Customizable ranking** → See [Reranking](#reranking) in the [Advanced Methods](#advanced-methods) section - **Latency and concurrency** → See [Performance Guide](#performance-guide) ## Search Methods ### Exact Match Use exact match search for specific identifiers, codes, or when you need guaranteed matches. This is the fastest search method. #### Using LIKE For substring matching, use `LIKE` (or `ILIKE` for case-insensitive). In patterns, `%` matches any sequence of characters and `_` matches exactly one character. ```sql -- Find places with 'Starbucks' in their name SELECT name, locality, region FROM foursquare.main.fsq_os_places WHERE name LIKE '%Starbucks%' LIMIT 10; ``` See also: [Pattern Matching](https://duckdb.org/docs/stable/sql/functions/pattern_matching.html) in DuckDB documentation #### Using Regular Expressions For more complex pattern matching or matching multiple keywords, use `regexp_matches()` with `(?i)` for case-insensitive searches: ```sql -- Find Hacker News posts with 'python', 'javascript', or 'rust' in text SELECT title, "by", score FROM sample_data.hn.hacker_news WHERE regexp_matches(text, '(?i)(python|javascript|rust)') LIMIT 10; ``` See also: [Regular Expressions](https://duckdb.org/docs/stable/sql/functions/regular_expressions) in DuckDB documentation ### Fuzzy Search (Text Similarity) Fuzzy search handles typos and spelling variations in entity names like companies, people, or products. Use `jaro_winkler_similarity()` for most fuzzy matching scenarios - it offers the best balance of accuracy and performance compared to `damerau_levenshtein()` or `levenshtein()`. ```sql -- Find places similar to 'McDonalds' (handles typo 'McDonalsd') SELECT name, locality, region, jaro_winkler_similarity('McDonalsd', name) AS similarity FROM foursquare.main.fsq_os_places ORDER BY similarity DESC LIMIT 10; ``` See also: [Text Similarity Functions](https://duckdb.org/docs/stable/sql/functions/text#text-similarity-functions) in DuckDB documentation ### Full-Text Search (FTS) Full-Text Search ranks documents by keyword relevance using BM25 scoring, which considers both how often terms appear in a document and how rare they are across all documents. Use this for articles, descriptions, or longer text where you need relevance ranking. FTS automatically handles word stemming (e.g., "running" matches "run") and removes common stopwords (like "the", "and", "or"), but requires exact word matches - it won't handle typos in search queries. #### Basic FTS Setup FTS requires write access to the table. Since we're using a read-only example database, we first create a copy of the table in a read-write database we own: ```sql CREATE TABLE hn_stories AS SELECT id, title, text, "by", score, type FROM sample_data.hn.hacker_news WHERE type = 'story' AND LENGTH(text) > 100 LIMIT 10000; ``` Build the FTS index on the text column. This creates a new schema called `fts_{schema}_{table_name}` (in this case `fts_main_hn_stories`): ```sql PRAGMA create_fts_index( 'hn_stories', -- table name 'id', -- document ID column 'text' -- text column to index ); ``` Search the index using the `match_bm25` function from the newly created schema: ```sql SELECT id, title, text, fts_main_hn_stories.match_bm25(id, 'database analytics') AS score FROM hn_stories ORDER BY score DESC LIMIT 10; ``` #### Index Maintenance FTS indexes need to be updated when the underlying data changes. Rebuild the index using the `overwrite` parameter: ```sql PRAGMA create_fts_index('hn_stories', 'id', 'text', overwrite := 1); ``` See also: [Full-Text Search Guide](https://duckdb.org/docs/stable/guides/sql_features/full_text_search.html) and [Full-Text Search Extension](https://duckdb.org/docs/stable/core_extensions/full_text_search) in DuckDB documentation ### Embedding-Based Search Embedding-based search finds conceptually similar text by meaning, not keywords. Use this for natural language queries, handling synonyms, or when users search with questions. Embeddings handle synonyms and typos naturally without manual configuration. :::note Embedding generation and lookups are priced in [AI Units](/about-motherduck/billing/pricing#advanced-ai-functions). For paid organizations, Business and Lite plans have a default soft limit of 10 AI Units per user/day (sufficient to embed around 600,000 rows) to help prevent unexpected costs. If you'd like to adjust these limits, [just ask!](/troubleshooting/support) ::: :::info The DuckDB [VSS extension](https://duckdb.org/docs/stable/core_extensions/vss) for approximate vector search (HNSW) is currently experimental, and not supported in MotherDuck's cloud service (Server-Side). [Learn more](/concepts/duckdb-extensions/) about MotherDuck's support for DuckDB extensions. ::: #### Basic Embedding-Based Search Setup Generate embeddings for your text data, then search using exact vector similarity. For search queries phrased as questions (like "What are the best practices for...?"), see [Hypothetical Document Embeddings](#hypothetical-document-embeddings-hyde). ```sql -- Reusing the hn_stories table from the FTS section, add embeddings ALTER TABLE hn_stories ADD COLUMN text_embedding FLOAT[512]; UPDATE hn_stories SET text_embedding = embedding(text); -- Semantic search - this will also match texts with related concepts like 'neural networks', 'deep learning', etc. SELECT title, text, array_cosine_similarity( embedding('machine learning and artificial intelligence'), text_embedding ) AS similarity FROM hn_stories ORDER BY similarity DESC LIMIT 10; ``` See also: [MotherDuck Embedding Function](/sql-reference/motherduck-sql-reference/ai-functions/embedding/), and [array_cosine_similarity](https://duckdb.org/docs/stable/sql/functions/array#array_cosine_similarityarray1-array2) in DuckDB documentation #### Document Chunking for Embedding-Based Search When documents are longer than ~2000 characters, consider breaking them into smaller chunks to improve retrieval precision and focus results. For production pipelines with PDFs or Word docs, you can use the [MotherDuck integration for Unstructured.io](https://motherduck.com/blog/effortless-etl-unstructured-data-unstructuredio-motherduck/). Otherwise, you can also do document chunking in the database - here are some helpful macros: ```sql -- Fixed-size chunking with configurable overlap CREATE MACRO chunk_fixed_size(text_col, chunk_size, overlap) AS TABLE ( SELECT gs.generate_series as chunk_number, substring(text_col, (gs.generate_series - 1) * (chunk_size - overlap) + 1, chunk_size) AS chunk_text FROM generate_series(1, CAST(CEIL(LENGTH(text_col) / (chunk_size - overlap * 1.0)) AS INTEGER)) gs WHERE LENGTH(substring(text_col, (gs.generate_series - 1) * (chunk_size - overlap) + 1, chunk_size)) > 50 ); -- Paragraph-based chunking (splits on double newlines) CREATE MACRO chunk_paragraphs(text_col) AS TABLE ( WITH chunks AS (SELECT string_split(text_col, '\n\n') as arr) SELECT UNNEST(generate_series(1, array_length(arr))) as chunk_number, UNNEST(arr) as chunk_text FROM chunks ); -- Sentence-based chunking (splits on sentence boundaries) CREATE MACRO chunk_sentences(text_col) AS TABLE ( WITH chunks AS (SELECT string_split_regex(text_col, '[.!?]+\s+') as arr) SELECT UNNEST(generate_series(1, array_length(arr))) as chunk_number, UNNEST(arr) as chunk_text FROM chunks ); ``` Use one of the macros to create chunks from your documents. Fixed-size chunks (300-600 chars with 10-20% overlap) work well for most use cases: ```sql CREATE OR REPLACE TABLE hn_text_chunks AS SELECT id AS post_id, title, chunks.chunk_number, chunks.chunk_text FROM hn_stories CROSS JOIN LATERAL chunk_fixed_size(text, 500, 100) chunks; -- Alternative: CROSS JOIN LATERAL chunk_paragraphs(text) chunks; -- Alternative: CROSS JOIN LATERAL chunk_sentences(text) chunks; ``` Generate embeddings for the chunks: ```sql ALTER TABLE hn_text_chunks ADD COLUMN chunk_embedding FLOAT[512]; UPDATE hn_text_chunks SET chunk_embedding = embedding(chunk_text); ``` Once you have chunks with embeddings, search them the same way as full documents using `array_cosine_similarity()` - the chunk-level results often provide more precise matches than searching entire documents. ## Performance Guide Search performance depends on several factors, from the chosen search method, to cold vs. warm reads, Duckling sizing, and tenancy model. When running a search query against your data for the first time (cold read), it may have a higher latency than subsequent queries (warm reads). For production search workloads, ideally dedicate a service account's Duckling primarily to search, so other queries don't compete with search queries. Account for [Duckling cooldown periods](/about-motherduck/billing/duckling-sizes/) - the first search query after cooldown may experience more latency. The DuckDB analytics engine divides data into chunks and processes them in parallel across threads. More data means more chunks to process in parallel, so larger datasets don't necessarily take proportionally longer to search - they just use more threads simultaneously. **Duckling sizing:** Optimal latency requires warm reads and enough threads to process your data in parallel. With the ideal [Duckling sizing](/about-motherduck/billing/duckling-sizes/) configuration matched to your dataset size, keyword search over identifiers ([exact match](#exact-match), [fuzzy match](#fuzzy-search-text-similarity)) typically achieves latencies in the range of a few hundred milliseconds, while document search ([regex](#using-regular-expressions), [Full-Text Search](#full-text-search-fts), [embedding search](#embedding-based-search)) typically achieves 0.5-3 second latency. Our team is happy to help advise on the right resource allocation for your specific workload and latency targets - [get in touch](/troubleshooting/support) to discuss how we can meet your needs. **Handling Concurrent Requests:** For handling multiple simultaneous search requests effectively, consider using [read scaling](/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/) to distribute load across multiple read scaling Ducklings. Alternatively, consider [hypertenancy](/concepts/hypertenancy), providing isolated compute resources for each user. To optimize further, see the strategies below. For questions or requirements beyond this guide, please [get in touch](/troubleshooting/support). ### Search Optimization Strategies When optimizing search performance, consider the following options. #### Pre-filtering Reduce the search space using structured metadata (e.g. location, categories, date ranges) that can be inferred from the user's context, before running similarity searches: ```sql -- Create a local copy with embeddings for place names (using a subset) CREATE TABLE places AS SELECT fsq_place_id, name, locality, region, fsq_category_labels FROM foursquare.main.fsq_os_places WHERE name IS NOT NULL LIMIT 10000; -- Add embeddings for semantic search ALTER TABLE places ADD COLUMN name_embedding FLOAT[512]; UPDATE places SET name_embedding = embedding(name); -- Pre-filter by location before semantic search WITH filtered_candidates AS ( SELECT fsq_place_id, name, locality, fsq_category_labels, name_embedding FROM places WHERE locality = 'New York' -- Filter by location and region AND region = 'NY' ) SELECT name, locality, fsq_category_labels, array_cosine_similarity( embedding('italian restaurant'), name_embedding ) AS similarity FROM filtered_candidates ORDER BY similarity DESC LIMIT 20; ``` #### Reducing Embedding Dimensionality Halving embedding dimensions roughly halves compute time. OpenAI embeddings can be truncated at specific dimensions (256 for `text-embedding-3-small`, 256 or 512 for `text-embedding-3-large`). Use lower dimensions for initial pre-filtering, then rerank with full embeddings: ```sql -- Setup: Create normalization macro CREATE MACRO normalize(v) AS ( CASE WHEN len(v) = 0 THEN NULL WHEN sqrt(list_dot_product(v, v)) = 0 THEN NULL ELSE list_transform(v, element -> element / sqrt(list_dot_product(v, v))) END ); -- Add lower-dimensional column (e.g., 256 dims instead of 512) ALTER TABLE hn_stories ADD COLUMN text_embedding_short FLOAT[256]; UPDATE hn_stories SET text_embedding_short = normalize(text_embedding[1:256]); ``` Then use a two-stage search: ```sql -- Stage 1: Fast pre-filter with short embeddings SET VARIABLE query_emb = embedding('machine learning algorithms', 'text-embedding-3-large'); SET VARIABLE query_emb_short = normalize(getvariable('query_emb')[1:256])::FLOAT[256]; WITH candidates AS ( SELECT id, array_cosine_similarity(getvariable('query_emb_short'), text_embedding_short) AS similarity FROM hn_stories ORDER BY similarity DESC LIMIT 500 -- Get more candidates if needed ) -- Stage 2: Rerank with full embeddings SELECT p.title, p.text, array_cosine_similarity(getvariable('query_emb'), p.text_embedding) AS final_similarity FROM hn_stories p WHERE p.id IN (SELECT id FROM candidates) ORDER BY final_similarity DESC LIMIT 10; ``` #### FTS Pre-filtering (Hybrid Search) FTS typically has lower latency than embedding search, making it effective as a pre-filter to reduce similarity comparisons. Use a large LIMIT in the FTS stage to ensure good recall: ```sql -- FTS pre-filter with large limit, then semantic rerank SET VARIABLE search_query = 'artificial intelligence neural networks'; WITH fts_candidates AS ( SELECT id, fts_main_hn_stories.match_bm25(id, getvariable('search_query')) AS fts_score FROM hn_stories ORDER BY fts_score DESC LIMIT 10000 -- Large limit to ensure recall ) SELECT h.id, h.title, h.text, array_cosine_similarity( embedding(getvariable('search_query')), h.text_embedding ) AS similarity FROM hn_stories h INNER JOIN fts_candidates f ON h.id = f.id ORDER BY similarity DESC LIMIT 10; ``` See also: [Search Using DuckDB Part 3 (Hybrid Search)](https://motherduck.com/blog/search-using-duckdb-part-3/) ## Advanced Methods This section covers additional techniques to customize and improve your search. The methods below demonstrate common approaches - many other variants are possible. :::note Some methods in this section make use of the `prompt()` function, which is priced in [AI Units](/about-motherduck/billing/pricing#advanced-ai-functions). For paid organizations, Business and Lite plans have a default soft limit of 10 AI Units per user/day (sufficient to process around 80,000 rows) to help prevent unexpected costs. If you'd like to adjust these limits, [just ask!](/troubleshooting/support) ::: ### LLM-Enhanced Keyword Expansion Generate synonyms with an LLM, then use them in pattern matching: ```sql -- Generate synonyms using LLM with structured output SET VARIABLE search_term = 'programming'; WITH synonyms AS ( SELECT prompt( 'Give me 5 synonyms for ''' || getvariable('search_term') || '''', struct := {'synonyms': 'VARCHAR[]'} ).synonyms AS synonym_list ) -- Search with expanded terms SELECT title, text FROM sample_data.hn.hacker_news, synonyms WHERE regexp_matches(text, getvariable('search_term') || '|' || array_to_string(synonym_list, '|')) LIMIT 10; ``` See also: [MotherDuck `prompt()` Function](/sql-reference/motherduck-sql-reference/ai-functions/prompt/) ### Hypothetical Document Embeddings (HyDE) HyDE improves question-based retrieval by generating a hypothetical answer first, then searching with that answer's embedding. This works because questions and answers have different linguistic patterns - the hypothetical answer better matches actual document content. Use with semantic search or the semantic component of hybrid search. ```sql -- HyDE: Generate hypothetical answer, then search with it WITH hypothetical_answer AS ( SELECT prompt( 'Answer this question in 2-3 sentences: "What are the key challenges in building scalable distributed systems?" Focus on typical technical challenges and solutions.' ) AS answer ) -- Search using the hypothetical answer's embedding SELECT title, text, array_cosine_similarity( (SELECT embedding(answer) FROM hypothetical_answer), text_embedding ) AS similarity FROM hn_stories ORDER BY similarity DESC LIMIT 10; ``` See also: [Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE paper)](https://arxiv.org/abs/2212.10496) ### Reranking Reranking typically happens in two stages: initial retrieval to get top candidates (100-500 results), then precise reranking of that smaller set. #### Rule-Based Reranking with Metadata Refine results based on business rules and metadata like score, category, or freshness: ```sql -- Find similar posts with metadata-based reranking WITH initial_similarity AS ( -- Step 1: Fast vector similarity for top candidates SELECT title, text, score as author_score, array_cosine_similarity( embedding('artificial intelligence and machine learning applications'), text_embedding ) AS emb_similarity FROM hn_stories ORDER BY emb_similarity DESC LIMIT 100 ), reranked_scores AS ( -- Step 2: Rerank with metadata (author score) SELECT title, text, author_score, emb_similarity, -- Score boost (normalize to 0-1 range based on actual data) (author_score / MAX(author_score) OVER ()) AS author_score_norm, -- Combined final score: 60% semantic + 40% author score (emb_similarity * 0.6 + author_score_norm * 0.4) AS reranked_score FROM initial_similarity ) SELECT title, text, author_score, ROUND(emb_similarity, 3) as semantic_score, ROUND(author_score_norm, 3) as author_score_normalized, ROUND(reranked_score, 3) as final_score FROM reranked_scores ORDER BY reranked_score DESC LIMIT 10; ``` #### LLM-Based Reranking For complex relevance criteria that are hard to express as rules, use an LLM to judge and score results. The [`prompt()` function](/sql-reference/motherduck-sql-reference/ai-functions/prompt/) is optimized for batch processing and processes requests in parallel - so reranking 50 results typically adds only a few hundred milliseconds. ```sql -- LLM reranking for top search results SET VARIABLE search_query = 'best practices for code review and software quality'; WITH top_candidates AS ( -- Initial retrieval (e.g., via semantic search) SELECT id, title, text, array_cosine_similarity( embedding(getvariable('search_query')), text_embedding ) AS initial_score FROM hn_stories ORDER BY initial_score DESC LIMIT 20 ), llm_reranked AS ( SELECT *, prompt( format( 'Rate how well this post matches the query ''{}''. Post: {} - {}', getvariable('search_query'), title, text ), struct := {'rating': 'INTEGER'} ).rating AS llm_score FROM top_candidates ) SELECT title, text, ROUND(initial_score, 3) as initial_score, llm_score, ROUND((0.6 * initial_score + 0.4 * llm_score / 10.0), 3) AS final_score FROM llm_reranked ORDER BY final_score DESC LIMIT 10; ``` ## Next Steps - Check out the MotherDuck [Embedding Function](/sql-reference/motherduck-sql-reference/ai-functions/embedding/) and [Prompt Function](/sql-reference/motherduck-sql-reference/ai-functions/prompt/) - Review the [Full-Text Search Guide](https://duckdb.org/docs/stable/guides/sql_features/full_text_search.html) in DuckDB documentation - Read the MotherDuck blog series: [Search Using DuckDB Part 1](https://motherduck.com/blog/search-using-duckdb-part-1/), [Part 2](https://motherduck.com/blog/search-using-duckdb-part-2/), [Part 3](https://motherduck.com/blog/search-using-duckdb-part-3/) - Explore [Building Analytics Agents with MotherDuck](/key-tasks/ai-and-motherduck/building-analytics-agents/)