---
sidebar_position: 1
title: Multithreading and Parallelism with Python and MotherDuck
sidebar_label: Python
description: Performance tuning via multithreading with multiple connections to MotherDuck with Python
---

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import useBaseUrl from '@docusaurus/useBaseUrl';

# Multithreading and parallelism with Python

Depending on the needs of your data application, you can use multithreading for improved performance. If your queries will benefit from concurrency, you can create [connections in multiple threads](#connections-in-multiple-threads). For multiple long-lived connections to one or more databases in one or more MotherDuck accounts, you can use [connection pooling](#connection-pooling). If you need to run many concurrent read-only queries on the same MotherDuck account, you can use a [Read Scaling](/docs/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/) token.

## Connections in multiple threads

<img src={useBaseUrl('/img/key-tasks/authenticating-and-connecting-to-motherduck/one-connection-threads.png')} width="300" style={{maxWidth: '100%'}} />

If you have multiple parallelizable queries you want to run in quick succession, you could benefit from concurrency.

:::note
Concurrency is supported by DuckDB, across multiple Python threads, as described in the [Multiple Python Threads](https://duckdb.org/docs/guides/python/multiple_threads.html) documentation page. However, be mindful when using this approach, as parallelism does not always lead to better performance. Read the notes on [Parallelism](https://duckdb.org/docs/guides/performance/how_to_tune_workloads.html#parallelism-multi-core-processing) in the DuckDB documentation to understand the specific scenarios in which concurrent queries can be beneficial.
:::

A single DuckDB connection [is not thread-safe](https://duckdb.org/docs/api/python/overview.html#using-connections-in-parallel-python-programs). To use multiple threads, pass the connection object to each thread, and create a copy of the connection with the `.cursor()` method to run a query:

```python
import duckdb
from threading import Thread

duckdb_con = duckdb.connect('md:my_db')

def query_from_thread(duckdb_con, query):
    cur = duckdb_con.cursor()
    result = cur.execute(query).fetchall()
    print(result)
    cur.close()

queries = ["SELECT 42", "SELECT 'Hello World!'"]
threads = []

for i in range(len(queries)):
    threads.append(Thread(target = query_from_thread,
                            args = (duckdb_con, query,),
                            name = 'query_' + str(i)))
for thread in threads:
    thread.start()

for thread in threads:
    thread.join()
```

## Connection pooling

<img src={useBaseUrl('/img/key-tasks/authenticating-and-connecting-to-motherduck/connection-pooling-multiple-copies.png')} width="145" style={{maxWidth: '100%'}} />
<img src={useBaseUrl('/img/key-tasks/authenticating-and-connecting-to-motherduck/connection-pooling.png')} width="300" style={{maxWidth: '100%'}} />
<img src={useBaseUrl('/img/key-tasks/authenticating-and-connecting-to-motherduck/connection-pooling-multiple.png')} width="300" style={{maxWidth: '100%'}} />

If your application needs multiple read-only connections to a MotherDuck database, for example, to handle requests in a queue, you can use a Connection Pool. A Connection Pool keeps connections open for a longer period for efficient re-use. The connections in your pool can connect to one database in the same MotherDuck account, or multiple databases in one or more accounts. To run concurrent read-only queries on the same MotherDuck account, you can use a [Read Scaling](/docs/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/) token.

For connection pools, we recommend using [SQLAlchemy](https://docs.sqlalchemy.org/14/core/pooling.html). Below is an example implementation. For this implementation, you can connect to a user account by providing a `motherduck_token` in your database path.

```python
import logging
from itertools import cycle
from threading import Lock

import duckdb
import sqlalchemy.pool as pool
from sqlalchemy.engine import make_url

_log = logging.getLogger(__name__)

logging.basicConfig(level=logging.DEBUG)


class DuckDBPool(pool.QueuePool):
    """Connection pool for DuckDB databases (MD or local).
    When you run con = pool.connect(), it will return a cached copy of one of the
    database connections in the pool.
    When you run con.close(), it doesn't close the connection, it just
    returns it to the pool.

    Args:
        database_paths: A list of unique databases to connect to.
    """
    def __init__(
        self,
        database_paths,
        max_overflow=0,
        timeout=60,
        reset_on_return=None,
        *args,
        **kwargs
    ):
        self.database_paths = database_paths
        self.gen_database_path = cycle(database_paths)
        self.pool_size = kwargs.pop("pool_size", len(database_paths))
        self.lock = Lock()

        super().__init__(
            self._next_conn,
            *args,
            max_overflow=max_overflow,
            pool_size=self.pool_size,
            reset_on_return=reset_on_return,
            timeout=timeout,
            **kwargs
        )

    def _next_conn(self):
        with self.lock:
            path = next(self.gen_database_path)
            duckdb_conn = duckdb.connect(path)
            url = make_url(f"duckdb:///{path}")
            _log.debug(f"Connected to database: {url.database}")

            return duckdb_conn
```

### How to set `database_paths`

The `DuckDBPool` takes a list of `database_paths` and an optional input argument `pool_size` (defaults to the number of paths). Each path in the list will get a DuckDB connection in the pool, that readers can use to query the database(s) they connect to. If you have a `pool_size` that is larger than the number of paths, the pool will return thread-safe copies of those connections. This gives you a few options on how to configure the pool.

:::note
To learn more about database instances and connections, see [Connect to multiple databases](/docs/key-tasks/authenticating-and-connecting-to-motherduck/connecting-to-motherduck/#connect-to-multiple-databases).
:::

<Tabs>

<TabItem value="One database" label="One database">

<img src={useBaseUrl('/img/key-tasks/authenticating-and-connecting-to-motherduck/connection-pooling-multiple-copies.png')} width="145" style={{maxWidth: '100%'}} />

To create a connection pool with 3 connections to **the same database**, you can pass a single database path, and set `pool_size=3`:

```python
path = "md:my_db?motherduck_token=<motherduck_token>&access_mode=read_only"
conn_pool = DuckDBPool([path], pool_size=3)
```

Set `access_mode=read_only` to avoid potential write conflicts. This is especially important if your `pool_size` is larger than the number of databases.

</TabItem>

<TabItem value="One database with >1 instances" label="One database with >1 instances">

<img src={useBaseUrl('/img/key-tasks/authenticating-and-connecting-to-motherduck/connection-pooling.png')} width="300" style={{maxWidth: '100%'}} />

You can also create multiple connections to **the same database** using **different DuckDB instances**. However, keep in mind that each connection takes time to establish. Create multiple paths and make them unique by adding `&user=<n>` to the database path:

```python
paths = [
    "md:my_db?motherduck_token=<motherduck_token>&access_mode=read_only&user=1",
    "md:my_db?motherduck_token=<motherduck_token>&access_mode=read_only&user=2",
    "md:my_db?motherduck_token=<motherduck_token>&access_mode=read_only&user=3",
]
conn_pool = DuckDBPool(paths)
```

Set `access_mode=read_only` to avoid potential write conflicts. This is especially important if your `pool_size` is larger than the number of databases.

</TabItem>

<TabItem value=">1 Databases on 1 account" label=">1 Databases on one account">

<img src={useBaseUrl('/img/key-tasks/authenticating-and-connecting-to-motherduck/connection-pooling.png')} width="300" style={{maxWidth: '100%'}} />

You can also create multiple connections to **separate databases** in **the same MotherDuck account** using **different DuckDB instances**. However, keep in mind that each connection takes time to establish. Create multiple paths where each uses a different database path:

```python
paths = [
    "md:my_db1?motherduck_token=<motherduck_token>&access_mode=read_only",
    "md:my_db2?motherduck_token=<motherduck_token>&access_mode=read_only",
    "md:my_db3?motherduck_token=<motherduck_token>&access_mode=read_only",
]
conn_pool = DuckDBPool(paths)
```

Set `access_mode=read_only` to avoid potential write conflicts. This is especially important if your `pool_size` is larger than the number of databases.

</TabItem>

<TabItem value=">1 Databases on >1 accounts" label=">1 Databases on >1 accounts">

<img src={useBaseUrl('/img/key-tasks/authenticating-and-connecting-to-motherduck/connection-pooling-multiple.png')} width="300" style={{maxWidth: '100%'}} />

You can also create multiple connections to **separate databases** in **separate MotherDuck accounts** using *different DuckDB instances*. However, keep in mind that each connection takes time to establish. Create multiple paths where each uses a different database path:

```python
paths = [
    "md:my_db1?motherduck_token=<motherduck_token1>&access_mode=read_only",
    "md:my_db2?motherduck_token=<motherduck_token2>&access_mode=read_only",
    "md:my_db3?motherduck_token=<motherduck_token3>&access_mode=read_only",
]
conn_pool = DuckDBPool(paths)
```

Set `access_mode=read_only` to avoid potential write conflicts. This is especially important if your `pool_size` is larger than the number of databases.

</TabItem>

</Tabs>

### How to run queries with a thread pool

You can then fetch connections from the pool, for example, to run queries from a queue. You can use a `ThreadPoolExecutor` with 3 workers to fetch connections from the pool and run the queries using a `run_query` function:

```python
from concurrent.futures import ThreadPoolExecutor

def run_query(conn_pool: DuckDBPool, query: str):
    _log.debug(f"Run query: {query}")

    conn = conn_pool.connect()
    rows = conn.execute(query)
    res = rows.fetchall()
    conn.close()

    _log.debug(f"Done running query: {query}")
    return res

with ThreadPoolExecutor(max_workers=3) as executor:
    conn_pool = DuckDBPool(database_paths)
    futures = [executor.submit(run_query, conn_pool, query) for query in queries]

    for future, query in zip(futures, queries):
        result = future.result()
        print(f"Query [{query}] num rows: {len(result)}")
```

Reset the connection pool at least once every 24 hours, by closing and reopening all connections. This ensures that you are always running on the latest version of MotherDuck.

```python
conn_pool.dispose()
conn_pool.recreate()
```
