--- sidebar_position: 1 title: Multithreading and Parallelism with Python and MotherDuck sidebar_label: Python description: Performance tuning via multithreading with multiple connections to MotherDuck with Python --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import useBaseUrl from '@docusaurus/useBaseUrl'; # Multithreading and parallelism with Python Depending on the needs of your data application, you can use multithreading for improved performance. If your queries will benefit from concurrency, you can create [connections in multiple threads](#connections-in-multiple-threads). For multiple long-lived connections to one or more databases in one or more MotherDuck accounts, you can use [connection pooling](#connection-pooling). If you need to run many concurrent read-only queries on the same MotherDuck account, you can use a [Read Scaling](/docs/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/) token. ## Connections in multiple threads

If you have multiple parallelizable queries you want to run in quick succession, you could benefit from concurrency. :::note Concurrency is supported by DuckDB, across multiple Python threads, as described in the [Multiple Python Threads](https://duckdb.org/docs/guides/python/multiple_threads.html) documentation page. However, be mindful when using this approach, as parallelism does not always lead to better performance. Read the notes on [Parallelism](https://duckdb.org/docs/guides/performance/how_to_tune_workloads.html#parallelism-multi-core-processing) in the DuckDB documentation to understand the specific scenarios in which concurrent queries can be beneficial. ::: A single DuckDB connection [is not thread-safe](https://duckdb.org/docs/api/python/overview.html#using-connections-in-parallel-python-programs). To use multiple threads, pass the connection object to each thread, and create a copy of the connection with the `.cursor()` method to run a query: ```python import duckdb from threading import Thread duckdb_con = duckdb.connect('md:my_db') def query_from_thread(duckdb_con, query): cur = duckdb_con.cursor() result = cur.execute(query).fetchall() print(result) cur.close() queries = ["SELECT 42", "SELECT 'Hello World!'"] threads = [] for i in range(len(queries)): threads.append(Thread(target = query_from_thread, args = (duckdb_con, query,), name = 'query_' + str(i))) for thread in threads: thread.start() for thread in threads: thread.join() ``` ## Connection pooling

If your application needs multiple read-only connections to a MotherDuck database, for example, to handle requests in a queue, you can use a Connection Pool. A Connection Pool keeps connections open for a longer period for efficient re-use. The connections in your pool can connect to one database in the same MotherDuck account, or multiple databases in one or more accounts. To run concurrent read-only queries on the same MotherDuck account, you can use a [Read Scaling](/docs/key-tasks/authenticating-and-connecting-to-motherduck/read-scaling/) token. For connection pools, we recommend using [SQLAlchemy](https://docs.sqlalchemy.org/14/core/pooling.html). Below is an example implementation. For this implementation, you can connect to a user account by providing a `motherduck_token` in your database path. ```python import logging from itertools import cycle from threading import Lock import duckdb import sqlalchemy.pool as pool from sqlalchemy.engine import make_url _log = logging.getLogger(__name__) logging.basicConfig(level=logging.DEBUG) class DuckDBPool(pool.QueuePool): """Connection pool for DuckDB databases (MD or local). When you run con = pool.connect(), it will return a cached copy of one of the database connections in the pool. When you run con.close(), it doesn't close the connection, it just returns it to the pool. Args: database_paths: A list of unique databases to connect to. """ def __init__( self, database_paths, max_overflow=0, timeout=60, reset_on_return=None, *args, **kwargs ): self.database_paths = database_paths self.gen_database_path = cycle(database_paths) self.pool_size = kwargs.pop("pool_size", len(database_paths)) self.lock = Lock() super().__init__( self._next_conn, *args, max_overflow=max_overflow, pool_size=self.pool_size, reset_on_return=reset_on_return, timeout=timeout, **kwargs ) def _next_conn(self): with self.lock: path = next(self.gen_database_path) duckdb_conn = duckdb.connect(path) url = make_url(f"duckdb:///{path}") _log.debug(f"Connected to database: {url.database}") return duckdb_conn ``` ### How to set `database_paths` The `DuckDBPool` takes a list of `database_paths` and an optional input argument `pool_size` (defaults to the number of paths). Each path in the list will get a DuckDB connection in the pool, that readers can use to query the database(s) they connect to. If you have a `pool_size` that is larger than the number of paths, the pool will return thread-safe copies of those connections. This gives you a few options on how to configure the pool. :::note To learn more about database instances and connections, see [Connect to multiple databases](/docs/key-tasks/authenticating-and-connecting-to-motherduck/connecting-to-motherduck/#connect-to-multiple-databases). :::

To create a connection pool with 3 connections to **the same database**, you can pass a single database path, and set `pool_size=3`: ```python path = "md:my_db?motherduck_token=&access_mode=read_only" conn_pool = DuckDBPool([path], pool_size=3) ``` Set `access_mode=read_only` to avoid potential write conflicts. This is especially important if your `pool_size` is larger than the number of databases.

You can also create multiple connections to **the same database** using **different DuckDB instances**. However, keep in mind that each connection takes time to establish. Create multiple paths and make them unique by adding `&user=` to the database path: ```python paths = [ "md:my_db?motherduck_token=&access_mode=read_only&user=1", "md:my_db?motherduck_token=&access_mode=read_only&user=2", "md:my_db?motherduck_token=&access_mode=read_only&user=3", ] conn_pool = DuckDBPool(paths) ``` Set `access_mode=read_only` to avoid potential write conflicts. This is especially important if your `pool_size` is larger than the number of databases.

You can also create multiple connections to **separate databases** in **the same MotherDuck account** using **different DuckDB instances**. However, keep in mind that each connection takes time to establish. Create multiple paths where each uses a different database path: ```python paths = [ "md:my_db1?motherduck_token=&access_mode=read_only", "md:my_db2?motherduck_token=&access_mode=read_only", "md:my_db3?motherduck_token=&access_mode=read_only", ] conn_pool = DuckDBPool(paths) ``` Set `access_mode=read_only` to avoid potential write conflicts. This is especially important if your `pool_size` is larger than the number of databases.

You can also create multiple connections to **separate databases** in **separate MotherDuck accounts** using *different DuckDB instances*. However, keep in mind that each connection takes time to establish. Create multiple paths where each uses a different database path: ```python paths = [ "md:my_db1?motherduck_token=&access_mode=read_only", "md:my_db2?motherduck_token=&access_mode=read_only", "md:my_db3?motherduck_token=&access_mode=read_only", ] conn_pool = DuckDBPool(paths) ``` Set `access_mode=read_only` to avoid potential write conflicts. This is especially important if your `pool_size` is larger than the number of databases. ### How to run queries with a thread pool You can then fetch connections from the pool, for example, to run queries from a queue. You can use a `ThreadPoolExecutor` with 3 workers to fetch connections from the pool and run the queries using a `run_query` function: ```python from concurrent.futures import ThreadPoolExecutor def run_query(conn_pool: DuckDBPool, query: str): _log.debug(f"Run query: {query}") conn = conn_pool.connect() rows = conn.execute(query) res = rows.fetchall() conn.close() _log.debug(f"Done running query: {query}") return res with ThreadPoolExecutor(max_workers=3) as executor: conn_pool = DuckDBPool(database_paths) futures = [executor.submit(run_query, conn_pool, query) for query in queries] for future, query in zip(futures, queries): result = future.result() print(f"Query [{query}] num rows: {len(result)}") ``` Reset the connection pool at least once every 24 hours, by closing and reopening all connections. This ensures that you are always running on the latest version of MotherDuck. ```python conn_pool.dispose() conn_pool.recreate() ``` --- ## Docs feedback MotherDuck accepts optional user-submitted feedback about this page at `POST https://motherduck.com/docs/api/feedback/agent`. For agents and automated tools, feedback submission should be user-confirmed before sending. Payload: ```json { "page_path": "/key-tasks/authenticating-and-connecting-to-motherduck/multithreading-and-parallelism/multithreading-and-parallelism-python/", "page_title": "Multithreading and Parallelism with Python and MotherDuck", "text": "", "source": "" } ``` `page_path` and `text` are required; `page_title` and `source` are optional. Responses: `200 {"feedback_id": ""}`, `400` for malformed payloads, and `429` when rate-limited.