Multithreading and parallelism with NodeJS
For multiple long-lived connections to one or more databases in one or more MotherDuck accounts, you can use connection pooling. Depending on the needs of your data application, you can use thread-based parallelism for improved performance, for example, if the queries are hybrid with CPU intensive work done locally. To enable thread-based parallelism, you can use Node worker threads with one database connection in each thread.
Connection pooling
If your application needs multiple read-only connections to a MotherDuck database, for example, to handle requests in a queue, you can use a Connection Pool. A Connection Pool keeps connections open for a longer period for efficient re-use, so you can avoid the overhead of creating a new database object for each query. The connections in your pool can connect to one database in the same MotherDuck account, or multiple databases in one or more accounts.
For connection pools, we recommend using generic-pool with duckdb-async and overriding the release
function to delete a connection if it's been in use for too long to optimize resource usage.
First, let's create a file md_connection_pool.js
to implement the connection pool class. Note that we are adding a new config option, recycleTimeoutMillis
, that will help us recreate any connections (active or idle) that have been open for a given time. This is different from idleTimeoutMillis
, which only destroys idle connections.
import { Database } from "duckdb-async";
import * as genericPool from "generic-pool";
export class RecyclingPool extends genericPool.Pool {
constructor(Evictor, Deque, PriorityQueue, factory, options) {
super(Evictor, Deque, PriorityQueue, factory, options);
// New _config option for when to recycle a non-idle connection
this._config['recycleTimeoutMillis'] = (typeof options.recycleTimeoutMillis == 'undefined') ? undefined : parseInt(options.recycleTimeoutMillis);
this._config['motherduckToken'] = (typeof options.motherduckToken == 'undefined') ? undefined : options.motherduckToken;
console.log('Creating a RecyclingPool');
}
release(resource) {
const loan = this._resourceLoans.get(resource);
const creationTime = typeof loan == 'undefined' ? 0 : loan.pooledResource.creationTime;
// If the connection has been in use for longer than the recycleTimeoutMillis, then destroy it instead of releasing it back into the pool.
// If that deletion brings the pool size below the min, a new connection will automatically be created within the destroy method.
if (new Date(creationTime + this._config.recycleTimeoutMillis) <= new Date()) {
return this.destroy(resource);
}
return super.release(resource);
}
}
You can then create an MDFactory
class to create the connection in the pool, and use it with createRecyclingPool
(equivalent to the createPool
function from generic-pool
).
export class MDFactory {
constructor(opts) {
this.opts = opts
}
async create() {
console.log("Creating a connection");
const connection = await Database.create(`md:my_db?motherduck_token=` + this.opts.motherduckToken);
// Run any connection initialization commands here
// For example, you can set THREADS = 1 if you want to limit duckdb to run on a single thread
await connection.all("SET THREADS='1';");
return connection;
}
async destroy(connection) {
console.log("Destroying a connection");
return connection.close();
}
};
export function createRecyclingPool(config) {
const factory = new MDFactory(config);
return new RecyclingPool(genericPool.DefaultEvictor, genericPool.Deque, genericPool.PriorityQueue, factory, config);
}
To try out the connection pool, you can create a file md_connection_pool_test.js
that creates a RecyclingPool
and submits a list of queries.
To create the pool instance, first set the configuration options specified by generic-pool
and pass them to the createRecyclingPool
function. You can find the list of options in the docs. Below are a few example values that we recommend for using with MotherDuck.
import { createRecyclingPool } from "./md_connection_pool.js";
// If an idle eviction would bring us below the min pool size, a new connection is made after the eviction
const opts = {
max: 10,
min: 3,
// Background idle connection detruction process runs every evictionRunIntervalMillis
// We don't want all connections to be evicted at the same time, so only destroy one at a time
// Connection must be idle for softIdleTimeoutMillis before it is recycled.
// (Additionally, we implemented recycleTimeoutMillis to also recycle active connections.)
evictionRunIntervalMillis: 30000,
numTestsPerEvictionRun: 1,
softIdleTimeoutMillis: 90000,
// Do not start to use a connection that is older than 20 minutes old. Recreate it first.
// Set this higher than recycleTimeoutMillis below so that recycling will happen proactively rather than delay query execution.
idleTimeoutMillis: 1200000,
// Before returning resource to pool, check if it has been in existence longer than this timeout and if so, destroy it.
// New connections will be added up to the min pool size during the destroy process, so this is proactive rather than reactive.
recycleTimeoutMillis: 900000,
// We don't want all the connections to recycle at the same time, so let's randomize it slightly.
// This number should be smaller than the recycleTimeoutMillis
recycleTimeoutJitter: 60000,
// This gets your MotherDuck token from an environment variable.
motherduckToken: process.env.motherduck_token,
};
const myPool = createRecyclingPool(opts);
Then, you can use the pool to asynchronously acquire connections from the pool and run a list of queries.
let promiseArray = [];
let queries = ["SELECT 42", "SELECT 'Hello World!'"];
for (let i=0; i < queries.length; i++) {
// Promise is resolved once a resource becomes available
console.log("Acquire connection from pool");
promiseArray.push(myPool.acquire());
promiseArray[i]
.then(async function(client) {
console.log("Starting query");
const results = await client.all(queries[i]);
console.log("Results: ", results[0]);
await new Promise(r => setTimeout(r, 200)); // Delay for testing
// Release the connection (or destroy if it exceeds recycleTimeoutMillis)
myPool.release(client);
})
.catch(function(err) {
console.log(err)
});
}
You can easily create additional connection pools that connect to different MotherDuck databases by changing the MotherDuck token.
const opts2 = { ...opts, motherduckToken: process.env.motherduck_token_2};
const myPool2 = createRecyclingPool(opts2);
To shutdown and stop using a pool, you can optionally run the following code in your application:
myPool.drain().then(function() {
myPool.clear();
});
To test the pool, run:
npm install duckdb-async
npm install generic-pool
export motherduck_token="" # Add your MotherDuck token here
node md_connection_pool_test.js