What about DuckDB?
DuckDB stands out in its capacity: its drivers open up an in-process OLAP query engine, equipped with an advanced SQL language, compatible with a wide array of applications. Moreover, DuckDB provides the capability to craft potent extensions, empowering developers to link various data sources using high-performance languages such as C++ or Rust. Though creating those connectors require some effort, the end users can enjoy a natural developer experience on the SQL end.
Many of these extensions, fostered by DuckDB Labs and its community, function as foreign data wrappers tailored for DuckDB. Examples include those designed for Postgres, MySQL, or Athena. While some are in their early stages and may not yet fully support pushdowns, the development of advanced features is actively underway.
What distinguishes DuckDB from larger platforms like Trino or Clickhouse? DuckDB excels with small and medium-sized datasets due to its single-machine architecture and in-process methodology, drastically reducing response times. Adding to this advantage is its effortless setup process: simply integrate the DuckDB driver into your application and seamlessly connect databases using SQL, treating them as if they were native.
A quick example
Let’s demonstrate the previously quoted example in action. Suppose the product data resides in a MySQL database, while the analytics data is stored as a DuckDB file on S3. Firstly, let’s load the extensions and connect to the databases. The procedure would resemble the following SQL commands:
INSTALL mysql_scanner
INSTALL httpfs
LOAD mysql_scanner
LOAD httpfs
CALL load_aws_credentials()
ATTACH 'host=127.0.0.1 user=root port=3306 database=product_db' AS product_db (TYPE MYSQL_SCANNER);
ATTACH 's3://<bucket>/product_stats.db' (READ_ONLY);
As you can observe, once the connections are established and initialized with the database attachments, we can retrieve the actual data seamlessly, as if the data were co-located:
SELECT product.id, product.name, product_stats.views_count, product_stats.in_basket_count
FROM product_db.product
JOIN product_stats.product_stats ON product.id = product_stats.product_id
WHERE product.name LIKE "%duck%"
LIMIT 100 OFFSET 0
With this approach in place, the developer’s journey becomes significantly smoother when tasked with implementing a product that necessitates filtering, pagination, and sorting functionalities.
An experiment
In a recent endeavor, I brought an idea to life by constructing a proof of concept on two MySQL servers, mirroring the previous approach. The steps were as follows:
- I initiated a connection pool from a Scala application to DuckDB, laying the groundwork for the database attachments.
- I crafted a query to unify two tables, each residing in a separate database.
- I executed the query, parsed the resulting data, and returned the content.
The response time clocked in at approximately five seconds. While this isn’t overly lengthy, it’s worth noting that bypassing DuckDB and opting for requests and in-memory joins could potentially trim this down to a brisk 200 milliseconds, given that each query takes about 70 milliseconds on a standalone SQL client.
You might be curious about the factors contributing to this duration. Here are a few insights:
- To push down predicates, the extension fetches the table schema information prior to constructing the actual MySQL query. Although this information is cached post the initial request, failing to run a pre-cache request for table schemas could tack on an extra 2–3 seconds to your response time.
- All requests are encapsulated in a transaction, which could introduce unnecessary overhead.
- Depending on the nature of the request, the absence of a connection pool might lead to sequential database queries, thereby slowing down the process.
- Lastly, I observed that executing the full request, once the schema was cached, took around 2.5 seconds (as measured by the time command in bash), while the profiling details reported a response time of approximately 1.5 seconds on DuckDB.
There’s ample scope for enhancement, but it’s crucial to remember that we’re still navigating the nascent stages of the DuckDB extensions ecosystem.