Hey, friend π
Hello. I'mΒ Simon, and I am excited to share another monthly newsletter with highlights and the latest updates about DuckDB, delivered straight to your inbox.
In this February issue, Iβve curated ten insightful links and highlighted key takeaways from DuckCon #6 in Amsterdam. This edition explores DuckDBβs latest features, including UNION ALL BY NAME and new community-driven extensions like Airport for Arrow Flight. Additionally, I cover practical benchmarks comparing DuckDB with tools like DataFusion and UNIX coreutils, along with exciting developments in real-time data processing (Debezium) and cloud integrations with platforms such as Databricks and MotherDuck.
PS: DuckDB 1.2 is out! And MotherDuck added support at the same time as its release!
If you have feedback, news, or any insights, they are always welcome. ππ»Β duckdbnews@motherduck.com.
![]() | Featured Community Member |
![Post Image](/_next/image/?url=https%3A%2F%2F22616816.fs1.hubspotusercontent-na1.net%2Fhubfs%2F22616816%2F1728511492903.jpeg&w=3840&q=75)
Daniel Beach
Daniel Beach is a senior Data engineer and a an active writer in the data engineering community through data engineering central and confession data guy.
He doesnβt write to please peopleβhe writes to share the truth as he sees it.Β He prefers to share his firsthand experience, working hands-on with tools and processes. If something doesnβt work on the first try, itβs probably a sign of valuable UX feedback rather than just user error.
Several of his blog posts have already been shared in this newsletter, so it was about time! Thank you for your contribution to the DuckDB community!
![]() | Top DuckDB Links this Month |
DuckCon #6 in Amsterdam
DuckCon #6 took place last week in Amsterdam with great community talks, fantastic gatherings, and discussions during the breaks (I was there π₯³). Some highlights are:
- Mark talked about how DuckDB has achieved significant performance gains. GΓ‘bor successfully ran TPC-H at scale factor 300 (roughly 300GB of input data) on a Raspberry Pi, while scale factor 30,000 was achieved on a 96-core machine. The upcoming DuckDB v1.2 "Harlequin" release will feature improved CSV reader performance, enhanced SQL CLI autocomplete, various performance optimizations, and a new C API for easier extension development.
- The community showed impressive talks with Rusty's "Airport" extension, enabling DuckDB to interact with Apache Arrow Flight servers for Delta Lake integration with write support, and Daniel ten Wolde's SQL/PGQ extension implementing the SQL:2023 property graph query standard, which outperforms Neo4j in certain benchmarks by 10-100x on analytical queries.
- Sam Jewell from Grafana Labs shared essential security considerations when implementing DuckDB in production environments, particularly regarding the CLI's dot commands and file access capabilities. The team also announced their focus on lakehouse format support for the upcoming year, with Mark indicating this would be a significant development priority.
DuckDB's extension ecosystem is maturing rapidly. The new C API and Airport extension make it easier to develop community extensions, while the core DuckDB engine continues to impress with its performance capabilities.
Definite: Duck Takes Flight
TL;DR: Mike discusses using Apache Arrow Flight to implement concurrent read/write operations in DuckDB.
DuckDB is limited by its inability to support concurrent writers and simultaneous reads during writes, which poses challenges for near real-time analytics. Ritchie introduces a solution by leveraging Arrow Flight, enabling multiple writers to send data while simultaneously executing queries. The core implementation consists of a Flight server class, where data can be registered and queried using the do_put and do_get methods. For example, the do_get method executes a query and fetches an Arrow table, allowing for real-time data processing. Ritchie emphasizes that this architecture facilitates parallel data loading and querying without the traditional limitations of DuckDB.
Real-time Data Replication with Debezium and Python
TL;DR: This article explores setting up a real-time Change Data Capture (CDC) pipeline using Debezium and Python, loading data into DuckDB via dlt.
Ismail showcases the integration of Java-based Debezium with a Python environment to capture change data from a PostgreSQL database and load it into a DuckDB using pydbzengine. The pipeline leverages Debezium to monitor PostgreSQL transaction logs, producing change event streams. These events are then consumed by a Python-based dlt pipeline that uses pydbzengine, simplifying configuring and running Debezium within Python. The code showcases setting up a PostgreSQL container using Testcontainers and configuring Debezium with a Java Properties object, specifying details like connector class and offset storage.
DuckDB Database File as a New Standard for Sharing Data?
TL;DR: DuckDB offers a promising solution for data sharing by encapsulating large datasets into a single compressed database file, simplifying data transfer processes.
In a detailed exploration, Josef examines DuckDB's potential to become the new data-sharing standard, particularly highlighting its efficiency in handling large datasets. DuckDB's ability to store data in a columnar format similar to Parquet allows for significant compression, especially with numerical data. For instance, a 10 GB CSV of string data resulted in an 11 GB PostgreSQL table (size on the disk) and 14 GB DuckDB database file, whereas a similar numerical dataset was compressed to a 15 GB PostgreSQL table and just 2.8 GB. This suggests that while DuckDB may initially appear less efficient for string-heavy datasets, it excels with numerical data. Integration with PostgreSQL and MySQL was straightforward, facilitating seamless data transfers without complex export-import processes.
DuckDB vs. Datafusion
TL;DR: This article compares DuckDB and DataFusion, an ongoing comparison focusing on their SQL capabilities and performance characteristics.
Matt discusses the growing trend of using Rust-based tooling in data processing, highlighting the choice of Apache DataFusion as a backend by organizations like Databricks and the Apache Iceberg team. He presents benchmark results showing Rust's efficiency, noting it could write 1 billion rows in under 3 seconds. The article includes code examples illustrating the ease of use for DuckDB and DataFusion, such as the syntax for reading and exporting data. For instance, exporting to Parquet requires just one line of code in both engines. While both tools offer comparable syntax, Matt concludes that DuckDB has a richer SQL feature set. As a practical takeaway, if you're developing a Rust-based API and need a SQL engine, DataFusion might be the logical choice; however, for more advanced SQL features, DuckDB remains superior.
DuckDB processing remote (s3) JSON files
TL;DR: Processing remote JSON files stored in S3, highlighting its potential as a versatile SQL tool for data engineering tasks.
The article by Daniel explores the integration of DuckDB with remote JSON files stored in AWS S3, emphasizing its ease of use and efficiency. DuckDB's READ_JSON() function is pivotal, allowing automatic configuration flag inference from JSON files. This function simplifies complex data processing tasks, traditionally involving cumbersome data transformations using tools like PySpark or AWS Glue. Daniel illustrates this by querying JSON files using minimal code and achieving effective results without the need for intermediate data format conversions. The practical takeaway for database engineers is DuckDB's ability to streamline SQL operations on JSON files directly from S3, reducing code complexity and execution time and making it an appealing choice for handling large-scale JSON datasets in cloud environments. "DuckDB rarely disappoints with its integrations", Daniel notes, underscoring its position as a reliable tool for modern data engineering challenges.
Local dev and cloud prod for faster dbt development
TL;DR: Speed up dbt development using local dev and cloud prod setups with DuckDB and MotherDuck.
Jacob highlights a dual execution setup using DuckDB locally and MotherDuck in the cloud to enhance dbt development speed. By configuring the dbt profile for dual execution, where the local environment uses path: local.db and the cloud environment uses path: "md:jdw", developers can leverage local resources for faster iterations. Jacob provides code snippets for setting up dbt profiles and conditionally sampling data using Jinja, such as from {{ source("tpc-ds", "catalog_sales") }} {% if target.name == 'local' %} using sample 1 % {% endif %}. This method keeps datasets under a million rows for local development while utilizing the cloud for larger datasets. A practical takeaway is the accelerated development cycle, with an approximate 5X increase in dbt run speeds by minimizing local data size and optimizing compute resources.
Access Databricks UnityCatalog from DuckDB
TL;DR: Accessing Databricks UnityCatalog data using DuckDB improves efficiency by bypassing Spark for direct delta file reads.
The article by Matthias discusses leveraging DuckDB to query data from Databricks UnityCatalog. It highlights two primary methods: using PySpark to read data into Arrow format and querying it with DuckDB or directly reading delta files from storage with DuckDB, which avoids Spark's overhead. The direct method involves using Databricks' temporary table credentials API to secure access to storage, allowing DuckDB to handle "larger than memory" datasets efficiently. This is achieved by setting SET azure_transport_option_type = 'curl'; in DuckDB to ensure proper certificate handling. Matthias points out that DuckDB's ability to directly interact with data stored in UnityCatalog without Spark intermediation enhances performance, particularly for data not fitting into memory.
Vertical Stacking as the Relational Model Intended: UNION ALL BY NAME
TL;DR: UNION ALL BY NAME enhances vertical stacking by matching columns by name, supporting evolving schemas, and improving performance.
This feature allows you to combine two SQL queries stacking automatically, aligning by name, as opposed to UNION ALL, where the order needs to be the same for each dataset you combine. This feature is especially relevant when dealing with evolving schemas in data lakes or combining datasets with different column orders. It simplifies operations by automatically filling missing columns with NULL values, reducing the need for explicit schema management.
![]() | Upcoming Events |
Local Dev to Cloud Prod
13 February, Online - 6 PM PT
βJoin us to discover how MotherDuck simplifies your development experience by eliminating the friction between local development and cloud production environments
Getting Started with MotherDuck
20 February, Online
Looking to get started with MotherDuck and DuckDB? Join us for a live session to learn how MotherDuck makes analytics fun, frictionless, and ducking awesome!Β
Fast & Scalable Analytics Pipelines with MotherDuck & dltHub
26 February, Online
Explore how dltHub's Python-based ETL capabilities, paired with MotherDuck, empower you to effortlessly build fast, scalable analytics pipelines from local development to cloud-native production.
CONTENT
- Hey, friend π
- Featured Community Member
- Top DuckDB Links this Month
- Upcoming Events
Subscribe to DuckDB Newsletter
![blog subscription icon](/_next/image/?url=%2F_next%2Fstatic%2Fmedia%2Fblog-subscription-icon.67ccd5ae.png&w=828&q=75)