This Month in the DuckDB Ecosystem: November 2024
2024/11/04 - 7 min read
BYSubscribe to the newsletter
Hey, friend π
Hello. I'm Simon, and I am excited to share another monthly newsletter with highlights and the latest updates about DuckDB, delivered straight to your inbox.
In this November issue, I gathered twelve exciting links, ranging from using DuckDB as an HTTP OLAP server to integrating it with the OSS Unity catalog to exciting applications such as DuckDB as a DrugDB. Others are building high-performance and cost-efficient data pipelines with DuckDB and Python and lazy loading data frames with Hex. Notable features include Excel-style pivoting, enhanced ACID compliance, and MotherDuck's new LLM integration with SQL. Please enjoy.
As always, if you have feedback, news, or any insights, they are always welcome. ππ» duckdbnews@motherduck.com.
Featured Community Member |
Lorenzo Mangani
Lorenzo is CEO and Co-Founder at QXIP BV, Leaders in Open-Source Telecom Observability. But he's also recently contributed to the DuckDB community by creating some community DuckDB extensions (more on that below!). Thanks Lorenzo, for your creativity within the DuckDB community!
Top DuckDB Links this Month |
Building Cost-Efficient Data Pipelines with Python & DuckDB
Joseph writes about cost-efficient data pipelines and categorizes data pipelines with Python and DuckDB into three parts, helping visualize when best to use DuckDB. He also offers a GitHub repo with the code.
Ducklake: Integrate DuckDB with Unity Catalog
In this article, the team of Xebia integrated DuckDB using dbt and Jupyter notebooks with the open-source Unity Catalog and uses the advantages of both worlds. The integration provides real-time updates in the Unity Catalog UI, confirming that DuckDB and Unity Catalog are fully integrated.
Community Extensions: DuckDB HTTP GET/POST Client // HTTP Server
A new HTTP DuckDB Community Extension is out. With it, you can make HTTP requests directly from within DuckDB. For example, with your SQL query, you can create a GET request using http_get(url) and a POST request using http_post(url, headers, params). This can be useful for fetching data from REST APIs, extracting data from JSON payloads, and processing API responses using DuckDB's SQL capabilities. The extension is available in the community extension repository, but its status is still experimental.
Quackscience also released another extension, the HTTP Server Extension, which transforms any DuckDB instance into an HTTP OLAP API server. With just a few commands, you get a queryable HTTP API with authentication support, a built-in query UI, and the ability to work with local and MotherDuck datasets. This makes it perfect for spinning up quick data services or creating distributed query networks while maintaining DuckDB's simplicity and performance.
DuckDB User Survey Analysis
DuckDB Labs surveyed 500+ DuckDB users and shared their findings.
It's no surprise that DuckDB is often used on a laptop, but servers were also popular. The most popular clients are the Python API and the standalone CLI client. Most users don't have giant data sets but appreciate the high performance. Users would like performance optimizations related to time series and partitioned data. DuckDB is popular among data engineers, analysts, scientists, and software engineers. The survey includes many more findings, including some nice graphs.
Excel-Style Pivoting, read_excel() function and duckdb-gsheets
Excel never dies, and with it, the Pivot Tables π. This year, in the year of the return of Pivot Tables (I have seen them in Rill and Cube), DuckDB supports these now, too, with:
INSTALL pivot_table FROM community;Β
LOAD pivot_table;
The extension supports well-known SQL features, such as PIVOT, UNNEST, MACRO, GROUPING SETS, ROLLUP, UNION ALL BY NAME, COLUMNS, and many more.
On the same note, Thomas wrote, "Whereβs the read Excel() function in DuckDB?". Surprisingly, thereΒ isΒ no read_excel() function yet, but you can (mis)use DuckDB'sΒ SpatialΒ extension. But what does spatial have to do with Excel? Nothing, but it's rooted in the fact that historically and even now, many geospatial files were β and still are β shared in Excel data files. Archie took it further with duckdb-gsheets, reading, and writing to Google Sheets.
DuckDB as a DrugDB: a Free and Simple Multi-Model Drug and Trial Database
This is the fourth case study for clinical trials Sixing has made. He tried Google Spanner, Postgres, SurrealDB, and now DuckDB. He uses a combined dataset containing over 5000 drugs, 2000 disorders, and 2000Β clinical trials, as well as Superset, for visualization. He uses extensions for full-text search (fits) and vector similarity search (vss) as well as DuckPGQ and uses theΒ PGQΒ (Property Graph Query Language) for graph-related operations.
Sixing concluded that DuckDB's extension system successfully handles SQL, graph queries, vector searches, and full-text searches, making it suitable for complex healthcare data analysis. While the ecosystem needs development in areas like visualization tools, DuckDB's columnar storage and SQL/GQL compatibility make it an attractive alternative to traditional databases.
Building a High-Performance Data Pipeline Using DuckDB
Alireza showcases how to build an efficient data pipeline using DuckDB as a compute engine for data lakes, implementing a Medallion architecture (Bronze β Silver β Gold) with GitHub Archive data. His detailed guide shows how DuckDB's in-memory processing and SQL capabilities can handle JSON ingestion, Parquet serialization, and data aggregation with impressive performanceβprocessing nearly six million records in under a minute, with complete sample code available.
Changing Data with Confidence and ACID
Hannes and Mark explain the ACID principles behind DuckDB and how you can confidently change data with full ACID guarantees by default without additional configuration. Everything started with transactions, and eventually, the well-known ACID came with the principles that describe a set of guarantees that a data management system must provide to be considered safe. ACID is an acronym that stands for Atomicity, Consistency, Isolation, and Durability. Every one of these is explained in greater detail.
It's pretty novel for an OLAP database to have ACID. The article goes on to explain why ACID in OLAP makes sense. It summarizes that DuckDB has passed the specific ACID Transaction tests from the TPC-H Benchmark tests. Check the tests out at GitHub.
Optimizing Multi-Modal Analysis by Lazy Loading Dataframes
Hex, a notebook-based solution, has improved its performance by migrating to a DuckDB-based architecture that directly queries Arrow data from S3. This has enabled 5-10x speedups in execution times. Moving data processing from Python to DuckDB and Arrow in their backend service enabled lazy loading and more efficient data handling, eliminating pandas format limitations and reducing memory usage.
Introducing the prompt() Function: Use the Power of LLMs with SQL
MotherDuck released LLMs within SQL. Instead of a context switch, we can ask the LLM to summarize text into a short poem with :
Copy code
SELECT
prompt('summarize the comment in a Haiku: ' || text)
AS summary FROM sample_data.hn.hacker_news
limit 20;
Β
Upcoming Events |
PyData NYC: A Duck in the hand is worth two in the Cloud: Data preparation and analytics on your laptop with DuckDB
08 November, 11 Times Square, New York City, NY π½ - 2:30 PM US, Eastern
Guen Prawiroatmodjo & Jacob Matson will showcase how DuckDB replaces Spark for 10GB-1TB tasks on laptops with fast, seamless Python integration, enabling efficient analytics and easy Cloud deployment via MotherDuckβs serverless support.
Small Data NYC: Watch Party Wednesday with Altana, Jamsocket and MotherDuck
13 November, 25 Kent, Williamsburg, Brooklyn π½ - 6:00 PM America, New York
DataGalaxy Tech Summit NYC: How to put DuckDB to work today?
13 November New York City π½ - 3:30 PM US, Eastern
Nick Ursa of MotherDuck will present a talk on DuckDB. DataGalaxy brings together industry experts to share their insights on optimizing data models, choosing the best data storage formats, and insights on streamlining data ingestion processes.
AI Native Summit 2024
21 November, Computer History Museum, Mountain View, CA π - 12:00 PM America, Los Angeles
Join MotherDuck CEO Jordan Tigani and AI leaders across research, startups and global companies for a day of discussion about the state of enterprise AI.
Data Rock N' Roll at AWS re:Invent
3 December, Brooklyn Bowl Las Vegas π€ - 6:00 PM America, Los Angeles
Attendees will enjoy a fun-filled atmosphere where they can network with fellow AWS enthusiasts, industry leaders, and innovators while competing in friendly bowling matches.
CONTENT
- Hey, friend π
- Featured Community Member
- Top DuckDB Links this Month
- Upcoming Events
Subscribe to the newsletter