Hey, friend 👋
Hello, I'm Luciano, and I bring you your monthly dose of what's up in DuckDB. This month, we've put together a series of articles and videos to update you on the ecosystem.
Your insights and news are always welcome. Feel free to share by emailing duckdbnews@motherduck.com
Enjoy,
Luciano
Qiusheng Wu
Dr. Qiusheng Wu is the creator of advanced open-source geospatial tools like geemap, leafmap, and segment-geospatial, with thousands of users. His work is inspiring and serves as the foundation for numerous studies. He is an Associate Professor in the Department of Geography & Sustainability at the University of Tennessee, Knoxville, and also serves as an Amazon Visiting Academic and a Senior Research Fellow at the United Nations University. Dr. Wu specializes in geospatial data science and open-source software development, with a particular focus on utilizing big geospatial data and cloud computing to study environmental changes, especially surface water and wetland inundation dynamics.
 | Top DuckDB Links this Month |
DuckDB is fast, there's no doubt about it, right? But a lot of work is being done to make our solution even more robust and reliable in data reading and ingestion. Who hasn't had to deal with a CSV file with corrupted or incorrectly formatted lines? A series of improvements have been implemented on how DuckDB detects and manages formatting errors in our datasets. Pedro Holanda and Mehdi Quazza bring the latest in a relaxed and practical conversation.
Andy Pavlo provided a comprehensive overview of DuckDB and a detailed discussion on the internal workings of DuckDB, such as its execution model, vectorized query processing, and handling of data storage and retrieval. This includes how DuckDB processes queries and utilizes hardware efficiently, ensuring fast response times for analytical queries.
If you work with R, you need to try the 'duckplyr' package. It integrates the efficiency of DuckDB with the familiar functionalities of dplyr. 'Duckplyr' enables data analysts to perform complex transformations directly on their data frames, significantly improving performance without leaving the familiar dplyr environment. This represents a considerable advantage for daily R users, as it combines ease of use with powerful data processing capabilities.
The possibilities with DuckDB are vast and continue to expand. Alvaro Huarte delves into the integration of geospatial images with DuckDB's spatial extension in detail. As spatial analysis has become essential in various fields, from geographic information systems (GIS) to urban planning and beyond, this integration offers new possibilities for such analyses and has been gaining significant momentum in our community.
No spoilers please. But can you guess which Python implementation performed the best? The 1 billion line challenge provides an opportunity to investigate how efficiently we can process a large text file and obtain some general statistics. This video explores the most effective strategies for processing lines using both pure Python and external libraries. Are you surprised by the result?
Fly high with the full potential of your Jupyter Notebooks using DuckDB! In this article, Deepa Vasanthkumar demonstrates how integrating these powerful tools enhances your data analysis experience with fast querying and robust data manipulation. Ideal for efficiently handling large datasets, this combination ensures you never compromise on performance or flexibility. Learn the simple steps to take flight and elevate your data skills.
PyIceberg could be the solution you've been looking for to integrate DuckDB with Snowflake. In this article, Julien Hurault presents a step-by-step guide to building a 'multi-engine data stack' that combines Snowflake, DuckDB, and Iceberg, offering efficiency, scalability, and integration between these two platforms. While Iceberg is still in its early stages, enabling interoperability among different engines opens up so many possibilities.
Who else loves testing out new technologies and exploring them through tutorials and end-to-end projects? If you're like me, check out this project exploring the creation of a complete data stack. It leverages technologies like Mage, DuckDB, dbt core, and Superset to provide a comprehensive solution. It's a fantastic starting point for demos, templates, or learning how all these components work together. Have fun!
This month, our page is full of end-to-end projects, with a highlight now on building a data quality pipeline. If the topic of data quality keeps you up at night, check out this solution that integrates Prefect, Soda, MotherDuck, and YData Profiling. With YData Profiling providing exploratory analysis and Soda performing accurate checks, you can get back to having a peaceful night's sleep.
If you're looking for a practical introduction to using Supabase storage, this video is for you! With clear, step-by-step demonstrations, you'll learn how to connect DuckDB to your PostgreSQL database in Supabase, export data to storage buckets, and perform analyses directly on the files.
We aim to centralize all Duck-related events at motherduck.com/events, but here are some highlights:
20 May, Seattle, WA, USA
Join us for an exciting in-person MotherDuck / DuckDB meetup 🐥 at the MotherDuck office in Seattle on May 20, 2024, from 6:00 PM to 9:00 PM! We'll have engaging talks, networking opportunities with industry experts, and SWAG for attendees.
22 May, Online
Join Frances Perry, Engineer Manager at MotherDuck, for a talk and walkthrough of interactive visualizations done in-browser using Mosaic and WebAssembly (WASM), powered by DuckDB and extended to the cloud with MotherDuck’s serverless analytics platform.