Why Use DuckDB for Analytics?

2022/11/11 - 5 min read

BY

Subscribe to MotherDuck Blog

Industries transform on the back of momentous technological change. For example, the modern cloud data warehouse arose a decade ago on a foundation of powerful cloud storage, compute, and networking. When we founded MotherDuck we recognized that DuckDB might just be the next major game changer thanks to its ease of use, portability, lightning-fast performance, and a rapid pace of community-driven innovation.

First, What is DuckDB?

DuckDB is an open source in-process SQL OLAP database management system. DuckDB can be thought of as “SQLite for analytics” - you can embed it in virtually any codebase and run it in virtually any environment with minimal complexity.

As an in-process database, DuckDB is a storage and compute engine that enables developers, data scientists and data analysts to power their code with extremely fast analyses using plain SQL. Additionally, DuckDB can analyze data wherever it lives, be it on your laptop or in the cloud.

DuckDB comes with a command-line interface for rapid prototyping, and you can try DuckDB right now using the hosted DuckDB shell.

Runs Anywhere

Thanks to DuckDB, practically any CPU in the world can now be mobilized to perform powerful analytics. DuckDB is portable and modular, with no external dependencies. Thus you can run DuckDB on your laptop, in the browser, on a cloud VM, in a cloud function, and even in a CDN edge point-of-presence.

You can use DuckDB in Python notebooks, R scripts, Javascript data apps, or Java backends. DuckDB is universally useful for data scientists, analysts, data engineers, and application developers.

Simplified Data Access

Analysts often tell us that they wish to analyze data that lives in disparate places - CSV files on their laptops, Parquet files on S3, dataframes in their Python notebooks, and even tables in relational databases. DuckDB challenges the current status quo that needlessly complicates access to these diverse data sources. With DuckDB, you’re at most one or two commands away from querying data where it lies, whether it’s on your local hard drive, in the cloud, or in another database.

These are all valid SQL statements in DuckDB:

Copy code

SELECT AVG(trip_distance) FROM 's3://yellow_tripdata_20[12]*.parquet' SELECT * FROM '~/local/files/file.parquet' SELECT * FROM dataframe SELECT * FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet'

Do you have Arrow tables, PostgreSQL databases or SQLite databases? DuckDB can directly query those too; no import required!

DuckDB rose in prominence thanks to its ease of use in Python alongside pandas, a hugely popular library for data science. While pandas enables rich and powerful data science transformations, DuckDB dramatically accelerates analytical workloads, with the added benefit of using a standard SQL interface. DuckDB can even treat pandas dataframes as DuckDB tables and query them directly.

Copy code

import pandas as pd import duckdb mydf = pd.DataFrame({'a' : [1, 2, 3]}) print(duckdb.query("SELECT sum(a) FROM mydf;").fetchall())

DuckDB enables users to connect to powerful BI tools like Tableau, Looker, or Superset with standard ODBC or JDBC drivers. Additionally, DuckDB is available in Python, R, Javan, node.JS, Julia, C/C++, and WASM.

Fast Aggregation and Excellent SQL Support, the Key to Analytics

DuckDB is designed as an analytics database from the bottoms up – aiming to squeeze every ounce of performance while also allowing you to perform complex analytics queries using standardized SQL.

As an analytics database, DuckDB is optimized for read operations and can also perform updates in a transactional ACID-compliant fashion. It stores data in a compressed columnar format, which provides the best performance for large-scale aggregations. This is in contrast to a transactional database, which is optimized for high-frequency writes and typically stores data as rows (tuples) to support that.

Additionally, DuckDB has a vectorized query engine, enabling small batches of data to be analyzed simultaneously via processors supporting SIMD (Simultaneous Instruction on Multiple Data). These small batches are optimized for locality to the CPU, utilizing the L1/L2/L3 caches which have the lowest latency, as opposed to only using main memory.

The SQL engine is extremely thoroughly tested and aims to support PostgreSQL-style SQL, along with some special analytical functions and custom syntax that’s helpful for analysts. You get window functions, statistical sampling, a good math library, and even support for working with nested data.

Open Source Community that Flocks Together

With hundreds of contributors and 7.1k GitHub stars at time of publication, DuckDB is home to a vibrant and rapidly expanding open source community. Contributors are working on core database functionality, improved integrations with external data formats and tooling, improved documentation and all other aspects of the project. The community flocks together on Discord, with over 1,100 members, supported by a growing DuckDB foundation.

Innovation at an Incredible Pace

The DuckDB project came out of academic research, so naturally its code base is very clean. Moreover, DuckDB is based on a very simple scale-up architecture, which enables an unparalleled velocity of innovation, and the DuckDB team habitually implements cutting edge academic research (eg compression algorithms). As a consequence, DuckDB is getting faster, more efficient, and easier to use every single month.

Next Steps

At MotherDuck, we want to help the community, the DuckDB Foundation and DuckDB Labs build greater awareness and adoption of DuckDB, whether users are working locally or want a serverless always-on way to execute their SQL.

Subscribe to MotherDuck Blog