Building data-driven components and applications doesn't have to be so ducking hard

DuckDB Tutorial For Beginners

2023/04/26

BY

Subscribe to MotherDuck Blog

If you haven't had the chance to get up to speed with DuckDB, this tutorial is for you! We'll go over the essentials, from installation to workflow, getting to know the command-line interface (CLI), and diving into your first analytics project. If are too lazy to read, I also made a video for this tutorial.

Let's start quacking some code!

What is DuckDB?

DuckDB is an in-process SQL OLAP database, which means it runs within the same process as the application using it. This unique feature allows DuckDB to offer the advantages of a database without the complexities of managing one. But, as with any software concept, the best way to learn is to dive in and get your hands dirty.

We’ll be showing examples using the DuckDB command-line client (CLI), but you can also use DuckDB from within Python, R, and other languages, or any tool supporting JDBC or ODBC drivers.

In the below snippets, any code example prefixed with $ means that it’s a bash command. Otherwise we assume that these would run within a DuckDB process.

Installation

Installing DuckDB is a breeze. Visit the DuckDB documentation website and download the binary for your operating system.

For MacOS users, you can leverage the famous Homebrew package manager to make the DuckDB CLI directly available in your PATH, simplifying upgrades and installations.

To install DuckDB using Homebrew, run the following command:

$ brew install duckdb

You can now launch DuckDB by simply calling the duckdb CLI command.

$ duckdb
v0.7.1 b00b93f0b1
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D 

Workflow with VSCode

To follow along with our exploration of DuckDB, check out this GitHub repository. I recommend working with an editor, a SQL file, and sending commands to the terminal for a lightweight setup. This approach offers visibility on all commands, enables you to safely version control them, and allows you to leverage formatting tools and AI friends like Copilot if you are missing them.

In our example, we'll use Visual Studio Code (VSCode). To configure a custom shortcut to send commands from the editor to the terminal, open the keyboard shortcuts JSON file and add a key binding to the following command :

{
    "key": "shift+enter",
    "command": "workbench.action.terminal.runSelectedText"
}

Of course, this workflow can be pretty easily replicated with any editor or IDE!

Data Persistence with DuckDB

By default, DuckDB is an in-memory process and won't persist any data. To demonstrate this, let's create a simple table based on a query result:

$ duckdb
CREATE TABLE ducks AS SELECT 3 AS age, 'mandarin' AS breed;
FROM ducks;
┌───────┬──────────┐
│  age  │  breed   │
│ int32 │ varchar  │
├───────┼──────────┤
│     3 │ mandarin │
└───────┴──────────┘

This query creates and populates a "ducks" table. Note that we are using the FROM statement without any SELECT * statement, this is a neat shortcut in DuckDB! However, if we exit the CLI and reopen it, the table will be gone.

To persist data, you have two options:

  1. Provide a path to a database file when starting DuckDB. The file can have any extension, but common choices are .db, .duckdb, or .ddb. If no database exists at the specified path, DuckDB will create one. You can also launch DuckDB with a database in read-only mode to avoid modifying the database.

    $ duckdb /data/myawesomedb.db
    
    
  2. If DuckDB is already running, use the attach command to connect to a database.

    ATTACH DATABASE '/path/to/your/database.db' AS mydb;
    

The database file uses DuckDB's custom single-file format (all tables are included), which supports transactional ACID compliance and stores data in a compressed columnar format for optimal aggregation performance.

Reading and Displaying Data

Next, let's explore reading and writing data in CSV and Parquet formats. We'll use a small dataset from Kaggle containing daily Netflix Top 10 Movie/TV Show data for the United States from 2020 to March 2022.

To load the CSV dataset, use the read_csv_auto command, which infers the schema and detects the delimiter. You can also use the read_csv command and pass the schema and delimiter as parameters.

SELECT * FROM read_csv_auto('path/to/your/file.csv');

When you use this command, the dataset is read, but an actual table is not created in your DuckDB database. To create a table, use the following command:

CREATE TABLE netflix_top10 AS SELECT * FROM read_csv_auto('path/to/your/file.csv');

To write data to a CSV file, use the COPY command and specify the delimiter. For Parquet files, simply specify the file format:

COPY ./data/netflix_top10.csv TO 'path/to/your/output/file.csv' WITH (FORMAT 'CSV', DELIMITER ',');
COPY ./data/netflix_top10.csv TO 'path/to/your/output/file.parquet' WITH (FORMAT 'PARQUET');

To read data from a Parquet file, use the read_parquet command:

SELECT * FROM read_parquet('path/to/your/file.parquet');

Display Modes, Output Options

DuckDB CLI offers various ways to enhance your experience by customizing the data display and output options.

You can use the .mode command to change the appearance of tables returned in the terminal output. For instance, if you are dealing with long nested JSON, you can change the mode to line or JSON to have a better view on your data.

.mode line
SELECT * FROM './data/sales.json';
sales_data = [{'order_id': 1, 'customer': {'id': 101, 'name': John Doe, 'email': john.doe@example.com}, 'items': [{'product_id': 301, 'product_name': Laptop, 'quantity': 1, 'price': 1200}, {'product_id': 302, 'product_name': Mouse, 'quantity': 1, 'price': 25}], 'total_amount': 1225, 'date': 2023-03-24}, {'order_id': 2, 'customer': {'id': 102, 'name': Jane Smith, 'email': jane.smith@example.com}, 'items': [{'product_id': 303, 'product_name': Keyboard, 'quantity': 1, 'price': 50}, {'product_id': 304, 'product_name': Monitor, 'quantity': 1, 'price': 200}], 'total_amount': 250, 'date': 2023-03-25}]

Next to that, you can output elsewhere the data by redirecting the terminal output to a file.

Let’s say you would like to output the result to a Markdown file, you can set the display mode to Markdown with .mode markdown. Combine this with the .output or .once command to write the result directly to a specific file. The .output command writes all the output of the different results you run, while .once does it just once.

.mode markdown
.output myfile.md

Running Commands and Exiting

DuckDB CLI allows you to run a command and exit using the -c option parameter. For example, if you use a SELECT statement to read a Parquet file:

$ duckdb -c "SELECT * FROM read_parquet('path/to/your/file.parquet');"

This feature is lightweight, fast, and easy. You can even build your own bash functions using the DuckDB CLI for various operations on CSV/Parquet files, such as converting a CSV to Parquet.

DuckDB also offers flags for configuration that you can fine-tune. You can find the full list here.

Working with Extensions

Extensions are like packages that you can install within DuckDB to enjoy specific feature. DuckDB supports a number of core extensions. Not all are included by default, but DuckDB has a mechanism for remote extension installation. To view the available core extensions, use the following command:

SELECT * FROM duckdb_extensions();

To install an extension, such as the popular httpfs extension that allows reading/writing remote files over HTTPS and S3, use the INSTALL command followed by the extension name. Once installed, DuckDB downloads the extension to the $HOME/.duckdb/ folder (modifiable by setting the extension_directory parameter).

Next, load the extension in the DuckDB process with the LOAD command.

INSTALL httpfs;
LOAD httpfs;

If you're using a third-party extension or your own extension not bundled by default, set the allow_unsigned_extensions flag to True, or use the -unsigned flag parameter when launching DuckDB.

$ duckdb -unsigned

Extensions are powerful and versatile. You can create your own using the template provided by the DuckDB Labs team to kickstart your extension development journey.

First analytics project

We have the mentioned Netflix dataset hosted on a public AWS S3 bucket. In this simple project, we will answer the most existential question : what were people in the US binge-watching during the COVID lockdown?

As the data is sitting on AWS S3, we’ll start by installing the extension httpfs.

-- Install extensions
INSTALL httpfs;
LOAD httpfs;
-- Minimum configuration for loading S3 dataset if the bucket is public
SET s3_region='us-east-1';

We can now read our dataset :


CREATE TABLE netflix AS SELECT * FROM read_parquet('s3://duckdb-md-dataset-121/netflix_daily_top_10.parquet');
FROM netflix;
┌────────────┬───────┬───────────────────┬───┬────────────────┬──────────────────┐
│   As ofRankYear to Date Rank │ … │ Days In Top 10Viewership Score │
│    date    │ int64 │      varchar      │   │     int64      │      int64       │
├────────────┼───────┼───────────────────┼───┼────────────────┼──────────────────┤
│ 2020-04-0111                 │ … │              990 │
│ 2020-04-0122                 │ … │              545 │
│ 2020-04-0133                 │ … │              976 │
│ 2020-04-0144                 │ … │              530 │
│ 2020-04-0155                 │ … │              955 │
│ 2020-04-0166                 │ … │              414

Finally, getting the top watched movies as follow :

-- Display the most popular TV Shows
SELECT Title, max("Days In Top 10") from netflix
where Type='Movie'
GROUP BY Title
ORDER BY max("Days In Top 10") desc
limit 5;
┌────────────────────────────────┬───────────────────────┐
│             Titlemax("Days In Top 10") │
│            varchar             │         int64         │
├────────────────────────────────┼───────────────────────┤
│ The Mitchells vs. The Machines31 │
│ How the Grinch Stole Christmas29 │
│ Vivo29 │
│ 365 Days28 │
│ Despicable Me 227 │
└────────────────────────────────┴───────────────────────┘

-- Copy the result to CSV
COPY (
SELECT Title, max("Days In Top 10") from netflix
where Type='TV Show'
GROUP BY Title
ORDER BY max("Days In Top 10") desc
limit 5
) TO 'output.csv' (HEADER, DELIMITER ',');

What’s fun is that for both Movies and TV shows, the top 5 mostly include kids show. We all know that kids doesn’t bother to see multiple time the same thing…

Exploring Beyond the Pond

That’s it for this tutorial! If you're interested in delving deeper into DuckDB, check out these resources:

To elevate your experience with DuckDB and scale it to the cloud, or to use a user-friendly web interface, explore MotherDuck! Dive into our end-to-end tutorial to discover all the features we offer.

Additionally, stay tuned to our monthly newsletter and YouTube channel, where we'll continue to share more DuckDB-related content!

Keep quacking, keep coding.

CONTENT
  1. What is DuckDB?
  2. Installation
  3. Workflow with VSCode
  4. Reading and Displaying Data
  5. Display Modes, Output Options
  6. Running Commands and Exiting
  7. Working with Extensions
  8. First analytics project
  9. Exploring Beyond the Pond

Subscribe to MotherDuck Blog