Subscribe to MotherDuck Blog
If you haven't had the chance to get up to speed with DuckDB, this tutorial is for you! We'll go over the essentials, from installation to workflow, getting to know the command-line interface (CLI), and diving into your first analytics project. If are too lazy to read, I also made a video for this tutorial.
Let's start quacking some code!
What is DuckDB?
DuckDB is an in-process SQL OLAP database, which means it runs within the same process as the application using it. This unique feature allows DuckDB to offer the advantages of a database without the complexities of managing one. But, as with any software concept, the best way to learn is to dive in and get your hands dirty.
Weβll be showing examples using the DuckDB command-line client (CLI), but you can also use DuckDB from within Python, R, and other languages, or any tool supporting JDBC or ODBC drivers.
In the below snippets, any code example prefixed with $
means that itβs a bash command. Otherwise we assume that these would run within a DuckDB process.
Installation
Installing DuckDB is a breeze. Visit the DuckDB documentation website and download the binary for your operating system.
For MacOS users, you can leverage the famous Homebrew package manager to make the DuckDB CLI directly available in your PATH, simplifying upgrades and installations.
To install DuckDB using Homebrew, run the following command:
Copy code
$ brew install duckdb
You can now launch DuckDB by simply calling the duckdb
CLI command.
Copy code
$ duckdb
v0.7.1 b00b93f0b1
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D
Workflow with VSCode
To follow along with our exploration of DuckDB, check out this GitHub repository. I recommend working with an editor, a SQL file, and sending commands to the terminal for a lightweight setup. This approach offers visibility on all commands, enables you to safely version control them, and allows you to leverage formatting tools and AI friends like Copilot if you are missing them.
In our example, we'll use Visual Studio Code (VSCode). To configure a custom shortcut to send commands from the editor to the terminal, open the keyboard shortcuts JSON file and add a key binding to the following command :
Copy code
{
"key": "shift+enter",
"command": "workbench.action.terminal.runSelectedText"
}
Of course, this workflow can be pretty easily replicated with any editor or IDE!
Data Persistence with DuckDB
By default, DuckDB is an in-memory process and won't persist any data. To demonstrate this, let's create a simple table based on a query result:
Copy code
$ duckdb
CREATE TABLE ducks AS SELECT 3 AS age, 'mandarin' AS breed;
FROM ducks;
βββββββββ¬βββββββββββ
β age β breed β
β int32 β varchar β
βββββββββΌβββββββββββ€
β 3 β mandarin β
βββββββββ΄βββββββββββ
This query creates and populates a "ducks" table. Note that we are using the FROM statement without any SELECT *
statement, this is a neat shortcut in DuckDB!
However, if we exit the CLI and reopen it, the table will be gone.
To persist data, you have two options:
-
Provide a path to a database file when starting DuckDB. The file can have any extension, but common choices are
.db
,.duckdb
, or.ddb
. If no database exists at the specified path, DuckDB will create one. You can also launch DuckDB with a database in read-only mode to avoid modifying the database.Copy code
$ duckdb /data/myawesomedb.db
-
If DuckDB is already running, use the
attach
command to connect to a database.Copy code
ATTACH DATABASE '/path/to/your/database.db' AS mydb;
The database file uses DuckDB's custom single-file format (all tables are included), which supports transactional ACID compliance and stores data in a compressed columnar format for optimal aggregation performance.
Reading and Displaying Data
Next, let's explore reading and writing data in CSV and Parquet formats. We'll use a small dataset from Kaggle containing daily Netflix Top 10 Movie/TV Show data for the United States from 2020 to March 2022.
To load the CSV dataset, use the read_csv_auto command, which infers the schema and detects the delimiter. You can also use the read_csv
command and pass the schema and delimiter as parameters.
Copy code
SELECT * FROM read_csv_auto('path/to/your/file.csv');
When you use this command, the dataset is read, but an actual table is not created in your DuckDB database. To create a table, use the following command:
Copy code
CREATE TABLE netflix_top10 AS SELECT * FROM read_csv_auto('path/to/your/file.csv');
To write data to a CSV file, use the COPY
command and specify the delimiter. For Parquet files, simply specify the file format:
Copy code
COPY ./data/netflix_top10.csv TO 'path/to/your/output/file.csv' WITH (FORMAT 'CSV', DELIMITER ',');
COPY ./data/netflix_top10.csv TO 'path/to/your/output/file.parquet' WITH (FORMAT 'PARQUET');
To read data from a Parquet file, use the read_parquet
command:
Copy code
SELECT * FROM read_parquet('path/to/your/file.parquet');
Display Modes, Output Options
DuckDB CLI offers various ways to enhance your experience by customizing the data display and output options.
You can use the .mode
command to change the appearance of tables returned in the terminal output. For instance, if you are dealing with long nested JSON, you can change the mode to line
or JSON
to have a better view on your data.
Copy code
.mode line
SELECT * FROM './data/sales.json';
sales_data = [{'order_id': 1, 'customer': {'id': 101, 'name': John Doe, 'email': john.doe@example.com}, 'items': [{'product_id': 301, 'product_name': Laptop, 'quantity': 1, 'price': 1200}, {'product_id': 302, 'product_name': Mouse, 'quantity': 1, 'price': 25}], 'total_amount': 1225, 'date': 2023-03-24}, {'order_id': 2, 'customer': {'id': 102, 'name': Jane Smith, 'email': jane.smith@example.com}, 'items': [{'product_id': 303, 'product_name': Keyboard, 'quantity': 1, 'price': 50}, {'product_id': 304, 'product_name': Monitor, 'quantity': 1, 'price': 200}], 'total_amount': 250, 'date': 2023-03-25}]
Next to that, you can output elsewhere the data by redirecting the terminal output to a file.
Letβs say you would like to output the result to a Markdown file, you can set the display mode to Markdown with .mode markdown
. Combine this with the .output
or .once
command to write the result directly to a specific file. The .output
command writes all the output of the different results you run, while .once
does it just once.
Copy code
.mode markdown
.output myfile.md
Running Commands and Exiting
DuckDB CLI allows you to run a command and exit using the -c
option parameter. For example, if you use a SELECT statement to read a Parquet file:
Copy code
$ duckdb -c "SELECT * FROM read_parquet('path/to/your/file.parquet');"
This feature is lightweight, fast, and easy. You can even build your own bash functions using the DuckDB CLI for various operations on CSV/Parquet files, such as converting a CSV to Parquet.
DuckDB also offers flags for configuration that you can fine-tune. You can find the full list here.
Working with Extensions
Extensions are like packages that you can install within DuckDB to enjoy specific feature. DuckDB supports a number of core extensions. Not all are included by default, but DuckDB has a mechanism for remote extension installation. To view the available core extensions, use the following command:
Copy code
SELECT * FROM duckdb_extensions();
To install an extension, such as the popular httpfs
extension that allows reading/writing remote files over HTTPS and S3, use the INSTALL
command followed by the extension name. Once installed, DuckDB downloads the extension to the $HOME/.duckdb/
folder (modifiable by setting the extension_directory
parameter).
Next, load the extension in the DuckDB process with the LOAD
command.
Copy code
INSTALL httpfs;
LOAD httpfs;
If you're using a third-party extension or your own extension not bundled by default, set the allow_unsigned_extensions
flag to True
, or use the -unsigned
flag parameter when launching DuckDB.
Copy code
$ duckdb -unsigned
Extensions are powerful and versatile. You can create your own using the template provided by the DuckDB Labs team to kickstart your extension development journey.
First analytics project
We have the mentioned Netflix dataset hosted on a public AWS S3 bucket. In this simple project, we will answer the most existential question : what were people in the US binge-watching during the COVID lockdown?
As the data is sitting on AWS S3, weβll start by installing the extension httpfs.
Copy code
-- Install extensions
INSTALL httpfs;
LOAD httpfs;
-- Minimum configuration for loading S3 dataset if the bucket is public
SET s3_region='us-east-1';
We can now read our dataset :
Copy code
CREATE TABLE netflix AS SELECT * FROM read_parquet('s3://duckdb-md-dataset-121/netflix_daily_top_10.parquet');
FROM netflix;
ββββββββββββββ¬ββββββββ¬ββββββββββββββββββββ¬ββββ¬βββββββββββββββββ¬βββββββββββββββββββ
β As of β Rank β Year to Date Rank β β¦ β Days In Top 10 β Viewership Score β
β date β int64 β varchar β β int64 β int64 β
ββββββββββββββΌββββββββΌββββββββββββββββββββΌββββΌβββββββββββββββββΌβββββββββββββββββββ€
β 2020-04-01 β 1 β 1 β β¦ β 9 β 90 β
β 2020-04-01 β 2 β 2 β β¦ β 5 β 45 β
β 2020-04-01 β 3 β 3 β β¦ β 9 β 76 β
β 2020-04-01 β 4 β 4 β β¦ β 5 β 30 β
β 2020-04-01 β 5 β 5 β β¦ β 9 β 55 β
β 2020-04-01 β 6 β 6 β β¦ β 4 β 14 β
Finally, getting the top watched movies as follow :
Copy code
-- Display the most popular TV Shows
SELECT Title, max("Days In Top 10") from netflix
where Type='Movie'
GROUP BY Title
ORDER BY max("Days In Top 10") desc
limit 5;
ββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββ
β Title β max("Days In Top 10") β
β varchar β int64 β
ββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββ€
β The Mitchells vs. The Machines β 31 β
β How the Grinch Stole Christmas β 29 β
β Vivo β 29 β
β 365 Days β 28 β
β Despicable Me 2 β 27 β
ββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββ
-- Copy the result to CSV
COPY (
SELECT Title, max("Days In Top 10") from netflix
where Type='TV Show'
GROUP BY Title
ORDER BY max("Days In Top 10") desc
limit 5
) TO 'output.csv' (HEADER, DELIMITER ',');
Whatβs fun is that for both Movies and TV shows, the top 5 mostly include kids show. We all know that kids doesnβt bother to see multiple time the same thingβ¦
Exploring Beyond the Pond
Thatβs it for this tutorial! If you're interested in delving deeper into DuckDB, check out these resources:
- The official DuckDB docs : https://duckdb.org/
- The DuckDB discord : https://discord.com/invite/tcvwpjfnZx
To elevate your experience with DuckDB and scale it to the cloud, or to use a user-friendly web interface, explore MotherDuck! Dive into our end-to-end tutorial to discover all the features we offer.
Additionally, stay tuned to our monthly newsletter and YouTube channel, where we'll continue to share more DuckDB-related content!
Keep quacking, keep coding.
CONTENT
- What is DuckDB?
- Installation
- Workflow with VSCode
- Running Commands and Exiting
- Working with Extensions
- First analytics project
- Exploring Beyond the Pond
Subscribe to MotherDuck Blog