MotherDuck and DuckDB Tutorial
In this tutorial, you will go through a full end-to-end example on how to use MotherDuck and DuckDB, push and share data, take advantage of hybrid query execution and query data using SQL through the MotherDuck UI or DuckDB CLI.
MotherDuck currently supports DuckDB 1.1.3 and it is compatible with any client version 0.10.2 through 1.1.3
Prerequisites
For this tutorial, you will need:
- A MotherDuck account. Examples will be covering using the MotherDuck UI and the DuckDB CLI
- A csv dataset. You can download our example csv dataset hosted on our public AWS S3 bucket here. Feel free to use your own csv/parquet data!
- DuckDB CLI for running a hybrid query, check how to install this one here.
Understanding SQL is valuable, but no other programming experience is needed.
Running your first query
- UI
- CLI
Query from a shared database
Head over to the MotherDuck UI and after logging in, go through the notebook and add a new cell.
Before playing with the dataset we just downloaded, let's run a couple simple queries on the shared sample database. This database contains a series of MotherDuck's public datasets and it's auto-attached for each user, meaning it's accessible direclty within your MotherDuck session without any additional setup.
We will query the NYC 311 dataset first. This dataset contains over thirty million complaints citizens have filed with the New York City government. We'll select several columns and look at the compliants filed over a few days to demonstrate the Column Explorer feature of the MotherDuck UI.
SELECT
created_date, agency_name, complaint_type,
descriptor, incident_address, resolution_description
FROM
sample_data.nyc.service_requests
WHERE
created_date >= '2022-03-27' AND
created_date <= '2022-03-31';
Take a moment to explore the shape of this data with the Column Explorer.
For the remainder of this tutorial, we'll focus on the NYC taxi data and perform aggregation queries representative of the types of queries often performed in analytics databases. We will first get the average fare based on the number of passengers. The source dataset covers data for the whole month of November 2022.
SELECT
passenger_count,
avg(total_amount)
-- reading from shared sample database
FROM sample_data.nyc.taxi
GROUP BY passenger_count
ORDER by passenger_count;
The sample_data
database is auto-attached but for any other shared database you would like to read, you would need to use the ATTACH
statement. Read more about querying a shared MotherDuck database here.
You can also run the same queries using the DuckDB CLI. You just need to connect to MotherDuck first using the ATTACH 'md:';
command. You will be prompted to authenticate if there is no motherduck_token
found in your environment.
ATTACH 'md:';
SELECT
created_date, agency_name, complaint_type,
descriptor, incident_address, resolution_description
FROM
sample_data.nyc.service_requests
WHERE
created_date >= '2022-03-27' AND
created_date <= '2022-03-31';
SELECT
passenger_count,
avg(total_amount)
FROM sample_data.nyc.taxi
GROUP BY passenger_count
ORDER by passenger_count;
Query from S3
Our shared sample database is great to play with but you probably want to use your own data on AWS S3. Let's see how to do that.
The sample database source data is actually available on our public AWS S3 bucket. Let's run the exact same query but instead of pointing to a MotherDuck table, we will point to a parquet file on S3.
For a secured bucket, we need to pass the AWS credentials - check authenticating to S3 for more information.
Here's the updated query while reading from S3:
SELECT
passenger_count,
avg(total_amount)
-- reading from AWS S3 parquet files
FROM read_parquet('s3://us-prd-motherduck-open-datasets/nyc_taxi/parquet/yellow_cab_nyc_2022_11.parquet')
GROUP BY passenger_count
ORDER by passenger_count;
If your data is a csv, you can use the read_csv_auto()
method instead of the read_parquet()
. Similarly, for json it's read_json_auto()
.
Loading your dataset
- UI
- CLI
Head over to the button "ADD FILES" and select the dataset you just downloaded.
A cell will be automatically created with the following code:
CREATE TABLE my_db.popular_currency_rate_dollar AS SELECT * FROM read_csv_auto(['popular_currency_rate_dollar.csv']);
Run this cell in order to create a table that would contain the csv's data. You can also rename the table by changing the name after the CREATE TABLE
statement.
You can now run queries on this MotherDuck table. For example, let's see the top 10 rows of the table:
FROM my_db.popular_currency_rate_dollar limit 10;
ATTACH 'md:';
USE my_db;
CREATE TABLE my_db.popular_currency_rate_dollar AS SELECT * FROM read_csv_auto(['./popular_currency_rate_dollar.csv']);
Don't forget to adapt the file path to the actual local path where you downloaded the csv
file.
Running a hybrid query
To experience hybrid query execution, we'll need to use the DuckDB CLI. With the local csv popular_currency_rate_dollar.csv
which you should have downloaded in the steps provided above, we will run a query that combines local data with data from the sample_data cloud database.
Our file, popular_currency_rate_dollar.csv
, contains currency rates against the U.S. dollar over a few days. Let's utilize the same query we used above to determine the average fare. However, instead of presenting the results in dollars, we're interested in seeing them in British Pounds (GBP).
- CLI
SELECT
cr.currency_code,
t.passenger_count,
AVG(t.total_amount * cr.exchange_rate) as average_converted_amount
FROM
sample_data.nyc.taxi t
CROSS JOIN
-- reading from local csv, adapt the path where you downloaded the file
(SELECT * FROM read_csv_auto('./popular_currency_rate_dollar.csv')) cr
WHERE cr.currency_code = 'GBP'
GROUP BY
cr.currency_code, t.passenger_count
ORDER by t.passenger_count ASC;
You can see the execution through the EXPLAIN method :
EXPLAIN SELECT
cr.currency_code,
t.passenger_count,
AVG(t.total_amount * cr.exchange_rate) as average_converted_amount
FROM
sample_data.nyc.taxi t
CROSS JOIN
-- reading from local csv, adapt the path where you downloaded the file
(SELECT * FROM read_csv_auto('./popular_currency_rate_dollar.csv')) cr
WHERE cr.currency_code = 'GBP'
GROUP BY
cr.currency_code, t.passenger_count
ORDER by t.passenger_count ASC;
┌───────────────┬─────────────────┬──────────────────────────┐
│ currency_code │ passenger_count │ average_converted_amount │
│ varchar │ double │ double │
├───────────────┼─────────────────┼──────────────────────────┤
│ GBP │ 0.0 │ 15.932329767528195 │
│ GBP │ 1.0 │ 16.68244975354177 │
│ GBP │ 2.0 │ 18.939035313855573 │
│ GBP │ 3.0 │ 18.24300645274264 │
│ GBP │ 4.0 │ 19.073578370153896 │
│ GBP │ 5.0 │ 16.526609827337477 │
│ GBP │ 6.0 │ 16.91326606429221 │
│ GBP │ 7.0 │ 59.94501665999999 │
│ GBP │ 8.0 │ 48.38727310588234 │
│ GBP │ 9.0 │ 59.48654116 │
│ GBP │ │ 23.031504804070522 │
├───────────────┴─────────────────┴──────────────────────────┤
│ 11 rows 3 columns │
└──────────────────────────────────────── ────────────────────┘
Each operation is followed by either (L)
= Local or (R)
= Remote.
┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌─── ────────────────────────┐
│ DOWNLOAD_SOURCE (L) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ bridge_id: 1 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ BATCH_DOWNLOAD_SINK (R) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ bridge_id: 1 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ ORDER_BY (R) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ORDERS: │
│ t.passenger_count ASC │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_GROUP_BY (R) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ #0 │
│ #1 │
│ avg(#2) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION (R) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ currency_code │
│ passenger_count │
│ (total_amount * │
│ exchange_rate) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ CROSS_PRODUCT (R) ├──────────────┐
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ SEQ_SCAN (R) ││ UPLOAD_SOURCE (R) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ yellow_cab_nyc_2022_11 ││ bridge_id: 2 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ │
│ passenger_count ││ │
│ total_amount ││ │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ │
│ EC: 3252717 ││ │
└───────────────────────────┘└─────────────┬─────────────┘
┌─────────────┴───── ────────┐
│ UPLOAD_SINK (L) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ bridge_id: 2 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ FILTER (L) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ (currency_code = 'GBP') │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 132 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ READ_CSV_AUTO (L) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ LOCAL │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ currency_code │
│ exchange_rate │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 0 │
└───────────────────────────┘
Finally, let's create a database with a table based on this query and share the result.
CREATE DATABASE holiday_budget;
CREATE TABLE holiday_budget.taxi_nyc_fare AS
SELECT
cr.currency_code,
t.passenger_count,
AVG(t.total_amount * cr.exchange_rate) as average_converted_amount
FROM
sample_data.nyc.taxi t
CROSS JOIN
-- reading from local csv, adapt the path where you downloaded the file
(SELECT * FROM read_csv_auto('./popular_currency_rate_dollar.csv')) cr
WHERE cr.currency_code = 'GBP'
GROUP BY
cr.currency_code, t.passenger_count
ORDER by t.passenger_count ASC;
Sharing your database
Now that you have a new dataset and a database, you can share it with your colleagues. To do so, we'll create a share which creates a point-in-time snapshot of the database.
- UI
- SQL
Click on the drop down menu next to the database you want to share:
You will be prompted with a window to create a share.
The syntax to create a share visible to everyone in your Organization is CREATE SHARE <share name> from <database name>
.
CREATE SHARE duck_holiday_budget FROM holiday_budget (ACCESS ORGANIZATION , VISIBILITY DISCOVERABLE);
┌───────────────────────────────────────────────────────────────┐
│ share_url │
│ varchar │
├───────────────────────────────────────────────────────────────┤
│ md:_share/holiday_budget/b556630d-74f1-435c-9459-cfb87d349cb3 │
└───────────────────────────────────────────────────────────────┘
Now everyone in your Organization will see this share in the UI under "Shared with me". They simply need to press "Attach" to start querying!
Learn more about sharing in MotherDuck here.
Going further
Try it with your own data! Look at our supported integrations and keep coding, keep quacking.