Skip to main content

MotherDuck and DuckDB Tutorial

In this tutorial, you will go through a full end-to-end example on how to use MotherDuck and DuckDB, push and share data, take advantage of hybrid query execution and query data using SQL through the MotherDuck UI or DuckDB CLI.

Prerequisites

For this tutorial, you will need:

  • A MotherDuck account. Examples will be covering using the MotherDuck UI and the DuckDB CLI
  • A csv dataset. You can download our example csv dataset hosted on our public AWS S3 bucket here. Feel free to use your own csv/parquet data!
  • DuckDB CLI for running a hybrid query, check how to install this one here

Understanding SQL is valuable, but no other programming experience is needed.

Running your first query

Query from a shared database

Head over to the MotherDuck UI and after logging in, go through the notebook and add a new cell.

Before playing with the dataset we just downloaded, let's run a couple simple queries on the shared sample database. This database contains a series of MotherDuck's public datasets and it's auto-attached for each user, meaning it's accessible direclty within your MotherDuck session without any additional setup.

We will query the NYC 311 dataset first. This dataset contains over thirty million complaints citizens have filed with the New York City government. We'll select several columns and look at the compliants filed over a few days to demonstrate the Column Explorer feature of the MotherDuck UI.

SELECT 
created_date, agency_name, complaint_type,
descriptor, incident_address, resolution_description
FROM
sample_data.nyc.service_requests
WHERE
created_date >= '2022-03-27' AND
created_date <= '2022-03-31';

UI Capability

Take a moment to explore the shape of this data with the Column Explorer.

For the remainder of this tutorial, we'll focus on the NYC taxi data and perform aggregation queries representative of the types of queries often performed in analytics databases. We will first get the average fare based on the number of passengers. The source dataset covers data for the whole month of November 2022.

SELECT
passenger_count,
avg(total_amount)
-- reading from shared sample database
FROM sample_data.nyc.taxi
GROUP BY passenger_count
ORDER by passenger_count;

Query Result

info

The sample_data database is auto-attached but for any other shared database you would like to read, you would need to use the ATTACH statement. Read more about querying a shared MotherDuck database here.

Query from S3

Our shared sample database is great to play with but you probably want to use your own data on AWS S3. Let's see how to do that.

The sample database source data is actually available on our public AWS S3 bucket. Let's run the exact same query but instead of pointing to a MotherDuck table, we will point to a parquet file on S3. For a public AWS S3 bucket, we only need to specify the AWS region as follows :

SET s3_region='us-east-1';

For a secured bucket, we need to pass the AWS credentials - check authenticating to S3 for more information.

Here's the updated query while reading from S3:

SELECT
passenger_count,
avg(total_amount)
-- reading from AWS S3 parquet files
FROM read_parquet('s3://us-prd-motherduck-open-datasets/nyc_taxi/parquet/yellow_cab_nyc_2022_11.parquet')
GROUP BY passenger_count
ORDER by passenger_count;
info

If your data is a csv, you can use the read_csv_auto() method instead of the read_parquet(). Similarly, for json it's read_json_auto().

Loading your dataset

Head over to the button "ADD FILES" and select the dataset you just downloaded.

Add file

A cell will be automatically created with the following code:

create statement

CREATE TABLE my_db.popular_currency_rate_dollar AS SELECT * FROM read_csv_auto(['popular_currency_rate_dollar.csv']);

Run this cell in order to create a table that would contain the csv's data. You can also rename the table by changing the name after the CREATE TABLE statement.

You can now run queries on this MotherDuck table. For example, let's see the top 10 rows of the table:

FROM my_db.popular_currency_rate_dollar limit 10;

create statement

Running a hybrid query

To experience hybrid query execution, we'll need to use the DuckDB CLI. With the local csv popular_currency_rate_dollar.csv which you should have downloaded in the steps provided above, we will run a query that combines local data with data from the sample_data cloud database.

Our file, popular_currency_rate_dollar.csv, contains currency rates against the U.S. dollar over a few days. Let's utilize the same query we used above to determine the average fare. However, instead of presenting the results in dollars, we're interested in seeing them in British Pounds (GBP).

SELECT
cr.currency_code,
t.passenger_count,
AVG(t.total_amount * cr.exchange_rate) as average_converted_amount
FROM
sample_data.nyc.taxi t
CROSS JOIN
-- reading from local csv, adapt the path where you downloaded the file
(SELECT * FROM read_csv_auto('./popular_currency_rate_dollar.csv')) cr
WHERE cr.currency_code = 'GBP'
GROUP BY
cr.currency_code, t.passenger_count
ORDER by t.passenger_count ASC;

You can see the execution through the EXPLAIN method :

EXPLAIN SELECT
cr.currency_code,
t.passenger_count,
AVG(t.total_amount * cr.exchange_rate) as average_converted_amount
FROM
sample_data.nyc.taxi t
CROSS JOIN
-- reading from local csv, adapt the path where you downloaded the file
(SELECT * FROM read_csv_auto('./popular_currency_rate_dollar.csv')) cr
WHERE cr.currency_code = 'GBP'
GROUP BY
cr.currency_code, t.passenger_count
ORDER by t.passenger_count ASC;
┌───────────────┬─────────────────┬──────────────────────────┐
│ currency_code │ passenger_count │ average_converted_amount │
│ varchar │ double │ double │
├───────────────┼─────────────────┼──────────────────────────┤
│ GBP │ 0.015.932329767528195
│ GBP │ 1.016.68244975354177
│ GBP │ 2.018.939035313855573
│ GBP │ 3.018.24300645274264
│ GBP │ 4.019.073578370153896
│ GBP │ 5.016.526609827337477
│ GBP │ 6.016.91326606429221
│ GBP │ 7.059.94501665999999
│ GBP │ 8.048.38727310588234
│ GBP │ 9.059.48654116
│ GBP │ │ 23.031504804070522
├───────────────┴─────────────────┴──────────────────────────┤
11 rows 3 columns │
└────────────────────────────────────────────────────────────┘

Each operation is followed by either (L)= Local or (R)= Remote.

┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ DOWNLOAD_SOURCE (L) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ bridge_id: 1 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ BATCH_DOWNLOAD_SINK (R) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ bridge_id: 1 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ ORDER_BY (R) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ORDERS: │
│ t.passenger_count ASC │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_GROUP_BY (R) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ #0 │
│ #1 │
│ avg(#2) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION (R) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ currency_code │
│ passenger_count │
│ (total_amount * │
│ exchange_rate) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ CROSS_PRODUCT (R) ├──────────────┐
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ SEQ_SCAN (R) ││ UPLOAD_SOURCE (R) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ yellow_cab_nyc_2022_11 ││ bridge_id: 2 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ │
│ passenger_count ││ │
│ total_amount ││ │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ │
│ EC: 3252717 ││ │
└───────────────────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ UPLOAD_SINK (L) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ bridge_id: 2 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ FILTER (L) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ (currency_code = 'GBP') │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 132 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ READ_CSV_AUTO (L) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ LOCAL │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ currency_code │
│ exchange_rate │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 0 │
└───────────────────────────┘

Finally, let's create a database with a table based on this query and share the result.

CREATE DATABASE holiday_budget
CREATE TABLE holiday_budget.taxi_nyc_fare AS 
SELECT
cr.currency_code,
t.passenger_count,
AVG(t.total_amount * cr.exchange_rate) as average_converted_amount
FROM
sample_data.nyc.taxi t
CROSS JOIN
-- reading from local csv, adapt the path where you downloaded the file
(SELECT * FROM read_csv_auto('./popular_currency_rate_dollar.csv')) cr
WHERE cr.currency_code = 'GBP'
GROUP BY
cr.currency_code, t.passenger_count
ORDER by t.passenger_count ASC;

Sharing your database

Now that you have a new dataset and a database, you can share it with your colleagues. To do so, we'll create a share which creates a point-in-time snapshot of the database.

Click on the drop down menu next to the database you want to share :

share 1

You will be prompted with a window to create a share

share 2

share 3

This will return a URL that you can share. Anyone wanting to query this database will have to do an ATTACH command. More info here

Going further

Try it with your own data! Look at our supported integrations and keep coding, keep quacking.