Quickstart Challenge
Use Hugging Face datasets and MotherDuck to enrich and prepare your dataset for your project.
Check out the example we've included below to see how you could explore endangered species, and a couple downstream ideas. Feel free to explore your own ideas and be creative!
- While the examples below cover Python and SQL, MotherDuck supports multiple clients, such as Node.JS, Golang, Java, and Rust. More information on these clients is available here.
- You can also use the MotherDuck Web UI to explore data, visualize your tables with Column Explorer, and take advantage of MotherDuck’s AI SQL error fixer, FixIt.
Getting Started with MotherDuck in Python
Head to https://motherduck.com, and create an account.
- Every new account receives a 30-day free trial of the MotherDuck Standard Plan, with no credit card required.
- After the end of your Standard Plan free trial, your account will automatically move to the MotherDuck Free Plan, no action needed on your part.
How to get started with MotherDuck in Python:
- In the MotherDuck UI, grab an access token to connect with Python.
- Next, run the following in Python:
Copy code
!pip install duckdb==1.0.0
import duckdb
# Connect to MotherDuck
con = duckdb.connect('md:?motherduck_token=<your_motherduck_token>')
# Run a sample query using MotherDuck
res = con.execute("""
SELECT
created_date, agency_name, complaint_type,
descriptor, incident_address, resolution_description
FROM
sample_data.nyc.service_requests
WHERE
created_date >= '2022-03-27' AND
created_date <= '2022-03-31';
""")
# Fetch MotherDuck query results to pandas df
df = res.df()
Reading a dataset from Hugging Face using Python:
Copy code
# Run a query on Hugging Face data, using MotherDuck
hf_query = con.execute("""
SELECT *
FROM read_parquet('hf://datasets/<user>/<dataset-name>/data/*.parquet'));
""")
Read more about using Hugging Face with DuckDB and MotherDuck in the documentation here.
Example Using SQL
Start with this dataset of 150k endangered species:
- The dataset has very little information about the species. So it would be ideal to enrich with data from Wikipedia to enable downstream tasks: https://huggingface.co/datasets/wikimedia/wikipedia
Copy code
-- Load endangered animal species dataset from Hugging Face (hf)
CREATE OR REPLACE TABLE animals AS (SELECT * FROM read_parquet(
'hf://datasets/datonic/threatened_animal_species/data/threatened_animal_species.parquet'));
-- Load wiki en dataset from hf (this may take a few minutes)
CREATE OR REPLACE TABLE wiki AS (SELECT * FROM read_parquet('hf://datasets/wikimedia/wikipedia/20231101.en/*'));
-- Join both datasets, and create a table in MotherDuck
CREATE OR REPLACE TABLE animals_wiki AS (SELECT * FROM animals LEFT JOIN wiki ON wiki.title = animals.scientific_name);
-- Create a SHARE of your database, to share it with others in MotherDuck
(Learn more about shares here: https://motherduck.com/docs/key-tasks/sharing-data/sharing-overview)
CREATE SHARE hacknight FROM my_db (ACCESS UNRESTRICTED);
Sample Analysis Ideas with SQL
Copy code
-- Show a sample of endangered animal species, including their wikipedia info
SELECT * FROM animals_wiki LIMIT 100;
-- Check how many animal species have a wikipedia entry
SELECT count(*) FROM animals_wiki WHERE text IS NOT NULL;
-- Check the distribution of endangerment categories across all species
SELECT category, count(*) AS cnt FROM animals_wiki
GROUP BY category
ORDER BY cnt DESC;
-- Check how many wikipedia articles contain the word “endangered” and which endangerment category those animals are in
SELECT category, count(*) AS cnt FROM animals_wiki
WHERE text IS NOT NULL AND text LIKE '%endangered%'
GROUP BY category
ORDER BY cnt DESC;
Downstream Task Ideas
- How many endangered duck species are there?
- DuckDB versions are named after duck species. Which DuckDB version has the most endangered duck as its namesake?
- Help Wikipedia editors keep endangerment information in articles up-to-date.
- Try to find most relevant Wikipedia article for animals that didn’t have an exact match based on the scientific_name.
- Enrich endangered species data with structured data extracted from Wikipedia articles.