Exploring StackOverflow with DuckDB on MotherDuck (Part 2)
2023/10/02
BYSubscribe to MotherDuck Blog
From Local to Cloud - Loading our Database into MotherDuck and Querying it with AI Prompts
In theΒ first part of the seriesΒ we looked at the full StackOverflow dump as an interesting dataset to explore with DuckDB. We downloaded the data, converted it to CSV and loaded it into DuckDB and explored tags, users and posts a bit before exporting the database to Parquet. Today we want to move from our local evironment to MotherDuck, where we want to look at importing these parquet files into a database, sharing that database with you and exploring the data with the new AI prompt features.
Getting started with MotherDuck
DuckDB itself is focusing on local, and in-process execution of the analytical database engine. While you can access remote data, itβs downloaded to your machine every time you access the remote files, so you really might want to move your DuckDB execution to where the data lives.
To make it easier to query data that resides in other, remote locations, MotherDuck offers a managed service that allows you to run DuckDB in the cloud.
With MotherDuck you can query the data on your cloud storage transparently as if it was local. But whatβs even better, is you can join and combine local tables transparently with data in tables residing in the cloud. The MotherDuck UI runs a build of DuckDB WASM in your browser, so the operations in the database that can be executed and rendered locally, are executed inside your web-browser.
Here is a picture of the architecture from theΒ documentation:
Motherduck also integrates with Python and all the other access libraries and integrations for DuckDB.
If you already signed up, you can just log-in to MotherDuck, otherwise you can create an account on theΒ homepageΒ (via Google, GitHub or email auth).
Anywhere you can run DuckDB you can use MotherDuck as it connects through an official DuckDB extension which is downloaded & loaded as soon as you connect to a motherduck database throughΒ .open md:
Β or similar commands.
Copy code
.open md:
Attempting to automatically open the SSO authorization page
in your default browser.
1. Please open this link to login into your account:
https://auth.motherduck.com/activate
2. Enter the following code: XXXX-XXXX
Token successfully retrieved β
You can store it as an environment variable to avoid having to log in again:
$ export motherduck_token='eyJhbGciOiJI..._Jfo'
Once you have an account you get aΒ motherduck_token, which you need to connect to MotherDuck. Best to set the token as an environment variable, instead of a database variable, because opening a new database wipes the settings in DuckDB (trust me, I tried).
If you want to explore the MotherDuck UI first, feel free to do so, you can create new databases, upload files and create tables from those. You can run queries and get a nice pivotable, sortable output table with inline frequency charts in the header.
Loading our StackOverflow Data into MotherDuck
You have the option of uploading your local database with single command, which is really neat.
Copy code
CREATE DATABASE remote_database_name FROM CURRENT_DATABASE();
-- or more generally
CREATE DATABASE remote_database_name FROM '<local database name>';
There are only two caveats,Β the local and remote name must be different, otherwise you might get the error below.
Catalog Error: error while importing share: Schema with name <local-database-name> does not exist!
And for the size of our StackOverflow database and the it took quite some time to finish the upload, around 1 hour, sending 15GB of data for our 11GB database.
So we can either create the database on the MotherDuck UI and import our tables from our Parquet files on S3, or upload the database from our local system.
For creating the database and tables from Parquest, we use the web interface or DuckDB on the local machine, connected to MotherDuck. Here are the SQL commands you need to run.
Copy code
create database so;
create table users as
from 's3://us-prd-motherduck-open-datasets/stackoverflow/parquet/2023-05/users.parquet';
-- Run Time (s): real 10.401 user 0.006417 sys 0.003527
describe users;
ββββββββββββββββββ¬ββββββββββββββ
β column_name β column_type β
β varchar β varchar β
ββββββββββββββββββΌββββββββββββββ€
β Id β BIGINT β
β Reputation β BIGINT β
β CreationDate β TIMESTAMP β
β DisplayName β VARCHAR β
β LastAccessDate β TIMESTAMP β
β AboutMe β VARCHAR β
β Views β BIGINT β
β UpVotes β BIGINT β
β DownVotes β BIGINT β
β Id β BIGINT β
β Reputation β BIGINT β
β CreationDate β TIMESTAMP β
β DisplayName β VARCHAR β
β LastAccessDate β TIMESTAMP β
β AboutMe β VARCHAR β
β Views β BIGINT β
β UpVotes β BIGINT β
β DownVotes β BIGINT β
ββββββββββββββββββ΄ββββββββββββββ€
β 18 rows β
ββββββββββββββββββββββββββββββββ
Run Time (s): real 0.032 user 0.026184 sys 0.002383
-- do the same for the other tables
create table comments as
from 's3://us-prd-motherduck-open-datasets/stackoverflow/parquet/2023-05/comments.parquet';
create table posts as
from 's3://us-prd-motherduck-open-datasets/stackoverflow/parquet/2023-05/posts.parquet';
create table votes as
from 's3://us-prd-motherduck-open-datasets/stackoverflow/parquet/2023-05/votes.parquet';
create table badges as
from 's3://us-prd-motherduck-open-datasets/stackoverflow/parquet/2023-05/badges.parquet';
create table post_links as
from 's3://us-prd-motherduck-open-datasets/stackoverflow/parquet/2023-05/post_links.parquet';
create table tags as
from 's3://us-prd-motherduck-open-datasets/stackoverflow/parquet/2023-05/tags.parquet';
In the left sidebar of the web interface, now the databaseΒ so
Β and the tables should show up, if not, refresh the page.
Querying the Data with AI π€
A while ago MotherDuck released a newΒ generative AI featureΒ that allows you to
- query your data using natural language
- generate and fix SQL statements and
- describe your data.
As LLMs, GPT and foundational models areΒ close to my heart, I was really excited to try these out.
It works actually already quite well, letβs see how it does on this dataset.
The schema description is a bit uninspiring, I could have seen the same by just looking at the table list. As expected from probabilistic models it returns different results on each run.
Copy code
CALL prompt_schema();
summary = The database contains information related to posts, comments, votes, badges, tags, post links, and users for a platform.
Run Time (s): real 1.476 user 0.001069 sys 0.000778
summary = The database schema represents a collection of data about various aspects of a community platform, including users, posts, comments, tags, badges, votes, and post links.
Ok, letβs try a simple question:Β What are the most popular tags?
Copy code
.mode duckbox
pragma prompt_query('What are the most popular tags?');
ββββββββββββββ¬ββββββββββ
β TagName β Count β
β varchar β int64 β
ββββββββββββββΌββββββββββ€
β javascript β 2479947 β
β python β 2113196 β
β java β 1889767 β
β c# β 1583879 β
β php β 1456271 β
β android β 1400026 β
β html β 1167742 β
β jquery β 1033113 β
β c++ β 789699 β
β css β 787138 β
ββββββββββββββ΄ββββββββββ€
β 10 rows 2 columns β
ββββββββββββββββββββββββ
-- Run Time (s): real 3.763 user 0.124567 sys 0.001716
Nice, what is the SQL it might have used for that (probabilistically it could have been slightly different)?
Copy code
.mode line
call prompt_sql('What are the most popular tags?');
-- query = SELECT TagName, Count FROM tags ORDER BY Count DESC LIMIT 5;
-- Run Time (s): real 2.813 user 2.808042 sys 0.005866
Looks good to me, itβs even smart enough to use the attribute and ordering and limit to get "most popular" tags. The runtime for these AI prompts is between 2 and 10 seconds almost exclusively depending on the processing time of the LLM.
That was pretty easy, so letβs see how it deals a few more involved questions.
- What question has the highest score and what are itβs other attributes?
- Which 5 questions have the most comments, what is the post title and comment count
Copy code
pragma prompt_query("What question has the highest score and what are it's other attributes?");
Id = 11227809
PostTypeId = 1
AcceptedAnswerId = 11227902
CreationDate = 2012-06-27 13:51:36.16
Score = 26903
ViewCount = 1796363
Body =
OwnerUserId = 87234
LastEditorUserId = 87234
LastEditorDisplayName =
LastEditDate = 2022-10-12 18:56:47.68
LastActivityDate = 2023-01-10 04:40:07.12
Title = Why is processing a sorted array faster than processing an unsorted array?
Tags = <java><c++><performance><cpu-architecture><branch-prediction>
AnswerCount = 26
CommentCount = 9
FavoriteCount = 0
CommunityOwnedDate =
ContentLicense = CC BY-SA 4.0
call prompt_sql("What question has the highest score and what are it's other attributes?");
query = SELECT *
FROM posts
WHERE PostTypeId = 1
ORDER BY Score DESC
LIMIT 1;
Run Time (s): real 3.683 user 0.001970 sys 0.000994
Ok, not bad, itβs nice that it detects thatΒ PostTypeId = 1
Β are questions (or known that from its training data on Stackoverflow), now lets go for the next one.
Copy code
.mode duckbox
pragma prompt_query("Which 5 questions have the most comments, what is the post title and comment count");
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββ
β Title β comment_count β
β varchar β int64 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββ€
β UIImageView Frame Doesnt Reflect Constraints β 108 β
β Is it possible to use adb commands to click on a view by finding its ID? β 102 β
β How to create a new web character symbol recognizable by html/javascript? β 100 β
β Why isnt my CSS3 animation smooth in Google Chrome (but very smooth on ot β 89 β
β Heap Gives Page Fault β 89 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββ
Run Time (s): real 19.695 user 2.406446 sys 0.018353
.mode line
call prompt_sql("Which 5 questions have the most comments, what is the post title and comment count");
query = SELECT p.Title, COUNT(c.Id) AS comment_count
FROM posts p
JOIN comments c ON p.Id = c.PostId AND p.PostTypeId = 1
GROUP BY p.Title
ORDER BY comment_count DESC
LIMIT 5;
Run Time (s): real 4.795 user 0.002301 sys 0.001346
This is what it looks like in the MotherDuck UI:
Actually the comment count is a column on the posts table, so it could have used that, letβs see if we can make it use only the one table.
Copy code
call prompt_sql("System: No joins! User: Which 5 questions have the most comments, what is the post title and comment count");
query = SELECT Title, CommentCount
FROM posts
WHERE PostTypeId = 1
ORDER BY CommentCount DESC
LIMIT 5;
Run Time (s): real 3.587 user 0.001733 sys 0.000865
Nice, that worked!
You can also useΒ prompt_fixup
Β to fix the SQL for a query, e.g. the infamous, "I forgot GROUP BY".
Copy code
call prompt_fixup("select postTypeId, count(*) from posts");
query = SELECT postTypeId, COUNT(*) FROM posts GROUP BY postTypeId
Run Time (s): real 12.006 user 0.004266 sys 0.002980
Or fixing a wrong join column name, or two.
Copy code
call prompt_fixup("select count(*) from posts join users on posts.userId = users.userId");
query = SELECT COUNT(*) FROM posts JOIN users ON posts.OwnerUserId = users.Id
Run Time (s): real 2.378 user 0.001770 sys 0.001067
Thatβs a really neat feature, hope they use it in their UI when your query would encounter an error with an explain in the background.
Data Sharing
ToΒ make this data available to others, we can use theΒ CREATE SHARE
Β command.
If we run it, we will get a shareable link, that others can use withΒ ATTACH
Β toΒ attach our database. Currently it takes about a minute to create the share, but in the future it will be a zero-copy operation.
Copy code
-- CREATE SHARE <share name> [FROM <database name>];
CREATE SHARE so_2023_05 FROM so;
-- share_url = md:_share/so/373594a2-06f7-4c33-814e-cf59028482ca
-- Run Time (s): real 63.335 user 0.014849 sys 0.013110
-- ATTACH '<share URL>' [AS <database name>];
ATTACH 'md:_share/so/373594a2-06f7-4c33-814e-cf59028482ca' AS so;
-- show the contents of the share
DESCRIBE SHARE "so_2023_05";
LIST SHARES;
-- After making changes to the shared database, you need to update the share
UPDATE SHARE "so_2023_05";
Today we explored the MotherDuck interface, created a database and populated it with tables using Parquet data on S3. That worked really well and you should be able to do this with your own data easily.
Then we tried the new AI prompts on MotherDuck, which work quite well, of course not 100% but often good enough to get a starting point or learn something new. Given the amount of SQL information that was used to the train the LLMs plus the additional schema information, that is not surprising. SQL (derived from structured english query language SEQUEL) is just another langauge for the LLM to translate into, much like Korean or Klingon.
So while youβre waiting for the third part of the blog series, you can attach our share (which is public) and run your own queries on it.
In the third part we want to connect to our StackOverflow database on MotherDuck using Python and explore some more ways accessing, querying and visualizing our data.
Please share any interesting queries or issues on theΒ MotherDuck Slack channel.
CONTENT
- Getting started with MotherDuck
- Loading our StackOverflow Data into MotherDuck
Subscribe to MotherDuck Blog