StackOverflow Data
About the dataset
Stack Overflow is a website dedicated to providing professional and enthusiast programmers a platform to learn and share knowledge. It features questions and answers on a wide range of topics in computer programming and is renowned for its community-driven approach. Users can ask questions, provide answers, vote on questions and answers, and earn reputation points and badges for their contributions.
The dataset includes a complete data dump up to May 2023, covering posts, comments, users, badges, and related metrics.
You can read more about the dataset in our blog series part 1 and part 2.
How to query the dataset
As this dataset is quite large, it's not part of the sample_data
database. Instead, you can find it as a dedicated shared database.
To attach it to your workspace, you can use the following command:
ATTACH 'md:_share/stackoverflow/6c318917-6888-425a-bea1-5860c29947e5' AS stackoverflow;
Schema
Badges
column_name | column_type | null | key | default | extra |
---|---|---|---|---|---|
Id | BIGINT | YES | |||
UserId | BIGINT | YES | |||
Name | VARCHAR | YES | |||
Date | TIMESTAMP | YES | |||
Class | BIGINT | YES | |||
TagBased | BOOLEAN | YES |
Comments
column_name | column_type | null | key | default | extra |
---|---|---|---|---|---|
Id | BIGINT | YES | |||
PostId | BIGINT | YES | |||
Score | BIGINT | YES | |||
Text | VARCHAR | YES | |||
CreationDate | TIMESTAMP | YES | |||
UserId | BIGINT | YES | |||
ContentLicense | VARCHAR | YES |
Post Links
column_name | column_type | null | key | default | extra |
---|---|---|---|---|---|
Id | BIGINT | YES | |||
CreationDate | TIMESTAMP | YES | |||
PostId | BIGINT | YES | |||
RelatedPostId | BIGINT | YES | |||
LinkTypeId | BIGINT | YES |
Posts
column_name | column_type | null | key | default | extra |
---|---|---|---|---|---|
Id | BIGINT | YES | |||
PostTypeId | BIGINT | YES | |||
AcceptedAnswerId | BIGINT | YES | |||
CreationDate | TIMESTAMP | YES | |||
Score | BIGINT | YES | |||
ViewCount | BIGINT | YES | |||
Body | VARCHAR | YES | |||
OwnerUserId | BIGINT | YES | |||
LastEditorUserId | BIGINT | YES | |||
LastEditorDisplayName | VARCHAR | YES | |||
LastEditDate | TIMESTAMP | YES | |||
LastActivityDate | TIMESTAMP | YES | |||
Title | VARCHAR | YES | |||
Tags | VARCHAR | YES | |||
AnswerCount | BIGINT | YES | |||
CommentCount | BIGINT | YES | |||
FavoriteCount | BIGINT | YES | |||
CommunityOwnedDate | TIMESTAMP | YES | |||
ContentLicense | VARCHAR | YES |
Tags
column_name | column_type | null | key | default | extra |
---|---|---|---|---|---|
Id | BIGINT | YES | |||
TagName | VARCHAR | YES | |||
Count | BIGINT | YES | |||
ExcerptPostId | BIGINT | YES | |||
WikiPostId | BIGINT | YES |
Votes
column_name | column_type | null | key | default | extra |
---|---|---|---|---|---|
Id | BIGINT | YES | |||
PostId | BIGINT | YES | |||
VoteTypeId | BIGINT | YES | |||
CreationDate | TIMESTAMP | YES |
Users
column_name | column_type | null | key | default | extra |
---|---|---|---|---|---|
Id | BIGINT | YES | |||
Reputation | BIGINT | YES | |||
CreationDate | TIMESTAMP | YES | |||
DisplayName | VARCHAR | YES | |||
LastAccessDate | TIMESTAMP | YES | |||
AboutMe | VARCHAR | YES | |||
Views | BIGINT | YES | |||
UpVotes | BIGINT | YES | |||
DownVotes | BIGINT | YES |
Examples queries
The following queries assume that the current database connected is stackoverflow
. Run use stackoverflow
to switch to it.
List the top 5 posts that received the most votes
SELECT posts.Title, COUNT(votes.Id) AS VoteCount
FROM posts
JOIN votes ON posts.Id = votes.PostId
GROUP BY posts.Title
ORDER BY VoteCount DESC
LIMIT 5;
Find the top 5 posts with the highest view count:
SELECT Title, ViewCount
FROM posts
ORDER BY ViewCount DESC
LIMIT 5;