StackOverflow Data
About the dataset
Stack Overflow is a website dedicated to providing professional and enthusiast programmers a platform to learn and share knowledge. It features questions and answers on a wide range of topics in computer programming and is renowned for its community-driven approach. Users can ask questions, provide answers, vote on questions and answers, and earn reputation points and badges for their contributions.
The dataset includes a complete data dump up to May 2023, covering posts, comments, users, badges, and related metrics.
You can read more about the dataset in our blog series part 1 and part 2.
How to query the dataset
As this dataset is quite large, it's not part of the sample_data database. Instead, you can find it as a dedicated shared database.
To attach it to your workspace, you can use the following command:
ATTACH 'md:_share/stackoverflow/6c318917-6888-425a-bea1-5860c29947e5' AS stackoverflow;
Schema
Badges
| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| Id | BIGINT | YES | |||
| UserId | BIGINT | YES | |||
| Name | VARCHAR | YES | |||
| Date | TIMESTAMP | YES | |||
| Class | BIGINT | YES | |||
| TagBased | BOOLEAN | YES |
Comments
| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| Id | BIGINT | YES | |||
| PostId | BIGINT | YES | |||
| Score | BIGINT | YES | |||
| Text | VARCHAR | YES | |||
| CreationDate | TIMESTAMP | YES | |||
| UserId | BIGINT | YES | |||
| ContentLicense | VARCHAR | YES |
Post Links
| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| Id | BIGINT | YES | |||
| CreationDate | TIMESTAMP | YES | |||
| PostId | BIGINT | YES | |||
| RelatedPostId | BIGINT | YES | |||
| LinkTypeId | BIGINT | YES |
Posts
| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| Id | BIGINT | YES | |||
| PostTypeId | BIGINT | YES | |||
| AcceptedAnswerId | BIGINT | YES | |||
| CreationDate | TIMESTAMP | YES | |||
| Score | BIGINT | YES | |||
| ViewCount | BIGINT | YES | |||
| Body | VARCHAR | YES | |||
| OwnerUserId | BIGINT | YES | |||
| LastEditorUserId | BIGINT | YES | |||
| LastEditorDisplayName | VARCHAR | YES | |||
| LastEditDate | TIMESTAMP | YES | |||
| LastActivityDate | TIMESTAMP | YES | |||
| Title | VARCHAR | YES | |||
| Tags | VARCHAR | YES | |||
| AnswerCount | BIGINT | YES | |||
| CommentCount | BIGINT | YES | |||
| FavoriteCount | BIGINT | YES | |||
| CommunityOwnedDate | TIMESTAMP | YES | |||
| ContentLicense | VARCHAR | YES |
Tags
| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| Id | BIGINT | YES | |||
| TagName | VARCHAR | YES | |||
| Count | BIGINT | YES | |||
| ExcerptPostId | BIGINT | YES | |||
| WikiPostId | BIGINT | YES |
Votes
| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| Id | BIGINT | YES | |||
| PostId | BIGINT | YES | |||
| VoteTypeId | BIGINT | YES | |||
| CreationDate | TIMESTAMP | YES |
Users
| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| Id | BIGINT | YES | |||
| Reputation | BIGINT | YES | |||
| CreationDate | TIMESTAMP | YES | |||
| DisplayName | VARCHAR | YES | |||
| LastAccessDate | TIMESTAMP | YES | |||
| AboutMe | VARCHAR | YES | |||
| Views | BIGINT | YES | |||
| UpVotes | BIGINT | YES | |||
| DownVotes | BIGINT | YES |
Examples queries
The following queries assume that the current database connected is stackoverflow. Run use stackoverflow to switch to it.
List the top 5 posts that received the most votes
SELECT posts.Title, COUNT(votes.Id) AS VoteCount
FROM posts
JOIN votes ON posts.Id = votes.PostId
GROUP BY posts.Title
ORDER BY VoteCount DESC
LIMIT 5;
Find the top 5 posts with the highest view count:
SELECT Title, ViewCount
FROM posts
ORDER BY ViewCount DESC
LIMIT 5;