Skip to main content

Hacker News

About the dataset

Hacker News is a social news website focusing on computer science and entrepreneurship. It is run by Y Combinator, a startup accelerator, and it's known for its minimalist interface. Users can post stories (such as links to articles), comment on them, and vote them up or down, affecting their visibility.

The dataset sample covers data from January 2022 to November 2022.

Here's the schema :

column_namecolumn_typenullkeydefaultextra
titleVARCHARYES
urlVARCHARYES
textVARCHARYES
deadBOOLEANYES
byVARCHARYES
scoreBIGINTYES
timeBIGINTYES
timestampTIMESTAMPYES
typeVARCHARYES
idBIGINTYES
parentBIGINTYES
descendantsBIGINTYES
rankingBIGINTYES
deletedBOOLEANYES

To read from the sample_data database, please refer to attach the sample datasets database

Example queries

Most shared websites

This query gives use the most top domains being shared on hacker_news.

SELECT
regexp_extract(url, 'http[s]?://([^/]+)/', 1) AS domain,
count(*) AS count
FROM sample_data.hn.hacker_news
WHERE url IS NOT NULL AND regexp_extract(url, 'http[s]?://([^/]+)/', 1) != ''
GROUP BY domain
ORDER BY count DESC
LIMIT 20;

Most Commented Stories Each Month

This query calculates the total number of comments for each story and identifies the most commented story of each month.

WITH ranked_stories AS (
SELECT
title,
'https://news.ycombinator.com/item?id=' || id AS hn_url,
descendants AS nb_comments,
YEAR(timestamp) AS year,
MONTH(timestamp) AS month,
ROW_NUMBER()
OVER (
PARTITION BY YEAR(timestamp), MONTH(timestamp)
ORDER BY descendants DESC
)
AS rn
FROM sample_data.hn.hacker_news
WHERE type = 'story'
)

SELECT
year,
month,
title,
hn_url,
nb_comments
FROM ranked_stories
WHERE rn = 1
ORDER BY year, month;

Most monthly voted stories

This query determines the most voted story for each month.

WITH ranked_stories AS (
SELECT
title,
'https://news.ycombinator.com/item?id=' || id AS hn_url,
score,
YEAR(timestamp) AS year,
MONTH(timestamp) AS month,
ROW_NUMBER()
OVER (PARTITION BY YEAR(timestamp), MONTH(timestamp) ORDER BY score DESC)
AS rn
FROM sample_data.hn.hacker_news
WHERE type = 'story'
)

SELECT
year,
month,
title,
hn_url,
score
FROM ranked_stories
WHERE rn = 1
ORDER BY year, month;

Keyword analysis

This query counts the monthly mentions a the keyword (here duckdb) in the title or text of Hacker News posts, organized by year and month.

SELECT
YEAR(timestamp) AS year,
MONTH(timestamp) AS month,
COUNT(*) AS keyword_mentions
FROM sample_data.hn.hacker_news
WHERE
(title LIKE '%duckdb%' OR text LIKE '%duckdb%')
GROUP BY year, month
ORDER BY year ASC, month ASC;