PyPi Data
About the dataset
PyPi is the Python Package Index, a repository of software packages for the Python programming language. It is a central repository that allows users to find and install software developed and shared by the Python community.
The dataset includes information about packages, releases, and downloads on the duckdb
python package.
It's refreshed weekly and you can visit the dashboard here
How to query the dataset
A dedicated shared database is maintained to query the dataset. To attach it to your workspace, you can use the following command:
ATTACH 'md:_share/duckdb_stats/1eb684bf-faff-4860-8e7d-92af4ff9a410' AS duckdb_stats;
Schema
pypi_file_downloads
This table contains the raw data. Each row represents a download from PyPi.
column_name | column_type | null |
---|---|---|
timestamp | TIMESTAMP | YES |
country_code | VARCHAR | YES |
url | VARCHAR | YES |
project | VARCHAR | YES |
file | STRUCT(filename VARCHAR, project VARCHAR, "version" VARCHAR, "type" VARCHAR) | YES |
details | STRUCT("installer" STRUCT("name" VARCHAR, "version" VARCHAR), "python" VARCHAR, "implementation" STRUCT("name" VARCHAR, "version" VARCHAR), "distro" STRUCT("name" VARCHAR, "version" VARCHAR, "id" VARCHAR, "libc" STRUCT("lib" VARCHAR, "version" VARCHAR)), "system" STRUCT("name" VARCHAR, "release" VARCHAR), "cpu" VARCHAR, "openssl_version" VARCHAR, "setuptools_version" VARCHAR, "rustc_version" VARCHAR, "ci" BOOLEAN) | YES |
tls_protocol | VARCHAR | YES |
tls_cipher | VARCHAR | YES |
pypi_daily_stats
This table is a daily aggregation of the raw data. It contains the following columns:
column_name | column_type | null |
---|---|---|
load_id | VARCHAR | YES |
download_date | DATE | YES |
system_name | VARCHAR | YES |
system_release | VARCHAR | YES |
version | VARCHAR | YES |
project | VARCHAR | YES |
country_code | VARCHAR | YES |
cpu | VARCHAR | YES |
python_version | VARCHAR | YES |
daily_download_sum | BIGINT | YES |
Examples queries
The following queries assume that the current database connected is duckdb_stats
. Run use duckdb_stats
to switch to it.
Get weekly download stats
SELECT
DATE_TRUNC('week', download_date) AS week_start_date,
version,
country_code,
python_version,
SUM(daily_download_sum) AS weekly_download_sum
FROM
duckdb_stats.main.pypi_daily_stats
GROUP BY
ALL
ORDER BY
week_start_date