Analyze Your X (Twitter) Data with Node.js and DuckDB
2023/11/08 - 7 min read
BYSubscribe to MotherDuck Blog
Would you like to know which of your X (the artist formerly known as Twitter) posts received the most favorites or reposts? How many times have you replied? In which months were you the most active? What were the first ten posts you wrote? In this tutorial, we will answer all these questions and give you the tools to discover even more!
Not very long ago, X started offering a way for its users to download an archive of their data. When you request and receive your X data archive, you get a static web app. It lets you browse and search your tweets and gives you some basic stats. Your archive not only has tweets and associated media but also includes followers, direct messages, likes, ad impressions and engagements, personalization data, and much more.
Unfortunately, the static web app only gives you a few options to view and discover your data. You're on your own to dive deeper. And, while the archive includes a "readme" document that describes the data, the data itself is not in a format you can easily query and analyze. Node.js and DuckDB to the rescue!
Node.js and DuckDB analytics project overview
Before you dive into the steps, it’s helpful to understand how and why these technologies work well together. Node.js is a minimal software development framework based on the JavaScript language. Since the X archive data is in the form of JavaScript code, Node.js is a good choice for converting the JavaScript code into a data format that is easier to consume and query. You will use the Node.js application to convert your X posts into a comma-separated values (CSV) data file. You will also learn how Node.js can be used to automate DuckDB to execute queries from JavaScript code.
DuckDB is a lightweight data analysis application that supports the structured query language (SQL) and can natively query common data formats such as CSV. You will learn to use the DuckDB command-line interface (CLI) to further analyze your X posts with SQL without writing any JavaScript code.
I heard DuckDB can read JSON files. Why do I need Node.js?
Yes, DuckDB can read JSON files! The problem is the X archive data files are not in JSON format. They are JavaScript code files with JSON inside designed to run in the front-end app. Code is needed to convert the JavaScript files to JSON. The script in this project could have stopped with creating a JSON file. However, it was helpful to transform the data from the original JSON to CSV to create a structure that was more comfortable to query.
Requirements and setup
- Download your X data archive. Your request may take 24 hours or more to process.
- Install Node.js version 18 or higher.
- Install DuckDB.
Note: You can use DuckDB in a Node.js application with the duckdb-async library without having to install the DuckDB app. However, you may want the standalone app to run your queries against the converted data.
- Clone the
analyze-x-nodejs-duckdb
project. If you're not familiar with usinggit
, you can also download and unzip the project. - Open the project in your terminal or command prompt and run
npm install
to install dependencies. - Extract (unzip) your X archive and copy or move the files into your project folder named
x-archive
. Thex-archive
folder should now look like the following.
Copy code
|__ x-archive |__ assets |__ data |__ readme.md |__ Your archive.html
Launch the conversion application
From your terminal or command window, make sure your current directory is the project folder. Run the following command:
Copy code
node .
If everything is set up correctly, your tweets archive will be converted to a CSV file, and you'll see the output of several queries. Scroll up to see the see the results!
Further analysis using DuckDB
Now that all of your posts have been converted to a CSV file, you can use DuckDB to query that data directly. With the power of SQL, you can answer all kinds of questions!
From the same terminal or command prompt, start the DuckDB application with the following command.
Copy code
duckdb
Your cursor should be before a D
prompt, waiting for a command or SQL statement. To see the first ten posts you created, enter the following query.
Copy code
SELECT created_at_date, link, full_text
FROM "./src/data/tweets.csv"
ORDER BY created_at_date, created_at_time
LIMIT 10;
Note: When typing in a SQL query, press the return or enter key to move to a new line. The query will not execute until you type the ending semicolon `;` and press return or enter.
The sky is the limit! Here are all the columns in the CSV file you can use as part of your queries.
Column Name | Description |
---|---|
id | The post ID assigned by X |
favorite_count | The number of favorites on the post |
retweet_count | The number of times the post was reposted |
created_at_date | The post creation date, formatted as `yyyy-mm-dd` |
created_at_time | The post creation time, formatted as `HH:mm:ss` (0-23 hour format) |
is_reply | Is this post a reply, `0` for false, `1` for true |
in_reply_to_user_id | If the post is a reply, this is the user ID of the account being replied to |
is_self_reply | If the post is a reply to self (part of a thread), `0` or `1` |
retweet | If the post is a retweet, `0` for false, `1` for true |
has_media | If the post contains media, such as an image, `0` for false, `1` for true |
hashtags | The number of hashtags in the post |
user_mentions | The number of user mentions in the post |
urls | The number of URLs in the post |
source | The name of the app used to create the post |
link | The post URL |
full_text | The text of the post |
Note: When you want to exit from DuckDB back to the console, type
.exit
and press return, or press CTRL+C twice.
An overview of the Node.js code
Node.js is a powerful software development environment that uses JavaScript for building all kinds of applications, from scripts like what you see in this project to full-blown web applications, mobile apps, desktop apps, and much more. This project uses the duckdb-async
library to execute DuckDB queries directly. Here is an example of the source code found in the duckdb.js
file in this project.
Copy code
import duckdb from "duckdb-async";
export async function analyzePosts( csvFilePath ) {
try {
// Create an instance of DuckDB using in-memory storage
const db = await duckdb.Database.create( ":memory:" );
await topRetweets( db, csvFilePath );
await topFavorites( db, csvFilePath );
await postStats( db, csvFilePath );
} catch ( err ) {
console.log( "Uh oh! There's an error!" );
console.log( err );
}
}
async function topRetweets( db, csvFilePath ) {
const topRetweets = await db.all( `
SELECT full_text,
created_at_date + created_at_time AS created_at,
retweet_count,
link
FROM read_csv_auto( '${ csvFilePath }' )
ORDER BY retweet_count DESC
LIMIT 3;` );
console.log( "\nTop Retweets!\n" );
console.log( topRetweets );
}
The first line imports the duckdb-async
library. The analyzePosts
takes a single argument, the path to the CSV file to query. The function creates an instance of DuckDB, which it uses to call functions to perform various queries. The topRetweets
function is shown next as an example of how to execute a DuckDB query from Node.js.
TIP: If you don't have a good code editor, download and install Visual Studio Code. Use it to explore the source code and the data in your X archive!
Next steps with Node.js and DuckDB
Now that you've tasted what's possible with Node.js and DuckDB, there's more data to analyze! As mentioned, the archive includes direct messages, followers, and interesting data like personalization and ads. Modify the code to build something that answers whatever questions you have!
Happy coding and querying with Node.js and DuckDB!
CONTENT
- Node.js and DuckDB analytics project overview
- Requirements and setup
- Launch the conversion application
- Further analysis using DuckDB
- An overview of the Node.js code
Subscribe to MotherDuck Blog