Analyze JSON Data Using SQL and DuckDB

2024/01/10 - 10 min read

BY
David Neal

Subscribe to MotherDuck Blog

You have a deadline flying in fast and a boatload of data to dive through. You extract the data archive to discover all the files have a “.json” extension. Oh, no. JSON is for programmers, right? What do you do now?

No need to get your feathers in a fluff! DuckDB to the rescue!

DuckDB is a featherlight yet powerful database that supports querying lots of data formats directly using SQL. It can query data locally on disk, in memory, in the cloud, or combine data from multiple sources in a single query!

In this post, we'll guide you through querying JSON data using DuckDB. You'll learn how to target the data you need with the precision of a duck snatching its favorite bread. Let’s get quacking on how to query JSON data with DuckDB!

Prerequisites

  • Install DuckDB (instructions below).
  • Optional: View or download the sample JSON data used in this tutorial.

What is JSON?

JSON, which stands for JavaScript Object Notation, is a lightweight data format. It is designed to be fairly easy for humans to read and write, and easy for machines to parse and generate, making it a great way to share data. It was originally created for Web applications to share data between the browser and server and has become a standard for storing and sharing data in many other types of applications. Outside of the browser, JSON is typically stored in a text file with a .json extension.

Let’s waddle through some of the basics of JSON! JSON is built on two basic structures:

  • A collection of one or more name/value pairs surrounded by curly braces {}, each pair separated by commas.
  • A list of one or more values surrounded by brackets [], each value separated by commas.

Here’s an example:

Copy code

{ "ducks": [ { "name": "Quackmire", "color": "green", "actions": [ "swimming", "waddling", "quacking" ] }, { "name": "Feather Locklear", "color": "yellow", "actions": [ "sunbathing" ] }, { "name": "Duck Norris", "color": "brown", "actions": [ "karate chopping bread" ] } ], "totalDucks": 3 }
  • All of the data is wrapped in curly braces {}, like a cozy nest.
  • Each duck is part of a "ducks" array (like a flock of ducks in a row), wrapped by square brackets [].
  • Each duck in the array is a set of "name/value" pairs. For example, "name": "Duck Norris" tells us one duck's name is Duck Norris.

Curly braces {} are used to represent an object. You might also think of an object as a record, thing, or entity. The name/value pairs are sometimes called properties. The value associated with the name can represent text (a string), a number, true/false (a boolean), a collection of values (an array), or a nested object. An array is represented by square brackets [] and can be an ordered list of strings, numbers, booleans, or objects.

The JSON format can represent data structures ranging from simple to complex with nested objects and arrays! This makes it a great way to express and exchange data.

Install and execute DuckDB

If you don’t already have DuckDB installed, flap on over to duckdb.org and follow the instructions for your operating system. In this tutorial, you’ll be using DuckDB from the command line.

  • Mac: Follow the Homebrew (brew) install instructions.
  • Windows: Follow the winget install instructions.
  • Linux: Download the appropriate archive for your OS and processor. Extract the duckdb executable binary from the archive to a folder where you easily execute it from your terminal.

Launch DuckDB from the command line

After installing DuckDB, open (or reopen) your terminal or command prompt and enter the following to start an in-memory session of DuckDB.

Copy code

duckdb

Note: If you are running Linux, you’ll want to change the current directory to where you extracted the binary and use ./duckdb to execute the binary

If all goes to plan, you should see a new D prompt ready for a command or SQL query, similar to the following.

Copy code

Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database. D

Run your first DuckDB SQL query

From the D prompt on the command line, type in the following SQL query and press enter. Don’t forget to include the semicolon at the end! SQL queries can span multiple lines, and the semicolon lets DuckDB know you are finished writing the query and it’s ready to execute.

Copy code

SELECT current_date - 7;

The result returned should be the date from seven days ago. current_date is one of many SQL functions available, and can be useful for including in query results or filtering data.

Query JSON files with DuckDB

In many cases, you can query data directly from a JSON file by specifying a path to the file.

  • Create a new text file named ducks.json and open it in a text editor.
  • Paste the following JSON data into the file and save it.

Copy code

[ { "id": "kA0KgL", "color": "red", "firstName": "Marty", "lastName": "McFly", "gender": "male" }, { "id": "dx3ngL", "color": "teal", "firstName": "Duckota", "lastName": "Fanning", "gender": "female" }, { "id": "FQ4dU1", "color": "yellow", "firstName": "Duck", "lastName": "Norris", "gender": "male" }, { "id": "JqS7ZZ", "color": "red", "firstName": "James", "lastName": "Pond", "gender": "male" }, { "id": "ZM5uJL", "color": "black", "firstName": "Darth", "lastName": "Wader", "gender": "male" } ]

With DuckDB running at the command line, paste the following query and press ENTER.

Copy code

SELECT * FROM './ducks.json';

The results should look similar to the following.

Copy code

D SELECT * FROM './ducks.json'; ┌─────────┬─────────┬───────────┬──────────┬─────────┐ │ id │ color │ firstName │ lastName │ gender │ │ varchar │ varchar │ varchar │ varchar │ varchar │ ├─────────┼─────────┼───────────┼──────────┼─────────┤ │ kA0KgL │ red │ Marty │ McFly │ male │ │ dx3ngL │ teal │ Duckota │ Fanning │ female │ │ FQ4dU1 │ yellow │ Duck │ Norris │ male │ │ JqS7ZZ │ red │ James │ Pond │ male │ │ ZM5uJL │ yellow │ Darth │ Wader │ male │ ├─────────┴─────────┴───────────┴──────────┴─────────┤ │ 5 rows 5 columns │ └────────────────────────────────────────────────────┘

Change DuckDB’s output display

If you’re not 100% satisfied with DuckDB’s output to the console, there are lots of choices to customize the output. Type the following command to list the available output modes.

Copy code

.help .mode

Try switching the output mode to column display and rerun the last query to see the difference.

Copy code

.mode column

Copy code

D SELECT * FROM './ducks.json'; id color firstName lastName gender ------ ------ --------- -------- ------ kA0KgL red Marty McFly male dx3ngL teal Duckota Fanning female FQ4dU1 yellow Duck Norris male JqS7ZZ red James Pond male ZM5uJL black Darth Wader male

Experiment with other output modes until you find the one you like the most! If you want to switch back to the default DuckDB output mode, use the following command.

Copy code

.mode duckbox

Query multiple JSON files at once

You can query across multiple files at once using path wildcards. For example, to query all files that end with .json:

Copy code

SELECT * FROM './*.json';

You can query from specific files, too, such as:

Copy code

SELECT * FROM './monthly-sales-2023*.json';

Join JSON files together

Just like joining tables together, if there is a common key in one or more different data files, you can join on that key.

In this example, we have one JSON file that contains a list of ducks in a sanctuary, including ID, name, and color. In another JSON file there is a log of all the the things the ducks were observed doing, surveyed every 10 minutes for a month. This second file has the date and time of the log, the action, and only the ID of the duck. To create a report that summarizes the ducks' activities, you would want to join them together.

Copy code

SELECT ducks.firstName || ' ' || ducks.lastName AS duck_name, samples.action, COUNT(*) AS observations FROM './samples.json' AS samples JOIN './ducks.json' AS ducks ON ducks.id = samples.id GROUP BY ALL ORDER BY 1, 3 DESC;

Copy code

┌────────────────┬─────────────────────────┬──────────────┐ │ duck_name │ action │ observations │ │ varchar │ varchar │ int64 │ ├────────────────┼─────────────────────────┼──────────────┤ │ Captain Quack │ sleeping │ 890 │ │ Captain Quack │ quacking │ 632 │ │ Captain Quack │ eating │ 623 │ │ Captain Quack │ annoying │ 594 │ │ Captain Quack │ swimming │ 356 │ │ Captain Quack │ waddling │ 351 │ │ Captain Quack │ sunbathing │ 348 │ │ Captain Quack │ twitching │ 125 │ │ Captain Quack │ flying │ 121 │ │ Captain Quack │ dancing │ 117 │ │ Captain Quack │ diving │ 106 │ │ Captain Quack │ posting on social media │ 57 │ ...

Import JSON data into DuckDB for further analysis

If you have a lot of different JSON files, it might make sense to import the data into tables in your local DuckDB database. In the following example, you'll import the ducks.json file and samples.json together into one table.

Copy code

CREATE OR REPLACE TABLE duck_samples AS SELECT CAST(samples.sampleTime AS date) AS sample_date, ducks.firstName || ' ' || ducks.lastName AS duck_name, samples.action, COUNT(*) AS observations FROM read_json('./samples.json', columns = { id: 'varchar', sampleTime: 'datetime', action: 'varchar' }) AS samples JOIN './ducks.json' AS ducks ON ducks.id = samples.id GROUP BY ALL;

This example uses the read_json function to customize the schema of the imported data, which can be useful for converting dates and times as the data is read and parsed from the JSON data.

With the duck_samples table populated, we can now use it to analyze the data in new ways, such as number of actions performed by all ducks on a given day.

Copy code

SELECT ds.sample_date, ds.action, ds.observations, round(( ds.observations / totals.total_obs ) * 100, 1) AS percent_total FROM ( SELECT sample_date, action, SUM(observations) AS observations FROM duck_samples GROUP BY ALL ) AS ds JOIN ( SELECT sample_date, SUM(observations) AS total_obs FROM duck_samples GROUP BY ALL ) AS totals ON ds.sample_date = totals.sample_date WHERE ds.sample_date = '2024-01-01' GROUP BY ALL ORDER BY 3 DESC; ┌─────────────┬─────────────────────────┬──────────────┬───────────────┐ │ sample_date │ action │ observations │ percent_total │ │ datevarchar │ int128 │ double │ ├─────────────┼─────────────────────────┼──────────────┼───────────────┤ │ 2024-01-01 │ sleeping │ 155121.5 │ │ 2024-01-01 │ quacking │ 97813.6 │ │ 2024-01-01 │ eating │ 97713.6 │ │ 2024-01-01 │ annoying │ 94713.2 │ │ 2024-01-01 │ swimming │ 6128.5 │ │ 2024-01-01 │ waddling │ 6008.3 │ │ 2024-01-01 │ sunbathing │ 5988.3 │ │ 2024-01-01 │ flying │ 2313.2 │ │ 2024-01-01 │ diving │ 2203.1 │ │ 2024-01-01 │ twitching │ 2082.9 │ │ 2024-01-01 │ dancing │ 1932.7 │ │ 2024-01-01 │ posting on social media │ 851.2 │ ├─────────────┴─────────────────────────┴──────────────┴───────────────┤ │ 12 rows 4 columns │ └──────────────────────────────────────────────────────────────────────┘

Query complex JSON data

Depending on the structure of the JSON data you are working with it may be necessary to extract values from nested objects or arrays. Nested objects are referred to in DuckDB as a struct data type. In some cases, it's possible to access data directly in a struct using syntax that resembles schema or table namespaces. For example, imagine you have JSON file named ducks-nested-name.json with the following data.

Copy code

[ { "color": "red", "name": { "firstName": "Marty", "lastName": "McFly" }, "gender": "male" }, { "color": "teal", "name": { "firstName": "Duckota", "lastName": "Fanning" }, "gender": "female" }, { "color": "yellow", "name": { "firstName": "Duck", "lastName": "Norris" }, "gender": "male" } ]

If you query the file directly, the results would like the following.

Copy code

D SELECT * FROM './ducks-nested-name.json'; ┌─────────┬─────────────────────────────────────────────┬─────────┐ │ color │ name │ gender │ │ varchar │ struct(firstname varchar, lastname varchar) │ varchar │ ├─────────┼─────────────────────────────────────────────┼─────────┤ │ red │ {'firstName': Marty, 'lastName': McFly} │ male │ │ teal │ {'firstName': Duckota, 'lastName': Fanning} │ female │ │ yellow │ {'firstName': Duck, 'lastName': Norris} │ male │ └─────────┴─────────────────────────────────────────────┴─────────┘

You can access the nested values under name using the following syntax.

Copy code

D SELECT color, name.firstName FROM './ducks-nested-name.json'; ┌─────────┬───────────┐ │ color │ firstName │ │ varcharvarchar │ ├─────────┼───────────┤ │ red │ Marty │ │ teal │ Duckota │ │ yellow │ Duck │ └─────────┴───────────┘

DuckDB provides the unnest function to help when dealing with nested data. Taking the first example with Quackmire, Feather Locklear, and Duck Norris, if you query this JSON data without using unnest, you'll see the following results.

Copy code

┌─────────────────────────────────────────────────────────────────┬────────────┐ │ ducks │ totalDucks │ │ struct("name" varchar, color varchar, actions varchar[])[] │ int64 │ ├─────────────────────────────────────────────────────────────────┼────────────┤ │ [{'name': Quackmire, 'color': green, 'actions': [swimming, wa… │ 3 │ └─────────────────────────────────────────────────────────────────┴────────────┘

To make better use of the data in the ducks column, use the unnest function to destructure and flatten the data into their own columns.

Copy code

D SELECT unnest(ducks, recursive:= true) AS ducks FROM './ducks-example.json'; ┌──────────────────┬─────────┬────────────────────────────────┐ │ name │ color │ actions │ │ varcharvarcharvarchar[] │ ├──────────────────┼─────────┼────────────────────────────────┤ │ Quackmire │ green │ [swimming, waddling, quacking] │ │ Feather Locklear │ yellow │ [sunbathing] │ │ Duck Norris │ brown │ [karate chopping bread] │ └──────────────────┴─────────┴────────────────────────────────┘

Query JSON data from an API

DuckDB can also parse data directly from APIs that return JSON. The following example uses the TVmaze API, a public API for TV shows.

Copy code

D SELECT show.name, show.type, show.summary FROM read_json('https://api.tvmaze.com/search/shows?q=duck', auto_detect=true); ┌──────────────────────┬──────────────┬────────────────────────────────────────────────────────────────┐ │ show_name │ show_type │ summary │ │ json │ json │ json │ ├──────────────────────┼──────────────┼────────────────────────────────────────────────────────────────┤ │ "Duck Dynasty" │ "Reality" │ "<p>In <b>Duck Dynasty</b>, A&amp;E Network introduces the R… │ │ "Darkwing Duck" │ "Animation" │ "<p>In the city of St. Canard, the people are plagued by the… │ │ "Duck Dodgers" │ "Animation" │ "<p>Animated sci-fi series based on the alter ego of Looney … │ │ "Duck Patrol" │ "Scripted" │ "<p><b>Duck Patrol</b> deals with the activities of the offi… │ └──────────────────────────────────────────────────────────────────────────────────────────────────────┘

Learn more about DuckDB

To learn more about what you can do with DuckDB, check out the DuckDB Snippets Library or download a free copy of DuckDB in Action.

CONTENT
  1. Prerequisites
  2. What is JSON?
  3. Install and execute DuckDB
  4. Query JSON files with DuckDB
  5. Query complex JSON data
  6. Query JSON data from an API
  7. Learn more about DuckDB

Subscribe to MotherDuck Blog