What is Advent of Code?
For the uninitiated, Advent of Code (AoC) is an advent calendar in the form of coding problems that runs from December 1-25. It has run every year since 2015.
The AoC about page describes it best:
Advent of Code is an Advent calendar of small programming puzzles for a variety of skill sets and skill levels that can be solved in any programming language you like. People use them as interview prep, company training, university coursework, practice problems, a speed contest, or to challenge each other.
Each problem has two parts. Complete the solution for both parts and you get a gold star. Completing just part one earns you a silver star. The problems generally get tougher as the days go on. This is best illustrated by the stats page which clearly shows the tail off of the number of people completing the problems.
I've attempted AoC each year since 2020. In 2022 I used Python. In 2021, I decided to try to use Snowflake SQL but by day four it became too tedious for me and I returned to using Python.
The daily solution threads on /r/adventofcode are full of people using any programming language you can imagine. This includes extremely odd choices such as APL, Rockstar, and even Microsoft Excel. However, I was surprised to not find many SQL solutions posted. When I did find them they were often written in T-SQL for Microsoft SQL Server, and occasionally I'd find a PostgreSQL solution.
Why DuckDB?
In November 2022 with AoC approaching I decided I would commit myself to using SQL for AoC, even if I didn't get very far. I like SQL because it's a very satisfying way to solve complex problems. The next question was which database I would use. I'd been familiar with DuckDB for years but never had a chance to use it in any practical sense. With its recent surge in popularity, DuckDB felt like an obvious choice. I could easily run it on my laptop, I didn't need to set up an account with a cloud data warehouse provider, and didn't need to mess with Docker to get PostgreSQL running.
While DuckDB could work perfectly fine on its own, I decided to pair it with dbt-duckdb, a DuckDB adapter for dbt. I use dbt daily during my job as an analytics engineer so it felt like an obvious way to structure my project. By using dbt I was able to add tests to ensure my solution still worked after refactoring.
Patterns
After doing several AoC problems you start to see patterns appear. You're provided your puzzle input for each day. The puzzle inputs are text files, often in the form of long lists of numbers or strings. DuckDB has excellent support for reading CSV files. Both read_csv
and read_csv_auto
worked incredibly well for parsing my puzzle input depending on how much flexibility I needed.
The data types provided by DuckDB are very comprehensive. While data types like lists are essential in procedural languages such as Python, I'd barely ever used them in SQL before. For many problems, I found a common pattern of using string_split
to return rows of lists containing strings, then using the incredibly powerful unnest
function to turn the rows of lists into one row per list item. This worked especially well for grid problems that are often seen in AoC to format the structure into rows of x
, y
, and value
.
Recursive CTEs are essential for many of the more challenging AoC problems, especially ones that require building and walking graphs, as well as ones that require iterating over rows with conditional branching logic.
Window functions are also very useful. DuckDB is kind enough to maintain the order of rows after reading in a CSV, but you'll often want to add an identifier to keep track of these rows through transformations. row_number() over ()
will give you just that.
string_agg
is a useful aggregate, window, and list function. However (at the time of writing) when using it as a list function it has an odd limitation; specifying the string separator does not work as expected. Thanks to the wonderful DuckDB Discord I found a solution for this: list_aggr(['a', 'b', 'c'], 'string_agg', '')
will join a list together. It looks odd but it does work.
Walkthrough
Next, I'll walk through a couple of my solutions. Feel free to skip past if you plan on attempting these yourself and you don't want to be spoiled.
Day Three
Day three has you working with ”items“ (represented as letters) in a rucksack (a line of input). The puzzle input contains 300 lines of varying length random-looking text strings.
First, we start with reading in the puzzle input. Here we also add the elf
identifier to each row to keep track of each elf's rucksack.
Copy code
with input(elf, items) as (
select row_number() over () as elf
, *
from read_csv_auto('input/03.csv')
)
The first real step is to split each rucksack into equal halves. We can do this by counting how many items each rucksack contains and then using string slicing we can separate each half into a new column. Finally, we split the strings into lists.
Copy code
, compartments as (
select *
, length(items) as len
, string_split(items[1 : len / 2], '') as compartment_1
, string_split(items[len / 2 + 1 : len], '') as compartment_2
from input
)
Part one asks us to find the one item type that appears in both compartments of each rucksack. With both compartments now list
types we can use a list_filter
to construct a lambda that uses the contains
function to filter for the one item. Finally, we can use [1]
which is a list slice to return a single value from the resulting lists.
Copy code
, common_by_compartment as (
select elf
, list_filter(compartment_1, x -> contains(compartment_2, x))[1] as item
from compartments
)
The final step for part one (and part two) is to calculate the priority. If you're familiar with ASCII character codes you'll probably notice the shortcut we can use. The ord
function returns the ASCII character code for the character passed in. With the ASCII code, we just need to figure out the offset needed to match the instructions. Finally, we can sum the column to get the answer to part one.
Copy code
select sum(case
when ord(item) between 65 and 90 then ord(item) - 38 /* A-Z */
else ord(item) - 96 /* a-z */
end) as answer
from common_by_compartment
Part two ups the difficulty and asks us to find the common item between groups of three elves. The first step is to create the groups using a row_number
window function with a window frame specifying elf % 3
. elf
is the identifier we created in the first step and %
is the modulo operator which returns the remainder of dividing the two values.
Copy code
, elf_groups as (
select row_number() over (partition by elf % 3 order by elf) as elf_group
, *
from input
)
Next, we split each rucksack into a list and unnest
to fan out the results to a single item per row.
Copy code
, distinct_items_by_group as (
select distinct
elf_group
, elf
, unnest(string_split(items, '')) as item
from elf_groups
)
Nearly finished, we can use a simple group by
and having
statement to find the common item for each group of elves.
Copy code
, common_by_group as (
select elf_group
, item
from distinct_items_by_group
group by 1, 2
having count(*) = 3
)
The final step for part two is to calculate the priority the same way as in part one.
Day Six
Day six's problem asks you to help the elves decode signals from their communication system. The puzzle input is a single line of 4,096 lowercase letters. In part one, you're asked to find the first start-of-packet marker which is defined as four sequential characters that are all different. Part two asks us to find the first start-of-message marker which is the same as a start-of-packet marker except it's 14 characters.
We start the same way as usual, by reading from our puzzle input. We also dive right in by splitting the characters into a list and unnesting to get each character onto their own row.
Copy code
with input as (
select unnest(str_split(char, '')) as buffer
from read_csv_auto('input/06.csv') as chars(char)
)
The next step simply adds an identifier column that we'll later use to sort by.
Copy code
, row_id as (
select row_number() over () as id
, buffer
from input
)
Here is where the bulk of the work happens for both parts. First, we use list
in a window function with a frame that looks backward the required number of characters. Then list_distinct
returns the unique items from the column of lists. Finally, length
will give us the length of each list.
Copy code
, markers as (
select id
, length(list_distinct(list(buffer) over (order by id
rows between 3 preceding
and current row))) as packet_marker
, length(list_distinct(list(buffer) over (order by id
rows between 13 preceding
and current row))) as message_marker
from row_id
order by id
To get the final answer for both parts we just figure out the minimum id
that matches our criteria.
Copy code
select 1 as part
, min(id) as answer
from markers
where packet_marker = 4
union all
select 2 as part
, min(id) as answer
from markers
where message_marker = 14
Wrapping Up
I enjoyed attempting AoC with DuckDB. It required a completely different way of thinking compared to Python. While I didn't complete as many days as in prior years using Python, I learned a ton. I also felt like several of my solutions were much more readable and elegant compared to those done in procedural languages.
I managed to get gold stars for the first eight days. On day nine I struggled to come up with a solution; I believe it's possible to solve this using a recursive CTE or even a lateral join (currently in DuckDB development builds) but I didn't get very far. I did find a solution for day 10, however, on days 11, 12, and 13, I worked for many hours but ultimately did not manage to find solutions. Day 12 involved building a graph which I did using a recursive CTE. While it could run successfully on the sample input, I could not get it to run quickly enough to solve using my provided input.
Go give it a try! You don't have to wait until December, you can attempt any day from any of the prior years. I recommend joining or creating a private leaderboard for some friendly competition among friends or those with similar interests. If you're able to come up with DuckDB solutions I'd love to hear about them!
You can find my Github repo with my solutions here and the best place to reach me is on Mastodon or LinkedIn.
Start using MotherDuck now!