Query Orchestration with Github Actions
GitHub Actions is a continuous integration and continuous delivery (CI/CD) platform that allows you to automate your build, test, and deployment pipeline. You can create workflows that build and test every pull request to your repository, or deploy merged pull requests to production.
For the purposes of data warehousing, we can use GitHub Actions to extract, load, and transform data as a simple cron job. You can learn more about Github Actions on the documentation pages.
Triggering GitHub Actions
This How-to guide will cover two invocation examples: Actions invoked via workflow dispatch
(manually triggered by a button in Github) and via a scheduled job. After reviewing the job invocation methodology, it continues on to show the definition of a container, installation of DuckDB, and then execution of some basic operations in MotherDuck. It should be noted that this is not intended to be a complete document - rather, a narrow slice of useful code that can be directly applied to the types of problems that can be solved with MotherDuck.
Manually triggered actions
The most basic way to use Github Actions is to use workflow dispatch
so that the action can be triggered by clicking a button in GitHub. Detailed documentation about this can be found on the Github website.
Using workflow dispatch
in practice looks like this:
name: manual_build
on:
workflow_dispatch:
inputs:
name:
# Friendly description to be shown in the UI instead of 'name'
description: 'What is the reason to trigger this manually?'
# Default value if no value is explicitly provided
default: 'testing github actions'
# Input has to be provided for the workflow to run
required: false
jobs:
...
Running cron jobs
Many types of jobs are better suited for scheduled orchestration. This can be done with the schedule
attribute, which will use traditional cron syntax to determine when to run the job.
Using schedule
can look like this:
name: 'Scheduled Run'
on:
schedule:
- cron: '0 10 * * *' # This line sets the job to run every day at 10am UTC
jobs:
...
Defining Jobs & Steps
After invocation method is defined, jobs should be defined. This contains the specific steps required to accomplish the job. For this example, we will define the container, install DuckDB, and then run a script a MotherDuck.
Job definition can look like this:
jobs:
deploy:
name: 'Deploy'
runs-on: ubuntu-latest
We have now define the Action environment, which is the latest stable version of ubuntu. There are of course other places these can run on, but the ubuntu container is a great starting point because it can also be easily shared with Github Codespaces, which makes testing easier.
Github Actions are composable, but for simplicity this guide will not cover how to link actions to each other, or other more advanced steps. This can all be found in the documentation on Github.
Afer the Job is defined, we add the steps. Since is yaml, the spacing is important, which why the steps are tabbed over.
steps:
# check out master using the "Checkout" action
- name: Check out
uses: actions/checkout@master
# install duckdb binary
- name: Install DuckDB
run: |
wget https://github.com/duckdb/duckdb/releases/download/v1.1.3/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64
rm duckdb_cli-linux-amd64.zip
# run sql script with a specific token
- name: Run SQL script
env:
MOTHERDUCK_TOKEN: ${{ secrets.MOTHERDUCK_TOKEN }}
run: ./duckdb < script.sql
The example script invoked above looks like this:
-- attach to motherduck
ATTACH 'md:';
-- set the database
USE my_db;
-- create the table if it doesn't exist
CREATE TABLE IF NOT EXISTS target (
source VARCHAR(255),
timestamp TIMESTAMP
);
-- insert a row
INSERT INTO target (source, timestamp)
VALUES ('github action', CURRENT_TIMESTAMP);
Other Considerations
In order to use this Action as currently written, you will need to create a secret in your repo called MOTHERDUCK_TOKEN with a token generated from your MotherDuck account.