Analyze Data in Azure with DuckDB

2023/11/01 - 4 min read

BY
David Neal

So, you've got some data in Azure blob storage, and you want to run some queries? You can do that with DuckDB or MotherDuck! DuckDB is a lightweight app you install on your computer and execute queries by typing in commands at your terminal. MotherDuck is essentially DuckDB in the cloud, with a UI running in your browser. There's nothing to install. The good news is that both now support querying data stored on Azure!

Find your Azure connection string

Whether you use DuckDB or MotherDuck, you need your Azure connection string to authenticate to the Azure platform. You can find your connection string in the Azure portal. Under Resources, click your storage container. Under Security + networking, click Access keys. If this is your first time using access keys, you may need to generate a new one. Next, click the Show button to reveal your connection string. Select the entire connection string and copy it to your clipboard.

Find your Azure connection string under access keys

As mentioned on this page in the Azure portal, keeping your connection string secure is very important. Learn more about Azure connection strings.

Query Azure with DuckDB

The DuckDB command-line interface (CLI) application supports Azure queries through an optional extension. To use this extension, you must start the application, install the extension, and configure your Azure connection string.

Launch DuckDB

If you haven't already, download and install DuckDB on your computer. Open your terminal or command prompt, and start the DuckDB application.

Copy code

./duckdb

Install and configure the Azure extension for DuckDB

Now that you have DuckDB running, you must install and activate the Azure extension. You can do this in the DuckDB CLI with the following commands.

Copy code

INSTALL azure; LOAD azure;

With the Azure extension loaded, you can configure the extension to use your Azure connection string. Use the following SET command, replacing <your_connection_string> with the value copied from the Azure portal.

Copy code

SET azure_storage_connection_string = '<your_connection_string>';

You are now ready to query data files stored in your Azure container!

Query data files in Azure from DuckDB

Here is the syntax for querying a file in Azure Blob storage.

Copy code

FROM 'azure://[container]/[file-name-or-file-pattern]'

For example, to query a file named survey_results.csv in a container named my_container, the SQL may look like the following.

Copy code

SELECT count(*) FROM 'azure://my_container/survey_results.csv';

You can also query across multiple files with a file-matching pattern. For example, if you have separate files for each month of the year named year-month-sales.csv, you could query across the entire year using the following.

Copy code

SELECT count(*) FROM 'azure://my_container/2023-*-sales.csv';

Query across multiple cloud storage providers using DuckDB

Combining the new Azure extension and the HTTPS extension, it's possible to query across multiple storage providers, should the need arise. For example, you may have historical data stored in Amazon S3 and more recent data stored in Azure and need to query across both.

Copy code

-- Load and configure Azure INSTALL azure; LOAD azure; SET azure_storage_connection_string = 'your-connection-string'; -- Load and configure Amazon S3 INSTALL httpfs; LOAD httpfs; SET s3_access_key_id='your-access-key-id'; SET s3_secret_access_key='your-secret-access-key'; SET s3_region='your-region'; SELECT t1.* FROM ( SELECT * FROM 's3://my-s3-bucket/sales-history.csv' UNION ALL SELECT * FROM 'azure://my-container/ytd-sales.csv' ) t1 ORDER BY "Gross Amt" DESC LIMIT 10;

Query data in Azure from MotherDuck

MotherDuck is a powerful, serverless analytics tool that enables you to run queries directly from your browser. And, there are fewer steps to configure MotherDuck to query Azure.

Configure your Azure connection in MotherDuck

MotherDuck provides a secure and convenient way to store your Azure connection string so you can query Azure whenever you need. To save your Azure connection string in MotherDuck, log in to your MotherDuck account and complete the following steps.

  1. Click your profile menu and click Settings.
  2. Under Secrets, click the ADD button.
  3. Click the Secret type and click Azure.
  4. Paste your connection string in the box labeled Azure storage connection string.
  5. Click Save.
Add MotherDuck Secret to connect to Azure

Query Azure data from MotherDuck

Next, create a new cell in your MotherDuck notebook. Then, write a SQL query to access your Azure storage account. For example, if you have a file saved in my_container named ytd-sales.csv, you might try the following.

Copy code

SELECT * FROM 'azure://my_container/ytd-sales.csv' ORDER BY "Gross Sales" LIMIT 10;

You are ready to duck and roll with MotherDuck and Azure!

Further reading

With the Azure extension for DuckDB, you can now query data in secure Azure Blob storage, including CSV, JSON, parquet, Apache Iceberg, and others. To learn more, you may be interested in the following.

CONTENT
  1. Find your Azure connection string
  2. Query Azure with DuckDB
  3. Query data in Azure from MotherDuck
  4. Further reading

Start using MotherDuck now!

blog subscription icon

Subscribe to motherduck blog