Analyze Data in Azure with DuckDB
2023/11/01 - 4 min read
BYSo, you've got some data in Azure blob storage, and you want to run some queries? You can do that with DuckDB or MotherDuck! DuckDB is a lightweight app you install on your computer and execute queries by typing in commands at your terminal. MotherDuck is essentially DuckDB in the cloud, with a UI running in your browser. There's nothing to install. The good news is that both now support querying data stored on Azure!
Find your Azure connection string
Whether you use DuckDB or MotherDuck, you need your Azure connection string to authenticate to the Azure platform. You can find your connection string in the Azure portal. Under Resources, click your storage container. Under Security + networking, click Access keys. If this is your first time using access keys, you may need to generate a new one. Next, click the Show button to reveal your connection string. Select the entire connection string and copy it to your clipboard.
As mentioned on this page in the Azure portal, keeping your connection string secure is very important. Learn more about Azure connection strings.
Query Azure with DuckDB
The DuckDB command-line interface (CLI) application supports Azure queries through an optional extension. To use this extension, you must start the application, install the extension, and configure your Azure connection string.
Launch DuckDB
If you haven't already, download and install DuckDB on your computer. Open your terminal or command prompt, and start the DuckDB application.
Copy code
./duckdb
Install and configure the Azure extension for DuckDB
Now that you have DuckDB running, you must install and activate the Azure extension. You can do this in the DuckDB CLI with the following commands.
Copy code
INSTALL azure; LOAD azure;
With the Azure extension loaded, you can configure the extension to use your Azure connection string. Use the following SET
command, replacing <your_connection_string>
with the value copied from the Azure portal.
Copy code
SET azure_storage_connection_string = '<your_connection_string>';
You are now ready to query data files stored in your Azure container!
Query data files in Azure from DuckDB
Here is the syntax for querying a file in Azure Blob storage.
Copy code
FROM 'azure://[container]/[file-name-or-file-pattern]'
For example, to query a file named survey_results.csv
in a container named my_container
, the SQL may look like the following.
Copy code
SELECT count(*) FROM 'azure://my_container/survey_results.csv';
You can also query across multiple files with a file-matching pattern. For example, if you have separate files for each month of the year named year-month-sales.csv
, you could query across the entire year using the following.
Copy code
SELECT count(*) FROM 'azure://my_container/2023-*-sales.csv';
Query across multiple cloud storage providers using DuckDB
Combining the new Azure extension and the HTTPS extension, it's possible to query across multiple storage providers, should the need arise. For example, you may have historical data stored in Amazon S3 and more recent data stored in Azure and need to query across both.
Copy code
-- Load and configure Azure
INSTALL azure;
LOAD azure;
SET azure_storage_connection_string = 'your-connection-string';
-- Load and configure Amazon S3
INSTALL httpfs;
LOAD httpfs;
SET s3_access_key_id='your-access-key-id';
SET s3_secret_access_key='your-secret-access-key';
SET s3_region='your-region';
SELECT t1.*
FROM (
SELECT * FROM 's3://my-s3-bucket/sales-history.csv'
UNION ALL
SELECT * FROM 'azure://my-container/ytd-sales.csv'
) t1
ORDER BY "Gross Amt" DESC
LIMIT 10;
Query data in Azure from MotherDuck
MotherDuck is a powerful, serverless analytics tool that enables you to run queries directly from your browser. And, there are fewer steps to configure MotherDuck to query Azure.
Configure your Azure connection in MotherDuck
MotherDuck provides a secure and convenient way to store your Azure connection string so you can query Azure whenever you need. To save your Azure connection string in MotherDuck, log in to your MotherDuck account and complete the following steps.
- Click your profile menu and click Settings.
- Under Secrets, click the ADD button.
- Click the Secret type and click Azure.
- Paste your connection string in the box labeled Azure storage connection string.
- Click Save.
Query Azure data from MotherDuck
Next, create a new cell in your MotherDuck notebook. Then, write a SQL query to access your Azure storage account. For example, if you have a file saved in my_container
named ytd-sales.csv
, you might try the following.
Copy code
SELECT *
FROM 'azure://my_container/ytd-sales.csv'
ORDER BY "Gross Sales"
LIMIT 10;
You are ready to duck and roll with MotherDuck and Azure!
Further reading
With the Azure extension for DuckDB, you can now query data in secure Azure Blob storage, including CSV, JSON, parquet, Apache Iceberg, and others. To learn more, you may be interested in the following.
CONTENT
- Find your Azure connection string
- Query Azure with DuckDB
- Query data in Azure from MotherDuck
- Further reading
Start using MotherDuck now!