PROMPT_SQL
Overview
The prompt_sql
function allows you to generate SQL queries using natural language. Simply describe what you want to analyze in plain English, and MotherDuck AI will translate your request into a valid SQL query based on your database schema and content.
This function helps users who are less familiar with SQL syntax to generate queries and experienced SQL users save time when working with unfamiliar schemas.
Syntax
CALL prompt_sql('<natural language question>'[, include_tables=<table_specification>]);
Parameters
Parameter | Type | Description | Required |
---|---|---|---|
natural language question | STRING | Your query in plain English describing the data you want to analyze | Yes |
include_tables | ARRAY or MAP | Specifies which tables and columns to consider for query generation. When not provided, all tables in the current database will be considered. | No |
Include tables parameter
You can specify which tables and columns should be considered during SQL generation using the include_tables
parameter. This is particularly useful when:
- You want to focus on specific tables in a large database
- You want to improve performance by reducing the schema analysis scope
The parameter accepts three formats:
-
Array of table names: include all columns from specified tables:
include_tables=['table1', 'table2']
-
Map of tables to columns: include only specific columns from tables:
include_tables={'table1': ['column1', 'column2'], 'table2': ['column3']}
-
Map with column regex patterns: include columns matching patterns:
include_tables={'table1': ['column_prefix.*', 'exact_column']}
Examples
Basic example
Let's start with a simple example using MotherDuck's sample Hacker News dataset:
CALL prompt_sql('what are the top domains being shared on hacker_news?');
Output:
query |
---|
SELECT regexp_extract(url, 'https?://([^/]+)') AS domain, COUNT(*) AS count FROM hn.hacker_news WHERE url IS NOT NULL GROUP BY domain ORDER BY count DESC; |
Intermediate example
This example demonstrates how to generate a more complex query with filtering, aggregation, and time-based analysis:
CALL prompt_sql('Show me the average score of stories posted by each author who has posted at least 5 stories in 2022, sorted by average score');
Output:
query |
---|
SELECT 'by', AVG(score) AS average_score FROM hn.hacker_news WHERE EXTRACT(YEAR FROM 'timestamp') = 2022 GROUP BY 'by' HAVING COUNT(id) >= 5 ORDER BY average_score; |
Advanced Example: Multi-table Analysis with Specific Columns
This example shows how to generate a query that focuses on specific columns:
CALL prompt_sql(
'Find the top 10 users who submitted the most stories with the highest average scores in 2023',
include_tables={
'hn.hacker_news': ['id', 'by', 'score', 'timestamp', 'type', 'title']
}
);
Output:
query |
---|
SELECT "by", AVG(score) AS avg_score, COUNT(*) AS story_count FROM hn.hacker_news WHERE "type" = 'story' AND EXTRACT(YEAR FROM "timestamp") = 2023 GROUP BY "by" ORDER BY story_count DESC, avg_score DESC LIMIT 10; |
Expert example
This example demonstrates generating a complex query with subqueries, window functions, and complex logic:
CALL prompt_sql('For each month in 2022, show me the top 3 users who posted stories with the highest scores, and how their average score compares to the previous month');
Output:
query |
---|
WITH monthly_scores AS ( SELECT "by" AS user, DATE_TRUNC('month', "timestamp") AS month, AVG(score) AS avg_score FROM hn.hacker_news WHERE "type" = 'story' AND DATE_PART('year', "timestamp") = 2022 GROUP BY user, month ), ... |
Failure example
This example shows that for some complex queries, the model might not generate a valid SQL query. Therefore the output will be the following error message:
CALL prompt_sql('Identify the most discussed technology topics in Hacker News stories from the past year based on title keywords, and show which days of the week have the highest engagement for each topic');
Output:
query |
---|
Invalid Input Error: The AI could not generate valid SQL. Try re-running the command or rephrasing your question. |
To generate a valid SQL query, you can try to break down the question into simpler parts.
Best practices
-
Be specific in your questions: the more specific your natural language query, the more accurate the generated SQL will be.
-
Start simple and iterate: begin with basic queries and gradually add complexity as needed.
-
Use the
include_tables
parameter: when working with large databases, specify relevant tables to improve performance and accuracy. -
Review generated SQL: always review the generated SQL before executing it, especially for complex queries.
-
Understand your schema: knowing your table structure helps you phrase questions that align with available data.
-
Use domain-specific terminology: include field names in your questions when possible.
-
Provide context in your questions: mention time periods, specific metrics, or business context to get more relevant results.
Notes
- By default, all tables in the current database are considered. Use the
include_tables
parameter to narrow the scope. - To target a specific database, first execute the
USE <database_name>
command (learn more about switching databases). - The quality of generated SQL depends on the clarity of your natural language question and the quality of your database schema (table and column names).
- This feature is powered by MotherDuck's integration with OpenAI's language models.
Troubleshooting
If you encounter issues with the prompt_sql
function, consider the following troubleshooting steps:
- Check your database schema: ensure that the tables and columns you're querying are present in the current database.
- Be specific in your questions: the more specific your natural language query, the more accurate the generated SQL will be.
- Use the
include_tables
parameter: when working with large databases, specify relevant tables to improve performance and accuracy.