Query-Log-Informed Schema Descriptions and their Impact on Text-to-SQL
2025/08/22
By
Evgeniia Egorova
Overview
In practice, data warehouses continuously grow in both size and complexity, making them progressively harder for analysts and AI agents to navigate. One of the key issues lies in the inherent ambiguity of schema elements — obscure table or column names that are uninterpretable without tribal knowledge. This creates challenges in data discovery, data interpretability, and data usage.
This thesis, conducted at MotherDuck under the supervision of Hannes Mühleisen and Till Döhmen, proposes automating schema documentation by mining historical query logs. The generated descriptions serve as semantic context for both human users and LLM-powered applications like Text-to-SQL.
Figure: Traditional Text-to-SQL pipeline — a natural language question is combined with schema context and fed to an LLM to produce SQL.
The Approach: Mining Query Logs for Schema Descriptions
Two methods are explored for extracting useful information from query history:
1. Query Annotation
A random sample of historical queries is annotated by an LLM, which describes how each column is used in context. These annotations are then aggregated into unified descriptions per column. This approach is LLM-heavy but captures rich semantic context.
2. Query Pattern Mining
Each query from the history is parsed into structural elements — column references, table references, and expressions — which are aggregated into usage profiles. These profiles report how frequently an identifier appears, in which SQL clauses, with which other identifiers, and as part of which expressions. This approach is more systematic and produces descriptions grounded in actual usage patterns.
Figure: Text-to-SQL architecture used in the experiments — schema, samples, and generated descriptions are combined into a prompt for the LLM.
Both methods are compared against baselines using only raw schema or column profiling statistics.
Benchmarks: BIRD vs. Real-World MotherDuck Data
A key contribution is the creation of MDW-AMBIG, a custom benchmark built from MotherDuck’s own production data warehouse. The study reveals significant gaps between the widely-used BIRD benchmark and real industrial settings:
Figure: Pairwise similarity between column embeddings for BIRD and MDW. Real-world columns in MDW tend to be much more similar to each other, making disambiguation harder.
- BIRD schemas are small with distinct column names; MDW has 2,730+ columns with many ambiguous, similar names
- BIRD queries use simple expressions; MDW queries involve complex expressions like
SUM(revenue) FILTER (WHERE item IN (...)) - BIRD provides human-written "evidence" hints; real-world settings have no such luxury
Results
| Configuration | Gemini 2.5 Flash | Qwen 2.5-32B |
|---|---|---|
| Schema only | 34.8% | 20.2% |
| + Query annotation descriptions | 38.0% (+3.2%) | 19.4% (-0.8%) |
| + Query pattern descriptions | 36.8% (+2.0%) | 18.4% (-1.8%) |
| Schema + samples | 42.2% | 28.2% |
| + Query pattern descriptions | 44.0% (+1.8%) | 22.4% (-5.8%) |
Table: BIRD-Dev results (Execution Accuracy). Query-log descriptions improve Gemini performance by 2–3%.
On MDW-AMBIG, the improvements were dramatically larger:
| Configuration | Gemini 2.5 Flash | Qwen 2.5-32B |
|---|---|---|
| Schema only | 36% | 18% |
| + Query pattern descriptions | 52% (+16%) | 8% (-10%) |
| Schema + samples | 42% | 12% |
| + Query pattern descriptions | 52% (+10%) | 4% (-8%) |
Table: MDW-AMBIG results. Query pattern descriptions boost Gemini accuracy by up to 16% on real-world data, though smaller models like Qwen struggle with the added context length.
The descriptions help LLMs by surfacing exact value formats and common filter expressions that would otherwise require domain expertise. For example, knowing that bond_type values are '=', '#', and '-' (not 'double', 'triple', 'single') prevents common prediction errors.
Figure: Schema relationships recovered purely from query log analysis, without any predefined foreign key documentation.
Key Takeaways
- Query-log-informed descriptions significantly improve Text-to-SQL on real-world data (up to 16% on MDW), with moderate gains on benchmarks (2–3% on BIRD)
- Query pattern mining outperforms query annotation, especially for complex industrial queries where exact expressions matter more than textual paraphrases
- Larger, more capable models benefit from added context; smaller models can be overwhelmed by it
- Real-world schemas are fundamentally different from benchmarks — more ambiguous names, more complex expressions, and larger schemas that exceed context windows
- Automated schema documentation is cheap (~$0.50 for an entire production warehouse) but requires ongoing maintenance as schemas evolve
- The approach can recover implicit schema relationships (joins, foreign keys) that are undocumented, directly from usage patterns
TABLE OF CONTENTS
Get the full document


