Query-Log-Informed Schema Descriptions and their Impact on Text-to-SQL

2025/08/22

Evgeniia Egorova

Overview

In practice, data warehouses continuously grow in both size and complexity, making them progressively harder for analysts and AI agents to navigate. One of the key issues lies in the inherent ambiguity of schema elements — obscure table or column names that are uninterpretable without tribal knowledge. This creates challenges in data discovery, data interpretability, and data usage.

This thesis, conducted at MotherDuck under the supervision of Hannes Mühleisen and Till Döhmen, proposes automating schema documentation by mining historical query logs. The generated descriptions serve as semantic context for both human users and LLM-powered applications like Text-to-SQL.

Figure: Traditional Text-to-SQL pipeline — a natural language question is combined with schema context and fed to an LLM to produce SQL.

The Approach: Mining Query Logs for Schema Descriptions

Two methods are explored for extracting useful information from query history:

1. Query Annotation

A random sample of historical queries is annotated by an LLM, which describes how each column is used in context. These annotations are then aggregated into unified descriptions per column. This approach is LLM-heavy but captures rich semantic context.

2. Query Pattern Mining

Each query from the history is parsed into structural elements — column references, table references, and expressions — which are aggregated into usage profiles. These profiles report how frequently an identifier appears, in which SQL clauses, with which other identifiers, and as part of which expressions. This approach is more systematic and produces descriptions grounded in actual usage patterns.

Text-to-SQL architecture with descriptions Figure: Text-to-SQL architecture used in the experiments — schema, samples, and generated descriptions are combined into a prompt for the LLM.

Both methods are compared against baselines using only raw schema or column profiling statistics.

Benchmarks: BIRD vs. Real-World MotherDuck Data

A key contribution is the creation of MDW-AMBIG, a custom benchmark built from MotherDuck’s own production data warehouse. The study reveals significant gaps between the widely-used BIRD benchmark and real industrial settings:

Column similarity heatmaps for BIRD vs MDW Figure: Pairwise similarity between column embeddings for BIRD and MDW. Real-world columns in MDW tend to be much more similar to each other, making disambiguation harder.

BIRD schemas are small with distinct column names; MDW has 2,730+ columns with many ambiguous, similar names
BIRD queries use simple expressions; MDW queries involve complex expressions like SUM(revenue) FILTER (WHERE item IN (...))
BIRD provides human-written "evidence" hints; real-world settings have no such luxury

Results

Configuration	Gemini 2.5 Flash	Qwen 2.5-32B
Schema only	34.8%	20.2%
+ Query annotation descriptions	38.0% (+3.2%)	19.4% (-0.8%)
+ Query pattern descriptions	36.8% (+2.0%)	18.4% (-1.8%)
Schema + samples	42.2%	28.2%
+ Query pattern descriptions	44.0% (+1.8%)	22.4% (-5.8%)

Table: BIRD-Dev results (Execution Accuracy). Query-log descriptions improve Gemini performance by 2–3%.

On MDW-AMBIG, the improvements were dramatically larger:

Configuration	Gemini 2.5 Flash	Qwen 2.5-32B
Schema only	36%	18%
+ Query pattern descriptions	52% (+16%)	8% (-10%)
Schema + samples	42%	12%
+ Query pattern descriptions	52% (+10%)	4% (-8%)

Table: MDW-AMBIG results. Query pattern descriptions boost Gemini accuracy by up to 16% on real-world data, though smaller models like Qwen struggle with the added context length.

The descriptions help LLMs by surfacing exact value formats and common filter expressions that would otherwise require domain expertise. For example, knowing that bond_type values are '=', '#', and '-' (not 'double', 'triple', 'single') prevents common prediction errors.

Recovered schema relationships Figure: Schema relationships recovered purely from query log analysis, without any predefined foreign key documentation.

Key Takeaways

Query-log-informed descriptions significantly improve Text-to-SQL on real-world data (up to 16% on MDW), with moderate gains on benchmarks (2–3% on BIRD)
Query pattern mining outperforms query annotation, especially for complex industrial queries where exact expressions matter more than textual paraphrases
Larger, more capable models benefit from added context; smaller models can be overwhelmed by it
Real-world schemas are fundamentally different from benchmarks — more ambiguous names, more complex expressions, and larger schemas that exceed context windows
Automated schema documentation is cheap (~$0.50 for an entire production warehouse) but requires ongoing maintenance as schemas evolve
The approach can recover implicit schema relationships (joins, foreign keys) that are undocumented, directly from usage patterns

Get the full document

View Paper