Query-Log-Informed Schema Descriptions and their Impact on Text-to-SQL

2025/08/22

By

Evgeniia Egorova

Overview

In practice, data warehouses continuously grow in both size and complexity, making them progressively harder for analysts and AI agents to navigate. One of the key issues lies in the inherent ambiguity of schema elements — obscure table or column names that are uninterpretable without tribal knowledge. This creates challenges in data discovery, data interpretability, and data usage.

This thesis, conducted at MotherDuck under the supervision of Hannes Mühleisen and Till Döhmen, proposes automating schema documentation by mining historical query logs. The generated descriptions serve as semantic context for both human users and LLM-powered applications like Text-to-SQL.

Traditional Text-to-SQL pipeline with LLM Figure: Traditional Text-to-SQL pipeline — a natural language question is combined with schema context and fed to an LLM to produce SQL.

The Approach: Mining Query Logs for Schema Descriptions

Two methods are explored for extracting useful information from query history:

1. Query Annotation

A random sample of historical queries is annotated by an LLM, which describes how each column is used in context. These annotations are then aggregated into unified descriptions per column. This approach is LLM-heavy but captures rich semantic context.

2. Query Pattern Mining

Each query from the history is parsed into structural elements — column references, table references, and expressions — which are aggregated into usage profiles. These profiles report how frequently an identifier appears, in which SQL clauses, with which other identifiers, and as part of which expressions. This approach is more systematic and produces descriptions grounded in actual usage patterns.

Text-to-SQL architecture with descriptions Figure: Text-to-SQL architecture used in the experiments — schema, samples, and generated descriptions are combined into a prompt for the LLM.

Both methods are compared against baselines using only raw schema or column profiling statistics.

Benchmarks: BIRD vs. Real-World MotherDuck Data

A key contribution is the creation of MDW-AMBIG, a custom benchmark built from MotherDuck’s own production data warehouse. The study reveals significant gaps between the widely-used BIRD benchmark and real industrial settings:

Column similarity heatmaps for BIRD vs MDW Figure: Pairwise similarity between column embeddings for BIRD and MDW. Real-world columns in MDW tend to be much more similar to each other, making disambiguation harder.

  • BIRD schemas are small with distinct column names; MDW has 2,730+ columns with many ambiguous, similar names
  • BIRD queries use simple expressions; MDW queries involve complex expressions like SUM(revenue) FILTER (WHERE item IN (...))
  • BIRD provides human-written "evidence" hints; real-world settings have no such luxury

Results

ConfigurationGemini 2.5 FlashQwen 2.5-32B
Schema only34.8%20.2%
+ Query annotation descriptions38.0% (+3.2%)19.4% (-0.8%)
+ Query pattern descriptions36.8% (+2.0%)18.4% (-1.8%)
Schema + samples42.2%28.2%
+ Query pattern descriptions44.0% (+1.8%)22.4% (-5.8%)

Table: BIRD-Dev results (Execution Accuracy). Query-log descriptions improve Gemini performance by 2–3%.

On MDW-AMBIG, the improvements were dramatically larger:

ConfigurationGemini 2.5 FlashQwen 2.5-32B
Schema only36%18%
+ Query pattern descriptions52% (+16%)8% (-10%)
Schema + samples42%12%
+ Query pattern descriptions52% (+10%)4% (-8%)

Table: MDW-AMBIG results. Query pattern descriptions boost Gemini accuracy by up to 16% on real-world data, though smaller models like Qwen struggle with the added context length.

The descriptions help LLMs by surfacing exact value formats and common filter expressions that would otherwise require domain expertise. For example, knowing that bond_type values are '=', '#', and '-' (not 'double', 'triple', 'single') prevents common prediction errors.

Recovered schema relationships Figure: Schema relationships recovered purely from query log analysis, without any predefined foreign key documentation.

Key Takeaways

  • Query-log-informed descriptions significantly improve Text-to-SQL on real-world data (up to 16% on MDW), with moderate gains on benchmarks (2–3% on BIRD)
  • Query pattern mining outperforms query annotation, especially for complex industrial queries where exact expressions matter more than textual paraphrases
  • Larger, more capable models benefit from added context; smaller models can be overwhelmed by it
  • Real-world schemas are fundamentally different from benchmarks — more ambiguous names, more complex expressions, and larger schemas that exceed context windows
  • Automated schema documentation is cheap (~$0.50 for an entire production warehouse) but requires ongoing maintenance as schemas evolve
  • The approach can recover implicit schema relationships (joins, foreign keys) that are undocumented, directly from usage patterns