Towards Efficient Data Wrangling with LLMs using Code Generation
2024/06/09
By
Xue Li, Till Döhmen
Overview
Every data analytics workflow hits the same bottleneck: messy data. Column formats are inconsistent, values need cleaning, entities need matching. The standard approach today is to throw an LLM at every single row — which works, but costs a fortune and takes forever on anything beyond a toy dataset.
This paper from MotherDuck and the University of Amsterdam asks a better question: what if, instead of asking an LLM to do the wrangling, you ask it to write the code that does the wrangling? Generate a Python function once, run it on a million rows for free. The results are compelling — up to 37 points improvement on F1 score for data transformations, at a fraction of the compute cost.
The Problem with Per-Row LLM Processing
The naive approach to LLM-powered data wrangling is straightforward: send each row to GPT-4, get back the cleaned version. It works surprisingly well for small datasets. But for a table with a million rows, you're looking at hours of API calls, hundreds of dollars in costs, and non-deterministic outputs — the same row might get different results on different runs.
Fine-tuned smaller models need labeled training data, which defeats the purpose of automation. Programming-by-example methods are deterministic and fast but struggle with semantically complex tasks. None of these approaches alone check all the boxes that real users care about: quality, speed, cost, determinism, and interpretability.
The Code Generation Approach
The key insight is to use LLMs for what they're actually good at — reasoning about a task and writing code — rather than using them as a row-level execution engine. Given a task description and a few examples, the LLM generates a Python function. That function runs natively on the full dataset, with all the efficiency and determinism of regular code execution.
The framework follows a generate-verify-rank pipeline:
- Generate: The LLM receives a task prompt with a few demonstration examples and produces candidate code using the ReAct prompting framework (reason first, then act)
- Verify: Each candidate is checked for executability, then evaluated against the demonstration data for accuracy
- Rank: Multiple candidates are generated across iterations, and the best-performing one wins
If a candidate fails validation, the error message gets fed back to the LLM for refinement — a simple but effective feedback loop.
What Works (and What Doesn't)
The results paint a clear picture of where code generation shines and where it falls short.
Data Transformation is where code generation dominates. On the BingQL-semantics benchmark, it hit 91.6 F1 — blowing past both the programming-by-example baseline (32.0) and per-row LLM processing (54.0). Tasks like unit conversion, format standardization, and string manipulation are a natural fit: they follow consistent rules that can be expressed in code.
| Benchmark | PBE | Per-Row LLM | Code Gen (this paper) |
|---|---|---|---|
| BingQL-semantics | 32.0 | 54.0 | 91.6 |
| BingQL-Unit | 96.0 | N/A | 95.0 |
| Stack-overflow | 63.0 | 65.3 | 87.4 |
| FF-GR-Trifacta | 91.0 | N/A | 83.7 |
| Head cases | 82.0 | N/A | 74.6 |
Entity Matching is trickier. Some datasets (like Fodors-Zagats at 95.5 F1) can be handled well through string manipulation, but others (like DBLP-ACM at 19.7) require genuine semantic understanding that code alone can't capture. You can't write a regex that knows "J. Smith" and "John Smith" are the same person.
Error Detection shows the same split. Systematic errors like typos in a known vocabulary? Code nails it (100% on the Adult dataset). Errors that require domain knowledge to spot? Code struggles (23.5 F1 on Hospital).

The Routing Vision
The paper's most interesting contribution might be its honesty about limitations. Rather than claiming code generation solves everything, the authors propose a hybrid workflow with intelligent routing: a task router that decides whether a given wrangling task can be solved by code, and a data router that identifies which rows within a task can be handled by the generated code and which need to fall back to per-row LLM processing.
This is a pragmatic architecture. For a million-row table, maybe 95% of rows follow a clean pattern that code can handle. The remaining 5% of edge cases get routed to an LLM. You get the quality of per-row processing where you need it, and the efficiency of code execution everywhere else.

Why This Matters for MotherDuck
This work came directly from talking to MotherDuck users. They wanted wrangling automation that was fast, cheap, deterministic, and interpretable — not just accurate. The code generation approach delivers on all four, and the vision of generating SQL macros (not just Python) would let these transformations run directly inside DuckDB, making the entire pipeline even more efficient.
The next steps include expanding to SQL macro generation, building the routing components, and creating better benchmark datasets that reflect real-world data complexity.
About
Presented at the Workshop on Data Management for End-to-End Machine Learning (DEEM '24) at SIGMOD, June 9, 2024, Santiago, Chile. Authors: Xue Li (MotherDuck & University of Amsterdam) and Till Döhmen (MotherDuck). Code available at github.com/effyli/efficient_llm_data_wrangling.
TABLE OF CONTENTS
Get the full document


