LLMs Still Corrupt Documents When You Delegate Real Work
Delegating work to an AI assistant feels different from asking a chatbot a question.
In a normal chat, the model gives you an answer and you decide whether to trust it. In a delegated workflow, the model changes the thing you care about: a source file, a ledger, a subtitle track, a recipe, a circuit description, a music score, a calendar, or some other structured document. The output is no longer advice. It is the working copy.
That is the trust problem behind DELEGATE-52, a Microsoft Research benchmark described in the paper LLMs Corrupt Your Documents When You Delegate. The paper is not saying models are useless at document editing. It is saying the current failure mode is worse than a visible refusal or a bad answer. The model often completes the requested edit while quietly damaging unrelated parts of the document.
That distinction matters. A system can appear helpful at the task level and still be unsafe at the workflow level.
The Benchmark Is Built Around Round Trips
DELEGATE-52 tests long delegated document workflows across 52 professional domains. The domains are intentionally broad: Python, Docker, JSON, Graphviz, crystallography, Lean math, molecules, aviation, music notation, subtitles, 3D objects, accounting ledgers, genealogy, transit, recipes, job boards, and more.
The benchmark does not ask a model to answer trivia about those files. It asks the model to edit them.
The core trick is a round-trip task:
- Start with a real seed document.
- Ask the model to perform a structural edit.
- Ask the model to undo that edit.
- Compare the reconstructed document with the original.
For example, an accounting ledger might be split into category-specific files, then merged back into one chronological ledger. A perfect delegate should recover the original semantics. If the recovered ledger drops transactions, changes amounts, mangles account names, or loses ordering that matters, the score falls.
This round-trip structure is useful because it avoids needing hand-written reference answers for every task. The original document is the reference. The model is allowed to transform it, but after the inverse operation it should be back where it started.
The paper then chains these round trips into relays. A 10-round relay means 20 model interactions. That is much closer to how delegated work actually feels: not one edit, but a session of repeated changes where small and large mistakes can accumulate.
The Documents Are Not Toy Prompts
Each work environment contains a seed document, 5 to 10 reversible edit pairs, and distractor context. The seed documents are real public documents, not synthetic templates. In the full benchmark there are 310 work environments; the public Hugging Face release includes 234 environments across 48 domains where redistribution is allowed.
That matters because real documents have boring, fragile details:
- file names that must stay exact
- numeric values that must not drift
- domain-specific syntax
- repeated sections that look similar but are not interchangeable
- metadata that is easy to drop
- ordering constraints that are not obvious from plain English
The benchmark also includes distractor files: related documents that are not needed for the task. This reflects a real retrieval-augmented workspace. When an assistant edits a project folder or a knowledge base, it often sees relevant and irrelevant material together. A good delegate has to know what to ignore.
The evaluation is domain-specific. A generic text similarity score would miss too much. A recipe evaluator should know that changing 200g of butter to 800g is serious. A subtitle evaluator should care about timing and text. A ledger evaluator should care about transactions. DELEGATE-52 therefore parses each domain into structured representations and scores semantic preservation with custom evaluators.
That is one of the strongest parts of the work. It treats document reliability as a domain problem, not just a language-model problem.
The Main Result: Damage Compounds
The headline result is blunt: all tested models degraded documents over long workflows.
The researchers evaluated 19 models across six families, including OpenAI, Anthropic, Google, Mistral, xAI, and Moonshot models. After 20 interactions, the strongest frontier models still lost a substantial amount of document content or correctness. The paper reports that Gemini 3.1 Pro, Claude 4.6 Opus, and GPT-5.4 corrupted about 25% of document content on average by the end of the long workflow.
The spread between models is large:
- Gemini 3.1 Pro ended highest in the main table, with an RS@20 score of 80.9.
- Claude 4.6 Opus ended at 73.1.
- GPT-5.4 ended at 71.5.
- GPT-5.2 ended at 66.1.
- GPT-4o ended at 14.7.
- GPT-5 Nano ended at 10.0.
The important point is not only the ranking. It is the curve. Performance drops as the interaction continues. A model can look strong after two interactions and still fall apart after twenty.
This is exactly where many product demos are misleading. A demo usually shows one clean edit. Delegated work is not one edit. It is edit after edit after edit, with the user progressively losing the ability to inspect every unchanged line.
Short Tests Hide Long Workflow Risk
One of the more practical findings is that short-term performance does not reliably predict long-term performance.
The paper gives examples where models have similar scores early but diverge sharply later. A model that survives a two-step edit is not necessarily a model that can carry a document through a day of changes. That should change how teams evaluate AI editing systems.
If your acceptance test is “make this one change and show me the diff,” you are testing the first round trip. You are not testing the workflow.
For real delegation, the questions need to be longer:
- What happens after 20 edits?
- What happens when unrelated files are nearby?
- What happens when the document is 10,000 tokens instead of 2,000?
- What happens when the user asks for split, merge, sort, classify, and restore operations in sequence?
- What happens when the model has to preserve obscure syntax it does not deeply understand?
DELEGATE-52 is valuable because it makes those questions measurable.
Tool Use Did Not Fix It
It is tempting to assume the answer is agents. Give the model tools. Let it read files, write files, delete files, and run Python. Surely that should reduce corruption.
In this benchmark, the basic agentic harness did not help. The tested models performed worse with tools than without tools, with an average additional degradation of about 6% by the end of the simulation.
The paper is careful about this. The harness was basic, not an optimized state-of-the-art agent system. But the result is still useful because it exposes a common assumption: tools do not automatically create reliability.
Tools add new burdens:
- the model has to decide which files to inspect
- it has to choose whether to edit manually or programmatically
- it spends more tokens managing tool calls
- it may read distractor files
- it may overwrite rather than patch
- it may create filename or workspace-state errors
In the paper’s experiments, models used 8 to 12 tools on average per task and consumed 2 to 5 times more input tokens than the no-tool setup. Better models used code execution more effectively, but the overall agentic mode still degraded documents more in the tested setup.
The lesson is not “never use tools.” The lesson is that an agent harness is not a verification layer. It is another execution path that needs its own reliability tests.
Larger Documents Make the Problem Worse
The benchmark also varies document size. For GPT-5.4, increasing the document from 1k to 10k tokens worsened degradation, and the gap widened over longer interactions.
This is the exact shape of a production problem. The first edit might be fine. The fifth edit might still look fine. But as the file grows and the session gets longer, hidden drift compounds. The paper describes document size and interaction length as multiplicative rather than isolated effects.
Distractor context behaves similarly. Removing distractors improves scores only modestly at the beginning, but the benefit grows by the end of the workflow. In other words, retrieval precision matters more over time than a short evaluation might suggest.
This should make teams cautious about “just give the agent the whole repo” or “just attach the whole folder” workflows. More context can help, but irrelevant context is not free.
The Failures Are Sparse And Severe
The most interesting analysis is about failure shape.
The models are not mostly failing through a smooth stream of tiny harmless edits. The paper finds that much of the total degradation comes from sparse critical failures: individual round trips that drop the score by at least 10 points.
That matches how AI editing failures often feel in practice. Most of the diff is fine, and then one unrelated section is gone. Or a table still exists, but one column is subtly wrong. Or a generated file looks plausible, but the identifiers no longer match the rest of the system.
For weaker models, degradation often comes from deletion: content disappears. For stronger frontier models, degradation is more often corruption: the content remains present but wrong.
That is the harder failure mode to catch. Missing content can be seen in a diff. Corrupted content can survive a glance because the shape of the document still looks right.
Python Is The Outlier
One bright spot is Python. In the paper, Python is the only domain where a majority of tested models reach the benchmark’s “ready” threshold after 20 interactions.
That result should not be overgeneralized. It does not mean AI coding agents are solved. It means code has properties that help models and evaluators:
- syntax is explicit
- errors are often executable
- tests and linters can provide feedback
- code corpora are heavily represented in training data
- many transformations are structured and checkable
Other domains do not always have that advantage. A music notation file, accounting ledger, transit schedule, recipe, genealogy record, or crystallography file may be textual, but it is not “just text.” It has its own invariants.
This is the jagged frontier in a very practical form. The model can be impressive in one document type and unreliable in another.
What This Means For Product Builders
If you are building AI document workflows, the benchmark points to several design rules.
First, preserve originals aggressively. A delegated edit should never destroy the only copy. Keep snapshots, checkpoints, and reversible histories.
Second, prefer patch-based editing where possible. Regenerating an entire document to make a local change increases the surface area for unrelated damage.
Third, use domain-aware validators. Generic “looks good” review is weak. Ledgers need ledger checks. Subtitles need subtitle checks. Config files need parsers. Code needs tests. Documents with numeric facts need consistency checks.
Fourth, evaluate long sessions, not just single edits. If your product is meant for delegated work, test 10, 20, and 50-step workflows.
Fifth, treat retrieval as part of reliability. Irrelevant context can silently degrade editing quality. Better context selection is not just a cost optimization; it is a correctness feature.
Sixth, separate execution from verification. The same model that performed the edit should not be the only thing deciding whether the edit preserved the document.
What This Means For Users
For users, the practical advice is simple: do not delegate beyond your ability to verify.
That does not mean avoid AI assistants. It means match the workflow to the verification layer.
Good candidates for delegation:
- local edits with clear diffs
- code changes covered by tests
- structured files with parsers or validators
- transformations where the original can be restored
- repetitive changes with easy spot checks
Riskier candidates:
- large documents with many similar sections
- niche formats you cannot personally inspect
- financial, legal, medical, or compliance records
- workflows involving many sequential edits
- tasks where “mostly right” is still expensive
The danger zone is not the model saying “I cannot do that.” It is the model confidently producing a plausible artifact that has drifted away from the original.
The Benchmark Itself Has Limits
The paper is clear about limitations.
The simulated interactions are single-turn instructions. Real users often underspecify requests, ask follow-up questions, change their mind, and carry state across sessions. That may make real workflows harder, not easier.
The benchmark is also constrained to reversible document edits. Many knowledge-work tasks are not cleanly reversible. Planning, negotiation, communication, and creative development do not always have an obvious original document to reconstruct.
The evaluation favors domains where parsing is feasible. That is reasonable for a benchmark, but it means the hardest open-ended tasks are only partially covered.
Still, those limits do not weaken the main finding. If models already corrupt structured reversible documents under controlled conditions, then production delegation needs stronger guardrails.
The Real Lesson
The paper lands at an uncomfortable but useful point: delegation is not the same as generation.
Generation asks, “Can the model produce something useful?”
Delegation asks, “Can the model change my existing work without damaging what must remain true?”
That second question is much harder. It requires memory, precision, domain understanding, context filtering, and verification. It also requires product design that assumes silent corruption is possible.
The future of AI work will not be decided only by which model writes the best first draft. It will be decided by which systems can keep important artifacts intact while changing them over time.
DELEGATE-52 gives that problem a concrete shape. The current answer is sobering: models are improving quickly, but they are not yet trustworthy delegates across most professional document domains.
Use them. But keep the diff, run the checks, preserve the original, and make the verification layer stronger than the demo.