Does Agents Actually Help Coding

Does AGENTS.md Actually Help Coding Agents? A New Study Has Answers

Summary of Elvis Saravia’s AI Newsletter, Feb 26, 2026

Main Thesis

Developers widely assume that repository-level context files — CLAUDE.md, AGENTS.md, CONTRIBUTING.md — make coding agents meaningfully better. A new paper from ETH Zurich’s SRI Lab puts that assumption to a rigorous empirical test, and the results are more nuanced than most practitioners expect.

Paper: Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

Background: The Problem

Context files have proliferated alongside coding agents, but adoption has outpaced evaluation — developers write them, agents read them, and everyone assumed the relationship was positive.
Standard benchmarks like SWE-bench mostly cover popular repositories, which tend not to have context files, making them a poor testbed for this question.

The New Benchmark: AGENTbench

The paper introduces AGENTbench: 138 task instances from 12 less-popular Python repositories, all of which already have developer-written context files.
Context files in AGENTbench average 641 words across 9.7 sections — detailed, real-world guidance, not trivial one-liners.
Three agents were tested: Claude Code (Sonnet-4.5), Codex (GPT-5.2 / GPT-5.1 mini), and Qwen Code (Qwen3-30b-coder).
Each agent ran tasks under three conditions: no context file, LLM-generated context file, and developer-written context file.

Key Findings

🔴 LLM-Generated Context Files Hurt Performance

On SWE-bench Lite: LLM-generated files drop task success by ~0.5%.
On AGENTbench: the drop is ~2%.
Across all conditions, context files increase inference cost by 14–22% more reasoning tokens and 2–4 additional steps per task — regardless of whether they help.

🟢 Human-Written Context Files Help (On Their Own Turf)

Human-written files produce a ~4% improvement over no context on average across both benchmarks.
The gain is real, but it is benchmark- and file-quality-dependent.

⚡ The Instruction-Following Paradox

Agents follow context file instructions faithfully: when uv is mentioned, usage jumps to 1.6× per instance vs. fewer than 0.01× without it.
But more instruction-following ≠ better outcomes. Agents explore more, run more tests, traverse more files — without meaningfully reaching the right code faster.
“A map of the whole city doesn’t tell you which building to walk into.”

🔍 Why Human Files Win: The Redundancy Problem

LLM-generated files tend to restate information already in READMEs and docs — additive noise, not additive value.
When existing documentation was removed before generating context files, LLM-generated files improved by 2.7% and actually outperformed human-written ones.
Human-written files capture non-obvious, non-redundant information: quirky CI setups, non-default tooling choices, undocumented conventions.

Limitations

Study limited to Python repositories — generalisability to TypeScript, Rust, multi-language codebases is unknown.
Only measures issue resolution success, not security, consistency, or convention adherence.
No longitudinal data on how context file quality or agent utilisation evolves over time.

Practical Takeaways

Principle	Detail
Write for the gap	Only encode what the repo doesn’t already explain — non-default tool choices, unusual test configs, hidden constraints.
Avoid restating the README	A `CLAUDE.md` that duplicates existing docs likely hurts more than it helps.
Respect the cost floor	Every context file adds ~20% to inference cost. High-volume pipelines should weigh this carefully.
Fix LLM-generated files	Auto-generators should be designed to explicitly avoid restating existing docs and focus on extracting non-obvious conventions.
Keep files minimal and specific	Less is more — specificity beats comprehensiveness.

Bottom Line

Context files are not magic, but not useless. Human-written, specific, non-redundant files improve agent performance. Auto-generated files that recycle existing documentation actively reduce it. In both cases, the mechanism is the same: agents follow instructions, and outcome quality depends entirely on instruction quality. Getting this balance right is both a context file design problem and a model training problem.

Resources

Infographic

Infographic wide