AI Agents Weekly: Evaluating Agents

Β· 2 min read Β· Alex

AI Agents Weekly: Evaluating Agents

Read the original article

AI Agents Weekly: Evaluating AGENTS.md & More

From Elvis Saravia’s AI Newsletter β€” February 28, 2026

Main Thesis

This issue covers a wide range of AI agent developments, with the headline story challenging a widely adopted practice: using repository-level context files (like AGENTS.md or CLAUDE.md) to guide coding agents. Counterintuitively, research shows these files may be doing more harm than good.


πŸ”¬ Key Finding: AGENTS.md Files Hurt Coding Agent Performance

Researchers from UIUC and Microsoft Research evaluated whether repository-level context files actually improve coding agent performance on SWE-bench benchmarks.

Surprising results:

  • ❌ Lower success rates β€” Both LLM-generated and human-written context files caused agents to solve fewer tasks compared to agents given no repository context at all.
  • πŸ’Έ Higher inference costs β€” Context files increased inference costs by over 20%.
  • πŸ” Broader but less effective exploration β€” Agents with context files explored more (more testing, more file traversal), but the additional constraints made tasks harder, not easier.
  • βœ… Minimal is better β€” The authors recommend context files describe only minimal requirements rather than comprehensive specifications, as unnecessary constraints actively hurt performance.

Practical takeaway: Developers should rethink how they write AGENTS.md, CLAUDE.md, and similar files β€” focus on essential guardrails only, not exhaustive instructions.

Paper


StorySummary
Perplexity ComputerPerplexity launches a computer-use agent for end-to-end task automation
Google Nano Banana 2Google releases Nano Banana 2 model for free
Sakana AI Doc-to-LoRA & Text-to-LoRATools for fine-tuning models directly from documents or text
Notion Custom Agents 3.3Notion launches custom agent capabilities in version 3.3
Nous Research Hermes AgentOpen-source agent model released by Nous Research
GPT-5.3-CodexOpenAI makes GPT-5.3-Codex available to all developers
Mercury 2New reasoning diffusion LLM ships from Mercury
Qwen 3.5 Medium SeriesAlibaba drops a new medium-sized Qwen model series
Claude Code Auto-MemoryAnthropic ships auto-memory across sessions for Claude Code
RoguePilotSecurity vulnerability exposed in GitHub Copilot
Vercel Chat SDKVercel open-sources a Chat SDK for multi-platform bot development

πŸ’‘ Practical Takeaways

  1. Less is more when writing agent context files β€” avoid over-specifying agent behaviour.
  2. Benchmark your context files β€” don’t assume that more instructions equals better agent performance.
  3. The AI tooling ecosystem is rapidly expanding across coding, browser automation, fine-tuning, and memory management.
  4. Security remains a concern as tools like RoguePilot highlight vulnerabilities in popular AI coding assistants.

Infographic

Infographic wide