AI Agents of the Week – LLM Watch (March 8, 2026)

Main Thesis

This week’s research roundup highlights meaningful progress across five core challenge areas for autonomous AI agents: memory management, long-horizon planning, multi-agent collaboration, safety/evaluation, and practical tooling.

Key Findings

🧠 Memory & Continual Learning

Memex(RL): Solves the context window bottleneck by indexing past interactions externally, allowing agents to retrieve full-fidelity memories on demand — avoiding lossy summarization.
SkillNet: Builds a library of 200,000+ reusable skills, improving average rewards by 40% and cutting execution steps by 30% across benchmarks. Agents stop “reinventing the wheel.”

🗺️ Planning & Environment Interaction

HiMAP-Travel: Hierarchical multi-agent framework splits travel planning into strategic + parallel day-level execution. Achieves 52.78% validation pass rate on TravelPlanner (+8.67pp over baselines) with 2.5x lower latency.
T2S-Bench: “Structure of Thought” prompting yields +5.7% improvement across 8 text tasks; fine-tuning pushes this to +8.6%. How agents organize information matters as much as what they know.

🤝 Multi-Agent Collaboration

HACRL: Enables heterogeneous agents to learn from each other via verified rollout sharing. Their HACPO algorithm beats GSPO by 3.3% using only half the rollout cost.
Vivaldi: Role-structured multi-agent system for physiological time series. Key nuance: agentic pipelines help non-thinking models (+6.9 to +9.7 points) but can hurt thinking models (−14 points in relevance). Agentic reasoning is not universally beneficial.

🔒 Trust, Verification & Safety

AgentVista: A rigorous benchmark across 25 sub-domains. Even the best model (Gemini-3-Pro with tools) achieves only 27.3% overall accuracy. Hard tasks require 25+ tool-calling turns, exposing how far agents are from reliable real-world use.
Vivaldi reinforces that explicit tool-based computation wins for codifiable metrics; subjective targets show limited gains — value lies in selective externalization of computation.

🛠️ Tools & Frameworks

DARE: Distribution-aware retrieval for R’s statistical ecosystem, achieving 93.47% NDCG@10 — outperforming state-of-the-art embedding models by up to 17% with fewer parameters. RCodingAgent shows strong downstream gains.
SkillNet releases an interactive platform and Python toolkit alongside its skill repository for immediate developer use.
Memex(RL) provides an RL framework specifically for optimizing memory retrieval operations.

Practical Takeaways

Don’t rely on context windows alone — external indexed memory (Memex(RL)) is a superior architecture for long-running agents.
Reusable skill libraries dramatically cut redundant computation; SkillNet’s open toolkit is worth exploring now.
Parallelized hierarchical planning (HiMAP-Travel) is a viable pattern for complex constrained tasks.
Test before assuming agentic = better — Vivaldi’s results show thinking models can be degraded by agentic pipelines.
Current agent benchmarks are brutally hard — AgentVista’s 27.3% ceiling should calibrate expectations for production deployment.
Structured prompting (Structure of Thought) is a low-cost, high-value technique worth applying immediately.

Infographic

Infographic wide