AI Agents Weekly - Claude Sonnet

Β· 3 min read Β· Alex

AI Agents Weekly - Claude Sonnet

Read the original article

AI Agents Weekly: Claude Sonnet 4.6, Gemini 3.1 Pro, Stripe Minions & More

From Elvis Saravia’s AI Newsletter β€” February 21, 2026

Overview

This issue covers a packed week of AI agent developments, spanning major model releases, infrastructure tooling, agentic benchmarks, and real-world agent deployments.


πŸ”₯ Top Stories (Publicly Accessible)

1. Claude Sonnet 4.6 β€” Anthropic

Anthropic released Claude Sonnet 4.6 as the new default model for all Claude users on February 17, 2026.

Key highlights:

  • Computer Use Breakthrough: OSWorld scores jumped from 14.9% β†’ 72.5% β€” a nearly 5x improvement β€” making it the most capable model for autonomous GUI-based agent workflows.
  • 1M Token Context Window: Available in beta, enabling agents to process entire codebases, long documents, and multi-session histories without losing earlier context.
  • User Preference: In blind A/B tests, users preferred Sonnet 4.6 over Sonnet 4.5 roughly 70% of the time, especially in coding, instruction following, and nuanced reasoning.
  • Pricing: $3/$15 per million input/output tokens β€” cost-efficient for high-volume agentic deployments.

Practical Takeaway: Sonnet 4.6 is positioned as the go-to model for autonomous agent workflows, particularly those involving computer use, long-context reasoning, and code generation.


2. EVMBench β€” AI Agents vs. Smart Contract Security

OpenAI and Paradigm introduced EVMBench, a benchmark evaluating AI agents on smart contract security tasks across 120 curated vulnerabilities from 40 audits.

Key findings:

  • Exploit tasks are handled best β€” agents perform well when the goal is explicit (e.g., drain funds iteratively).
  • Detect and Patch tasks are harder β€” agents struggle with exhaustive auditing and maintaining full contract functionality after patching.
  • Detection Gap: Agents tend to stop after finding a single vulnerability rather than performing comprehensive audits β€” a critical limitation for security-critical deployments.
  • Scenarios sourced from open code audit competitions and Tempo blockchain (a purpose-built L1 for high-throughput stablecoin payments).

Practical Takeaway: AI agents show promise in offensive security (exploit generation) but are not yet reliable enough for defensive, exhaustive smart contract auditing without human oversight.


πŸ“° Paywalled Headlines (Titles Only)

The following stories are mentioned but locked behind the paid subscription:

  • Gemini 3.1 Pro β€” Google launches with 77% ARC-AGI-2 score
  • Stripe Minions β€” Coding agents shipped at scale
  • Cloudflare Code Mode MCP β€” 99.9% token savings reported
  • Qwen 3.5 β€” Alibaba drops new model with agentic vision capabilities
  • ggml.ai joins Hugging Face β€” Local AI inference collaboration
  • Anthropic measures AI agent autonomy in practice
  • AI agent autonomously publishes a hit piece β€” Autonomous content generation controversy
  • dmux β€” Multiplexes AI coding agents in parallel
  • New benchmarks for agent memory and reliability

πŸ“„ Papers Mentioned

No direct arxiv.org links were included in the accessible portion of the article. EVMBench is referenced via a blog post (no arxiv link provided in the visible content).


🧠 Key Takeaways

  1. Claude Sonnet 4.6 represents a step-change in computer use capability β€” the 5x OSWorld improvement is significant for production agent deployments.
  2. EVMBench highlights that AI agents are better attackers than defenders in smart contract security β€” important for teams considering AI-assisted auditing.
  3. The week broadly signals a maturing agentic infrastructure layer β€” from MCP tooling (Cloudflare) to parallel agent orchestration (dmux) to memory benchmarking.
  4. Cost-efficient, long-context models like Sonnet 4.6 are making large-scale multi-agent systems increasingly viable.

Infographic

Infographic wide