Overview
- EVMbench evaluates AI agents on detecting, patching, and exploiting Ethereum smart contract vulnerabilities using a standardized, open framework.
- The benchmark draws on 120 high‑severity issues from 40 real‑world audits, including scenarios from Stripe’s Tempo security review.
- Paradigm reports top models’ exploit success rose from under 20% at project start to above 70%, indicating rapid capability gains.
- OpenAI expanded the private beta of its Aardvark security research agent and committed $10 million in API credits to support defensive crypto research.
- Parallel efforts push agent autonomy: Sigil Wen’s Conway equips agents with wallets, x402 pay‑for‑compute and deployment tools, the Automaton reference agent demonstrates those capabilities, and Dragonfly’s new fund cites agentic payments as a core thesis, while observers flag significant dual‑use risk.