Saturday — quieter day. Five strong items below; the cyber-eval thread and the new agent-cost paper are the ones to actually read.
🎯 Top 3 Things to Know
1. AISI publishes Mythos cyber evaluation — and a parallel one for GPT-5.5. The UK AI Security Institute's writeups reframe the story: it's not "one model is dangerous," it's "frontier offensive cyber capability is here across labs." Mythos was the first to clear AISI's 32-step "The Last Ones" corporate-network takeover end-to-end (~20 human hours); GPT-5.5 reportedly matches it. The policy question stops being "should this one model ship" and becomes "what's the across-lab posture." AISI on Mythos · AISI on GPT-5.5 · Anthropic preview page
2. First systematic study of agent token spend lands on arXiv. "How Do AI Agents Spend Your Money?" measures eight frontier LLMs on SWE-bench Verified. Five takeaways: agentic coding is orders of magnitude more expensive than chat, input tokens dominate even with caching, the same task varies by 30× across runs, more spend ≠ more accuracy (it peaks mid-budget then degrades), and models are bad at predicting their own cost. The empirical baseline cost-aware design has been waiting for. arXiv:2604.22750
3. Federal CIO signals caution on Mythos rollout despite Project Glasswing. The federal CIO is reportedly cautious on broader Mythos adoption inside US government even as Anthropic onboards Glasswing partners (Amazon, Apple, Cisco, Microsoft, Palo Alto Networks, Linux Foundation). First real test of "release dangerous-capability models only to defenders" as a doctrine. CyberScoop · Anthropic Glasswing
🚀 Frontier Models & Features
- Saturday is quiet. No headline release in the last 24 hours. Carryover worth flagging: NVIDIA's Nemotron 3 Nano Omni (30B MoE, omni-modal) shipped Tuesday and is still the most interesting open release of the week. NVIDIA blog
🔬 Research Worth Reading
🔖 SYNTHESIZE CANDIDATE — "How Do AI Agents Spend Your Money?" (arXiv:2604.22750). First systematic measurement of where token spend goes in agentic coding; input tokens dominate, accuracy is non-monotonic in budget. Implications for harness design are unusually clean. Verdict: read full paper. arXiv
"Budget-Aware Tool-Use Enables Effective Agent Scaling" (arXiv:2511.17006). Per-call tool budgeting as the lever that makes scaling tractable. Verdict: read abstract, skim §3 if tuning a tool-heavy harness. arXiv
"LeWorldModel: Stable End-to-End JEPA from Pixels" (arXiv:2603.19312). Mila/NYU/Samsung/Brown — cleanest pixel-space JEPA yet (one regularizer instead of six, ~48× faster planning, latents that linearly decode to physical quantities). The post-LLM architecture thread is real. Verdict: read abstract — full paper if you care about world models. project · arXiv
🏢 Enterprise in the Wild
JPMorgan's internal LLM Suite hits 200,000 employee users with reported 83% faster research cycles for portfolio managers and ~360,000 hours/year of manual work automated. Useful internal-platform sizing reference. Lyzr writeup citing JPMorgan disclosures
Global pharma agentic deployment goes fully operational. A Fortune Global 500 pharmaceutical company has its agentic AI platform now live across regulatory reporting and payroll — one of the first end-to-end agent deployments in a regulated domain that's actually past pilot. GlobeNewswire
🛠️ Tooling & Ecosystem
⭐ ANTHROPIC — Claude Code MCP reliability. Auto-retry on transient MCP startup failures (up to 3×),
/mcpnow surfaces claude.ai connectors hidden by manually-added duplicates, needs-auth suppression bug fixed. Lift: copy the bounded-retry-on-startup pattern — silent disconnects are the dominant operational failure in MCP harnesses right now. official changelogHarness engineering coverage is consolidating into a real subgenre. This week's most-circulated reads are AlphaSignal's "Closer Look at Harness Engineering from Top AI Companies" and Cobus Greyling's GPT-5.5 computer-use deconstruction. Lift: both converge on the same pattern — the agent's environment (sandbox, file abstractions, retry/feedback loops) is doing more work than model choice. AlphaSignal · Cobus Greyling
⚖️ Policy & Regulation
Bartz v. Anthropic settlement fairness hearing set for May 14. Final approval on the
$1.5B class settlement — last meaningful procedural milestone before payouts ($3,000/work). The Alsup ruling stands: training on legally-acquired books is fair use, storing pirated copies is not. Authors GuildEU AI Act high-risk deadline still August 2, 2026. Carryover from yesterday: the Digital Omnibus that would have pushed compliance to December 2027 didn't clear last week's trilogue. Plan against the original date. EU AI Act tracker
📌 Watch List
- Cost-aware / token-economy agents. New empirical baseline this week — input tokens dominate spend even with caching, accuracy peaks mid-budget. The field is converging on "predict-then-budget" as the next agent primitive. arXiv:2604.22750 · arXiv:2511.17006
- World models / non-LLM architectures. LeWorldModel is the JEPA result that finally makes pixel-space training stable with minimal hyperparameter tuning. Worth tracking whether a Meta-FAIR follow-up scales it. arXiv:2603.19312
- Cross-lab offensive cyber capability. AISI's parallel Mythos and GPT-5.5 evaluations make this a frontier-wide story, not a one-model story. AISI Mythos eval · AISI GPT-5.5 eval
- Multi-agent vs. single-agent under equal budget. The "one model with full context wins under equal token budget" argument from last week is still unrefuted. New survey landed this week — useful if you're picking a side. arXiv:2503.23037
- MCP server reliability patterns. Auto-retry-on-transient-startup-error is now in Claude Code's own MCP client; expect the pattern to become standard in third-party MCP hosts. official changelog