May 17, 2026 — Tandemly Briefings

🎯 Top 3 Things to Know

1. Anthropic published a research paper this weekend forecasting AGI by 2028 and arguing for tighter U.S. chip and model export controls on China. The paper landed during Trump's two-day Beijing visit and sketches two scenarios for the next three years. The headline is not the AGI date itself. It is the policy lever: in Anthropic's second scenario, looser export discipline lets Chinese labs close the capability gap by 2028 and erases the leverage Washington has been using to shape global AI norms. The argument is unusual because it comes from a model lab rather than a policy shop, and it commits Anthropic publicly to a hawkish position the same week the White House is reportedly weighing a separate pre-release vetting regime for frontier models. Worth watching whether OpenAI and Google DeepMind respond on the record, since compute and model-access restrictions cut both ways for them. Anthropic research

2. A new paper finds that grep beats vector retrieval inside the major coding agents, but the harness around the search tool moves accuracy more than the choice of retrieval method. "Is Grep All You Need?" tests four agent harnesses on a 116-question slice of LongMemEval. Claude Code holds a persistent grep advantage with Opus and Haiku. Gemini CLI holds a persistent vector advantage with Gemini 3.1 Pro. Same data, different harnesses, different winners. Vector search tends to win at small context, when the bundle is still manageable. Grep tends to win later, when the agent has to separate needle from haystack in a noisy context window. Relevant for anyone running RAG inside an agent loop. Production teams should benchmark grep against their vector setup before assuming the embedding pipeline is the thing doing the work. arXiv 2605.15184

3. Thinking Machines previewed "interaction models," a native multimodal architecture aimed at OpenAI's Realtime stack. Mira Murati's lab claims a 0.4-second average response latency against GPT-Realtime-2.0's 1.18 seconds. The architecture replaces the request-response loop with 200ms micro-turns and splits the system in two: a live interaction model that stays open to the user while a background reasoning model runs tools asynchronously and shares full conversation context. The bet is that real-time voice and video collaboration is not a faster chatbot. It is a different system topology, where listening, speaking, seeing, and pausing are trained behaviors rather than stitched components. A limited research preview is open now for feedback. Worth watching whether the latency advantage holds in adversarial conditions like overlapping speech and noisy audio. Semafor coverage

🚀 Frontier Models & Features

Salesforce expects $300M in Anthropic token spend in 2026. Marc Benioff disclosed the figure on the All-In podcast on May 16, primarily covering coding and product work. Salesforce remains the largest public Claude reference customer. Let's Data Science
Otherwise a quiet weekend on the frontier-model lane.

🔬 Research Worth Reading

Is Grep All You Need? How Agent Harnesses Reshape Agentic Search (Sen, Kasturi, Lumer, Gulati, Subbiah et al. — see arXiv link for affiliations). arXiv
- TL;DR: Empirical study showing the agent harness around a retrieval system shifts accuracy more than the choice between grep and vector retrieval. Tests four harnesses (Chronos, Claude Code, Codex, Gemini CLI) on a LongMemEval slice, with both inline and file-based tool result delivery.
- Stat: Same 116 questions, same backbone models, different harnesses, different winners. Claude Code shows a persistent grep advantage on Opus and Haiku; Gemini CLI shows a persistent vector advantage on Gemini 3.1 Pro.
- Apply it: Before tuning embeddings on a stuck agentic RAG stack, A/B the same questions through a grep-only path with file-based tool result presentation. If accuracy moves, the retrieval method is not the bottleneck. The harness is.
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning (Zheng, Wang, Li, Song, Fang / HKUST). arXiv
- TL;DR: An automated agentic pipeline that synthesizes conceptual and computational scientific tasks grounded in academic evidence, then trains an 8B agent on the resulting data with tool-integrated reasoning and long-horizon objectives.
- Stat: SciResearcher-8B hits 19.46% on HLE-Bio/Chem-Gold, a new state of the art at its parameter scale and ahead of several larger proprietary agents, with 13 to 15 point absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature.
- Apply it: For domain-specific agent training where supervised data is scarce, treat data construction itself as an agentic loop with academic grounding rather than paying for annotation. The training-data pipeline is the moat.

🏢 Enterprise in the Wild

Salesforce-Anthropic at a $300M run rate. Benioff's All-In disclosure on May 16 puts Salesforce among the largest token-volume customers in the Claude enterprise book. The use case mix (coding plus product work) tracks with what other heavy Claude shops report.
Virgin Voyages, surfaced this past week at NVIDIA GTC 2026, scaled its production agent fleet from 50 in October to more than 1,500 by May, with a 60% reduction in content production time and doubled promotional output. Useful proof point that horizontal scaling of narrow agents can outpace one-big-agent strategies. Crescendo summary

🛠️ Tooling & Ecosystem

Quiet weekend on this front. Claude Code's agent view (multi-session CLI manager) and the broader May 14 update remain the latest substantial releases worth re-reading if missed.

⚖️ Policy & Regulation

Anthropic's "2028" paper doubles as a policy intervention (covered in Top 3) and arrives against the backdrop of a reportedly active White House study on pre-release vetting of frontier models. Two converging vectors toward a more hands-on federal posture, even as several state-level AI laws keep advancing in parallel. The Hill coverage

📌 Watch List

Pre-release model vetting regimes. White House study and the Anthropic export-control argument pull in the same direction.
Agent harness as the accuracy frontier. The grep paper makes the harness layer itself a benchmarkable surface, not just a developer-ergonomics choice.
Real-time voice and video models. Thinking Machines opens the first credible challenge to OpenAI Realtime on latency and topology.
Frontier-lab geopolitics. Anthropic's public stance on export controls may shift competitive dynamics with peer labs.