May 8, 2026 — Tandemly Briefings

🎯 Top 3 Things to Know

1. Zyphra released ZAYA1-8B, a reasoning MoE that matches frontier models on math with under 1B active parameters and was trained entirely on AMD GPUs. The model has 8.4B total parameters but activates only 760M per token. Zyphra reports it approaches or beats Claude 4.5 Sonnet, Gemini 2.5 Pro, and DeepSeek-V3.2 on math benchmarks under extended compute, and it surpasses GPT-OSS-120B on the APEX-shortlist. Two things make this matter beyond the score sheet. The first is the hardware: pretraining ran on a 1,024-GPU AMD MI300X cluster with Pensando Pollara networking, the cleanest public proof yet that frontier-class training works outside an NVIDIA stack. The second is the cost frontier: a sub-1B-active model at this quality shifts what teams can self-host on a single commodity GPU. Weights are on Hugging Face under Apache 2.0. Zyphra blog

2. ServiceNow opened its enterprise system of action to outside agents, with Anthropic as the first design partner and a NVIDIA-built desktop agent called Project Arc as the headline demo. At Knowledge 2026 in Las Vegas, ServiceNow launched Action Fabric, a layer that lets third-party agents like Claude or Microsoft Copilot trigger governed actions inside ServiceNow rather than just reading and writing data through APIs. Project Arc is the second piece: a long-running desktop agent secured by NVIDIA OpenShell, with every action logged in ServiceNow's AI Control Tower. The pattern to notice is that the governance layer is becoming the integration surface. Whoever owns the audit log and the policy boundary becomes the place agents have to plug in. ServiceNow newsroom · NVIDIA blog

3. A new Stanford and MIT paper, Meta-Harness, shows that auto-optimizing the wrapper code around a fixed LLM produces a 6× performance gap on the same benchmark. The system uses a coding agent as its own search loop. It reads the source, scores, and execution traces of past harness candidates, then decides whether to make a small edit or rewrite. On online text classification, Meta-Harness beats a state-of-the-art context-management system by 7.7 points while using 4× fewer context tokens. On retrieval-augmented IMO-level math, it adds 4.7 points across five held-out models. The implication is that "harness engineering" is now an optimization problem with a closed loop, not a craft, and most teams are leaving large gains on the table by hand-tuning prompts and tools. arXiv 2603.28052

🚀 Frontier Models & Features

OpenAI made GPT-5.5 Instant the default in ChatGPT. Available to all users, exposed in the API as chat-latest. OpenAI reports a 52.5% reduction in hallucinated claims on high-stakes prompts in medicine, law, and finance. The release also adds memory sources and a global rollout of ChatGPT add-ins for Excel and Google Sheets. OpenAI blog · TechCrunch

Google Gemini API File Search added multimodal RAG. Page-level citations are now returned for image-and-text retrieval, and an event-driven webhook lets long-running tasks notify clients without polling. Greeden weekly summary

🔬 Research Worth Reading

Meta-Harness: End-to-End Optimization of Model Harnesses (Stanford / MIT). arXiv 2603.28052
- TL;DR: Treat the harness around a fixed LLM as the thing being trained. A coding agent reads its own prior attempts off the filesystem and proposes the next one.
- Stat: 7.7 points over a state-of-the-art context-management baseline at 4× fewer tokens; up to 6× spread on the same benchmark across harness variants.
- Apply it: Stop hand-tuning prompts as the optimization unit. Wrap your harness code in a CI-style scoring loop and let an agent propose rewrites against held-out tasks.
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery (AgentFlow) (recent, arXiv). arXiv 2604.20801
- TL;DR: A typed graph DSL covers agent roles, prompts, tools, communication topology, and coordination protocol. A feedback loop reads runtime signals from the target program to diagnose which part of the harness failed and rewrite it.
- Stat: Reaches 84.3% on TerminalBench-2 and discovered ten previously unknown zero-day vulnerabilities in Google Chrome, including two Critical sandbox-escape CVEs.
- Apply it: When a multi-agent loop fails, instrument the topology, not just the prompts. The diagnostic signal lives in the message graph between agents, not in any single turn.

🏢 Enterprise in the Wild

Anthropic + ServiceNow at Knowledge 2026. Claude Cowork is now wired directly into ServiceNow's governed system of action via Action Fabric, the first design partner integration. The pitch is enterprise execution inside the apps employees already have open, with policy enforcement at the action layer rather than the API. ServiceNow newsroom

Customers Bank earnings call run by an AI clone. The CEO appeared on the bank's earnings call as a voice-and-likeness clone trained on his prior commentary, alongside the company's new OpenAI partnership for finance workflow automation. Synthetic-executive presence on a regulated investor call is the precedent worth tracking. CNBC

🛠️ Tooling & Ecosystem

Perplexity launched Finance Search in the Agent API. One tool call returns licensed financial datasets, real-time prices, fundamentals, transcripts, and citations. Benchmarked top of FinSearchComp T1 on accuracy, latency, and cost-per-correct-answer. Perplexity blog

ZAYA1-8B weights and serverless endpoint live. Apache 2.0 weights on Hugging Face, free serverless on Zyphra Cloud. Ships with Markovian RSA, a test-time compute method that combines parallel trace generation with fixed-length context chunking to keep memory flat as reasoning length grows. Zyphra blog

Project Arc + NVIDIA OpenShell. Sandboxed desktop agent with hard IT-defined capability boundaries. Notable as a concrete pattern for shipping a long-running desktop agent inside a regulated enterprise. NVIDIA blog

⚖️ Policy & Regulation

EU AI Act Omnibus, day-after detail. Yesterday's provisional agreement also added a new prohibition on AI systems generating non-consensual sexual or intimate content, postponed national regulatory sandbox deadlines to August 2, 2027, and clarified that national authorities retain GPAI competence for law-enforcement, border, judicial, and financial-sector use cases. Council release

US frontier-model pre-deployment testing expands. The Center for AI Standards and Innovation (CAISI), under Commerce, signed evaluation agreements with Google DeepMind, Microsoft, and xAI for pre-release model testing. CAISI joins Anthropic's existing arrangement, making pre-deployment evaluation a near-default for the major US labs. CNBC

📌 Watch List

AMD as a viable training stack: ZAYA1-8B is the cleanest public proof yet that NVIDIA-free pretraining hits frontier quality.
Governance-layer integration: ServiceNow Action Fabric makes the audit log the integration surface. Expect imitators.
Harness optimization: Meta-Harness and AgentFlow turn wrapper engineering into a measurable search problem.
Synthetic executives on regulated calls: Customers Bank earnings clone is the first concrete data point.
Pre-deployment model evals as default: CAISI agreements with Google, Microsoft, and xAI normalize frontier gating beyond Anthropic.