AI FRONTIER: Week 5, 2026

> The best agent architecture right now isn't a sophisticated skill system. It's a markdown file sitting in context.

The Big Story

Vercel's evaluation research (407 HN points, 160 comments) produced a result that should reshape how we build AI agents. Testing against Next.js 16 APIs absent from training data, they compared skills-based agent approaches against a simple AGENTS.md file sitting in persistent context.

Results: AGENTS.md achieved 100% success. Skills without explicit instructions hit 53%. Skills with instructions reached 79% — but fragile, with outcomes changing based on instruction wording ("read docs first" vs. "explore project first" produced different results from the same skill).

The failure mode is fundamental. Skills require agents to decide when to invoke them. That decision point fails 56% of the time without prompting, and remains brittle even with guidance. Passive context eliminates the decision entirely — capabilities stay available throughout every interaction. An 8KB compressed index achieved perfect scores while cutting context overhead 80% compared to the initial 40KB version.

The architectural implication: for framework knowledge, persistent passive context beats active retrieval. Reduce decision points. Keep relevant information always available rather than requiring the agent to know when to look for it.

This Week in 60 Seconds

Deep Dive: Trinity Large and the $20M Frontier Model

Arcee AI released Trinity Large — 400B parameters, sparse Mixture-of-Experts, trained for $20M in 33 days on 2,048 B300 GPUs. It matches frontier benchmarks (MMLU 87.2, AIME 2025: 24.0 vs. Llama 4 Maverick's 19.3) while running 2-3x faster at inference.

The architecture: 256 experts, 4 active per token, yielding only 13B active parameters per token despite 400B total. The 1.56% routing fraction is notably sparser than DeepSeek-V3 (3.13%), enabled by 6 dense layers maintaining stability. Native 512K context.

Key innovations: momentum-based expert load balancing with tanh clipping, z-loss regularization preventing logit drift, and HSDP with expert parallelism that enabled batch size increases after 5T tokens.

Three variants released: Preview (chat-ready, free on OpenRouter), Base (full pretrained checkpoint), and TrueBase (early 10T token checkpoint with no instruction data for pretraining research).

The $20M price tag is the headline. Frontier-class models no longer require $100M+ budgets. Sparse architectures provide both training efficiency and deployment speed advantages. The question shifts from "can you afford to train a frontier model?" to "can you afford not to train a specialized one?"

Open Source Radar

AI2 SERA — 32B coding agents achieving 54.2% SWE-Bench Verified for $12K training cost. Includes full recipe: models, training code, data, methodology. Codebase-specific fine-tuning on 8K examples matches 100B+ teacher models.

Kimi K2.5 — Open-source visual agentic model. 15T training tokens, agent swarm with up to 100 sub-agents, 1,500 parallel tool calls. 59.3% improvement over K2 Thinking on internal benchmarks.

Cloudflare Moltworker — Proof-of-concept deploying AI agents on edge infrastructure via Workers + Sandboxes + R2 + browser rendering. Zero Trust security, no local hardware needed.

The Numbers

100% vs. 53%: AGENTS.md success rate vs. skills-based approach without instructions
$20M / 33 days: Trinity Large training cost and timeline for a 400B frontier model
4% degradation: Claude Code performance decline over 30 days (54% vs. 58% baseline on SWE-Bench-Pro)

Aaron's Take

The AGENTS.md result is one of those findings that seems obvious in hindsight but changes how you architect. We've been building increasingly sophisticated retrieval and skill-invocation systems when a well-structured markdown file in context outperforms all of them. Sometimes the simple thing wins because it removes the failure mode entirely. That's an engineering principle worth internalizing.

— Aaron, from the terminal. See you next Friday.

AGENTS.md Beats Skills 100% to 53%

AI FRONTIER: Week 5, 2026

The Big Story

This Week in 60 Seconds

Deep Dive: Trinity Large and the $20M Frontier Model

Open Source Radar

The Numbers

Aaron's Take

You Might Also Like

Browser Use vs Stagehand vs Playwright MCP Compared (2026)

OpenClaw Architecture: 8-Tier Routing & Sandbox Deep Dive

OpenClaw vs Hermes Agent: Prompt & Context Compression