Vercel proves passive docs crush active skills, Arcee ships a 400B model for $20M, and someone ported 100K lines of TypeScript to Rust with zero manual code.
> The best agent architecture right now isn't a sophisticated skill system. It's a markdown file sitting in context.
Vercel's evaluation research (407 HN points, 160 comments) produced a result that should reshape how we build AI agents. Testing against Next.js 16 APIs absent from training data, they compared skills-based agent approaches against a simple AGENTS.md file sitting in persistent context.
Results: AGENTS.md achieved 100% success. Skills without explicit instructions hit 53%. Skills with instructions reached 79% — but fragile, with outcomes changing based on instruction wording ("read docs first" vs. "explore project first" produced different results from the same skill).
The failure mode is fundamental. Skills require agents to decide when to invoke them. That decision point fails 56% of the time without prompting, and remains brittle even with guidance. Passive context eliminates the decision entirely — capabilities stay available throughout every interaction. An 8KB compressed index achieved perfect scores while cutting context overhead 80% compared to the initial 40KB version.
The architectural implication: for framework knowledge, persistent passive context beats active retrieval. Reduce decision points. Keep relevant information always available rather than requiring the agent to know when to look for it.
Arcee AI released Trinity Large — 400B parameters, sparse Mixture-of-Experts, trained for $20M in 33 days on 2,048 B300 GPUs. It matches frontier benchmarks (MMLU 87.2, AIME 2025: 24.0 vs. Llama 4 Maverick's 19.3) while running 2-3x faster at inference.
The architecture: 256 experts, 4 active per token, yielding only 13B active parameters per token despite 400B total. The 1.56% routing fraction is notably sparser than DeepSeek-V3 (3.13%), enabled by 6 dense layers maintaining stability. Native 512K context.
Key innovations: momentum-based expert load balancing with tanh clipping, z-loss regularization preventing logit drift, and HSDP with expert parallelism that enabled batch size increases after 5T tokens.
Three variants released: Preview (chat-ready, free on OpenRouter), Base (full pretrained checkpoint), and TrueBase (early 10T token checkpoint with no instruction data for pretraining research).
The $20M price tag is the headline. Frontier-class models no longer require $100M+ budgets. Sparse architectures provide both training efficiency and deployment speed advantages. The question shifts from "can you afford to train a frontier model?" to "can you afford not to train a specialized one?"
AI2 SERA — 32B coding agents achieving 54.2% SWE-Bench Verified for $12K training cost. Includes full recipe: models, training code, data, methodology. Codebase-specific fine-tuning on 8K examples matches 100B+ teacher models.
Kimi K2.5 — Open-source visual agentic model. 15T training tokens, agent swarm with up to 100 sub-agents, 1,500 parallel tool calls. 59.3% improvement over K2 Thinking on internal benchmarks.
Cloudflare Moltworker — Proof-of-concept deploying AI agents on edge infrastructure via Workers + Sandboxes + R2 + browser rendering. Zero Trust security, no local hardware needed.
The AGENTS.md result is one of those findings that seems obvious in hindsight but changes how you architect. We've been building increasingly sophisticated retrieval and skill-invocation systems when a well-structured markdown file in context outperforms all of them. Sometimes the simple thing wins because it removes the failure mode entirely. That's an engineering principle worth internalizing.
— Aaron, from the terminal. See you next Friday.
Compare three approaches to AI agent browser automation. Browser Use, Stagehand, and Playwright MCP tested with code examples, benchmarks, and architecture trade-offs.
AI EngineeringHow OpenClaw routes messages across Discord, Telegram, and Slack with an 8-tier priority cascade, then isolates agent execution in pluggable Docker/SSH sandboxes.
AI EngineeringSide-by-side comparison of how OpenClaw and Hermes Agent build system prompts, manage token budgets, and compress long conversations without losing critical context.
AI Engineering