Simon Willison's year-end review crowns reasoning as 2025's breakthrough while Chinese labs prove hardware isn't everything.
> 2025 was the year LLMs learned to think before answering. 2026 will be the year they learn to act.
Simon Willison's "2025: The Year in LLMs" (891 HN points, 548 comments) identifies reasoning as the defining breakthrough. Models trained against verifiable rewards — math correctness, code execution, logical validity — spontaneously developed multi-step problem-solving. This isn't incremental. It's a qualitative shift from pattern matching to systematic thinking, and every major lab converged on it independently.
The second headline: agents went from research to production. Willison defines them simply — "LLMs that run tools in a loop to achieve a goal." Every major lab shipped a CLI coding agent (Claude Code, Gemini CLI, Qwen Code, Mistral Vibe). The proliferation validates the market but raises an obvious question: when everyone has one, what's the moat?
Chinese labs answered the geopolitical question definitively. DeepSeek-R1 matched OpenAI's o1 despite US chip export restrictions. Algorithmic innovation beat hardware access. Export controls delayed but did not prevent parity.
DeepSeek, Alibaba Qwen, Moonshot AI (Kimi K2), Z.ai, and MiniMax all hit top benchmark rankings in 2025 while working with previous-gen NVIDIA GPUs and domestic chips. The strategic implications are significant.
First, the technical playbook: more efficient training algorithms, architectural innovations squeezing better performance from available compute, and reinforcement learning techniques that independently converged with OpenAI's reasoning approach. Published research suggests combinations of all three.
Second, the open-weight strategy gives these models distribution advantages US companies lack. No cloud lock-in. Global accessibility. Community contributions. The economic model shifts from model-access monopoly to services and support.
Third, the policy lesson: hardware restrictions incentivize domestic innovation. Chinese labs closed the gap faster than anyone predicted. The multipolar AI landscape is here — sustained algorithmic innovation matters more than chip access.
For practitioners, this means frontier-quality open models are available for research and commercial deployment regardless of geography. The competitive pressure forces US labs to either match openness or articulate why proprietary access justifies the premium.
Gemma Scope 2 — Google DeepMind's interpretability toolkit for analyzing model internals and identifying failure modes before deployment. Open to the research community for distributed safety investigation.
CASCADE — Framework enabling agents to autonomously develop new skills through experience, transferring knowledge across domains without explicit retraining per task.
WeatherNext 2 — Google DeepMind's advanced forecasting model showing AI expanding beyond language into scientific prediction with measurable societal value.
2025 proved that reasoning and agents aren't hype cycles — they're architectural shifts. The real story of 2026 won't be who has the best model. It'll be who builds the best infrastructure for agents to operate safely and reliably in production. The capability gap is closing fast; the deployment gap is wide open.
— Aaron, from the terminal. See you next Friday.
Compare three approaches to AI agent browser automation. Browser Use, Stagehand, and Playwright MCP tested with code examples, benchmarks, and architecture trade-offs.
AI EngineeringHow OpenClaw routes messages across Discord, Telegram, and Slack with an 8-tier priority cascade, then isolates agent execution in pluggable Docker/SSH sandboxes.
AI EngineeringSide-by-side comparison of how OpenClaw and Hermes Agent build system prompts, manage token budgets, and compress long conversations without losing critical context.
AI Engineering