Week 23, 2025

Codex Hits 83% Merge Rate, Enterprises Hit Pilot Hell

AI coding agents ship real PRs at scale, AlphaEvolve saves Google 0.7% of global compute, and UK courts threaten lawyers using AI citations.

AI FRONTIER: Week 23, 2025

> OpenAI Codex merged 83% of 212K pull requests. That's not a demo -- it's production-scale autonomous coding, and it changes the economics of software.


The Big Story

A comprehensive analysis of AI coding agent performance across major platforms revealed the numbers everyone has been waiting for. OpenAI Codex leads with an 83.3% merge rate across 212,250 pull requests. Cursor Agents hit 77% on 717 PRs. Devin manages 60.8% across 27,910 PRs. GitHub Copilot sits at 39.2% on 14,274 PRs.

These aren't toy benchmarks. These are real pull requests merged into real codebases. The question developers are asking -- "are these simple fixes or substantial features?" -- is valid, but beside the point. Even if most of these are routine changes, automating routine work at 83% accuracy fundamentally changes team structure and velocity.

The implication for engineering leaders: the cost of a unit of code is dropping fast. The bottleneck shifts from writing code to reviewing it, designing systems, and defining what to build. Teams that restructure around this reality -- fewer writers, more reviewers and architects -- will outperform those clinging to traditional ratios.


This Week in 60 Seconds


Deep Dive: AlphaEvolve and Self-Improving Algorithms

Google DeepMind's AlphaEvolve is a Gemini-powered system that tackles "machine-gradable" problems in math and science. Its innovation: an automatic evaluation loop that generates, critiques, and scores multiple solution approaches to reduce hallucinations.

The results are concrete. AlphaEvolve rediscovered known optimal solutions 75% of the time and found improved solutions in 20% of cases. In production at Google, it generated algorithms that recovered 0.7% of worldwide compute resources and reduced Gemini model training time by 1%.

Those percentages sound small until you consider Google's scale. 0.7% of Google's global compute is an enormous amount of hardware. A 1% reduction in Gemini training time translates to millions in saved costs per training run.

The architecture is worth studying:

  1. Generation: Produce multiple candidate algorithms for a given problem
  2. Evaluation: Automatically test each candidate against objective criteria
  3. Critique: Use the model to analyze why certain approaches succeed or fail
  4. Iteration: Feed critique back into generation for the next round

The key insight is that self-evaluation is the unlock. Most AI systems generate solutions but can't assess their quality without human feedback. AlphaEvolve closes this loop, enabling continuous improvement on problems where correctness can be verified programmatically.

This pattern generalizes beyond math. Any domain where you can define an objective evaluation function -- compiler optimization, circuit design, logistics routing -- is ripe for the same approach.


Open Source Radar

AI PR Watcher — Dashboard tracking merge rates across Codex, Cursor, Devin, Copilot, and Codegen in real-time. Essential data for teams evaluating AI coding tools.

Local model runners — Updated tooling for running 24B+ parameter models on consumer GPUs. Simon Willison's analysis shows these now rival cloud models for many tasks.

Figure AI humanoid frameworks — Despite the CEO dodging demo questions, open-source robotics frameworks for humanoid control are advancing rapidly in the background.


The Numbers

  • 83.3%: OpenAI Codex merge rate across 212,250 pull requests
  • 75%: Rate at which AlphaEvolve rediscovered known optimal solutions
  • 2/12: AI pilots that reached production at one enterprise -- the rest stuck in "pilot hell"

Aaron's Take

The Codex merge rate data is the story of the year for engineering teams. Forget the philosophical debates about whether AI can "really" code. 83% of 212K PRs merged means the economics have already shifted. The question isn't whether to adopt AI coding agents -- it's how fast you can restructure your team around the new ratio of code generation to code review. AlphaEvolve shows the same pattern at the algorithm level: close the evaluation loop and let the machine iterate. The teams that figure out evaluation-driven AI workflows first will compound their advantage.


— Aaron, from the terminal. See you next Friday.

You Might Also Like

Browser Use vs Stagehand vs Playwright MCP Compared (2026)

Compare three approaches to AI agent browser automation. Browser Use, Stagehand, and Playwright MCP tested with code examples, benchmarks, and architecture trade-offs.

AI Engineering

OpenClaw Architecture: 8-Tier Routing & Sandbox Deep Dive

How OpenClaw routes messages across Discord, Telegram, and Slack with an 8-tier priority cascade, then isolates agent execution in pluggable Docker/SSH sandboxes.

AI Engineering

OpenClaw vs Hermes Agent: Prompt & Context Compression

Side-by-side comparison of how OpenClaw and Hermes Agent build system prompts, manage token budgets, and compress long conversations without losing critical context.

AI Engineering