A contamination-free coding challenge exposes AI's real limits while infrastructure spending reaches absurd new heights.
> The gap between AI benchmarks and reality just got a number: 7.5%. Meanwhile, OpenAI signed a check that makes most countries' budgets look modest.
The K Prize, a contamination-free AI coding challenge launched by Databricks and Perplexity co-founder Andy Konwinski, published its first results. The winner, Brazilian prompt engineer Eduardo Rocha de Andrade, scored just 7.5% correct. Compare that to SWE-Bench's 75% top scores and the picture snaps into focus: benchmark contamination has been flattering our AI coding tools for months.
The challenge tests models against real GitHub issues flagged after the submission deadline, making training data leakage impossible. Konwinski put it bluntly: "If we can't even get more than 10% on a contamination-free SWE-Bench, that's the reality check for me."
For anyone shipping AI-assisted code in production, this is the wake-up call. The tools are useful for boilerplate and pattern completion, but novel problem-solving on unfamiliar codebases remains firmly human territory. Plan your tooling investments accordingly.
OpenAI's annual commitment to Oracle — $30 billion for data center services — deserves a closer look. That's 4.5 gigawatts of capacity, equivalent to two Hoover Dams, all part of the $500 billion Stargate project with Oracle and SoftBank.
The math is striking: OpenAI's current ARR is roughly $10 billion. They're spending triple that on infrastructure alone. This isn't a cloud services deal — it's a monster data center build in Abilene, Texas, with Oracle pouring nearly $50 billion over two years into the physical plant.
What this signals for the industry:
Oracle's stock hit all-time highs. Larry Ellison became the second richest person globally. The deal reshapes how we think about AI infrastructure economics.
LegalOn (Series E, $50M) — SoftBank-backed legal AI platform for in-house teams. Automates contract analysis and document review. Legal tech is quietly becoming one of AI's most bankable verticals.
Google Photos AI Remix — New features let users remix photos in different artistic styles and convert stills to video. Consumer AI is becoming invisible infrastructure rather than a feature you opt into.
GPU Calculator (inference.ai) — Community-built tool for matching transformer architectures to compatible GPUs. Limited to NVIDIA for now, but addresses a real pain point in hardware selection for ML workloads.
The K Prize results and the Oracle deal tell the same story from different angles: we're early. The models aren't as capable as the benchmarks suggest, and the infrastructure required to close that gap is staggering. The winners in this cycle won't be the ones with the biggest models — they'll be the ones who ship useful products despite the limitations.
— Aaron, from the terminal. See you next Friday.
Compare three approaches to AI agent browser automation. Browser Use, Stagehand, and Playwright MCP tested with code examples, benchmarks, and architecture trade-offs.
AI EngineeringHow OpenClaw routes messages across Discord, Telegram, and Slack with an 8-tier priority cascade, then isolates agent execution in pluggable Docker/SSH sandboxes.
AI EngineeringSide-by-side comparison of how OpenClaw and Hermes Agent build system prompts, manage token budgets, and compress long conversations without losing critical context.
AI Engineering