The gauge broke
For two years I argued the feeling of AI speed had come apart from the fact of it, from watching my own teams. This summer it stopped being an anecdote. A controlled trial measured experienced developers feeling about 20% faster while running about 19% slower. The instrument we steer by reads backward.
In December 2023 I wrote that the feeling of speed and the fact of speed had come apart on my teams, and I admitted it was an anecdote, a thing I saw but could not yet prove. This summer the anecdote got a stopwatch on it, and the result is worse than I guessed.
METR ran a randomized controlled trial on experienced open-source developers, working in codebases they knew well, with current frontier AI tools. Before the work, the developers expected the tools to speed them up. After the work, they reported the tools had sped them up, by around 20%. Measured against the clock, they were about 19% slower. The self-report and the stopwatch pointed in opposite directions by nearly 40 points. The study is small, 16 developers across 246 tasks, and the authors are careful to say it does not prove AI slows everyone everywhere. The effect flips positive for juniors and for greenfield work. Read the caveats. Then read the part that does not have a caveat: the people most confident the tool was speeding them up were the ones it was measurably slowing down.
That is the gauge breaking. The instrument every engineering leader steers by, the team’s own felt sense of velocity, does not just have noise in it. It reads backward, under exactly the conditions most real work happens in: experienced people, in code that already exists.
I had the shape of this for two years and I want to be precise about what I had wrong. I thought the felt-versus-real gap was a measurement problem, a thing you could fix by looking harder at the dashboard. It is worse than that. The feeling is not a noisy version of the truth. It is actively misleading, and it is the single input most decisions about AI adoption are running on. Every leadership deck claiming a team is twice as fast now is built on the one reading the data says is inverted.
The team-level telemetry says the same thing from the other side, and at a scale the small trial cannot. Faros AI, looking across more than 10,000 developers, found pull requests merged up 98%, pull request size up over 150%, and review time up 91%, for roughly no net change in delivery. 31% of pull requests merged with no review at all. DORA’s research found higher AI adoption associated with a measurable drop in delivery stability, and the damage persisted into this year. GitClear, reading 200 million changed lines, found copy-pasted code rising, code churn rising, and refactoring collapsing to under 10% of changes, with 2024 the first year on record that developers pasted more code than they reorganized. The pattern across every one of these is identical. More generated, more merged, more churned. Same amount delivered, shakier when it lands.
The sentence I have been circling since 2022 is now just true, with measurements under it. Generation got cheap. Verification got expensive. We removed the old bottleneck and shipped the work straight into a new one, and the new one is review. The volume exploded at the one stage we did not re-staff, and the dashboards we trust cannot see the cost because the cost lands downstream, in incidents and churn and reviewer burnout, on a different page from the velocity chart everyone is cheering.
You can watch the tool-builders concede the same point in where the money went this summer. Windsurf, the editor I have lived in since January, got pulled apart in a single weekend in July. Google paid billions to move its founders and core researchers into DeepMind, the remainder was absorbed by the maker of Devin, and the thing the founders left to build is an agent-first IDE. Strip the branding off agent-first and it says the quiet part out loud. You stop sitting at the keyboard generating, and you move to a dashboard where the job is to review what the agents produced and decide what to keep. The most aggressive bet in the tooling market is a bet that the work is now verification. They are building the cockpit for the exact bottleneck this study just put a stopwatch on.
The honest counter, and it matters here more than usual. This is most likely the dip in a J-curve, not the destination. New tools cost you before they pay you, and most of the felt-versus-real gap is the cost showing up before the payoff does. The trial’s effect flips for juniors and new code, which is a growing share of what gets built. DORA’s throughput recovered even as stability lagged, which is what a team climbing out of the dip looks like. I am not arguing the tool is bad. I am arguing the gauge is broken, which is a different and more dangerous claim, because a bad tool you eventually notice and a broken gauge you keep trusting.
So the discipline for whoever is running a team through this is one line. Stop steering by how fast it feels. The feeling is the one number we now know reads backward. Measure what reaches production and stays standing, re-staff the stage where the work actually piles up, and treat any productivity claim that lives in a feeling as unproven until the stopwatch agrees. The gauge broke this summer. The teams that win the next stretch are the ones that notice, and replace it, before they report a number they got from a feeling.