- datapro.news
- Posts
- Faster Code, More Failures. The AI Paradox
Faster Code, More Failures. The AI Paradox
THIS WEEK: A landmark METR trial found developers felt 20% faster with AI assistance. They were actually 19% slower. The data behind the efficiency illusion is in.

Dear Reader…
The pitch is seductive. Hand your codebase to an AI agent, step back, and watch production-ready software materialise at machine speed. Across enterprise boardrooms and developer Slack channels alike, the promise of autonomous code generation has become something close to received wisdom. But a growing body of empirical research tells a more complicated story, one in which the productivity gains are real but partial, the risks are structural rather than incidental, and the industry may be trading a short-term efficiency dividend for a long-term competency crisis.
This week, we examine the myth and reality of relying on agents for code development, drawing on data from Sonar, GitClear, Veracode, Gartner, and a randomised controlled trial from METR that should be required reading for any organisation currently building an AI-assisted engineering strategy.
The Numbers That Sell the Dream
Start with the headline figure that vendors love to cite. According to Sonar's 2026 State of Code survey, 42% of all committed code is now AI-generated or AI-assisted. That is not a rounding error. It represents a structural shift in how software is produced. A ShiftMag study of 4.2 million developers narrows that figure when applying a stricter definition of AI authorship, arriving at 26.9% of production code. Even at the lower bound, more than a quarter of the code running enterprise systems has been written, at least in part, by a machine.
The drafting speed argument is also legitimate. AI tools save an average of 35% in drafting time, and that compression is experienced as meaningful relief by developers under delivery pressure. The sensation of working faster is real. The problem, as we will examine, is that sensation and measurement are telling different stories.
The Productivity Paradox
In 2025, METR published the findings of a randomised controlled trial that ought to have caused considerably more disruption than it did. Developers with AI assistance believed they were working 20% faster. Actual measured productivity, when controlling for task complexity and codebase maturity, showed a net 19% slowdown compared to unassisted peers working on complex systems.
That 39-percentage-point gap between perceived and actual productivity is not a minor discrepancy. It points to something more fundamental about how AI tools reshape cognitive work. The METR researchers identified a mechanism they describe as a shift from visible execution to less visible but cognitively intensive verification and correction. As one framing in the research puts it: "The effort feels lower even when it is not, because drafting feels like the hard part."
This matters enormously for organisations benchmarking AI adoption through developer satisfaction surveys or velocity metrics that measure output rather than outcome. If your measurement framework captures how fast code is written but not how long it takes to safely commit, you are measuring the easy half of the equation.
What the Code Itself Reveals
GitClear's 2025 analysis of AI-assisted codebases moves the conversation from developer experience to structural code quality, and the findings are difficult to dismiss. Code churn, the rate at which recently written code is subsequently modified or reverted, has doubled in AI-assisted environments. Duplicated code blocks have increased eightfold. These are not cosmetic problems. They are indicators of architectural incoherence accumulating at scale.
The underlying dynamic is one that data management practitioners will recognise immediately. AI tools optimise for local correctness, asking only whether a function does what it is supposed to do in isolation, rather than architectural coherence, which asks whether that function behaves appropriately within the broader system context. The result is what researchers describe as structurally incoherent codebases: Systems where individual components pass their own tests but interact in ways that generate failures at integration and production boundaries.
The security picture compounds this further. Veracode's 2025 analysis found that 45% of AI-generated code samples contain OWASP Top 10 vulnerabilities. Java AI-generated code carries a security failure rate exceeding 70%. Across languages, AI-generated code introduces 1.57 times more security findings and 1.75 times more logic and correctness errors than human-written equivalents. Maintainability errors run 1.64 times higher. These are not edge-case anomalies. They are systematic properties of how current generation models produce code.
Your Analytics Stack Is One Database Too Many
Pipelines, backfills, sync lag, data drift… that's the cost of splitting your stack. Tiger Cloud extends Postgres, fully managed, so analytics run on live data. No second system. Stay on Postgres. Scale on Postgres.Try Tiger Cloud free.
The Verification Bottleneck
Here is where the efficiency case begins to unravel in practice. A survey of developer behaviour reveals a striking trust deficit. Just 4% of developers fully trust that AI-generated code is functionally correct. Yet only 48% always verify AI code before committing it. The gap between distrust and verification behaviour suggests that deadline pressure, workflow friction, and the persuasive plausibility of AI output are combining to produce a systematic under-review of code that developers themselves do not believe is reliable.
The verification burden is also heavier than it appears. Thirty-eight per cent of developers report that reviewing AI-generated code requires more effort than reviewing code written by humans. The saved drafting time is frequently reinvested into what researchers describe as a gruelling cycle of review, testing, and correction, a workflow characterised, pointedly, as "vibe then verify." Developer trust in AI code accuracy has also deteriorated, falling from 40% in 2024 to 29% in 2025, a decline that correlates with wider production exposure to AI-generated failures.
The operational data reinforces this. Pull requests per developer increased by 20% with AI assistance. Incidents per pull request increased by 23.5%. More code is being shipped, but more of it is breaking things.
Agents and the Architecture of Optimism
The industry response to these challenges has been, in many cases, to propose more automation as the remedy. Agentic architectures, in which a supervisor agent decomposes human intent into subtasks for specialised worker agents, are being positioned as the solution to the limitations of single-model code generation. Self-healing infrastructure, where AI agents autonomously detect and remediate broken processes, achieves approximately 90% success rates on well-structured codebases. Observability layers in advanced pipelines can detect 96.4% of potential failures, with automated remediation handling 90% of common anomalies.
These are impressive figures in controlled conditions. The question that Gartner's research forces us to confront is whether those conditions describe real enterprise environments. Gartner projects that over 40% of agentic AI projects will be cancelled by 2027 due to reliability issues and inadequate risk controls. Named specifically as a driver of this anticipated collapse is what Gartner calls the "prototype mirage," the pattern of enterprises measuring success through compelling demonstrations rather than production performance. Agentic systems that perform elegantly in structured demos encounter an entirely different set of challenges when exposed to legacy architectures, ambiguous requirements, and the messy interdependencies of live data pipelines.
The Human Cost Not Being Budgeted For
Perhaps the most consequential finding in the current research landscape sits at the intersection of psychology and workforce economics. There are two dynamics operating in parallel that, taken together, represent a systemic risk to the engineering profession's capacity to sustain itself.
The first is what researchers are calling confidence-competence inversion. Junior developers trust AI output at a rate of 78%, nearly double the 39% rate observed among senior engineers. They are also substantially more likely to ship unreviewed AI code: 60.2% of junior developers do so, compared with 25.8% of seniors. The developers least equipped to identify AI-generated errors are the most likely to let those errors reach production.
The second dynamic is structural. Junior developer job postings have dropped 40 to 50% since early 2024. The roles that historically provided the apprenticeship pipeline through which senior engineers developed, including the debugging assignments, the code review cycles, and the exposure to system failure at manageable scale, are disappearing. As one researcher frames it: "By automating that friction, the industry is destroying its capacity to build the future verifiers required to sustain the system."
There is also a subtler phenomenon emerging at the individual level. Employees working with AI tools are working faster and longer, with the natural cognitive governor of human writing speed removed. The psychological toll of what is being termed "agent thrashing" (the state of overwhelm generated by intervening in agentic loops and managing stochastic failures in multi-agent systems) and "machine-speed burnout" are not yet well understood at the organisational level, but they are appearing consistently enough in the research to warrant attention.
What Rigorous Adoption Actually Looks Like
None of this is an argument for abandoning AI-assisted development. The efficiency gains in drafting, scaffolding, and pattern-repetitive tasks are real and material. The question is whether organisations are building the governance infrastructure to capture those gains without importing the associated risks into their production environments.
That means treating AI code review as a distinct organisational discipline, not an extension of existing review practices. It means measuring outcome quality across incident rates, churn, and security findings, not just output volume. It means being honest about the difference between a compelling demo and a production-grade system. And it means taking seriously the workforce development implications of a hiring market eliminating the junior roles through which the next generation of senior verifiers would have learned to tell good code from bad.
The METR trial's central finding, that developers believed they were faster when they were slower, is a useful lens through which to evaluate your organisation's current AI development narrative. The question is not whether the technology is impressive. It clearly is. The question is whether your measurement framework is sophisticated enough to know the difference between the impression of productivity and the thing itself.


