For developers who have fully embraced AI agents, productivity is no longer the bottleneck. Agents produce code faster than humans, and often better. The bottleneck is correctness: whether the code solves the problem, whether the architecture can still hold the feature set you’re growing into, whether the implementation encodes the right premises. The practical shift is that a significant fraction of your agent compute should now go to validation (tests, reviewers, scope checks), rather than to producing more code.
Productivity Is No Longer the Bottleneck
A year ago, the argument for agentic programming was throughput: agents let you ship more code in less time.
That is not controversial anymore. With the current generation of models inside agentic harnesses (Claude Code, Codex, and the like), a single developer supervising one agent can run circles around a small team writing by hand.
For people fully inside this workflow, the question is no longer “how do I produce more?”. It is “how do I know what I produced is right?”.
Three Layers Where “Right” Can Fail
Each layer has its own remedies.
The product layer: does this match what a user wants?
Part of “right” is still human judgment. An actual human has to look at the running product and say whether it does what they actually need. No amount of agent compute substitutes for that.
(Incidentally, this will probably change fast, as more and more products are targeted at agents.)
The most you can do is make this human’s job easy: ship into a real environment quickly, with real data, so the human can poke at it and react.
The architecture layer: is this the right shape of code?
This is the hardest layer, and the one where current agents are weakest.
A lot of what experienced software engineers do is architectural judgment:
- structuring code so it can accommodate the features that will land on it, without knowing in advance what’s coming;
- knowing what features not to implement;
- recognizing when the feature set has outgrown the original architecture, and the codebase needs to be reshaped before more code can land cleanly.
Agents are not yet good at any of these. They will happily add the feature you asked for, in the place where it fits least, with abstractions that make the next three features harder.
They will implement the configurability you asked for instead of pushing back on whether you should want it. They will keep extending a structure that has outlived its assumptions, because each individual diff still looks reasonable in isolation.
The remedy at this layer is human-driven architectural review, backed by an agent whose explicit job is to check each diff against a source-of-truth document for scope and shape.
The implementation layer: is the code doing what it claims?
The third failure mode is the easiest to miss: code that looks right, has passing tests, and is actually wrong. The premises it encodes are off by a degree. There is a workaround in the middle that the tests happen not to exercise. A helper that “handles edge cases” is actually swallowing errors that should be propagated.
This is the failure mode the writer agent cannot see, because it inherited the premises from its own earlier reasoning. It needs a separate pair of eyes (a reviewer agent, a fresh-context subagent, or a different agent kind entirely) to read the diff without the writer’s assumptions.
Validation Is What You Spend Compute On
The remedy is simple to state and harder to commit to: spend a significant fraction of your agent budget on validation.
The point of running teams of coding agents is no longer to produce more code faster. It is to make it more likely that the code is correct.
Validation, in practice, means several things at once:
- Tests, including the expensive ones. Real end-to-end tests against a real database. Browser tests with Playwright. Runs against representative data.
- Constant code review, not just at PR time. A reviewer agent that reads each diff as it lands, with the architectural document and the task list in its context.
- Scope rechecks against a source-of-truth document. Has the diff drifted from the original ask? Has the agent quietly expanded the scope? Has it implemented something the SOT explicitly said not to do?
The mistake is to treat validation as a tax on productivity. It isn’t. The output of one coding agent whose work has been validated is worth meaningfully more than the output of two coding agents whose work has not.
Three Patterns to Apply
There are several ways to deploy validation compute. They are not exclusive.
- A code-reviewer subagent inside the same agent. Both Claude Code and Codex ship with strong code-reviewer subagents. The writer spawns one mid-task, gets local feedback, applies it. Catches bugs, missing edge cases, style violations. Cheap and easy first step.
- A different agent kind reviewing the writer. Have Claude Code spawn Codex (or vice versa) for review. Different training, different priors, catches a slightly different class of issue. Works, but is qualitatively close to (1).
- A persistent, specialized reviewer agent. Two agents per significant task. One programmer, one reviewer; they persist, share a task list, and talk after each TDD cycle. This is the pattern I have found most useful. I wrote separately about it; in this article’s framing, it is the highest-context, most expensive, and most useful form of validation compute.
The Shift
The earlier argument for agentic programming was: you can produce more code per hour of attention. That is still true, and still useful.
The argument that matters more for people already inside the workflow is: you can produce more correct code per hour of attention, if you spend the compute on correctness.
The bottleneck has moved. The budget should move with it.
Our team behind aweb uses this pattern daily. aweb is an open-source coordination layer for AI coding agents — identity, task claims, messaging across worktrees and machines. MIT-licensed. Hosted at aweb.ai