Why I Stopped Trusting “Done” from AI Coding Agents
I learned this the hard way: a coding agent saying something is done does not mean the work is complete, especially when that claim has not yet survived a second-agent review.
Sometimes the code exists but does not match the plan. Sometimes the feature works in a narrow sense but misses key constraints. Sometimes the agent takes the shortest path to a plausible end state, even when that path drifts from the intended design or skips the parts that were hardest to verify. That changed how I work.
The biggest improvement in my AI-assisted workflow did not come from swapping one model for another. It came from treating verification as a first-class part of the process.
The Wrong Assumption
A lot of people still seem to assume that if an AI coding agent is strong enough, the main problem is solved. Give it a task, let it work, and review only when something obviously breaks. I do not think that assumption holds.
The issue is not that these systems never produce useful work. They do. The issue is that they often produce work that looks complete before it is actually complete and that gap matters.
If you are moving quickly, a convincing partial result can be more dangerous than an obvious failure. Obvious failures stop the process. Convincing partial results move it forward under false confidence.
What Verification Started Catching
Once I began verifying one agent's work with another, the pattern became hard to ignore.
I kept finding the same kinds of problems:
- drift from the original plan
- incomplete execution presented as complete
- missing edge cases
- hidden assumptions left unchallenged
- requirements that were implemented loosely instead of precisely
- tests or checks that proved less than they seemed to prove
None of this means the executing agent was useless. It means self-reported completion is not a reliable quality signal.
What Changed My Workflow
The biggest shift was simple: I stopped treating execution and verification as the same job.
One agent did the building. A second agent audited the result against the plan, the expected behavior and the output itself. That separation improved outcomes immediately, not because the verifier was magically perfect, but because a second pass from a different model, with a different failure pattern and a different job, consistently surfaced issues the first pass missed.
A few changes in my workflow and verification process made a real difference for me.
Smaller waves
Large, vague tasks create too much room for interpretation. Breaking work into smaller waves made it easier to verify what changed, compare the output against the plan and catch drift before it spread.
Tighter acceptance criteria
The more room a task leaves for interpretation, the more likely the agent is to fill in the gaps with something plausible but wrong.
Clear acceptance criteria improved both execution and review. They gave the builder a tighter target and gave the verifier something concrete to test against.
Independent review
The executing agent should not be the only judge of whether the work is complete.
Independent review catches a different class of failure, especially when the problem is not broken code but overconfident reporting. In practice, that reviewer was often a second agent reading the same task from a different angle.
Spec-first verification
It is not enough to ask whether code exists or whether tests passed.
The better question is whether the output matches the intended behavior, the stated constraints and the actual shape of the task. I also learned to check whether the agent had quietly changed course, taken a shortcut or skipped a required note about deviation instead of surfacing that choice explicitly. A second agent was especially useful here because it could compare the implementation back to the original task and spot drift the first pass had glossed over.
Live checks where they matter
Passing tests are useful, but they are not the whole truth. Some of the most important problems only show up when you look at real behavior, real outputs or real integration points.
Those changes reshaped what I trust. I trust tighter scopes, explicit constraints, adversarial review and second-pass validation more. I trust self-assessment, vague completion claims and big one-shot execution a lot less.
This Is Not About Model Tribalism
This is not an argument that one coding agent is good and another is bad. It is an argument that different agents fail differently and that those differences become useful when you design the workflow around them.
One model may move faster. Another may review more skeptically. One may be strong at generating structure. Another may be better at spotting deviations, omissions or weak reasoning. In my experience, Codex has been especially useful as a second pass because it reads the task more holistically and checks the work against the broader requirement and the original prompt, not just the local implementation.
What matters is not pretending the models are equal. They are not. It is also not assuming any one of them will follow the plan cleanly, avoid drift and get everything right on a single pass.
Final Thought
What made AI coding more reliable for me was the review loop around it. In my work, that is where the bigger improvement came from.
The lesson was not that coding agents are useless. It was that verification changes the quality of the outcome more than optimism does.
If you are using AI to build customer-facing products, that shift matters.