Stop Reviewing Code. Start Proving It Works. My Take on AI in the Quality Process of Software.

March 18, 2026

·

Norman

13 min read

Hot

I had a flag in college. WMU Broncos. Wore it as a cape to hockey games at Lawson Arena, student section, chanting at refs who could not hear us. If you dig up ESPN footage from 2006 you might see it. That flag has been on my office wall ever since. In 2025, the Broncos won a national championship. You are welcome, Western Michigan.

Code review is my lucky flag. Not mine specifically, I will get to what I actually do in a minute, but code review as practiced across the industry in 2026. It is a ritual. A ceremony performed for comfort, not outcomes. You do it because the process says you do it, the same way you wear your lucky jersey and chant at officials who are not listening.

You know this. You have seen the data. You have lived the data.

What code review actually means

I do not know whether to shout out Bryan Finster, our biggest fan and the person who made me a SAD MF Fellow years ago (so I am technically speaking from authority here), or to shout out the framework itself. Either way, go look at SAD MF. It describes what most companies actually do when they say “code review” better than any conference talk I have seen.

Because “code review” is not a practice. It is a phrase. It means wildly different things depending on who says it.

Ask ten engineering leaders what code review is and you will get ten answers. One team uses it for knowledge sharing, where the senior walks the junior through the change, explains the trade-offs, and everybody learns. That is valuable. Another team uses it for risk reduction, a second pair of eyes before anything touches production. Also valuable. A third team uses it as a vanity metric, “we reviewed 100% of PRs this quarter” on a slide deck that nobody in the room believes. If you want to see what that looks like at scale, read SAD MF’s Code Review Comments per Convoy metric. They measure comment volume without any quality filter because “quality is subjective and subjectivity introduces bias.” The engineer who flags a SQL injection and the engineer who asks you to add a semicolon count the same. That is not parody. That is how most organizations actually measure review rigor.

And then there is the fourth team. The one that does not know why they do it. It is in the process document. It has always been in the process document. Nobody remembers who put it there.

The theory behind all of this is beautiful: a second set of eyes catches defects before they reach production, knowledge transfers across the team, design decisions get pressure-tested, junior engineers learn from senior engineers, and the codebase stays consistent. That is the brochure. Here is what actually happens.

Why this is personal

This is personal for me.

I built sensor networks for the US Army. Chemical weapon leak detection on a base where I sat with my team. If the software missed a reading or threw a false negative, the people breathing that air were us. We did not LGTM that code. We verified it, we validated it, and we ran it.

I wrote medication dosing software for neonates. Babies in the NICU whose bodies could not tolerate a rounding error. The difference between a therapeutic dose and a lethal dose for a two-pound infant is measured in micrograms. We did not skim that code in eleven minutes between meetings. We proved it was correct, and then we proved it again.

So when I tell you I take code review seriously, that is the context. When I review code today, I read every line. I trace the logic path from entry point through the change and back out. I check edge cases. I read the tests, not just that they exist, but that they test the right things. I look at mutation testing scores to see if the tests actually kill mutants or just cover lines, I check cyclomatic complexity to see if someone turned a straightforward function into a decision tree, and I look at afferent and efferent coupling to understand if this change is going to ripple through eight other modules next quarter. Cognitive complexity, change risk anti-patterns, dependency depth, Halstead volume if the method is dense enough to warrant it. The things that tell you a file is about to become everyone’s problem.

A thorough review of a 400-line PR takes me forty-five minutes to an hour. Sometimes longer.

I know, because I have worked with, lead, observed, and been on engineering teams for twenty years, that almost nobody does this. And honestly, the stakes for most teams are not chemical weapons or infant dosing. The stakes are a broken checkout flow or a misaligned dashboard. So they LGTM it. Given the stakes, you cannot entirely blame them.

The LGTM economy

You know how many pull requests get merged every day with no review comments and an approval time under two minutes. You have watched it happen on your own teams. You may have done it yourself.

LGTM. Looks good to me.

You know what LGTM actually means. It means I am busy. It means I trust you. It means this PR has been open for three days and I feel guilty. It means I do not understand this area of the code well enough to say anything useful, but I am not going to admit that, so I will approve it and hope the tests catch whatever I missed.

That is not rigor, that is a rain dance.

SmartBear’s data says the average code review takes eleven minutes. Eleven minutes for a change that took eight hours to build. Microsoft’s research found that reviewers spend most of their time on style and formatting issues (the things a linter handles in milliseconds) and miss the semantic bugs that actually reach production. Google published a study showing that only 15% of review comments were about defects. The rest were style, naming, documentation, and social norms. Fifteen percent. You are running a process where fifteen percent of the output is about defects and you are calling it quality assurance.

So let me do some math. Take a 200-person engineering org. Assume 80 pull requests per day, which is conservative for that size. Eleven minutes per review average. That is 14.7 hours of senior engineer time per day spent reviewing code. At $90 an hour loaded cost, that is $1,320 a day. I am not really good at mathematics but I think that is about $343,000 a year, right?

Three hundred and forty-three thousand dollars a year on a process where, by Google’s own data, 85% of the output is about style, naming, and social norms. Work a linter does for free.

Now look at the other side. Pull up your last twenty production incidents. How many of them were caught by a code reviewer? Not “could have been caught,” actually caught. In my experience, the answer is almost never. The incidents that reach production are the ones that pass review and pass tests. Semantic bugs, integration failures, edge cases nobody thought to check. The things eleven-minute reviews were never going to find.

You are spending $343,000 a year on ceremony and your incidents are coming from the gaps the ceremony does not cover.

Could it be that we keep doing code review because the process document says to? Or because the tooling makes it easy to measure (PR approved, yes or no) and we have confused measurability with value? Or is it because the people who built the process have moved on and nobody wants to be the one who says maybe we should rethink this?

Regardless of intent, where are the metrics that tie the economic investment in code review to the value it delivers back to your organization?

AI reviews better than your team

Good reviews do catch real bugs. I have caught race conditions and security holes that no automated tool would have flagged. Code review done well is real verification, and I do not want to dismiss that. The problem is that the version most teams practice is not the version that catches those things. And instead of fixing the practice, we keep measuring whether it happened. That is like tracking whether your hockey team wore helmets, but not whether they scored.

A frontier model today will do a better code review than most of your team. That is not hype. Give a diff to Claude, Gemini, or whatever model your team settled on this quarter, and it will read every line, trace the logic, check edge cases, flag security issues, and do it in thirty seconds. It will do the forty-five-minute review I described above, consistently, on every PR, without the guilt or the Friday fatigue.

But what is the AI reviewing against?

Patterns from its training data. It knows what good Express middleware looks like because it has seen ten million examples. It knows common security antipatterns because those are well-documented. It does not know that your billing service has a race condition when two invoices close in the same millisecond. It does not know that your state machine has an undocumented transition that three customers depend on. It does not know that the function it is reviewing was written to work around a vendor API bug that has been there since 2019. Without the context of your system, your requirements, and your failure history, the model is doing a very fast, very thorough generic review. It catches real defects, more than your eleven-minute human reviewer does, and it does it consistently. But it misses the things that only someone who knows your system would catch, and those are the things that cause your production incidents.

Will AI replace code review entirely at some point? Probably. The labs are working on approaches that look promising, agentic verification, self-healing pipelines, models that can reason about system-level behavior across services. Some of it is impressive. But the labs also have outages. Frequently. The organizations building the most advanced AI on the planet still ship bugs to production. If they cannot review their way to zero defects with frontier models and the best engineers in the world, you are not going to get there by pointing an API at your pull requests.

So I am going to follow Humble and Farley’s advice. Build the pipeline. Prove correctness. And use AI where it actually changes the equation, not as a reviewer, but as the engineer that builds the verification systems.

The pipeline is the gate

Dave Farley and Jez Humble wrote a book in 2010 called Continuous Delivery. Sixteen years ago. They made an argument that most of the industry still has not internalized, which is that the goal is not to find defects after they are introduced. The goal is to shorten the feedback loop until defects are caught within minutes of being created, automatically, every time. Not by a reviewer, but by the system itself.

Deming said it about manufacturing fifty years before that. Farley and Humble translated it into software terms that should have changed everything: build pipelines that verify correctness at every stage, automate the verification, and make the pipeline the authority on whether the software works. Not a human reviewer scrolling through a diff on a Tuesday afternoon.

Bryan Finster has been operationalizing this at enterprise scale (Defense Department scale) and proving it works in the environments where people swear it cannot. Minimum Viable Continuous Delivery. Stop batching. Stop inspecting quality in. Build it in. Every commit deployable, every deployment verified. No gates that require a human to form an opinion about a diff. Gates that execute and produce a verdict.

The moment you rely on a human gate to catch defects, you have accepted a defect rate. You have built a system that tolerates bugs and hopes to intercept them. Bryan has been saying this louder than almost anyone, and the data backs him up. The DORA research shows that teams deploying multiple times per day have lower change failure rates than teams deploying monthly. Not because they review more carefully, but because their systems verify more continuously.

What the shift looks like

The shift is to stop thinking about code review as a human reading code and start thinking about validation gates that prove the product is correct before it moves forward. A pipeline where every commit triggers a build that compiles, runs the full test suite (unit, integration, contract, end-to-end), checks security policy, validates performance benchmarks, confirms the deployment can roll back, and produces an artifact identical to what will run in production. Every stage is a gate. Every gate has a binary answer. Pass or fail. No judgment calls.

Farley has been teaching this for a decade and a half. What is new is that AI makes it achievable for teams that could never afford to build comprehensive gates before, and makes the gates themselves smarter than anything a static pipeline could do. The agent does not review your code. The agent builds the gates. It writes the tests that verify behavior, it generates the contract tests that catch integration drift, and it builds the performance benchmarks that detect regressions.

And then you embed AI into the pipeline itself. Not as a reviewer reading diffs, but as a participant in the verification. An AI gate that understands your domain model can validate that a pricing change does not create negative-margin scenarios across your entire product catalog, or that a schema change does not break downstream consumers in ways a static type checker cannot see. That is not AI reviewing code. That is AI verifying the product is correct, and the distinction matters.

The gates are only as good as what they check, and writing the right tests is the real work. An agent that generates tests which verify implementation rather than behavior is not helping. The pipeline that catches the defect is downstream of the human who understood the requirement well enough to specify what correct behavior looks like. This is not a push-button solution. It is an investment in verification infrastructure that pays compound returns, and AI makes the compound rate faster than it has ever been.

A practical roadmap

If your team has 30% test coverage and a ten-year-old monolith, you do not stop doing code review on Monday. That would be reckless. You start building the gates that will replace it.

First quarter, take your highest-churn files and your most common production incident categories. Have your agents write the tests that cover those paths. Measure your change failure rate before and after. You should see movement within 90 days. Second quarter, expand coverage. Add contract tests between services. Add performance benchmarks for critical paths. Start tracking what your pipeline catches that your reviewers did not. Third quarter, shift the conversation. When the pipeline is catching more defects than reviewers are (and it will be, faster than you expect), you can start treating code review as a knowledge-sharing practice rather than a quality gate. Reviews become about design, mentoring, and shared understanding. Not about finding bugs. The pipeline finds the bugs.

Code review does not disappear. It just stops pretending to be something it is not. It stops being the gate and starts being the conversation.

For regulated industries (financial services, healthcare, defense), the pipeline solves your audit problem better than human review ever did. An automated gate that runs the same checks on every commit produces a traceable, reproducible audit trail. Every gate decision is logged, every verification is repeatable. That is a stronger compliance posture than “Steve reviewed it on Tuesday and clicked approve.” If your SOX controls require human oversight, the human reviews the gate design and the gate results, not every diff.

The flag stays on the wall

I still do thorough code reviews. Old habits. And I still find things, design issues, naming problems, architectural decisions that will cost us later. Code review has value as a knowledge-sharing practice, as a mentoring tool, as a way to build shared understanding of a system. That value is real and I am not arguing you throw it away.

But I have stopped pretending it is quality assurance. It is not. It never was, for most teams. It was a ritual we performed because it felt rigorous, the same way wearing that flag at Lawson felt like I was helping the Broncos win.

The Broncos won a championship seventeen years after I graduated. The flag was on my wall. Coincidence is a powerful drug.

Maybe the best thing we could do is just get rid of the phrase “code review” entirely. Stop using the words. Describe what actually happens instead. “We run automated verification on every commit.” “We pair on design decisions before writing code.” “We measure change failure rate, not approval rate.” Say what you do, not what you call it.

Because if you keep using the phrase, the entire student section twenty years from now will still be doing the same chants and the same rituals that netted us one national championship in fifty years.

It was the chants and rituals, right?

The flag stays on the wall. It earned that. Your code review process has not.

Get one good email a week.

Short notes on AI-native software leadership. No launch sequence. No funnel theater.

Subscribe