11 min read
I have a three-year-old daughter and a five-year-old daughter, and I am learning how to trust them.
Not the greeting-card version of trust. The operational version.
My five-year-old can go anywhere in the house. Upstairs, downstairs, bedroom, playroom, bathroom, kitchen if she is not climbing for something sharp or sticky. She has earned household autonomy. Enough that I do not follow her every time she disappears around a corner.
My three-year-old still needs a chaperone to go up the stairs.
That sounds unfair until you watch a three-year-old make a stair decision with full confidence and no evidence. She is brilliant. She is funny. She is also three. Her judgment is not a moral failing. It is a maturity problem, a feedback problem, and sometimes a sock-on-hardwood problem.
So I do not ask, “Do I trust my daughters?”
I ask where I trust each of them, under what conditions, with what guardrails, and what happens if I am wrong.
That is the actual trust question.
I have been having the same conversation with friends about trusting frontier models to write code.
The model made a bad abstraction choice. It missed a null edge case. It generated a test that asserted the mock instead of the behavior. The conclusion arrives fast: the model is not trustworthy.
Sometimes they are right.
Sometimes the more honest sentence is worse: they do not trust any of it.
They do not trust the model. They do not trust the contractor. They do not trust the low-rated developer. They do not trust the legacy team that knows the system but keeps making the same category of mistake.
So they validate everything, reread every diff, run every test, trace every deployment path, and wonder whether it would have been faster to do it themselves.
That is not a productivity question first.
It is a trust question.
Sometimes they are using last year’s model and expecting this year’s frontier result.
That is like evaluating cloud migration in 2013 and using the result to make a 2026 infrastructure strategy. It is not analysis. It is a timestamp.
Sometimes the model is working inside a legacy codebase where the real problem is not model intelligence. The organization refuses to change the system. Not cannot. Refuses. The team knows the service boundaries are wrong. They know the test suite is theater. They know the deployment path requires tribal knowledge and two people who remember why the customer export job preserves tenant ordering.
Changing it would be uncomfortable.
So they ask the model to behave perfectly inside a system humans have tolerated for years.
Then they call the model untrustworthy.
Meanwhile, a contractor sends an eight-hundred-line pull request across that same customer export service. He has been on the account for three weeks. Nobody has pair-programmed with him. Nobody has watched him debug production. Nobody knows whether he understands why tenant ordering matters because your largest customer built reconciliation around that behavior in 2019.
The review takes nine minutes.
“Looks good.”
Merged.
This is the trust problem nobody wants to name.
You are not deciding whether AI-generated code requires review. You are deciding which actors are trusted by default, which actors are suspicious by default, and whether that trust has anything to do with evidence.
The answer is not “trust the model.”
The answer is build a trustworthy delivery system.
That starts by rethinking what trust is, how you measure it, and where your bias toward familiar humans hides risk.
The frontier model is suspicious because it is new, fails in weird ways, and everyone remembers the demo where it hallucinated a library that did not exist. Fine. Skepticism is rational.
But look at the rest of the system.
Do you?
Do you trust the contractor because the vendor passed procurement?
Do you trust the low-rated developer because HR has not put them on a plan?
Do you trust the offshore team because the delivery manager sends a green report every Friday?
Do you trust the senior engineer because they were right about the cache invalidation incident two years ago?
Do you trust the framework upgrade because the migration guide said it was safe?
How much of that trust is earned?
How much of it is inherited?
How much of it is familiarity wearing a badge?
I keep hearing the same sentence: “We tried AI for code and we still had to review everything.”
What is “still” doing in that sentence? You review contractors, new hires, and senior engineers when the change touches billing, authentication, payments, customer data, infrastructure, or the weird batch job nobody wants to own.
So when the model writes code and you review it, what exactly failed?
Did the model fail because review was necessary, or did your trust model fail because review is the only trust mechanism you have?
Code review is not where trust is created. Code review is where missing trust becomes visible.
If the only way you know whether a change is safe is to have one tired senior engineer read the diff after lunch, you do not have a review process. You have a human bottleneck with syntax highlighting.
There is another complaint that sounds technical until you press on it.
“The model did not do it the way I would have done it.”
Okay. Is it wrong?
Usually there is a pause.
No, they say. Not wrong. It works. The tests pass. The edge case is handled. It is just not how I would structure it. It is a style thing. I am used to doing it my way.
I hear a version of that at bedtime from my three-year-old.
“Mommy does not do it like that.”
Correct. I do not do bedtime like Mommy does. I read the book in a different voice. I negotiate water differently. I probably put the blanket on wrong by the standards of a three-year-old with strong process opinions.
But the outcome is the same.
The kid is happy. The kid is asleep. The kid is in bed.
So here is the uncomfortable question for the professional reviewer: are you the toddler or the parent?
If you are being paid as the professional in the room, your job is not to reject every implementation that violates your bedtime ritual. Your job is to know the difference between unsafe and unfamiliar.
If the model creates a race condition, leaks customer data, or hides a domain assumption inside a helper nobody will notice until reconciliation breaks, reject it loudly.
That is review doing its job.
If the model picked a different-but-readable structure and the behavior is correct, what are you protecting?
Your standard?
Or your preference?
I have to talk about the math here.
I married an accountant. Money makes the world go around. If this were only about cool technology, Geocities might have been the first trillion-dollar company.
So do the math on distrust.
Take a senior engineer at $280,000 fully loaded. That is roughly $135 an hour before meetings, context switching, incident response, mentoring, and the four production systems held together by one Confluence page written in 2021.
Put that engineer in reviews for eight hours a week. That is $1,080 a week, $56,000 a year, for one senior engineer doing review as an activity. Ten senior engineers doing that is $560,000 a year.
What are you buying?
If the review catches payment edge cases, race conditions, broken migration order, missing telemetry, security mistakes, and real customer-risk defects, you are buying something.
If the review mostly catches naming, formatting, style preferences, “can we move this helper,” and “I would have structured this differently,” you are buying bedtime enforcement at senior-engineer rates.
Your CFO should hate that. Your CTO should hate that more. Because the bill is not just salary. It is cycle time, missed roadmap, senior attention, developer morale, and the quiet decision by good engineers to stop pushing hard because every path ends in the same review swamp.
If a frontier model writes the first implementation in twenty minutes and your human spends forty minutes reviewing the domain assumption, you may be ahead. If the model writes the tests, runs the suite, fixes the first failure, and leaves the human to review risk, you may be far ahead.
And Mark, if you keep uploading your specs so your teammates can review the spec, I am going to use your real name here.
Do not make the spec review the work.
Generate the code with the agent. Review the output against the outcome. Did the customer workflow work? Did the test prove the behavior? Did the migration preserve the data? Did the API contract hold?
Then generate again.
I just got you a bazillion story points and happier developers.
But if your process treats the model as suspicious and the contractor as normal, you are not measuring risk. You are measuring novelty.
Trust is not a feeling. Trust is a system property.
You do not trust a contractor because their badge says contractor. You trust them when their work lands safely over time, tests catch mistakes, blast radius is small, rollback is boring, and telemetry proves the customer did not become the test suite.
So what would make a frontier model trustworthy?
The same things.
Small changes. Clear contracts. Real tests. Behavior assertions instead of mock worship. Type checks. Contract tests. A local environment the agent can run. Continuous delivery with canaries, feature flags, roll-forward plans, rollback, and telemetry. A work log that shows what it tried, what failed, and what changed.
I am not advocating for blindly shipping code because a model wrote it.
Please tell me you are using good software engineering practices. Please tell me you have continuous delivery that can survive contact with production: roll-forward plans, automated checks, observable deployments, feature flags, canaries, and a way to know whether the thing you shipped is hurting customers before a customer tells you.
That used to be expensive.
Now it is close to free. Agents can write the contract tests you never got around to, generate edge-case coverage, wire smoke tests, build fixture data, check migration paths, and extend the platform testing structure every time they touch the system.
If you are going to distrust the model, fine.
But do not distrust the model while refusing to build the verification machinery that would make any actor safer.
You do not trust the model.
You trust the delivery system.
Or you do not.
That is why “do we need to review the code?” is the wrong first question.
Review by whom? Review for what? Review after which automated checks? Review against which domain contract? Review with what production signal after deployment?
The question is why your review policy is organized around the author instead of the risk.
If the change touches money, identity, safety, compliance, customer data, production topology, or cross-service contracts, the author should matter less than the blast radius. Contractor, senior engineer, junior engineer, frontier model. Same risk class. Same verification bar.
If the change is a report label, a low-risk workflow, a migration note, a dead-code cleanup covered by tests, or a generated client from a schema you own, why is a human reading every line like the company depends on it?
What are you proving?
And to whom?
There is a reason people over-review model code.
The model’s failures feel alien. Humans fail in familiar ways. A weak developer forgets the edge case you expected. A senior engineer makes a judgment call you can argue with. The model makes a mistake with perfect confidence, and that feels worse.
It may be worse.
Or it may just be less familiar.
I have watched humans ship bugs with perfect confidence for twenty years. They just do it with better eye contact.
You know what to do when the contractor fails. Escalate to the vendor. Change the statement of work. Ask for a different person.
What do you do when the model fails?
Most organizations do not know, so they fall back to the oldest mechanism they have.
Read the code harder.
That is not a strategy. It is a reflex.
The trustworthy delivery system is boring.
Classify the change before anyone reads the diff. A CSS class on an internal admin screen is not an account lockout policy. A data migration is not a logging cleanup. A payment retry loop is not a generated DTO.
Define the verification bar by risk class. Low-risk changes get automated checks and spot review. Medium-risk changes get tests, contract checks, and human review against behavior. High-risk changes get domain-owner review, adversarial testing, staged rollout, telemetry, and rollback proof.
Track reliability over time. Human or model. Contractor or employee. Which changes passed without rework? Which caused incidents? Which required review churn? Which improved test coverage? Which reduced change failure rate? If the frontier model has a better thirty-day record than the contractor, what does your review policy do with that information?
Make the model run the checks. Run the tests. Fix failures. Explain the diff. Identify the risk class. List assumptions. Tell you what it could not verify. A frontier model that cannot do that is not ready for that class of work. A human who cannot do that should raise the same concern.
Put production feedback in the loop. Trust that never sees production is faith. Trust that watches error rates, latency, customer behavior, rollback frequency, and incident follow-up is engineering.
None of this answers whether you need to review the code. That is the point.
The uncomfortable possibility is that a current frontier model, inside a disciplined delivery system, may be more trustworthy for some work than a familiar human inside a loose process.
Not all work. Not all models. Not all teams.
But some work.
A model that runs the test suite, explains assumptions, checks migration order, and operates inside a small blast radius may be more reliable than a contractor pushing eight hundred lines through a nine-minute review.
If that bothers you, good.
It means the trust hierarchy in your head may not match the evidence in your system.
I am not asking you to trust the model.
I am asking whether you understand why you trust everyone else.
The contract house. The low-rated developer. The senior engineer who earned trust in one part of the system and now gets trusted everywhere. The offshore team whose work arrives in batches too large to review properly. The legacy service nobody tests because the person who understood it left in 2020.
Was that trust evidence, or habit?
Was it the system proving safety, or the org chart laundering risk?
A year from now, the teams that win will not be the ones that stopped reviewing code. They will not be the ones that made every deploy walk through the suspicious-process parade either, where twelve people squint at the diff, nobody trusts the pipeline, and someone named Dave asks if we checked the logs before the code has even shipped.
They will be the ones that rebuilt trust around evidence.
Not human versus model.
Trustworthy delivery system versus trusted story.
So when the next pull request lands, and the author line says it came from a frontier model, ask the question you should have been asking about the contractor all along.
What, exactly, has earned your trust?
Companion
