I have a three-year-old daughter and a five-year-old daughter, and I am learning how to trust them. Not the greeting card version of trust. The operational version.
My five-year-old can go anywhere in the house. Upstairs, downstairs, bedroom, playroom, bathroom, kitchen if she is not climbing for something sharp or sticky. She has earned household autonomy. Enough that I do not follow her every time she disappears around a corner.
My three-year-old still needs a chaperone to go up the stairs.
That sounds unfair until you watch a three-year-old make a stair decision with full confidence and no evidence. She is brilliant. She is funny. She is also three. Her judgment is not a moral failing. It is a maturity problem, a feedback problem, and sometimes a sock-on-hardwood problem.
So I do not ask, "Do I trust my daughters?" I ask where I trust each of them, under what conditions, with what guardrails, and what happens if I am wrong. That is the actual trust question.
I have been having the same conversation with friends about trusting frontier models to write code. The model made a bad abstraction choice. It missed a null edge case. It generated a test that asserted the mock instead of the behavior. The conclusion arrives fast. The model is not trustworthy.
Sometimes they are right. Sometimes the more honest sentence is worse. They do not trust any of it. They do not trust the model. They do not trust the contractor. They do not trust the low-rated developer. They do not trust the legacy team that knows the system but keeps making the same category of mistake. So they validate everything, reread every diff, run every test, trace every deployment path, and wonder whether it would have been faster to do it themselves.
That is not a productivity question first. It is a trust question.
Sometimes they are using last year's model and expecting this year's frontier result. That is like evaluating cloud migration in two thousand thirteen and using the result to make a two thousand twenty-six infrastructure strategy. It is not analysis. It is a timestamp.
Sometimes the model is working inside a legacy codebase where the real problem is not model intelligence. The organization refuses to change the system. Not cannot. Refuses. The team knows the service boundaries are wrong. They know the test suite is theater. They know the deployment path requires tribal knowledge and two people who remember why the customer export job preserves tenant ordering. Changing it would be uncomfortable. So they ask the model to behave perfectly inside a system humans have tolerated for years. Then they call the model untrustworthy.
Meanwhile, a contractor sends an eight hundred line pull request across that same customer export service. He has been on the account for three weeks. Nobody has pair-programmed with him. Nobody has watched him debug production. Nobody knows whether he understands why tenant ordering matters because your largest customer built reconciliation around that behavior in two thousand nineteen. The review takes nine minutes. "Looks good." Merged.
This is the trust problem nobody wants to name. You are not deciding whether AI generated code requires review. You are deciding which actors are trusted by default, which actors are suspicious by default, and whether that trust has anything to do with evidence.
The answer is not "trust the model." The answer is build a trustworthy delivery system. That starts by rethinking what trust is, how you measure it, and where your bias toward familiar humans hides risk. The frontier model is suspicious because it is new, fails in weird ways, and everyone remembers the demo where it hallucinated a library that did not exist. Fine. Skepticism is rational.
But look at the rest of the system. Do you? Do you trust the contractor because the vendor passed procurement? Do you trust the low-rated developer because human resources has not put them on a plan? Do you trust the offshore team because the delivery manager sends a green report every Friday? Do you trust the senior engineer because they were right about the cache invalidation incident two years ago? Do you trust the framework upgrade because the migration guide said it was safe?
How much of that trust is earned? How much of it is inherited? How much of it is familiarity wearing a badge?
I keep hearing the same sentence. "We tried AI for code and we still had to review everything." What is "still" doing in that sentence? You review contractors, new hires, and senior engineers when the change touches billing, authentication, payments, customer data, infrastructure, or the weird batch job nobody wants to own. So when the model writes code and you review it, what exactly failed? Did the model fail because review was necessary, or did your trust model fail because review is the only trust mechanism you have?
Code review is not where trust is created. Code review is where missing trust becomes visible. If the only way you know whether a change is safe is to have one tired senior engineer read the diff after lunch, you do not have a review process. You have a human bottleneck with syntax highlighting.
There is another complaint that sounds technical until you press on it. "The model did not do it the way I would have done it." Okay. Is it wrong?
Usually there is a pause. No, they say. Not wrong. It works. The tests pass. The edge case is handled. It is just not how I would structure it. It is a style thing. I am used to doing it my way.
I hear a version of that at bedtime from my three-year-old. "Mommy does not do it like that." Correct. I do not do bedtime like Mommy does. I read the book in a different voice. I negotiate water differently. I probably put the blanket on wrong by the standards of a three-year-old with strong process opinions. But the outcome is the same. The kid is happy. The kid is asleep. The kid is in bed.
So here is the uncomfortable question for the professional reviewer. Are you the toddler or the parent? If you are being paid as the professional in the room, your job is not to reject every implementation that violates your bedtime ritual. Your job is to know the difference between unsafe and unfamiliar. If the model creates a race condition, leaks customer data, or hides a domain assumption inside a helper nobody will notice until reconciliation breaks, reject it loudly. That is review doing its job.
If the model picked a different but readable structure and the behavior is correct, what are you protecting? Your standard? Or your preference?
I have to talk about the math here. I married an accountant. Money makes the world go around. If this were only about cool technology, GeoCities might have been the first trillion dollar company. So do the math on distrust.
Take a senior engineer at two hundred eighty thousand dollars fully loaded. That is roughly one hundred thirty-five dollars an hour before meetings, context switching, incident response, mentoring, and the four production systems held together by one Confluence page written in two thousand twenty-one. Put that engineer in reviews for eight hours a week. That is one thousand eighty dollars a week, fifty-six thousand dollars a year, for one senior engineer doing review as an activity. Ten senior engineers doing that is five hundred sixty thousand dollars a year.
What are you buying? If the review catches payment edge cases, race conditions, broken migration order, missing telemetry, security mistakes, and real customer risk defects, you are buying something. If the review mostly catches naming, formatting, style preferences, "can we move this helper," and "I would have structured this differently," you are buying bedtime enforcement at senior engineer rates. Your chief financial officer should hate that. Your chief technology officer should hate that more. Because the bill is not just salary. It is cycle time, missed roadmap, senior attention, developer morale, and the quiet decision by good engineers to stop pushing hard because every path ends in the same review swamp.
If a frontier model writes the first implementation in twenty minutes and your human spends forty minutes reviewing the domain assumption, you may be ahead. If the model writes the tests, runs the suite, fixes the first failure, and leaves the human to review risk, you may be far ahead.
And Mark, if you keep uploading your specs so your teammates can review the spec, I am going to use your real name here. Do not make the spec review the work. Generate the code with the agent. Review the output against the outcome. Did the customer workflow work? Did the test prove the behavior? Did the migration preserve the data? Did the application programming interface contract hold? Then generate again. I just got you a bazillion story points and happier developers.
But if your process treats the model as suspicious and the contractor as normal, you are not measuring risk. You are measuring novelty.
Trust is not a feeling. Trust is a system property. You do not trust a contractor because their badge says contractor. You trust them when their work lands safely over time, tests catch mistakes, blast radius is small, rollback is boring, and telemetry proves the customer did not become the test suite. So what would make a frontier model trustworthy? The same things.
Small changes. Clear contracts. Real tests. Behavior assertions instead of mock worship. Type checks. Contract tests. A local environment the agent can run. Continuous delivery with canaries, feature flags, roll-forward plans, rollback, and telemetry. A work log that shows what it tried, what failed, and what changed.
I am not advocating for blindly shipping code because a model wrote it. Please tell me you are using good software engineering practices. Please tell me you have continuous delivery that can survive contact with production. Roll-forward plans, automated checks, observable deployments, feature flags, canaries, and a way to know whether the thing you shipped is hurting customers before a customer tells