10 min read
I was on a call with my friend Nathan tonight. We were talking about what is possible in software right now — agent-driven development, POCs replacing specs, where all of this is heading — and then he said something that stuck.
“Norman, everyone keeps saying AI agents will eventually run the whole product loop. Market research, build, ship, measure, iterate. No humans deciding what to build. But nobody is actually doing it. It is all conference slides.”
He is right. It does not mean anything until someone builds it and shows you the results.
So I am going to build it. And you are going to be able to watch.
What This Actually Is
I am building a system of agents that runs the entire product lifecycle for a consumer browser game, end to end, without a human deciding what features to ship. The agents handle market research, development, deployment, measurement, and iteration.
Before I go further: if you are running a payments platform, a healthcare system, an ERP, anything where a wrong product decision has legal or safety consequences — do not do this. That would be reckless. I wrote about the importance of matching your AI approach to your domain risk and I meant it. A game is a domain where failure is cheap. If the agents ship something terrible, three people play it and forget about it. Nobody’s mortgage gets denied. That is why I picked it.
The Constraints
Every experiment needs a cage before you put anything in it. I learned that running transformation programs at F500 companies. You do not let people (or agents) run in an open field and hope they stay in bounds. You define the field first.
These are the hard constraints baked into the system. Some are guardrails on the agents themselves, some are rules about how the product gets built. All of them are non-negotiable.
-
Browser game, runs on a school laptop. Web-based, no app store, no install, no GPU requirement. If your kid opens it in Chrome on a four-year-old Chromebook and it runs smoothly, it qualifies. The agents cannot build anything outside the game domain — no e-commerce bolt-ons, no data collection side projects. One game, one host, one deployment pipeline.
-
Rated G. Family friendly. Legal in the US and EU. No violence, no gambling mechanics, no predatory monetization, no loot boxes, no dark patterns designed to keep a twelve-year-old playing past bedtime. A content filter runs on every asset the agents produce (code, copy, images, survey questions) before anything reaches production. Fails the filter, does not ship.
-
No text boxes. No asking players what to build. This is the constraint that makes the experiment real. You do not get to put a text box in front of a player that says “What feature do you want?” That is cheating. That is outsourcing the hardest part of the job to the user. The agents have to do what a seasoned PM does: design structured surveys (multiple choice, rating scales, preference rankings), watch player behavior, analyze session patterns, monitor where players drop off and what they replay. They read social media — X, Reddit, forums, wherever players talk about the game. Product insight comes from inference and observation, not from handing someone a suggestion box.
-
Non-identifiable data only. No IP addresses, no device fingerprints, no tracking cookies, no cross-session tracking. The agents can collect anonymous behavioral data (what a player clicked, how long they played, where they dropped off) and responses to the structured surveys they design. The data schema is locked before the agents touch it — they can read from it, but they cannot modify what gets collected.
-
Production-worthy releases with automatic rollback. What goes live needs to work — tested, deployed, monitored. Every release is versioned. If a release degrades any key metric below a threshold the agents set in advance, the system auto-rolls back to the previous version before the next cycle runs. I wrote about what happens when you skip governance and I have no intention of running that experiment.
-
Success is market signal, not vanity metrics. Page views do not matter. The agents measure retention, session depth, replay rates, survey responses, and social media sentiment. They decide what to measure, what the results mean, and what to build next.
-
Hard cost cap. Monthly ceiling on inference spend. If the agents burn through the budget mid-month, the loop pauses until the next cycle. I have seen organizations blow six figures on unmonitored AI experiments (I wrote about the economics in Your AI ROI Dashboard Is Lying to You). I am not joining that club.
-
Current frontier models only. I am not waiting for GPT-5 or Claude 5. I am using what exists today — Claude, GPT-4o, Gemini, whatever is best suited for each part of the pipeline. This is Gen 1. It is supposed to be imperfect.
How It Runs
Three agent groups, one closed loop.
Market intelligence agents do the product management work. They design surveys, run segmentation analysis on anonymous behavioral data, scrape public social media for sentiment, and watch what correlates with retention. They turn all of that into hypotheses about what to build next.
Build agents take those hypotheses and turn them into working software. Write the code, write the tests, run the tests, package for deployment. This is the part most people think about when they hear “AI building software,” but it is the least interesting part of this experiment. We already know agents can write code. The question is whether they can write the right code for the right feature at the right time.
Release and measurement agents handle deployment, A/B testing, monitoring, and analytics. They instrument the new features, watch the metrics, decide whether a release is performing or should be rolled back, and feed the results back to the market intelligence agents.
The trigger is a cron job. Every night at midnight, the cycle runs. Market intelligence agents pull the last 24 hours of behavioral data and survey responses, scrape social media for mentions, generate hypotheses, and hand them to the build agents. If everything passes and the content filter clears, the release agents deploy. By morning, there is a new version of the game live. Players play, generate new data, and the next night the cycle runs again.
I picked a nightly cadence because it is slow enough that I can check what the agents shipped each morning and verify the constraints held. If I ran the loop continuously I would lose the ability to audit. I am not in the decision loop, but I am watching the output daily. If the constraints fail, I kill the cron job and figure out why.
What the Prompts Look Like
Too many agent experiments describe the system in abstractions and never show you what the agents actually receive. So here are examples across the pipeline.
Market intelligence agent — behavioral analysis:
You are a product analyst for a browser-based casual game. Here is the anonymous player telemetry from the last 24 hours: [session data]. Here are the structured survey responses collected in-game: [survey data].
Analyze this data. Identify the top 3 patterns in player behavior. For each pattern, state what it suggests about what players find engaging or frustrating. Do not speculate beyond what the data supports. Do not recommend features yet. Output your findings as numbered observations with supporting data points.
Market intelligence agent — social media sentiment:
Search public posts on X, Reddit, and web forums mentioning [game name] from the last 48 hours. Categorize sentiment as positive, negative, or neutral. For negative sentiment, identify the specific complaint. For positive sentiment, identify what the player praised. Summarize the top 3 themes. Do not include any PII, usernames, or identifying information in your output.
Market intelligence agent — hypothesis generation:
You are a senior product manager. Here are the behavioral analysis findings from the last 7 nightly cycles: [findings]. Here is the social media sentiment summary for the same period: [sentiment]. Here are the current game mechanics and features: [feature list].
Generate 3 hypotheses about what change to the game would improve 7-day retention. For each hypothesis, state: what you would change, why the data supports it, how you would measure whether it worked, and what the success threshold is. Rank them by expected impact. You must be able to build and test each hypothesis within a single nightly cycle.
Build agent — feature implementation:
You are a frontend developer building a browser-based casual game. The game must run on a 4-year-old Chromebook in Chrome with no install required. All content must be rated G and family friendly.
Here is the current game codebase: [repo]. Here is the hypothesis to test: [hypothesis from market intelligence agent]. Here is the success metric and threshold: [metric].
Implement this change. Write tests that verify the feature works and does not break existing functionality. Instrument the new feature with anonymous telemetry events that the measurement agent can use to evaluate the hypothesis. Do not collect any PII. Output the complete diff and test results.
Release agent — deploy decision:
Here is the diff from the build agent: [diff]. Here are the test results: [results]. Here is the content filter output: [filter results].
If all tests pass and the content filter shows no violations, deploy to production and output the deployment log. If any test fails or the content filter flags a violation, do not deploy. Output the reason for rejection and pass it back to the build agent for revision.
Measurement agent — post-release evaluation:
The following change was deployed 24 hours ago: [change description]. Here is the pre-release baseline for the success metric: [baseline]. Here is the post-release data: [24h data].
Did the change meet the success threshold defined in the hypothesis? State yes or no with supporting data. If no, recommend whether to roll back or iterate. If yes, recommend whether to keep the change and move to the next hypothesis. Output your recommendation and reasoning.
These prompts will change as I learn what works. But they are concrete enough that you can see what the agents are actually being asked to do.
The Napkin Math
Nathan asked me what this would cost to run. I did the math on a napkin (I am not good at math, but I think this is close).
Inference for a nightly cycle: $200 to $500 a month. Hosting: $10 to $20. Social media API access: free tier for most platforms, maybe $100 if I need premium. Call it $400 a month total.
A single product manager fully loaded at $180K a year runs about $15,000 a month. Add a two-person engineering team at $160K each and you are at $42,000 a month. For one product, one team, one planning cycle that takes three weeks before anyone writes a line of code.
$400 a month is a cheap way to find out how far the technology actually goes. I have been clear in other posts that agents do not replace the judgment of a good product manager. But that is exactly the question this experiment is designed to pressure-test.
What I Expect to Go Wrong
The build step will probably be the strongest part. I ship production software with agents every week on this site, so there is evidence for that already.
Market intelligence is where I expect the system to struggle. Understanding what humans want from a product has been the hardest problem in software for sixty years, and I would be surprised if 2026 frontier models cracked it in a nightly cron job. But the bar is not “solve product management.” The bar is: can the system make product decisions that are better than random? Can it build a game that retains players at a rate higher than chance? If the answer is yes, even marginally, that is the first real data point about what autonomous product development looks like in practice.
Why I Am Building This in the Open
Nathan said something else tonight. “The problem with AI experiments is everyone only publishes the wins. Nobody shows you the part where it fell apart.”
So when the agents ship something nobody plays, you will see it. When they misread a signal and spend a cycle on a feature that tanks retention, you will see that too. The data will be real because you watched it happen.
Every product decision in your organization goes through a human right now. A product manager, a director, a committee. That process takes weeks. By the time the feature ships, the market has already moved. You are always building for the market that existed when you started planning, not the market that exists when you deliver.
What happens when that loop runs every night instead of every quarter? Can your planning process outrun a system that never sleeps, never anchors to last quarter’s strategy, and ships before the signal decays?
I do not know. That is why I am building it instead of writing a whitepaper.
I will have the link for you soon.
