I was on a call with my friend Nathan tonight. We were talking about what is possible in software right now. Agent-driven development, proof of concepts replacing specifications, and where all of this is heading. Then he said something that stuck.
Nathan told me that everyone keeps saying AI agents will eventually run the whole product loop. Market research, build, ship, measure, and iterate. No humans deciding what to build. But nobody is actually doing it. He said it is all conference slides.
He is right. It does not mean anything until someone builds it and shows you the results. So I am going to build it. And you are going to be able to watch. Here is what this actually is.
I am building a system of agents that runs the entire product lifecycle for a consumer browser game, end to end, without a human deciding what features to ship. The agents handle market research, development, deployment, measurement, and iteration.
Before I go further, here is a warning. If you are running a payments platform, a healthcare system, an Enterprise Resource Planning system, or anything where a wrong product decision has legal or safety consequences, do not do this. That would be reckless. I wrote about the importance of matching your AI approach to your domain risk and I meant it. A game is a domain where failure is cheap. If the agents ship something terrible, three people play it and forget about it. Nobody's mortgage gets denied. That is why I picked it.
Every experiment needs a cage before you put anything in it. I learned that running transformation programs at Fortune five hundred companies. You do not let people, or agents, run in an open field and hope they stay in bounds. You define the field first. There are hard constraints baked into the system.
Some are guardrails on the agents themselves. Some are rules about how the product gets built. All of them are non-negotiable.
First, the game must run on a school laptop. Web-based, no app store, no install, and no graphics processing unit requirement. If your kid opens it in Chrome on a four-year-old Chromebook and it runs smoothly, it qualifies. The agents cannot build anything outside the game domain. No e-commerce bolt-ons. No data collection side projects. One game, one host, one deployment pipeline.
Second, the content is rated G. Family friendly. Legal in the United States and the European Union. No violence, no gambling mechanics, no predatory monetization, no loot boxes, and no dark patterns designed to keep a twelve-year-old playing past bedtime. A content filter runs on every asset the agents produce. That includes code, copy, images, and survey questions. It runs before anything reaches production. If it fails the filter, it does not ship.
Third, there are no text boxes. No asking players what to build. This is the constraint that makes the experiment real. You do not get to put a text box in front of a player that says, what feature do you want? That is cheating. That is outsourcing the hardest part of the job to the user. The agents have to do what a seasoned product manager does. They must design structured surveys with multiple choice, rating scales, and preference rankings. They watch player behavior. They analyze session patterns. They monitor where players drop off and what they replay. They read social media. X, Reddit, and forums. Product insight comes from inference and observation, not from handing someone a suggestion box.
Fourth, the system uses non-identifiable data only. No internet protocol addresses, no device fingerprints, no tracking cookies, and no cross-session tracking. The agents can collect anonymous behavioral data. They see what a player clicked, how long they played, and where they dropped off. They see responses to the structured surveys they design. The data schema is locked before the agents touch it. They can read from it, but they cannot modify what gets collected.
Fifth, I require production-worthy releases with automatic rollback. What goes live needs to work. It must be tested, deployed, and monitored. Every release is versioned. If a release degrades any key metric below a threshold the agents set in advance, the system auto-rolls back to the previous version before the next cycle runs. I wrote about what happens when you skip governance and I have no intention of running that experiment.
Sixth, success is market signal, not vanity metrics. Page views do not matter. The agents measure retention, session depth, replay rates, survey responses, and social media sentiment. They decide what to measure, what the results mean, and what to build next.
Seventh, there is a hard cost cap. I have set a monthly ceiling on inference spend. If the agents burn through the budget mid-month, the loop pauses until the next cycle. I have seen organizations blow six figures on unmonitored AI experiments. I wrote about the economics in my return on investment dashboard piece. I am not joining that club.
Finally, I use current frontier models only. I am not waiting for G P T five or Claude five. I am using what exists today. Claude, G P T four o, Gemini, and whatever is best suited for each part of the pipeline. This is Generation One. It is supposed to be imperfect.
Look. Here is how it runs. Three agent groups in one closed loop.
Market intelligence agents do the product management work. They design surveys, run segmentation analysis on anonymous behavioral data, and scrape public social media for sentiment. They watch what correlates with retention. They turn all of that into hypotheses about what to build next.
Build agents take those hypotheses and turn them into working software. They write the code, write the tests, run the tests, and package for deployment. This is the part most people think about when they hear AI building software, but it is the least interesting part of this experiment. We already know agents can write code. The question is whether they can write the right code for the right feature at the right time.
Release and measurement agents handle deployment, A B testing, monitoring, and analytics. They instrument the new features, watch the metrics, and decide whether a release is performing or should be rolled back. They feed the results back to the market intelligence agents.
The trigger is a cron job. Every night at midnight, the cycle runs. Market intelligence agents pull the last twenty-four hours of behavioral data and survey responses. They scrape social media for mentions, generate hypotheses, and hand them to the build agents. If everything passes and the content filter clears, the release agents deploy. By morning, there is a new version of the game live. Players play, generate new data, and the next night the cycle runs again.
I picked a nightly cadence because it is slow enough that I can check what the agents shipped each morning and verify the constraints held. If I ran the loop continuously, I would lose the ability to audit. I am not in the decision loop, but I am watching the output daily. If the constraints fail, I kill the cron job and figure out why.
Right. Too many agent experiments describe the system in abstractions and never show you what the agents actually receive. So let us look at what the prompts look like across the pipeline.
The market intelligence agent for behavioral analysis gets told this. You are a product analyst for a browser-based casual game. Here is the anonymous player telemetry from the last twenty-four hours. Here are the structured survey responses collected in-game. Analyze this data. Identify the top three patterns in player behavior. For each pattern, state what it suggests about what players find engaging or frustrating. Do not speculate beyond what the data supports. Do not recommend features yet. Output your findings as numbered observations.
The prompt for social media sentiment is different. Search public posts on X, Reddit, and web forums mentioning the game from the last forty-eight hours. Categorize sentiment as positive, negative, or neutral. For negative sentiment, identify the specific complaint. For positive sentiment, identify what the player praised. Summarize the top three themes. Do not include any personally identifiable information in your output.
Then comes hypothesis generation. You are a senior product manager. Here are the behavioral analysis findings from the last seven nightly cycles. Here is the social media sentiment summary. Here are the current game mechanics and features. Generate three hypotheses about what change to the game would improve seven-day retention. For each hypothesis, state what you would change, why the data supports it, how you would measure whether it worked, and what the success threshold is. Rank them by expected impact. You must be able to build and test each hypothesis within a single nightly cycle.
The build agent for feature implementation has these instructions. You are a frontend developer building a browser-based casual game. The game must run on a four-year-old Chromebook in Chrome with no install required. All content must be rated G and family friendly. Here is the current game codebase. Here is the hypothesis to test. Here is the success metric and threshold. Implement this change. Write tests that verify the feature works and does not break existing functionality. Instrument the new feature with anonymous telemetry events that the measurement agent can use to evaluate the hypothesis. Do not collect any personally identifiable information. Output the complete code changes and test results.
The release agent has a clear deploy decision. Here are the code changes from the build agent. Here are the test results. Here is the content filter output. If all tests pass and the content filter shows no violations, deploy to production and output the deployment log. If any test fails or the content filter flags a violation, do not deploy. Output the reason for rejection and pass it back to the build agent for revision.
Finally, the measurement agent handles post-release evaluation. The following change was deployed twenty-four hours ago. Here is the pre-release baseline for the success metric. Here is the post-release data. Did the change meet the success threshold defined in the hypothesis? State yes or no with supporting data. If no, recommend whether to roll back or iterate. If yes, recommend whether to keep the change and move to the next hypothesis.
These prompts will change as I learn what works. But they are concrete enough that you can see what the agents are actually being asked to do.
Nathan asked me what this would cost to run. I did the math on a napkin. I am not good at math, but I think this is close. Here is the napkin math.
Inference for a nightly cycle costs two hundred to five hundred dollars a month. Hosting is ten to twenty dollars. Social media API access is free for most platforms, maybe one hundred dollars if I need premium. Call it four hundred dollars a month total.
Compare that to the alternative. A single product manager fully loaded at one hundred eighty thousand dollars a year runs about fifteen thousand dollars a month. Add a two-person engineering team at one hundred sixty thousand dollars each and you are at forty-two thousand dollars a month. That is for one product, one team, and one planning cycle that takes three weeks before anyone writes a line of code.
Four hundred dollars a month is a cheap way to find out how far the technology actually goes. I have been clear in other posts that agents do not replace the judgment of a good product manager. But that is exactly the question this experiment is designed to pressure-test.
So, what do I expect to go wrong? The build step will probably be the strongest part. I ship production software with agents every week on this site, so there is evidence for that already.
Market intelligence is where I expect the system to struggle. Understanding what humans want from a product has been the hardest problem in software for sixty years. I would be surprised if twenty twenty-six frontier models cracked it in a nightly cron job. But the bar is not solving product management. The bar is whether the system can make product decisions that are better than random. Can it build a game that retains players at a rate higher than chance? If the answer is yes, even marginally, that is the first real data point about what autonomous product development looks like in practice.
Here is why I am building this in the open. Nathan said something else tonight. He said the problem with AI experiments is everyone only publishes the wins. Nobody shows you the part where it fell apart.
So when the agents ship something nobody plays, you will see it. When they misread a signal and spend a cycle on a feature that tanks retention, you will see that too. The data will be real because you watched it happen.
Every product decision in your organization goes through a human right now. A product manager, a director, or a committee. That process takes weeks. By the time the feature ships, the market has already moved. You are always building for the market that existed when you started planning, not the market that exists when you deliver.
What happens when that loop runs every night instead of every quarter? Can your planning process outrun a system that never sleeps, never anchors to last quarter's strategy, and ships before the signal decays?
I do not know. That is why I am building it instead of writing a whitepaper.
I will have the link for you soon.