Slide 01

Everyone Talks About Autonomous AI Development. Nobody Is Actually Running It. I Am Building It and You Can Watch.

CxO + VP Engineering + Board

The experiment

A system of agents that runs the entire product lifecycle for a consumer game. Market research, build, ship, measure, iterate. No human in the product decision loop. Every night. In the open.

Everyone keeps presenting "future state" diagrams with gradient arrows pointing toward "autonomous product development." Conference slides. Nobody has built it and shown you the results. That changes now.

Why it matters to you Your product decision cycle takes weeks or months. This one runs every night. The question is not whether autonomous product loops are possible. The question is what happens to your planning process when someone proves they are.

Slide 02

The Domain Is Deliberately Low-Stakes. The Lessons Are Not.

Domain selection

Why a browser game

If the agents ship something terrible, three people play it and forget about it. Nobody's mortgage gets denied.

Payments platforms, healthcare systems, ERPs — domains where a wrong product decision has legal or safety consequences — are not candidates for this. That would be reckless.

A consumer game lets you push the autonomy boundary further than any responsible organization could push it in production. Then everyone gets to learn where the boundaries actually are instead of guessing from a conference stage.

What you learn from watching

Can agents make product decisions better than random? Can they build a game that retains players at a rate higher than chance? If the answer is yes, even marginally, that is the first real data point about autonomous product development outside of a slide deck.

Your takeaway The blast radius is small. The signal about what is possible applies to every product organization in the world.

Slide 03

Every Experiment Needs a Cage Before You Put Anything In It.

Guardrails framework

Domain

Consumer browser game only

No e-commerce bolt-ons, no data collection side projects, no scope creep. If an agent tries to expand scope, the guardrail kills the run.

Content

Rated G, family safe, legal in US and EU

Content filter runs on every asset — code, copy, images, surveys — before anything reaches production. Fails the filter, does not ship.

Data

Zero PII. Anonymous telemetry only.

No IP logging, no device fingerprinting, no cross-session tracking. The data schema is locked before the agents touch it. They can read from it. They cannot modify what gets collected.

Cost

Hard monthly cap on inference spend

If the agents burn through the budget mid-month, the loop pauses until the next cycle. No runaway spend.

Deploy

One target environment. No infrastructure sprawl.

Agents cannot provision new infrastructure, spin up additional servers, or create external service accounts. One game, one host, one pipeline.

Rollback

Every release versioned. Auto-rollback on metric degradation.

If a release degrades any key metric below a threshold the agents set in advance, the system rolls back before the next cycle runs.

Slide 04

Seven Rules That Make This an Experiment Instead of a YouTube Video.

Operating rules

Rule 1 School laptop

Browser game, no app store, no install, no GPU. Runs on a four-year-old Chromebook. Performance is a constraint, not a backlog item.

Rule 2 Rated G

No violence, no gambling mechanics, no predatory monetization, no loot boxes, no dark patterns. If a parent watches over their kid's shoulder, both should be comfortable. Absolute.

Rule 3 No text boxes

You do not get to ask players "What feature do you want?" That is outsourcing the hardest part of the job. Agents design structured surveys, watch behavior, scrape social media. Product insight from inference, not suggestion boxes.

Rule 4 Zero PII

Anonymous behavioral data only. What a player clicked, how long they played, where they dropped off. No IP addresses, no fingerprints, no tracking cookies.

Rule 5 Production-worthy

The agents do not ship prototypes and call them products. Tested, deployed, monitored. If it breaks, the agents detect it and respond.

Rule 6 Market signal

Page views do not matter. Retention, session depth, replay rates, survey responses, social sentiment. The agents decide what to measure and what to build next.

Rule 7 Current frontier models only. Claude, GPT-4o, Gemini. No waiting for the next generation. Gen 1 is supposed to be imperfect.

Slide 05

Three Agent Groups. One Closed Loop. No Human in the Decision Chain.

Agent architecture

Observe

Market intelligence agents

Design structured surveys. Run segmentation on anonymous behavioral data. Scrape public social media for sentiment. Watch where players drop off, what correlates with retention, what features get ignored. Turn all of it into hypotheses about what to build next.

Build

Build agents

Take hypotheses, write code, write tests, run tests, package for deployment. The least interesting part of the experiment. We already know agents can write code. The question is whether they can write the right code for the right feature at the right time.

Measure

Release and measurement agents

Handle deployment, A/B testing, monitoring, and analytics. Instrument new features, watch metrics, decide whether to roll back, feed results back to market intelligence. Loop closed.

Slide 06

Midnight. Every Night. No Standup. No Sprint. The Cron Job Does Not Care About Your Prioritization Framework.

Operating cadence

The nightly cycle

Every night at midnight, the loop runs. By morning, there is a new version of the game live.

Market intelligence agents pull the last 24 hours of behavioral data and survey responses. Scrape social media for mentions. Generate hypotheses. Hand them to the build agents. Code and tests written. Content filter clears. Release agents deploy.

Players wake up, play, generate new data. Next night, the cycle runs again.

Why nightly, not continuous

Slow enough that Norman can check what the agents shipped each morning and verify the guardrails held. If the loop ran continuously, he would lose the ability to audit.

Key distinction The human is not in the decision loop. The human is watching the output daily. If the guardrails fail, the cron job dies.

Slide 07

No Text Boxes. No Asking Players What to Build. That Is the Rule That Makes This Real.

Product intelligence

What the agents cannot do

Put a text box in front of a player that says "What feature do you want?" That is cheating. That is outsourcing the hardest part of the job to the user.

Every product organization in the world runs some version of this shortcut: customer advisory boards, feature request portals, NPS follow-up surveys with open text fields. You collect wishes and call it strategy.

What the agents must do

Design structured surveys: multiple choice, rating scales, preference rankings. Watch player behavior. Analyze session patterns. Monitor drop-off points and replay rates. Read public social media. Product insight comes from inference and observation.

The PM test A seasoned PM does not ask customers what to build. They watch what customers do and infer what to build. The agents have to do the same work.

Slide 08

$400 a Month vs. $42,000 a Month. Same Product Loop.

Economics

Agent loop ~$400/mo

Frontier model inference for a nightly observe-build-ship-measure cycle: $200-$500. Hosting a browser game: $10-$20. Social media API access: free tier to $100. Total: $300-$600.

Human team $42K/mo

One product manager fully loaded at $180K/year: $15,000/month. Two engineers at $160K each: $27,000/month. One product, one team, one planning cycle that takes three weeks before anyone writes code.

Ratio 105:1

The human team costs 105x more per month, ships on a three-week cycle instead of nightly, and still needs to guess what to build next. The agent loop ships every 24 hours.

$400 a month is a cheap way to find out how far the technology actually goes. Agents do not replace the judgment of a good product manager. But you will not know where that line is until you run the experiment.

The cost structure makes saying no harder than saying yes

Slide 09

Real Prompts, Not Architecture Diagrams. You Can See What the Agents Actually Receive.

Implementation

Analyze

Behavioral analysis prompt

"Here is the anonymous player telemetry from the last 24 hours. Identify the top 3 patterns in player behavior. For each, state what it suggests about what players find engaging or frustrating. Do not speculate beyond what the data supports. Do not recommend features yet."

Hypothesize

Hypothesis generation prompt

"Here are findings from the last 7 nightly cycles plus social media sentiment. Generate 3 hypotheses about what change would improve 7-day retention. For each: what you would change, why the data supports it, how you would measure it, and the success threshold. Must be buildable in a single nightly cycle."

Evaluate

Post-release measurement prompt

"This change was deployed 24 hours ago. Here is the baseline. Here is the post-release data. Did it meet the success threshold? State yes or no with supporting data. If no, recommend rollback or iterate. If yes, recommend keep and move to next hypothesis."

Transparency Too many agent experiments describe the system in abstractions and never show you the prompts. That makes it impossible to tell whether the thing is real or a LinkedIn post with an architecture diagram.

Slide 10

The Build Step Will Work. Market Intelligence Is Where This Breaks.

Honest risk assessment

What will probably work

Code generation. Norman ships production software with agents every week already. Evidence exists.
Test generation and execution. Agents are strong at mechanical verification.
Deployment and rollback. Infrastructure automation is solved territory.
Content filtering. Binary pass/fail decisions are what models do well.

What will probably struggle

Understanding what humans want from a product. The hardest problem in software for sixty years.
Inferring product direction from behavioral data alone, without human intuition filling the gaps.
Distinguishing signal from noise in small-sample social media sentiment.
Making the leap from "players drop off here" to "this is the change that fixes it."

Slide 11

The Problem With AI Experiments Is Everyone Only Publishes the Wins.

Transparency

What you will see

When the agents ship something nobody plays, you will see it. When market intelligence misreads a signal and the build agents waste a cycle on a feature that tanks retention, you will see that too.

When it works — if it works — the data will be real because you watched it happen. The game will be live. You can play it. You can watch the iteration cycle in action.

Why this matters for your org

Your vendor evaluations, your board presentations, your AI strategy decks — they are all based on curated success stories. Nobody shows you the part where it fell apart.

This experiment runs in the open specifically so the failure modes are visible. That is where the learning is.

Commitment If the whole thing collapses, you will see that too.

Slide 12

Your Product Decision Cycle Takes Months. What Happens When That Loop Runs Every Night?

Decision close

The timing problem

Every product decision in your organization goes through a human right now. A product manager, a director, a committee. That process takes weeks, sometimes months. By the time the feature ships, the market has already moved.

You are always building for the market that existed when you started planning, not the market that exists when you deliver.

A system that never sleeps, never anchors to last quarter's strategy, and ships before the signal decays does not need to be perfect. It just needs to be faster than your planning process.