# Put Tokens in the P&L, Not in a Developer Expense Report

Source: https://agentdrivendevelopment.com/pnlnt/
Agent-readable URL: https://agentdrivendevelopment.com/pnlnt/?agent=1
Published: 2026-06-24T19:44:30-05:00
Modified: 2026-06-24T20:26:38-05:00
Attribution: If you quote, paraphrase, summarize, or cite this material, credit agentdrivendevelopment.com and link to the source URL above.

## Summary

Token costs are production costs, not developer allowances. Put them in the portfolio P&L, measure human plus AI labor against accepted outcomes, and stop mistaking visible inference bills for economic governance.

## Article

Let me be very frank.

I am going to start with what you should be doing, then explain why what you are doing now is wrong.

Put the cost of tokens in the portfolio P&L. Profit and loss. The place where cost is supposed to meet value before everyone starts pretending a usage dashboard is governance.

Tokens are the metered units behind LLM inference. Inference is what you consume when the system reads context, reasons, generates output, reviews code, or summarizes a messy body of work. The token count is the unit. The token cost is the line that belongs in the P&L. It is not the whole AI cost, but it is the part of AI labor that shows up with an invoice, which is why everyone suddenly loses their mind.

That is the answer.

Now we can work backward through the confusion.

## Put Token Costs in the Portfolio P&L

Token costs belong in the blended cost of producing the software outcome the portfolio is funding. If accounting wants to classify some of the spend as R&D, cost to serve, capitalized software, support operations, internal tooling, or something else defensible, fine. Classify it correctly. I am not trying to win an accounting taxonomy fight before the coffee gets cold.

The management point is simpler: AI labor is now part of the production mix.

That mix already includes employees, contractors, consultants, cloud infrastructure, test infrastructure, security review, support load, rework, and delay. Token costs are not special because they come from AI. They matter because they are newly visible, and visible costs make executives act like they just discovered money.

They did not discover money.

They discovered a meter.

## The CFO Question

For any meaningful portfolio investment, the token invoice should not be the first moment anyone asks what the work is worth. Before the first serious spend, the sponsor should be able to name the expected value, the value owner, the baseline, the time horizon, and the acceptable variance. You do not need fake precision. You need a financial model honest enough to be tested.

The CFO question is boring on purpose: what value did we expect, how much human labor did this consume, how much AI labor did this consume, what other production costs were attached, what changed in the business, and did the actual value justify the spend?

If that cannot be answered in business terms, the problem is not the token bill. The problem is that the initiative was funded without a measurable value case. Faster delivery is not a value case by itself. Better developer experience is not a value case by itself. The value case is gross margin, cost to serve, revenue protected, revenue accelerated, risk retired, defects avoided, vendor spend avoided, or capacity moved into work that matters.

That is the same question a company already asks when it decides whether to use employees, contractors, a systems integrator, a specialist consultant, a managed service, or a fixed-bid partner. AI labor is another production input. Put it in the financial model.

I have written the audit-posture version of this already in Token Economics Is the Wrong Spreadsheet (/token-economics-is-the-wrong-spreadsheet/). This piece is the operating policy version.

This is not that complicated. The fact that it feels complicated says more about how badly most companies measure software economics than it says about the cost of tokens.

It is the same discipline the board expects for any material investment. The only novelty is that software finally has a visible meter attached to part of the work.

## The Meter Is Not the Financial Model

The token bill arrives clean. Vendor, date, usage, overage, team, maybe user. Finance can see it. Procurement can challenge it. An engineering manager can be told to approve it.

Meanwhile the rest of the software value stream sits in fog: six weeks of waiting, four architecture review meetings, three security loops, two rewrites, one offshore handoff, and a feature nobody can tie back to revenue, cost reduction, customer retention, risk reduction, or operating leverage.

The token invoice looks like the problem because it is legible.

That is the trap. The company has not become good at cost control because it found an inference bill. It found the one part of the production system with a meter attached, then mistook the meter for a financial model.

You have turned the cost of tokens into expense reports.

You know the pattern. The travel system rejects the five-dollar junior-suite upgrade because the policy is clear and the variance is visible. Someone gets to feel disciplined for twelve seconds. Meanwhile the trip itself may be attached to a $2 million pursuit, an $8 million renewal, or an executive meeting that never should have been scheduled.

The five dollars is not irrelevant. It is just not the economic unit.

The cost of tokens is not the economic unit either.

If a developer spends $140 in inference calls to produce a migration plan that saves three weeks of senior-engineer effort, avoids two defects, and removes one dependency on a consulting team, the token spend was not expensive. It was cheap. If a team spends $4,000 in inference calls producing speculative code for an initiative that never had a business case, the token cost is not the problem. The initiative is.

Expense-report logic asks whether the visible line item complied with policy. Portfolio logic asks whether the investment should exist.

That distinction is the whole argument.

## Token Spend Replaces Other Costs

The other mistake is comparing token spend to zero, which is a neat trick if the goal is to make every new production input look expensive.

Token spend does not replace nothing. It replaces or augments the cost of human labor, contractor labor, consulting labor, coordination time, review time, research time, test-generation time, support investigation time, and delay.

The alternatives are already in the plan. Another sprint of internal labor. Another two contractors. Another statement of work. Another platform team request. Another offshore pod. Another quarter of waiting. Another meeting where nobody makes a decision because nobody has done the analysis.

That is the labor market AI inference now participates in.

An inference call is not magic and it is not free. Neither is a senior engineer spending four days assembling context that an agent could have compressed in forty minutes. Neither is a consultant writing a migration strategy in a deck. Neither is a contractor producing code the internal team has to rewrite. Neither is delay, even though delay is the cost companies are best at pretending not to see.

The correct comparison is not token cost versus zero token cost. The correct comparison is blended AI-plus-human production cost versus the old way of producing the same outcome.

That is the denominator. Use it.

## Tokens Are Not Toys

This is why the “tokens are toys” posture is unserious. The unit is technical, but the cost is economic.

Air conditioning did not begin as a comfort perk. Willis Carrier’s first modern air-conditioning system (https://www.williscarrier.com/weathermakers/1876-1902/) was designed in 1902 for the Sackett & Wilhelms printing plant in Brooklyn because humidity was damaging color registration, creating scrap, and threatening production schedules. The original business case was production quality.

Once humidity control became part of the production environment, the right question was not whether the press room deserved cool air. The right question was whether the system reduced scrap, protected output, improved throughput, and justified its operating cost.

AI inference is entering software delivery through the same door, and token cost is the meter attached to it. It looks optional because the meter is new. It looks indulgent because the work is cognitive. It looks suspicious because the bill is easier to read than the value stream. But in AI-enabled software delivery, token cost is increasingly part of the production environment.

The same logic applies to airlines. An aircraft costs more than the pilot, fuel costs more than the pilot, and maintenance, routing, utilization, gates, turn time, safety, and revenue management all shape the economics of the flight before the cockpit labor line is even interesting. IATA’s global airline cost data for 2022 (https://www.iata.org/en/publications/newsletters/iata-knowledge-hub/unveiling-the-biggest-airline-costs/) puts fuel and oil at 28.7% of total airline costs, depreciation and amortization at 9.1%, and flight crew salaries and expenses at 8.6%.

No serious airline concludes that pilots are irrelevant. It concludes that the flight economics have to be managed at the route, fleet, utilization, load-factor, safety, and margin level.

Software has to learn the same lesson. Do not make the person flying the plane feel guilty about the aircraft.

## The Bad Policy Tax

This is where companies make the obvious bad move. They push token-cost accountability down to developers.

Developers should be accountable for the quality of the work: correctness, maintainability, test coverage, security posture, operational fit, design judgment, and whether the solution solves the problem. They should not be asked to carry portfolio margin inside their IDE because leadership does not want to build a real financial model.

“Be judicious with token spend” sounds responsible in a budget meeting. Inside the work, it becomes an instruction to slow down, use a weaker LLM, avoid exploration, stop before the analysis is complete, or make the cost disappear somewhere else.

That last one matters.

If the work is valuable and the official path is hostile, good developers route around it. They use personal accounts, shared keys, product-side usage, untracked tools, weak internal exceptions, or whatever path lets them finish the work. Not because they are reckless, but because the organization told them to deliver AI-enabled outcomes while making the actual AI input socially and financially suspect.

Congratulations, the token-cost policy created shadow spend. Very controlled. Very mature.

The side effects are the evidence. A bad token-cost policy moves spend underground, moves security risk off the dashboard, destroys the usage data needed for future forecasts, and makes the best AI-enabled people slower. It also teaches teams to buy labor instead of inference, because spending $50 of inference time requires a justification while burning eight hours of a senior engineer is invisible.

The P&L does not care that the waste was socially acceptable.

When a policy lowers visible token spend while raising total production cost, that is not governance. It is finance cosplay with a login screen.

## Procurement Will Optimize the Wrong Thing

Procurement can make this worse, and I mean that with all due respect to procurement, which is to say: some respect, bounded by evidence.

Procurement has a real job here. Negotiate commercial terms. Reduce vendor sprawl. Manage renewal traps. Protect data rights. Coordinate with security. Keep the company out of bad enterprise commitments. All necessary.

But procurement cannot own the economics of AI-enabled software delivery. Procurement can negotiate token price. It cannot decide whether a modernization program is worth funding, whether a support-automation initiative will reduce cost to serve, whether a migration reduces operational risk, or whether a product team should trade a higher inference bill for lower engineering labor and faster time to value.

Left alone, procurement will do what procurement is designed to do. It will standardize the category, consolidate vendors, push for the discount tier, and come back with an inference package that looked competitive 18 months ago from a vendor nobody in the workstream would have chosen if output quality were the deciding factor.

The paperwork will be cleaner. The price sheet will look better. The tool will be worse.

Then the engineers will spend the next quarter working around it while leadership congratulates itself for cost discipline. This is how companies save 18% on the visible line item and quietly lose 30% in throughput, quality, rework, retention, support load, or missed timing.

Cheap token pricing is useful when the outcome economics hold. Better price, same or better output, lower total cost. Fine. That is a win. Worse LLM, worse output, extra human cleanup, slower delivery, and more defects is not a win. It is buying cheaper paint for the aircraft and calling it fleet strategy.

The question is not “what did the token cost.” The question is “what did the accepted outcome cost.”

## Scarce Capital Does Not Get Distributed Fairly

And no, your developers do not get equal token budgets.

That sentence will make somebody in HR twitch, so let me say it more carefully and make them twitch for the right reason. Equal access to opportunity is a people principle. Equal allocation of scarce production capital is not.

If tokens are scarce, and for some companies they will be, the budget should flow toward the people and work producing the highest value.

Sarah might get 90% of the company’s token budget because Sarah is shipping the work that changes the quarter. The rest of the company can fight over the 10% until they can show comparable output, a comparable value case, or a learning path worth funding.

That sounds harsh only if the token budget is being treated like an employee benefit. It is not. It is production capital. Nobody spreads the enterprise sales pursuit budget equally across every account executive because fairness. Nobody gives the same cloud budget to a dormant internal wiki and a revenue-critical pricing engine because both teams have feelings.

The question is not “did every developer get the same allowance.” The question is “did the shareholder get the best return on the next dollar.”

This is the same argument I made in You Have a Sub-Five Miler. Your Relay Team Still Loses (/you-have-a-sub-five-miler-your-relay-team-still-loses/). Sometimes the right answer is not to bring the floor up evenly. Sometimes the right answer is to push the ceiling where the ceiling creates value.

You still fund learning. You still fund the middle of the organization. You still create paths for people to grow into higher-leverage work. But do not confuse workforce development with capital allocation. The production budget follows value.

## The Financial Measurement Model

This is where the COGS framing helps. For years, software organizations have treated software production as if the real cost were mostly people plus a cloud bill, with everything else hidden inside operating rituals. AI exposes the lie because it turns part of knowledge-work production into metered consumption.

That makes people uncomfortable. It is also useful.

For the first time, many companies have a visible signal that can be tied to the cost of producing a software outcome, if they put it in the right financial model.

The right financial model is not token cost per developer. It is not token cost per prompt. It is not token cost per story point, because story points were already fake enough before someone tried to staple an inference bill to them.

The right financial model is cost per value-bearing unit: cost per accepted outcome, cost per migrated workflow, cost per resolved support issue, cost per release, cost per customer-impacting improvement, cost per unit of risk retired, cost per dollar of protected revenue, cost per operating hour saved.

The unit depends on the portfolio. The principle does not.

Measure the blended cost of production. Include human labor. Include AI labor. Include contractors and consultants. Include cloud. Include review. Include rework. Include delay where it can be credibly estimated. Then ask whether the output justified the input.

And stop treating hours saved as value.

Hours saved are only value if the business captures them. If a team saves 400 hours and then spends those hours in more status meetings, more intake ceremonies, more alignment calls, or more backlog grooming, nothing happened except the calendar got more decorative.

The value is reduced headcount need, faster revenue, lower support cost, fewer defects, avoided vendor spend, reduced risk, better retention, protected revenue, or capacity redeployed to work that actually ships. If the saved time does not become one of those things, it is not ROI. It is a productivity anecdote with a nice haircut.

That is what finance should want.

## The Other Ways to Get This Wrong

Because executive attention is finite, and because the middle of every strategy document is where good ideas go to nap, here are the remaining mistakes in plain language.

- Cause: measuring token count per prompt instead of cost per accepted outcome. Unintended effect: teams optimize for shorter conversations, not better results, and stop using inference for context, review, and validation that would have prevented rework.

- Cause: measuring token spend per developer instead of value per portfolio investment. Unintended effect: high-value work gets throttled so low-value work can look equally compliant.

- Cause: treating story points as an economic unit, which was already funny before AI got involved. Unintended effect: leadership attaches a real bill to a fake unit and then wonders why the math cannot explain revenue, risk, margin, or customer value.

- Cause: punishing exploration even when exploration retires risk before the expensive mistake happens. Unintended effect: teams skip discovery, commit too early, and pay later through rework, delay, and architectural regret.

- Cause: treating all token spend as economically equivalent when frontier LLM work, coding-agent work, retrieval, summarization, and autocomplete do different jobs. Unintended effect: teams overpay for simple work, underpower critical work, and turn LLM selection into a blanket policy instead of an operating judgment.

- Cause: creating use-it-or-lose-it budgets that teach teams to hoard, hide, or burn usage so the quarter looks tidy. Unintended effect: finance gets clean period reporting and garbage demand signals.

- Cause: charging the token cost to engineering when the value lands in sales, support, compliance, operations, or customer retention. Unintended effect: the cost owner rejects the spend, the value owner complains delivery is slow, and everyone acts surprised that local optimization damaged the portfolio.

- Cause: counting adoption as success when usage has not changed cost, speed, quality, risk, revenue, or margin. Unintended effect: the company celebrates tool activity while the P&L receives absolutely nothing.

## The Controls Belong at the Portfolio Level

Here is the part people pretend not to hear: none of this argues for unlimited spend.

Portfolio-level measurement does not mean weak governance, and it does not mean “let the developers do whatever they want.” It means the controls move to the level where they can be economically meaningful.

Use portfolio envelopes. Use showback or chargeback where it helps. Classify usage by initiative and business function. Separate experimentation spend from production delivery spend. Track outliers. Investigate abuse. Stop fraud. Negotiate better rates. Require teams to connect AI usage to accepted outcomes, defect movement, cycle-time movement, support-cost movement, risk retired, or revenue protected.

The minimum control package is not complicated: initiative, sponsor, value owner, expected outcome, budget envelope, stop-loss, actual blended cost, and actual business result. If that feels heavy, compare it to the meeting series you were about to create instead.

The monthly review should answer four questions: are we inside the envelope, did the work move the expected business metric, is the blended cost better than the old production approach, and do we continue, increase, cut, or transfer the risk?

Set the budget so the company cannot bankrupt itself on token costs. That is what a budget is for. Put a hard envelope around the portfolio, define the stop-loss, and decide in advance what evidence earns more funding. Inside that envelope, stop pretending equal per-engineer rationing is discipline. It is not. It is how you make the best users of the scarce resource slow down so the spreadsheet looks democratic.

I called this finding the ceiling (/find-the-ceiling/): no per-engineer cap inside the program, a program-level envelope a CFO can sign, and a clear stop-loss if the bet does not pay. That is the difference between an investment posture and a panic posture.

And yes, if the economics do not work, stop the work. Amazing how often that option disappears once there is a steering committee.

Just stop pretending that telling developers to watch their token spend is the same as managing the economics of AI-enabled delivery. It is the cheapest-looking control and one of the most expensive mistakes.

## Fixed Bid Is Risk Transfer

If the company cannot tolerate the variance, transfer the risk. That is what fixed bid is for. If leadership does not want variable AI labor cost, if the portfolio cannot absorb uncertainty, or if finance needs price certainty more than delivery control, use an outside partner willing to price the outcome and absorb the variance.

Make them bid the work. Make them own the inference approach. Make them take the estimation risk.

Then treat the market answer as information.

If a capable partner can quote the work, absorb the AI-labor variance, and still make money, the internal organization should ask why it cannot build its own financial model for production economics. If no capable partner will take the fixed bid, or if the premium is too high, the uncertainty was not created by token costs. It was already inside the work.

## The Accountability Structure

So the answer is still the answer, and it was the answer at the top.

Measure token costs at the portfolio level as part of the total cost of producing and serving the software outcome. Mix AI labor and human labor in the financial model the same way the company already mixes employees, contractors, consultants, managed services, cloud, and outside partners. Compare the blended cost against the value produced. Govern variance at the level where variance means something.

Then hold the right people accountable.

The sponsor owns the economics. Finance owns the measurement discipline. Procurement owns the commercial wrapper, and only the commercial wrapper. Engineering owns the production system. Developers own the quality of the work.

The board question is not whether developers are behaving. The board question is whether management can convert AI labor into margin, risk reduction, growth, or operating leverage without losing control of the production system.

When those accountabilities collapse into “watch your token spend,” the organization is not being careful.

It is dodging the portfolio conversation and calling it governance.

## Companion Artifacts

- Executive brief: https://agentdrivendevelopment.com/executive-brief/pnlnt/