Every AI feature has two demos. The first is the one you give the board: the chat companion answering a message at two in the morning with something that sounds like a person who cares. The second is the one finance gives you three months later, which is a spreadsheet.
The distance between those two demos is the entire subject of this essay.
There is no shortage of writing about the cost of large language models. Most of it is produced by people optimising token throughput, which is a real and useful craft, and also not the question I get asked in a budget review. The question I get asked is simpler and worse. How much does it cost to run an AI feature once real people are using it — per user, per month, and is that number going up or down. That is a P&L question, not an engineering one, and the people who can answer it from the chair where the budget actually gets signed are rarer than you would think.
I have spent the last few years answering it for a regulated mental health platform — four million users, real enterprise contracts, generative features sitting directly in the path of clinical risk. Here is what I have learnt about the bill.
The unit you actually pay in
Software people carry a comfortable mental model around with them. You build the thing once, and serving the millionth user costs roughly what serving the thousandth did. Marginal cost drifts towards zero. It is the entire reason software valuations look the way they do, and it is the assumption underneath every gross-margin slide ever shown to an investor.
Generative AI quietly deletes that assumption while you are looking at the demo.
Every interaction now has a variable cost, because every interaction spends tokens, and tokens are metered like electricity. The millionth message is not free. It costs almost exactly what the thousandth cost, and rather more if the conversation runs long, which the good conversations always do. You have, without entirely noticing, rebuilt your software business with the cost structure of a utility. The chat companion that makes the product feel humane is also the line item that scales linearly with success. The better it is, the more people use it, the larger the invoice. This is the first thing nobody puts on the slide.
Build versus buy is a margin decision wearing a quality costume
In a clinical context the instinct is to reach for the most capable model available and use it everywhere. If a smarter model is marginally less likely to mishandle someone in distress, what kind of person chooses the cheaper one.
The honest answer is that you would not, for the things that matter, and you absolutely would, for the things that do not — and most of what an AI feature does all day does not matter in that way. Treatment formatting, which turns a clinician's notes into structured output, is a mechanical task; it does not need the frontier. Triage routing needs judgement but not genius. The acute-crisis path — what we call Level 3A, where the system has decided a human being needs to intervene right now — is not the place to economise on a fraction of a cent.
So you route. The unglamorous discipline of the GenAI P&L is deciding, task by task, which jobs deserve the expensive model and which are perfectly well served by something a tenth of the price.
The model you choose is not a quality decision or a cost decision. It is the same decision, made separately for every task your product performs.
Get this wrong in the generous direction and your margins quietly bleed. Get it wrong in the frugal direction and you have saved a few cents per session on the one path where saving money was never the point.
The costs nobody puts in the demo
The token bill is the part you can see. The part you cannot see is larger.
Evaluation is not free: to know whether one model is doing its job you generally pay another model to grade it, which means your quality assurance has a metered cost of its own. Guardrails mean you frequently run the same request more than once — a generation, a check, sometimes a regeneration — so a single user-facing reply can quietly cost you two or three times what the demo implied. Observability, logging, the storage of every interaction in a domain where you may later need to prove exactly what was said: all metered, all real, none of it in the pitch.
And then there is the most expensive token of all, which is a human. In a high-stakes product the escalation path does not end at a cheaper model. It ends at a clinician. In a low-stakes product, the cost of being wrong is a refund. In ours, the cost of being wrong is a person. That single fact bends the entire P&L, because past a certain level of risk you are no longer buying inference at all. You are buying confidence, and confidence has a price curve that turns sharply upward exactly where you most need it.
There is also the tax you pay for moving at all. Every time you swap in a newer model — and you will, because they keep getting better and cheaper — you inherit a regression test across every surface you have ever shipped. The upgrade that shaves a real fraction off your cost per million tokens also costs you a fortnight of re-checking that nothing in the crisis path now behaves slightly differently than it did last week. Nobody warns you that the savings have an invoice attached.
How to actually budget the thing
This is the AI product P&L, and it has a shape. If you are the person holding the number, here is the work.
Start with cost per active user per month, and refuse to look at it as one figure. Decompose it by feature, because the chat companion, the triage step, session prep and treatment formatting have wildly different cost profiles, and the blended average hides the one that is about to become a problem. The feature that worries me is never the expensive one I already watch. It is the cheap one whose usage is quietly compounding.
Then map each feature to what it is actually for. Some of these features exist to lift utilisation — the metric an enterprise buyer renews on, because HR did not purchase a product for employees to ignore. Those features earn their token cost by moving a number that shows up in the renewal conversation. Others are pure cost of doing business in a regulated category; they will never show up in a sales deck, and you pay for them anyway. Knowing which is which is the difference between a budget you can defend and a budget you can only apologise for.
On build versus buy, the calculus is duller than the debate around it. You fine-tune when volume multiplied by the per-call saving comfortably clears the engineering cost of doing so and maintaining it forever. Below that line you stay with prompt engineering and a hosted model, because the cheapest custom model is the one you never had to keep alive. Most teams fine-tune too early, for the same reason most people buy the professional camera before they have taken the amateur photographs.
Above all, watch the direction of the number, not its level. A high cost per user that is falling every quarter is a business getting better at its craft. A low cost per user that is creeping upward is a slow leak you have not found yet. The board does not need the cost to be small. It needs you to be the person who already knows which way it is moving, and why, before anyone has to ask.
The two demos never quite reconcile. The chat companion still answers at two in the morning, and finance still arrives three months later with the spreadsheet. The job is not to make the bill disappear. It is to be the only person in the room who can read it.