Talk about AI tends to fixate on training — the months and the rumored fortunes spent to build a frontier model. But training is a one-time cost. The bill that never stops arriving is inference: the price of actually answering each prompt, every time, for every user. Understanding where that money goes explains most of the product decisions you see across the industry.

Why generating text is expensive

When a model answers, it produces output one token at a time, and each token requires running the full network — billions of parameters' worth of matrix multiplications. A hundred-word reply is not one computation; it is dozens of forward passes through the entire model, in sequence. That work runs on GPUs, which are costly to buy, costly to rent, and power-hungry to operate. Every word the model writes is burning a measurable amount of compute and electricity.

The memory wall

The less obvious cost is memory. To generate text, a model holds the entire conversation — every token of context — in fast GPU memory, along with a growing cache of intermediate state (the "KV cache") that lets it avoid recomputing the past on every step. Longer prompts and longer outputs mean more memory pressure, and GPU memory is scarce and expensive. This is a big reason long context windows cost more: you are not just paying for more thinking, you are paying to keep more of the conversation resident in the most expensive memory in the building.

Batching: the economics trick

The single most important lever for inference cost is batching — running many users' requests through the GPU together. A GPU running one request at a time is mostly idle silicon; pack dozens of requests into a batch and you amortize the hardware across all of them, driving the per-request cost down sharply. This is why API pricing can look cheap: providers are running enormous batches at high utilization. It is also why a self-hosted model serving a handful of users is often shockingly expensive per answer — you are paying for a whole GPU and using a sliver of it.

Input is cheap, output is dear

Providers usually charge more for output tokens than input tokens, and the reason is structural. The model can read your entire prompt in one parallel pass, but it must generate the response sequentially, one token at a time, each requiring a full pass. Reading is parallel and fast; writing is serial and slow. So a verbose, rambling answer genuinely costs more to produce than a tight one — efficiency and good writing happen to align with lower cost.

Why model size is a product decision

All of this is why "just use the biggest model" is rarely the right answer in production. A larger model costs more per token on every single request, forever. Teams increasingly route easy queries to a small, cheap model and reserve the expensive one for hard cases, or distill a big model's behavior into a smaller one for deployment. The art is matching model size to the difficulty of the task, because the difference compounds across millions of requests.

Why it matters

Inference economics quietly shape the AI products you use: why free tiers are rate-limited, why context windows cost what they do, why responses get cut off, why companies push smaller on-device models. The headline number is always training, but the business runs on inference. Whoever serves capable answers at the lowest cost per token — through batching, smart routing, and right-sized models — has the durable advantage, regardless of who trained the flashiest model.

Analysis by GenZTech.