Token spend is the visible line item. It is rarely the largest cost. The hidden costs are latency risk, reliability dependency, and evaluation debt — and these compound in ways that token pricing does not.

Token cost is only the starting point

Teams model LLM cost as input tokens plus output tokens times the per-token rate. This is accurate as far as it goes. The problem is that it accounts for the cheapest part of the total cost of operating a production LLM system.

Consider what is missing from that calculation:

  • Retry spend. Every API timeout, rate limit error, or malformed output that triggers a retry doubles the cost of that request. In high-throughput systems, retry rates of 5–15% are common and expected. The token cost model does not include this.
  • Prompt engineering overhead. Each iteration of a prompt that changes output quality requires re-running your evaluation set. At meaningful scale, evaluation runs are not free — either in API spend, in engineer time, or both.
  • Latency cost. A response time of 3–8 seconds is normal for frontier model APIs. For user-facing features, this is often unacceptable. The engineering investment required to make this acceptable — streaming, caching, speculative prefilling, model tiering — is significant and recurring.
  • Model deprecation risk. API providers deprecate model versions. When a model version you depend on is deprecated, you must re-evaluate your prompts against the replacement model. The cost of a model migration is not trivial and is not included in any per-token pricing calculation.

The evaluation debt problem

Every production LLM system accrues evaluation debt. The system starts with an evaluation set. The model changes. The prompt changes. The input distribution drifts. Each of these events requires updating the evaluation set — and teams consistently underinvest in this work because it does not ship features.

Evaluation debt manifests as degraded performance that no one notices until it causes an incident. The system has been getting slightly worse for months. The evaluation set has not kept up with production inputs. When the failure becomes visible, the cost to diagnose and remediate is far larger than the cost of maintaining the evaluation harness would have been.

The practical implication for cost modelling

When modelling the cost of an LLM integration, token spend is the floor. Multiply it by 1.5 to 2x to account for retries and evaluation runs. Add a maintenance budget for evaluation set upkeep of at least 10–20% of initial build cost per quarter. Budget for at least one model migration per year.

The teams that are surprised by LLM operating costs are the ones who modelled only the token line item. The teams that are not surprised built the full cost model before they started.