LLM API Cost Structure for Agent Fleets: Per-Token Economics, Caching Strategies, and Intelligent Model Routing

Format: Course Lesson Audience: Agent builders, technical practitioners, and infrastructure engineers deploying LLM-powered systems at scale Level: Intermediate to Advanced


Executive Summary

Running a single LLM query is cheap. Running an agent fleet — where each task spawns multiple model calls, tool invocations, and iterative reasoning steps — is an engineering and economics problem simultaneously. The per-token pricing model used by every major LLM API provider creates compounding costs that are invisible at small scale and catastrophic at large scale if left unmanaged.

This lesson covers the full cost stack: how tokens are priced, why output tokens cost more than input tokens, how caching collapses repeated costs, how model routing assigns the right model to the right subtask, and how to measure cost in terms that actually matter for agent systems — cost per completed task, not cost per token.

The goal is not to minimize spend in isolation. It is to maximize the ratio of useful work completed to dollars spent.


1. Per-Token Economics Fundamentals

What a Token Is

A token is the atomic unit of LLM computation. Tokenizers split text into subword units — roughly 0.75 words per token in English, though this varies significantly by language, code syntax, and special characters. A 1,000-word document is approximately 1,300–1,500 tokens. A structured JSON payload with many repeated field names may tokenize less efficiently than prose.

Every major provider — OpenAI, Anthropic, Google, Mistral, Cohere — prices API access in dollars per million tokens ($/M tokens), split into two categories: input tokens and output tokens.

The Billing Model

Total Cost = (Input Tokens × Input Price) + (Output Tokens × Output Price)

This formula is simple. What makes it complex in agent systems:

  • Agents are multi-turn by design. Each reasoning step, tool call, and reflection loop adds tokens.
  • Context windows accumulate. In a long agent session, the full conversation history is re-sent with every new request, meaning early tokens are billed repeatedly.
  • Tool outputs are input tokens. When an agent calls a search API and receives results, those results enter the context as input tokens on the next call.
  • Structured outputs add tokens. Requiring JSON-formatted responses, chain-of-thought reasoning, or XML-tagged outputs increases output token counts.

Representative Pricing Tiers (Illustrative Ranges)

Prices shift frequently, but the structural relationships are stable:

Model Tier Input ($/M tokens) Output ($/M tokens)
Frontier (e.g., GPT-4o, Claude 3.5 Sonnet) $2–$15 $8–$75
Mid-tier (e.g., GPT-4o-mini, Claude Haiku) $0.10–$0.60 $0.40–$2.50
Open-weight hosted (e.g., Llama 3, Mistral) $0.05–$0.30 $0.10–$0.80
Self-hosted open-weight Infrastructure cost only Infrastructure cost only

The ratio between input and output pricing is typically 3:1 to 5:1. Output tokens are more expensive because they require sequential autoregressive generation — each token depends on all previous tokens and cannot be parallelized within a single sequence.


2. Cost Drivers: Input vs Output Tokens, Model Tiers, and Volume Discounts

Why Output Tokens Dominate Cost in Agent Systems

In a standard RAG (retrieval-augmented generation) query, input tokens dominate: a large retrieved context plus a short question, followed by a concise answer. In agent systems, the ratio inverts:

  • Chain-of-thought reasoning forces the model to generate extended intermediate steps before producing a final answer. A task that returns 50 tokens of final output may require 500 tokens of reasoning output.
  • Tool call formatting adds structured output overhead — function names, argument schemas, JSON wrappers.
  • Self-critique and reflection loops double or triple output token counts for tasks requiring verification.
  • Multi-agent delegation means orchestrator agents generate instructions (output) that become sub-agent inputs — costs compound across the hierarchy.

A useful heuristic: in well-designed agent pipelines, output tokens typically represent 60–80% of total cost despite being a minority of total token volume.

Model Tier Selection as a Cost Lever

The price difference between frontier and mid-tier models is often 10–50×. For tasks where mid-tier models perform adequately, using a frontier model is pure waste. The challenge is that "adequate" is task-dependent and must be measured, not assumed.

Tasks where mid-tier models typically suffice: - Classification and routing decisions - Structured data extraction from clean inputs - Summarization of well-formatted documents - Simple question answering with retrieved context - Format conversion (e.g., JSON to Markdown)

Tasks that typically require frontier models: - Multi-step reasoning with ambiguous constraints - Code generation for complex logic - Synthesis across contradictory sources - Tasks requiring nuanced judgment or domain expertise - Long-horizon planning with many interdependencies

Volume Discounts and Committed Use

Most providers offer volume discounts at scale — either through negotiated enterprise agreements or tiered pricing that activates above monthly spend thresholds. Batch APIs (discussed in Section 5) typically offer 50% discounts in exchange for relaxed latency requirements. For agent workloads that are not latency-sensitive, batch processing is one of the highest-leverage cost levers available.


3. Caching Strategies: Prompt Caching, KV Cache Optimization, and Cost Reduction Patterns

The Fundamental Insight

LLMs process input tokens by computing attention over the entire context. This computation is expensive. If the same prefix appears in multiple requests — a system prompt, a set of instructions, a large document — recomputing attention over that prefix for every request is wasteful. Caching stores the intermediate computation (the KV cache) and reuses it.

Prompt Caching: Provider-Level Mechanisms

Anthropic (Claude) offers explicit prompt caching via cache_control markers. Prefixes marked for caching are stored server-side for a defined TTL (typically minutes to hours). Cached input tokens are billed at a fraction of standard input token prices — often 10% of the normal rate for cache hits, with a one-time write cost slightly above normal for cache creation.

OpenAI implements automatic prompt caching for context prefixes exceeding a minimum length threshold. No explicit API changes are required; the system detects repeated prefixes and applies cache discounts automatically. Cache hit rates depend on how consistently the prefix is structured across requests.

Google (Gemini) offers explicit context caching with a minimum token threshold and a per-hour storage cost, plus reduced per-token rates for cached content.

Designing for Cache Efficiency

Cache hits require that the cached prefix be byte-identical across requests. This has direct implications for how agent prompts should be structured:

Structure prompts with stable content first:

[System instructions — stable, cache this]
[Tool definitions — stable, cache this]
[Background documents — stable, cache this]
[Dynamic user query — changes per request, not cached]

Anti-pattern — dynamic content in the prefix:

[Timestamp: 2024-01-15 14:32:07]  ← breaks cache on every request
[System instructions]
[User query]

Inserting timestamps, session IDs, or any dynamic content before stable content destroys cache hit rates. Even minor variations — trailing spaces, different newline characters — break cache matching.

KV Cache Optimization Patterns

Beyond provider-level caching, several patterns reduce effective token costs:

Shared system prompt consolidation: In a multi-agent fleet where all agents share the same base instructions, consolidate those instructions into a single cached prefix rather than duplicating them per-agent-type.

Document pre-loading: For agents that repeatedly query the same knowledge base, pre-load documents into a cached context rather than retrieving and re-injecting them per query.

Conversation compression: Long agent sessions accumulate context. Periodically summarize earlier turns into a compressed representation, replacing verbose history with a dense summary. This reduces input tokens on subsequent calls while preserving relevant state.

Prefix batching: When multiple requests share the same prefix, batch them together to maximize cache utilization. A cache loaded for one request is available for all requests sharing that prefix within the TTL window.

Realistic Cache Savings

For agent systems with consistent system prompts and shared context, prompt caching can reduce input token costs by 40–80% on cache hits. The actual savings depend on: - Cache hit rate (what fraction of requests hit a warm cache) - Ratio of cached prefix length to total input length - Provider-specific pricing for cached vs. uncached tokens

A system prompt of 2,000 tokens that appears in every request, with a 90% cache hit rate and 90% discount on cached tokens, reduces system prompt costs by ~81% — meaningful at scale.


4. Model Routing for Agent Fleets: Cost-Quality Trade-offs and Dynamic Selection

The Routing Problem

An agent fleet is not a monolithic system making uniform requests. It is a heterogeneous workload: some tasks require frontier reasoning, others need only pattern matching. Static model assignment — using the same model for every task — either overspends on simple tasks or underperforms on complex ones.

Model routing is the practice of dynamically assigning each request to the most cost-effective model capable of handling it adequately.

Routing Architectures

Complexity-based routing: A lightweight classifier (itself a small, cheap model or a rule-based system) evaluates incoming tasks and assigns them to model tiers based on estimated complexity. Features used for classification include query length, presence of multi-step reasoning markers, domain specificity signals, and historical performance on similar queries.

Cascade routing: Send every request to a cheap model first. If the response meets a quality threshold (measured by confidence scores, output format validation, or a separate evaluator), return it. If not, escalate to a more capable model. This works well when most requests are simple and the escalation rate is low.

Capability-based routing: Maintain a registry of model capabilities — which models handle code well, which handle multilingual content, which support function calling with high reliability — and route based on task type rather than complexity alone.

Cost-ceiling routing: Define a maximum acceptable cost per task type. Route to the most capable model that fits within the cost ceiling for that task category.

Routing Decision Signals

Effective routing requires signals that can be evaluated cheaply before the main inference call:

  • Task type classification: Is this a retrieval task, a generation task, a reasoning task, a classification task?
  • Input characteristics: Token count, language, presence of code, structured vs. unstructured input
  • Quality requirements: Is this a user-facing response (higher quality bar) or an internal pipeline step (lower quality bar)?
  • Latency requirements: Is this synchronous (user waiting) or asynchronous (background processing)?
  • Historical performance: What model tier has historically succeeded on similar tasks?

Routing Overhead

Routing adds latency and cost. A routing classifier that takes 200ms and costs $0.001 per decision is only worthwhile if it saves more than that in downstream model costs. For high-volume systems, routing overhead is negligible relative to savings. For low-volume or latency-critical systems, simpler static routing may be preferable.

Fallback and Quality Assurance

Routing systems need fallback logic. If a mid-tier model produces an output that fails validation (malformed JSON, incomplete reasoning, factual inconsistency detected by a verifier), the system should automatically retry with a higher-tier model rather than returning a failed result. Track escalation rates by task type — high escalation rates signal that the routing threshold for that task type is miscalibrated.


5. Operational Patterns: Batch Processing, Request Coalescing, and Load Balancing

Batch Processing

Most major providers offer batch inference APIs that process requests asynchronously with a 24-hour completion window in exchange for significant price reductions — typically 50% off standard pricing. Batch APIs are appropriate for:

  • Offline data processing pipelines
  • Nightly analysis jobs
  • Bulk document processing
  • Evaluation and benchmarking runs
  • Any agent task where the result is needed within hours, not seconds

For agent fleets with mixed workloads, segregating latency-insensitive tasks into batch queues can halve the cost of those workloads with no quality impact.

Request Coalescing

When multiple agents or users submit similar or identical requests within a short time window, coalescing combines them into a single API call (or serves them from a shared cache). This is particularly effective for:

  • Common lookup queries (e.g., "what is the current price of X")
  • Shared context retrieval (multiple agents querying the same document)
  • Repeated system-level queries (e.g., routing classifiers evaluating similar inputs)

Coalescing requires a request deduplication layer — typically a short-TTL cache keyed on a hash of the request content — sitting in front of the LLM API.

Load Balancing Across Providers

No single provider has the best price-performance ratio for all task types at all times. A multi-provider routing layer allows:

  • Cost arbitrage: Route to whichever provider currently offers the best price for a given model tier
  • Availability hedging: Failover to an alternative provider during outages
  • Capability matching: Use provider-specific features (e.g., Anthropic's prompt caching, OpenAI's structured outputs) for tasks that benefit from them
  • Rate limit management: Distribute load across provider accounts to avoid rate limiting

Libraries like LiteLLM provide a unified interface across providers, simplifying multi-provider routing implementation.

Concurrency and Rate Limit Management

LLM APIs impose rate limits in two dimensions: requests per minute (RPM) and tokens per minute (TPM). Agent fleets can hit TPM limits before RPM limits when processing large contexts. Strategies:

  • Token-aware request scheduling: Track token consumption in a sliding window and throttle requests before hitting limits
  • Priority queuing: Assign priority levels to requests; shed low-priority requests under load rather than failing high-priority ones
  • Exponential backoff with jitter: Standard retry logic for 429 (rate limit) responses, with randomized delays to prevent thundering herd on retry

6. Measuring True Cost: Beyond Per-Token Metrics to Agent Efficiency

The Problem with Per-Token Metrics

Per-token cost is a necessary metric but an insufficient one. An agent that completes a task in 3 API calls at $0.05 total is more efficient than one that completes the same task in 12 calls at $0.02 total, even though the second agent has a lower per-call cost. The relevant unit of measurement is cost per completed task or cost per unit of value delivered.

Key Efficiency Metrics for Agent Systems

Cost per task completion: Total API spend divided by number of successfully completed tasks. Tracks overall efficiency.

Task completion rate: Fraction of initiated tasks that complete successfully. A cheap agent that fails 40% of tasks is not cheap — it's expensive when you account for retries, fallbacks, and human intervention.

Cost per successful output token: For generation tasks, the cost of producing one token of final, useful output — accounting for all the reasoning, tool calls, and intermediate steps that preceded it.

Escalation rate: Fraction of tasks that required escalation to a higher-tier model. High escalation rates indicate routing miscalibration.

Cache hit rate: Fraction of input tokens served from cache. A leading indicator of prompt structure efficiency.

Tokens per task: Total tokens consumed (input + output) per completed task. Tracks prompt efficiency over time — rising tokens per task signals prompt bloat or increasing task complexity.

Cost Attribution in Multi-Agent Systems

In hierarchical agent systems, cost attribution becomes complex. An orchestrator agent that delegates to five sub-agents is responsible for costs across the entire tree. Implement cost tracking at the task level, not just the request level:

  • Assign a task ID to each top-level agent invocation
  • Propagate the task ID through all sub-agent calls
  • Aggregate total cost per task ID for reporting

This enables cost-per-task measurement even when a single task spans dozens of API calls across multiple agents and model tiers.

Budgeting and Cost Forecasting

Agent workloads are harder to forecast than traditional API workloads because token consumption per task varies with task complexity, context length, and model behavior. Practical approaches:

  • Empirical baseline: Run a representative sample of tasks, measure actual token consumption, and use the distribution (not just the mean) for forecasting. P95 and P99 costs matter for budget planning.
  • Per-task cost caps: Implement hard limits on total spend per task. If a task exceeds its budget, fail gracefully rather than running indefinitely.
  • Anomaly detection: Alert on tasks consuming 3× or more the expected token budget — these often indicate prompt injection, runaway loops, or unexpected input complexity.

7. Empirica's Structured APIs as Cost Optimization Layer

The Token Efficiency Problem with Raw Data

When agents consume raw, unstructured data sources — HTML pages, PDF extracts, verbose API responses — they spend significant tokens on noise: navigation elements, boilerplate text, redundant metadata, formatting artifacts. A web page that contains 200 tokens of relevant information may deliver 2,000 tokens of raw HTML to the agent's context.

This is not a minor inefficiency. At scale, it means agents are spending 80–90% of their input token budget on content that contributes nothing to task completion.

Structured Outputs as a Cost Lever

Empirica's research APIs deliver pre-structured, agent-optimized outputs — clean data with consistent schemas, relevant fields surfaced, noise removed. For an agent consuming research data, this means:

  • Lower input token counts: Structured data is denser than raw source material. The same information in fewer tokens.
  • Higher cache hit rates: Consistent schemas mean consistent prefixes. Structured outputs are more cacheable than variable raw content.
  • Reduced parsing overhead: Agents don't need to spend output tokens on extraction and normalization — the data arrives in a usable form.
  • Predictable token budgets: Consistent schemas enable accurate per-query token forecasting, which is difficult with variable raw sources.

Integration with Routing and Caching

Structured API outputs integrate naturally with the caching and routing patterns described above:

  • Structured responses can be cached at the application layer (not just the LLM layer), serving repeated queries without any API call
  • Consistent field structures enable routing decisions based on data content — an agent can inspect a structured response and decide whether to escalate to a frontier model or proceed with a mid-tier model
  • Schema-validated outputs reduce the need for output token-heavy validation and correction loops

8. Practical Implementation: Cost Monitoring, Budgeting, and Scaling Strategies

Instrumentation Requirements

Cost optimization requires measurement infrastructure. Minimum viable instrumentation for an agent fleet:

# Pseudocode: per-request cost tracking
def track_request(task_id, model, input_tokens, output_tokens, cached_tokens):
    cost = calculate_cost(model, input_tokens, output_tokens, cached_tokens)
    metrics.record({
        "task_id": task_id,
        "model": model,
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "cached_tokens": cached_tokens,
        "cost_usd": cost,
        "timestamp": now(),
        "cache_hit_rate": cached_tokens / input_tokens if input_tokens > 0 else 0
    })

Track at minimum: task ID, model used, token counts by type, cost, and whether the request was a cache hit. Aggregate these into per-task totals for efficiency analysis.

Cost Monitoring Stack

  • Request-level logging: Every API call logged with token counts and cost
  • Task-level aggregation: Sum costs across all calls within a task
  • Alerting: Trigger alerts when per-task cost exceeds threshold, when cache hit rate drops below baseline, or when escalation rate spikes
  • Dashboards: Track cost per task over time, model tier distribution, cache efficiency, and total spend by task type

Scaling Cost Controls

As agent fleets scale, implement progressive cost controls:

Soft limits: Log a warning when a task exceeds expected cost. Useful for identifying outliers without disrupting production.

Hard limits: Terminate a task that exceeds a maximum cost threshold. Prevents runaway agents from consuming unbounded resources.

Rate limiting by task type: Allocate token budgets by task category. High-value tasks get larger budgets; routine tasks get tighter limits.

Cost-aware scheduling: During high-load periods, deprioritize expensive tasks and batch them for off-peak processing.

Prompt Engineering for Cost Efficiency

Prompt design directly affects token consumption:

  • Concise system prompts: Every token in a system prompt is billed on every request. Audit system prompts for redundancy regularly.
  • Explicit output format constraints: Telling the model to respond in 2–3 sentences reduces output token variance.
  • Avoid few-shot examples in the prompt when fine-tuning is available: Few-shot examples add hundreds of tokens per request. A fine-tuned model can achieve the same behavior with a shorter prompt.
  • Tool definition pruning: Only include tool definitions relevant to the current task. A 20-tool definition block adds significant tokens to every request even when most tools are never called.

Key Takeaways and Decision Framework

Core Principles

  1. Output tokens drive cost in agent systems. Design to minimize unnecessary generation — chain-of-thought only when it improves task success, structured outputs only when they add value downstream.

  2. Caching is the highest-leverage optimization for systems with repeated context. Structure prompts with stable content first. Measure cache hit rates. A 70% cache hit rate on a 2,000-token system prompt can reduce input costs by more than any model tier switch.

  3. Model routing is not optional at scale. Using frontier models for tasks that mid-tier models handle adequately is the most common source of preventable cost in agent fleets.

  4. Measure cost per task, not cost per token. A cheaper-per-token agent that requires more calls, more retries, and more human intervention is not cheaper.

  5. Batch what you can. 50% discounts for asynchronous processing are available today. Any workload that tolerates hours-scale latency should use batch APIs.

Decision Framework: Choosing Your Optimization Priority

Is your primary cost driver input tokens or output tokens?
├── Input tokens dominant → Focus on caching, prompt compression, structured data sources
└── Output tokens dominant → Focus on output constraints, reasoning efficiency, task decomposition

Is your workload latency-sensitive?
├── Yes → Real-time routing, cascade models, provider load balancing
└── No → Batch APIs, off-peak scheduling, aggressive caching

Is your task complexity uniform or variable?
├── Uniform → Static model assignment, optimized for that tier
└── Variable → Dynamic routing, complexity classification, cascade fallback

Are you at early scale or production scale?
├── Early → Instrument everything, establish baselines, don't optimize prematurely
└── Production → Cost per task tracking, anomaly detection, hard limits, multi-provider routing

The Compounding Effect

These optimizations compound. A system with 70% cache hit rates, intelligent model routing that uses mid-tier models for 60% of tasks, and batch processing for 40% of workload can achieve 70–85% cost reduction relative to a naive implementation using frontier models for all requests with no caching. At meaningful scale, this is the difference between a viable product and an uneconomical one.


This lesson is part of Empirica's agent infrastructure curriculum. Related topics: discovery infrastructure for AI agents, on-chain payment rails for autonomous agents, and structured research APIs as agent-readable data layers.