The complete guide to LLM token pricing: input, output, cached, and reasoning tokens

They are all quoted as “$/1M tokens,” but input, output, cached input, and reasoning tokens are billed in completely different ways — and the final number is usually decided by the part you cannot see. Using real model pricing (Claude caches at 10% of input, OpenAI o1 only at 50%), this guide explains how each token type is charged, why output is usually 3–5× input, and how to estimate and cut the bill.

1. What a token is and how it's counted

Every LLM API invoice lists a number that almost nobody has an intuitive feel for: tokens. The billing unit is not characters, not words, not sentences — it is the sub-word chunk that the model's tokenizer breaks text into before processing it. Understanding this is the first step to understanding why your bill is what it is.

The practical rule of thumb for English prose: roughly 0.75 words per token, which inverts to about 1.33 tokens per word. A 500-word document lands somewhere between 650 and 750 tokens. Code and structured data are often denser — Python variable names and JSON keys tokenize compactly, but verbose languages like XML or YAML can run closer to 1.5–2 tokens per word-equivalent because every angle bracket and colon is its own token or shares one with a neighbour. Non-English languages vary considerably: CJK text (Chinese, Japanese, Korean) frequently runs 1–2 characters per token, making it token-efficient compared with, say, agglutinative languages like Finnish where a single inflected word can become 3–4 tokens.

You can verify token counts before sending to the API. Most providers expose a tokenizer endpoint or client-side library (OpenAI's tiktoken, Anthropic's token counting API). Run your typical prompt through it once and calibrate against your actual invoices — the "0.75 words" rule breaks down at the tails (very short prompts, code-heavy prompts, multilingual prompts), and billing surprises almost always happen at the tails.

A subtlety worth knowing: tokenization is model-family-specific. GPT-4o and GPT-5 use different tokenizers than Claude, which uses a different one than Gemini. A 1,000-token system prompt measured on tiktoken may land at 950 or 1,100 tokens on the Claude tokenizer. When you are doing cross-model cost comparisons, measure token counts on each provider's actual tokenizer, not a generic approximation.

One final point that surprises many practitioners: whitespace and formatting tokens are real. A system prompt full of Markdown — headers, bullet lists, code fences — adds tokens for each formatting character. Compressing your system prompt from rich Markdown to minimal plain text can trim 5–15% off your input token count with no loss in model comprehension. That saving compounds every time the prompt is sent.

2. Input vs output: why output costs more

Nearly every LLM pricing page shows two numbers: an input price and an output price. In almost every case output is more expensive — often significantly so. Understanding why explains how to manage it.

The asymmetry has a hardware explanation. Processing input tokens is partially parallelisable: the model can attend across the full prompt in one forward pass. Generating output tokens is auto-regressive — each token must be produced sequentially, conditioned on everything before it, before the next token can start. Inference hardware is less efficiently utilised during generation, so the compute cost per token is higher. Providers pass this on as higher output rates.

The multiple varies across the market. Here are real 2026 examples (USD per 1M tokens):

Model	Input	Output	Output / Input
Claude Opus 4.8 (standard)	$5	$25	5×
GPT-5.5	$5	$30	6×
MiniMax M3	$0.30	$1.20	4×
DeepSeek V4 Pro	$0.435	$0.87	2×

The range is 2–6×. DeepSeek V4 Pro's relatively tight ratio (2×) reflects a pricing strategy aimed at high-throughput agentic tasks where output tokens dominate; GPT-5.5's 6× ratio reflects a different position in the market. The implication is important: the "cheapest" model on input price alone can easily become the most expensive once you account for your actual output volume.

For most real applications the input/output split is not 50/50. A customer-support bot sending a 1,000-token FAQ context and receiving a 100-token answer is spending 91% of its tokens on input. A code generator producing a 2,000-token implementation from a 200-token spec is spending 91% on output. These workloads should be evaluated on completely different model criteria. The cost calculator on this site lets you plug in your own input/output ratio and see which models win under your specific numbers.

3. Cached input: the biggest single cost lever

Prompt caching lets you pay a reduced rate for input tokens that the model has already processed in a previous request. When a request's prefix — typically a long system prompt or retrieved document — matches a cached prefix exactly, you are billed at the cached rate rather than the full input rate. In 2026 this has become the single largest lever most practitioners have on their monthly bill.

The discount, however, is not uniform across providers. This is where many teams get tripped up. Here are the actual numbers as of mid-2026:

Claude Opus 4.8: standard input $10/M → cached input $1/M. That is 10% of the standard rate — a 90% discount on cached prefixes.
Claude Opus 4.6: standard input $5/M → cached input $0.50/M. Also 10% of the standard rate.
OpenAI GPT-5 Image: standard input $10/M → cached input $1.25/M. That is 12.5% of the standard rate — close to Anthropic's depth of discount.
OpenAI o3 Deep Research: standard input $10/M → cached input $2.50/M. That is 25% of the standard rate — a smaller discount than Anthropic.
OpenAI o1: standard input $15/M → cached input $7.50/M. That is 50% — a much shallower discount. For o1, caching halves your input cost; for Claude, it cuts it to a tenth.

This means "cached input" is not a single feature — it spans from a 50% discount (OpenAI o1) to a 90% discount (Anthropic Claude). For an agent workload with a 4,000-token system prompt that repeats across 1,000 daily requests, the difference in monthly cost between a provider that charges 50% for cached tokens vs one that charges 10% is substantial. Run the maths against your own volume before you pick a provider.

The mechanics also differ. Anthropic's cache is prefix-matching and warms automatically once you include a cache_control breakpoint in your messages. OpenAI's automatic caching kicks in after the first request once a prefix exceeds 1,024 tokens. Neither guarantees a hit — if the model infrastructure routes you to a different pod, the cache may be cold. Real-world hit rates depend on your traffic pattern and provider infrastructure, not just your prompt structure. The cheapest-input ranking on this site shows cached input prices alongside standard input so you can compare directly.

4. Reasoning tokens: the hidden output you pay for

The o-series models from OpenAI (o1, o3, o4) and Claude's extended-thinking modes both have a feature that catches many users by surprise the first time they see their invoice: they bill for reasoning tokens. These are the chain-of-thought tokens the model generates internally — the scratchpad it uses to think through the problem — before producing the visible answer. You never see them in the response body, but they appear in your usage statistics and are billed as output tokens.

The practical impact is large. On a typical "think hard about this" prompt, the internal reasoning trace can be 5–20× longer than the visible answer. A response that looks like 100 output tokens can actually bill as 1,000–2,000 output tokens once the reasoning trace is included. Since output tokens are already the most expensive token type (section 2 above), this compounds fast.

There is no simple rule for how long the reasoning trace will be — it depends on the task complexity as perceived by the model, which is a function of your prompt phrasing. A prompt that says "think step by step, explore multiple approaches, then give me your best answer" will consistently generate longer reasoning traces than one that says "answer briefly." Some API parameters let you set a budget for reasoning tokens (thinking.budget_tokens in the Anthropic API, and reasoning effort tiers in the OpenAI API), but the model does not always respect the budget precisely.

Two practical implications follow. First, for cost-sensitive workloads, the cheapest reasoning model by headline output price may easily be more expensive than the most expensive non-reasoning model once the reasoning overhead is accounted for. Second, you cannot accurately compare reasoning vs non-reasoning models using $/M output tokens alone — you need to measure "cost per completed task" on your own prompt set. Publish your reasoning-token usage metrics alongside your standard usage so your finance team is not blindsided by the invoices.

One useful mental model: reasoning tokens are more like "background compute" than "visible output." You are buying the model's thinking time. Whether that thinking time produces quality improvements worth its cost depends entirely on the task type. For highly structured extraction tasks, a non-reasoning model tuned well often beats a reasoning model at one-fifth the cost. For complex multi-step planning tasks, the reasoning trace earns its keep.

5. Multimodal and other billing dimensions

Tokens are not the only billing dimension. As LLMs have extended to images, audio, video, and other modalities, providers have introduced per-unit billing that sits alongside token pricing. If you are only reading the text-token columns of a pricing table, you may be missing a significant share of your bill.

Image input is the most common non-text dimension. Most providers convert images to a token-equivalent (OpenAI's GPT-4o tiles images into 85- or 170-token chunks depending on resolution; Claude converts images to a flat token count based on pixel area). The practical effect: a high-resolution image can cost the same as several hundred words of text input. If your pipeline passes full-resolution screenshots or product photos, resizing to the model's minimum required resolution before sending is a straightforward cost reduction — typically 30–60% on image-heavy workloads — with no quality loss if the content is still legible at the smaller size.

Per-request fees appear on some models as a flat charge on top of token prices. This shows up most often on specialised models (web-search-augmented endpoints, image-generation models, and some fine-tuned variants). Per-request fees can dominate for short-prompt, high-volume workloads: if you are making 100,000 requests per day with very short prompts, a $0.01 per-request fee adds $1,000/day, potentially dwarfing the token cost.

Audio and video billing is still maturing. OpenAI's realtime audio API charges per second of audio rather than per token, at rates that make it expensive for long-form use cases. Video understanding models typically convert video to frames and charge per-frame or per-second equivalents. These numbers are evolving quickly and are best verified directly against the provider's current pricing page rather than from aggregators.

This site's prices are sourced from OpenRouter, which routes requests and may carry a small routing margin on top of provider costs. For the full picture of why routed prices differ from official prices — and when that matters for your budget — see the platform-differences article.

6. How to estimate and cut your bill

Estimating your bill before you build is more tractable than it looks. The key is to measure a few representative prompts, not to derive an exact model. Here is a practical framework:

Count your prompt tokens accurately. Run 10–20 representative prompts through the provider's tokenizer, not a generic estimator. Average the results. Note the spread — a wide range means your cost variance will be high.
Measure your input/output ratio. For each representative prompt, record how many tokens the model actually generated. The ratio often surprises people — a "generate a summary" task might output 3× more than expected once you account for the model's tendency to include caveats and context.
Measure your cache hit rate. If you have any repeated prefix (system prompt, retrieved context), simulate a week of traffic and measure what fraction of input tokens hit the cache. Even a 50% hit rate at a 90% discount is a 45% reduction in input cost.
Account for reasoning overhead if using extended-thinking models. Run your actual prompts through the model and inspect the usage response field for reasoning_tokens. Do not rely on estimates.
Use the cost calculator on this site to plug in your numbers and compare across models at once.

On the cost-reduction side, there are four high-leverage actions:

Structure prompts for cache hits. The cached prefix must be identical across requests — even a single changed character breaks the match. Put everything that is stable (system instructions, reference documents, few-shot examples) at the top of the prompt, and everything that is request-specific (the user's message, retrieved search results) at the bottom. This maximises the prefix that is eligible for caching.

Discipline output length. If your use case does not need long responses, say so explicitly in the system prompt. "Answer in one paragraph" or "respond in under 100 words" are both effective. Output tokens are the most expensive, so every token you prevent the model from generating is a token you do not pay for. This is also where model choice matters: a model with a 2× output/input ratio like DeepSeek V4 Pro charges far less for verbose tasks than one with a 6× ratio like GPT-5.5.

Use batch APIs for latency-insensitive work. OpenAI's Batch API and Anthropic's Message Batches both offer ~50% discounts on the processing rate in exchange for completing the request within 24 hours. For background tasks — nightly document processing, overnight fine-tuning data generation, weekly analysis — batch mode cuts your bill in half with no engineering changes beyond the API call pattern. Check the best-value ranking to see which models also have batch pricing available.

Match the model to the task complexity. Sending every request to the largest flagship model is the fastest way to overspend. For classification, routing, and simple extraction tasks, a model at $0.10/M input is typically indistinguishable from one at $5/M input. Reserve the expensive models for the tasks where intelligence per dollar genuinely makes a difference — complex reasoning, multi-step planning, code generation requiring correctness guarantees. The beginner's guide on this site covers how to set a quality floor before optimising for price.

Written by Allen Pan. Corrections or questions welcome — allen@xyzsleep.com.