Why the same model costs different amounts on different platforms

The same model can cost different amounts on the official API, on OpenRouter, and on Bedrock/Vertex/Azure. This piece explains where those gaps come from — routing margin, volume and committed-use discounts, region and hosting cost, batch APIs — and why this site shows OpenRouter’s routed price (not the official price; a feature, not a bug). You will leave knowing how to find the cheapest source for your own usage.

1. One model, several price tags

You find Claude Opus 4.8 listed at $5 per million input tokens on Anthropic's own pricing page. Then you check OpenRouter and see the same model — same weights, same capability — at a slightly different figure depending on which provider you route to. Then a colleague mentions they are running it through AWS Bedrock and paying yet another rate. You are not being deceived. These numbers are genuinely different, and they are different for structural reasons that are worth understanding before you commit to any platform at scale.

The same phenomenon appears across the entire market. GPT-5.4 at $2.50 input / $15 output on OpenAI's direct API may appear at a different rate on Azure OpenAI. Gemini 3.5 Flash at $1.50 input / $9 output on Google AI Studio may not match the price on Vertex AI. DeepSeek V4 Flash at $0.09 input / $0.18 output on OpenRouter reflects one particular routing arrangement. None of these quotes are wrong. They describe different commercial products built on the same model.

This article unpacks the layers: what routing marketplaces like OpenRouter actually do, how cloud-managed offerings like AWS Bedrock and Google Vertex add their own commercial wrapper, where the specific gaps come from, what this site chooses to display and why, and finally how to find the cheapest source for your own usage once you understand the architecture.

2. How OpenRouter routing prices work

OpenRouter is a routing marketplace, not a model provider. It does not train models and does not run its own GPU clusters for frontier models. Instead, it maintains a catalogue of models from dozens of providers — Anthropic, OpenAI, Google, Mistral, Meta, DeepSeek, and many more — and routes your API requests to the appropriate provider endpoint on your behalf.

When you call a model via OpenRouter, the request goes to the upstream provider. The price you pay is a function of the upstream provider's rate, sometimes with a small OpenRouter margin layered on top, and sometimes passed through at cost depending on the specific model and the provider arrangement. This means the "OpenRouter price" for a model is not a fixed delta above the official price — it varies by provider, by model, and by the commercial terms OpenRouter has negotiated.

OpenRouter also allows you to choose which provider to route to for models that have multiple upstream options. The same model may have different latencies, different rate limits, and slightly different prices depending on which provider's endpoint you route to. This is genuinely useful: a model running on one provider's infrastructure may be cheaper or faster than the same model on another.

The practical consequence is that OpenRouter prices are routed prices, not official first-party API prices. They reflect a routing layer — and that is a feature, not a limitation. You get a single unified API and billing account across dozens of providers, with the ability to switch providers or fall back automatically when one is rate-limited or down.

3. Official API vs cloud-managed (Bedrock / Vertex / Azure)

The clearest illustration of price divergence is the gap between a model provider's direct API and the same model running on a major cloud's managed AI service. AWS Bedrock, Google Vertex AI, and Azure OpenAI Service all offer models from external providers — but they wrap them in their own infrastructure, SLA, compliance, and billing system.

What this means in practice: the model weights are identical. Claude Opus 4.8 on AWS Bedrock runs the same weights as Claude Opus 4.8 on Anthropic's direct API. GPT-5.4 on Azure runs the same weights as GPT-5.4 on OpenAI's direct API. What differs is everything around the weights: the infrastructure it runs on, the SLA guarantees, the data residency options, the compliance certifications (HIPAA, SOC 2, FedRAMP), and the pricing structure.

Cloud-managed offerings typically cost more per token than the provider's direct API for on-demand usage. You are paying for the integration into an existing cloud vendor relationship, the compliance overhead, and the enterprise SLA. For organisations that already run their workloads on AWS, Azure, or GCP, this can still be the right choice — simplified procurement, unified billing, and compliance coverage often justify the premium. But for cost-sensitive production workloads that do not need those features, the direct API is usually cheaper.

The cloud providers also offer volume and committed-use pricing that can close or reverse the gap with direct API pricing — but those arrangements require significant upfront commitment and are rarely the right choice for early-stage workloads.

4. Where the gaps come from

Once you understand the structural layers, the specific sources of price variation become clear. The main drivers are not arbitrary — they follow the economics of each part of the stack.

Factor	Effect on price	Notes
Routing margin	Small markup or pass-through	OpenRouter's commercial layer; varies by model and provider arrangement
Cloud managed overhead	Often higher than direct API	Bedrock / Vertex / Azure add SLA, compliance, and infra cost
Volume & committed-use discounts	Can significantly reduce effective rate	Available at providers and cloud vendors; requires commitment
Batch API	~50% off standard rate (typical)	Available from Anthropic, OpenAI, and others; requires async workflow
Prompt caching	10–50% of standard input rate	Varies by provider; can dominate total cost for agent workloads
Region & egress	Usually small; can matter at scale	Cross-region data transfer adds cost; inference in some regions costs more

Batch APIs are the most underused lever. When your workload does not require real-time responses — classification, document extraction, content moderation, offline summarisation — batch endpoints typically cut the bill by around 50%. Anthropic's batch API, OpenAI's batch endpoint, and their equivalents at other providers all operate on the same principle: you submit a file of requests, the provider processes them within a window (usually 24 hours), and you pay roughly half the on-demand rate. For high-volume offline workloads, this is often the single largest cost reduction available.

Prompt caching is the other major variable. When the same prefix appears repeatedly — a long system prompt, a retrieval context, a document to be analysed — cached input tokens are charged at a fraction of the standard input rate. The fraction varies: Anthropic caches at 10% of the standard input rate for Claude models; other providers are in the 20–50% range. For agent workloads that re-send a large system prompt on every turn, cache hit rates above 60–70% are achievable, and the difference between a 10% and a 50% cached rate compounds across millions of tokens.

Volume and committed-use discounts are the least visible but potentially the largest for enterprise customers. Both direct API providers and cloud vendors offer negotiated rates for customers who commit to minimum monthly spend or token volumes. These discounts are not publicly listed, making cross-platform comparison impossible without direct contact with a sales team. This is worth knowing: the public price is not always the price large customers actually pay.

5. What this site shows, and why

This site sources its pricing data from OpenRouter. Every price you see here — input, output, context window, and provider — is the OpenRouter routed price for that model, updated daily via OpenRouter's API. We are transparent about this because it matters.

OpenRouter prices are not the same as the model provider's official first-party API price. They may be higher (when OpenRouter applies a margin), lower (when OpenRouter has a favourable provider arrangement), or approximately equal. We do not have access to the exact pricing terms between OpenRouter and each upstream provider, so we cannot decompose the delta for you.

What we can offer is comparability and freshness. OpenRouter's unified API makes it possible to price hundreds of models on a consistent basis — same billing unit, same token definition, same daily update cycle. If you are trying to compare the relative cost of running Claude Opus 4.8 versus DeepSeek V4 Flash versus Gemini 3.5 Flash, the OpenRouter prices give you an accurate picture of the relative magnitudes. Claude's flagship at $5 input / $25 output is roughly 55× the input price of DeepSeek V4 Flash at $0.09 input — that relative difference is real and stable across sources, even if the absolute numbers shift slightly between platforms.

If you need to verify the official first-party price for procurement or financial planning, always cross-check with the provider's own pricing page. We link to providers from each model's detail page. See the about page for full disclosure on our data sources and methodology.

We chose OpenRouter as our source because it is: machine-readable (a stable JSON API, not scraping a webpage), comprehensive (hundreds of models across dozens of providers), and daily-updated (price changes appear within 24 hours). These properties make it the most reliable basis for a comparison site that aims to be genuinely useful rather than a one-time snapshot.

6. Finding the cheapest source for your usage

With the architecture in mind, here is a practical decision process for finding the cheapest source for a given workload. The right answer depends on your volume, your latency requirements, your compliance needs, and your prompt structure.

Step one: establish your input/output ratio. Use the cost calculator to model your expected monthly token volumes at different input/output ratios. The output price is typically 3–10× the input price — a workload that generates long outputs looks very different from one that generates short answers from long context. This single number changes which model and which platform wins the comparison.

Step two: check batch eligibility. If any significant fraction of your workload is offline — does not need a response in under a second — price-check the batch API rate, not the on-demand rate. For Anthropic and OpenAI models, batch pricing is typically around 50% of the standard rate. This often means a "more expensive" model in batch mode is cheaper than a "cheaper" model in on-demand mode.

Step three: model your cache structure. If you have a stable system prompt or a document that repeats across many requests, estimate your cache hit rate. The savings compound: at a 10% cached input rate (Anthropic's rate for Claude) and a 70% cache hit rate, your effective input cost is 0.1 × 0.7 + 1.0 × 0.3 = 0.37 of the standard rate. That is a 63% reduction on the input side, which can flip the cost ranking between models.

Step four: check committed-use options. If your monthly spend on a provider is above roughly $5,000–$10,000 per month, it is worth contacting the provider directly about volume discounts. At higher volumes, the difference between public prices and negotiated enterprise rates can be substantial — often 20–40% on top of the published price, occasionally more for very large customers.

Step five: factor in compliance and infrastructure requirements. If your workload requires data residency, SOC 2 compliance, HIPAA compliance, or integration with an existing cloud vendor's logging and access control, the cloud-managed path (Bedrock, Vertex, Azure) may be necessary regardless of price. These requirements have non-zero cost, but the alternative — building your own compliance layer on top of a direct API — is often more expensive than the price premium.

The best-value ranking on this site sorts models by quality per dollar using OpenRouter prices as the baseline. It is a reasonable starting point. But the cheapest model in absolute terms is rarely the cheapest model for a specific workload once you factor in batch eligibility, cache structure, and compliance requirements. The ranking tells you where to look; your own production logs tell you where to land.

Written by Allen Pan. Corrections or questions welcome — allen@xyzsleep.com.