RESEARCH PAPER

The Falling Price of Inference: Token Economics Across Hosted Providers

A preliminary cross-provider view of inference token pricing

StatusPreprint
DateMay 2026
Authorsgerra
Sample488 obs, 4 providers
01

Abstract

We assemble a preliminary cross-provider snapshot of hosted-inference token pricing to characterize how the cost of running language-model inference has moved. The sample covers 488 token-price observations across four hosted providers - Together, Fireworks, DeepInfra, and Replicate - spanning the Llama, Qwen, Mixtral, and DeepSeek model families.

The headline observation is a steep decline in the price of the cheapest available output tokens, from roughly $0.13 per million tokens in mid-2024 into the $0.01-$0.03 range in 2025-2026. We are deliberate about the data's limits. The sample mixes very different model tiers, so absolute price levels are noisy and we do not yet treat them as established. We report the direction of travel as robust and the precise per-tier curves as unfinished. This is a preprint, not a validated trading signal.

02

Data & Coverage

Prices are collected from the public pricing surfaces of four hosted-inference providers and normalized to a per-million-token basis for output tokens. Heterogeneous model listings are grouped by the underlying open-weight family so that comparisons are made within recognizable lineages rather than across arbitrary product names. The current sample:

DimensionValue
Token-price observations488
Providers4 (Together, Fireworks, DeepInfra, Replicate)
Model familiesLlama, Qwen, Mixtral, DeepSeek
Price basisUSD per million output tokens
StatusPreliminary cross-provider snapshot

Coverage is uneven across model tiers. The observed models range from small open-weight checkpoints to frontier-scale hosted deployments, and the providers price these very differently. We treat this asymmetry explicitly throughout and do not pool tiers into a single price level.

03

The Decline in Token Prices

Tracking the cheapest available output price over time, the cost of the lowest-priced intelligence has fallen sharply. Among the cheapest available models in the sample, output price per million tokens dropped from roughly $0.13 in mid-2024 into the $0.01-$0.03 range across 2025-2026.

~$0.13 to $0.01-0.03per M tokens
Cheapest output token price

Preliminary: the cheapest available output price per million tokens, from mid-2024 into 2025-2026, across Together, Fireworks, DeepInfra, and Replicate.

PeriodCheapest output ($/M tok)Note
Mid-2024~$0.13Cheapest available output token price
2025-2026$0.01 - $0.03Range across cheapest hosted models

We read this as a directional result rather than a precise curve. The cheapest tier of hosted inference has deflated by roughly an order of magnitude over the window, consistent with growing competition among providers, more efficient open-weight models, and falling underlying compute cost. The magnitude of the decline at the cheap end is the robust part of this finding; the exact level at any point in time is subject to the caveat below.

04

The Mixed-Tier Caveat

The sample mixes very different model tiers. A small open-weight model and a frontier-scale hosted deployment can both appear as valid token-price observations, yet they price intelligence at radically different points. As a result, absolute price levels in the panel are noisy: the cheapest quote in any period may reflect a different model tier than the cheapest quote in another period.

We are explicit that this is preliminary. What survives the noise is the direction - a steep decline in the price of the cheapest intelligence available across these providers. What does not yet survive is a clean, per-tier price curve. We have not yet established separate, like-for-like trajectories for small open models versus frontier-scale hosted models, and we do not claim to. Until per-tier curves are pinned down, the levels here should be read as indicative, not definitive.

05

Toward a Demand-Side Signal (Future Work)

Pricing is a supply-side view. A more useful indicator would capture demand pressure: when inference capacity tightens, latency and reliability degrade before list prices move. We outline a demand-side signal we intend to build, and we are explicit that it is not yet collected and therefore unmeasured here.

Proposed metricWhat it readsStatus
Time-to-first-token (TTFT)Queueing / scheduling latency under loadNot yet collected
Generation throughput (tokens/sec)Per-request capacity available to a userNot yet collected
Request success rateSaturation / rejection under demand spikesNot yet collected

The hypothesis is that TTFT, throughput, and success rate measured across providers could act as a leading indicator of capacity tightening - moving ahead of, and independently of, posted token prices. Building it requires inference-provider API access we have not yet wired up. We flag this as a research direction, not a result: none of these metrics are present in the current 488-observation sample.

06

Limitations

Several limits bound these results. The sample is small (488 observations) and spans only four providers, so it is a snapshot rather than a comprehensive panel. The model tiers are mixed, which makes absolute price levels noisy and precludes firm per-tier curves. The price basis is output tokens on cheapest-available models, which is a deliberately narrow slice of inference economics and not a weighted view of real-world usage. And the demand-side signal described above is unbuilt and uncollected. Each of these is a coverage gap addressable with deeper collection, but together they are why we label this work preliminary.

07

Conclusion

The cheapest hosted inference has gotten dramatically cheaper, falling roughly an order of magnitude at the low end between mid-2024 and 2025-2026 across Together, Fireworks, DeepInfra, and Replicate. We report this direction as the robust core of a preliminary study, while holding back on precise price levels and per-tier curves until the sample is deeper and tiers are cleanly separated. The natural next step - a demand-side signal from TTFT, throughput, and success rate - remains future work, gated on inference-provider API access we have not yet established. We present this as a preprint and hold it to that standard.