Abstract
We assemble a preliminary cross-provider snapshot of hosted-inference token pricing to characterize how the cost of running language-model inference has moved. The sample covers 488 token-price observations across four hosted providers - Together, Fireworks, DeepInfra, and Replicate - spanning the Llama, Qwen, Mixtral, and DeepSeek model families.
The headline observation is a steep decline in the price of the cheapest available output tokens, from roughly $0.13 per million tokens in mid-2024 into the $0.01-$0.03 range in 2025-2026. We are deliberate about the data's limits. The sample mixes very different model tiers, so absolute price levels are noisy and we do not yet treat them as established. We report the direction of travel as robust and the precise per-tier curves as unfinished. This is a preprint, not a validated trading signal.
Data & Coverage
Prices are collected from the public pricing surfaces of four hosted-inference providers and normalized to a per-million-token basis for output tokens. Heterogeneous model listings are grouped by the underlying open-weight family so that comparisons are made within recognizable lineages rather than across arbitrary product names. The current sample:
| Dimension | Value |
|---|---|
| Token-price observations | 488 |
| Providers | 4 (Together, Fireworks, DeepInfra, Replicate) |
| Model families | Llama, Qwen, Mixtral, DeepSeek |
| Price basis | USD per million output tokens |
| Status | Preliminary cross-provider snapshot |
Coverage is uneven across model tiers. The observed models range from small open-weight checkpoints to frontier-scale hosted deployments, and the providers price these very differently. We treat this asymmetry explicitly throughout and do not pool tiers into a single price level.
The Decline in Token Prices
Tracking the cheapest available output price over time, the cost of the lowest-priced intelligence has fallen sharply. Among the cheapest available models in the sample, output price per million tokens dropped from roughly $0.13 in mid-2024 into the $0.01-$0.03 range across 2025-2026.
Preliminary: the cheapest available output price per million tokens, from mid-2024 into 2025-2026, across Together, Fireworks, DeepInfra, and Replicate.
| Period | Cheapest output ($/M tok) | Note |
|---|---|---|
| Mid-2024 | ~$0.13 | Cheapest available output token price |
| 2025-2026 | $0.01 - $0.03 | Range across cheapest hosted models |
We read this as a directional result rather than a precise curve. The cheapest tier of hosted inference has deflated by roughly an order of magnitude over the window, consistent with growing competition among providers, more efficient open-weight models, and falling underlying compute cost. The magnitude of the decline at the cheap end is the robust part of this finding; the exact level at any point in time is subject to the caveat below.
The Mixed-Tier Caveat
The sample mixes very different model tiers. A small open-weight model and a frontier-scale hosted deployment can both appear as valid token-price observations, yet they price intelligence at radically different points. As a result, absolute price levels in the panel are noisy: the cheapest quote in any period may reflect a different model tier than the cheapest quote in another period.
We are explicit that this is preliminary. What survives the noise is the direction - a steep decline in the price of the cheapest intelligence available across these providers. What does not yet survive is a clean, per-tier price curve. We have not yet established separate, like-for-like trajectories for small open models versus frontier-scale hosted models, and we do not claim to. Until per-tier curves are pinned down, the levels here should be read as indicative, not definitive.
Toward a Demand-Side Signal (Future Work)
Pricing is a supply-side view. A more useful indicator would capture demand pressure: when inference capacity tightens, latency and reliability degrade before list prices move. We outline a demand-side signal we intend to build, and we are explicit that it is not yet collected and therefore unmeasured here.
| Proposed metric | What it reads | Status |
|---|---|---|
| Time-to-first-token (TTFT) | Queueing / scheduling latency under load | Not yet collected |
| Generation throughput (tokens/sec) | Per-request capacity available to a user | Not yet collected |
| Request success rate | Saturation / rejection under demand spikes | Not yet collected |
The hypothesis is that TTFT, throughput, and success rate measured across providers could act as a leading indicator of capacity tightening - moving ahead of, and independently of, posted token prices. Building it requires inference-provider API access we have not yet wired up. We flag this as a research direction, not a result: none of these metrics are present in the current 488-observation sample.
Limitations
Several limits bound these results. The sample is small (488 observations) and spans only four providers, so it is a snapshot rather than a comprehensive panel. The model tiers are mixed, which makes absolute price levels noisy and precludes firm per-tier curves. The price basis is output tokens on cheapest-available models, which is a deliberately narrow slice of inference economics and not a weighted view of real-world usage. And the demand-side signal described above is unbuilt and uncollected. Each of these is a coverage gap addressable with deeper collection, but together they are why we label this work preliminary.
Conclusion
The cheapest hosted inference has gotten dramatically cheaper, falling roughly an order of magnitude at the low end between mid-2024 and 2025-2026 across Together, Fireworks, DeepInfra, and Replicate. We report this direction as the robust core of a preliminary study, while holding back on precise price levels and per-tier curves until the sample is deeper and tiers are cleanly separated. The natural next step - a demand-side signal from TTFT, throughput, and success rate - remains future work, gated on inference-provider API access we have not yet established. We present this as a preprint and hold it to that standard.