INFERENCE DEMAND DATA
GPU Inference & Token-Demand Intelligence
Real-time inference economics across hosted-model providers - token prices, throughput, and latency as a demand-side read on AI compute.
See the data
Representative records in the exact shape we deliver. Real provenance and full slices are shared under license.
Current collected record (output price)
Representative. The study measures output price per million tokens on cheapest-available models; tiers are not pooled.
{
"observed_at": "2026-04-02T00:00:00Z",
"provider": "provider-a",
"model_family": "llama",
"model_listing": "llama-3.1-8b-instruct",
"output_price_per_mtok_usd": 0.02,
"input_price_per_mtok_usd": 0.01,
"model_tier": "small_open"
}Proposed demand-side record (not yet collected)
Representative of the proposed schema only. TTFT, throughput, and success rate are not present in the current 488-observation sample - shown here as planned, not collected.
{
"observed_at": "2026-04-02T00:05:00Z",
"provider": "provider-b",
"model_family": "qwen",
"ttft_ms": null,
"throughput_tok_s": null,
"success_rate": null,
"_status": "demand-side metrics not yet collected"
}Record shape
Every field, its type, whether it can be null, and a representative value.
| Field | Type | Constraint | Description |
|---|---|---|---|
| observed_at | timestamp · UTC | required | When the price was collected from the provider pricing surface. e.g. 2026-04-02T00:00:00Z |
| provider | string | required | One of four hosted-inference providers. e.g. provider-a |
| model_family | string | required | Open-weight lineage the listing is grouped under (Llama, Qwen, Mixtral, DeepSeek). e.g. llama |
| model_listing | string | nullable | Provider raw model product name before family grouping. e.g. llama-3.1-8b-instruct |
| output_price_per_mtok_usd | float · USD / 1M output tokens | required | Normalized output-token price - the quantity the study actually measures. e.g. 0.02 |
| input_price_per_mtok_usd | float · USD / 1M input tokens | nullable | Input-token price where listed. e.g. 0.01 |
| model_tier | string | nullable | Coarse tier flag; the study does not pool tiers. e.g. small_open |
| ttft_ms | float · ms | nullable | Time-to-first-token. Proposed demand-side metric, not yet collected. e.g. null |
| throughput_tok_s | float · tokens/sec | nullable | Generation throughput. Proposed demand-side metric, not yet collected. e.g. null |
| success_rate | float · 0..1 | nullable | Request success rate under load. Proposed demand-side metric, not yet collected. e.g. null |
Token Price Index
Input and output token rates per model per provider over time - the unit economics of inference, tracked continuously.
Latency & Throughput
Time-to-first-token, generation throughput, and latency percentiles - congestion signals that move before capacity announcements.
Reliability Signal
Request success and error rates across providers - a real-time read on where inference demand is outrunning supply.
How it is built
- 01
Collection
Collect from the public pricing surfaces of four hosted-inference providers.
- 02
Normalization
Normalize to a per-million-token basis for output tokens.
- 03
Family grouping
Group heterogeneous model listings by the underlying open-weight family so comparisons stay within recognizable lineages.
- 04
Tier separation
Observed models span small open-weight checkpoints to frontier-scale hosted deployments priced very differently; tiers are not pooled into a single level.
- 05
Direction-over-level reporting
Report the direction - a steep decline at the cheap end - as robust, and treat absolute price levels as preliminary because the cheapest quote in any period may reflect a different tier than in another.
How we validate
What each evaluation measures and how it is run. Where no benchmark is published, we show the methodology and say so.
Cheapest-output price decline
Measures
How the price of the cheapest available output tokens moved over time.
Method
Track the cheapest available output price per million tokens across the four providers over the window.
Result
Descriptive, published as preliminary: fell from about $0.13 per million tokens in mid-2024 into the $0.01 to $0.03 range in 2025-2026. Direction robust; absolute level preliminary due to mixed tiers.
Demand-side leading indicator
Measures
Whether time-to-first-token, throughput, and success rate lead capacity tightening ahead of posted prices.
Method
The hypothesis is stated; the metrics require inference-provider API access that is not yet wired up.
Result
Methodology-stage. Explicitly not built and not collected. The demand-side fields are a roadmap, shown in the schema as not-yet-collected.
Ground truth
What correct means for this data, and how it is established.
Ground truth
For the descriptive layer, the observed posted output-token prices across the four providers on a cheapest-available basis. For the proposed demand signal, there is no ground truth yet because the data is uncollected.
How it is established
Descriptive aggregation of the cheapest-available output price over time per provider and family. No predictive grader exists; the proposed demand-side validation is future work gated on API access.
Demand-Side Compute Read
Throughput collapse and rising latency across providers signal surging model demand days to weeks before it shows up in chip orders.
Inference Margin Tracking
Track the falling price of intelligence per token and model the margin structure of hosted-inference and model-API businesses.
Provider Competitive Map
Compare price, speed, and reliability across providers for the same model - who is winning the inference market in real time.
How you load it
Delivery
S3, REST API, Parquet
Formats
Parquet, JSON, CSV
Auth
Licensed for internal research and model development. Sourced from public pricing surfaces; no PII or MNPI.
Cadence
Continuous polling with daily aggregates.
Request access.
Restricted-scope evaluation access for qualified teams. We share real samples, full schema, and provenance under a mutual NDA.