INFERENCE DEMAND DATA

GPU Inference & Token-Demand Intelligence

Name: GPU Inference & Token-Demand Intelligence
Creator: Gerra

Real-time inference economics across hosted-model providers - token prices, throughput, and latency as a demand-side read on AI compute.

1 YEARParquet · JSON · CSVContinuous polling, daily aggregates

488

Token-price observations

Providers

~10x

Output price decline

Roadmap

Demand-side metrics

Sample Schema Included Methodology Evals Graders Application Integration Research

01Sample

See the data

Representative records in the exact shape we deliver. Real provenance and full slices are shared under license.

Current collected record (output price)

Representative. The study measures output price per million tokens on cheapest-available models; tiers are not pooled.

inference_prices.jsonlrepresentative

{
  "observed_at": "2026-04-02T00:00:00Z",
  "provider": "provider-a",
  "model_family": "llama",
  "model_listing": "llama-3.1-8b-instruct",
  "output_price_per_mtok_usd": 0.02,
  "input_price_per_mtok_usd": 0.01,
  "model_tier": "small_open"
}

Proposed demand-side record (not yet collected)

Representative of the proposed schema only. TTFT, throughput, and success rate are not present in the current 488-observation sample - shown here as planned, not collected.

inference_demand.jsonlrepresentative

{
  "observed_at": "2026-04-02T00:05:00Z",
  "provider": "provider-b",
  "model_family": "qwen",
  "ttft_ms": null,
  "throughput_tok_s": null,
  "success_rate": null,
  "_status": "demand-side metrics not yet collected"
}

02Schema

Record shape

Every field, its type, whether it can be null, and a representative value.

Field	Type	Constraint	Description
observed_at	timestamp · UTC	required	When the price was collected from the provider pricing surface. e.g. 2026-04-02T00:00:00Z
provider	string	required	One of four hosted-inference providers. e.g. provider-a
model_family	string	required	Open-weight lineage the listing is grouped under (Llama, Qwen, Mixtral, DeepSeek). e.g. llama
model_listing	string	nullable	Provider raw model product name before family grouping. e.g. llama-3.1-8b-instruct
output_price_per_mtok_usd	float · USD / 1M output tokens	required	Normalized output-token price - the quantity the study actually measures. e.g. 0.02
input_price_per_mtok_usd	float · USD / 1M input tokens	nullable	Input-token price where listed. e.g. 0.01
model_tier	string	nullable	Coarse tier flag; the study does not pool tiers. e.g. small_open
ttft_ms	float · ms	nullable	Time-to-first-token. Proposed demand-side metric, not yet collected. e.g. null
throughput_tok_s	float · tokens/sec	nullable	Generation throughput. Proposed demand-side metric, not yet collected. e.g. null
success_rate	float · 0..1	nullable	Request success rate under load. Proposed demand-side metric, not yet collected. e.g. null

03What's included

Token Price Index

Input and output token rates per model per provider over time - the unit economics of inference, tracked continuously.

Latency & Throughput

Time-to-first-token, generation throughput, and latency percentiles - congestion signals that move before capacity announcements.

Reliability Signal

Request success and error rates across providers - a real-time read on where inference demand is outrunning supply.

04Methodology

How it is built

01
Collection
Collect from the public pricing surfaces of four hosted-inference providers.
02
Normalization
Normalize to a per-million-token basis for output tokens.
03
Family grouping
Group heterogeneous model listings by the underlying open-weight family so comparisons stay within recognizable lineages.
04
Tier separation
Observed models span small open-weight checkpoints to frontier-scale hosted deployments priced very differently; tiers are not pooled into a single level.
05
Direction-over-level reporting
Report the direction - a steep decline at the cheap end - as robust, and treat absolute price levels as preliminary because the cheapest quote in any period may reflect a different tier than in another.

05Evals

How we validate

What each evaluation measures and how it is run. Where no benchmark is published, we show the methodology and say so.

Cheapest-output price decline

Measures

How the price of the cheapest available output tokens moved over time.

Method

Track the cheapest available output price per million tokens across the four providers over the window.

Result

Descriptive, published as preliminary: fell from about $0.13 per million tokens in mid-2024 into the $0.01 to $0.03 range in 2025-2026. Direction robust; absolute level preliminary due to mixed tiers.

Demand-side leading indicator

Measures

Whether time-to-first-token, throughput, and success rate lead capacity tightening ahead of posted prices.

Method

The hypothesis is stated; the metrics require inference-provider API access that is not yet wired up.

Result

Methodology-stage. Explicitly not built and not collected. The demand-side fields are a roadmap, shown in the schema as not-yet-collected.

06Graders

Ground truth

What correct means for this data, and how it is established.

Ground truth

For the descriptive layer, the observed posted output-token prices across the four providers on a cheapest-available basis. For the proposed demand signal, there is no ground truth yet because the data is uncollected.

How it is established

Descriptive aggregation of the cheapest-available output price over time per provider and family. No predictive grader exists; the proposed demand-side validation is future work gated on API access.

07Application

Demand-Side Compute Read

Throughput collapse and rising latency across providers signal surging model demand days to weeks before it shows up in chip orders.

Inference Margin Tracking

Track the falling price of intelligence per token and model the margin structure of hosted-inference and model-API businesses.

Provider Competitive Map

Compare price, speed, and reliability across providers for the same model - who is winning the inference market in real time.

08Environment & integration

How you load it

Delivery

S3, REST API, Parquet

Formats

Parquet, JSON, CSV

Auth

Licensed for internal research and model development. Sourced from public pricing surfaces; no PII or MNPI.

Cadence

Continuous polling with daily aggregates.

09Related research

The Falling Price of InferenceRead →

Request access.

Restricted-scope evaluation access for qualified teams. We share real samples, full schema, and provenance under a mutual NDA.

Talk to us team@gerra.com

GPU Inference & Token-Demand Intelligence

See the data

Record shape

Token Price Index

Latency & Throughput

Reliability Signal

How it is built

Collection

Normalization

Family grouping

Tier separation

Direction-over-level reporting

How we validate

Cheapest-output price decline

Demand-side leading indicator

Ground truth

Demand-Side Compute Read

Inference Margin Tracking

Provider Competitive Map

How you load it

Request access.

Product

Company

Connect