Dynamic pricing for inference

Every inference operator today posts a flat list — same price per token whether the GPU has ten buyers queued or zero. Demand swings hour to hour; the list doesn’t. So capacity sits idle in slack windows (rent the operator pays anyway) and buyers queue at peak. Dynamic pricing — the mechanism airlines, hotels, electricity grids, and ride-share built for exactly this structure11. Sheryl E. Kimes. Yield Management: A Tool for Capacity-Constrained Service Firms. Journal of Operations Management, 1989. The canonical definition: fixed capacity, perishable inventory, time-variable demand, segmentable willingness to pay. Inference satisfies all four conditions. — fixes both ends. The economics are settled at 2–6% welfare gains across every market that tried it.22. Smith, Leimkuhler and Darrow. Yield Management at American Airlines. Interfaces 22(1), 1992. AA’s SABRE-based DINAMO system credited with $1.4B incremental revenue over three years, ~$500M/year ongoing — a ~5% revenue lift on fixed-capacity perishable inventory. The canonical empirical anchor for yield-management gains. 33. Juan Camilo Castillo. Who Benefits from Surge Pricing? Econometrica 93(5), 2025. Empirical spatial-equilibrium model on Houston Uber data. Relative to a uniform-price counterfactual at the same overall level: surge pricing increases total welfare by 2.15%; rider surplus +3.57% (every income level gains); driver surplus −0.98%; platform profit −0.50%. The strongest published empirical estimate of dynamic-vs-static welfare gains on a perishable matching market. The infrastructure to actually run it for inference is what’s missing, and the conditions for building it are converging now.

The arithmetic

Inference is a perishable-capacity business. An H100 rents anywhere from $1.50 to $7 per GPU-hour depending on tier and commitment, with the SemiAnalysis 1-year committed-contract index at $2.35 and rising 40% in five months as on-demand sold out across providers.44. SemiAnalysis. The Great GPU Shortage: H100 1-Year Rental Price Index, April 2026. Tracks 1-year committed-contract pricing (not on-demand) via monthly survey of 100+ neocloud providers, buyers, and brokers, reported as a 25th–75th percentile band. H100 contract pricing rose from ~$1.70/GPU-hour in October 2025 to $2.35 in March 2026 — a ~40% increase as on-demand capacity sold out. Whatever the rate, the meter runs whether you serve one request or a thousand.

So unit cost reduces to one identity: $/token = $/hour ÷ tokens/hour. The numerator is contractual. The denominator — batch fullness — is the only operational lever. Every reduction in inference unit cost over the past three years has come from raising the denominator.

Why hyperscalers win on $/token

Not better silicon. SemiAnalysis’s ClusterMAX rates GPU-cloud TCO on a normalized basis and finds hyperscalers in line with the top neoclouds on hardware cost.55. SemiAnalysis. ClusterMAX 2.0: The Industry Standard GPU Cloud Rating System, 2025. The rating system normalizes GPU-cloud TCO across storage, networking, control-plane and goodput, finding hyperscaler hardware cost broadly in line with the strongest independent operators. What hyperscalers win on is demand pooling: thousands of concurrent buyers per batch instead of tens. Same hardware, same model, an order of magnitude apart on $/token.

The swing is direct. SemiAnalysis benchmarks DeepSeek R1 FP4 on B200/TRT-LLM at ~$0.56 per million output tokens at 50 tokens/sec/user, rising to ~$4/M at 125 tokens/sec/user — a 7× unit-cost swing on identical hardware, purely from how much per-user latency the operator trades against batch fullness.66. SemiAnalysis. InferenceX v2: NVIDIA Blackwell vs AMD vs Hopper, 2025. DeepSeek R1 FP4 on B200/TRT-LLM costs ~$0.56/M output tokens at 50 tokens/sec/user vs ~$4/M at 125 tokens/sec/user — a 7× unit-cost swing on identical hardware driven by batch fullness. On the latest gpt-oss runs the same B200 reached $0.02/M tokens — a 5× drop in two months and 15× vs Hopper.77. NVIDIA Developer Blog. NVIDIA Blackwell Leads on New SemiAnalysis InferenceMAX Benchmarks, October 2025. On gpt-oss at 100 TPS/user, cost per million tokens fell from $0.11 at InferenceMAX launch to $0.02 two months later — a 5× drop from software optimization alone, on top of a 15× per-token cost reduction vs Hopper.

The implication is that hyperscaler $/token leadership is an aggregation moat, not a hardware one. An open market that pooled demand across the long tail of operators would close most of it.

The Pareto frontier

Each operator faces an achievable set of (utilization, revenue) outcomes. The upper boundary of that set — the Pareto frontier — is set entirely by buyers’ actual willingness to pay. A single static posted price is one point on the achievable set, and almost always strictly inside the frontier. The gap is the surplus that flat pricing leaves on the floor.

Pareto frontier (dynamic) Single static price Surplus stranded ← the gap 0% 25% 50% 75% 100% 0 max Batch utilization Realized revenue / GPU-hour

The frontier of achievable outcomes. For any operator, there is an upper bound on revenue per GPU-hour at each level of batch utilization. The Pareto frontier (solid) is the boundary set by buyers' actual willingness to pay. A single posted price (dashed) is strictly inside the frontier whenever buyers differ in willingness to pay — and they always do. The shaded gap is the surplus that flat pricing leaves on the floor.

Static pricing forces the operator to pick one point and live there. Dynamic pricing rides the frontier as demand shifts: discounting into slack windows to pull in price-elastic buyers, lifting at peak to ration scarce capacity to whoever values it most. Same hardware, same model, same buyers — different mechanism.

Reaching the frontier requires price discrimination — charging different buyers different prices, by what each is individually willing to pay. Airlines call it yield management; hotels call it revenue management; every other capacity-constrained market built infrastructure for it in the 1980s. Smith, Leimkuhler and Darrow’s canonical 1992 Interfaces paper on American Airlines’s DINAMO system reports $500M/year in incremental revenue from exactly this move — a ~5% lift on a fixed-capacity perishable inventory.22. Smith, Leimkuhler and Darrow. Yield Management at American Airlines. Interfaces 22(1), 1992. AA’s SABRE-based DINAMO system credited with $1.4B incremental revenue over three years, ~$500M/year ongoing — a ~5% revenue lift on fixed-capacity perishable inventory. The canonical empirical anchor for yield-management gains. Hotel RM benchmarks land near 6%; Uber surge pricing yields 2.15% total welfare gain.88. Aziz et al. Dynamic Pricing for Hotel Revenue Management. Real-data analysis showing ~6% average revenue uplift vs fixed pricing across hotel RM datasets — consistent with the 2–6% cross-domain pattern. 33. Juan Camilo Castillo. Who Benefits from Surge Pricing? Econometrica 93(5), 2025. Empirical spatial-equilibrium model on Houston Uber data. Relative to a uniform-price counterfactual at the same overall level: surge pricing increases total welfare by 2.15%; rider surplus +3.57% (every income level gains); driver surplus −0.98%; platform profit −0.50%. The strongest published empirical estimate of dynamic-vs-static welfare gains on a perishable matching market. The cross-domain pattern is consistent: dynamic pricing captures 2–6% of total surplus that static pricing leaves stranded.

Price sensitivity is the lever

Riding the frontier requires that buyers actually differ in price sensitivity — and they do, dramatically. A real-time chat user has hours of deferral tolerance worth zero and cares more about model quality than $0.10/M tokens. A nightly evaluation job has hours of tolerance worth most of the price; the same buyer routinely switches between Together, Fireworks and OpenRouter to chase a cheaper provider. A long-running agent sits between.

The aggregate elasticity in the data we have looks low. OpenRouter’s regression over ~100 trillion routed tokens estimates own-price elasticity at roughly −0.05 to −0.07 — a 10% price cut moves usage only ~0.5–0.7%.99. OpenRouter. State of AI 2025: 100T Token LLM Usage Study, with academic write-up Token Economics in the LLM Era. Log-log regression of platform usage vs $/M tokens over ~100T routed tokens (Nov 2024–Nov 2025): own-price elasticity ≈ −0.05 to −0.07. Should be read as a lower bound on dynamic-price-revealed elasticity: the data is from a market where prices move on quarterly cadence via model launches and provider list updates, not hour-to-hour via a clearing signal. Buyers cannot respond to a price signal that has never existed. But that measurement is from a market that has never had dynamic prices. It captures how little buyers respond to slow, structural shifts (a cheaper model launches; a provider cuts list); it tells us almost nothing about how they would respond to a clearing-price signal that moves hour to hour. The ride-share analog is direct: pre-surge-pricing, taxi data showed flat demand curves; post, surge revealed segment-level elasticity orders of magnitude higher.33. Juan Camilo Castillo. Who Benefits from Surge Pricing? Econometrica 93(5), 2025. Empirical spatial-equilibrium model on Houston Uber data. Relative to a uniform-price counterfactual at the same overall level: surge pricing increases total welfare by 2.15%; rider surplus +3.57% (every income level gains); driver surplus −0.98%; platform profit −0.50%. The strongest published empirical estimate of dynamic-vs-static welfare gains on a perishable matching market. The price sensitivity was always there. There was just no mechanism to express it.

A static price treats all three buyer archetypes identically and forces the operator to choose which segment to under-serve: too high and the deferrable batch buyer leaves; too low and inelastic chat surplus is left on the table. A continuous price reveals the gradient. In slack windows the price drops until deferrable jobs enter the queue and fill the GPU; at peak the price rises until elastic buyers defer voluntarily into the next slack window. The aggregate elasticity that will show up under dynamic pricing is an empirical question the market has never been allowed to answer — but it cannot be lower than the static-price measurement, only higher.

Scarcity is the strongest case

The intuitive worry is that dynamic pricing under scarcity is just surge pricing under a kinder name. The empirical record runs the other way. Castillo’s 2025 Econometrica paper on Uber finds that relative to a uniform-price counterfactual at the same overall level, surge increases total welfare by 2.15%, with rider surplus up 3.57% at every income level — drivers absorb −0.98%, the platform −0.50%.33. Juan Camilo Castillo. Who Benefits from Surge Pricing? Econometrica 93(5), 2025. Empirical spatial-equilibrium model on Houston Uber data. Relative to a uniform-price counterfactual at the same overall level: surge pricing increases total welfare by 2.15%; rider surplus +3.57% (every income level gains); driver surplus −0.98%; platform profit −0.50%. The strongest published empirical estimate of dynamic-vs-static welfare gains on a perishable matching market. Riders gain everywhere because the static-price alternative isn’t “lower prices” — it’s longer waits and random rationing. Cramton’s canonical electricity-market review puts the same point in scarcity-pricing form: arbitrarily-high shortage prices are what generate the investment signal to build more capacity, “enhanc[ing] reliability by providing stronger investment incentives.”1010. Peter Cramton. Electricity Market Design. Oxford Review of Economic Policy 33(4), 2017. Energy-only markets with administrative scarcity prices via an Operating Reserve Demand Curve: “setting higher scarcity prices enhances reliability in providing stronger investment incentives.” The canonical articulation of scarcity pricing as an investment signal.

The inference case is sharper because the structural argument doesn’t depend on knowing the exact elasticity. Whether demand turns out to be near-inelastic (a peak price increase mostly captures surplus from latency-sensitive buyers) or substantially more elastic than the static-market measurement (a price increase mostly defers price-elastic buyers into the next slack window), the conclusion runs the same way: rationing by value beats rationing by queue position whenever buyers’ values differ. And they always differ.

The H100 market is constrained today exactly because flat list pricing has produced classic shortage symptoms: on-demand sold out across providers, contract index up 40% in five months.44. SemiAnalysis. The Great GPU Shortage: H100 1-Year Rental Price Index, April 2026. Tracks 1-year committed-contract pricing (not on-demand) via monthly survey of 100+ neocloud providers, buyers, and brokers, reported as a 25th–75th percentile band. H100 contract pricing rose from ~$1.70/GPU-hour in October 2025 to $2.35 in March 2026 — a ~40% increase as on-demand capacity sold out. Silicon Data’s spike analysis attributes this directly to capacity hoarding under flat pricing: “Hyperscalers and major cloud providers prioritized their allocation policies toward reserved capacity and long-term contracts. That squeezed on-demand inventory in secondary markets… when demand spikes, there’s no quick way to flood the market with more H100 capacity.”1111. Silicon Data. The H100 Price Spike, 2026. Diagnoses the structural cause of the current H100 supply squeeze: “Hyperscalers and major cloud providers prioritized their allocation policies toward reserved capacity and long-term contracts. That squeezed on-demand inventory in secondary markets… when demand spikes, there’s no quick way to flood the market with more H100 capacity.” Flat pricing + capacity hoarding produces shortage symptoms that dynamic pricing exists to resolve. AWS Spot — the closest existing dynamic-pricing mechanism on compute — offers 70–90% off On-Demand with under-5% interruption; it is a working proof that algorithmic dynamic pricing on perishable compute is operationally viable at hyperscale.1212. AWS. New Amazon EC2 Spot Pricing (Algorithmic Mechanism), 2017, mechanism current. AWS abolished bidding in favor of algorithmically-set prices “determined by supply and demand for Amazon EC2 spare capacity, not bid prices.” Typical savings of 70–90% off On-Demand, with under-5% interruption frequency over the prior 30 days. The closest existing analog to dynamic GPU pricing — a working proof at hyperscale. The mechanism exists. It has not been built for the H100 long tail.

The honest caveat: Hall, Horton and Knoepfle find that in ride-share, dynamic-price rents largely erode within ~2 months as competitor supply responds.1313. Jonathan V. Hall, John J. Horton and Daniel T. Knoepfle. Ride-Sharing Markets Re-Equilibrate. NBER Working Paper 30883, 2023. After Uber fare increases, supply (driver hours) and demand both adjust; hourly earnings revert to their pre-increase level in roughly two months. Implication: in markets with competitive supply elasticity, dynamic-price rents are transient — the mechanism accelerates supply response rather than persistently extracting surplus. In equilibrium the mechanism accelerates the supply response rather than persistently extracting from buyers. That is the point of building the layer early, while the cycle is still tight.

Why this happens now

Three forces converge to make the static-pricing regime increasingly untenable.

Supply. Blackwell capacity ramps through 2026; H100 supply finds buyers further down the demand curve as new clusters come online and reserved blocks expire. Operators staring at idle margin will need a mechanism to compete on time-tolerant workloads. A uniform list-price cut destroys margin on every buyer simultaneously, not just the deferrable ones — flat pricing has no way to discount selectively into the slack windows without giving up the peak.

Demand. Agentic workflows, evaluation pipelines, training-data generation, batch document jobs, scheduled retraining. The fraction of inference demand that doesn’t need sub-second latency is growing fast, and each new use case with non-trivial deferral tolerance widens the segment-elasticity gap. Flat pricing wastes most of this demand’s willingness to wait.

Settlement and routing. The things that historically blocked dynamic compute pricing — slow bilateral payments, no aggregation layer, no neutral price discovery — are solved problems. USDC rails settle in seconds. Routers like OpenRouter already aggregate cross-provider demand across ~100T routed tokens a year.99. OpenRouter. State of AI 2025: 100T Token LLM Usage Study, with academic write-up Token Economics in the LLM Era. Log-log regression of platform usage vs $/M tokens over ~100T routed tokens (Nov 2024–Nov 2025): own-price elasticity ≈ −0.05 to −0.07. Should be read as a lower bound on dynamic-price-revealed elasticity: the data is from a market where prices move on quarterly cadence via model launches and provider list updates, not hour-to-hour via a clearing signal. Buyers cannot respond to a price signal that has never existed. Open-weight models make tokens substitutable: a million tokens of Llama from operator A is fungible with a million tokens of Llama from operator B in a way that proprietary-model tokens are not. The only piece still missing is the price-discovery layer itself.

The first-mover problem explains why this hasn’t already happened. An operator who alone offers dynamic prices gets adversely selected — buyers shop the static list and take the dynamic discount. The fix is structural: an aggregator that runs the clearing function for all operators eliminates the exposure. Every participant clears at the same price, settlement is atomic, no one is picked off bilaterally.

The window for building that aggregator is now. Operators have strong incentive to participate while supply is tight — the current shortage means flat pricing is leaving the most surplus on the table it ever has. The same infrastructure carries through to the slack regime that follows.

What the infrastructure looks like

A continuous auction takes (request, deadline) from buyers and (capacity, window) from operators, and finds the price that fills each batch. Markets have run exactly this structure on fixed-capacity perishable goods for a century — airline seats, electricity, hotel rooms, freight, ride-share. The mechanism is not a research problem.

The novel piece for inference is what gets settled: deliverable tokens against a verified price, across operators that today price bilaterally and don’t see each other’s demand. The economics are settled. The infrastructure is the work.

Sources

  1. Sheryl E. Kimes. Yield Management: A Tool for Capacity-Constrained Service Firms. Journal of Operations Management, 1989. The canonical definition: fixed capacity, perishable inventory, time-variable demand, segmentable willingness to pay. Inference satisfies all four conditions.

  2. Smith, Leimkuhler and Darrow. Yield Management at American Airlines. Interfaces 22(1), 1992. AA’s SABRE-based DINAMO system credited with $1.4B incremental revenue over three years, ~$500M/year ongoing — a ~5% revenue lift on fixed-capacity perishable inventory. The canonical empirical anchor for yield-management gains.

  3. Juan Camilo Castillo. Who Benefits from Surge Pricing? Econometrica 93(5), 2025. Empirical spatial-equilibrium model on Houston Uber data. Relative to a uniform-price counterfactual at the same overall level: surge pricing increases total welfare by 2.15%; rider surplus +3.57% (every income level gains); driver surplus −0.98%; platform profit −0.50%. The strongest published empirical estimate of dynamic-vs-static welfare gains on a perishable matching market.

  4. SemiAnalysis. The Great GPU Shortage: H100 1-Year Rental Price Index, April 2026. Tracks 1-year committed-contract pricing (not on-demand) via monthly survey of 100+ neocloud providers, buyers, and brokers, reported as a 25th–75th percentile band. H100 contract pricing rose from ~$1.70/GPU-hour in October 2025 to $2.35 in March 2026 — a ~40% increase as on-demand capacity sold out.

  5. SemiAnalysis. ClusterMAX 2.0: The Industry Standard GPU Cloud Rating System, 2025. The rating system normalizes GPU-cloud TCO across storage, networking, control-plane and goodput, finding hyperscaler hardware cost broadly in line with the strongest independent operators.

  6. SemiAnalysis. InferenceX v2: NVIDIA Blackwell vs AMD vs Hopper, 2025. DeepSeek R1 FP4 on B200/TRT-LLM costs ~$0.56/M output tokens at 50 tokens/sec/user vs ~$4/M at 125 tokens/sec/user — a 7× unit-cost swing on identical hardware driven by batch fullness.

  7. NVIDIA Developer Blog. NVIDIA Blackwell Leads on New SemiAnalysis InferenceMAX Benchmarks, October 2025. On gpt-oss at 100 TPS/user, cost per million tokens fell from $0.11 at InferenceMAX launch to $0.02 two months later — a 5× drop from software optimization alone, on top of a 15× per-token cost reduction vs Hopper.

  8. Aziz et al. Dynamic Pricing for Hotel Revenue Management. Real-data analysis showing ~6% average revenue uplift vs fixed pricing across hotel RM datasets — consistent with the 2–6% cross-domain pattern.

  9. OpenRouter. State of AI 2025: 100T Token LLM Usage Study, with academic write-up Token Economics in the LLM Era. Log-log regression of platform usage vs $/M tokens over ~100T routed tokens (Nov 2024–Nov 2025): own-price elasticity ≈ −0.05 to −0.07. Should be read as a lower bound on dynamic-price-revealed elasticity: the data is from a market where prices move on quarterly cadence via model launches and provider list updates, not hour-to-hour via a clearing signal. Buyers cannot respond to a price signal that has never existed.

  10. Peter Cramton. Electricity Market Design. Oxford Review of Economic Policy 33(4), 2017. Energy-only markets with administrative scarcity prices via an Operating Reserve Demand Curve: “setting higher scarcity prices enhances reliability in providing stronger investment incentives.” The canonical articulation of scarcity pricing as an investment signal.

  11. Silicon Data. The H100 Price Spike, 2026. Diagnoses the structural cause of the current H100 supply squeeze: “Hyperscalers and major cloud providers prioritized their allocation policies toward reserved capacity and long-term contracts. That squeezed on-demand inventory in secondary markets… when demand spikes, there’s no quick way to flood the market with more H100 capacity.” Flat pricing + capacity hoarding produces shortage symptoms that dynamic pricing exists to resolve.

  12. AWS. New Amazon EC2 Spot Pricing (Algorithmic Mechanism), 2017, mechanism current. AWS abolished bidding in favor of algorithmically-set prices “determined by supply and demand for Amazon EC2 spare capacity, not bid prices.” Typical savings of 70–90% off On-Demand, with under-5% interruption frequency over the prior 30 days. The closest existing analog to dynamic GPU pricing — a working proof at hyperscale.

  13. Jonathan V. Hall, John J. Horton and Daniel T. Knoepfle. Ride-Sharing Markets Re-Equilibrate. NBER Working Paper 30883, 2023. After Uber fare increases, supply (driver hours) and demand both adjust; hourly earnings revert to their pre-increase level in roughly two months. Implication: in markets with competitive supply elasticity, dynamic-price rents are transient — the mechanism accelerates supply response rather than persistently extracting surplus.