Engineering

AI Inference Cost Economics in 2026: GPU FinOps Playbook

Back to BlogWritten by Mitrasish, Co-founderApr 4, 2026
AI InferenceGPU FinOpsLLM Cost OptimizationInference vs TrainingCost Per TokenGPU CloudSpot InstancesH100B200
AI Inference Cost Economics in 2026: GPU FinOps Playbook

Training was the cost center in 2021-2023. Inference is the cost center now. Industry analysts estimate 55-80% of enterprise AI GPU spend goes to inference. Either way, the trend is clear: once a model ships to production, serving costs accumulate every hour, every day, indefinitely. This playbook covers the four layers of optimization (model, runtime, infrastructure, FinOps), the math behind cost-per-token, and a real-world case study that took a 70B model deployment from $39K to $16K per month.

The Great Inversion: How Inference Ate the Training Budget

Training is a fixed compute job. You run it for days or weeks, it finishes, and the cost stops. Inference is the opposite: it starts when you ship and never stops as long as users are hitting your API.

The math compounds fast. Take a 70B model serving 1,000 daily active users averaging 1,000 requests per user per day at 500 tokens per request:

  • Daily token volume: 500M tokens/day (1,000 users × 1,000 requests × 500 tokens)
  • At $1.90/million tokens (on-demand H100 SXM5, 8x node): $950/day
  • Annualized: ~$347,000/year in compute alone, before egress and infrastructure overhead

That number is not unusual. For any team with real production traffic, inference costs will exceed training costs within weeks of launch. Most teams underestimate this because they plan around training budgets and treat inference as an afterthought until the bill arrives.

For foundational GPU cost strategies that apply to both training and inference, see the GPU Cost Optimization Playbook.

Anatomy of an Inference Bill: Cost Per Token Across GPU Types

The most useful metric for inference economics is cost per million tokens (CPM). It normalizes GPU price and throughput into a single comparable figure.

The formula:

CPM = (GPU $/hr) / (tokens_per_sec × 3600 / 1,000,000)

Using live Spheron pricing and vLLM throughput benchmarks for Llama 3.1 70B at batch size 256 (512 input / 512 output tokens):

GPUConfig$/hr (on-demand)vLLM Throughput (Llama 3.1 70B)CPM ($)
A100 80G SXM48x$1.05/GPU = $8.40~1,400 tok/s~$1.67
H100 SXM58x$2.40/GPU = $19.20~2,800 tok/s~$1.90
H200 SXM58x$4.54/GPU = $36.32~3,600 tok/s~$2.80
B200 SXM68x$7.43/GPU = $59.44~5,200 tok/s~$3.18

Throughput figures are for 70B FP16 at batch 256. Actual numbers vary with sequence length and concurrency profile. For exact measured values across engines, see vLLM vs TensorRT-LLM vs SGLang Benchmarks.

Pricing fluctuates based on GPU availability. The prices above are based on 04 Apr 2026 and may have changed. Check current GPU pricing for live rates.

The A100 has the lowest CPM for 70B FP16 at this batch size because its lower per-GPU price offsets its lower throughput. The H100 and newer GPUs become more cost-efficient when you factor in FP8 quantization, which roughly doubles throughput without changing the $/hr. FP8 on H100 brings CPM for the 8x node from ~$1.90 to approximately $0.95-1.10, making it the better choice for most production deployments.

For per-GPU benchmark detail and workload-specific guidance, see Best GPU for AI Inference in 2026.

The Four Optimization Layers

Inference cost reduction is not one technique. It is four layers applied in sequence, each with independent savings potential:

LayerWhat It CoversTypical Savings
ModelQuantization, distillation, right-sizing30-75% cost reduction
RuntimeBatching, speculative decoding, KV cache40-80% throughput gain
InfrastructureSpot vs on-demand, auto-scaling, GPU type40-65% unit cost reduction
FinOpsAttribution, metering, budgetsPrevents waste accumulation

The order matters. Start with the model layer: model changes affect every subsequent dollar spent. Then optimize the runtime. Then choose infrastructure to match the workload pattern. FinOps runs continuously throughout.

Model Layer: Quantization, Distillation, and Right-Sizing

FP8 quantization is the first thing to apply on H100 (vLLM supports it natively). It gives 1.3-2x throughput gain over FP16 at under 2% quality loss on instruction-tuned models. For standard conversational AI, summarization, and code generation tasks, the quality difference is not perceptible in production.

FP4 on B200 via TensorRT-LLM adds a further 1.5-2x gain over FP8. More quantization error than FP8, so test against your eval suite before moving to production. For the full B200 quantization economics, see FP4 Quantization on Blackwell GPUs.

INT4 (GPTQ/AWQ): Suitable for less quality-sensitive workloads like classification and summarization. Not recommended for complex reasoning tasks. At INT4, a 70B model fits on a single H100 with room for KV cache, versus two GPUs minimum at FP16. For high-batch serving, INT4 lets you cut the pod from eight GPUs to two or three while maintaining throughput.

Distillation is the highest-impact model change, but also the most work. If your 70B model handles tasks that a 14B or 8B model manages adequately, distillation cuts GPU requirements by 4-8x. The cost math: 70B on 8x H100 at $19.20/hr versus 14B on 2x H100 at $4.80/hr for comparable throughput on the target task. For setup details, see Model Distillation on GPU Cloud.

Right-sizing is often skipped. A 14B model at FP8 frequently matches a 70B INT4 on standard NLP tasks. Run your eval before defaulting to the largest available model.

Runtime Layer: Batching, Speculative Decoding, and KV Cache

Continuous batching is the single most impactful runtime change for most teams. Static batching leaves the GPU idle between requests. vLLM's continuous batching (PagedAttention) processes new tokens as slots free up, raising GPU utilization from 15-30% to 60-80% at typical traffic patterns. That is a 3-4x improvement in effective throughput at the same GPU cost.

Speculative decoding uses a small draft model to generate candidate tokens and a larger target model to verify them in parallel. For output-heavy workloads (code generation, long-form text), this gives 2-4x throughput improvement. It works best when the draft and target models are from the same family. For production setup details, see Speculative Decoding Production Guide.

KV cache optimization matters most for long-context workloads. PagedAttention eliminates internal VRAM fragmentation. For 32K+ token contexts, KV cache quantization to INT8 or FP8 cuts VRAM usage by 30-50%, freeing capacity for more concurrent requests. See the KV Cache Optimization Guide for memory calculations.

For a head-to-head comparison of how vLLM, TensorRT-LLM, and SGLang implement these techniques and how they affect throughput and latency numbers, see vLLM vs TensorRT-LLM vs SGLang Benchmarks.

Infrastructure Layer: Spot vs On-Demand, Auto-Scaling, and GPU Right-Sizing

The infrastructure decision changes based on workload type. Using live Spheron pricing:

GPUOn-DemandSpot
H100 SXM5$2.40/hrN/A
H200 SXM5$4.54/hrN/A
B200 SXM6$7.43/hrN/A
A100 80G SXM4$1.05/hrN/A
L40S$0.32/hr

Pricing fluctuates based on GPU availability. The prices above are based on 04 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Synchronous API with latency SLAs: Use on-demand H100 or H200. Spot interruptions are incompatible with sub-2-second P99 requirements.

Batch inference pipelines: Embeddings, async summarization, nightly report generation, evaluation runs. These are the right workloads for spot. For GPU models with spot availability (such as L40S at $0.32/hr), the savings versus on-demand rates are substantial. Build a job queue with retry logic and exponential backoff. A spot interruption causes a retry, not a user-visible error. See instance types on Spheron for current spot availability by GPU model.

Auto-scaling trigger: Scale GPU count based on pending request queue depth, not CPU or memory. A queue of more than 50 pending requests per GPU is typically the signal to scale out. CPU and memory give lagging signals for inference workloads; queue depth is real-time.

GPU right-sizing: For models under 14B at FP8, a single H100 PCIe at $2.01/hr may match four A100 SXM4 nodes at $4.20/hr at lower cost with simpler networking. Run benchmarks before committing.

For the billing model decision framework (serverless vs on-demand vs reserved), see Serverless vs On-Demand vs Reserved GPU. For GPU sharing strategies using MIG and time-slicing, see Run Multiple LLMs on One GPU.

FinOps Layer: Cost Attribution, Token Metering, and Budget Alerts

Without attribution, GPU spend is a black box. With it, you can see exactly which model, use-case, and team is driving cost.

Tag at submission. Every inference job should carry three tags: model name, use-case ID, and team or product. These flow through to your cost reporting without any additional work.

Emit token counts as metrics. GPU hours alone do not tell you if your cost is rising because traffic increased or because efficiency dropped. Token counts give you the denominator. Store both in your data warehouse and compute CPM as a weekly KPI.

Budget alerts at 80%, not 100%. By the time you hit 100% of budget, there is nothing left to do. An 80% alert gives you a week or two to investigate and course-correct before the next billing cycle.

Review cadence. Inference costs move fast. Weekly reviews catch regressions before they compound. Monthly reviews are appropriate for training, where costs are more predictable.

Spheron vs AWS vs RunPod: Inference Cost Per Million Tokens

Provider pricing affects the baseline before any optimization. The table below compares Spheron's on-demand H100 SXM5 pricing against representative estimates for AWS and RunPod. AWS and RunPod prices are estimates based on public pricing pages as of March 2026 and may not reflect reserved or negotiated rates.

ProviderGPU8-GPU Pod $/hrvLLM 70B FP16 (tok/s)CPM ($)Egress/Networking
SpheronH100 SXM5$19.20 (8 × $2.40)~2,800~$1.90None
RunPodH100 SXM5~$21.52 (8 × $2.69)~2,800~$2.13Included
AWS (p5.48xlarge)H100 SXM5~$55.04~2,800~$5.46$0.09/GB outbound

Pricing fluctuates based on GPU availability. The prices above are based on 04 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Llama 3.1 70B in FP16 requires approximately 140 GB of VRAM, so it cannot run on a single 80 GB H100. An 8-GPU pod is the minimum configuration for this workload, which is why the table uses 8-GPU pod pricing. The AWS figure reflects the p5.48xlarge on-demand rate ($55.04/hr in us-east-1), which provides 8x H100 SXM5 GPUs. AWS inference workloads also accumulate ancillary costs: VPC NAT gateway, ALB for load balancing, and egress. A team pushing 10TB/month in API response data pays roughly $900/month in egress fees alone on AWS. Spheron charges a flat per-GPU-hour rate with no egress or networking overhead.

For teams running $20K/month in raw compute on AWS, migrating to Spheron typically reduces total spend by 20-35% after accounting for the rate difference and eliminated ancillary fees.

Real-World Savings: From ~$39K/month to $16K/month

This scenario is based on a team serving Llama 3.1 70B to an internal enterprise tool with 500 DAU. Traffic peaks 9am-6pm on weekdays, drops to near-zero overnight and on weekends.

Before:

ItemDetailsMonthly Cost
4x H100 SXM5 nodes (4 GPUs each) on prior cloud providerOn-demand, always-on, FP16, static batching at $2.50/GPU/hr (16 GPUs × $2.50 × 730 hrs)$29,200
AWS egress and networkingALB, NAT gateway, 8TB outbound$7,500
StorageTraining artifacts, logs, checkpoints$2,400
Total-~$39,100

Average GPU utilization: 22%. The cluster sits mostly idle from 7pm to 8am and all weekend.

After (all four layers applied):

ItemDetailsMonthly Cost
2x H100 SXM5 nodes on-demand (Spheron)FP8, continuous batching, always-on$14,016
4x L40S spot for batch jobs (Spheron)Embeddings, async summarization$935
StorageReduced footprint after archive cleanup$1,200
Total-~$16,151

Net savings after fully eliminating AWS ancillary fees: the effective monthly bill came to approximately $16K, a 59% reduction from the prior $39,100/month.

The changes that drove it:

  1. FP8 quantization: Same 4-GPU footprint served 1.8x the traffic. Reduced the number of nodes needed at peak from four to two.
  2. Continuous batching: GPU utilization rose from 22% to 68%. Far fewer idle GPU-hours.
  3. Spot for batch: Moved all embedding and summarization jobs to L40S spot at $0.32/hr, a fraction of on-demand inference compute rates.
  4. Provider switch: Eliminated AWS egress, NAT gateway, and ALB costs entirely.

For a comparable cost reduction story on the training side, see Spot GPU Training Case Study.

MetricBeforeAfter
Monthly GPU cost~$29,200~$14,951
Infrastructure overhead~$9,900~$1,200
GPU utilization22%68%
Total monthly spend~$39,100~$16,151

Decision Framework: When to Self-Host vs Use Inference APIs

Managed inference APIs (OpenAI, Anthropic, Google) trade cost for simplicity. Self-hosting on GPU cloud trades simplicity for cost. Here is how to pick:

ConditionSelf-Host on GPU CloudUse Inference API
Token volumeGreater than 100M tokens/monthLess than 50M tokens/month
Latency SLAStrict P99 requirementsFlexible
Model customizationFine-tuned or private modelFoundation model
Ops capabilityDedicated infra teamSmall team, no infra resources
Data privacyPrivate or regulated dataPublic or low-sensitivity data
Cost at scale$0.10-0.50/M tokens self-hosted$0.60-15/M tokens via API

The break-even math: at 500M tokens/month with a 70B FP16 model on an 8x H100 pod via Spheron at roughly $1.90/M tokens, compute costs are ~$950/month. The same volume via OpenAI GPT-4o at $10.00/M output tokens is $5,000/month. With FP8 quantization cutting CPM to roughly $0.95-1.10/M, the gap widens further. Self-hosting breaks even at roughly 50-100M tokens/month for most 70B-class models when you factor in engineering overhead for running the infrastructure.

Below 20M tokens/month, managed APIs win on total cost including ops. Above 100M tokens/month, self-hosting almost always wins on unit economics.


Inference costs compound fast. If you are serving more than 50M tokens a month, running on-demand cloud instances with no cost optimization strategy is leaving real money behind. Spheron gives you transparent per-hour GPU pricing, spot instances on select GPU models, and no hidden egress fees.

Rent H100 → | View all GPU pricing → | Compare all GPUs →

Start saving on inference today →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.