Engineering

GPU Cost Per Token: Benchmark 7 Major LLMs Across GPU Types in 2026

Back to BlogWritten by Mitrasish, Co-founderApr 18, 2026
Cost Per TokenLLM InferenceGPU BenchmarksH100A100B200QuantizationSpot InstancesLlama 4DeepSeek
GPU Cost Per Token: Benchmark 7 Major LLMs Across GPU Types in 2026

Inference now accounts for 80%+ of AI GPU spend at production scale. The metric that actually controls that budget is cost per million tokens (CPM). What follows are cross-GPU, cross-model benchmarks for the models teams are running in 2026: Llama 4, Qwen 3, Gemma 4, DeepSeek V3, and Mistral Small, measured against A100, H100, H200, B200, and L40S with live Spheron pricing.

Why Cost Per Token Is the Only Metric That Matters

GPU hourly rate tells you almost nothing in isolation. An H100 at $2.90/hr sounds expensive next to an A100 at $1.64/hr until you account for what each delivers in tokens per hour.

The formula:

CPM = (cluster $/hr) / (tokens_per_sec × 3600 / 1,000,000)

Throughput and price collapse into one number, which lets you compare any GPU-model combination on equal footing. A GPU that costs 5x more but delivers 6x the throughput is cheaper per token. The only way to know which hardware actually wins for your model is to run the math.

For the full FinOps framework behind these numbers, including a real-world case study cutting from $39K to $16K per month, see AI Inference Cost Economics 2026.

Benchmark Methodology

All results use the following setup:

  • Hardware: A100 80G SXM4, H100 SXM5, H200 SXM5, B200 SXM6, L40S PCIe
  • Framework: vLLM with continuous batching and PagedAttention enabled
  • Precision: FP16 baseline unless noted
  • Batch config: 256 concurrent requests, 512 input / 512 output tokens per request
  • Pricing source: Spheron live API, on-demand rates, fetched 18 Apr 2026
  • Throughput source: Community benchmark data from vLLM and MLPerf results, cross-referenced against published reports

These are not real-time SLA benchmarks. Throughput figures reflect sustained batch throughput at the specified configuration. Exact numbers vary by vLLM version, exact hardware revision (SXM vs PCIe), driver stack, and sequence length distribution. Treat them as engineering approximations, not vendor-guaranteed peak specs.

Cost Per Token Results by Model

Llama 4 Scout 17B (MoE)

Llama 4 Scout uses a mixture-of-experts architecture with 17B active parameters per token. It fits on a single H100 (80 GB) or A100 (80 GB) with room for KV cache.

GPUConfigOn-demand $/hrThroughput (tok/s)CPM ($)
H100 SXM51x$2.904,200$0.19
A100 80G SXM41x$1.642,800$0.16

The A100 edges out the H100 on on-demand CPM for this model because single-GPU pricing is close enough that the A100's lower rate matters. For deployment steps, see Deploy Llama 4 on GPU Cloud.

Pricing fluctuates based on GPU availability. The prices above are based on 18 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Qwen 3

Qwen 3 30B-A3B is a 30B MoE model with only 3B active parameters per token. Qwen 3 32B is a dense model that fits on 2x H100 in FP16 (64 GB weights).

GPUConfigModelOn-demand $/hrThroughput (tok/s)CPM ($)
H100 SXM51xQwen 3 30B-A3B (MoE)$2.903,900$0.21
H100 SXM52xQwen 3 32B$5.803,200$0.50

The MoE model is where Qwen 3 shines: single-GPU CPM is low. The 32B dense model on 2x H100 gives you a substantial parameter budget at moderate cost, with CPM around $0.50/M on-demand.

Pricing fluctuates based on GPU availability. The prices above are based on 18 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Gemma 4 31B

Gemma 4 31B fits on a single H100 80GB in FP16 (62 GB weights). No tensor parallelism needed.

GPUConfigOn-demand $/hrThroughput (tok/s)CPM ($)
H100 SXM51x$2.903,100$0.26

$0.26/M on-demand is competitive for a 31B model. FP8 quantization reduces the model to 31 GB, freeing more VRAM for larger KV cache batches and improving throughput further. For deployment steps, see Deploy Gemma 4 on GPU Cloud.

Pricing fluctuates based on GPU availability. The prices above are based on 18 Apr 2026 and may have changed. Check current GPU pricing for live rates.

DeepSeek V3

DeepSeek V3 is a 671B total parameter MoE model with approximately 37B active parameters per token. Despite the low active parameter count per forward pass, the full model weights must reside in VRAM. At FP16, the weights alone total roughly 1,342 GB (671B params × 2 bytes), which is approximately 167.75 GB per GPU across 8 devices, far exceeding the H100 SXM5's 80 GB VRAM. Serving DeepSeek V3 on 8x H100 80GB requires quantization: INT4 (AWQ/GPTQ) brings the per-GPU footprint to roughly 42 GB, which fits comfortably. Alternatively, the model can be served in FP8 on 16x H100 or on 8x H200 (141 GB VRAM each). The benchmarks below reflect a quantized deployment on 8x H100.

GPUConfigOn-demand $/hrThroughput (tok/s)CPM ($)
H100 SXM58x$23.201,800$3.58

DeepSeek V3's CPM looks high on on-demand. For background inference pipelines (document processing, RAG pre-computation), the 8-GPU requirement still makes it expensive on on-demand. See Deploy DeepSeek V3 on GPU Cloud for setup details.

Pricing fluctuates based on GPU availability. The prices above are based on 18 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Mistral Small 3.1 24B

Mistral Small 3.1 24B requires 2x H100 in FP16. FP8 would fit it on a single H100, halving the cluster cost.

GPUConfigOn-demand $/hrThroughput (tok/s)CPM ($)
H100 SXM52x$5.803,400$0.47

Competitive CPM for a high-quality 24B model. If you can accept FP8 quality (which is generally fine for conversational tasks), single-GPU operation cuts cost further.

Pricing fluctuates based on GPU availability. The prices above are based on 18 Apr 2026 and may have changed. Check current GPU pricing for live rates.

GPU Hardware Comparison: Cost Efficiency for Inference

Using Llama 3.3 70B as the reference model (8x GPUs required in FP16), here is how the GPU lineup compares on CPM.

GPUVRAMMemory BWOn-demand $/GPU/hr70B CPM (on-demand)70B Spot CPM
A100 80G SXM480 GB2.0 TB/s$1.64$2.60N/A
H100 SXM580 GB3.35 TB/s$2.90$2.30N/A
H200 SXM5141 GB HBM3e4.8 TB/s$3.69$2.28N/A
B200 SXM6192 GB HBM3e8.0 TB/sspot-only ($2.07)N/A$0.88
L40S PCIe48 GB GDDR6864 GB/s$0.72N/A*N/A

*70B FP16 requires ~140 GB VRAM. L40S (48 GB) cannot host 70B in FP16. Use INT4 with AWQ or GPTQ to fit on L40S, or use multi-GPU tensor parallelism.

When A100 wins: For 30-70B models at budget-sensitive shops where throughput is not the bottleneck, the A100's lower hourly rate produces competitive CPM.

When H100 wins: For throughput-sensitive production APIs where you need the best on-demand CPM for 13-70B models, H100 edges out A100 despite the higher hourly rate. FP8 on H100 roughly doubles throughput without changing GPU count, making H100 the clear winner once quantization is in play.

When B200 wins: For very large MoE models (200B+) where memory bandwidth is the primary bottleneck, B200's 8.0 TB/s HBM3e shifts the CPM calculus. B200 SXM6 is available as spot-only on Spheron at $2.07/GPU/hr, making it the most cost-effective option for batch inference on models requiring this hardware tier. For B200 FP4 deployments via TensorRT-LLM, CPM drops further still.

Best GPU for AI Inference in 2026 has a full hardware decision matrix including L40S guidance for smaller models.

Impact of Quantization: FP16 vs FP8 vs INT4

Using Llama 3.3 70B on 8x H100 and 8x A100 as the reference configuration:

GPUPrecisionThroughput (tok/s)CPM ($)Quality Impact vs FP16
H100 SXM5 (8x)FP162,800$2.30Baseline
H100 SXM5 (8x)FP85,600$1.15<1-2% on MMLU, MT-Bench
H100 SXM5 (8x)INT4 (AWQ)8,400$0.772-4%, task-dependent
A100 80G (8x)FP161,400$2.60Baseline
A100 80G (8x)INT8 (TRT-LLM)2,100$1.74<1-2% for most tasks
A100 80G (8x)INT4 (GPTQ)3,500$1.042-5%, task-dependent

Key takeaways:

FP8 is an H100/H200/B200 hardware feature. On A100, the equivalent path is INT8 via TensorRT-LLM or bitsandbytes, which gives a smaller throughput gain than native FP8 but still cuts CPM meaningfully.

INT4 (AWQ or GPTQ) delivers the lowest CPM but introduces more quantization error. For conversational AI and summarization, the quality delta is usually acceptable. For code generation, math reasoning, and precise factual recall, run your eval suite before committing.

B200 introduces FP4 support via TensorRT-LLM. For large models where FP4 weights are available, CPM drops by 30-40% versus FP8. See FP4 Quantization on Blackwell for the full breakdown.

The Tokens Per Watt Guide covers efficiency from a different angle if you are also optimizing for power and not just dollar cost.

Batch Size and Throughput: How Concurrent Users Change CPM

Batch size is the most impactful variable most teams ignore. Using Llama 3.3 70B on 8x H100 and 8x A100:

Batch SizeAggregate tok/s (H100 8x)CPM H100 ($)Aggregate tok/s (A100 8x)CPM A100 ($)
1~25~$258~15~$242
8~200~$32~120~$30
32~720~$8.95~460~$7.94
128~2,100~$3.07~1,100~$3.31
256~2,800~$2.30~1,400~$2.60

At batch size 1, the GPU spends most of its time waiting to load model weights from VRAM for a single active request. GPU utilization is under 5%. At batch size 256 with continuous batching, the hardware is running near its sustained throughput ceiling. The CPM difference between these two states is roughly 100x for H100 and comparable for A100.

Most production teams are not running at batch 1. But a lot of teams are running at effective batch sizes in the 8-32 range due to under-configured inference servers, which puts them 10-30x higher on CPM than they need to be. The fix is configuring continuous batching properly and aligning max_num_batched_tokens to your actual traffic profile.

For detailed serving optimization, see LLM Serving Optimization: Continuous Batching.

Spot vs On-Demand: GPU Availability and Cost Per Token

As of April 2026, spot GPU instances on Spheron are available for B200 SXM6 at $2.07/GPU/hr. H100 SXM5, A100 80G SXM4, and H200 SXM5 are currently on-demand only. A100 80G PCIe (not covered in the SXM4 benchmarks above) has spot availability at $1.14/hr for teams that can work with PCIe bandwidth.

GPUOn-demand $/GPU/hrSpot $/GPU/hrOn-demand CPM (70B)Spot CPM (70B)
H100 SXM5$2.90N/A$2.30N/A
A100 80G SXM4$1.64N/A$2.60N/A
H200 SXM5$3.69N/A$2.28N/A
B200 SXM6N/A (spot-only)$2.07N/A$0.88
L40S PCIe$0.72N/AN/A*N/A

*70B FP16 does not fit on L40S. Check current pricing for live spot rates, as availability changes.

For B200, the spot-only pricing at $2.07/GPU/hr makes it unusually competitive for batch workloads that can tolerate interruption. An 8x B200 spot cluster at $16.56/hr delivers 5,200 tok/s for 70B models, giving a CPM of $0.88. That is lower than the on-demand CPM of any other GPU in this comparison for 70B inference.

The workloads that should use spot: embedding pipelines, offline document summarization, nightly eval runs, batch enrichment, RAG pre-computation. These can tolerate a restart if the instance is reclaimed.

The workloads that should not use spot: synchronous APIs with latency SLAs, streaming chat endpoints, any pipeline where request retry adds unacceptable latency.

For a deeper treatment of spot vs. on-demand vs. reserved trade-offs, see Serverless GPU vs On-Demand vs Reserved Instances.

Optimization Playbook: 5 Strategies to Halve Your Cost Per Token

  1. Switch to continuous batching. If you are still running a naive per-request inference loop, switching to vLLM or SGLang with continuous batching and PagedAttention is typically the highest-ROI move available. Batch size 8 vs batch size 256 is a 10-30x CPM difference on the same hardware.
  1. Apply FP8 on H100 or newer hardware. FP8 is the safest quantization option for production inference on Hopper and newer GPUs. Quality degradation is under 2% on most standard benchmarks, and throughput roughly doubles. This halves CPM without changing cluster size or model quality.
  1. Right-size GPU count. Don't run a 7B model on an 8-GPU pod because that is what your cluster defaults to. Match GPU count to model VRAM requirements, then add headroom for KV cache. A 7B FP16 model fits on a single L40S (48 GB). Running it on 8x H100 is throwing money away.
  1. Move batch workloads to spot instances where available. B200 SXM6 spot at $2.07/GPU/hr is the best current spot option on Spheron for large MoE models needing high memory bandwidth. For any async inference workload that fits on B200, this significantly undercuts on-demand H100 or H200 CPM.
  1. Use the right model for the task. A 70B model for a classification task that a fine-tuned 7B handles equally well costs 10-15x more per token. Audit your routing logic. Reserve large models for tasks where they clearly outperform smaller alternatives.

For a deeper treatment of each layer, see GPU Cost Optimization Playbook.

Full Cross-Model, Cross-GPU Reference Table

All CPM figures are on-demand unless noted. B200 SXM6 is spot-only; its cluster cost reflects spot pricing. H100, A100 80G SXM4, and H200 are currently on-demand only on Spheron.

ModelGPU ConfigCluster $/hrThroughput (tok/s)On-demand CPMSpot CPM
Llama 4 Scout 17B (MoE)1x H100 SXM5$2.904,200$0.19N/A
Llama 4 Scout 17B (MoE)1x A100 80G$1.642,800$0.16N/A
Qwen 3 30B-A3B (MoE)1x H100 SXM5$2.903,900$0.21N/A
Qwen 3 32B2x H100 SXM5$5.803,200$0.50N/A
Gemma 4 31B1x H100 SXM5$2.903,100$0.26N/A
Mistral Small 3.1 24B2x H100 SXM5$5.803,400$0.47N/A
Llama 3.3 70B8x A100 80G$13.121,400$2.60N/A
Llama 3.3 70B8x H100 SXM5$23.202,800$2.30N/A
Llama 3.3 70B8x H200 SXM5$29.523,600$2.28N/A
Llama 3.3 70B8x B200 SXM6$16.56 (spot)5,200N/A$0.88
DeepSeek V3 (671B MoE)8x H100 SXM5$23.201,800$3.58N/A

All throughput figures are measured on vLLM with continuous batching at batch size 256, 512 input / 512 output tokens, FP16. They are engineering approximations, not guaranteed throughput under all conditions.

Pricing fluctuates based on GPU availability. The prices above are based on 18 Apr 2026 and may have changed. Check current GPU pricing for live rates.

For most teams running 30-70B models, FP8 on H100 is the sweet spot: competitive CPM, safe quality, and broad model support. For batch workloads on very large models, B200 spot at $2.07/GPU/hr delivers the best available CPM. If you are running DeepSeek V3 or other 200B+ MoE models, the on-demand H100 numbers look steep but B200 spot brings CPM under $1/M, competitive with smaller dense models on on-demand.


Running inference at scale? Spheron's on-demand and spot H100, A100, B200, and H200 instances give you the GPU configurations benchmarked in this post, billed per minute with no reserved instance lock-in.

Rent H100 → | Rent A100 → | Rent B200 → | View all pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.