GPU Cost Per Token: Benchmark 7 Major LLMs Across GPU Types in 2026

Q: What is cost per million tokens and how do I calculate it?

Cost per million tokens (CPM) = (cluster $/hr) / (tokens_per_sec × 3600 / 1,000,000). For a single H100 SXM5 running Llama 4 Scout 17B via vLLM at 4,200 tok/s at $2.90/hr, CPM works out to roughly $0.19. GPU count in the numerator (as part of cluster $/hr) depends on your deployment configuration: the 70B benchmarks in this post use 8x H100 for maximum batch throughput, so 8× $2.90/hr goes in the numerator. The formula works for any cluster size — a 2x H100 setup (the VRAM minimum for 70B FP16) would use $5.80/hr as the numerator instead.

Q: Which GPU has the lowest cost per token for LLM inference in 2026?

For models under 34B parameters (FP16), the A100 80G typically delivers the lowest on-demand CPM due to its lower hourly rate, despite lower throughput. For 70B models at the 8x GPU benchmark configuration used in this post (chosen for maximum batch throughput, not minimum VRAM), the H100 cluster wins on CPM because the throughput gain outpaces the price premium. B200 leads on absolute CPM for 671B MoE models where its memory bandwidth advantage is decisive.

Q: How much does quantization reduce cost per token?

FP8 reduces effective cost per token by roughly 50% on H100 and H200 hardware by doubling throughput without changing GPU count. INT4 (AWQ/GPTQ) can push further but requires quality evaluation. On B200 with FP4 via TensorRT-LLM, CPM drops by 30-40% versus FP16 for large models compared to FP8, though this requires pre-calibrated model weights.

Q: When do spot GPU instances make economic sense for inference?

Spot instances work for batch inference (embeddings, offline summarization, nightly enrichment pipelines) where you can tolerate interruption. For synchronous APIs with latency SLAs, use on-demand. On Spheron, B200 SXM6 is currently available as spot at $2.07/GPU/hr. H100 SXM5, A100 80G SXM4, and H200 SXM5 are on-demand only.

Q: How does batch size affect cost per token?

Batch size is the biggest single lever on CPM. At batch size 1, GPU utilization is very low and CPM can be 50-100x higher than at batch size 256. Continuous batching in vLLM and SGLang amortizes idle GPU time across concurrent requests, which is why CPM at 100+ concurrent users looks very different from single-request benchmarks.

Inference now accounts for 80%+ of AI GPU spend at production scale. The metric that actually controls that budget is cost per million tokens (CPM). What follows are cross-GPU, cross-model benchmarks for the models teams are running in 2026: Llama 4, Qwen 3, Gemma 4, DeepSeek V3, and Mistral Small, measured against A100, H100, H200, B200, and L40S with live Spheron pricing.

Why Cost Per Token Is the Only Metric That Matters

GPU hourly rate tells you almost nothing in isolation. An H100 at $2.90/hr sounds expensive next to an A100 at $1.64/hr until you account for what each delivers in tokens per hour.

The formula:

CPM = (cluster $/hr) / (tokens_per_sec × 3600 / 1,000,000)

Throughput and price collapse into one number, which lets you compare any GPU-model combination on equal footing. A GPU that costs 5x more but delivers 6x the throughput is cheaper per token. The only way to know which hardware actually wins for your model is to run the math.

For the full FinOps framework behind these numbers, including a real-world case study cutting from $39K to $16K per month, see AI Inference Cost Economics 2026. For the specific case of Google's Gemini 3 Flash-Lite API pricing versus self-hosted SLMs, the Flash-Lite vs self-hosted decision guide walks through the break-even math at realistic utilization.

Benchmark Methodology

All results use the following setup:

Hardware: A100 80G SXM4, H100 SXM5, H200 SXM5, B200 SXM6, L40S PCIe
Framework: vLLM with continuous batching and PagedAttention enabled
Precision: FP16 baseline unless noted
Batch config: 256 concurrent requests, 512 input / 512 output tokens per request
Pricing source: Spheron live API, on-demand rates, fetched 18 Apr 2026
Throughput source: Community benchmark data from vLLM and MLPerf results, cross-referenced against published reports

These are not real-time SLA benchmarks. Throughput figures reflect sustained batch throughput at the specified configuration. Exact numbers vary by vLLM version, exact hardware revision (SXM vs PCIe), driver stack, and sequence length distribution. Treat them as engineering approximations, not vendor-guaranteed peak specs.

Cost Per Token Results by Model

Llama 4 Scout 17B (MoE)

Llama 4 Scout uses a mixture-of-experts architecture with 17B active parameters per token. It fits on a single H100 (80 GB) or A100 (80 GB) with room for KV cache.

GPU	Config	On-demand $/hr	Throughput (tok/s)	CPM ($)
H100 SXM5	1x	$2.90	4,200	$0.19
A100 80G SXM4	1x	$1.64	2,800	$0.16

The A100 edges out the H100 on on-demand CPM for this model because single-GPU pricing is close enough that the A100's lower rate matters. For deployment steps, see Deploy Llama 4 on GPU Cloud.

Pricing fluctuates based on GPU availability. The prices above are based on 18 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Qwen 3

Qwen 3 30B-A3B is a 30B MoE model with only 3B active parameters per token. Qwen 3 32B is a dense model that fits on 2x H100 in FP16 (64 GB weights).

GPU	Config	Model	On-demand $/hr	Throughput (tok/s)	CPM ($)
H100 SXM5	1x	Qwen 3 30B-A3B (MoE)	$2.90	3,900	$0.21
H100 SXM5	2x	Qwen 3 32B	$5.80	3,200	$0.50

The MoE model is where Qwen 3 shines: single-GPU CPM is low. The 32B dense model on 2x H100 gives you a substantial parameter budget at moderate cost, with CPM around $0.50/M on-demand.

Pricing fluctuates based on GPU availability. The prices above are based on 18 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Gemma 4 31B

Gemma 4 31B fits on a single H100 80GB in FP16 (62 GB weights). No tensor parallelism needed.

GPU	Config	On-demand $/hr	Throughput (tok/s)	CPM ($)
H100 SXM5	1x	$2.90	3,100	$0.26

$0.26/M on-demand is competitive for a 31B model. FP8 quantization reduces the model to 31 GB, freeing more VRAM for larger KV cache batches and improving throughput further. For deployment steps, see Deploy Gemma 4 on GPU Cloud.

Pricing fluctuates based on GPU availability. The prices above are based on 18 Apr 2026 and may have changed. Check current GPU pricing for live rates.

DeepSeek V3

DeepSeek V3 is a 671B total parameter MoE model with approximately 37B active parameters per token. Despite the low active parameter count per forward pass, the full model weights must reside in VRAM. At FP16, the weights alone total roughly 1,342 GB (671B params × 2 bytes), which is approximately 167.75 GB per GPU across 8 devices, far exceeding the H100 SXM5's 80 GB VRAM. Serving DeepSeek V3 on 8x H100 80GB requires quantization: INT4 (AWQ/GPTQ) brings the per-GPU footprint to roughly 42 GB, which fits comfortably. Alternatively, the model can be served in FP8 on 16x H100 or on 8x H200 (141 GB VRAM each). The benchmarks below reflect a quantized deployment on 8x H100.

GPU	Config	On-demand $/hr	Throughput (tok/s)	CPM ($)
H100 SXM5	8x	$23.20	1,800	$3.58

DeepSeek V3's CPM looks high on on-demand. For background inference pipelines (document processing, RAG pre-computation), the 8-GPU requirement still makes it expensive on on-demand. See Deploy DeepSeek V3 on GPU Cloud for setup details.

Pricing fluctuates based on GPU availability. The prices above are based on 18 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Mistral Small 3.1 24B

Mistral Small 3.1 24B requires 2x H100 in FP16. FP8 would fit it on a single H100, halving the cluster cost.

GPU	Config	On-demand $/hr	Throughput (tok/s)	CPM ($)
H100 SXM5	2x	$5.80	3,400	$0.47

Competitive CPM for a high-quality 24B model. If you can accept FP8 quality (which is generally fine for conversational tasks), single-GPU operation cuts cost further.

Pricing fluctuates based on GPU availability. The prices above are based on 18 Apr 2026 and may have changed. Check current GPU pricing for live rates.

GPU Hardware Comparison: Cost Efficiency for Inference

Using Llama 3.3 70B as the reference model (8x GPUs required in FP16), here is how the GPU lineup compares on CPM.

GPU	VRAM	Memory BW	On-demand $/GPU/hr	70B CPM (on-demand)	70B Spot CPM
A100 80G SXM4	80 GB	2.0 TB/s	$1.64	$2.60	N/A
H100 SXM5	80 GB	3.35 TB/s	$2.90	$2.30	N/A
H200 SXM5	141 GB HBM3e	4.8 TB/s	$3.69	$2.28	N/A
B200 SXM6	192 GB HBM3e	8.0 TB/s	spot-only ($2.07)	N/A	$0.88
L40S PCIe	48 GB GDDR6	864 GB/s	$0.72	N/A*	N/A

*70B FP16 requires ~140 GB VRAM. L40S (48 GB) cannot host 70B in FP16. Use INT4 with AWQ or GPTQ to fit on L40S, or use multi-GPU tensor parallelism.

When A100 wins: For 30-70B models at budget-sensitive shops where throughput is not the bottleneck, the A100's lower hourly rate produces competitive CPM.

When H100 wins: For throughput-sensitive production APIs where you need the best on-demand CPM for 13-70B models, H100 edges out A100 despite the higher hourly rate. FP8 on H100 roughly doubles throughput without changing GPU count, making H100 the clear winner once quantization is in play.

When B200 wins: For very large MoE models (200B+) where memory bandwidth is the primary bottleneck, B200's 8.0 TB/s HBM3e shifts the CPM calculus. B200 SXM6 is available as spot-only on Spheron at $2.07/GPU/hr, making it the most cost-effective option for batch inference on models requiring this hardware tier. For B200 FP4 deployments via TensorRT-LLM, CPM drops further still.

Best GPU for AI Inference in 2026 has a full hardware decision matrix including L40S guidance for smaller models. For a dedicated consumer GPU comparison with live pricing, see RTX 5090 vs RTX 4090 cost per million tokens. Teams evaluating whether the upgrade from L40 to L40S is worth the cost should check the L40 vs L40S FP8 inference comparison for throughput and pricing analysis.

Impact of Quantization: FP16 vs FP8 vs INT4

Using Llama 3.3 70B on 8x H100 and 8x A100 as the reference configuration:

GPU	Precision	Throughput (tok/s)	CPM ($)	Quality Impact vs FP16
H100 SXM5 (8x)	FP16	2,800	$2.30	Baseline
H100 SXM5 (8x)	FP8	5,600	$1.15	<1-2% on MMLU, MT-Bench
H100 SXM5 (8x)	INT4 (AWQ)	8,400	$0.77	2-4%, task-dependent
A100 80G (8x)	FP16	1,400	$2.60	Baseline
A100 80G (8x)	INT8 (TRT-LLM)	2,100	$1.74	<1-2% for most tasks
A100 80G (8x)	INT4 (GPTQ)	3,500	$1.04	2-5%, task-dependent

Key takeaways:

FP8 is an H100/H200/B200 hardware feature. On A100, the equivalent path is INT8 via TensorRT-LLM or bitsandbytes, which gives a smaller throughput gain than native FP8 but still cuts CPM meaningfully.

INT4 (AWQ or GPTQ) delivers the lowest CPM but introduces more quantization error. For conversational AI and summarization, the quality delta is usually acceptable. For code generation, math reasoning, and precise factual recall, run your eval suite before committing.

B200 introduces FP4 support via TensorRT-LLM. For large models where FP4 weights are available, CPM drops by 30-40% versus FP8. See FP4 Quantization on Blackwell for the full breakdown. For a dedicated B300 vs B200 comparison at current spot pricing, see NVIDIA B300 vs B200 cost-per-token.

The Tokens Per Watt Guide covers efficiency from a different angle if you are also optimizing for power and not just dollar cost.

Batch Size and Throughput: How Concurrent Users Change CPM

Batch size is the most impactful variable most teams ignore. Using Llama 3.3 70B on 8x H100 and 8x A100:

Batch Size	Aggregate tok/s (H100 8x)	CPM H100 ($)	Aggregate tok/s (A100 8x)	CPM A100 ($)
1	~25	~$258	~15	~$242
8	~200	~$32	~120	~$30
32	~720	~$8.95	~460	~$7.94
128	~2,100	~$3.07	~1,100	~$3.31
256	~2,800	~$2.30	~1,400	~$2.60

At batch size 1, the GPU spends most of its time waiting to load model weights from VRAM for a single active request. GPU utilization is under 5%. At batch size 256 with continuous batching, the hardware is running near its sustained throughput ceiling. The CPM difference between these two states is roughly 100x for H100 and comparable for A100.

Most production teams are not running at batch 1. But a lot of teams are running at effective batch sizes in the 8-32 range due to under-configured inference servers, which puts them 10-30x higher on CPM than they need to be. The fix is configuring continuous batching properly and aligning max_num_batched_tokens to your actual traffic profile.

For detailed serving optimization, see LLM Serving Optimization: Continuous Batching.

Spot vs On-Demand: GPU Availability and Cost Per Token

As of April 2026, spot GPU instances on Spheron are available for B200 SXM6 at $2.07/GPU/hr. H100 SXM5, A100 80G SXM4, and H200 SXM5 are currently on-demand only. A100 80G PCIe (not covered in the SXM4 benchmarks above) has spot availability at $1.14/hr for teams that can work with PCIe bandwidth.

GPU	On-demand $/GPU/hr	Spot $/GPU/hr	On-demand CPM (70B)	Spot CPM (70B)
H100 SXM5	$2.90	N/A	$2.30	N/A
A100 80G SXM4	$1.64	N/A	$2.60	N/A
H200 SXM5	$3.69	N/A	$2.28	N/A
B200 SXM6	N/A (spot-only)	$2.07	N/A	$0.88
L40S PCIe	$0.72	N/A	N/A*	N/A

*70B FP16 does not fit on L40S. Check current pricing for live spot rates, as availability changes.

For B200, the spot-only pricing at $2.07/GPU/hr makes it unusually competitive for batch workloads that can tolerate interruption. An 8x B200 spot cluster at $16.56/hr delivers 5,200 tok/s for 70B models, giving a CPM of $0.88. That is lower than the on-demand CPM of any other GPU in this comparison for 70B inference.

The workloads that should use spot: embedding pipelines, offline document summarization, nightly eval runs, batch enrichment, RAG pre-computation. These can tolerate a restart if the instance is reclaimed.

The workloads that should not use spot: synchronous APIs with latency SLAs, streaming chat endpoints, any pipeline where request retry adds unacceptable latency.

For a deeper treatment of spot vs. on-demand vs. reserved trade-offs, see Serverless GPU vs On-Demand vs Reserved Instances.

Optimization Playbook: 5 Strategies to Halve Your Cost Per Token

Switch to continuous batching. If you are still running a naive per-request inference loop, switching to vLLM or SGLang with continuous batching and PagedAttention is typically the highest-ROI move available. Batch size 8 vs batch size 256 is a 10-30x CPM difference on the same hardware.

Apply FP8 on H100 or newer hardware. FP8 is the safest quantization option for production inference on Hopper and newer GPUs. Quality degradation is under 2% on most standard benchmarks, and throughput roughly doubles. This halves CPM without changing cluster size or model quality.

Right-size GPU count. Don't run a 7B model on an 8-GPU pod because that is what your cluster defaults to. Match GPU count to model VRAM requirements, then add headroom for KV cache. A 7B FP16 model fits on a single L40S (48 GB). Running it on 8x H100 is throwing money away.

Move batch workloads to spot instances where available. B200 SXM6 spot at $2.07/GPU/hr is the best current spot option on Spheron for large MoE models needing high memory bandwidth. For any async inference workload that fits on B200, this significantly undercuts on-demand H100 or H200 CPM.

Use the right model for the task. A 70B model for a classification task that a fine-tuned 7B handles equally well costs 10-15x more per token. Audit your routing logic. Reserve large models for tasks where they clearly outperform smaller alternatives.

For a deeper treatment of each layer, see GPU Cost Optimization Playbook.

Full Cross-Model, Cross-GPU Reference Table

All CPM figures are on-demand unless noted. B200 SXM6 is spot-only; its cluster cost reflects spot pricing. H100, A100 80G SXM4, and H200 are currently on-demand only on Spheron.

Model	GPU Config	Cluster $/hr	Throughput (tok/s)	On-demand CPM	Spot CPM
Llama 4 Scout 17B (MoE)	1x H100 SXM5	$2.90	4,200	$0.19	N/A
Llama 4 Scout 17B (MoE)	1x A100 80G	$1.64	2,800	$0.16	N/A
Qwen 3 30B-A3B (MoE)	1x H100 SXM5	$2.90	3,900	$0.21	N/A
Qwen 3 32B	2x H100 SXM5	$5.80	3,200	$0.50	N/A
Gemma 4 31B	1x H100 SXM5	$2.90	3,100	$0.26	N/A
Mistral Small 3.1 24B	2x H100 SXM5	$5.80	3,400	$0.47	N/A
Llama 3.3 70B	8x A100 80G	$13.12	1,400	$2.60	N/A
Llama 3.3 70B	8x H100 SXM5	$23.20	2,800	$2.30	N/A
Llama 3.3 70B	8x H200 SXM5	$29.52	3,600	$2.28	N/A
Llama 3.3 70B	8x B200 SXM6	$16.56 (spot)	5,200	N/A	$0.88
DeepSeek V3 (671B MoE)	8x H100 SXM5	$23.20	1,800	$3.58	N/A

All throughput figures are measured on vLLM with continuous batching at batch size 256, 512 input / 512 output tokens, FP16. They are engineering approximations, not guaranteed throughput under all conditions.

Pricing fluctuates based on GPU availability. The prices above are based on 18 Apr 2026 and may have changed. Check current GPU pricing for live rates.

For most teams running 30-70B models, FP8 on H100 is the sweet spot: competitive CPM, safe quality, and broad model support. For batch workloads on very large models, B200 spot at $2.07/GPU/hr delivers the best available CPM. If you are running DeepSeek V3 or other 200B+ MoE models, the on-demand H100 numbers look steep but B200 spot brings CPM under $1/M, competitive with smaller dense models on on-demand. For a CPM comparison that includes SambaNova Cloud's per-token API pricing against H200 and B200 on Spheron for large MoE models, see the SambaNova SN40L cost-per-token analysis.

Running inference at scale? Spheron's on-demand and spot H100, A100, B200, and H200 instances give you the GPU configurations benchmarked in this post, billed per minute with no reserved instance lock-in.
Spheron H100 → | Spheron A100 → | B200 GPU pricing → | View all pricing →

STEPS / 05

Quick Setup Guide

Calculate your baseline cost per token
Pull your last 30 days of GPU spend. Divide total GPU cost by total tokens generated (from your inference framework logs). This gives your current CPM. Compare it against the benchmarks in this post to identify your gap.
Select the right GPU for your model size
Use the GPU-model compatibility table in this post to find configurations where your model fits in VRAM with room for KV cache. Prioritize GPU configurations where your model fills 70-80% of available VRAM at your target batch size.
Apply FP8 or INT4 quantization
If you are on H100 or newer, benchmark FP8 with your eval suite. FP8 typically recovers 90-95% of FP16 quality while roughly doubling throughput. For edge deployments where INT4 CPM savings justify quality tradeoffs, use AWQ or GPTQ with a task-specific evaluation.
Tune batch size for your traffic pattern
Identify your P50 and P95 concurrent request counts from production logs. Set max_num_batched_tokens in vLLM to target 60-80% GPU utilization at P50. Under-batching and over-batching both increase CPM.
Move eligible workloads to spot instances
Tag each inference job as synchronous (real-time) or async (batch). Migrate all async jobs to spot GPU instances on Spheron where available. Configure retry logic with exponential backoff. Compare the spot CPM against your current on-demand CPM to quantify monthly savings.

FAQ / 05

Frequently Asked Questions

Cost per million tokens (CPM) = (cluster $/hr) / (tokens_per_sec × 3600 / 1,000,000). For a single H100 SXM5 running Llama 4 Scout 17B via vLLM at 4,200 tok/s at $2.90/hr, CPM works out to roughly $0.19. GPU count in the numerator (as part of cluster $/hr) depends on your deployment configuration: the 70B benchmarks in this post use 8x H100 for maximum batch throughput, so 8× $2.90/hr goes in the numerator. The formula works for any cluster size — a 2x H100 setup (the VRAM minimum for 70B FP16) would use $5.80/hr as the numerator instead.

For models under 34B parameters (FP16), the A100 80G typically delivers the lowest on-demand CPM due to its lower hourly rate, despite lower throughput. For 70B models at the 8x GPU benchmark configuration used in this post (chosen for maximum batch throughput, not minimum VRAM), the H100 cluster wins on CPM because the throughput gain outpaces the price premium. B200 leads on absolute CPM for 671B MoE models where its memory bandwidth advantage is decisive.

FP8 reduces effective cost per token by roughly 50% on H100 and H200 hardware by doubling throughput without changing GPU count. INT4 (AWQ/GPTQ) can push further but requires quality evaluation. On B200 with FP4 via TensorRT-LLM, CPM drops by 30-40% versus FP16 for large models compared to FP8, though this requires pre-calibrated model weights.

Spot instances work for batch inference (embeddings, offline summarization, nightly enrichment pipelines) where you can tolerate interruption. For synchronous APIs with latency SLAs, use on-demand. On Spheron, B200 SXM6 is currently available as spot at $2.07/GPU/hr. H100 SXM5, A100 80G SXM4, and H200 SXM5 are on-demand only.

Batch size is the biggest single lever on CPM. At batch size 1, GPU utilization is very low and CPM can be 50-100x higher than at batch size 256. Continuous batching in vLLM and SGLang amortizes idle GPU time across concurrent requests, which is why CPM at 100+ concurrent users looks very different from single-request benchmarks.

Why Cost Per Token Is the Only Metric That Matters

Benchmark Methodology

Cost Per Token Results by Model

Llama 4 Scout 17B (MoE)

Qwen 3

Gemma 4 31B

DeepSeek V3

Mistral Small 3.1 24B

GPU Hardware Comparison: Cost Efficiency for Inference

Impact of Quantization: FP16 vs FP8 vs INT4

Batch Size and Throughput: How Concurrent Users Change CPM

Spot vs On-Demand: GPU Availability and Cost Per Token

Optimization Playbook: 5 Strategies to Halve Your Cost Per Token

Full Cross-Model, Cross-GPU Reference Table

Quick Setup Guide

Calculate your baseline cost per token

Select the right GPU for your model size

Apply FP8 or INT4 quantization

Tune batch size for your traffic pattern

Move eligible workloads to spot instances

Frequently Asked Questions

01What is cost per million tokens and how do I calculate it?

02Which GPU has the lowest cost per token for LLM inference in 2026?

03How much does quantization reduce cost per token?

04When do spot GPU instances make economic sense for inference?

05How does batch size affect cost per token?

Build what's next.