What is the best GPU for AI inference cost efficiency in 2026?

It depends on model size and traffic. For 7B–13B models at moderate concurrency, the L40S at $0.91/hr on-demand is a cost-accessible option for smaller models. For 70B models, H100 on-demand at $2.01/hr achieves approximately $0.182/M tokens. For interruptible batch workloads, H100 spot at $0.99/hr drops that to approximately $0.090/M tokens. The B200 with FP4 reaches approximately $0.130/M tokens on-demand, offering better cost-per-token than H100 on-demand while fitting larger models in its 192 GB VRAM. The H200 at $4.54/hr on-demand is the choice for long-context workloads and 70B FP16 serving that requires its 141 GB VRAM, not for cost-per-token efficiency at current pricing. Pricing is based on March 20, 2026 and can fluctuate over time based on GPU availability.

How is inference different from training when choosing a GPU?

Training is primarily compute-bound: you want maximum TFLOPS. Inference is primarily memory-bandwidth-bound at low batch sizes: you want high memory bandwidth to load model weights quickly for each request. At high batch sizes, inference becomes more compute-bound. This means the H100's 3.35 TB/s HBM3 bandwidth matters far more for inference than its raw CUDA core count, and the L40S can match H100 output at low concurrency despite having 3.9x less memory bandwidth.

Does FP8 quantization hurt inference quality?

No, in practice. FP8 introduces less than 1-2% degradation on standard benchmarks for major LLM families (Llama, Mistral, DeepSeek). For virtually all production conversational AI, summarization, and code generation tasks, the quality difference is imperceptible. FP8 is considered the safe default for inference on H100, H200, and Blackwell GPUs.

When should I use FP4 instead of FP8 for inference?

FP4 is only available on Blackwell GPUs (B200, B300, RTX 5090, RTX PRO 6000). It delivers approximately 1.5-2x the throughput of FP8 on the same hardware depending on workload (memory-bandwidth-bound workloads see closer to 2x; compute-saturated batch jobs typically see 1.5-1.7x), but introduces more quantization error than FP8. Use FP4 when: your model has pre-calibrated FP4 weights (check the nvidia/ namespace on Hugging Face), your task tolerates the quality reduction (conversational AI, creative writing, summarization), and your cost-per-token calculation shows a benefit over H100/H200 FP8 for your specific batch size and model. Always benchmark before committing.

What GPU should I use for real-time LLM serving vs batch inference?

Real-time serving (chatbots, copilots) prioritizes latency: time to first token and tokens per second at batch size 1-4. The H100 or H200 win here due to HBM bandwidth. Batch inference (offline processing, embedding pipelines) maximizes throughput and runs at large batch sizes where the GPU is more compute-bound. At high batch sizes, the B200 FP4 delivers the most tokens per dollar. For small-model batch inference, the L40S FP8 is often the most cost-efficient option.

Best GPU for AI Inference in 2026: Benchmarks, Pricing, and Decision Guide

Inference now accounts for roughly two-thirds of all AI compute in 2026, having overtaken training as the dominant workload. Models are trained once; they serve requests millions of times. That shift means the GPU decision for inference deserves the same rigor that training GPU selection has always gotten. This guide gives you a decision framework for which GPU to rent for inference, based on real cost-per-token data across L40S, H100, H200, and B200. For deeper benchmarks across providers, see our GPU cloud benchmarks guide. For current GPU availability, visit GPU rental. New to Spheron? The quick-start guide walks through your first GPU deployment in minutes.

Why Inference Has Overtaken Training

Models are trained infrequently and inference runs constantly. A single training run for a 70B model takes weeks on a cluster of hundreds of GPUs. That same model then serves millions of requests per day, each one consuming GPU time. At scale, inference dominates compute spend by a large margin.

This matters architecturally because training and inference have different bottlenecks. Training is compute-bound: you want maximum TFLOPS to complete each forward and backward pass as fast as possible. Inference at low batch sizes is memory-bandwidth-bound: the GPU spends most of its time loading model weights from VRAM into compute units, not doing the actual matrix multiplications. The L40S has 3.9x less memory bandwidth than the H100, yet for a Llama 3.1 8B endpoint handling 1-4 concurrent requests, it runs at nearly the same throughput because both GPUs are waiting on memory, not compute.

At high batch sizes, the calculus flips. Once you're running 32+ concurrent requests, the GPU becomes more compute-bound and TFLOPS start to matter more. That's when the H100's 1,979 dense FP8 TFLOPS (versus the L40S's 733) produces proportional throughput advantages.

One more inference-specific bottleneck worth understanding: the KV cache. During generation, the model stores key and value tensors for each token in the context in GPU VRAM. The larger your context window and the more concurrent requests you handle, the more VRAM the KV cache consumes. For long-context workloads (128K+ token windows), VRAM capacity matters as much as bandwidth. For VRAM sizing guidance by model size, see our GPU memory requirements guide for LLMs.

The Key Metrics for Inference GPUs

Tokens Per Second (Throughput)

Tokens/sec is the primary throughput metric: how many output tokens the GPU can generate per second at a given batch size. It scales with batch size up to the GPU's memory-bandwidth ceiling, then starts to plateau as the GPU becomes compute-saturated. Always measure throughput at your actual expected batch size, not at batch 1 or batch 512.

Time to First Token (Latency)

TTFT measures how long before the first output token is returned to the user. It's driven by prefill speed: the GPU has to process the entire input prompt before generating any output. High HBM bandwidth reduces TTFT because the model weights can be loaded faster. The H100 and H200 have a consistent TTFT advantage over the L40S at equivalent batch sizes, which matters for interactive applications.

Tokens Per Second Per Dollar (Cost Efficiency)

The composite metric for comparing GPUs. Use this formula:

Cost per 1M tokens = ($/hr) ÷ (tokens/sec × 3,600) × 1,000,000

Run this at your expected operating batch size, not peak or idle. The cheapest GPU per hour is often not the cheapest per token at production batch sizes.

KV Cache Capacity (Concurrency Headroom)

How many concurrent requests the GPU can hold in VRAM without eviction depends on your VRAM size and context length. For a 70B model served in FP8, model weights consume approximately 70 GB of VRAM. That leaves roughly 10 GB for KV cache on the H100 (80 GB total) versus roughly 71 GB on the H200 (141 GB total), a difference of about 7x KV cache headroom for that specific configuration. For smaller models, the ratio is closer to the raw VRAM ratio of 1.76x. The B200's 192 GB takes this further still. For high-concurrency deployments with long contexts and large models, VRAM capacity directly determines how many users you can serve from one GPU.

GPU Comparison for Inference: L40S vs H100 vs H200 vs B200

Quick reference:

GPU	VRAM	Memory BW	FP8 TFLOPS	On-demand Price/hr	Best model size	Best for
L40S	48 GB GDDR6	864 GB/s	733 TFLOPS (dense)	from $0.91	7B–13B	Cost-efficient moderate-traffic
H100 SXM	80 GB HBM3	3.35 TB/s	1,979 TFLOPS (dense)	from $2.01	13B–70B	High-throughput production
H200 SXM	141 GB HBM3e	4.8 TB/s	1,979 TFLOPS (dense)	from $4.54	70B–200B	Large-model, long-context
B200	192 GB HBM3e	8.0 TB/s	4,500 TFLOPS (dense)	from $6.03	160B+ or FP4 70B	Maximum throughput, FP4

Spheron on-demand pricing as of March 20, 2026. Prices can fluctuate over time based on GPU availability. Check current pricing for live rates.

For a broader ranked list of NVIDIA GPUs for LLM workloads including consumer and workstation options, see our best NVIDIA GPUs for LLMs guide. For a head-to-head comparison with the RTX 5090, see the RTX 5090 vs H100 vs B200 guide.

L40S: The 7B–13B Inference Workhorse

The L40S is an Ada Lovelace GPU with 48 GB GDDR6 and 864 GB/s bandwidth. It's not an HBM chip, which is why it gets dismissed in GPU spec sheets. But for inference on 7B–13B parameter models, bandwidth is sufficient at moderate concurrency and the price advantage is real.

FP8 support via Transformer Engine (4th-gen Tensor Cores) gives it 733 dense FP8 TFLOPS. Benchmark data from real L40S deployments: Llama 3.1 8B runs at roughly 46 tok/s at batch 1 FP16, and approximately 336 tok/s at batch 8 FP16 (see the Spheron Llama 3 guide for step-by-step setup). At $0.91/hr on-demand and batch 8, that's $0.91 ÷ (336 × 3,600) × 1,000,000 = approximately $0.75/M tokens.

No MIG, no NVLink. Single-card only for practical inference serving. Spot pricing at $0.41/hr makes the L40S the most cost-accessible option for experimental and batch workloads where occasional interruptions are acceptable. Where it wins: endpoints under 30 concurrent requests, 7B–34B INT4 models, and any deployment where the fixed hourly cost matters more than peak throughput. See our L40S inference guide for detailed vLLM benchmarks, or rent an L40S on Spheron. For LLM serving setup guides including vLLM, Ollama, and SGLang on Spheron GPUs, see the Spheron LLM inference docs or follow the vLLM setup guide to get vLLM running on Spheron in minutes.

H100 SXM: The Production Inference Standard

The H100 SXM is the baseline for production inference in 2026. Hopper architecture, 80 GB HBM3, 3.35 TB/s bandwidth. The 3.9x bandwidth advantage over the L40S becomes decisive at batch 16+ concurrent requests.

MIG support enables up to 7 isolated GPU instances from a single H100, each with dedicated VRAM and compute. This is the right architecture for multi-tenant inference platforms where you need guaranteed isolation between customers. NVLink 4 at 900 GB/s makes 2-GPU tensor parallel serving for 70B models efficient.

Estimated cost-per-token at high concurrency FP8: approximately $0.182/M on-demand at $2.01/hr. At $0.99/hr spot pricing, that drops to around $0.090/M for batch workloads where interruptions are acceptable.

Where it wins: 13B–70B models at production concurrency, multi-tenant deployments using MIG, and latency-sensitive APIs where TTFT is constrained. See our H100 vs H200 comparison or rent an H100 on Spheron. For step-by-step vLLM deployment on H100, see the Spheron vLLM server guide.

H200: Large-Model and Long-Context Inference

The H200 uses the same Hopper compute die as the H100 but pairs it with 141 GB HBM3e at 4.8 TB/s. That's 76% more VRAM and 43% more memory bandwidth than the H100. The extra bandwidth produces approximately 40-43% faster LLM inference throughput than the H100 on Llama 2 70B, based on comparing best published MLPerf Inference results (H200 at ~34,988 tok/s offline in MLPerf v5.0; H100 best result at ~24,525 tok/s offline in MLPerf v4.1).

The KV cache advantage is substantial for large models. When running a 70B model in FP8, model weights consume roughly 70 GB, leaving only ~10 GB for KV cache on the H100 versus ~71 GB on the H200, a 7x difference in KV headroom for that configuration. For long-context workloads at 128K+ token windows with large models, this is often the deciding factor.

Cost-per-token at FP8 for 70B-class models: approximately $0.288/M on-demand at $4.54/hr. Spot pricing at $1.78/hr is available for interruptible workloads. At current on-demand pricing, the H100 on-demand delivers lower cost-per-token ($0.182/M) for standard 70B FP8 serving. The H200's advantage is its 141 GB VRAM, which enables 70B FP16 serving, 128K+ context windows without KV cache evictions, and higher sustained concurrency on large models.

Where it wins: 70B models in FP16 (which need 140 GB and don't fit on H100 at all), long-context workloads requiring 128K+ token windows, and deployments where KV cache headroom and TTFT matter more than lowest hourly cost. See our H200 rental guide or rent an H200 on Spheron. For serving Llama 3.1 70B and other large dense models, see the Spheron Llama 3 deployment guide.

B200: FP4 Throughput and Maximum VRAM

The B200 is NVIDIA's Blackwell GPU: 192 GB HBM3e at 8.0 TB/s, 4,500 dense FP8 TFLOPS, and exclusive FP4 tensor core support. FP4 reaches 9,000 dense TFLOPS, or up to approximately 18,000 TFLOPS with structured sparsity enabled.

MLPerf Inference v5.1 data (September 2025) for an 8-GPU B200 system in offline mode shows approximately 102,725 tok/s for Llama 2 70B using FP4, roughly 2.9x the throughput of comparable 8-GPU H200 systems from MLPerf v5.0 (34,988 tok/s). On a per-GPU basis, that's approximately 12,841 tok/s at scale. Cost-per-token with FP4 on 70B-class models: estimated approximately $0.130/M on-demand, versus the H100 on-demand rate of around $0.182/M.

The 192 GB VRAM is the other advantage. It enables single-card serving of large dense models that need more than H200's 141 GB: models in the 160B–190B range at FP8, or dense 70B models with substantial KV cache headroom for long-context workloads. For teams running 90B+ dense models from a single GPU, the B200 provides headroom that H100 and H200 cannot match. Very large MoE architectures like Llama 4 Maverick (400B total parameters, requiring 8xH200 for production deployment on Spheron per the Llama 4 Scout and Maverick deployment guide and our Llama 4 GPU cloud deployment guide) still require multi-GPU tensor parallelism even on B200, since all expert weights must reside in VRAM simultaneously.

Where it wins: maximum single-GPU throughput for 70B+ models, FP4-validated workloads, and deployments needing more than 141 GB VRAM for large models. FP4 requires quality validation per task before production use. See our complete B200 guide or rent a B200 on Spheron. For workloads requiring more than 192 GB VRAM, the B300 (288 GB HBM3e, 8.0 TB/s, approximately 1.56x the FP4 TFLOPS of B200 at 14 PFLOPS vs 9 PFLOPS) is now available in limited early access on Spheron. See current GPU pricing for B300 rates or rent a B300.

B200 FP4 per-GPU figure (~12,841 tok/s) derived from the MLPerf Inference v5.1 (September 2025) 8-GPU result (102,725 tok/s) divided by 8. B200 FP8 estimate uses the H100 SXM MLPerf v4.1 baseline scaled by the B200/H100 FP8 TFLOPS ratio. Run your own benchmarks before production decisions.

FP8 and FP4 Quantization: What They Do to Your Inference Cost

Config	$/hr	Precision	Est. tok/s (70B)	Cost/M tokens
H100 SXM, FP8 (on-demand)	$2.01	FP8	~3,066	~$0.182
H100 SXM, FP8 (spot)	$0.99	FP8	~3,066	~$0.090
H200 SXM, FP8 (on-demand)	$4.54	FP8	~4,374	~$0.288
B200, FP8	$6.03	FP8	~6,972	~$0.240
B200, FP4	$6.03	FP4	~12,841	~$0.130

Spheron on-demand pricing as of March 20, 2026. Prices can fluctuate over time based on GPU availability. H100 SXM throughput figure is a per-GPU estimate from the 8-GPU MLPerf Inference v4.1 result (24,525 tok/s) divided by 8. H200 SXM throughput figure is a per-GPU estimate from the 8-GPU MLPerf Inference v5.0 (April 2025) result (34,988 tok/s) divided by 8. B200 FP8 throughput (~6,972 tok/s) is derived from the H100 SXM per-GPU baseline (~3,066 tok/s) scaled by the B200/H100 FP8 TFLOPS ratio (4,500/1,979 ≈ 2.274), consistent with the methodology used in the cross-referenced FP4 quantization and Blackwell cost guide. B200 FP4 throughput (~12,841 tok/s) is from the MLPerf Inference v5.1 (September 2025) 8-GPU result (102,725 tok/s for Llama 2 70B offline, FP4 precision) divided by 8. Run your own benchmarks before production decisions. Check current pricing for live rates.

FP8 is the safe default. It's supported on H100, H200, and Blackwell GPUs, delivers roughly 1.5–2x throughput over FP16, and introduces less than 1–2% quality degradation on standard benchmarks for Llama, Mistral, and DeepSeek model families (see the Spheron DeepSeek R1 deployment guide or the Mistral and Mixtral deployment guide for working FP8 setups on Spheron). For virtually any production conversational AI, summarization, or code generation task, you'll never notice the difference. For CUDA driver versions by GPU architecture and ready-to-use Spheron environment templates, see the CUDA and NVIDIA drivers guide.

FP4 is Blackwell-only and delivers approximately 1.5-2x FP8 throughput on the same hardware depending on workload and model configuration. Memory-bandwidth-bound inference (small batch sizes) sees gains closer to 2x; compute-saturated batch workloads typically see 1.5-1.7x. But it requires task-specific quality validation. Get pre-calibrated FP4 weights from the nvidia/ namespace on Hugging Face, compare FP16 and FP4 outputs on your actual task, and only move to FP4 in production if the quality delta is acceptable. Fine for conversational AI and creative writing; needs more scrutiny for math, reasoning, or scientific tasks. See the FP4 quantization and Blackwell cost guide for the full framework support and quality breakdown.

The key insights from this table: On-demand, B200 FP4 at $0.130/M delivers the lowest cost-per-token among on-demand options, beating H100 on-demand ($0.182/M) by about 29% while also offering significantly higher throughput per GPU. H100 on-demand at $0.182/M is the standard baseline for production 70B serving. B200 FP8 at $0.240/M costs more per token than H100 on-demand, since the B200's approximately 2.3x TFLOPS advantage does not offset its 3x price premium at on-demand rates. H200 on-demand at $0.288/M carries a premium that reflects its 141 GB VRAM and lower TTFT, not raw cost-per-token efficiency. For batch workloads where occasional interruptions are acceptable, H100 spot at $0.99/hr drops the cost-per-token to around $0.090/M.

Batch Inference vs Real-Time Serving: Different GPU Needs

Real-Time Serving (Chatbots, Copilots, APIs)

Real-time serving requires low TTFT (under 500ms) and consistent p99 latency. Runtime batch size is typically 1–8 concurrent requests. At these batch sizes, inference is memory-bandwidth-bound.

HBM GPUs win. The H100 and H200 have the bandwidth to load weights quickly even at batch 1. The L40S can serve 7B models at interactive speeds at batch 1–4, but at batch 8+ for 13B+ models, the H100's bandwidth advantage compounds.

7B–13B, cost-sensitive endpoints: L40S. You get approximately 46 tok/s at batch 1 for Llama 8B at $0.91/hr on-demand.
13B–70B, production API: H100 SXM. Better TTFT, MIG for multi-tenancy.
70B+, long-context: H200 SXM. The extra VRAM holds more KV cache; the bandwidth edge holds TTFT down.

Batch Inference (Offline Processing, Evals, Embeddings)

Batch workloads are different. Latency is flexible; you want maximum throughput and minimum cost-per-token. Run at batch 64–512 and the GPU becomes compute-saturated rather than bandwidth-limited.

At these batch sizes, the B200's 4,500 dense FP8 TFLOPS (versus the H100's 1,979) produces proportional throughput gains. And H100 spot at $0.99/hr makes it highly competitive for batch jobs where availability interruptions are acceptable.

7B–13B, high-volume batch jobs: L40S FP8. Lowest cost-per-token at this model size.
70B, batch processing: H100 spot FP8 at $0.99/hr. Best combination of cost and availability.
70B+ max throughput: B200 FP4. Only worthwhile if FP4 quality validates for your task.

Decision table:

Workload	Latency goal	Model size	Recommended GPU
Interactive chatbot	< 500ms TTFT	7B–13B	L40S
Interactive chatbot	< 500ms TTFT	13B–70B	H100 SXM
Interactive chatbot	< 500ms TTFT	70B+	H200 SXM
Batch processing	Flexible	7B–13B	L40S (FP8)
Batch processing	Flexible	70B	H100 spot (FP8)
Batch processing	Flexible	70B+ max throughput	B200 (FP4)
Multi-tenant API	SLA per tenant	Any	H100 SXM + MIG

Multi-Model Serving Strategies

MIG: Multi-Instance GPU on H100, H200, and B200

H100, H200, and B200 support MIG, which partitions one GPU into up to 7 fully isolated instances, each with dedicated VRAM and compute. The use case: host multiple small models simultaneously on one GPU. A single H100 can run seven separate 7B endpoints in INT4, each isolated from the others. The B200 also supports MIG with FP4, FP8, and FP16 precision, though at $6.03/hr on-demand the H100 at $2.01/hr is far more cost-efficient for multi-tenant inference unless you need B200-scale throughput per instance.

This is the right architecture for multi-tenant inference platforms where different customers or models need guaranteed resource isolation. The L40S does not support MIG.

vLLM Continuous Batching

vLLM's continuous batching (PagedAttention) pipelines prefill and decode stages to maximize GPU utilization. A single L40S running vLLM can handle far more concurrent users than naive batch limits suggest, because the engine interleaves requests that are in different stages. See our vLLM production deployment guide for setup details, the Spheron vLLM setup guide for step-by-step deployment on Spheron GPUs, or the full LLM inference docs for other frameworks. Additional framework guides: Ollama + Open WebUI, SGLang, TensorRT-LLM + Triton, llama.cpp Server for GGUF model serving with CPU/GPU offload, LMDeploy for TurboMind-based inference on A100/H100 with AWQ quantization, and LocalAI for a self-hosted OpenAI-compatible server supporting LLMs and Stable Diffusion in a single container. For model-specific deployment guides, see Qwen3 Dense and MoE for the Qwen3 family (8B through 235B MoE variants) on Spheron GPUs, Phi-4 for the 14B SLM with strong reasoning performance at low VRAM cost, and Gemma 3 for the 4B through 27B Gemma family from Google.

Tensor Parallelism for Large Models

Models that don't fit on one GPU require tensor parallelism: splitting each transformer layer across multiple cards. The quality of inter-GPU communication matters here. H100 and H200 use NVLink 4 at 900 GB/s total bidirectional bandwidth; the L40S is limited to PCIe Gen4 at roughly 64 GB/s bidirectional (32 GB/s per direction). That's approximately a 14x bandwidth advantage for all-reduce operations in tensor-parallel inference, which becomes a real bottleneck at scale. For large multi-GPU deployments, use NVLink-connected SXM boards.

The B200 includes NVLink 5 at 1.8 TB/s GPU-to-GPU interconnect (2x the 900 GB/s of NVLink 4 on H100/H200) and is the right choice for tensor-parallel serving of 400B+ MoE models.

Model Colocation Across GPU Types

A practical strategy for larger inference platforms: run B200 or H200 for large 70B+ models that need high throughput, and route 7B–13B requests to L40S clusters where the per-token cost is lower. A single inference platform with tiered GPU pools routes by model size and latency SLA, minimizing total infrastructure cost without sacrificing performance on large-model requests.

Decision Framework: Which GPU for Your Workload

Model fits in 48 GB GDDR6? (7B–34B at INT4) Start with the L40S at $0.91/hr on-demand.
Model needs 70B+ at FP8, or you need MIG? H100 SXM. Use spot pricing ($0.99/hr) for batch workloads, on-demand ($2.01/hr) for production.
Running 70B+ at FP16, or long context (128K+) with large concurrency? H200 SXM. Its 141 GB VRAM enables 70B FP16 serving and provides 7x more KV cache headroom than the H100 for large model configurations.
Need maximum single-GPU throughput or more than 141 GB VRAM? B200. FP4 at $0.130/M beats H100 on-demand on cost-per-token. Only worthwhile if FP4 quality validates for your task. For requirements exceeding 192 GB VRAM, consider the B300 (288 GB HBM3e, approximately 1.56x B200 FP4 TFLOPS at 14 PFLOPS vs 9 PFLOPS), now available in limited early access on Spheron.
Multi-GPU tensor parallelism for 400B+ models? H200 or B200 with NVLink. B300 NVLink bandwidth of 1.8 TB/s (same as B200) also works for distributed workloads requiring more than 192 GB per node.

Before committing to a GPU at scale, rent a single instance on Spheron at per-minute billing and run your own benchmark with vLLM and your production model. The cost-per-token numbers in this guide are a starting point. Real throughput depends on your model architecture, context length, and batch size. For strategies on reducing GPU spend, including spot instances and reserved pricing, see the Spheron cost optimization guide. For the difference between spot, dedicated, and cluster instances, see the Spheron instance types guide. Reserved GPUs offer 30-50% savings over on-demand rates for longer commitments; see the reserved GPU guide for details. For GPU availability by provider and data center region on Spheron, see the regions and providers guide. For programmatic GPU provisioning and management, see the Spheron API reference.

Spheron offers all GPUs covered in this guide at per-minute billing with no long-term commitment.
Rent L40S → | Rent H100 → | Rent H200 → | Rent B200 → | Rent B300 → | View all GPU pricing →