Inference now accounts for roughly two-thirds of all AI compute in 2026, having overtaken training as the dominant workload. Models are trained once; they serve requests millions of times. That shift means the GPU decision for inference deserves the same rigor that training GPU selection has always gotten. This guide gives you a decision framework for which GPU to rent for inference, based on real cost-per-token data across L40S, H100, H200, and B200. For deeper benchmarks across providers, see our GPU cloud benchmarks guide. For current GPU availability, visit GPU rental. New to Spheron? The quick-start guide walks through your first GPU deployment in minutes. For a full three-generation architecture comparison (Hopper, Blackwell, Rubin), see Rubin vs Blackwell vs Hopper. The April 2026 MLPerf Inference v6.0 results give the most current standardized throughput numbers for H200, B200, and MI355X. See the GPU cloud news feed for the latest hardware availability updates and pricing changes across providers.
Why Inference Has Overtaken Training
Models are trained infrequently and inference runs constantly. A single training run for a 70B model takes weeks on a cluster of hundreds of GPUs. That same model then serves millions of requests per day, each one consuming GPU time. At scale, inference dominates compute spend by a large margin.
This matters architecturally because training and inference have different bottlenecks. Training is compute-bound: you want maximum TFLOPS to complete each forward and backward pass as fast as possible. Inference at low batch sizes is memory-bandwidth-bound: the GPU spends most of its time loading model weights from VRAM into compute units, not doing the actual matrix multiplications. The L40S has 3.9x less memory bandwidth than the H100, yet for a Llama 3.1 8B endpoint handling 1-4 concurrent requests, it runs at nearly the same throughput because both GPUs are waiting on memory, not compute.
At high batch sizes, the calculus flips. Once you're running 32+ concurrent requests, the GPU becomes more compute-bound and TFLOPS start to matter more. That's when the H100's 1,979 dense FP8 TFLOPS (versus the L40S's 733) produces proportional throughput advantages.
One more inference-specific bottleneck worth understanding: the KV cache. During generation, the model stores key and value tensors for each token in the context in GPU VRAM. The larger your context window and the more concurrent requests you handle, the more VRAM the KV cache consumes. For long-context workloads (128K+ token windows), VRAM capacity matters as much as bandwidth. For VRAM sizing guidance by model size, see our GPU memory requirements guide for LLMs. For a full checklist of what makes inference slow and how to fix each issue, see Why Your LLM Inference Is Slow. If you have flexibility on model architecture, non-transformer models like Liquid AI's LFM family eliminate the KV cache entirely with a fixed-size recurrent state. LFM2 inference on GPU cloud shows how this shifts GPU tier requirements at 128K+ context lengths.
The Key Metrics for Inference GPUs
Tokens Per Second (Throughput)
Tokens/sec is the primary throughput metric: how many output tokens the GPU can generate per second at a given batch size. It scales with batch size up to the GPU's memory-bandwidth ceiling, then starts to plateau as the GPU becomes compute-saturated. Always measure throughput at your actual expected batch size, not at batch 1 or batch 512.
Time to First Token (Latency)
TTFT measures how long before the first output token is returned to the user. It's driven by prefill speed: the GPU has to process the entire input prompt before generating any output. High HBM bandwidth reduces TTFT because the model weights can be loaded faster. The H100 and H200 have a consistent TTFT advantage over the L40S at equivalent batch sizes, which matters for interactive applications.
Tokens Per Second Per Dollar (Cost Efficiency)
The composite metric for comparing GPUs. Use this formula:
Cost per 1M tokens = ($/hr) ÷ (tokens/sec × 3,600) × 1,000,000Run this at your expected operating batch size, not peak or idle. The cheapest GPU per hour is often not the cheapest per token at production batch sizes.
The billing model you choose matters as much as the GPU itself. See our GPU billing models guide for a full cost breakdown by workload type. If you are choosing between a serverless platform and a dedicated GPU VM for inference, our Spheron vs Modal comparison breaks down the latency and cost trade-offs in detail.
For a full breakdown of inference cost optimization strategies across model, runtime, infrastructure, and FinOps layers, including a real-world case study cutting from $39K to $16K per month, see AI Inference Cost Economics in 2026. For a model-specific CPM breakdown across these GPU types, see GPU Cost Per Token: LLM Inference Benchmark 2026.
KV Cache Capacity (Concurrency Headroom)
How many concurrent requests the GPU can hold in VRAM without eviction depends on your VRAM size and context length. For a 70B model served in FP8, model weights consume approximately 70 GB of VRAM. That leaves roughly 10 GB for KV cache on the H100 (80 GB total) versus roughly 71 GB on the H200 (141 GB total), a difference of about 7x KV cache headroom for that specific configuration. For smaller models, the ratio is closer to the raw VRAM ratio of 1.76x. The B200's 192 GB takes this further still. For high-concurrency deployments with long contexts and large models, VRAM capacity directly determines how many users you can serve from one GPU.
GPU Comparison for Inference: L40S vs H100 vs H200 vs B200
Quick reference:
| GPU | VRAM | Memory BW | FP8 TFLOPS | On-demand Price/hr | Best model size | Best for |
|---|---|---|---|---|---|---|
| L40S | 48 GB GDDR6 | 864 GB/s | 733 TFLOPS (dense) | from $0.72 | 7B–13B | Cost-efficient moderate-traffic |
| H100 SXM5 | 80 GB HBM3 | 3.35 TB/s | 1,979 TFLOPS (dense) | from $2.50 | 13B–70B | High-throughput production |
| H200 SXM | 141 GB HBM3e | 4.8 TB/s | 1,979 TFLOPS (dense) | from $4.54 | 70B–200B | Large-model, long-context |
| B200 | 192 GB HBM3e | 8.0 TB/s | 4,500 TFLOPS (dense) | from $6.02 | 160B+ or FP4 70B | Maximum throughput; spot $2.12 |
Spheron on-demand pricing as of 15 Apr 2026. Prices can fluctuate over time based on GPU availability. Check current pricing for live rates.
For a broader ranked list of NVIDIA GPUs for LLM workloads including consumer and workstation options, see our best NVIDIA GPUs for LLMs guide. For a head-to-head comparison with the RTX 5090, see the RTX 5090 vs H100 vs B200 guide. For a detailed RTX 5090 vs RTX 4090 benchmark including cost per million tokens, see our consumer GPU comparison for AI workloads. For diffusion model workloads specifically, see Deploy FLUX.2 on GPU Cloud for per-image cost benchmarks across H100 and A100.
L40S: The 7B–13B Inference Workhorse
The L40S is an Ada Lovelace GPU with 48 GB GDDR6 and 864 GB/s bandwidth. It's not an HBM chip, which is why it gets dismissed in GPU spec sheets. But for inference on 7B–13B parameter models, bandwidth is sufficient at moderate concurrency and the price advantage is real.
FP8 support via Transformer Engine (4th-gen Tensor Cores) gives it 733 dense FP8 TFLOPS. Benchmark data from real L40S deployments: Llama 3.1 8B runs at roughly 46 tok/s at batch 1 FP16, and approximately 336 tok/s at batch 8 FP16 (see the Spheron Llama 3 guide for step-by-step setup). At $0.72/hr on-demand and batch 8, that's $0.72 ÷ (336 × 3,600) × 1,000,000 = approximately $0.59/M tokens.
No MIG, no NVLink. Single-card only for practical inference serving. No spot pricing is available for the L40S, only on-demand at $0.72/hr, making it the most cost-accessible data center option for experimental and batch workloads. Where it wins: endpoints under 30 concurrent requests, 7B–34B INT4 models, and any deployment where the fixed hourly cost matters more than peak throughput. See our L40S inference guide for detailed vLLM benchmarks, or rent an L40S on Spheron, and see how L40 compares to L40S on FP8 if you're evaluating both. For LLM serving setup guides including vLLM, Ollama, and SGLang on Spheron GPUs, see the Spheron LLM inference docs or follow the vLLM setup guide to get vLLM running on Spheron in minutes.
H100 SXM: The Production Inference Standard
The H100 SXM is the baseline for production inference in 2026. Hopper architecture, 80 GB HBM3, 3.35 TB/s bandwidth. The 3.9x bandwidth advantage over the L40S becomes decisive at batch 16+ concurrent requests.
MIG support enables up to 7 isolated GPU instances from a single H100, each with dedicated VRAM and compute. This is the right architecture for multi-tenant inference platforms where you need guaranteed isolation between customers. NVLink 4 at 900 GB/s makes 2-GPU tensor parallel serving for 70B models efficient.
Estimated cost-per-token at high concurrency FP8: approximately $0.227/M on-demand at $2.50/hr. At $1.03/hr spot pricing, that drops to around $0.093/M for batch workloads where interruptions are acceptable.
Where it wins: 13B–70B models at production concurrency, multi-tenant deployments using MIG, and latency-sensitive APIs where TTFT is constrained. See our H100 vs H200 comparison or rent an H100 on Spheron. For step-by-step vLLM deployment on H100, see the Spheron vLLM server guide.
H200: Large-Model and Long-Context Inference
The H200 uses the same Hopper compute die as the H100 but pairs it with 141 GB HBM3e at 4.8 TB/s. That's 76% more VRAM and 43% more memory bandwidth than the H100. The extra bandwidth produces approximately 40-43% faster LLM inference throughput than the H100 on Llama 2 70B, based on comparing best published MLPerf Inference results (H200 at ~34,988 tok/s offline in MLPerf v5.0; H100 best result at ~24,525 tok/s offline in MLPerf v4.1).
The KV cache advantage is substantial for large models. When running a 70B model in FP8, model weights consume roughly 70 GB, leaving only ~10 GB for KV cache on the H100 versus ~71 GB on the H200, a 7x difference in KV headroom for that configuration. For long-context workloads at 128K+ token windows with large models, this is often the deciding factor.
Cost-per-token at FP8 for 70B-class models: approximately $0.288/M on-demand at $4.54/hr. Spot pricing at $1.78/hr is available for interruptible workloads. At current on-demand pricing, the H100 on-demand delivers lower cost-per-token ($0.227/M) for standard 70B FP8 serving. The H200's advantage is its 141 GB VRAM, which enables 70B FP16 serving, 128K+ context windows without KV cache evictions, and higher sustained concurrency on large models.
Where it wins: 70B models in FP16 (which need 140 GB and don't fit on H100 at all), long-context workloads requiring 128K+ token windows, and deployments where KV cache headroom and TTFT matter more than lowest hourly cost. See our H200 rental guide or rent an H200 on Spheron. For serving Llama 3.1 70B and other large dense models, see the Spheron Llama 3 deployment guide.
B200: FP4 Throughput and Maximum VRAM
The B200 is NVIDIA's Blackwell GPU: 192 GB HBM3e at 8.0 TB/s, 4,500 dense FP8 TFLOPS, and exclusive FP4 tensor core support. FP4 reaches 9,000 dense TFLOPS, or up to approximately 18,000 TFLOPS with structured sparsity enabled.
MLPerf Inference v5.1 data (September 2025) for an 8-GPU B200 system in offline mode shows approximately 102,725 tok/s for Llama 2 70B using FP4, roughly 2.9x the throughput of comparable 8-GPU H200 systems from MLPerf v5.0 (34,988 tok/s). On a per-GPU basis, that's approximately 12,841 tok/s at scale. Cost-per-token with FP4 on 70B-class models: estimated approximately $0.130/M on-demand, versus the H100 on-demand rate of around $0.227/M.
The 192 GB VRAM is the other advantage. It enables single-card serving of large dense models that need more than H200's 141 GB: models in the 160B–190B range at FP8, or dense 70B models with substantial KV cache headroom for long-context workloads. For teams running 90B+ dense models from a single GPU, the B200 provides headroom that H100 and H200 cannot match. Very large MoE architectures like Llama 4 Maverick (400B total parameters, requiring 8xH200 for production deployment on Spheron per the Llama 4 Scout and Maverick deployment guide and our Llama 4 GPU cloud deployment guide) still require multi-GPU tensor parallelism even on B200, since all expert weights must reside in VRAM simultaneously.
Where it wins: maximum single-GPU throughput for 70B+ models, FP4-validated workloads, and deployments needing more than 141 GB VRAM for large models. FP4 requires quality validation per task before production use. See our complete B200 guide or rent a B200 on Spheron. For workloads requiring more than 192 GB VRAM, the B300 (288 GB HBM3e, 8.0 TB/s, approximately 1.67x the FP4 TFLOPS of B200 at 15 PFLOPS vs 9 PFLOPS) is now available in limited early access on Spheron. See current GPU pricing for B300 rates or rent a B300. For a dedicated cost-per-token comparison between B300 and B200 with break-even math, see our B300 vs B200 guide. For the rack-scale Blackwell Ultra build, Spheron also has GB300 NVL72 capacity available to reserve now: share your GPU count, timeline, and workload on the form and the team confirms availability within a business day.
B200 FP4 per-GPU figure (~12,841 tok/s) derived from the MLPerf Inference v5.1 (September 2025) 8-GPU result (102,725 tok/s) divided by 8. B200 FP8 estimate uses the H100 SXM MLPerf v4.1 baseline scaled by the B200/H100 FP8 TFLOPS ratio. Run your own benchmarks before production decisions.
FP8 and FP4 Quantization: What They Do to Your Inference Cost
| Config | $/hr | Precision | Est. tok/s (70B) | Cost/M tokens |
|---|---|---|---|---|
| H100 SXM, FP8 (on-demand) | $2.50 | FP8 | ~3,066 | ~$0.227 |
| H100 SXM, FP8 (spot) | $1.03 | FP8 | ~3,066 | ~$0.095 |
| H200 SXM, FP8 (on-demand) | $4.54 | FP8 | ~4,374 | ~$0.288 |
| B200, FP8 (on-demand) | $6.02 | FP8 | ~6,972 | ~$0.239 |
| B200, FP8 (spot) | $2.12 | FP8 | ~6,972 | ~$0.084 |
| B200, FP4 (spot) | $2.12 | FP4 | ~12,841 | ~$0.047 |
Spheron on-demand pricing as of 15 Apr 2026. Prices can fluctuate over time based on GPU availability. H100 SXM throughput figure is a per-GPU estimate from the 8-GPU MLPerf Inference v4.1 result (24,525 tok/s) divided by 8. H200 SXM throughput figure is a per-GPU estimate from the 8-GPU MLPerf Inference v5.0 (April 2025) result (34,988 tok/s) divided by 8. B200 FP8 throughput (~6,972 tok/s) is derived from the H100 SXM per-GPU baseline (~3,066 tok/s) scaled by the B200/H100 FP8 TFLOPS ratio (4,500/1,979 ≈ 2.274), consistent with the methodology used in the cross-referenced FP4 quantization and Blackwell cost guide. B200 FP4 throughput (~12,841 tok/s) is from the MLPerf Inference v5.1 (September 2025) 8-GPU result (102,725 tok/s for Llama 2 70B offline, FP4 precision) divided by 8. Run your own benchmarks before production decisions. Check current pricing for live rates.
FP8 is the safe default. It's supported on H100, H200, and Blackwell GPUs, delivers roughly 1.5–2x throughput over FP16, and introduces less than 1–2% quality degradation on standard benchmarks for Llama, Mistral, and DeepSeek model families (see the Spheron DeepSeek R1 deployment guide or the Mistral and Mixtral deployment guide for working FP8 setups on Spheron). For virtually any production conversational AI, summarization, or code generation task, you'll never notice the difference. For CUDA driver versions by GPU architecture and ready-to-use Spheron environment templates, see the CUDA and NVIDIA drivers guide.
FP4 is Blackwell-only and delivers approximately 1.5-2x FP8 throughput on the same hardware depending on workload and model configuration. Memory-bandwidth-bound inference (small batch sizes) sees gains closer to 2x; compute-saturated batch workloads typically see 1.5-1.7x. But it requires task-specific quality validation. Get pre-calibrated FP4 weights from the nvidia/ namespace on Hugging Face, compare FP16 and FP4 outputs on your actual task, and only move to FP4 in production if the quality delta is acceptable. Fine for conversational AI and creative writing; needs more scrutiny for math, reasoning, or scientific tasks. See the FP4 quantization and Blackwell cost guide for the full framework support and quality breakdown.
INT4 (AWQ) works on all CUDA GPUs and is the right quantization choice when you're running on A100, L40S, or RTX 4090 hardware. A 70B model that needs two A100 80Gs at BF16 fits on a single A100 at AWQ INT4 with ~1-3% quality loss on most tasks. The RTX 4090 on Spheron is the cheapest of these options for sub-13B INT4 inference. For a complete INT4 quantization walkthrough including memory benchmarks, throughput comparisons, and deployment steps, see our AWQ quantization guide.
The key insights from this table: On-demand, H100 at $0.227/M is the cost baseline for production 70B serving. B200 on-demand FP8 at $0.239/M costs slightly more than H100 but delivers roughly 2.3x the throughput per GPU. For spot instances where interruptions are tolerable, B200 FP8 at $0.084/M beats H100 on-demand significantly, and B200 FP4 spot at $0.047/M is the lowest cost-per-token option across all configurations. H200 on-demand at $0.288/M carries a premium reflecting its 141 GB VRAM and lower TTFT. For batch workloads where occasional interruptions are acceptable, H100 spot at $0.095/M or B200 spot at $0.084/M deliver the best economics.
Batch Inference vs Real-Time Serving: Different GPU Needs
Real-Time Serving (Chatbots, Copilots, APIs)
Real-time serving requires low TTFT (under 500ms) and consistent p99 latency. Runtime batch size is typically 1–8 concurrent requests. At these batch sizes, inference is memory-bandwidth-bound.
HBM GPUs win. The H100 and H200 have the bandwidth to load weights quickly even at batch 1. The L40S can serve 7B models at interactive speeds at batch 1–4, but at batch 8+ for 13B+ models, the H100's bandwidth advantage compounds.
- 7B–13B, cost-sensitive endpoints: L40S. You get approximately 46 tok/s at batch 1 for Llama 8B at $0.72/hr on-demand.
- 13B–70B, production API: H100 SXM. Better TTFT, MIG for multi-tenancy.
- 70B+, long-context: H200 SXM. The extra VRAM holds more KV cache; the bandwidth edge holds TTFT down.
For real-time serving specifically, speculative decoding can add 2-5x TTFT improvement on top of your GPU selection with no model changes or GPU tier upgrade. Block diffusion approaches via dFlash and block-diffusion speculative decoding push that further on Blackwell hardware.
Batch Inference (Offline Processing, Evals, Embeddings)
Batch workloads are different. Latency is flexible; you want maximum throughput and minimum cost-per-token. Run at batch 64–512 and the GPU becomes compute-saturated rather than bandwidth-limited.
At these batch sizes, the B200's 4,500 dense FP8 TFLOPS (versus the H100's 1,979) produces proportional throughput gains. And H100 spot at $1.03/hr makes it highly competitive for batch jobs where availability interruptions are acceptable.
- 7B–13B, high-volume batch jobs: L40S FP8. Lowest cost-per-token at this model size.
- 70B, batch processing: H100 spot FP8 at $1.03/hr. Best combination of cost and availability.
- 70B+ max throughput: B200 FP4. Only worthwhile if FP4 quality validates for your task.
Decision table:
| Workload | Latency goal | Model size | Recommended GPU |
|---|---|---|---|
| Interactive chatbot | < 500ms TTFT | 7B–13B | L40S |
| Interactive chatbot | < 500ms TTFT | 13B–70B | H100 SXM |
| Interactive chatbot | < 500ms TTFT | 70B+ | H200 SXM |
| Batch processing | Flexible | 7B–13B | L40S (FP8) |
| Batch processing | Flexible | 70B | H100 spot (FP8) |
| Batch processing | Flexible | 70B+ max throughput | B200 (FP4) |
| Multi-tenant API | SLA per tenant | Any | H100 SXM + MIG |
Multi-Model Serving Strategies
MIG: Multi-Instance GPU on H100, H200, and B200
H100, H200, and B200 support MIG, which partitions one GPU into up to 7 fully isolated instances, each with dedicated VRAM and compute. The use case: host multiple small models simultaneously on one GPU. A single H100 can run seven separate 7B endpoints in INT4, each isolated from the others. The B200 also supports MIG with FP4, FP8, and FP16 precision. On-demand, B200 at $6.02/hr is premium compared to H100 SXM5 at $2.50/hr for multi-tenant inference; however, for spot or mixed workloads where B200 can leverage FP4, the higher throughput may justify the cost on a per-instance basis.
This is the right architecture for multi-tenant inference platforms where different customers or models need guaranteed resource isolation. The L40S does not support MIG. For a step-by-step MIG setup guide covering profile selection, cost math, and working deployment commands, see Run multiple LLMs on one GPU: MIG, time-slicing, and MPS. MIG and MPS are also a right-sizing tool, not just a multi-model deployment strategy. See fractional GPU inference: vGPU, MPS, and right-sizing for cost benchmarks across configurations.
vLLM Continuous Batching
vLLM's continuous batching (PagedAttention) pipelines prefill and decode stages to maximize GPU utilization. A single L40S running vLLM can handle far more concurrent users than naive batch limits suggest, because the engine interleaves requests that are in different stages. See our vLLM production deployment guide for setup details, the Spheron vLLM setup guide for step-by-step deployment on Spheron GPUs, or the full LLM inference docs for other frameworks. Additional framework guides: Ollama + Open WebUI, SGLang, TensorRT-LLM + Triton, llama.cpp Server for GGUF model serving with CPU/GPU offload, LMDeploy for TurboMind-based inference on A100/H100 with AWQ quantization, and LocalAI for a self-hosted OpenAI-compatible server supporting LLMs and Stable Diffusion in a single container. For model-specific deployment guides, see Qwen3 Dense and MoE for the Qwen3 family (8B through 235B MoE variants) on Spheron GPUs, Phi-4 for the 14B SLM with strong reasoning performance at low VRAM cost, and Gemma 3 for the 4B through 27B Gemma family from Google.
NVIDIA Dynamo 1.0 takes this further by routing prefill and decode work to separate GPU pools entirely. See our Dynamo disaggregated inference guide for deployment details.
Tensor Parallelism for Large Models
Models that don't fit on one GPU require tensor parallelism: splitting each transformer layer across multiple cards. The quality of inter-GPU communication matters here. H100 and H200 use NVLink 4 at 900 GB/s total bidirectional bandwidth; the L40S is limited to PCIe Gen4 at roughly 64 GB/s bidirectional (32 GB/s per direction). That's approximately a 14x bandwidth advantage for all-reduce operations in tensor-parallel inference, which becomes a real bottleneck at scale. For large multi-GPU deployments, use NVLink-connected SXM boards.
The B200 includes NVLink 5 at 1.8 TB/s GPU-to-GPU interconnect (2x the 900 GB/s of NVLink 4 on H100/H200) and is the right choice for tensor-parallel serving of 400B+ MoE models.
Model Colocation Across GPU Types
A practical strategy for larger inference platforms: run B200 or H200 for large 70B+ models that need high throughput, and route 7B–13B requests to L40S clusters where the per-token cost is lower. A single inference platform with tiered GPU pools routes by model size and latency SLA, minimizing total infrastructure cost without sacrificing performance on large-model requests.
Decision Framework: Which GPU for Your Workload
- Model fits in 48 GB GDDR6? (7B–34B at INT4) Start with the L40S at $0.72/hr on-demand.
- Model needs 70B+ at FP8, or you need MIG? H100 SXM. Use spot pricing ($1.03/hr) for batch workloads, on-demand ($2.50/hr) for production.
- Running 70B+ at FP16, or long context (128K+) with large concurrency? H200 SXM. Its 141 GB VRAM enables 70B FP16 serving and provides 7x more KV cache headroom than the H100 for large model configurations.
- Need maximum single-GPU throughput or more than 141 GB VRAM? B200 on-demand at $6.02/hr is premium, but B200 spot at $2.12/hr with FP4 at approximately $0.047/M is highly competitive on cost-per-token for fault-tolerant workloads. Only worthwhile if FP4 quality validates for your task and spot interruptions are tolerable. For requirements exceeding 192 GB VRAM or on-demand reliability, consider the B300 (288 GB HBM3e, approximately 1.67x B200 FP4 TFLOPS at 15 PFLOPS vs 9 PFLOPS), now available in limited early access on Spheron.
- Multi-GPU tensor parallelism for 400B+ models? H200 or B200 with NVLink. B300 NVLink bandwidth of 1.8 TB/s (same as B200) also works for distributed workloads requiring more than 192 GB per node.
One hardware development worth tracking: NVIDIA announced the Groq 3 LPU at GTC 2026, a purpose-built inference chip using on-chip SRAM instead of HBM. It targets the autoregressive decode phase specifically and is not a GPU replacement, but it introduces a new efficiency tier for high-volume dense LLM serving. Broad cloud availability is expected in late 2026. See NVIDIA Groq 3 LPU Explained for architecture details, an LPU vs GPU comparison table, and guidance on how to prepare your serving stack. For teams evaluating non-GPU inference chips alongside H100, see the Cerebras WSE-3 vs H100 comparison for a detailed cost-per-token breakdown at different batch sizes. Transformer ASICs like Sohu are an emerging category but are not production-available in 2026 - the Etched Sohu transformer ASIC comparison explains the architectural constraints and when GPU cloud is still the right call. For a currently available non-GPU inference option, the SambaNova SN40L vs H200 and B200 comparison covers the RDU 3-tier memory architecture and cost-per-token tradeoffs in detail. For a detailed comparison of AMD's upcoming MI400 series against NVIDIA's B300, see our AMD MI400 vs NVIDIA B300 guide.
Before committing to a GPU at scale, rent a single instance on Spheron at per-minute billing and run your own benchmark with vLLM and your production model. The cost-per-token numbers in this guide are a starting point. Real throughput depends on your model architecture, context length, and batch size. For strategies on reducing GPU spend, including spot instances and reserved pricing, see the Spheron cost optimization guide. For the difference between spot, dedicated, and cluster instances, see the Spheron instance types guide. Reserved GPUs offer 30-50% savings over on-demand rates for longer commitments; see the reserved GPU guide for details. For GPU availability by provider and data center region on Spheron, see the regions and providers guide. For programmatic GPU provisioning and management, see the Spheron API reference.
Best GPU for AI Training in 2026
Training picks a different GPU than inference does. Where inference is memory-bandwidth-bound at low batch sizes, training is compute-bound across the entire forward and backward pass, so raw FLOPS at the precision you train in becomes the deciding factor. Three workload sizes cover most teams:
LoRA and QLoRA fine-tuning of 7B-13B models. A single RTX 5090 (32GB GDDR7) at $0.99/hr or an A100 80GB at $1.07/hr is the right call. Both fit a 7B model in BF16 with optimizer state and gradient buffers and finish a typical 2-4 hour QLoRA run for $2-5 in compute. The RTX 5090 has 5th-gen Tensor Cores with FP8 support, which matters if your stack uses Unsloth or transformer_engine. For sub-13B work, this tier wins on dollars per fine-tune.
Full fine-tuning and 30B-70B training. Move to H100 SXM5 at $2.50/hr on-demand or $1.03/hr spot. The H100's Transformer Engine with FP8 gives roughly 2x the training throughput of BF16 on Hopper, and the 80GB HBM3 with 3.35 TB/s of bandwidth comfortably fits a 70B model with FSDP or DeepSpeed ZeRO-3. A full Llama 3 70B fine-tune on a 50K-sample dataset runs in 8-12 hours on a single H100. For multi-node training, every H100 SXM5 node on Spheron ships with 400 Gb/s InfiniBand and GPUDirect RDMA, so distributed runs scale close to linear.
Frontier-scale and trillion-parameter pre-training. B200 (192GB HBM3e, 8 TB/s, native FP4) or B300 (288GB HBM3e, 10 TB/s) are the only sensible options. B200 on-demand at $6.02/hr or $2.12/hr spot delivers roughly 2x H100 training throughput per MLPerf Training v4.1. NVLink 5 at 1.8 TB/s GPU-to-GPU bandwidth inside a node and InfiniBand NDR between nodes makes 405B+ parameter pre-training tractable on cluster sizes that would have been infeasible on Hopper. For pre-training the next generation of frontier models, this is the tier.
Skip the L40S for training. It has no NVLink, no FP8 Transformer Engine in the form Hopper has, and the 864 GB/s bandwidth makes the backward pass slow at any non-trivial batch size. L40S is an inference card.
For step-by-step fine-tuning workflows, see How to Fine-Tune LLMs in 2026: Costs, GPUs, and Code. For multi-node training without InfiniBand on a budget, see Multi-Node GPU Training Without InfiniBand.
Best GPU for Generative AI in 2026
Generative AI (image, video, 3D, audio) has different bottlenecks again. Diffusion models are compute-heavy on the U-Net or DiT backbone but VRAM-light compared to LLMs. Video generation flips that: model weights are smaller than 70B LLMs but the latent tensor per frame can balloon into tens of gigabytes once you batch frames for temporal coherence.
Stable Diffusion XL, Flux.1, and Stable Diffusion 3.5. RTX 5090 at $0.99/hr is the price-performance leader. 32GB GDDR7 fits SDXL base and refiner with room for ControlNet, IP-Adapter, and a half dozen LoRAs. Roughly 16 images per minute at 1024x1024 with the standard pipeline. For higher throughput, RTX PRO 6000 Blackwell (96GB GDDR7 ECC) handles batch sizes that the consumer card can't, useful for serving multiple users from one GPU.
AI video generation (Wan 2.2, HunyuanVideo, LTX-Video). H100 or H200 minimum. Video diffusion drives VRAM hard because the latent tensor grows with frame count and resolution. A 5-second 720p HunyuanVideo generation on H100 takes about 60-90 seconds; on H200 the larger VRAM lets you batch two clips per request, halving cost per clip. For long-form or 4K video, B200 with NVLink-connected pairs handles the memory pressure that single-GPU setups cannot. See Best GPU for AI Video Generation 2026 for full per-model recommendations.
3D and Gaussian Splatting. RTX 5090 or RTX PRO 6000. The workload is dominated by rasterization and per-Gaussian gradient updates, not transformer-style attention. Consumer Blackwell silicon hits this sweet spot. For 50M+ Gaussian scenes or production avatar pipelines, the 96GB ECC of the PRO 6000 matters.
Real-time voice cloning and TTS (NeuTTS Air, Kokoro, F5-TTS). RTX 5090 or L40S. Both run sub-second latency for a 4-second generated clip. L40S wins on cost-per-stream at high concurrency because its 48GB GDDR6 ECC pairs well with FP8 inference of small audio models.
Music and full-track generation (YuE, ACE-Step, Stable Audio). H100 SXM5. The diffusion sampler with text conditioning needs the bandwidth, and music tracks at 44kHz drive sample count high enough that L40S falls behind on throughput.
The headline trade-off across generative workloads: VRAM size determines what you can run at all, bandwidth determines how fast each request finishes. Image-only workloads are bandwidth-tolerant; video and audio are bandwidth-hungry; 3D is compute-hungry.
For full pipeline guides: Deploy FLUX.2 on GPU Cloud, ComfyUI on GPU Cloud 2026, Deploy Wan 2.5 on GPU Cloud.
Spheron offers all GPUs covered in this guide at per-minute billing with no long-term commitment.
On-demand L40S → | Check H100 availability → | H200 GPU pricing → | Spheron B200 → | Rent B300 → | View all GPU pricing →
Quick Setup Guide
Start with your model parameter count and expected concurrent request volume. 7B–13B models fit on a single L40S (48 GB) or RTX 5090 (32 GB). 70B models need 80 GB HBM (H100) in FP8 or 141 GB HBM3e (H200) in FP16. 400B+ MoE models require B200 (192 GB) or multi-GPU tensor parallelism.
Use the formula: Cost per 1M tokens = ($/hr) / (tokens/sec * 3600) * 1,000,000. Run this for each GPU at your expected batch size using benchmark data from this guide. Compare against your latency requirements: the cheapest cost-per-token often comes at higher batch sizes that add latency.
If you're serving a chatbot or API with sub-2-second latency requirements, optimize for time-to-first-token (TTFT). If you're running offline batch jobs (embedding pipelines, document processing, evals), optimize for throughput at large batch sizes. Different GPU choices follow from these requirements.
FP8 is safe for almost all tasks. FP4 (Blackwell only) requires task-specific validation: download a pre-calibrated FP4 model from the nvidia/ namespace on Hugging Face, run your production task sample set at FP16, FP8, and FP4, and compare outputs. Only move to FP4 in production if the quality delta is below your acceptable threshold.
Spin up a single GPU instance on Spheron at per-minute billing (no commitment), run vLLM with your model and target batch size, and measure actual tokens/sec and TTFT. Multiply by the hourly rate to get your real cost-per-token. Then scale based on measured, not estimated, performance. See the Spheron docs quick-start guide at docs.spheron.ai/quick-start for step-by-step setup.
Frequently Asked Questions
It depends on model size and traffic. For 7B–13B models at moderate concurrency, the L40S at $0.72/hr on-demand is a cost-accessible option for smaller models. For 70B models, H100 on-demand at $2.50/hr achieves approximately $0.227/M tokens. For interruptible batch workloads, H100 spot at $1.03/hr drops that to approximately $0.095/M tokens. The B200 with FP4 reaches approximately $0.044/M tokens on-demand, offering dramatically better cost-per-token than H100 on-demand while fitting larger models in its 192 GB VRAM. The H200 at $4.54/hr on-demand is the choice for long-context workloads and 70B FP16 serving that requires its 141 GB VRAM, not for cost-per-token efficiency at current pricing. Pricing is based on 15 Apr 2026 and can fluctuate over time based on GPU availability.
Training is primarily compute-bound: you want maximum TFLOPS. Inference is primarily memory-bandwidth-bound at low batch sizes: you want high memory bandwidth to load model weights quickly for each request. At high batch sizes, inference becomes more compute-bound. This means the H100's 3.35 TB/s HBM3 bandwidth matters far more for inference than its raw CUDA core count, and the L40S can match H100 output at low concurrency despite having 3.9x less memory bandwidth.
No, in practice. FP8 introduces less than 1-2% degradation on standard benchmarks for major LLM families (Llama, Mistral, DeepSeek). For virtually all production conversational AI, summarization, and code generation tasks, the quality difference is imperceptible. FP8 is considered the safe default for inference on H100, H200, and Blackwell GPUs.
FP4 is only available on Blackwell GPUs (B200, B300, RTX 5090, RTX PRO 6000). It delivers approximately 1.5-2x the throughput of FP8 on the same hardware depending on workload (memory-bandwidth-bound workloads see closer to 2x; compute-saturated batch jobs typically see 1.5-1.7x), but introduces more quantization error than FP8. Use FP4 when: your model has pre-calibrated FP4 weights (check the nvidia/ namespace on Hugging Face), your task tolerates the quality reduction (conversational AI, creative writing, summarization), and your cost-per-token calculation shows a benefit over H100/H200 FP8 for your specific batch size and model. Always benchmark before committing.
Real-time serving (chatbots, copilots) prioritizes latency: time to first token and tokens per second at batch size 1-4. The H100 or H200 win here due to HBM bandwidth. Batch inference (offline processing, embedding pipelines) maximizes throughput and runs at large batch sizes where the GPU is more compute-bound. At high batch sizes, the B200 FP4 delivers the most tokens per dollar. For small-model batch inference, the L40S FP8 is often the most cost-efficient option.
B200 on-demand at $6.02/hr is premium compared to H100 PCIe at $2.01/hr. However, B200 spot at $2.12/hr delivers 2.4x memory bandwidth and native FP4 support at only $0.11/hr more than H100 on-demand. For new inference deployments tolerant of interruption, B200 spot is the clear pick. For on-demand, H100 remains more cost-effective per hour. If already running H100 clusters, migration to B200 spot is worthwhile for fault-tolerant workloads or where FP4 quality is validated.
RTX 5090 at $0.76/hr on-demand or L40S at $0.72/hr for models under 48 GB. Both are sufficient for 7B-13B inference at FP16 or INT4 quantization with room for batching. For ultra-budget batch workloads (embeddings, classification), the L4 at roughly $0.50/hr handles 7B models at INT4. Across all options, the L40S offers the best balance of VRAM (48 GB), throughput at realistic batch sizes, and price.
H100 PCIe FP8 at $2.01/hr on-demand serving 3,066 tokens/sec costs roughly $0.182 per million tokens. B200 FP8 on-demand at $6.02/hr serving ~6,972 tokens/sec costs roughly $0.239/M tokens. However, B200 spot at $2.12/hr serving ~6,972 tokens/sec costs $0.084/M tokens. B200 FP4 on spot at $2.12/hr reaching ~12,841 tokens/sec drops to $0.047/M tokens, slightly lower than H100. For on-demand, H100 is more efficient; for spot, B200 dominates. See the /blog/gpu-cloud-pricing-comparison-2026/ cost-per-token table for workload-specific breakdowns.
