Running large language models in production requires choosing the right GPU. In 2026, the options range from NVIDIA's flagship B300 with 288 GB of HBM3e down to the RTX 4090 with 24 GB of GDDR6X. The difference between picking the right and wrong GPU for your workload can mean 10x cost differences, the ability to serve a model on one GPU instead of four, or hitting latency targets that make your application viable.
This guide ranks the best NVIDIA GPUs for LLM inference and training, with concrete specifications, real-world benchmark numbers, VRAM requirements for popular models, cloud pricing, and clear recommendations for which GPU fits which workload.
How to Choose a GPU for LLMs
Before diving into specific GPUs, it helps to understand the four factors that matter most for LLM workloads:
VRAM capacity is the single most important specification. LLMs must fit entirely in GPU memory (or be split across multiple GPUs) to run efficiently. A 70B parameter model at FP16 precision requires approximately 140 GB of VRAM, far more than any single consumer GPU offers. Quantization (reducing precision to INT8 or INT4) can cut this by 2-4x, but VRAM is still the primary constraint.
Memory bandwidth determines how fast the GPU can read model weights during inference. LLM inference is memory-bandwidth-bound for most batch sizes; the GPU spends more time loading weights from memory than computing. Higher bandwidth (measured in TB/s) translates directly to higher tokens-per-second throughput. For a deeper dive into how memory architecture affects performance, see our dedicated vs shared GPU memory guide.
Tensor Core performance matters for training and high-batch-size inference. Tensor Cores accelerate the matrix multiplications at the heart of transformer architectures. Fourth-generation Tensor Cores (Hopper/Ada) and fifth-generation (Blackwell) support FP8 precision, which doubles throughput compared to FP16 with minimal accuracy loss.
Total cost includes both the GPU rental or purchase price and the number of GPUs needed. A cheaper GPU that requires four cards to serve a model may cost more than a single expensive GPU that handles it alone.
VRAM Requirements for Popular LLMs
Understanding how much memory your target model needs is the first step in GPU selection. The table below shows approximate VRAM requirements at different quantization levels, excluding KV-cache overhead (which grows with context length and batch size).
| Model | Parameters | FP16 VRAM | INT8 VRAM | INT4 VRAM |
|---|---|---|---|---|
| Llama 4 Scout | 109B (17B active) | ~218 GB | ~109 GB | ~55 GB |
| Llama 4 Maverick | 400B (17B active) | ~800 GB | ~400 GB | ~200 GB |
| Llama 3.3 70B | 70B | ~140 GB | ~70 GB | ~35 GB |
| Qwen 3 72B | 72B | ~144 GB | ~72 GB | ~36 GB |
| Qwen 3 32B | 32B | ~64 GB | ~32 GB | ~16 GB |
| DeepSeek V3.2 | 671B (37B active) | ~1.34 TB | ~671 GB | ~336 GB |
| Mistral Large 2 | 123B | ~246 GB | ~123 GB | ~62 GB |
| Nemotron Ultra 253B | 253B | ~506 GB | ~253 GB | ~127 GB |
| Llama 3.1 8B | 8B | ~16 GB | ~8 GB | ~5 GB |
These figures represent model weights only. In production, you need additional VRAM for the KV-cache (which stores attention state for each token in the context window), activation memory, and framework overhead. A good rule of thumb is to add 20-30% overhead on top of the base model size. For Nemotron models with hybrid Mamba-Transformer architecture, see the Nemotron 3 Super GPU deployment guide for how SSM layers change the VRAM and KV cache calculus.
For context length impact: each 1,000 tokens of context adds roughly 0.5-1 GB of KV-cache memory for a 7B model, scaling linearly with model size. A 70B model with a 128K context window can consume 30-60 GB of KV-cache alone at higher batch sizes.
Tier 1: Data Center Flagships
These are the GPUs built for production LLM serving at scale. They feature HBM memory with massive bandwidth, optimized Tensor Cores, and multi-GPU interconnects designed for distributed inference.
NVIDIA B300: The New Flagship (Blackwell Ultra)
The B300 shipped in January 2026 as NVIDIA's most powerful single GPU. With 288 GB of HBM3e, it fits a full 70B model in FP16 on a single chip with 100+ GB to spare for KV cache.
| Spec | Value |
|---|---|
| Architecture | Blackwell Ultra |
| VRAM | 288 GB HBM3e |
| Memory Bandwidth | 8 TB/s |
| FP4 Performance | 14,000 TFLOPS |
| FP8 Performance | 7,000 TFLOPS |
| FP16 / BF16 | 3,500 TFLOPS |
| Tensor Cores | 5th generation |
| TDP | 1,400 W |
| Interconnect | NVLink 5 (1.8 TB/s) |
| Cloud Pricing | $6.80/hr |
The B300 delivers 55% more FP4 compute than the B200 and is the first GPU where FP4 inference is a first-class citizen. Its 288 GB capacity means an 8-GPU DGX B300 system provides 2.3 TB of total GPU memory, enough for 400B+ parameter models entirely in GPU memory. The 1,400W TDP requires liquid cooling and purpose-built infrastructure.
Best for: Maximum single-GPU capacity for 70B+ models, FP4 inference at scale, organizations that need the highest possible throughput per GPU. See our complete B300 guide for full specs, pricing, and infrastructure requirements.
NVIDIA B200: The Blackwell Standard
The B200 is NVIDIA's Blackwell-architecture flagship, representing the current state of the art for LLM workloads.
| Spec | Value |
|---|---|
| Architecture | Blackwell |
| VRAM | 192 GB HBM3e |
| Memory Bandwidth | 8 TB/s |
| FP4 Performance | 9,000 TFLOPS |
| FP8 Performance | 4,500 TFLOPS |
| FP16 / BF16 | 2,250 TFLOPS |
| Tensor Cores | 5th generation |
| TDP | 1,000 W |
| Interconnect | NVLink 5 (1.8 TB/s) |
| Cloud Pricing | $6.02/hr on-demand; $2.12/hr spot |
FP4, FP8, and FP16/BF16 Tensor Core values are dense (non-sparse). With NVIDIA 2:4 structured sparsity: FP4 ~18,000, FP8 ~9,000, FP16/BF16 ~4,500 TFLOPS.
The B200 delivers up to 4-5x the inference throughput of the H100 and up to 15x improvement over the H100 for optimized LLM workloads. Its 192 GB of HBM3e at 8 TB/s bandwidth means it can serve Llama 3.1 70B at FP16 on a single GPU with room to spare for large KV caches.
Best for: Production inference at any scale, training frontier models, organizations that need maximum throughput per GPU. If budget allows, the B200 reduces total GPU count and system complexity. See our complete B200 guide for full specs, benchmarks, and pricing.
See our H200 vs B200 vs GB200 guide for a direct three-way comparison with benchmark data and cost-per-token analysis.
NVIDIA H200: The Memory Leader (Hopper)
The H200 upgrades the H100's memory subsystem while keeping the same proven Hopper compute architecture.
| Spec | Value |
|---|---|
| Architecture | Hopper |
| VRAM | 141 GB HBM3e |
| Memory Bandwidth | 4.8 TB/s |
| FP8 Performance | 3,958 TFLOPS |
| FP16 / BF16 | 1,979 TFLOPS |
| Tensor Cores | 4th generation (528) |
| TDP | 700 W (SXM) |
| Interconnect | NVLink 4 (900 GB/s) |
| Cloud Pricing | $4.54/hr |
FP8 and FP16/BF16 Tensor Core values reflect NVIDIA 2:4 structured sparsity. Dense (non-sparse) FP16/BF16 is approximately 989 TFLOPS; dense FP8 approximately 1,979 TFLOPS.
MLPerf v4.0 benchmarks show the H200 reaching 31,712 tokens/second on Llama 2 70B offline, a ~42% improvement over the H100's 22,290 tokens/second. The 141 GB HBM3e capacity means Llama 3.1 70B fits comfortably at INT8 with ample room for KV cache, and even FP16 serving is possible with careful memory management.
Best for: Production 70B+ model serving, long-context inference, organizations already invested in the Hopper ecosystem. The H200 offers the best balance of performance, memory, and software maturity for Hopper-based deployments.
NVIDIA H100: The Proven Workhorse
The H100 remains the most widely deployed data center GPU for AI workloads, with the broadest cloud availability and the most optimized software stack. You can rent NVIDIA H100 GPUs on demand with per-minute billing, or read our H100 vs H200 benchmarks for a detailed comparison with the newer H200.
| Spec | Value |
|---|---|
| Architecture | Hopper |
| VRAM | 80 GB HBM3 |
| Memory Bandwidth | 3.35 TB/s |
| FP8 Performance | 3,958 TFLOPS |
| FP16 / BF16 | 1,979 TFLOPS |
| Tensor Cores | 4th generation (528) |
| TDP | 700 W (SXM) |
| Interconnect | NVLink 4 (900 GB/s) |
| Cloud Pricing | $2.50/hr |
FP8 and FP16/BF16 Tensor Core values reflect NVIDIA 2:4 structured sparsity. Dense (non-sparse) FP16/BF16 is approximately 989 TFLOPS; dense FP8 approximately 1,979 TFLOPS.
The H100 delivers over 10,000 tokens/second on optimized LLM inference with vLLM or TensorRT-LLM. Its 80 GB HBM3 comfortably serves models up to 34B parameters at FP16, or 70B models at INT4 quantization. For Qwen 3 deployments, the H100 is the recommended single-GPU configuration for Qwen3-32B at FP8; see the Qwen 3 deployment guide for step-by-step setup. Google's Gemma 3 27B is another example of a production-ready model that fits on a single H100 in BF16 without quantization. Multi-GPU H100 clusters with NVLink are the standard infrastructure for large-scale training.
Best for: Cost-efficient serving of 7B-34B models, training runs, any workload where the H100's massive software ecosystem and broad availability provide operational advantages. The H100 still offers one of the best price-to-performance ratios for most production LLM workloads. Robotics teams also use H100 instances for synthetic training data generation with Cosmos, where its 80GB VRAM fits the Cosmos-Predict 7B model without additional scaling.
Tier 2: High-Performance Data Center GPUs
These GPUs offer strong LLM performance at lower price points, making them ideal for cost-sensitive deployments, smaller models, and inference-heavy workloads.
NVIDIA A100: The Budget Data Center Option
The A100 is the previous generation's flagship, now available at significantly reduced prices while still delivering competitive inference performance. For a comparison with the older V100, see our A100 vs V100 guide.
| Spec | Value |
|---|---|
| Architecture | Ampere |
| VRAM | 40 GB HBM2 or 80 GB HBM2e |
| Memory Bandwidth | 2 TB/s (80 GB variant) |
| FP16 / BF16 | 312 TFLOPS |
| Tensor Cores | 3rd generation (432) |
| TDP | 400 W (SXM) |
| Interconnect | NVLink 3 (600 GB/s) |
| Cloud Pricing | $1.07/hr on-demand; $0.60/hr spot |
The A100 80 GB remains viable for serving 7B-13B models at FP16 and 70B models at INT4 with careful optimization. For teams migrating from older infrastructure, the A100 offers a familiar software environment with broad framework support.
Best for: Budget-conscious inference deployments, serving smaller models (7B-13B) in production, research and experimentation, organizations with existing A100 infrastructure. Spot pricing at $0.60/hr is excellent for fault-tolerant batch workloads.
NVIDIA L40S: The Inference Specialist
The L40S is NVIDIA's Ada Lovelace-based data center GPU, optimized for inference and multimodal AI workloads.
| Spec | Value |
|---|---|
| Architecture | Ada Lovelace |
| VRAM | 48 GB GDDR6 with ECC |
| Memory Bandwidth | 864 GB/s |
| FP8 Performance | 733 TFLOPS |
| FP16 / BF16 | 366 TFLOPS |
| Tensor Cores | 4th generation (568) |
| TDP | 350 W |
| Interconnect | PCIe Gen4 |
| Cloud Pricing | $0.72/hr |
Benchmarks show the L40S achieving 43.8 tokens/second on Llama 3.1 8B at batch size 1 and 325 tokens/second at batch size 8. It delivers up to 1.5x the inference performance of the A100 80 GB on popular MLPerf benchmarks while consuming less power.
The 48 GB GDDR6 memory is sufficient for most 7B-13B models at FP16 and can handle Mixtral 8x7B at INT4 quantization. However, GDDR6 bandwidth (864 GB/s) is significantly lower than HBM-based GPUs, which limits throughput at larger batch sizes.
Best for: Cost-efficient inference serving for 7B-13B models, multimodal workloads (vision + language), organizations that need a balance of inference performance and price. Excellent value at $0.80-$2/hr.
For a dedicated L40S deep dive including inference benchmarks and cloud pricing comparison, see our NVIDIA L40S inference guide.
NVIDIA L4: The Efficiency Champion
The L4 is designed for high-density, low-power inference at scale.
| Spec | Value |
|---|---|
| Architecture | Ada Lovelace |
| VRAM | 24 GB GDDR6 |
| Memory Bandwidth | 300 GB/s |
| FP8 Performance | 242 TFLOPS |
| FP16 / BF16 | 121 TFLOPS |
| Tensor Cores | 4th generation |
| TDP | 72 W |
| Interconnect | PCIe Gen4 |
| Cloud Pricing | ~$0.50/hr |
At just 72 W TDP, the L4 fits in standard server slots without special cooling, allowing dense deployments with many GPUs per rack. Its 24 GB VRAM handles 7B models at INT4/INT8 comfortably, making it ideal for chatbot backends, recommendation engines, and classification tasks.
Best for: High-volume, latency-sensitive inference on smaller models (7B and under), edge deployments, any workload where power efficiency and density matter more than raw throughput.
Tier 3: Consumer and Prosumer GPUs
Consumer GPUs can be a cost-effective choice for development, prototyping, and small-scale inference, but they come with important limitations for production use.
NVIDIA RTX 4090: The Developer's GPU
The RTX 4090 is the most powerful consumer GPU and a popular choice for local LLM development and small-scale inference. For a detailed analysis of this GPU's capabilities, see our RTX 4090 for AI and machine learning guide.
| Spec | Value |
|---|---|
| Architecture | Ada Lovelace |
| VRAM | 24 GB GDDR6X |
| Memory Bandwidth | 1,008 GB/s |
| FP16 / BF16 | 330 TFLOPS |
| Tensor Cores | 4th generation (512) |
| TDP | 450 W |
| Purchase Price | ~$1,600-$2,000 |
The RTX 4090 achieves approximately 6,900 tokens/second on Llama 3 8B (Q4_K_M quantization) and 9,056 tokens/second at FP16 (impressive numbers for a consumer card). Its 24 GB VRAM handles 7B-13B models at INT4/INT8 and even Mixtral 8x7B at aggressive quantization. Qwen 3 8B is a strong RTX 4090 workload at FP8; see the full Qwen 3 deployment guide for all variants.
However, NVIDIA's GeForce EULA technically prohibits data center deployment of consumer GPUs. For production use, the L40S or A100 are the compliant alternatives.
Teams that need 32GB VRAM, Blackwell's FP4 support, or lower cost-per-token on small model inference should consider the RTX 5090, the next-generation Blackwell consumer GPU available on Spheron from $0.76/hr. See our RTX 5090 rental and benchmark guide for a full cost-per-token analysis.
Best for: Local development and prototyping, fine-tuning smaller models, researchers who need strong single-GPU performance, hobbyists running models at home.
NVIDIA RTX 3090: The Budget Development Card
The RTX 3090 remains available on the used market at significant discounts and offers 24 GB GDDR6X, the same VRAM capacity as the RTX 4090 at roughly half the price.
| Spec | Value |
|---|---|
| Architecture | Ampere |
| VRAM | 24 GB GDDR6X |
| Memory Bandwidth | 936 GB/s |
| FP16 / BF16 | 142 TFLOPS |
| Tensor Cores | 3rd generation (328) |
| TDP | 350 W |
| Used Price | ~$700-$1,000 |
The RTX 3090's 24 GB VRAM handles the same model sizes as the RTX 4090, just at lower throughput. For development work where iteration speed matters more than peak inference performance, it's an excellent value.
Best for: Budget development setups, academic research, hobbyist LLM experimentation.
GPU-to-Model Matching Guide
Choosing the right GPU comes down to matching your model's memory requirements to available VRAM, then optimizing for throughput and cost. Here's a practical mapping:
| Model Size | Quantization | Min VRAM Needed | Recommended GPUs |
|---|---|---|---|
| 7B-8B | FP16 | ~16 GB | L4, RTX 4090, RTX 5090, L40S, A100 |
| 7B-8B | INT4 | ~5 GB | L4, RTX 4090, any 8+ GB GPU |
| 13B | FP16 | ~26 GB | RTX 5090 (32 GB), L40S (48 GB), A100 40 GB |
| 13B | INT4 | ~8 GB | L4, RTX 4090 |
| 32B | FP16 | ~64 GB | H100 80 GB, A100 80 GB |
| 32B | INT4 | ~16 GB | RTX 4090, RTX 5090, L40S |
| 70B | FP16 | ~140 GB | B300 (288 GB), H200, 2x H100 |
| 70B | INT8 | ~70 GB | H100 80 GB, A100 80 GB |
| 70B | INT4 | ~35 GB | RTX 5090 (32 GB tight), L40S, H100 |
| 109B (Llama 4 Scout) | INT4 | ~55 GB | H100 80 GB, L40S + offload |
| 405B | INT4 | ~203 GB | B300, 2x H200, B200 |
For production deployments, always benchmark your specific model and serving framework (vLLM, TensorRT-LLM, SGLang) on candidate GPUs before committing. Theoretical VRAM calculations don't account for framework overhead, KV-cache growth at high concurrency, or optimization opportunities.
Cloud Pricing Comparison
GPU cloud pricing varies significantly by provider, commitment length, and availability. These are approximate on-demand hourly rates as of 15 Apr 2026:
| GPU | VRAM | Typical Cloud Price | Best Value For |
|---|---|---|---|
| L40S | 48 GB | $0.72/hr | Mid-size inference, multimodal |
| A100 80 GB | 80 GB | $1.07/hr on-demand; $0.60 spot | Cost-efficient inference |
| H100 SXM | 80 GB | $2.50/hr on-demand; $1.03 spot | High-throughput 7B-34B serving |
| H200 SXM | 141 GB | $4.54/hr | 70B+ models, long context |
| B200 | 192 GB | $6.02/hr on-demand; $2.12 spot | High throughput, large models |
| B300 | 288 GB | $6.80/hr | Maximum throughput, frontier models |
Reserved instances and long-term commitments can reduce these prices by 30-60%. For cost optimization, consider whether fewer expensive GPUs (such as one H200 for a 70B model) cost less than multiple cheaper GPUs (such as two A100s with tensor parallelism overhead).
Key Recommendations by Use Case
Startups Serving 7B-13B Models
Start with L40S or H100. The L40S offers the best cost-per-token for smaller models at $0.80-$2/hr, while the H100 provides headroom to scale up to larger models as your product evolves. Both have excellent vLLM and TensorRT-LLM support.
Enterprise 70B+ Production Inference
Deploy on H200, B200, or B300. The H200's 141 GB HBM3e handles 70B models on a single GPU, eliminating the complexity of multi-GPU tensor parallelism. The B300's 288 GB fits 70B models in FP16 with over 100 GB to spare for KV cache and batch processing, making it the top choice when throughput per GPU matters most.
Research and Experimentation
The A100 80 GB offers the best value for research: broad software compatibility, sufficient VRAM for most experiments, and the lowest data center GPU pricing. For local development, the RTX 4090 provides excellent single-GPU performance.
Fine-Tuning and Training
For fine-tuning 7B-13B models, a single H100 or A100 is typically sufficient with LoRA or QLoRA techniques. Full fine-tuning of 70B+ models requires multi-GPU setups. The H100 with NVLink provides the best multi-GPU scaling and training library support.
High-Volume, Low-Latency Serving
The L4 at 72 W TDP enables the densest rack deployments for serving smaller models at massive scale. For applications like real-time chatbots, classification, or recommendation serving, the L4's efficiency and low cost make it the optimal choice.
Whether you need B300s for maximum capacity, H100s for production serving, or A100s for experimentation, Spheron provides bare-metal GPU access with per-minute billing and no contracts. Scale from a single GPU to multi-GPU clusters.
