The Quick Reference Table
This table covers the most popular open-source models in production today. VRAM numbers assume inference (not training) with the most common quantization level for each model size.
| Model | Parameters | Quantization | VRAM Needed | Min GPU Config | Est. Cloud Cost/hr |
|---|---|---|---|---|---|
| Llama 4 Scout | 109B (17B active) | INT4 | ~55 GB | 1x H100 80GB | ~$2.50 |
| Llama 4 Scout | 109B (17B active) | FP16 | ~218 GB | 4x H100 80GB | ~$10.00 |
| Llama 4 Maverick | 400B (17B active) | INT4 | ~200 GB | 4x H100 80GB | ~$10.00 |
| Llama 4 Maverick | 400B (17B active) | FP16 | ~800 GB | 8x H200 141GB | ~$36.32 |
| DeepSeek V3.2 | 671B (37B active) | FP8 | ~700 GB | 8x H200 141GB (8x H100 80GB is insufficient at FP8) | ~$36.32 |
| DeepSeek V4 | 1T (37B active) | FP8 | ~1,000 GB | 8x H200 141GB (tight, limit KV cache) | ~$36.32 |
| GLM 5.1 | ~9B (est.) | FP8 | ~9 GB | 1x RTX 4090 24GB | ~$0.55 |
| Qwen 3.6 Plus | ~72B (est.) | INT4 | ~36 GB | 1x H100 80GB | ~$2.50 |
| Qwen 3.5 397B | 397B MoE (varies active) | FP8 | ~397 GB | 8x H100 80GB | ~$20.00 |
| Qwen 3.5 397B | 397B MoE (varies active) | INT4 | ~199 GB | 4x H100 80GB | ~$10.00 |
| Qwen 3.5 27B | 27B | FP8 | ~27 GB | 1x H100 80GB | ~$2.50 |
| Qwen 3.5 9B | 9B | FP8 | ~9 GB | 1x RTX 4090 24GB | ~$0.55 |
| Qwen 3 72B | 72B | INT4 | ~36 GB | 1x H100 80GB | ~$2.50 |
| Qwen 3 72B | 72B | FP16 | ~144 GB | 2x H100 80GB | ~$5.00 |
| Qwen 3 32B | 32B | INT4 | ~16 GB | 1x RTX 4090 24GB | ~$0.55 |
| Qwen 3 32B | 32B | FP16 | ~64 GB | 1x H100 80GB | ~$2.50 |
| Mistral Large 2 | 123B | INT4 | ~62 GB | 1x H100 80GB | ~$2.50 |
| Mistral Large 2 | 123B | FP16 | ~246 GB | 4x H100 80GB | ~$10.00 |
| Kimi K2.5 | 1T (32B active) | INT4 | ~630 GB | 8x H100 80GB | ~$20.00 |
| Nemotron Ultra 253B | 253B | INT4 | ~127 GB | 2x H100 80GB | ~$4.00 |
| Nemotron Ultra 253B | 253B | FP16 | ~506 GB | 8x H100 80GB | ~$16.00 |
| Nemotron 3 Super (120B MoE) | 120B (12B active) | NVFP4 | ~60 GB | 1x H100 80GB | ~$2.50 |
| Llama 3.3 70B | 70B | INT4 | ~35 GB | 1x H100 80GB | ~$2.50 |
| Llama 3.3 70B | 70B | FP16 | ~140 GB | 2x H100 80GB | ~$4.00 |
| Phi-4 14B | 14B | INT4 | ~7 GB | 1x RTX 4090 24GB | ~$0.50 |
| Phi-4 14B | 14B | FP16 | ~28 GB | 1x A100 40GB | ~$1.07 |
Cloud costs based on current market rates as of 14 May 2026: H100 SXM5 $2.50/hr, H200 SXM5 $4.54/hr, A100 80GB $1.07/hr on-demand ($0.60/hr spot), RTX 4090 $0.55/hr. For full vLLM setup steps for Qwen 3.5, see the Qwen 3.5 deployment guide.
For a full deployment walkthrough of Nemotron 3 Super including VRAM breakdowns across all precision tiers and vLLM configuration, see the Nemotron 3 Super GPU deployment guide.
Note on DeepSeek V3.2 VRAM constraint: At FP8, the 671B parameters require roughly 671 GB for weights alone, plus ~30-60 GB for KV cache, activations, and framework overhead. That puts minimum practical VRAM at ~700 GB, which exceeds 8× H100 80GB (640 GB total). Production deployment requires 8× H200 141GB (1,128 GB total, ~$36.32/hr) or a split across multiple H100 nodes with CPU offload. See our DeepSeek V3.2 deployment guide for setup details.
Note on DeepSeek V4: At 1T total parameters, FP8 weights require ~1,000 GB. 8× H200 141GB provides 1,128 GB total, which fits with limited KV cache headroom. For full production batch sizes, plan for multi-node H200 configurations. See the DeepSeek V4 deployment guide for expert parallelism configuration.
Note on GLM 5.1 and Qwen 3.6 Plus: Parameter counts marked (est.) are estimated based on model family trajectory and have not been confirmed from official model cards. Verify against the official release before capacity planning.
Every time a new model releases, the first question is always the same: "What GPU do I need to run this?"
The answer is scattered across Hugging Face model cards, Reddit threads, and provider-specific docs, none of which agree with each other. This post puts it all in one place: exact VRAM requirements, recommended GPU configurations, and estimated cloud costs for every major open-source model available in May 2026. For deeper technical understanding, see our comprehensive GPU memory requirements guide.
Bookmark this page. We update it as new models release.
For teams planning infrastructure beyond 2026, see the NVIDIA Rubin R100 guide for next-generation GPU specs and cloud availability projections.
Llama 4 Scout
Llama 4 Scout (109B total, 17B active, MoE) is the most cost-effective entry point in the Llama 4 family. At INT4 quantization, the ~55 GB weight footprint fits on a single H100 80GB at $2.50/hr on Spheron. The 10M token context window is the standout spec: no other model in this comparison comes close at this price. Suitable for conversational AI, RAG, and long-document tasks where context length matters. See the Llama 4 deployment guide for vLLM configuration.
DeepSeek V3.2
DeepSeek V3.2 (671B total, 37B active, MoE) leads on math and multi-step reasoning. At FP8, minimum practical VRAM is ~700 GB, requiring 8× H200 141GB (~$36.32/hr on Spheron). The high hardware bar is the cost of entry for its benchmark leadership. For coding and general tasks, Qwen3-32B at $2.50/hr is a better dollar-per-quality choice. See our DeepSeek V3.2 deployment guide for multi-node setup and expert offloading strategies.
Qwen 3.5 27B
Qwen 3.5 27B is a dense 27B model that fits on a single H100 80GB at FP8 (~27 GB VRAM, ~$2.50/hr). It sits between the 9B and 72B variants in the quality-cost curve, making it a solid choice for teams that need more capability than the 9B but can't justify multi-GPU costs. For full vLLM setup steps, see the Qwen 3.5 deployment guide.
Qwen 3 72B
Qwen 3 72B at INT4 fits on a single H100 80GB with ~36 GB VRAM at $2.50/hr. At FP16, it needs 2× H100s (~$5.00/hr). The 72B size gives substantially better quality than 32B for reasoning and instruction following while remaining single-GPU at INT4. For a comparison of how it performs vs Llama 4 Scout on coding and reasoning tasks, see the GPU requirements cheat sheet comparison.
Mistral Large 2
Mistral Large 2 (123B) at INT4 (~62 GB) fits on a single H100 80GB at $2.50/hr. At FP16 (~246 GB), it needs 4× H100s at $10/hr. It has strong instruction following and function calling, and the Apache 2.0 license makes it fully open for commercial use. For teams needing a 100B+ class model on a single GPU without MoE complexity, it's a straightforward option. For the deployment guide, see the Mistral deployment guide.
Kimi K2.5
Kimi K2.5 (1T total, 32B active, MoE) at INT4 needs ~630 GB VRAM, fitting on 8× H100 80GB (~$20/hr on Spheron). Despite 1T total parameters, the 32B active parameter count keeps per-token compute manageable. Its strength is long-context reasoning and agentic tasks. Moonshot's newer release follows a similar hardware profile: see the Kimi K2.6 deployment guide for updated VRAM math and agentic-swarm configuration (300-agent swarms, 4,000 coordinated steps).
How to Read This Table
Parameters vs Active Parameters. Mixture-of-Experts models (Llama 4, DeepSeek, Kimi K2.5) have a total parameter count much higher than their active parameter count. The total count determines VRAM needs (all weights must be loaded). The active count determines inference speed: fewer active parameters means faster per-token generation.
VRAM Needed. This is the approximate memory required for the model weights at the listed quantization level. It doesn't include KV cache, which grows with context length and batch size. For production use with reasonable context windows (32K-128K tokens), add 20-40% headroom above the listed VRAM.
Min GPU Config. The minimum hardware configuration that fits the model. For production inference with concurrent users, you'll often want more VRAM than the minimum, the extra headroom enables higher throughput through larger batch sizes.
VRAM Calculation Rules of Thumb
If a model is not in the table above, you can estimate VRAM requirements using these formulas:
FP16 (half precision): Parameters in billions × 2 = VRAM in GB. A 70B model needs ~140 GB in FP16.
FP8 (8-bit weights): Parameters in billions × 1 = VRAM in GB. A 70B model needs ~70 GB in FP8.
INT4 (4-bit quantization): Parameters in billions × 0.5 = VRAM in GB. A 70B model needs ~35 GB in INT4.
KV cache overhead: For each 1K tokens of context per concurrent request, add approximately 0.5-2 MB of VRAM (varies by model architecture and attention head count). At 128K context with 8 concurrent requests, KV cache can consume 50-100 GB of additional VRAM.
GPU Selection Guide by Workload
Development and Experimentation
Best option: 1x RTX 4090 ($0.50/hr) or 1x A100 ($1.07/hr)
For prompt engineering, testing model quality, and running small-batch inference, you don't need datacenter GPUs. Most models up to 32B parameters fit on a single RTX 4090 with INT4 quantization. Larger models (70B) fit on a single A100 80GB.
Use Spot instances for experimentation, you save 30-50% and interruptions don't matter when you're iterating.
Single-Model Production Inference
Best option: 1x H100 ($2.50/hr) or 1x H200 ($4.54/hr)
For serving one model to production traffic, an H100 handles most 70B-class models comfortably. The H200's extra VRAM (141 GB vs 80 GB) gives you headroom for longer contexts and higher concurrency without sharding across multiple GPUs. See how best NVIDIA GPUs for LLMs compares these options for your specific use case.
Large Model Serving (100B+)
Best option: Multi-GPU H100 or B300 cluster
Models above 100B parameters require multi-GPU setups regardless of quantization. For the largest models (DeepSeek V3.2, Kimi K2.5), 8-GPU configurations are the minimum.
On Spheron, you can provision multi-GPU baremetal servers with H100, H200, and B300 GPUs. B300 Spot instances starting at $2.90/hr per GPU give you 288 GB VRAM per GPU, meaning a single B300 can run models that would require 2-4 H100s.
Training and Fine-Tuning
Best option: H100 or B300 with NVLink
Fine-tuning requires more VRAM than inference because you need to store the model weights, gradients, and optimizer states simultaneously. A general rule: fine-tuning requires 3-4x the VRAM of inference for the same model.
QLoRA dramatically reduces fine-tuning memory requirements. With QLoRA, you can fine-tune a 70B model on a single H100 80GB, or Llama 4 Scout on a single H100 using ~71 GB of VRAM. Learn how to deploy Llama 4 on GPU cloud for production fine-tuning.
The Cost Efficiency Tiers
Not every model needs the most expensive GPU. Here's how to think about cost efficiency:
Tier 1: Consumer GPUs ($0.40-0.70/hr)
RTX 4090, L40S
Run models up to 30B parameters quantized. Excellent for: Qwen 3 32B, Phi-4 14B, Llama 3.1 8B, Mistral 7B. Not suitable for: anything above 40B parameters, even with aggressive quantization.
See our L40S inference benchmark guide for a detailed performance and pricing analysis.
Tier 2: Single Datacenter GPU ($0.78-3.50/hr)
A100 80GB, H100 80GB, H200 141GB
Run models up to 70-100B parameters quantized, or up to 40B in full precision. The sweet spot for most production deployments. Covers: Qwen 3 72B, Llama 3.3 70B, Llama 4 Scout (quantized), Mistral Large 2 (quantized).
Tier 3: Multi-GPU Clusters ($8-28/hr)
4-8x H100, 4-8x H200, 4-8x B300
Required for 100B+ models and the largest MoE architectures. Covers: Llama 4 Maverick, DeepSeek V3.2, Kimi K2.5, Nemotron Ultra 253B (FP16).
Video Generation GPU Requirements
Video AI models have different requirements from LLMs. Where a 70B LLM at INT4 needs ~35GB VRAM, a 5-second 720p Wan 2.1 clip needs 65–80GB on an H100. The top open-source video models (Wan 2.1/2.2, HunyuanVideo) require datacenter hardware and cannot run on consumer GPUs.
| Model | Min VRAM | Min GPU | Notes |
|---|---|---|---|
| LTX-2.3 (720p) | 24–32GB | RTX 4090 (fp8) or RTX 5090 | fp8 quant required below 32GB |
| CogVideoX-1.5-5B (720p) | 24–32GB | RTX 4090 | 8-bit reduces to ~16GB |
| Wan 2.1/2.2 (480p) | 40–48GB | H100 PCIe | With fp8 quantization |
| Wan 2.1/2.2 (720p) | 65–80GB | H100 SXM5 | Tight on 80GB |
| HunyuanVideo (720p) | 60–80GB | H200 SXM | H100 carries OOM risk |
For model-by-model quality and cost comparisons, see AI video generation GPU guide. For robotics workloads, see Cosmos world foundation model GPU requirements covering VRAM and instance sizing for synthetic training data pipelines.
Time series foundation models (Chronos-T5-Large at ~2.5 GB, Lag-Llama at <0.1 GB) fit comfortably in the L40S tier without H100-class hardware. See the full deployment guide for complete VRAM tables by model variant.
What's Coming Next
The model landscape moves fast. By Q3 2026, expect DeepSeek R2 (the reasoning-focused successor to R1), Llama 4 Behemoth (the rumored 2T+ parameter model), and Qwen 4.0. Each will push VRAM requirements higher, but GPU improvements (B300's 288 GB, Rubin on the roadmap) and better quantization techniques (FP4, 1.58-bit) will keep the hardware accessible.
We'll update this cheat sheet as new models ship. If you're planning GPU infrastructure for the next 6-12 months, size for the largest model you expect to run and add 30% headroom for KV cache and future model growth.
Need GPUs matched to your model? Spheron has H100, H200, B300, A100, and RTX 4090 instances with per-minute billing and no commitments. Pick the GPU that fits your VRAM requirements and deploy in minutes.
Frequently Asked Questions
Llama 4 Scout (109B total, 17B active) requires ~55 GB at INT4 quantization, fitting on a single H100 80 GB. At FP16 it needs ~218 GB, requiring 4x H100s or a single B300. Llama 4 Maverick (400B total, 17B active) requires ~200 GB at INT4, needing 4x H100s. At FP16 it needs ~800 GB, requiring 8x H200s.
No. DeepSeek V3.2 has 671B total parameters (37B active). At FP8, weights alone require ~671 GB, and including KV cache and activations the minimum practical VRAM lands around 700 GB. That exceeds 8x H100 80GB (640 GB total), so production deployment requires 8x H200 141GB (1,128 GB total) or a multi-node split with CPU offload.
At INT4 quantization, a 70B model needs ~35-46 GB of VRAM. The cheapest option is an L40S (48 GB) at $0.72/hr on Spheron. An A100 80GB at $1.07/hr on-demand (or $0.60/hr spot) on Spheron also works with more headroom for KV cache. For INT8, a single H100 80GB at $2.50/hr on Spheron fits the model with room for production batch sizes.
Multiply the parameter count (in billions) by bytes per parameter: FP16 = 2 bytes, INT8 = 1 byte, INT4 = 0.5 bytes. A 70B model at INT4 needs 70 x 0.5 = 35 GB for weights. Add 20-40% for KV cache, activation memory, and framework overhead. For MoE models, use total parameter count (not active), since all expert weights must be loaded.
