Every time a new model drops, the first question is the same: "What GPU do I need to run this?"
The answer is scattered across Hugging Face model cards, Reddit threads, and provider-specific docs — none of which agree with each other. This post puts it all in one place: exact VRAM requirements, recommended GPU configurations, and estimated cloud costs for every major open-source model available in February 2026.
Bookmark this page. We update it as new models release.
The Quick Reference Table
This table covers the most popular open-source models in production today. VRAM numbers assume inference (not training) with the most common quantization level for each model size.
| Model | Parameters | Quantization | VRAM Needed | Min GPU Config | Est. Cloud Cost/hr |
|---|---|---|---|---|---|
| Llama 4 Scout | 109B (17B active) | INT4 | ~55 GB | 1x H100 80GB | ~$2.00 |
| Llama 4 Scout | 109B (17B active) | FP16 | ~218 GB | 4x H100 80GB | ~$8.00 |
| Llama 4 Maverick | 402B (17B active) | INT4 | ~200 GB | 4x H100 80GB | ~$8.00 |
| Llama 4 Maverick | 402B (17B active) | FP16 | ~804 GB | 8x H200 141GB | ~$28.00 |
| DeepSeek V3.2 Speciale | 671B (37B active) | FP8 | ~600 GB | 8x H100 80GB | ~$16.00 |
| DeepSeek V3.2 | 671B (37B active) | FP8 | ~600 GB | 8x H100 80GB | ~$16.00 |
| Qwen 3.5 72B | 72B | INT4 | ~36 GB | 1x H100 80GB | ~$2.00 |
| Qwen 3.5 72B | 72B | FP16 | ~144 GB | 2x H100 80GB | ~$4.00 |
| Qwen 3.5 32B | 32B | INT4 | ~16 GB | 1x RTX 4090 24GB | ~$0.55 |
| Qwen 3.5 32B | 32B | FP16 | ~64 GB | 1x H100 80GB | ~$2.00 |
| Mistral Large 2 | 123B | INT4 | ~62 GB | 1x H100 80GB | ~$2.00 |
| Mistral Large 2 | 123B | FP16 | ~246 GB | 4x H100 80GB | ~$8.00 |
| GLM-5 | ~200B (est.) | FP8 | ~200 GB | 4x H100 80GB | ~$8.00 |
| Kimi K2.5 | 1T (32B active) | INT4 | ~630 GB | 8x H100 80GB | ~$16.00 |
| Nemotron Ultra 253B | 253B | INT4 | ~127 GB | 2x H100 80GB | ~$4.00 |
| Nemotron Ultra 253B | 253B | FP16 | ~506 GB | 8x H100 80GB | ~$16.00 |
| Llama 3.3 70B | 70B | INT4 | ~35 GB | 1x H100 80GB | ~$2.00 |
| Llama 3.3 70B | 70B | FP16 | ~140 GB | 2x H100 80GB | ~$4.00 |
| Phi-4 14B | 14B | INT4 | ~7 GB | 1x RTX 4090 24GB | ~$0.55 |
| Phi-4 14B | 14B | FP16 | ~28 GB | 1x A100 40GB | ~$0.78 |
*Cloud costs based on current market rates: H100 ~$2.00/hr, H200 ~$3.50/hr, A100 ~$0.78/hr, RTX 4090 ~$0.55/hr.*
How to Read This Table
Parameters vs Active Parameters. Mixture-of-Experts models (Llama 4, DeepSeek, Kimi K2.5) have a total parameter count much higher than their active parameter count. The total count determines VRAM needs (all weights must be loaded). The active count determines inference speed (fewer active parameters = faster per-token generation).
VRAM Needed. This is the approximate memory required for the model weights at the listed quantization level. It doesn't include KV cache, which grows with context length and batch size. For production use with reasonable context windows (32K-128K tokens), add 20-40% headroom above the listed VRAM.
Min GPU Config. The minimum hardware configuration that fits the model. For production inference with concurrent users, you'll often want more VRAM than the minimum — the extra headroom enables higher throughput through larger batch sizes.
VRAM Calculation Rules of Thumb
If a model isn't in the table above, you can estimate VRAM requirements using these formulas:
FP16 (full precision): Parameters in billions × 2 = VRAM in GB. A 70B model needs ~140 GB in FP16.
FP8 (half precision weights): Parameters in billions × 1 = VRAM in GB. A 70B model needs ~70 GB in FP8.
INT4 (4-bit quantization): Parameters in billions × 0.5 = VRAM in GB. A 70B model needs ~35 GB in INT4.
KV cache overhead: For each 1K tokens of context per concurrent request, add approximately 0.5-2 MB of VRAM (varies by model architecture and attention head count). At 128K context with 8 concurrent requests, KV cache can consume 50-100 GB of additional VRAM.
GPU Selection Guide by Workload
Development and Experimentation
Best option: 1x RTX 4090 ($0.55/hr) or 1x A100 ($0.78/hr)
For prompt engineering, testing model quality, and running small-batch inference, you don't need datacenter GPUs. Most models up to 32B parameters fit on a single RTX 4090 with INT4 quantization. Larger models (70B) fit on an A100 80GB.
Use Spot instances for experimentation — you save 30-50% and interruptions don't matter when you're iterating.
Single-Model Production Inference
Best option: 1x H100 ($2.00/hr) or 1x H200 ($3.50/hr)
For serving one model to production traffic, an H100 handles most 70B-class models comfortably. The H200's extra VRAM (141 GB vs 80 GB) gives you headroom for longer contexts and higher concurrency without sharding across multiple GPUs.
Large Model Serving (100B+)
Best option: Multi-GPU H100 or B300 cluster
Models above 100B parameters require multi-GPU setups regardless of quantization. For the largest models (DeepSeek V3.2, Kimi K2.5), 8-GPU configurations are the minimum.
On Spheron AI, you can provision multi-GPU baremetal servers with H100, H200, and B300 GPUs. B300 Spot instances starting at $2.90/hr per GPU give you 288 GB VRAM per GPU — meaning a single B300 can run models that would require 2-4 H100s.
Training and Fine-Tuning
Best option: H100 or B300 with NVLink
Fine-tuning requires more VRAM than inference because you need to store the model weights, gradients, and optimizer states simultaneously. A general rule: fine-tuning requires 3-4x the VRAM of inference for the same model.
QLoRA dramatically reduces fine-tuning memory requirements. With QLoRA, you can fine-tune a 70B model on a single H100 80GB, or Llama 4 Scout on a single H100 using ~71 GB of VRAM.
The Cost Efficiency Tiers
Not every model needs the most expensive GPU. Here's how to think about cost efficiency:
Tier 1: Consumer GPUs ($0.40-0.70/hr)
RTX 4090, L40S
Run models up to 30B parameters quantized. Excellent for: Qwen 3.5 32B, Phi-4 14B, Llama 3.3 8B, Mistral 7B. Not suitable for: anything above 40B parameters, even with aggressive quantization.
Tier 2: Single Datacenter GPU ($0.78-3.50/hr)
A100 80GB, H100 80GB, H200 141GB
Run models up to 70-100B parameters quantized, or up to 40B in full precision. The sweet spot for most production deployments. Covers: Qwen 3.5 72B, Llama 3.3 70B, Llama 4 Scout (quantized), Mistral Large 2 (quantized).
Tier 3: Multi-GPU Clusters ($8-28/hr)
4-8x H100, 4-8x H200, 4-8x B300
Required for 100B+ models and the largest MoE architectures. Covers: Llama 4 Maverick, DeepSeek V3.2 Speciale, Kimi K2.5, Nemotron Ultra 253B (FP16).
What's Coming Next
The model landscape moves fast. By Q2 2026, expect DeepSeek R2 (the reasoning-focused successor to R1), Llama 4 Behemoth (the rumored 2T+ parameter model), and Qwen 4.0. Each will push VRAM requirements higher — but GPU improvements (B300's 288 GB, Rubin on the roadmap) and better quantization techniques (FP4, 1.58-bit) will keep the hardware accessible.
We'll update this cheat sheet as new models ship. If you're planning GPU infrastructure for the next 6-12 months, size for the largest model you expect to run and add 30% headroom for KV cache and future model growth.