GPU Requirements Cheat Sheet 2026: Every Major Open Source AI Model

Every time a new model drops, the first question is the same: "What GPU do I need to run this?"

The answer is scattered across Hugging Face model cards, Reddit threads, and provider-specific docs — none of which agree with each other. This post puts it all in one place: exact VRAM requirements, recommended GPU configurations, and estimated cloud costs for every major open-source model available in February 2026.

Bookmark this page. We update it as new models release.

The Quick Reference Table

This table covers the most popular open-source models in production today. VRAM numbers assume inference (not training) with the most common quantization level for each model size.

Model	Parameters	Quantization	VRAM Needed	Min GPU Config	Est. Cloud Cost/hr
Llama 4 Scout	109B (17B active)	INT4	~55 GB	1x H100 80GB	~$2.00
Llama 4 Scout	109B (17B active)	FP16	~218 GB	4x H100 80GB	~$8.00
Llama 4 Maverick	402B (17B active)	INT4	~200 GB	4x H100 80GB	~$8.00
Llama 4 Maverick	402B (17B active)	FP16	~804 GB	8x H200 141GB	~$28.00
DeepSeek V3.2 Speciale	671B (37B active)	FP8	~600 GB	8x H100 80GB	~$16.00
DeepSeek V3.2	671B (37B active)	FP8	~600 GB	8x H100 80GB	~$16.00
Qwen 3.5 72B	72B	INT4	~36 GB	1x H100 80GB	~$2.00
Qwen 3.5 72B	72B	FP16	~144 GB	2x H100 80GB	~$4.00
Qwen 3.5 32B	32B	INT4	~16 GB	1x RTX 4090 24GB	~$0.55
Qwen 3.5 32B	32B	FP16	~64 GB	1x H100 80GB	~$2.00
Mistral Large 2	123B	INT4	~62 GB	1x H100 80GB	~$2.00
Mistral Large 2	123B	FP16	~246 GB	4x H100 80GB	~$8.00
GLM-5	~200B (est.)	FP8	~200 GB	4x H100 80GB	~$8.00
Kimi K2.5	1T (32B active)	INT4	~630 GB	8x H100 80GB	~$16.00
Nemotron Ultra 253B	253B	INT4	~127 GB	2x H100 80GB	~$4.00
Nemotron Ultra 253B	253B	FP16	~506 GB	8x H100 80GB	~$16.00
Llama 3.3 70B	70B	INT4	~35 GB	1x H100 80GB	~$2.00
Llama 3.3 70B	70B	FP16	~140 GB	2x H100 80GB	~$4.00
Phi-4 14B	14B	INT4	~7 GB	1x RTX 4090 24GB	~$0.55
Phi-4 14B	14B	FP16	~28 GB	1x A100 40GB	~$0.78

*Cloud costs based on current market rates: H100 ~$2.00/hr, H200 ~$3.50/hr, A100 ~$0.78/hr, RTX 4090 ~$0.55/hr.*

How to Read This Table

Parameters vs Active Parameters. Mixture-of-Experts models (Llama 4, DeepSeek, Kimi K2.5) have a total parameter count much higher than their active parameter count. The total count determines VRAM needs (all weights must be loaded). The active count determines inference speed (fewer active parameters = faster per-token generation).

VRAM Needed. This is the approximate memory required for the model weights at the listed quantization level. It doesn't include KV cache, which grows with context length and batch size. For production use with reasonable context windows (32K-128K tokens), add 20-40% headroom above the listed VRAM.

Min GPU Config. The minimum hardware configuration that fits the model. For production inference with concurrent users, you'll often want more VRAM than the minimum — the extra headroom enables higher throughput through larger batch sizes.

VRAM Calculation Rules of Thumb

If a model isn't in the table above, you can estimate VRAM requirements using these formulas:

FP16 (full precision): Parameters in billions × 2 = VRAM in GB. A 70B model needs ~140 GB in FP16.

FP8 (half precision weights): Parameters in billions × 1 = VRAM in GB. A 70B model needs ~70 GB in FP8.

INT4 (4-bit quantization): Parameters in billions × 0.5 = VRAM in GB. A 70B model needs ~35 GB in INT4.

KV cache overhead: For each 1K tokens of context per concurrent request, add approximately 0.5-2 MB of VRAM (varies by model architecture and attention head count). At 128K context with 8 concurrent requests, KV cache can consume 50-100 GB of additional VRAM.

GPU Selection Guide by Workload

Development and Experimentation

Best option: 1x RTX 4090 ($0.55/hr) or 1x A100 ($0.78/hr)

For prompt engineering, testing model quality, and running small-batch inference, you don't need datacenter GPUs. Most models up to 32B parameters fit on a single RTX 4090 with INT4 quantization. Larger models (70B) fit on an A100 80GB.

Use Spot instances for experimentation — you save 30-50% and interruptions don't matter when you're iterating.

Single-Model Production Inference

Best option: 1x H100 ($2.00/hr) or 1x H200 ($3.50/hr)

For serving one model to production traffic, an H100 handles most 70B-class models comfortably. The H200's extra VRAM (141 GB vs 80 GB) gives you headroom for longer contexts and higher concurrency without sharding across multiple GPUs.

Large Model Serving (100B+)

Best option: Multi-GPU H100 or B300 cluster

Models above 100B parameters require multi-GPU setups regardless of quantization. For the largest models (DeepSeek V3.2, Kimi K2.5), 8-GPU configurations are the minimum.

On Spheron AI, you can provision multi-GPU baremetal servers with H100, H200, and B300 GPUs. B300 Spot instances starting at $2.90/hr per GPU give you 288 GB VRAM per GPU — meaning a single B300 can run models that would require 2-4 H100s.

Training and Fine-Tuning

Best option: H100 or B300 with NVLink

Fine-tuning requires more VRAM than inference because you need to store the model weights, gradients, and optimizer states simultaneously. A general rule: fine-tuning requires 3-4x the VRAM of inference for the same model.

QLoRA dramatically reduces fine-tuning memory requirements. With QLoRA, you can fine-tune a 70B model on a single H100 80GB, or Llama 4 Scout on a single H100 using ~71 GB of VRAM.

The Cost Efficiency Tiers

Not every model needs the most expensive GPU. Here's how to think about cost efficiency:

Tier 1: Consumer GPUs ($0.40-0.70/hr)

RTX 4090, L40S

Run models up to 30B parameters quantized. Excellent for: Qwen 3.5 32B, Phi-4 14B, Llama 3.3 8B, Mistral 7B. Not suitable for: anything above 40B parameters, even with aggressive quantization.

Tier 2: Single Datacenter GPU ($0.78-3.50/hr)

A100 80GB, H100 80GB, H200 141GB

Run models up to 70-100B parameters quantized, or up to 40B in full precision. The sweet spot for most production deployments. Covers: Qwen 3.5 72B, Llama 3.3 70B, Llama 4 Scout (quantized), Mistral Large 2 (quantized).

Tier 3: Multi-GPU Clusters ($8-28/hr)

4-8x H100, 4-8x H200, 4-8x B300

Required for 100B+ models and the largest MoE architectures. Covers: Llama 4 Maverick, DeepSeek V3.2 Speciale, Kimi K2.5, Nemotron Ultra 253B (FP16).

What's Coming Next

The model landscape moves fast. By Q2 2026, expect DeepSeek R2 (the reasoning-focused successor to R1), Llama 4 Behemoth (the rumored 2T+ parameter model), and Qwen 4.0. Each will push VRAM requirements higher — but GPU improvements (B300's 288 GB, Rubin on the roadmap) and better quantization techniques (FP4, 1.58-bit) will keep the hardware accessible.

We'll update this cheat sheet as new models ship. If you're planning GPU infrastructure for the next 6-12 months, size for the largest model you expect to run and add 30% headroom for KV cache and future model growth.