Engineering

GPU Requirements Cheat Sheet 2026: Every Major AI Model

Back to BlogWritten by Mitrasish, Co-founderFeb 22, 2026
GPU RequirementsVRAMOpen Source AILlama 4DeepSeekQwenMistralLLM DeploymentAI Infrastructure
GPU Requirements Cheat Sheet 2026: Every Major AI Model

Every time a new model releases, the first question is always the same: "What GPU do I need to run this?"

The answer is scattered across Hugging Face model cards, Reddit threads, and provider-specific docs, none of which agree with each other. This post puts it all in one place: exact VRAM requirements, recommended GPU configurations, and estimated cloud costs for every major open-source model available in February 2026. For deeper technical understanding, see our comprehensive GPU memory requirements guide.

Bookmark this page. We update it as new models release.

For teams planning infrastructure beyond 2026, see the NVIDIA Rubin R100 guide for next-generation GPU specs and cloud availability projections.

The Quick Reference Table

This table covers the most popular open-source models in production today. VRAM numbers assume inference (not training) with the most common quantization level for each model size.

ModelParametersQuantizationVRAM NeededMin GPU ConfigEst. Cloud Cost/hr
Llama 4 Scout109B (17B active)INT4~55 GB1x H100 80GB~$2.00
Llama 4 Scout109B (17B active)FP16~218 GB4x H100 80GB~$8.00
Llama 4 Maverick400B (17B active)INT4~200 GB4x H100 80GB~$8.00
Llama 4 Maverick400B (17B active)FP16~800 GB8x H200 141GB~$15.76
DeepSeek V3.2685B (37B active)FP8~690 GB8x H100 80GB (64K ctx only)~$16.00
Qwen 3.5 397B397B MoE (varies active)FP8~397 GB8x H100 80GB~$16.08
Qwen 3.5 397B397B MoE (varies active)INT4~199 GB4x H100 80GB~$8.04
Qwen 3.5 27B27BFP8~27 GB1x H100 80GB~$2.01
Qwen 3.5 9B9BFP8~9 GB1x RTX 4090 24GB~$0.50
Qwen 3 72B72BINT4~36 GB1x H100 80GB~$2.01
Qwen 3 72B72BFP16~144 GB2x H100 80GB~$4.00
Qwen 3 32B32BINT4~16 GB1x RTX 4090 24GB~$0.50
Qwen 3 32B32BFP16~64 GB1x H100 80GB~$2.01
Mistral Large 2123BINT4~62 GB1x H100 80GB~$2.01
Mistral Large 2123BFP16~246 GB4x H100 80GB~$8.00
Kimi K2.51T (32B active)INT4~630 GB8x H100 80GB~$16.00
Nemotron Ultra 253B253BINT4~127 GB2x H100 80GB~$4.00
Nemotron Ultra 253B253BFP16~506 GB8x H100 80GB~$16.00
Nemotron 3 Super (120B MoE)120B (12B active)NVFP4~60 GB1x H100 80GB~$2.00
Llama 3.3 70B70BINT4~35 GB1x H100 80GB~$2.01
Llama 3.3 70B70BFP16~140 GB2x H100 80GB~$4.00
Phi-4 14B14BINT4~7 GB1x RTX 4090 24GB~$0.50
Phi-4 14B14BFP16~28 GB1x A100 40GB~$1.07

Cloud costs based on current market rates: H100 ~$2.01/hr, H200 ~$1.97/hr, A100 ~$1.07/hr, RTX 4090 ~$0.50/hr. For full vLLM setup steps for Qwen 3.5, see the Qwen 3.5 deployment guide.

For a full deployment walkthrough of Nemotron 3 Super including VRAM breakdowns across all precision tiers and vLLM configuration, see the Nemotron 3 Super GPU deployment guide.

Note on DeepSeek V3.2 64K ctx constraint: The ~690 GB FP8 model exceeds 8× H100 80GB total VRAM (640 GB). Running on 8× H100 requires expert parallelism and limits context to 64K tokens, a viable minimum but not full-context deployment. For full 128K context, upgrade to 8× H200 141GB (1,128 GB total, ~$15.76/hr). See our DeepSeek V3.2 deployment guide for setup details.

How to Read This Table

Parameters vs Active Parameters. Mixture-of-Experts models (Llama 4, DeepSeek, Kimi K2.5) have a total parameter count much higher than their active parameter count. The total count determines VRAM needs (all weights must be loaded). The active count determines inference speed: fewer active parameters means faster per-token generation.

VRAM Needed. This is the approximate memory required for the model weights at the listed quantization level. It doesn't include KV cache, which grows with context length and batch size. For production use with reasonable context windows (32K-128K tokens), add 20-40% headroom above the listed VRAM.

Min GPU Config. The minimum hardware configuration that fits the model. For production inference with concurrent users, you'll often want more VRAM than the minimum, the extra headroom enables higher throughput through larger batch sizes.

VRAM Calculation Rules of Thumb

If a model is not in the table above, you can estimate VRAM requirements using these formulas:

FP16 (half precision): Parameters in billions × 2 = VRAM in GB. A 70B model needs ~140 GB in FP16.

FP8 (8-bit weights): Parameters in billions × 1 = VRAM in GB. A 70B model needs ~70 GB in FP8.

INT4 (4-bit quantization): Parameters in billions × 0.5 = VRAM in GB. A 70B model needs ~35 GB in INT4.

KV cache overhead: For each 1K tokens of context per concurrent request, add approximately 0.5-2 MB of VRAM (varies by model architecture and attention head count). At 128K context with 8 concurrent requests, KV cache can consume 50-100 GB of additional VRAM.

GPU Selection Guide by Workload

Development and Experimentation

Best option: 1x RTX 4090 ($0.50/hr) or 1x A100 ($1.07/hr)

For prompt engineering, testing model quality, and running small-batch inference, you don't need datacenter GPUs. Most models up to 32B parameters fit on a single RTX 4090 with INT4 quantization. Larger models (70B) fit on a single A100 80GB.

Use Spot instances for experimentation, you save 30-50% and interruptions don't matter when you're iterating.

Single-Model Production Inference

Best option: 1x H100 ($2.01/hr) or 1x H200 ($1.97/hr)

For serving one model to production traffic, an H100 handles most 70B-class models comfortably. The H200's extra VRAM (141 GB vs 80 GB) gives you headroom for longer contexts and higher concurrency without sharding across multiple GPUs. See how best NVIDIA GPUs for LLMs compares these options for your specific use case.

Large Model Serving (100B+)

Best option: Multi-GPU H100 or B300 cluster

Models above 100B parameters require multi-GPU setups regardless of quantization. For the largest models (DeepSeek V3.2, Kimi K2.5), 8-GPU configurations are the minimum.

On Spheron, you can provision multi-GPU baremetal servers with H100, H200, and B300 GPUs. B300 Spot instances starting at $2.90/hr per GPU give you 288 GB VRAM per GPU, meaning a single B300 can run models that would require 2-4 H100s.

Training and Fine-Tuning

Best option: H100 or B300 with NVLink

Fine-tuning requires more VRAM than inference because you need to store the model weights, gradients, and optimizer states simultaneously. A general rule: fine-tuning requires 3-4x the VRAM of inference for the same model.

QLoRA dramatically reduces fine-tuning memory requirements. With QLoRA, you can fine-tune a 70B model on a single H100 80GB, or Llama 4 Scout on a single H100 using ~71 GB of VRAM. Learn how to deploy Llama 4 on GPU cloud for production fine-tuning.

The Cost Efficiency Tiers

Not every model needs the most expensive GPU. Here's how to think about cost efficiency:

Tier 1: Consumer GPUs ($0.40-0.70/hr)

RTX 4090, L40S

Run models up to 30B parameters quantized. Excellent for: Qwen 3 32B, Phi-4 14B, Llama 3.1 8B, Mistral 7B. Not suitable for: anything above 40B parameters, even with aggressive quantization.

See our L40S inference benchmark guide for a detailed performance and pricing analysis.

Tier 2: Single Datacenter GPU ($0.78-3.50/hr)

A100 80GB, H100 80GB, H200 141GB

Run models up to 70-100B parameters quantized, or up to 40B in full precision. The sweet spot for most production deployments. Covers: Qwen 3 72B, Llama 3.3 70B, Llama 4 Scout (quantized), Mistral Large 2 (quantized).

Tier 3: Multi-GPU Clusters ($8-28/hr)

4-8x H100, 4-8x H200, 4-8x B300

Required for 100B+ models and the largest MoE architectures. Covers: Llama 4 Maverick, DeepSeek V3.2, Kimi K2.5, Nemotron Ultra 253B (FP16).

Video Generation GPU Requirements

Video AI models have different requirements from LLMs. Where a 70B LLM at INT4 needs ~35GB VRAM, a 5-second 720p Wan 2.1 clip needs 65–80GB on an H100. The top open-source video models (Wan 2.1/2.2, HunyuanVideo) require datacenter hardware and cannot run on consumer GPUs.

ModelMin VRAMMin GPUNotes
LTX-2.3 (720p)24–32GBRTX 4090 (fp8) or RTX 5090fp8 quant required below 32GB
CogVideoX-1.5-5B (720p)24–32GBRTX 40908-bit reduces to ~16GB
Wan 2.1/2.2 (480p)40–48GBH100 PCIeWith fp8 quantization
Wan 2.1/2.2 (720p)65–80GBH100 SXM5Tight on 80GB
HunyuanVideo (720p)60–80GBH200 SXMH100 carries OOM risk

For model-by-model quality and cost comparisons, see AI video generation GPU guide.

What's Coming Next

The model landscape moves fast. By Q2 2026, expect DeepSeek R2 (the reasoning-focused successor to R1), Llama 4 Behemoth (the rumored 2T+ parameter model), and Qwen 4.0. Each will push VRAM requirements higher, but GPU improvements (B300's 288 GB, Rubin on the roadmap) and better quantization techniques (FP4, 1.58-bit) will keep the hardware accessible.

We'll update this cheat sheet as new models ship. If you're planning GPU infrastructure for the next 6-12 months, size for the largest model you expect to run and add 30% headroom for KV cache and future model growth.

Need GPUs matched to your model? Spheron has H100, H200, B300, A100, and RTX 4090 instances with per-minute billing and no commitments. Pick the GPU that fits your VRAM requirements and deploy in minutes.

Explore GPU rental options →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.