GPU Memory Requirements for LLMs: How Much VRAM You Need (2026)

GPU memory is the single biggest constraint when deploying large language models. A model that doesn't fit in VRAM either won't run at all or spills into system RAM and drops to a fraction of its potential throughput. Understanding exactly how much GPU memory your model requires and what drives that number is the difference between a smooth production deployment and a 3 AM outage. To skip the math and match any HuggingFace model to its cheapest GPU instantly, run it through our LLM VRAM calculator, or see the GPU requirements cheat sheet for quick model-to-GPU pairings. For diagnosis and fixes when slow inference is the symptom rather than an OOM, see Why Your LLM Inference Is Slow.

The challenge is that VRAM requirements are not just about model size. A Llama 3.1 70B model's weights consume approximately 140 GB at FP16, but the total memory footprint in production can exceed 200 GB once you account for the KV cache, activation memory, and framework overhead. Context length, batch size, quantization strategy, and even your choice of serving framework all change the equation.

This guide breaks down every component of GPU memory consumption for LLM inference and training, provides exact VRAM calculations for popular 2026 models, and covers the optimization techniques that let you fit larger models on fewer GPUs. For a tier-by-tier breakdown of which open-source models fit each GPU class in mid-2026, see the best open-source LLMs by VRAM tier guide.

Quick Answer: How Much VRAM Do You Need for an LLM?

As a rule of thumb, an LLM needs about 2 GB of VRAM per billion parameters at FP16, or roughly 0.5 GB per billion at INT4, plus 15-20% on top for the KV cache, activations, and framework overhead. So a 7B model fits in ~16-20 GB, a 70B model needs ~140-170 GB at FP16 (or ~46 GB at INT4), and a 685B model like DeepSeek V3.2 needs ~822 GB at FP8.

Model size	FP16 VRAM	INT4 VRAM	Smallest GPU setup that runs it
7-8B	~16-20 GB	~6-7 GB	1x RTX 4090 (24 GB)
13-14B	~34 GB	~10 GB	1x RTX 5090 (32 GB), INT4
32B	~76 GB	~22 GB	1x RTX 4090/5090 (24-32 GB), INT4
70-72B	~168-172 GB	~46-47 GB	1x L40S (48 GB) at INT4, or 1x B300 at FP16
109-123B	~262-295 GB	~66-75 GB	1x H100 (80 GB), INT4
400B+ MoE	~960 GB+	~240 GB+	1x B300 (288 GB) or 4x H100, INT4
685B MoE	~1.6 TB	~411 GB	8x H100 / 8x H200 at FP8

The sections below show the exact math behind these figures, including how context length and batch size push the real number higher. To see what each of these GPUs costs per hour, check live GPU cloud pricing.

The Four Components of GPU Memory Usage

GPU memory consumption during LLM inference breaks down into four distinct components, each with different scaling behavior:

1. Model Weights

Model weights are the learned parameters of the neural network, the core data that defines what the model knows. Weight memory scales linearly with parameter count and the precision format used to store each parameter.

The formula is straightforward: Memory = Parameters × Bytes per Parameter

Precision Format	Bytes per Parameter	Example: 70B Model
FP32 (full precision)	4 bytes	280 GB
FP16 / BF16 (half precision)	2 bytes	140 GB
INT8 (8-bit quantization)	1 byte	70 GB
INT4 (4-bit quantization)	0.5 bytes	35 GB

Most production inference uses FP16/BF16 or quantized formats. FP32 is used only for specific training scenarios where maximum numerical precision matters.

2. KV Cache

The KV (Key-Value) cache is the hidden memory monster in LLM inference. During autoregressive generation, the model stores key and value vectors for every previously generated token so it doesn't have to recompute them. This cache grows with every token in the context window and with every concurrent request being served.

KV cache memory per token depends on three model-specific parameters: the number of layers, the number of KV heads, and the head dimension. The formula is:

KV Cache per Token: 2 × Layers × KV Heads × Head Dimension × Bytes per Element

The factor of 2 accounts for both the key and value vectors stored for each layer.

For a model using Grouped Query Attention (GQA), which most modern LLMs do, the KV head count is much smaller than the query head count, dramatically reducing cache size. Llama 3.1 70B, for example, uses 80 layers with 8 KV heads (versus 64 query heads) and a head dimension of 128, giving it approximately 0.31 MB per token at BF16. Standard multi-head attention (MHA) would require 2.5 MB per token for a similar-sized model (an 8x difference).

The critical insight: KV cache scales linearly with both context length and batch size. A single Llama 3.1 70B request at 128K context consumes approximately 40 GB of KV cache alone. Serve 4 concurrent requests at that context length and you need 160 GB just for the cache, more than the model weights themselves. This is why H200 GPUs with 141GB VRAM become cost-effective despite higher hourly rates.

For molecular modeling workloads, the same VRAM scaling logic applies. Protein structure prediction models like AlphaFold 3 and Boltz-2 follow near-identical memory growth curves by residue count, where residue length maps to context length and MSA depth drives overhead equivalent to KV cache. See GPU Cloud for AI Drug Discovery for VRAM tables specific to biotech workloads and a cost-per-prediction breakdown.

3. Activation Memory

Activations are the intermediate outputs computed during the forward pass through the network. During inference, only the activations for the current layer need to be held in memory. Unlike training, where activations for all layers must be stored for backpropagation.

Inference activation memory is relatively modest, typically 5-10% of the total memory footprint for standard batch sizes. However, it grows with batch size, so high-throughput serving configurations with large batches may see activations consume a more significant share.

4. Framework and System Overhead

The serving framework (vLLM, TensorRT-LLM, SGLang), CUDA context, memory allocator, and driver all consume GPU memory before your model loads a single weight. Typical overhead ranges from 500 MB to 2 GB, depending on the framework and configuration.

Memory fragmentation is another factor. GPU memory allocators don't always pack data perfectly, leaving small unusable gaps. Without optimization (like vLLM's PagedAttention), fragmentation can waste 20-30% of available memory. Modern serving frameworks have largely solved this, but it's still a factor when estimating tight memory budgets.

VRAM Requirements for Popular 2026 Models

The following table shows approximate total VRAM requirements for popular models at different quantization levels. These figures include model weights plus a baseline overhead of approximately 15-20% for KV cache (short context, low batch size), activation memory, and framework buffers. This means the totals are higher than raw weights-only calculations: for example, a 72B model at FP16 is 72B x 2 bytes = ~144 GB for weights alone, but the total including overhead is ~172 GB. Guides that show weights-only figures will show lower numbers for the same model. Production deployments with long contexts or high concurrency will need more.

Model	Parameters	FP16 Total	INT8 Total	INT4 Total
Mistral 7B	7B	~18 GB	~10 GB	~6 GB
Llama 3.1 8B	8B	~20 GB	~11 GB	~7 GB
Phi-4 14B	14B	~34 GB	~18 GB	~10 GB
Qwen 3 32B	32B	~76 GB	~40 GB	~22 GB
Llama 3.3 70B	70B	~168 GB	~84 GB	~46 GB
Qwen 3 72B	72B	~172 GB	~86 GB	~47 GB
Llama 4 Scout (MoE)	109B total / 17B active	~262 GB	~131 GB	~66 GB
Mistral Large 2	123B	~295 GB	~148 GB	~75 GB
MiniMax M3 (MoE)	230B total / 10B active	~540 GB	~270 GB	~135 GB
Nemotron Ultra 253B (MoE)	253B	~607 GB	~304 GB	~152 GB
Ornith 1.0 397B	397B	~940 GB	~470 GB	~235 GB
Llama 4 Maverick (MoE)	400B total / 17B active	~960 GB	~480 GB	~240 GB
DeepSeek V3.2 (MoE)	685B total / 37B active	~1.6 TB	~822 GB	~411 GB
GLM-5.2 (MoE)	744B total / ~40B active	~1.8 TB	~880 GB	~440 GB

These are estimates. Actual requirements vary by serving framework, context length, batch size, and the specific quantization method used (GPTQ, AWQ, GGUF, etc.). The largest entries moved fast in 2026: GLM-5.2 at 744B total parameters needs roughly 880 GB at FP8, and the per-quant breakdown on the GLM-5.2 VRAM page stays current with live GPU pricing. For a hands-on deployment guide covering Qwen 3 from 8B to 235B, see Deploy Qwen 3 on GPU Cloud. For Nemotron models specifically, see our Nemotron 3 Super deployment guide for the full VRAM breakdown and quantization tiers. For GPT-OSS 20B and 120B MoE VRAM breakdowns including MXFP4 quantization estimates and per-variant GPU recommendations, see the GPT-OSS GPU requirements guide.

How Context Length Multiplies Memory

Context length is the most overlooked driver of GPU memory consumption. As models support 32K, 128K, and even 1M token context windows, the KV cache can easily exceed the model weights in total memory usage.

Here's how KV cache memory scales with context length for Llama 3.1 70B (BF16, single request, GQA with 8 KV heads):

Context Length	KV Cache per Request	Total with Weights (FP16)
2,048 tokens	~0.6 GB	~141 GB
8,192 tokens	~2.5 GB	~143 GB
32,768 tokens	~10 GB	~150 GB
65,536 tokens	~20 GB	~160 GB
128,000 tokens	~40 GB	~180 GB

At 128K context with a single request, the KV cache adds 40 GB, roughly 29% of the total. With 4 concurrent requests at 128K context, the KV cache alone would need 160 GB, pushing the total far beyond what a single H200 (141 GB) can handle.

This is why production inference at long context lengths requires either fewer concurrent requests per GPU, KV cache quantization, or multi-GPU setups with careful memory planning. A different approach is to use a subquadratic model that avoids the O(n²) attention cost entirely: the SubQ 1M-Preview deployment guide covers how its linear KV-cache growth changes the VRAM math at 1M-12M token context lengths. MiniMax M3's sparse attention (MSA) changes the memory profile at long context significantly - see the full MiniMax M3 deployment guide for VRAM tables at FP8 and INT4 precision across H100 and H200 configurations.

Quantization: Trading Precision for Memory

Quantization is the most effective technique for reducing VRAM requirements. By storing model weights at lower numerical precision, you can cut memory usage by 2-4x with surprisingly little impact on output quality. See our best NVIDIA GPUs for LLMs comparison for how different GPU configurations handle quantization trade-offs.

Distillation goes further than quantization by creating a structurally smaller model. See the 70B-to-7B distillation guide for GPU requirements and cost math for the training phase.

Common Quantization Formats

INT8 (8-bit) reduces each parameter from 2 bytes (FP16) to 1 byte, a 50% memory saving. Quality degradation is negligible for most applications. Frameworks like TensorRT-LLM, vLLM, and bitsandbytes support INT8 inference natively.

INT4 (4-bit) further reduces each parameter to 0.5 bytes, a 75% saving over FP16. Modern 4-bit quantization methods like GPTQ, AWQ, and GGUF maintain excellent output quality, with perplexity increases of only 1-3% on most benchmarks. For production chatbots and Q&A systems, the quality difference is typically imperceptible to end users. If KV cache memory is your bottleneck at long context lengths, see our guide on TurboQuant KV cache compression.

FP8 (8-bit floating point) is supported natively on Hopper and Blackwell GPUs (H100, H200, B200). Unlike INT8, FP8 preserves the floating-point format, offering slightly better quality for the same memory footprint. FP8 also enables the Transformer Engine on Hopper GPUs, which dynamically switches between FP8 and FP16 per layer.

Quantization Impact on Quality

The quality-memory tradeoff varies by model and task. General guidelines:

INT8: Safe for virtually all inference use cases. Negligible quality loss on standard benchmarks.
INT4 (GPTQ/AWQ): Excellent for most production inference. Slight degradation on complex reasoning tasks, imperceptible for conversational AI.
INT4 (GGUF Q4_K_M): The most popular format for local LLM deployment. Slightly better quality than naive INT4 due to mixed precision within quantization groups. See the full GGUF quantization deployment guide for step-by-step instructions and cost comparisons.
INT2/INT3: Experimental. Noticeable quality degradation. Not recommended for production use cases.

GPU-to-Model Matching Guide

Choosing the right GPU means matching your model's total memory requirement (weights + KV cache + overhead) to the GPU's available dedicated VRAM. Here's a practical mapping:

Model	Quantization	Min VRAM	Recommended GPUs
Mistral 7B / Llama 3.1 8B	FP16	~18-20 GB	RTX 4090 (24 GB), RTX 5090 (32 GB), L40S (48 GB)
Mistral 7B / Llama 3.1 8B	INT4	~6-7 GB	L4 (24 GB), any 8+ GB GPU
Phi-4 14B	FP16	~34 GB	RTX 5090 (32 GB), L40S (48 GB), A100 40 GB
Phi-4 14B	INT4	~10 GB	RTX 4090 (24 GB), L4 (24 GB)
Qwen 3 32B	INT4	~22 GB	RTX 4090 (24 GB), RTX 5090 (32 GB), L40S (48 GB)
Llama 3.3 70B / Qwen 3 72B	INT8	~84-86 GB	H100 80 GB (tight), A100 80 GB (tight)
Llama 3.3 70B / Qwen 3 72B	INT4	~46-47 GB	L40S (48 GB), A100 80 GB
Llama 3.3 70B / Qwen 3 72B	FP16	~168-172 GB	B300 (288 GB), 2× H100, H200 + offload
Llama 4 Scout (109B MoE)	INT4	~66 GB	H100 80 GB, A100 80 GB
Llama 4 Maverick (400B MoE)	INT4	~240 GB	B300 (288 GB), 2× H200, 4× H100
DeepSeek V3.2 (685B MoE)	FP8	~822 GB	8× H100 (64K ctx), 8× H200 (full ctx)

For production inference with long context windows or high concurrency, add 30-50% additional VRAM headroom beyond the base model size to accommodate KV cache growth.

KV Cache Optimization Techniques

The KV cache is the most dynamic component of GPU memory usage, and several techniques exist to control its growth.

PagedAttention (vLLM)

Traditional KV cache implementations pre-allocate memory for the maximum sequence length, wasting 60-80% of allocated cache memory for requests that don't use the full context window. PagedAttention, introduced by vLLM, borrows the concept of virtual memory paging from operating systems.

Instead of allocating one contiguous block per request, PagedAttention divides the KV cache into fixed-size blocks that are allocated on-demand as tokens are generated. This eliminates pre-allocation waste and reduces internal fragmentation to near zero, enabling 2-4x more concurrent requests on the same GPU.

For a full deployment guide on all KV cache optimization techniques, see KV Cache Optimization: Serve 10x More Users on the Same GPU.

KV Cache Quantization

The KV cache itself can be quantized separately from the model weights. Quantizing the KV cache from FP16 to FP8 or INT8 shrinks cache memory by 2x with minimal quality impact. Combined with GQA (which already provides an 8x reduction), this creates a compound effect that makes long-context inference feasible on single GPUs.

vLLM supports KV cache quantization natively, and the technique is particularly valuable for production deployments serving 32K-128K context windows.

Grouped Query Attention (GQA)

GQA is an architectural feature (not a deployment optimization) where the model uses fewer KV heads than query heads. Most modern LLMs, including Llama 3.1, Mistral, Qwen 2.5, and DeepSeek, use GQA by default.

The impact is dramatic: Llama 3.1 70B uses 8 KV heads instead of 64, reducing KV cache memory by 8x compared to standard multi-head attention. This is why modern 70B models are practical to serve on single GPUs despite their size. Without GQA, the KV cache alone would consume 320 GB at 128K context instead of 40 GB.

Training Memory: Why It's 3-5x More Than Inference

Training requires significantly more GPU memory than inference because the GPU must simultaneously hold multiple copies of the model data:

Component	Size (70B FP16)	Purpose
Model weights	~140 GB	The parameters being trained
Gradients	~140 GB	Computed during backpropagation
Optimizer states (Adam)	~280 GB	Adam stores 2 momentum buffers in FP32
Activations	10-100+ GB	Stored for backpropagation (varies with batch size)
Total	570-660+ GB	Requires multi-GPU distribution

This is why training a 70B model requires 8x H100 GPUs (640 GB total HBM) even before accounting for activation memory. Techniques like ZeRO (Zero Redundancy Optimizer), gradient checkpointing, and mixed-precision training help distribute and reduce this memory burden, but the fundamental scale remains enormous.

Fine-Tuning with LoRA

Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and QLoRA dramatically reduce fine-tuning memory requirements by training only a small number of adapter parameters while freezing the base model weights. QLoRA combines 4-bit quantization of the base model with LoRA adapters, enabling fine-tuning of a 70B model on a single A100 80 GB or even an RTX 4090 with careful configuration. For a method-by-method comparison that goes deeper on the training-specific components, the GPU VRAM requirements to fine-tune LLMs post covers full fine-tuning vs LoRA vs QLoRA in detail across 7B through 70B model sizes.

Memory Overflow Strategies

When your model's memory requirements exceed available VRAM, several strategies can help, each with meaningful tradeoffs.

Tensor Parallelism

Tensor parallelism splits individual layers across multiple GPUs, distributing both weights and computation. With NVLink-connected GPUs (900 GB/s on H100), the communication overhead is minimal. This is the standard approach for serving models that do not fit on a single GPU, such as splitting Llama 3.1 70B at FP16 across two H100 GPUs.

The tradeoff is cost (two GPUs instead of one) and slightly increased latency from inter-GPU communication.

CPU Offloading

When GPU memory is insufficient, model layers or KV cache pages can be offloaded to CPU RAM and swapped back when needed. This works but incurs significant latency. PCIe Gen5 x16 delivers ~64 GB/s (unidirectional) versus the 3,350 GB/s of H100 HBM3, meaning every offloaded access is roughly 52x slower.

CPU offloading is acceptable for development and low-throughput experimentation but is not viable for production inference with latency SLAs.

Gradient Checkpointing (Training)

During training, gradient checkpointing discards intermediate activations during the forward pass and recomputes them during backpropagation. This trades compute time for memory, typically reducing activation memory by 60-70% at the cost of approximately 30% more computation time. For training large models where memory is the binding constraint, this tradeoff is usually worthwhile.

Practical Memory Planning Checklist

When planning GPU memory for an LLM deployment, account for all of the following:

Model weights: Parameters × bytes per parameter (at your chosen precision)
KV cache: Per-token cache size × max context length × max concurrent requests
Activation memory: Approximately 5-10% of total for inference, much more for training
Framework overhead: 500 MB - 2 GB for CUDA context, driver, and serving framework
Fragmentation buffer: Add 10-15% headroom for memory allocator inefficiency
Growth margin: Plan for peak load, not average. Memory usage spikes during burst traffic.

A safe rule of thumb: take the model weight size at your chosen precision, multiply by 1.3-1.5x for inference with moderate concurrency and context, and ensure this fits within your GPU's dedicated VRAM. For high-concurrency production serving or long-context use cases, multiply by 1.5-2x.

Vision language models carry additional VRAM overhead from the visual encoder, and each image processed at inference time injects hundreds of visual tokens into the KV cache. For VLM-specific sizing, see our VLM GPU requirements and deployment guide.

Need GPU infrastructure sized for your model? Spheron provides bare-metal access with full dedicated VRAM: B300 (288 GB), H200 (141 GB), H100 (80 GB), A100 (80 GB), and RTX 4090 (24 GB). Per-minute billing, no contracts.
Explore GPU options on Spheron →

FAQ / 08

Frequently Asked Questions

Llama 4 Scout has 109B total parameters (17B active) as a Mixture-of-Experts model. At FP16, it needs approximately 218 GB for weights, requiring 4x H100 80 GB or a single B300 (288 GB). At INT4, weights drop to ~55 GB, fitting on a single H100 80 GB with room for KV cache. The low active parameter count (17B) means inference is fast despite the large total parameter count.

KV cache memory scales linearly with context length. For Llama 3.1 70B at BF16, each token in the context adds approximately 0.31 MB of KV cache. At 2K context, the cache is negligible (~0.6 GB). At 128K context, it consumes approximately 40 GB per request, nearly 29% of the total memory footprint. Serving multiple concurrent requests at long context multiplies this further.

FP16 uses 2 bytes per parameter, INT8 uses 1 byte (50% saving), and INT4 uses 0.5 bytes (75% saving). INT8 quantization produces negligible quality loss for virtually all applications. INT4 introduces slight degradation, typically 1-3% perplexity increase, but is imperceptible for most production chatbot and Q&A use cases. INT4 is the most popular format for local LLM deployment because it enables running 70B models on consumer-grade hardware.

Not at full precision. A 70B model at FP16 requires ~140 GB, far exceeding the RTX 4090's 24 GB. At INT4 quantization with aggressive optimization, the model weights fit in ~35 GB, which still exceeds 24 GB. However, with CPU offloading (splitting some layers to system RAM), it is possible to run at very reduced throughput. For practical 70B inference, you need at minimum an A100 80 GB (INT8) or L40S 48 GB (INT4).

Training simultaneously stores model weights (~140 GB for 70B at FP16), gradients (~140 GB), optimizer states (~280 GB for Adam in FP32), and activations (10-100+ GB depending on batch size). The total for a 70B model exceeds 570 GB, requiring 8x H100 GPUs with tensor and pipeline parallelism. Techniques like QLoRA reduce fine-tuning requirements dramatically by training only small adapter layers while keeping the base model frozen and quantized.

PagedAttention, introduced by vLLM, applies OS-style virtual memory paging to the KV cache. Traditional implementations pre-allocate maximum context-length memory per request, wasting 60-80% of allocated memory. PagedAttention allocates cache blocks on-demand as tokens are generated, eliminating this waste. The practical impact is 2-4x more concurrent requests on the same GPU, a direct cost saving for production inference deployments.

Low GPU utilization during LLM inference is almost always a memory bandwidth bottleneck, not a compute bottleneck. During the decode phase (generating tokens one by one), the GPU spends most of its time streaming model weights and KV cache from VRAM to the compute units. The arithmetic intensity (FLOPs per byte of memory traffic) of single-batch decoding is very low — roughly 1-2 FLOPs per byte — far below the GPU's compute-to-bandwidth ratio. The GPU finishes its math before the next memory chunk arrives, so compute units sit idle waiting for data. This is the 'memory wall' problem. Solutions: increase batch size to amortize weight loading across more requests (raises arithmetic intensity), use quantization (INT8/INT4) to reduce the bytes streamed per parameter, enable speculative decoding to batch more compute into each memory-bound decode step, or switch to a higher-bandwidth GPU like an H200 (4.8 TB/s) or B300 (8 TB/s).

Memory bandwidth bottleneck in LLM inference occurs when the GPU's memory bus cannot feed data to the compute units fast enough to keep them busy. During decode, each generated token requires loading all model weights (~140 GB for a 70B FP16 model) plus the active KV cache from VRAM — even though most weights are not used for every token. The GPU's compute units process these bytes in microseconds, then sit idle waiting for the next batch of data. Fixes in order of impact: (1) Increase batch size — serving 8 concurrent requests instead of 1 multiplies throughput with the same memory reads. (2) Use INT8 or INT4 quantization — halving or quartering bytes per parameter directly halves or quarters the bandwidth demand. (3) Enable continuous batching (vLLM default) — ensures the GPU never drains the queue between requests. (4) Use FP8 KV cache — cuts KV read bandwidth by half with minimal quality loss. (5) Upgrade to a higher-bandwidth GPU — H200 at 4.8 TB/s vs H100 at 3.35 TB/s gives a 43% bandwidth uplift, translating directly to higher decode throughput at equivalent batch sizes.