Research

GPU Memory Requirements for LLMs: VRAM Calculator

Back to BlogWritten by Mitrasish, Co-founderMar 20, 2026
GPU CloudLLMVRAMGPU MemoryAI InfrastructureQuantizationKV CacheLLM Inference
GPU Memory Requirements for LLMs: VRAM Calculator

GPU memory is the single biggest constraint when deploying large language models. A model that doesn't fit in VRAM either won't run at all or spills into system RAM and drops to a fraction of its potential throughput. Understanding exactly how much GPU memory your model requires and what drives that number is the difference between a smooth production deployment and a 3 AM outage. For quick model-to-GPU matching, see our GPU requirements cheat sheet. For diagnosis and fixes when slow inference is the symptom rather than an OOM, see Why Your LLM Inference Is Slow.

The challenge is that VRAM requirements are not just about model size. A Llama 3.1 70B model's weights consume approximately 140 GB at FP16, but the total memory footprint in production can exceed 200 GB once you account for the KV cache, activation memory, and framework overhead. Context length, batch size, quantization strategy, and even your choice of serving framework all change the equation.

This guide breaks down every component of GPU memory consumption for LLM inference and training, provides exact VRAM calculations for popular 2026 models, and covers the optimization techniques that let you fit larger models on fewer GPUs.

The Four Components of GPU Memory Usage

GPU memory consumption during LLM inference breaks down into four distinct components, each with different scaling behavior:

1. Model Weights

Model weights are the learned parameters of the neural network, the core data that defines what the model knows. Weight memory scales linearly with parameter count and the precision format used to store each parameter.

The formula is straightforward: Memory = Parameters × Bytes per Parameter

Precision FormatBytes per ParameterExample: 70B Model
FP32 (full precision)4 bytes280 GB
FP16 / BF16 (half precision)2 bytes140 GB
INT8 (8-bit quantization)1 byte70 GB
INT4 (4-bit quantization)0.5 bytes35 GB

Most production inference uses FP16/BF16 or quantized formats. FP32 is used only for specific training scenarios where maximum numerical precision matters.

2. KV Cache

The KV (Key-Value) cache is the hidden memory monster in LLM inference. During autoregressive generation, the model stores key and value vectors for every previously generated token so it doesn't have to recompute them. This cache grows with every token in the context window and with every concurrent request being served.

KV cache memory per token depends on three model-specific parameters: the number of layers, the number of KV heads, and the head dimension. The formula is:

KV Cache per Token: 2 × Layers × KV Heads × Head Dimension × Bytes per Element

The factor of 2 accounts for both the key and value vectors stored for each layer.

For a model using Grouped Query Attention (GQA), which most modern LLMs do, the KV head count is much smaller than the query head count, dramatically reducing cache size. Llama 3.1 70B, for example, uses 80 layers with 8 KV heads (versus 64 query heads) and a head dimension of 128, giving it approximately 0.31 MB per token at BF16. Standard multi-head attention (MHA) would require 2.5 MB per token for a similar-sized model (an 8x difference).

The critical insight: KV cache scales linearly with both context length and batch size. A single Llama 3.1 70B request at 128K context consumes approximately 40 GB of KV cache alone. Serve 4 concurrent requests at that context length and you need 160 GB just for the cache, more than the model weights themselves. This is why H200 GPUs with 141GB VRAM become cost-effective despite higher hourly rates.

3. Activation Memory

Activations are the intermediate outputs computed during the forward pass through the network. During inference, only the activations for the current layer need to be held in memory. Unlike training, where activations for all layers must be stored for backpropagation.

Inference activation memory is relatively modest, typically 5-10% of the total memory footprint for standard batch sizes. However, it grows with batch size, so high-throughput serving configurations with large batches may see activations consume a more significant share.

4. Framework and System Overhead

The serving framework (vLLM, TensorRT-LLM, SGLang), CUDA context, memory allocator, and driver all consume GPU memory before your model loads a single weight. Typical overhead ranges from 500 MB to 2 GB, depending on the framework and configuration.

Memory fragmentation is another factor. GPU memory allocators don't always pack data perfectly, leaving small unusable gaps. Without optimization (like vLLM's PagedAttention), fragmentation can waste 20-30% of available memory. Modern serving frameworks have largely solved this, but it's still a factor when estimating tight memory budgets.

VRAM Requirements for Popular 2026 Models

The following table shows approximate total VRAM requirements for popular models at different quantization levels. These figures include model weights plus a baseline overhead of approximately 15-20% for KV cache (short context, low batch size), activation memory, and framework buffers. This means the totals are higher than raw weights-only calculations: for example, a 72B model at FP16 is 72B x 2 bytes = ~144 GB for weights alone, but the total including overhead is ~172 GB. Guides that show weights-only figures will show lower numbers for the same model. Production deployments with long contexts or high concurrency will need more.

ModelParametersFP16 TotalINT8 TotalINT4 Total
Mistral 7B7B~18 GB~10 GB~6 GB
Llama 3.1 8B8B~20 GB~11 GB~7 GB
Phi-4 14B14B~34 GB~18 GB~10 GB
Qwen 3 32B32B~76 GB~40 GB~22 GB
Llama 3.3 70B70B~168 GB~84 GB~46 GB
Qwen 3 72B72B~172 GB~86 GB~47 GB
Llama 4 Scout (MoE)109B total / 17B active~262 GB~131 GB~66 GB
Mistral Large 2123B~295 GB~148 GB~75 GB
Nemotron Ultra 253B (MoE)253B~607 GB~304 GB~152 GB
Llama 4 Maverick (MoE)400B total / 17B active~960 GB~480 GB~240 GB
DeepSeek V3.2 (MoE)685B total / 37B active~1.6 TB~822 GB~411 GB

These are estimates. Actual requirements vary by serving framework, context length, batch size, and the specific quantization method used (GPTQ, AWQ, GGUF, etc.). For a hands-on deployment guide covering Qwen 3 from 8B to 235B, see Deploy Qwen 3 on GPU Cloud. For Nemotron models specifically, see our Nemotron 3 Super deployment guide for the full VRAM breakdown and quantization tiers. For GPT-OSS 20B and 120B MoE VRAM breakdowns including MXFP4 quantization estimates and per-variant GPU recommendations, see the GPT-OSS GPU requirements guide.

How Context Length Multiplies Memory

Context length is the most overlooked driver of GPU memory consumption. As models support 32K, 128K, and even 1M token context windows, the KV cache can easily exceed the model weights in total memory usage.

Here's how KV cache memory scales with context length for Llama 3.1 70B (BF16, single request, GQA with 8 KV heads):

Context LengthKV Cache per RequestTotal with Weights (FP16)
2,048 tokens~0.6 GB~141 GB
8,192 tokens~2.5 GB~143 GB
32,768 tokens~10 GB~150 GB
65,536 tokens~20 GB~160 GB
128,000 tokens~40 GB~180 GB

At 128K context with a single request, the KV cache adds 40 GB, roughly 29% of the total. With 4 concurrent requests at 128K context, the KV cache alone would need 160 GB, pushing the total far beyond what a single H200 (141 GB) can handle.

This is why production inference at long context lengths requires either fewer concurrent requests per GPU, KV cache quantization, or multi-GPU setups with careful memory planning.

Quantization: Trading Precision for Memory

Quantization is the most effective technique for reducing VRAM requirements. By storing model weights at lower numerical precision, you can cut memory usage by 2-4x with surprisingly little impact on output quality. See our best NVIDIA GPUs for LLMs comparison for how different GPU configurations handle quantization trade-offs.

Distillation goes further than quantization by creating a structurally smaller model. See the 70B-to-7B distillation guide for GPU requirements and cost math for the training phase.

Common Quantization Formats

INT8 (8-bit) reduces each parameter from 2 bytes (FP16) to 1 byte, a 50% memory saving. Quality degradation is negligible for most applications. Frameworks like TensorRT-LLM, vLLM, and bitsandbytes support INT8 inference natively.

INT4 (4-bit) further reduces each parameter to 0.5 bytes, a 75% saving over FP16. Modern 4-bit quantization methods like GPTQ, AWQ, and GGUF maintain excellent output quality, with perplexity increases of only 1-3% on most benchmarks. For production chatbots and Q&A systems, the quality difference is typically imperceptible to end users. If KV cache memory is your bottleneck at long context lengths, see our guide on TurboQuant KV cache compression.

FP8 (8-bit floating point) is supported natively on Hopper and Blackwell GPUs (H100, H200, B200). Unlike INT8, FP8 preserves the floating-point format, offering slightly better quality for the same memory footprint. FP8 also enables the Transformer Engine on Hopper GPUs, which dynamically switches between FP8 and FP16 per layer.

Quantization Impact on Quality

The quality-memory tradeoff varies by model and task. General guidelines:

  • INT8: Safe for virtually all inference use cases. Negligible quality loss on standard benchmarks.
  • INT4 (GPTQ/AWQ): Excellent for most production inference. Slight degradation on complex reasoning tasks, imperceptible for conversational AI.
  • INT4 (GGUF Q4_K_M): The most popular format for local LLM deployment. Slightly better quality than naive INT4 due to mixed precision within quantization groups. See the full GGUF quantization deployment guide for step-by-step instructions and cost comparisons.
  • INT2/INT3: Experimental. Noticeable quality degradation. Not recommended for production use cases.

GPU-to-Model Matching Guide

Choosing the right GPU means matching your model's total memory requirement (weights + KV cache + overhead) to the GPU's available dedicated VRAM. Here's a practical mapping:

ModelQuantizationMin VRAMRecommended GPUs
Mistral 7B / Llama 3.1 8BFP16~18-20 GBRTX 4090 (24 GB), RTX 5090 (32 GB), L40S (48 GB)
Mistral 7B / Llama 3.1 8BINT4~6-7 GBL4 (24 GB), any 8+ GB GPU
Phi-4 14BFP16~34 GBRTX 5090 (32 GB), L40S (48 GB), A100 40 GB
Phi-4 14BINT4~10 GBRTX 4090 (24 GB), L4 (24 GB)
Qwen 3 32BINT4~22 GBRTX 4090 (24 GB), RTX 5090 (32 GB), L40S (48 GB)
Llama 3.3 70B / Qwen 3 72BINT8~84-86 GBH100 80 GB (tight), A100 80 GB (tight)
Llama 3.3 70B / Qwen 3 72BINT4~46-47 GBL40S (48 GB), A100 80 GB
Llama 3.3 70B / Qwen 3 72BFP16~168-172 GBB300 (288 GB), 2× H100, H200 + offload
Llama 4 Scout (109B MoE)INT4~66 GBH100 80 GB, A100 80 GB
Llama 4 Maverick (400B MoE)INT4~240 GBB300 (288 GB), 2× H200, 4× H100
DeepSeek V3.2 (685B MoE)FP8~822 GB8× H100 (64K ctx), 8× H200 (full ctx)

For production inference with long context windows or high concurrency, add 30-50% additional VRAM headroom beyond the base model size to accommodate KV cache growth.

KV Cache Optimization Techniques

The KV cache is the most dynamic component of GPU memory usage, and several techniques exist to control its growth.

PagedAttention (vLLM)

Traditional KV cache implementations pre-allocate memory for the maximum sequence length, wasting 60-80% of allocated cache memory for requests that don't use the full context window. PagedAttention, introduced by vLLM, borrows the concept of virtual memory paging from operating systems.

Instead of allocating one contiguous block per request, PagedAttention divides the KV cache into fixed-size blocks that are allocated on-demand as tokens are generated. This eliminates pre-allocation waste and reduces internal fragmentation to near zero, enabling 2-4x more concurrent requests on the same GPU.

For a full deployment guide on all KV cache optimization techniques, see KV Cache Optimization: Serve 10x More Users on the Same GPU.

KV Cache Quantization

The KV cache itself can be quantized separately from the model weights. Quantizing the KV cache from FP16 to FP8 or INT8 shrinks cache memory by 2x with minimal quality impact. Combined with GQA (which already provides an 8x reduction), this creates a compound effect that makes long-context inference feasible on single GPUs.

vLLM supports KV cache quantization natively, and the technique is particularly valuable for production deployments serving 32K-128K context windows.

Grouped Query Attention (GQA)

GQA is an architectural feature (not a deployment optimization) where the model uses fewer KV heads than query heads. Most modern LLMs, including Llama 3.1, Mistral, Qwen 2.5, and DeepSeek, use GQA by default.

The impact is dramatic: Llama 3.1 70B uses 8 KV heads instead of 64, reducing KV cache memory by 8x compared to standard multi-head attention. This is why modern 70B models are practical to serve on single GPUs despite their size. Without GQA, the KV cache alone would consume 320 GB at 128K context instead of 40 GB.

Training Memory: Why It's 3-5x More Than Inference

Training requires significantly more GPU memory than inference because the GPU must simultaneously hold multiple copies of the model data:

ComponentSize (70B FP16)Purpose
Model weights~140 GBThe parameters being trained
Gradients~140 GBComputed during backpropagation
Optimizer states (Adam)~280 GBAdam stores 2 momentum buffers in FP32
Activations10-100+ GBStored for backpropagation (varies with batch size)
Total570-660+ GBRequires multi-GPU distribution

This is why training a 70B model requires 8x H100 GPUs (640 GB total HBM) even before accounting for activation memory. Techniques like ZeRO (Zero Redundancy Optimizer), gradient checkpointing, and mixed-precision training help distribute and reduce this memory burden, but the fundamental scale remains enormous.

Fine-Tuning with LoRA

Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and QLoRA dramatically reduce fine-tuning memory requirements by training only a small number of adapter parameters while freezing the base model weights. QLoRA combines 4-bit quantization of the base model with LoRA adapters, enabling fine-tuning of a 70B model on a single A100 80 GB or even an RTX 4090 with careful configuration.

Memory Overflow Strategies

When your model's memory requirements exceed available VRAM, several strategies can help, each with meaningful tradeoffs.

Tensor Parallelism

Tensor parallelism splits individual layers across multiple GPUs, distributing both weights and computation. With NVLink-connected GPUs (900 GB/s on H100), the communication overhead is minimal. This is the standard approach for serving models that do not fit on a single GPU, such as splitting Llama 3.1 70B at FP16 across two H100 GPUs.

The tradeoff is cost (two GPUs instead of one) and slightly increased latency from inter-GPU communication.

CPU Offloading

When GPU memory is insufficient, model layers or KV cache pages can be offloaded to CPU RAM and swapped back when needed. This works but incurs significant latency. PCIe Gen5 x16 delivers ~64 GB/s (unidirectional) versus the 3,350 GB/s of H100 HBM3, meaning every offloaded access is roughly 52x slower.

CPU offloading is acceptable for development and low-throughput experimentation but is not viable for production inference with latency SLAs.

Gradient Checkpointing (Training)

During training, gradient checkpointing discards intermediate activations during the forward pass and recomputes them during backpropagation. This trades compute time for memory, typically reducing activation memory by 60-70% at the cost of approximately 30% more computation time. For training large models where memory is the binding constraint, this tradeoff is usually worthwhile.

Practical Memory Planning Checklist

When planning GPU memory for an LLM deployment, account for all of the following:

  1. Model weights: Parameters × bytes per parameter (at your chosen precision)
  2. KV cache: Per-token cache size × max context length × max concurrent requests
  3. Activation memory: Approximately 5-10% of total for inference, much more for training
  4. Framework overhead: 500 MB - 2 GB for CUDA context, driver, and serving framework
  5. Fragmentation buffer: Add 10-15% headroom for memory allocator inefficiency
  6. Growth margin: Plan for peak load, not average. Memory usage spikes during burst traffic.

A safe rule of thumb: take the model weight size at your chosen precision, multiply by 1.3-1.5x for inference with moderate concurrency and context, and ensure this fits within your GPU's dedicated VRAM. For high-concurrency production serving or long-context use cases, multiply by 1.5-2x.

Vision language models carry additional VRAM overhead from the visual encoder, and each image processed at inference time injects hundreds of visual tokens into the KV cache. For VLM-specific sizing, see our VLM GPU requirements and deployment guide.

Need GPU infrastructure sized for your model? Spheron provides bare-metal access with full dedicated VRAM: B300 (288 GB), H200 (141 GB), H100 (80 GB), A100 (80 GB), and RTX 4090 (24 GB). Per-minute billing, no contracts.

Explore GPU options on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.