GPU memory is the single biggest constraint when deploying large language models. A model that doesn't fit in VRAM either won't run at all or spills into system RAM and drops to a fraction of its potential throughput. Understanding exactly how much GPU memory your model requires and what drives that number is the difference between a smooth production deployment and a 3 AM outage.
The challenge is that VRAM requirements are not just about model size. A Llama 3.1 70B model's weights consume approximately 140 GB at FP16, but the total memory footprint in production can exceed 200 GB once you account for the KV cache, activation memory, and framework overhead. Context length, batch size, quantization strategy, and even your choice of serving framework all change the equation.
This guide breaks down every component of GPU memory consumption for LLM inference and training, provides exact VRAM calculations for popular 2025 models, and covers the optimization techniques that let you fit larger models on fewer GPUs.
The Four Components of GPU Memory Usage
GPU memory consumption during LLM inference breaks down into four distinct components, each with different scaling behavior:
1. Model Weights
Model weights are the learned parameters of the neural network, the core data that defines what the model knows. Weight memory scales linearly with parameter count and the precision format used to store each parameter.
The formula is straightforward: Memory = Parameters × Bytes per Parameter
| Precision Format | Bytes per Parameter | Example: 70B Model |
|---|---|---|
| FP32 (full precision) | 4 bytes | 280 GB |
| FP16 / BF16 (half precision) | 2 bytes | 140 GB |
| INT8 (8-bit quantization) | 1 byte | 70 GB |
| INT4 (4-bit quantization) | 0.5 bytes | 35 GB |
Most production inference uses FP16/BF16 or quantized formats. FP32 is used only for specific training scenarios where maximum numerical precision matters.
2. KV Cache
The KV (Key-Value) cache is the hidden memory monster in LLM inference. During autoregressive generation, the model stores key and value vectors for every previously generated token so it doesn't have to recompute them. This cache grows with every token in the context window and with every concurrent request being served.
KV cache memory per token depends on three model-specific parameters: the number of layers, the number of KV heads, and the head dimension. The formula is:
KV Cache per Token: 2 × Layers × KV Heads × Head Dimension × Bytes per Element
The factor of 2 accounts for both the key and value vectors stored for each layer.
For a model using Grouped Query Attention (GQA), which most modern LLMs do, the KV head count is much smaller than the query head count, dramatically reducing cache size. Llama 3.1 70B, for example, uses 80 layers with 8 KV heads (versus 64 query heads) and a head dimension of 128, giving it approximately 0.31 MB per token at BF16. Standard multi-head attention (MHA) would require 2.5 MB per token for a similar-sized model (an 8x difference).
The critical insight: KV cache scales linearly with both context length and batch size. A single Llama 3.1 70B request at 128K context consumes approximately 40 GB of KV cache alone. Serve 4 concurrent requests at that context length and you need 160 GB just for the cache, more than the model weights themselves.
3. Activation Memory
Activations are the intermediate outputs computed during the forward pass through the network. During inference, only the activations for the current layer need to be held in memory. Unlike training, where activations for all layers must be stored for backpropagation.
Inference activation memory is relatively modest, typically 5–10% of the total memory footprint for standard batch sizes. However, it grows with batch size, so high-throughput serving configurations with large batches may see activations consume a more significant share.
4. Framework and System Overhead
The serving framework (vLLM, TensorRT-LLM, SGLang), CUDA context, memory allocator, and driver all consume GPU memory before your model loads a single weight. Typical overhead ranges from 500 MB to 2 GB, depending on the framework and configuration.
Memory fragmentation is another factor. GPU memory allocators don't always pack data perfectly, leaving small unusable gaps. Without optimization (like vLLM's PagedAttention), fragmentation can waste 20–30% of available memory. Modern serving frameworks have largely solved this, but it's still a factor when estimating tight memory budgets.
VRAM Requirements for Popular 2025 Models
The following table shows approximate total VRAM requirements for popular models at different quantization levels. These figures include model weights plus a baseline overhead of approximately 15–20% for KV cache (short context, low batch size), activation memory, and framework buffers. Production deployments with long contexts or high concurrency will need more.
| Model | Parameters | FP16 Total | INT8 Total | INT4 Total |
|---|---|---|---|---|
| Mistral 7B | 7B | ~18 GB | ~10 GB | ~6 GB |
| Llama 3.1 8B | 8B | ~20 GB | ~11 GB | ~7 GB |
| Qwen 2.5 14B | 14B | ~34 GB | ~18 GB | ~10 GB |
| Mistral Small 3 (24B) | 24B | ~56 GB | ~30 GB | ~16 GB |
| Qwen 2.5 32B | 32B | ~76 GB | ~40 GB | ~22 GB |
| Llama 3.1 70B | 70B | ~168 GB | ~84 GB | ~46 GB |
| Qwen 2.5 72B | 72B | ~172 GB | ~86 GB | ~47 GB |
| Mixtral 8x7B (MoE) | 47B total | ~112 GB | ~58 GB | ~30 GB |
| Llama 3.1 405B | 405B | ~970 GB | ~486 GB | ~245 GB |
| DeepSeek V3 (MoE) | 671B total | ~1.6 TB | ~800 GB | ~400 GB |
These are estimates. Actual requirements vary by serving framework, context length, batch size, and the specific quantization method used (GPTQ, AWQ, GGUF, etc.).
How Context Length Multiplies Memory
Context length is the most overlooked driver of GPU memory consumption. As models support 32K, 128K, and even 1M token context windows, the KV cache can easily exceed the model weights in total memory usage.
Here's how KV cache memory scales with context length for Llama 3.1 70B (BF16, single request, GQA with 8 KV heads):
| Context Length | KV Cache per Request | Total with Weights (FP16) |
|---|---|---|
| 2,048 tokens | ~0.6 GB | ~141 GB |
| 8,192 tokens | ~2.5 GB | ~143 GB |
| 32,768 tokens | ~10 GB | ~150 GB |
| 65,536 tokens | ~20 GB | ~160 GB |
| 128,000 tokens | ~40 GB | ~180 GB |
At 128K context with a single request, the KV cache adds 40 GB, roughly 29% of the total. With 4 concurrent requests at 128K context, the KV cache alone would need 160 GB, pushing the total far beyond what a single H200 (141 GB) can handle.
This is why production inference at long context lengths requires either fewer concurrent requests per GPU, KV cache quantization, or multi-GPU setups with careful memory planning.
Quantization: Trading Precision for Memory
Quantization is the most effective technique for reducing VRAM requirements. By storing model weights at lower numerical precision, you can cut memory usage by 2–4x with surprisingly little impact on output quality.
Common Quantization Formats
INT8 (8-bit) reduces each parameter from 2 bytes (FP16) to 1 byte, a 50% memory saving. Quality degradation is negligible for most applications. Frameworks like TensorRT-LLM, vLLM, and bitsandbytes support INT8 inference natively.
INT4 (4-bit) further reduces each parameter to 0.5 bytes, a 75% saving over FP16. Modern 4-bit quantization methods like GPTQ, AWQ, and GGUF maintain excellent output quality, with perplexity increases of only 1–3% on most benchmarks. For production chatbots and Q&A systems, the quality difference is typically imperceptible to end users.
FP8 (8-bit floating point) is supported natively on Hopper and Blackwell GPUs (H100, H200, B200). Unlike INT8, FP8 preserves the floating-point format, offering slightly better quality for the same memory footprint. FP8 also enables the Transformer Engine on Hopper GPUs, which dynamically switches between FP8 and FP16 per layer.
Quantization Impact on Quality
The quality-memory tradeoff varies by model and task. General guidelines:
- INT8: Safe for virtually all inference use cases. Negligible quality loss on standard benchmarks.
- INT4 (GPTQ/AWQ): Excellent for most production inference. Slight degradation on complex reasoning tasks, imperceptible for conversational AI.
- INT4 (GGUF Q4_K_M): The most popular format for local LLM deployment. Slightly better quality than naive INT4 due to mixed precision within quantization groups.
- INT2/INT3: Experimental. Noticeable quality degradation. Not recommended for production use cases.
GPU-to-Model Matching Guide
Choosing the right GPU means matching your model's total memory requirement (weights + KV cache + overhead) to the GPU's available dedicated VRAM. Here's a practical mapping:
| Model | Quantization | Min VRAM | Recommended GPUs |
|---|---|---|---|
| Mistral 7B / Llama 3.1 8B | FP16 | ~18–20 GB | RTX 4090 (24 GB), L40S (48 GB) |
| Mistral 7B / Llama 3.1 8B | INT4 | ~6–7 GB | L4 (24 GB), any 8+ GB GPU |
| Qwen 2.5 14B | FP16 | ~34 GB | L40S (48 GB), A100 40 GB |
| Qwen 2.5 14B | INT4 | ~10 GB | RTX 4090 (24 GB), L4 (24 GB) |
| Qwen 2.5 32B | INT4 | ~22 GB | RTX 4090 (24 GB), L40S (48 GB) |
| Llama 3.1 70B | INT8 | ~84 GB | H100 80 GB, A100 80 GB |
| Llama 3.1 70B | INT4 | ~46 GB | L40S (48 GB), A100 80 GB |
| Llama 3.1 70B | FP16 | ~168 GB | H200 (141 GB) + offload, 2× H100 |
| Mixtral 8x7B | INT4 | ~30 GB | L40S (48 GB), RTX 4090 (24 GB) + offload |
| Llama 3.1 405B | INT4 | ~245 GB | 2× H200, 4× H100, B200 (192 GB) + second |
For production inference with long context windows or high concurrency, add 30–50% additional VRAM headroom beyond the base model size to accommodate KV cache growth.
KV Cache Optimization Techniques
The KV cache is the most dynamic component of GPU memory usage, and several techniques exist to control its growth.
PagedAttention (vLLM)
Traditional KV cache implementations pre-allocate memory for the maximum sequence length, wasting 60–80% of allocated cache memory for requests that don't use the full context window. PagedAttention, introduced by vLLM, borrows the concept of virtual memory paging from operating systems.
Instead of allocating one contiguous block per request, PagedAttention divides the KV cache into fixed-size blocks that are allocated on-demand as tokens are generated. This eliminates pre-allocation waste and reduces internal fragmentation to near zero, enabling 2–4x more concurrent requests on the same GPU.
KV Cache Quantization
The KV cache itself can be quantized separately from the model weights. Quantizing the KV cache from FP16 to FP8 or INT8 shrinks cache memory by 2x with minimal quality impact. Combined with GQA (which already provides an 8x reduction), this creates a compound effect that makes long-context inference feasible on single GPUs.
vLLM supports KV cache quantization natively, and the technique is particularly valuable for production deployments serving 32K–128K context windows.
Grouped Query Attention (GQA)
GQA is an architectural feature (not a deployment optimization) where the model uses fewer KV heads than query heads. Most modern LLMs, including Llama 3.1, Mistral, Qwen 2.5, and DeepSeek, use GQA by default.
The impact is dramatic: Llama 3.1 70B uses 8 KV heads instead of 64, reducing KV cache memory by 8x compared to standard multi-head attention. This is why modern 70B models are practical to serve on single GPUs despite their size. Without GQA, the KV cache alone would consume 320 GB at 128K context instead of 40 GB.
Training Memory: Why It's 3–5x More Than Inference
Training requires significantly more GPU memory than inference because the GPU must simultaneously hold multiple copies of the model data:
| Component | Size (70B FP16) | Purpose |
|---|---|---|
| Model weights | ~140 GB | The parameters being trained |
| Gradients | ~140 GB | Computed during backpropagation |
| Optimizer states (Adam) | ~280 GB | Adam stores 2 momentum buffers in FP32 |
| Activations | 10–100+ GB | Stored for backpropagation (varies with batch size) |
| Total | 570–660+ GB | Requires multi-GPU distribution |
This is why training a 70B model requires 8x H100 GPUs (640 GB total HBM) even before accounting for activation memory. Techniques like ZeRO (Zero Redundancy Optimizer), gradient checkpointing, and mixed-precision training help distribute and reduce this memory burden, but the fundamental scale remains enormous.
Fine-Tuning with LoRA
Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and QLoRA dramatically reduce fine-tuning memory requirements by training only a small number of adapter parameters while freezing the base model weights. QLoRA combines 4-bit quantization of the base model with LoRA adapters, enabling fine-tuning of a 70B model on a single A100 80 GB or even an RTX 4090 with careful configuration.
Memory Overflow Strategies
When your model's memory requirements exceed available VRAM, several strategies can help, each with meaningful tradeoffs.
Tensor Parallelism
Tensor parallelism splits individual layers across multiple GPUs, distributing both weights and computation. With NVLink-connected GPUs (900 GB/s on H100), the communication overhead is minimal. This is the standard approach for serving models that do not fit on a single GPU, such as splitting Llama 3.1 70B at FP16 across two H100 GPUs.
The tradeoff is cost (two GPUs instead of one) and slightly increased latency from inter-GPU communication.
CPU Offloading
When GPU memory is insufficient, model layers or KV cache pages can be offloaded to CPU RAM and swapped back when needed. This works but incurs significant latency. PCIe Gen5 delivers ~64 GB/s versus the 3,350 GB/s of H100 HBM3, meaning every offloaded access is roughly 50x slower.
CPU offloading is acceptable for development and low-throughput experimentation but is not viable for production inference with latency SLAs.
Gradient Checkpointing (Training)
During training, gradient checkpointing discards intermediate activations during the forward pass and recomputes them during backpropagation. This trades compute time for memory, typically reducing activation memory by 60–70% at the cost of approximately 30% more computation time. For training large models where memory is the binding constraint, this tradeoff is usually worthwhile.
Practical Memory Planning Checklist
When planning GPU memory for an LLM deployment, account for all of the following:
- Model weights: Parameters × bytes per parameter (at your chosen precision)
- KV cache: Per-token cache size × max context length × max concurrent requests
- Activation memory: Approximately 5–10% of total for inference, much more for training
- Framework overhead: 500 MB – 2 GB for CUDA context, driver, and serving framework
- Fragmentation buffer: Add 10–15% headroom for memory allocator inefficiency
- Growth margin: Plan for peak load, not average. Memory usage spikes during burst traffic.
A safe rule of thumb: take the model weight size at your chosen precision, multiply by 1.3–1.5x for inference with moderate concurrency and context, and ensure this fits within your GPU's dedicated VRAM. For high-concurrency production serving or long-context use cases, multiply by 1.5–2x.
Deploy on Spheron
Need GPU infrastructure sized for your model? Spheron provides bare-metal GPU access with full dedicated VRAM: H100 (80 GB), H200 (141 GB), A100 (80 GB), and RTX 4090 (24 GB). This comes with transparent pricing, instant provisioning, and no long-term contracts.
Explore GPU options on Spheron →
Frequently Asked Questions
How much VRAM do I need for Llama 3.1 70B?
At FP16 precision, Llama 3.1 70B requires approximately 140 GB for model weights plus 20–40 GB for KV cache and overhead, totaling 160–180 GB. This requires either an H200 (141 GB with careful optimization) or two H100 GPUs. At INT8, the weights shrink to ~70 GB, fitting on a single H100 80 GB. At INT4 (GPTQ or AWQ), weights drop to ~35 GB, fitting on a single L40S 48 GB or A100 80 GB with ample room for KV cache.
How does context length affect GPU memory?
KV cache memory scales linearly with context length. For Llama 3.1 70B at BF16, each token in the context adds approximately 0.31 MB of KV cache. At 2K context, the cache is negligible (~0.6 GB). At 128K context, it consumes approximately 40 GB per request, nearly 29% of the total memory footprint. Serving multiple concurrent requests at long context multiplies this further.
What's the difference between FP16, INT8, and INT4 for inference?
FP16 uses 2 bytes per parameter, INT8 uses 1 byte (50% saving), and INT4 uses 0.5 bytes (75% saving). INT8 quantization produces negligible quality loss for virtually all applications. INT4 introduces slight degradation, typically 1–3% perplexity increase, but is imperceptible for most production chatbot and Q&A use cases. INT4 is the most popular format for local LLM deployment because it enables running 70B models on consumer-grade hardware.
Can I run a 70B model on an RTX 4090?
Not at full precision. A 70B model at FP16 requires ~140 GB, far exceeding the RTX 4090's 24 GB. At INT4 quantization with aggressive optimization, the model weights fit in ~35 GB, which still exceeds 24 GB. However, with CPU offloading (splitting some layers to system RAM), it is possible to run at very reduced throughput. For practical 70B inference, you need at minimum an A100 80 GB (INT8) or L40S 48 GB (INT4).
Why does training need so much more memory than inference?
Training simultaneously stores model weights (~140 GB for 70B at FP16), gradients (~140 GB), optimizer states (~280 GB for Adam in FP32), and activations (10–100+ GB depending on batch size). The total for a 70B model exceeds 570 GB, requiring 8x H100 GPUs with tensor and pipeline parallelism. Techniques like QLoRA reduce fine-tuning requirements dramatically by training only small adapter layers while keeping the base model frozen and quantized.
What is PagedAttention and why does it matter?
PagedAttention, introduced by vLLM, applies OS-style virtual memory paging to the KV cache. Traditional implementations pre-allocate maximum context-length memory per request, wasting 60–80% of allocated memory. PagedAttention allocates cache blocks on-demand as tokens are generated, eliminating this waste. The practical impact is 2–4x more concurrent requests on the same GPU, a direct cost saving for production inference deployments.