GPU Memory Requirements for LLMs: VRAM Calculator and Sizing Guide

GPU memory is the single biggest constraint when deploying large language models. A model that doesn't fit in VRAM either won't run at all or spills into system RAM and drops to a fraction of its potential throughput. Understanding exactly how much GPU memory your model requires and what drives that number is the difference between a smooth production deployment and a 3 AM outage.

The challenge is that VRAM requirements are not just about model size. A Llama 3.1 70B model's weights consume approximately 140 GB at FP16, but the total memory footprint in production can exceed 200 GB once you account for the KV cache, activation memory, and framework overhead. Context length, batch size, quantization strategy, and even your choice of serving framework all change the equation.

This guide breaks down every component of GPU memory consumption for LLM inference and training, provides exact VRAM calculations for popular 2025 models, and covers the optimization techniques that let you fit larger models on fewer GPUs.

The Four Components of GPU Memory Usage

GPU memory consumption during LLM inference breaks down into four distinct components, each with different scaling behavior:

1. Model Weights

Model weights are the learned parameters of the neural network, the core data that defines what the model knows. Weight memory scales linearly with parameter count and the precision format used to store each parameter.

The formula is straightforward: Memory = Parameters × Bytes per Parameter

Precision Format	Bytes per Parameter	Example: 70B Model
FP32 (full precision)	4 bytes	280 GB
FP16 / BF16 (half precision)	2 bytes	140 GB
INT8 (8-bit quantization)	1 byte	70 GB
INT4 (4-bit quantization)	0.5 bytes	35 GB

Most production inference uses FP16/BF16 or quantized formats. FP32 is used only for specific training scenarios where maximum numerical precision matters.

2. KV Cache

The KV (Key-Value) cache is the hidden memory monster in LLM inference. During autoregressive generation, the model stores key and value vectors for every previously generated token so it doesn't have to recompute them. This cache grows with every token in the context window and with every concurrent request being served.

KV cache memory per token depends on three model-specific parameters: the number of layers, the number of KV heads, and the head dimension. The formula is:

KV Cache per Token: 2 × Layers × KV Heads × Head Dimension × Bytes per Element

The factor of 2 accounts for both the key and value vectors stored for each layer.

For a model using Grouped Query Attention (GQA), which most modern LLMs do, the KV head count is much smaller than the query head count, dramatically reducing cache size. Llama 3.1 70B, for example, uses 80 layers with 8 KV heads (versus 64 query heads) and a head dimension of 128, giving it approximately 0.31 MB per token at BF16. Standard multi-head attention (MHA) would require 2.5 MB per token for a similar-sized model (an 8x difference).

The critical insight: KV cache scales linearly with both context length and batch size. A single Llama 3.1 70B request at 128K context consumes approximately 40 GB of KV cache alone. Serve 4 concurrent requests at that context length and you need 160 GB just for the cache, more than the model weights themselves.

3. Activation Memory

Activations are the intermediate outputs computed during the forward pass through the network. During inference, only the activations for the current layer need to be held in memory. Unlike training, where activations for all layers must be stored for backpropagation.

Inference activation memory is relatively modest, typically 5–10% of the total memory footprint for standard batch sizes. However, it grows with batch size, so high-throughput serving configurations with large batches may see activations consume a more significant share.

4. Framework and System Overhead

The serving framework (vLLM, TensorRT-LLM, SGLang), CUDA context, memory allocator, and driver all consume GPU memory before your model loads a single weight. Typical overhead ranges from 500 MB to 2 GB, depending on the framework and configuration.

Memory fragmentation is another factor. GPU memory allocators don't always pack data perfectly, leaving small unusable gaps. Without optimization (like vLLM's PagedAttention), fragmentation can waste 20–30% of available memory. Modern serving frameworks have largely solved this, but it's still a factor when estimating tight memory budgets.

VRAM Requirements for Popular 2025 Models

The following table shows approximate total VRAM requirements for popular models at different quantization levels. These figures include model weights plus a baseline overhead of approximately 15–20% for KV cache (short context, low batch size), activation memory, and framework buffers. Production deployments with long contexts or high concurrency will need more.

Model	Parameters	FP16 Total	INT8 Total	INT4 Total
Mistral 7B	7B	~18 GB	~10 GB	~6 GB
Llama 3.1 8B	8B	~20 GB	~11 GB	~7 GB
Qwen 2.5 14B	14B	~34 GB	~18 GB	~10 GB
Mistral Small 3 (24B)	24B	~56 GB	~30 GB	~16 GB
Qwen 2.5 32B	32B	~76 GB	~40 GB	~22 GB
Llama 3.1 70B	70B	~168 GB	~84 GB	~46 GB
Qwen 2.5 72B	72B	~172 GB	~86 GB	~47 GB
Mixtral 8x7B (MoE)	47B total	~112 GB	~58 GB	~30 GB
Llama 3.1 405B	405B	~970 GB	~486 GB	~245 GB
DeepSeek V3 (MoE)	671B total	~1.6 TB	~800 GB	~400 GB

These are estimates. Actual requirements vary by serving framework, context length, batch size, and the specific quantization method used (GPTQ, AWQ, GGUF, etc.).

How Context Length Multiplies Memory

Context length is the most overlooked driver of GPU memory consumption. As models support 32K, 128K, and even 1M token context windows, the KV cache can easily exceed the model weights in total memory usage.

Here's how KV cache memory scales with context length for Llama 3.1 70B (BF16, single request, GQA with 8 KV heads):

Context Length	KV Cache per Request	Total with Weights (FP16)
2,048 tokens	~0.6 GB	~141 GB
8,192 tokens	~2.5 GB	~143 GB
32,768 tokens	~10 GB	~150 GB
65,536 tokens	~20 GB	~160 GB
128,000 tokens	~40 GB	~180 GB

At 128K context with a single request, the KV cache adds 40 GB, roughly 29% of the total. With 4 concurrent requests at 128K context, the KV cache alone would need 160 GB, pushing the total far beyond what a single H200 (141 GB) can handle.

This is why production inference at long context lengths requires either fewer concurrent requests per GPU, KV cache quantization, or multi-GPU setups with careful memory planning.

Quantization: Trading Precision for Memory

Quantization is the most effective technique for reducing VRAM requirements. By storing model weights at lower numerical precision, you can cut memory usage by 2–4x with surprisingly little impact on output quality.

Common Quantization Formats

INT8 (8-bit) reduces each parameter from 2 bytes (FP16) to 1 byte, a 50% memory saving. Quality degradation is negligible for most applications. Frameworks like TensorRT-LLM, vLLM, and bitsandbytes support INT8 inference natively.

INT4 (4-bit) further reduces each parameter to 0.5 bytes, a 75% saving over FP16. Modern 4-bit quantization methods like GPTQ, AWQ, and GGUF maintain excellent output quality, with perplexity increases of only 1–3% on most benchmarks. For production chatbots and Q&A systems, the quality difference is typically imperceptible to end users.

FP8 (8-bit floating point) is supported natively on Hopper and Blackwell GPUs (H100, H200, B200). Unlike INT8, FP8 preserves the floating-point format, offering slightly better quality for the same memory footprint. FP8 also enables the Transformer Engine on Hopper GPUs, which dynamically switches between FP8 and FP16 per layer.

Quantization Impact on Quality

The quality-memory tradeoff varies by model and task. General guidelines:

INT8: Safe for virtually all inference use cases. Negligible quality loss on standard benchmarks.
INT4 (GPTQ/AWQ): Excellent for most production inference. Slight degradation on complex reasoning tasks, imperceptible for conversational AI.
INT4 (GGUF Q4_K_M): The most popular format for local LLM deployment. Slightly better quality than naive INT4 due to mixed precision within quantization groups.
INT2/INT3: Experimental. Noticeable quality degradation. Not recommended for production use cases.

GPU-to-Model Matching Guide

Choosing the right GPU means matching your model's total memory requirement (weights + KV cache + overhead) to the GPU's available dedicated VRAM. Here's a practical mapping:

Model	Quantization	Min VRAM	Recommended GPUs
Mistral 7B / Llama 3.1 8B	FP16	~18–20 GB	RTX 4090 (24 GB), L40S (48 GB)
Mistral 7B / Llama 3.1 8B	INT4	~6–7 GB	L4 (24 GB), any 8+ GB GPU
Qwen 2.5 14B	FP16	~34 GB	L40S (48 GB), A100 40 GB
Qwen 2.5 14B	INT4	~10 GB	RTX 4090 (24 GB), L4 (24 GB)
Qwen 2.5 32B	INT4	~22 GB	RTX 4090 (24 GB), L40S (48 GB)
Llama 3.1 70B	INT8	~84 GB	H100 80 GB, A100 80 GB
Llama 3.1 70B	INT4	~46 GB	L40S (48 GB), A100 80 GB
Llama 3.1 70B	FP16	~168 GB	H200 (141 GB) + offload, 2× H100
Mixtral 8x7B	INT4	~30 GB	L40S (48 GB), RTX 4090 (24 GB) + offload
Llama 3.1 405B	INT4	~245 GB	2× H200, 4× H100, B200 (192 GB) + second

For production inference with long context windows or high concurrency, add 30–50% additional VRAM headroom beyond the base model size to accommodate KV cache growth.

KV Cache Optimization Techniques

The KV cache is the most dynamic component of GPU memory usage, and several techniques exist to control its growth.

PagedAttention (vLLM)

Traditional KV cache implementations pre-allocate memory for the maximum sequence length, wasting 60–80% of allocated cache memory for requests that don't use the full context window. PagedAttention, introduced by vLLM, borrows the concept of virtual memory paging from operating systems.

Instead of allocating one contiguous block per request, PagedAttention divides the KV cache into fixed-size blocks that are allocated on-demand as tokens are generated. This eliminates pre-allocation waste and reduces internal fragmentation to near zero, enabling 2–4x more concurrent requests on the same GPU.

KV Cache Quantization

The KV cache itself can be quantized separately from the model weights. Quantizing the KV cache from FP16 to FP8 or INT8 shrinks cache memory by 2x with minimal quality impact. Combined with GQA (which already provides an 8x reduction), this creates a compound effect that makes long-context inference feasible on single GPUs.

vLLM supports KV cache quantization natively, and the technique is particularly valuable for production deployments serving 32K–128K context windows.

Grouped Query Attention (GQA)

GQA is an architectural feature (not a deployment optimization) where the model uses fewer KV heads than query heads. Most modern LLMs, including Llama 3.1, Mistral, Qwen 2.5, and DeepSeek, use GQA by default.

The impact is dramatic: Llama 3.1 70B uses 8 KV heads instead of 64, reducing KV cache memory by 8x compared to standard multi-head attention. This is why modern 70B models are practical to serve on single GPUs despite their size. Without GQA, the KV cache alone would consume 320 GB at 128K context instead of 40 GB.

Training Memory: Why It's 3–5x More Than Inference

Training requires significantly more GPU memory than inference because the GPU must simultaneously hold multiple copies of the model data:

Component	Size (70B FP16)	Purpose
Model weights	~140 GB	The parameters being trained
Gradients	~140 GB	Computed during backpropagation
Optimizer states (Adam)	~280 GB	Adam stores 2 momentum buffers in FP32
Activations	10–100+ GB	Stored for backpropagation (varies with batch size)
Total	570–660+ GB	Requires multi-GPU distribution

This is why training a 70B model requires 8x H100 GPUs (640 GB total HBM) even before accounting for activation memory. Techniques like ZeRO (Zero Redundancy Optimizer), gradient checkpointing, and mixed-precision training help distribute and reduce this memory burden, but the fundamental scale remains enormous.

Fine-Tuning with LoRA

Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and QLoRA dramatically reduce fine-tuning memory requirements by training only a small number of adapter parameters while freezing the base model weights. QLoRA combines 4-bit quantization of the base model with LoRA adapters, enabling fine-tuning of a 70B model on a single A100 80 GB or even an RTX 4090 with careful configuration.

Memory Overflow Strategies

When your model's memory requirements exceed available VRAM, several strategies can help, each with meaningful tradeoffs.

Tensor Parallelism

Tensor parallelism splits individual layers across multiple GPUs, distributing both weights and computation. With NVLink-connected GPUs (900 GB/s on H100), the communication overhead is minimal. This is the standard approach for serving models that do not fit on a single GPU, such as splitting Llama 3.1 70B at FP16 across two H100 GPUs.

The tradeoff is cost (two GPUs instead of one) and slightly increased latency from inter-GPU communication.

CPU Offloading

When GPU memory is insufficient, model layers or KV cache pages can be offloaded to CPU RAM and swapped back when needed. This works but incurs significant latency. PCIe Gen5 delivers ~64 GB/s versus the 3,350 GB/s of H100 HBM3, meaning every offloaded access is roughly 50x slower.

CPU offloading is acceptable for development and low-throughput experimentation but is not viable for production inference with latency SLAs.

Gradient Checkpointing (Training)

During training, gradient checkpointing discards intermediate activations during the forward pass and recomputes them during backpropagation. This trades compute time for memory, typically reducing activation memory by 60–70% at the cost of approximately 30% more computation time. For training large models where memory is the binding constraint, this tradeoff is usually worthwhile.

Practical Memory Planning Checklist

When planning GPU memory for an LLM deployment, account for all of the following:

Model weights: Parameters × bytes per parameter (at your chosen precision)
KV cache: Per-token cache size × max context length × max concurrent requests
Activation memory: Approximately 5–10% of total for inference, much more for training
Framework overhead: 500 MB – 2 GB for CUDA context, driver, and serving framework
Fragmentation buffer: Add 10–15% headroom for memory allocator inefficiency
Growth margin: Plan for peak load, not average. Memory usage spikes during burst traffic.

A safe rule of thumb: take the model weight size at your chosen precision, multiply by 1.3–1.5x for inference with moderate concurrency and context, and ensure this fits within your GPU's dedicated VRAM. For high-concurrency production serving or long-context use cases, multiply by 1.5–2x.

Deploy on Spheron

Need GPU infrastructure sized for your model? Spheron provides bare-metal GPU access with full dedicated VRAM: H100 (80 GB), H200 (141 GB), A100 (80 GB), and RTX 4090 (24 GB). This comes with transparent pricing, instant provisioning, and no long-term contracts.

Explore GPU options on Spheron →

Frequently Asked Questions

How much VRAM do I need for Llama 3.1 70B?

At FP16 precision, Llama 3.1 70B requires approximately 140 GB for model weights plus 20–40 GB for KV cache and overhead, totaling 160–180 GB. This requires either an H200 (141 GB with careful optimization) or two H100 GPUs. At INT8, the weights shrink to ~70 GB, fitting on a single H100 80 GB. At INT4 (GPTQ or AWQ), weights drop to ~35 GB, fitting on a single L40S 48 GB or A100 80 GB with ample room for KV cache.

How does context length affect GPU memory?

KV cache memory scales linearly with context length. For Llama 3.1 70B at BF16, each token in the context adds approximately 0.31 MB of KV cache. At 2K context, the cache is negligible (~0.6 GB). At 128K context, it consumes approximately 40 GB per request, nearly 29% of the total memory footprint. Serving multiple concurrent requests at long context multiplies this further.

What's the difference between FP16, INT8, and INT4 for inference?

FP16 uses 2 bytes per parameter, INT8 uses 1 byte (50% saving), and INT4 uses 0.5 bytes (75% saving). INT8 quantization produces negligible quality loss for virtually all applications. INT4 introduces slight degradation, typically 1–3% perplexity increase, but is imperceptible for most production chatbot and Q&A use cases. INT4 is the most popular format for local LLM deployment because it enables running 70B models on consumer-grade hardware.

Can I run a 70B model on an RTX 4090?

Not at full precision. A 70B model at FP16 requires ~140 GB, far exceeding the RTX 4090's 24 GB. At INT4 quantization with aggressive optimization, the model weights fit in ~35 GB, which still exceeds 24 GB. However, with CPU offloading (splitting some layers to system RAM), it is possible to run at very reduced throughput. For practical 70B inference, you need at minimum an A100 80 GB (INT8) or L40S 48 GB (INT4).

Why does training need so much more memory than inference?

Training simultaneously stores model weights (~140 GB for 70B at FP16), gradients (~140 GB), optimizer states (~280 GB for Adam in FP32), and activations (10–100+ GB depending on batch size). The total for a 70B model exceeds 570 GB, requiring 8x H100 GPUs with tensor and pipeline parallelism. Techniques like QLoRA reduce fine-tuning requirements dramatically by training only small adapter layers while keeping the base model frozen and quantized.

What is PagedAttention and why does it matter?

PagedAttention, introduced by vLLM, applies OS-style virtual memory paging to the KV cache. Traditional implementations pre-allocate maximum context-length memory per request, wasting 60–80% of allocated memory. PagedAttention allocates cache blocks on-demand as tokens are generated, eliminating this waste. The practical impact is 2–4x more concurrent requests on the same GPU, a direct cost saving for production inference deployments.