A single Llama 3 70B request at 128K context needs 42 GB of GPU memory just for the KV cache. That leaves almost nothing for model weights on an 80 GB card, and zero room for concurrent users. Before you provision more GPUs, there are five optimization techniques that compound on each other to cut that 42 GB to under 6 GB on the right hardware. For the full VRAM calculation context including model weights and activation memory, see GPU Memory Requirements for LLMs.
Why the KV Cache Dominates Your VRAM Budget
During autoregressive generation, the model stores key and value vectors for every previously generated token so it doesn't recompute them on each new token. These tensors accumulate in the KV cache throughout the lifetime of a request and are only freed when the request completes.
The KV cache grows in five dimensions simultaneously: number of transformer layers, number of KV heads, head dimension, sequence length, and concurrent batch size. At short contexts (under 4K tokens), it's negligible. At long contexts (32K-128K tokens) with real concurrent load, it becomes the dominant memory consumer.
Here's how KV cache memory scales for Llama 3.1 70B at BF16:
| Context Length | 1 Concurrent User | 8 Concurrent Users |
|---|---|---|
| 4,096 tokens | ~1.3 GB | ~10.7 GB |
| 16,384 tokens | ~5.4 GB | ~43 GB |
| 32,768 tokens | ~10.7 GB | ~86 GB |
| 131,072 tokens | ~42.9 GB | ~343 GB |
The model weights for Llama 3.1 70B at FP16 are ~140 GB. At 32K context with 8 concurrent users, the KV cache alone exceeds the weights. At 128K context with 8 users, the KV cache is 2.4x the model size.
The KV Cache Memory Formula
The exact formula for KV cache memory:
KV_bytes = 2 × L × H_kv × D × S × B × bytes_per_elementWhere:
L= number of transformer layersH_kv= number of key-value heads (after GQA)D= head dimensionS= sequence length (tokens)B= concurrent batch sizebytes_per_element= 2 for BF16/FP16, 1 for FP8, 0.5 for FP4
Worked examples for three common models:
| Model | Layers | KV Heads | Head Dim | KV/token at BF16 |
|---|---|---|---|---|
| Llama 3.1 8B | 32 | 8 | 128 | ~0.131 MB |
| Llama 3.1 70B | 80 | 8 | 128 | ~0.327 MB |
| Llama 3.1 405B | 126 | 8 | 128 | ~0.516 MB |
Llama 3 uses Grouped Query Attention (GQA) with 8 KV heads instead of 64 query heads. That 8x reduction in KV heads cuts KV cache size by 8x vs standard multi-head attention (MHA). A comparable MHA model would need ~2.6 MB per token at BF16 for 70B scale.
One caveat: the formula applies to full-context attention. Models with sliding window attention (Mistral's Longformer layers, for example) use a smaller effective window per layer, so the actual per-layer KV cache is bounded by the window size, not the full context length.
For the full weight memory formula including quantization formats, see GPU Memory Requirements for LLMs.
PagedAttention in vLLM: Virtual Memory for the KV Cache
The naive approach to KV cache allocation reserves the maximum context-length memory for each request at the moment it arrives. A request with a 32K max context gets 32K worth of KV blocks reserved upfront, even if it only generates 500 tokens. In practice, 60-80% of that reserved memory goes unused.
PagedAttention solves this by treating KV cache like OS virtual memory. The cache is split into fixed-size blocks (16 tokens per block by default). Blocks are allocated on-demand as tokens are generated and freed immediately when requests complete. No wasted pre-allocation, no fragmentation from variable-length requests.
The throughput gain is not marginal. Going from naive allocation to PagedAttention typically doubles or quadruples the number of concurrent requests you can serve on the same GPU, because you stop reserving memory that never gets used.
PagedAttention is enabled by default in vLLM. The launch command:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--gpu-memory-utilization 0.90 \
--max-model-len 32768 \
--kv-cache-dtype auto--gpu-memory-utilization 0.90 tells vLLM to use 90% of available VRAM (after weights load) for the KV cache pool. --max-model-len caps the context window, which controls the maximum KV cache block pool size vLLM allocates. Set it to your actual maximum context need, not the model's theoretical maximum.
For a complete multi-GPU vLLM deployment setup, see vLLM Multi-GPU Production Deployment 2026.
KV Cache Quantization: FP8 and NVFP4
KV tensors can be stored at lower precision than the model weights. The attention computation is relatively tolerant of precision reduction because the softmax normalization in attention averages out small rounding errors.
FP8 on H100/A100: Switch from BF16 to FP8 KV storage with one flag:
--kv-cache-dtype fp8This halves KV cache memory vs BF16. Quality loss is negligible for most production workloads. FP8 KV cache works on both H100 and A100. Note: on A100, this is a software-level memory optimization. vLLM stores KV tensors in FP8 and dequantizes to BF16 before the attention computation. You get the memory reduction, but the A100 has no hardware FP8 Tensor Cores, so there is no throughput gain from the quantization itself.
NVFP4 on Blackwell (B200, RTX 5090): For an additional 50% reduction vs FP8:
--kv-cache-dtype nvfp4This requires Blackwell hardware. On H100 or A100, this flag will either error or behave incorrectly depending on vLLM version. Do not use it on Hopper or Ampere GPUs.
KV cache memory for Llama 3.1 70B at 32K context, 8 concurrent users:
| Precision | KV Cache Size | Savings vs BF16 |
|---|---|---|
| BF16 | ~85.9 GB | baseline |
| FP8 | ~42.9 GB | 50% |
| NVFP4 | ~21.5 GB | 75% |
The quality tradeoff: FP8 KV is safe for production across virtually all tasks. NVFP4 KV should be validated on your specific task before deploying to production, particularly for reasoning-heavy workloads. For the full quality analysis of NVFP4, see FP4 Quantization on Blackwell GPUs.
Note: hardware FP4 tensor core acceleration requires Blackwell. On H100, vLLM can load NVFP4-quantized model weights via Marlin software fallback, but --kv-cache-dtype nvfp4 for KV storage acceleration is a Blackwell-only feature.
CPU KV Cache Offloading
When GPU KV cache fills during a traffic spike, vLLM's default behavior is to queue or reject new requests. Adding a CPU swap space prevents this by using system RAM as a secondary KV cache tier.
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--kv-cache-dtype fp8 \
--swap-space 32--swap-space 32 reserves 32 GB of system RAM. When GPU KV cache pressure is high, low-priority request blocks swap to CPU RAM. When those requests need their blocks again, they swap back.
The cost: each eviction/reload adds 2-10ms of latency depending on PCIe bandwidth. For a latency-sensitive application serving strict SLAs, this is not acceptable for the hot path. For a batch processing workload or an application where occasional latency spikes are tolerable, it's the right call over request rejection.
Starting configuration: --swap-space 16 for moderate workloads, --swap-space 32 for bursty traffic. You need a host with at least 64 GB of DRAM to leave enough room for the OS and other processes after the swap reservation.
Full CPU offloading (loading model layers into CPU RAM) is different from swap space and not recommended for production serving given the 50x bandwidth penalty between CPU RAM and GPU HBM.
LMCache and Prefix Caching: Cutting TTFT for Long-Context Workloads
Many production LLM workloads share a common prefix across requests: a long system prompt, a shared reference document in a RAG pipeline, or a conversation history. Without prefix caching, every request recomputes the KV cache for that prefix from scratch during prefill, which is the expensive step for long contexts.
vLLM's built-in --enable-prefix-caching flag detects when a new request's prefix matches a previously-computed prefix and serves it from the in-memory cache. This is zero-configuration and applies within a single server session.
LMCache extends this to persist cached KV blocks across server restarts and multiple server instances using Redis or disk backends. The practical gain is dramatic for long-context workloads: on a 128K-token system prompt on H100, TTFT drops from ~11s to ~1.5s because the full prefill computation is replaced by a cache hit.
Install and launch:
pip install lmcache
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--enable-prefix-caching \
--kv-cache-dtype fp8When prefix caching helps most:
- RAG pipelines where the same document or knowledge base is prepended to many queries
- Chatbots with long, fixed system prompts (multi-paragraph instructions)
- Batch processing where many items share an identical preamble
- Multi-turn conversations where the growing history is re-submitted each turn
LMCache compatibility depends on the vLLM version you're running. Check the LMCache documentation for version-specific setup instructions before deploying.
GQA and Flash Attention: Architecture-Level Optimizations
Both of these are baked into modern models and serving frameworks. You don't configure them, but understanding what they do explains why the KV cache numbers above are already much smaller than they could be.
Grouped Query Attention (GQA): Llama 3 uses 8 KV heads for every 64 query heads. Each query head attends to the same 8 KV heads during attention computation, so you only need to store 8 KV heads worth of tensors per layer instead of 64. This is an 8x KV cache reduction vs standard Multi-Head Attention (MHA) for Llama-scale models.
Multi-Query Attention (MQA) takes this further: 1 KV head shared by all query heads. Used in some smaller models for maximum memory efficiency at some quality cost.
KV head counts for popular 2026 models:
| Model | Query Heads | KV Heads | KV Cache Reduction vs MHA |
|---|---|---|---|
| Llama 3.1 8B | 32 | 8 | 4x |
| Llama 3.1 70B | 64 | 8 | 8x |
| Llama 3.1 405B | 128 | 8 | 16x |
| Mistral 7B | 32 | 8 | 4x |
| Qwen 2.5 72B | 64 | 8 | 8x |
Flash Attention: Rewrites the attention kernel to avoid materializing the full NxN attention matrix in GPU HBM. Instead of writing the full matrix to memory and reading it back, Flash Attention computes attention in tiles that stay in SRAM. For long sequences, this reduces HBM reads/writes by 5-20x. Flash Attention is the default in vLLM and TensorRT-LLM and requires no configuration.
For end-to-end throughput comparisons across inference engines that include all of these optimizations, see vLLM vs TensorRT-LLM vs SGLang Benchmarks.
GPU Selection Guide: Match VRAM to Your Optimized KV Cache Budget
The right GPU is the one where: model weights + optimized KV cache + 10% activation headroom fits in VRAM. Optimizing the KV cache first, then sizing the GPU, is cheaper than throwing more VRAM at an unoptimized stack. For a broader cost reduction framework that goes beyond KV cache, see the GPU cost optimization playbook.
Three workload profiles showing the calculation:
Light workload: Llama 3.1 8B, 8K context, 16 concurrent users, FP8 KV cache
- Weights (FP8): ~8 GB
- KV cache (FP8): 2 × 32 × 8 × 128 × 8192 × 16 × 1 = ~8.6 GB
- Total with headroom: ~18 GB
- Fits on: A100 80GB with 62 GB to spare for larger batches
Medium workload: Llama 3.1 70B, 32K context, 8 users, FP8 KV cache
- Weights (FP8): ~70 GB
- KV cache (FP8): 2 × 80 × 8 × 128 × 32768 × 8 × 1 = ~42.9 GB
- Total: ~113 GB, needs 2x H100 at FP8 weights + FP8 KV
- With NVFP4 KV on Blackwell: ~70 GB + ~21.5 GB = ~91.5 GB, fits B200 (192 GB) with room
Heavy workload: Llama 3.1 70B, 128K context, 8 users, FP8 KV cache
- Weights (FP8): ~70 GB
- KV cache (FP8): 2 × 80 × 8 × 128 × 131072 × 8 × 1 = ~171 GB
- Total: ~241 GB, requires 4x H100 or 2x H100 with further prefix caching to reduce active context
GPU selection with live Spheron pricing (fetched 25 Mar 2026):
| Workload Profile | VRAM Needed | Recommended GPU | Spheron Price |
|---|---|---|---|
| 8B model, 8K context, 16 users, FP8 KV | ~18 GB | A100 80GB SXM4 | ~$1.05/hr |
| 70B model, 32K context, 8 users, FP8 KV | ~113 GB | 2x H100 SXM5 | ~$4.80/hr |
| 70B model, 128K context, 4 users, FP8 KV | ~156 GB | 3x H100 SXM5 | ~$7.20/hr |
| 70B model, 32K context, 8 users, NVFP4 KV | ~92 GB | B200 SXM6 | ~$2.07/hr (spot) |
Pricing fluctuates based on GPU availability. The prices above are based on 25 Mar 2026 and may have changed. Check current GPU pricing → for live rates.
L40S is not currently available in the Spheron GPU catalog. For L40S comparisons, see Best GPU for AI Inference in 2026 which covers a broader GPU-to-workload matching guide.
Benchmark Results: Throughput Gains from KV Cache Optimization
Stacked optimization gains for Llama 3.1 70B FP8 on H100 PCIe, starting from a naive baseline (rows 1-4 on H100 PCIe; final row on Blackwell):
| Configuration | Concurrent Users | Throughput (tok/s) | KV Cache VRAM | Notes |
|---|---|---|---|---|
| Naive pre-alloc, BF16 KV | 4 | ~400 | ~40 GB | No PagedAttention |
| + PagedAttention (vLLM default) | 8 | ~900 | ~22 GB | 2x users, same GPU |
| + FP8 KV cache | 16 | ~1,800 | ~11 GB | 4x users vs baseline |
| + Prefix caching (LMCache) | 16 | ~1,800 (TTFT: 1.5s vs 11s) | ~11 GB | Latency win, not throughput |
| + NVFP4 KV (Blackwell only) | 32 | ~2,600 | ~5.5 GB | 8x users vs baseline |
These are estimated figures based on published vLLM and LMCache benchmark data, not independently reproduced results on this hardware configuration. They are intended to illustrate relative ordering and magnitude of gains, not precise throughput you should expect. Actual results depend on model, hardware, batch composition, and request length distribution. Run your own benchmarks on your target hardware before making infrastructure decisions.
Quick Reference: KV Cache Optimization Checklist
- Enable PagedAttention (default in vLLM, on by default)
- Set
--gpu-memory-utilization 0.90(raise to 0.92-0.95 if model fits tightly) - Use FP8 KV cache on H100/A100:
--kv-cache-dtype fp8 - Use NVFP4 KV cache on Blackwell only:
--kv-cache-dtype nvfp4 - Enable prefix caching:
--enable-prefix-caching - Set
--swap-space 16for burst tolerance on bursty workloads - Choose GQA-native models (Llama 3 family) for long-context serving
- Verify Flash Attention is active (on by default in vLLM)
- Size your GPU to the optimized budget, not the theoretical maximum
KV cache optimization is what makes production LLM inference financially viable. Once you have your optimized VRAM number, pick the right-sized Spheron GPU and avoid paying for capacity you don't need.
