Engineering

KV Cache Optimization: Serve 10x More Users on the Same GPU (2026)

Back to BlogWritten by Mitrasish, Co-founderMar 28, 2026
KV CacheLLM InferenceGPU MemoryvLLMPagedAttentionNVFP4AI InfrastructureGPU Cloud
KV Cache Optimization: Serve 10x More Users on the Same GPU (2026)

A single Llama 3 70B request at 128K context needs 42 GB of GPU memory just for the KV cache. That leaves almost nothing for model weights on an 80 GB card, and zero room for concurrent users. Before you provision more GPUs, there are five optimization techniques that compound on each other to cut that 42 GB to under 6 GB on the right hardware. For the full VRAM calculation context including model weights and activation memory, see GPU Memory Requirements for LLMs.

Why the KV Cache Dominates Your VRAM Budget

During autoregressive generation, the model stores key and value vectors for every previously generated token so it doesn't recompute them on each new token. These tensors accumulate in the KV cache throughout the lifetime of a request and are only freed when the request completes.

The KV cache grows in five dimensions simultaneously: number of transformer layers, number of KV heads, head dimension, sequence length, and concurrent batch size. At short contexts (under 4K tokens), it's negligible. At long contexts (32K-128K tokens) with real concurrent load, it becomes the dominant memory consumer.

Here's how KV cache memory scales for Llama 3.1 70B at BF16:

Context Length1 Concurrent User8 Concurrent Users
4,096 tokens~1.3 GB~10.7 GB
16,384 tokens~5.4 GB~43 GB
32,768 tokens~10.7 GB~86 GB
131,072 tokens~42.9 GB~343 GB

The model weights for Llama 3.1 70B at FP16 are ~140 GB. At 32K context with 8 concurrent users, the KV cache alone exceeds the weights. At 128K context with 8 users, the KV cache is 2.4x the model size.

The KV Cache Memory Formula

The exact formula for KV cache memory:

KV_bytes = 2 × L × H_kv × D × S × B × bytes_per_element

Where:

  • L = number of transformer layers
  • H_kv = number of key-value heads (after GQA)
  • D = head dimension
  • S = sequence length (tokens)
  • B = concurrent batch size
  • bytes_per_element = 2 for BF16/FP16, 1 for FP8, 0.5 for FP4

Worked examples for three common models:

ModelLayersKV HeadsHead DimKV/token at BF16
Llama 3.1 8B328128~0.131 MB
Llama 3.1 70B808128~0.327 MB
Llama 3.1 405B1268128~0.516 MB

Llama 3 uses Grouped Query Attention (GQA) with 8 KV heads instead of 64 query heads. That 8x reduction in KV heads cuts KV cache size by 8x vs standard multi-head attention (MHA). A comparable MHA model would need ~2.6 MB per token at BF16 for 70B scale.

One caveat: the formula applies to full-context attention. Models with sliding window attention (Mistral's Longformer layers, for example) use a smaller effective window per layer, so the actual per-layer KV cache is bounded by the window size, not the full context length.

For the full weight memory formula including quantization formats, see GPU Memory Requirements for LLMs.

PagedAttention in vLLM: Virtual Memory for the KV Cache

The naive approach to KV cache allocation reserves the maximum context-length memory for each request at the moment it arrives. A request with a 32K max context gets 32K worth of KV blocks reserved upfront, even if it only generates 500 tokens. In practice, 60-80% of that reserved memory goes unused.

PagedAttention solves this by treating KV cache like OS virtual memory. The cache is split into fixed-size blocks (16 tokens per block by default). Blocks are allocated on-demand as tokens are generated and freed immediately when requests complete. No wasted pre-allocation, no fragmentation from variable-length requests.

The throughput gain is not marginal. Going from naive allocation to PagedAttention typically doubles or quadruples the number of concurrent requests you can serve on the same GPU, because you stop reserving memory that never gets used.

PagedAttention is enabled by default in vLLM. The launch command:

bash
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --kv-cache-dtype auto

--gpu-memory-utilization 0.90 tells vLLM to use 90% of available VRAM (after weights load) for the KV cache pool. --max-model-len caps the context window, which controls the maximum KV cache block pool size vLLM allocates. Set it to your actual maximum context need, not the model's theoretical maximum.

For a complete multi-GPU vLLM deployment setup, see vLLM Multi-GPU Production Deployment 2026.

KV Cache Quantization: FP8 and NVFP4

KV tensors can be stored at lower precision than the model weights. The attention computation is relatively tolerant of precision reduction because the softmax normalization in attention averages out small rounding errors.

FP8 on H100/A100: Switch from BF16 to FP8 KV storage with one flag:

bash
--kv-cache-dtype fp8

This halves KV cache memory vs BF16. Quality loss is negligible for most production workloads. FP8 KV cache works on both H100 and A100. Note: on A100, this is a software-level memory optimization. vLLM stores KV tensors in FP8 and dequantizes to BF16 before the attention computation. You get the memory reduction, but the A100 has no hardware FP8 Tensor Cores, so there is no throughput gain from the quantization itself.

NVFP4 on Blackwell (B200, RTX 5090): For an additional 50% reduction vs FP8:

bash
--kv-cache-dtype nvfp4

This requires Blackwell hardware. On H100 or A100, this flag will either error or behave incorrectly depending on vLLM version. Do not use it on Hopper or Ampere GPUs.

KV cache memory for Llama 3.1 70B at 32K context, 8 concurrent users:

PrecisionKV Cache SizeSavings vs BF16
BF16~85.9 GBbaseline
FP8~42.9 GB50%
NVFP4~21.5 GB75%

The quality tradeoff: FP8 KV is safe for production across virtually all tasks. NVFP4 KV should be validated on your specific task before deploying to production, particularly for reasoning-heavy workloads. For the full quality analysis of NVFP4, see FP4 Quantization on Blackwell GPUs.

Note: hardware FP4 tensor core acceleration requires Blackwell. On H100, vLLM can load NVFP4-quantized model weights via Marlin software fallback, but --kv-cache-dtype nvfp4 for KV storage acceleration is a Blackwell-only feature.

CPU KV Cache Offloading

When GPU KV cache fills during a traffic spike, vLLM's default behavior is to queue or reject new requests. Adding a CPU swap space prevents this by using system RAM as a secondary KV cache tier.

bash
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --kv-cache-dtype fp8 \
  --swap-space 32

--swap-space 32 reserves 32 GB of system RAM. When GPU KV cache pressure is high, low-priority request blocks swap to CPU RAM. When those requests need their blocks again, they swap back.

The cost: each eviction/reload adds 2-10ms of latency depending on PCIe bandwidth. For a latency-sensitive application serving strict SLAs, this is not acceptable for the hot path. For a batch processing workload or an application where occasional latency spikes are tolerable, it's the right call over request rejection.

Starting configuration: --swap-space 16 for moderate workloads, --swap-space 32 for bursty traffic. You need a host with at least 64 GB of DRAM to leave enough room for the OS and other processes after the swap reservation.

Full CPU offloading (loading model layers into CPU RAM) is different from swap space and not recommended for production serving given the 50x bandwidth penalty between CPU RAM and GPU HBM.

LMCache and Prefix Caching: Cutting TTFT for Long-Context Workloads

Many production LLM workloads share a common prefix across requests: a long system prompt, a shared reference document in a RAG pipeline, or a conversation history. Without prefix caching, every request recomputes the KV cache for that prefix from scratch during prefill, which is the expensive step for long contexts.

vLLM's built-in --enable-prefix-caching flag detects when a new request's prefix matches a previously-computed prefix and serves it from the in-memory cache. This is zero-configuration and applies within a single server session.

LMCache extends this to persist cached KV blocks across server restarts and multiple server instances using Redis or disk backends. The practical gain is dramatic for long-context workloads: on a 128K-token system prompt on H100, TTFT drops from ~11s to ~1.5s because the full prefill computation is replaced by a cache hit.

Install and launch:

bash
pip install lmcache
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --enable-prefix-caching \
  --kv-cache-dtype fp8

When prefix caching helps most:

  • RAG pipelines where the same document or knowledge base is prepended to many queries
  • Chatbots with long, fixed system prompts (multi-paragraph instructions)
  • Batch processing where many items share an identical preamble
  • Multi-turn conversations where the growing history is re-submitted each turn

LMCache compatibility depends on the vLLM version you're running. Check the LMCache documentation for version-specific setup instructions before deploying.

GQA and Flash Attention: Architecture-Level Optimizations

Both of these are baked into modern models and serving frameworks. You don't configure them, but understanding what they do explains why the KV cache numbers above are already much smaller than they could be.

Grouped Query Attention (GQA): Llama 3 uses 8 KV heads for every 64 query heads. Each query head attends to the same 8 KV heads during attention computation, so you only need to store 8 KV heads worth of tensors per layer instead of 64. This is an 8x KV cache reduction vs standard Multi-Head Attention (MHA) for Llama-scale models.

Multi-Query Attention (MQA) takes this further: 1 KV head shared by all query heads. Used in some smaller models for maximum memory efficiency at some quality cost.

KV head counts for popular 2026 models:

ModelQuery HeadsKV HeadsKV Cache Reduction vs MHA
Llama 3.1 8B3284x
Llama 3.1 70B6488x
Llama 3.1 405B128816x
Mistral 7B3284x
Qwen 2.5 72B6488x

Flash Attention: Rewrites the attention kernel to avoid materializing the full NxN attention matrix in GPU HBM. Instead of writing the full matrix to memory and reading it back, Flash Attention computes attention in tiles that stay in SRAM. For long sequences, this reduces HBM reads/writes by 5-20x. Flash Attention is the default in vLLM and TensorRT-LLM and requires no configuration.

For end-to-end throughput comparisons across inference engines that include all of these optimizations, see vLLM vs TensorRT-LLM vs SGLang Benchmarks.

GPU Selection Guide: Match VRAM to Your Optimized KV Cache Budget

The right GPU is the one where: model weights + optimized KV cache + 10% activation headroom fits in VRAM. Optimizing the KV cache first, then sizing the GPU, is cheaper than throwing more VRAM at an unoptimized stack. For a broader cost reduction framework that goes beyond KV cache, see the GPU cost optimization playbook.

Three workload profiles showing the calculation:

Light workload: Llama 3.1 8B, 8K context, 16 concurrent users, FP8 KV cache

  • Weights (FP8): ~8 GB
  • KV cache (FP8): 2 × 32 × 8 × 128 × 8192 × 16 × 1 = ~8.6 GB
  • Total with headroom: ~18 GB
  • Fits on: A100 80GB with 62 GB to spare for larger batches

Medium workload: Llama 3.1 70B, 32K context, 8 users, FP8 KV cache

  • Weights (FP8): ~70 GB
  • KV cache (FP8): 2 × 80 × 8 × 128 × 32768 × 8 × 1 = ~42.9 GB
  • Total: ~113 GB, needs 2x H100 at FP8 weights + FP8 KV
  • With NVFP4 KV on Blackwell: ~70 GB + ~21.5 GB = ~91.5 GB, fits B200 (192 GB) with room

Heavy workload: Llama 3.1 70B, 128K context, 8 users, FP8 KV cache

  • Weights (FP8): ~70 GB
  • KV cache (FP8): 2 × 80 × 8 × 128 × 131072 × 8 × 1 = ~171 GB
  • Total: ~241 GB, requires 4x H100 or 2x H100 with further prefix caching to reduce active context

GPU selection with live Spheron pricing (fetched 25 Mar 2026):

Workload ProfileVRAM NeededRecommended GPUSpheron Price
8B model, 8K context, 16 users, FP8 KV~18 GBA100 80GB SXM4~$1.05/hr
70B model, 32K context, 8 users, FP8 KV~113 GB2x H100 SXM5~$4.80/hr
70B model, 128K context, 4 users, FP8 KV~156 GB3x H100 SXM5~$7.20/hr
70B model, 32K context, 8 users, NVFP4 KV~92 GBB200 SXM6~$2.07/hr (spot)

Pricing fluctuates based on GPU availability. The prices above are based on 25 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

L40S is not currently available in the Spheron GPU catalog. For L40S comparisons, see Best GPU for AI Inference in 2026 which covers a broader GPU-to-workload matching guide.

Benchmark Results: Throughput Gains from KV Cache Optimization

Stacked optimization gains for Llama 3.1 70B FP8 on H100 PCIe, starting from a naive baseline (rows 1-4 on H100 PCIe; final row on Blackwell):

ConfigurationConcurrent UsersThroughput (tok/s)KV Cache VRAMNotes
Naive pre-alloc, BF16 KV4~400~40 GBNo PagedAttention
+ PagedAttention (vLLM default)8~900~22 GB2x users, same GPU
+ FP8 KV cache16~1,800~11 GB4x users vs baseline
+ Prefix caching (LMCache)16~1,800 (TTFT: 1.5s vs 11s)~11 GBLatency win, not throughput
+ NVFP4 KV (Blackwell only)32~2,600~5.5 GB8x users vs baseline

These are estimated figures based on published vLLM and LMCache benchmark data, not independently reproduced results on this hardware configuration. They are intended to illustrate relative ordering and magnitude of gains, not precise throughput you should expect. Actual results depend on model, hardware, batch composition, and request length distribution. Run your own benchmarks on your target hardware before making infrastructure decisions.

Quick Reference: KV Cache Optimization Checklist

  1. Enable PagedAttention (default in vLLM, on by default)
  2. Set --gpu-memory-utilization 0.90 (raise to 0.92-0.95 if model fits tightly)
  3. Use FP8 KV cache on H100/A100: --kv-cache-dtype fp8
  4. Use NVFP4 KV cache on Blackwell only: --kv-cache-dtype nvfp4
  5. Enable prefix caching: --enable-prefix-caching
  6. Set --swap-space 16 for burst tolerance on bursty workloads
  7. Choose GQA-native models (Llama 3 family) for long-context serving
  8. Verify Flash Attention is active (on by default in vLLM)
  9. Size your GPU to the optimized budget, not the theoretical maximum

KV cache optimization is what makes production LLM inference financially viable. Once you have your optimized VRAM number, pick the right-sized Spheron GPU and avoid paying for capacity you don't need.

Rent H100 → | Rent A100 → | View all GPU pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.