Tutorial

NVMe KV Cache Offloading for LLM Inference: Serve 10x More Users on the Same GPU (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 1, 2026
KV CacheNVMeLLM InferencevLLMLMCacheGPU MemoryH100GPU Cloud
NVMe KV Cache Offloading for LLM Inference: Serve 10x More Users on the Same GPU (2026)

On an H100 serving Llama 3.1 70B at 128K context, the KV cache for a single user takes ~40 GB. Eight concurrent users need ~320 GB total, which is 4x the H100's entire HBM. The only way to serve 8 users on one H100 is to move cold KV blocks off the GPU. That is what NVMe KV cache offloading does.

This post covers the three-tier storage hierarchy (GPU HBM, CPU DRAM, NVMe SSD), NVIDIA's ICMSP hardware architecture announced at CES 2026, and a step-by-step deployment of LMCache with a disk backend on vLLM. If you have not yet read the foundational in-GPU techniques (PagedAttention, FP8/NVFP4, CPU swap space), start with the KV Cache Optimization Guide first.

The GPU Memory Wall: KV Cache as the Inference Bottleneck

KV cache size grows with the number of layers (L), attention heads (H_kv), head dimension (D), sequence length (S), and batch size (B). For the derivation of that formula and the in-GPU optimizations (FP8 quantization, PagedAttention, CPU swap), see the KV Cache Optimization Guide. This post picks up where that one leaves off: the NVMe tier that extends capacity beyond what CPU DRAM can hold.

The 2026 inflection point is context length. 128K-token and 1M-token windows are production workloads now, not research benchmarks. At those lengths, in-GPU KV storage for multiple concurrent users is not physically possible on a single GPU.

GPUHBMKV cache (128K, BF16) per userUsers at 100% fill
H100 SXM580 GB~40 GB~1-2 after weight compression
B200 SXM6192 GB~40 GB~3 (FP8 weights ~70 GB + 3 × 40 GB BF16 KV = 190 GB)

Note: these figures use BF16 KV (the uncompressed baseline) to show the worst case. With FP8 KV quantization (~20 GB per user at 128K), B200 fits ~6 concurrent users at no offload (see the detailed table in the Cost Analysis section below).

PagedAttention, FP8/NVFP4 KV quantization, and --swap-space (covered in the prerequisite post) extend capacity within the GPU and CPU DRAM. NVMe offloading extends it further to the SSD tier.

KV Cache Storage Tiers Explained

The three-tier hierarchy assigns KV blocks to storage based on how recently they were accessed and how likely they are to be needed again:

TierStorageBandwidthLatency per blockCost/TB
HotGPU HBM~3.35 TB/s (H100 SXM5)less than 1 µshighest
WarmCPU DRAM~63 GB/s (PCIe 5.0 x16)~10-100 µsmid
ColdNVMe SSD~7 GB/s (PCIe 4.0 NVMe)~100 µs - 1 mslowest

The offloading heuristic is straightforward: hot KV blocks (recently accessed, active generation) stay on GPU. Warm blocks (completed but likely needed for multi-turn conversations) go to CPU DRAM. Cold blocks (historical context, prefix cache for future requests) go to NVMe.

NVMe offloading is a net win only when GPU KV cache is the actual bottleneck. For short-context single-user workloads, it adds I/O latency for no benefit. The "When to Use" section at the end covers this in detail.

NVIDIA ICMSP: Hardware Architecture for Tiered KV Cache

ICMSP (Inference Context Memory Storage Platform) was announced at CES 2026. The architecture uses BlueField-4 DPUs to manage KV cache movement off the GPU compute path. The GPU's streaming multiprocessors stay focused on the forward pass instead of stalling on PCIe I/O.

NIXL (NVIDIA Inference Xfer Library) is the KV block transport protocol within ICMSP. It supports:

  • NVLink for single-node GPU-to-GPU KV transfer
  • InfiniBand RDMA for multi-node disaggregated serving
  • PCIe for GPU-to-CPU-to-NVMe local tier movement
  • TCP fallback for environments without high-speed interconnect

VAST Data reported roughly 10x faster prefill times using ICMSP with VAST DataStore (all-NVMe storage), enabling proportionally higher concurrency. The architecture this post deploys in software is the same pattern: GPU HBM, CPU DRAM, and NVMe as ordered tiers with a protocol layer managing block movement between them.

Dynamo 1.0 uses NIXL as its KV transport layer between prefill and decode workers. LMCache and Dynamo share the same NIXL primitives for moving KV blocks. For a full walkthrough of Dynamo's disaggregated serving approach, see the NVIDIA Dynamo Disaggregated Inference Guide.

One important caveat: ICMSP hardware acceleration requires BlueField-4 DPUs in the server. Spheron bare-metal instances do not include BlueField-4 DPUs today. LMCache implements the same tiered storage pattern in software via PCIe, and NIXL-accelerated paths become available as BlueField-4 infrastructure rolls out across GPU cloud providers.

LMCache: NVMe KV Offloading Available Today

LMCache is a KV cache engine that adds persistent storage backends to vLLM's in-process prefix cache. It supports vLLM, SGLang, and NVIDIA Dynamo as inference engines. If you have not yet set up vLLM in production, see the vLLM production deployment guide first.

What it does differently from vLLM's built-in --swap-space:

FeaturevLLM --swap-spaceLMCache disk backend
Storage tierCPU RAM onlyNVMe, Redis, remote storage
Survives server restartNoYes
Shared across replicasNoYes (with shared storage)
Block granularityvLLM internalConfigurable chunk_size

The practical difference matters most for multi-turn chatbots and RAG workloads. With --swap-space, every server restart clears the warm cache and the next user request recomputes the full prefix. With LMCache, the blocks persist on disk, so a server restart does not wipe the cache.

Production adoption: Google Cloud uses LMCache in GKE Inference, and CoreWeave and Cohere have benchmarked LMCache on CoreWeave infrastructure. VAST Data benchmarks document TTFT dropping from ~11s to ~1.5s for a 128K-token system prompt on H100, because prefill computation is replaced by a cache hit from disk.

Version pinning note: LMCache's configuration format and compatible vLLM versions change with releases. Check docs.lmcache.ai for current version compatibility before deploying.

Deploy LMCache with vLLM on a Spheron H100 (Step-by-Step)

Provision Your Instance

Go to app.spheron.ai and select an H100 SXM5 80GB or H100 PCIe 80GB bare-metal instance. Current on-demand pricing: H100 PCIe from $2.63/hr (as of 01 Apr 2026). See H100 GPU rental for current availability and pricing.

Bare-metal instances on Spheron include NVMe SSDs at no additional cost. On managed cloud instances, NVMe block storage is typically billed separately per GB/month. On Spheron bare-metal, the NVMe is already there.

Pricing fluctuates based on GPU availability. The prices above are based on 01 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Verify NVMe Storage

SSH into your instance and confirm the NVMe device is accessible. The device path and mount point vary across instances, so check before hardcoding any path in your config:

bash
lsblk
nvme list
df -h

Expected output will show an NVMe device (typically /dev/nvme0n1) and its mount point. Create the cache directory:

bash
# Adjust the mount path based on your lsblk output
mkdir -p /nvme/lmcache
df -h /nvme

Install Dependencies

bash
pip install vllm lmcache

# Verify compatible versions
python -c "import vllm; print(vllm.__version__)"
python -c "import lmcache; print(lmcache.__version__)"

Check docs.lmcache.ai for which LMCache version is compatible with your vLLM version before proceeding. The config format has changed across releases.

Configure LMCache with Disk Backend

Create lmcache_config.yaml:

yaml
chunk_size: 256
local_cpu: false
max_local_cpu_size: 0
local_disk: true
local_disk_path: /nvme/lmcache
max_local_disk_size: 200  # GB - set to ~80% of available NVMe space
remote_url: null

Setting local_cpu: false routes cold blocks directly to NVMe, skipping CPU DRAM. If you want a warm tier in CPU DRAM between GPU and NVMe, set local_cpu: true and configure max_local_cpu_size accordingly.

Check available NVMe space before setting max_local_disk_size:

bash
df -h /nvme

Launch vLLM with LMCache

bash
# --max-model-len 32768: conservative default for a single H100; see note below
LMCACHE_CONFIG_FILE=/path/to/lmcache_config.yaml \
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --swap-space 16 \
  --port 8000

Context length note: This command uses --max-model-len 32768 (32K) as a conservative default that fits comfortably on a single H100 80 GB with FP8 weights and leaves room for multiple concurrent users. The rest of this post is motivated by 128K context workloads. To serve 128K context, set --max-model-len 131072 instead. Before doing so, verify you have enough GPU HBM + NVMe capacity: at 128K, each user's FP8 KV cache is ~20 GB, so a single H100 can hold at most one user in GPU memory with no offload. LMCache's NVMe backend is what makes 128K serving practical at any real concurrency.

--enable-prefix-caching activates vLLM's in-process prefix cache. LMCache extends this with the NVMe tier. --swap-space 16 adds a CPU RAM buffer for burst handling. --kv-cache-dtype fp8 halves KV memory on GPU, leaving more room for hot blocks before any eviction happens.

Validate Offloading

Send a request with a long shared prefix, then send the same prefix again:

bash
# Check the cache directory is growing
du -sh /nvme/lmcache

# Monitor KV cache metrics
curl http://localhost:8000/metrics | grep kv_cache

TTFT on the second request (same prefix) should be substantially lower than the first. If du shows no growth after the first request, check that LMCACHE_CONFIG_FILE is set correctly and that the local_disk_path directory is writable.

NVFP4 + NVMe Offloading on Blackwell GPUs

On B200 instances, combine --kv-cache-dtype nvfp4 with LMCache's NVMe backend for higher concurrency. NVFP4 stores KV tensors in 4-bit format, which is 75% smaller than BF16 and 50% smaller than FP8. This means the GPU retains more hot blocks before any eviction to NVMe.

Do not use --kv-cache-dtype nvfp4 on H100 or A100. NVFP4 KV cache requires Blackwell hardware acceleration. On H100, use --kv-cache-dtype fp8.

Memory estimates for Llama 3.1 70B at 128K context on a B200 192 GB:

ConfigurationHot GPU KVCold NVMe KVEstimated concurrent users
BF16, no offload192 GB cap0~1 (weights ~140 GB, ~40 GB BF16 KV each; 140 + 2 × 40 = 220 GB exceeds 192 GB)
FP8, no offload192 GB cap0~6 (weights ~70 GB, ~20 GB FP8 KV each; 70 + 6 × 20 = 190 GB)
FP8, NVMe offload~40 GB hot200 GB cold~12-15 (assumes high NVMe prefix cache hit rate; unique-prefix workloads will see lower concurrency and higher latency)
NVFP4, NVMe offload~40 GB hot200 GB cold~20-25 (estimated)

These are estimates. Actual concurrency depends on working set size relative to GPU HBM: if most requests share prefixes that fit in the hot tier, you get the high end of the range. If every request has a unique 128K prefix, cold NVMe accesses dominate and the latency penalty compounds. See FP4 Quantization on Blackwell GPUs for the quality analysis behind NVFP4.

Check GPU pricing for current B200 availability.

Benchmarks: Throughput and Latency Across Storage Tiers

These are illustrative figures drawn from published LMCache research and NVIDIA documentation, not independently measured on the exact configuration below. Actual numbers vary by model, batch size, and request pattern.

Configuration: Llama 3.1 70B, 8 concurrent users, 32K context, H100 PCIe:

KV Storage TierThroughput (tok/s)TTFT (warm prefix)TTFT (cold)Notes
GPU HBM only~900~0.8s~11svLLM default, no prefix cache
+ CPU swap (--swap-space 32)~950~0.9s~11sburst tolerance, no persistence
+ LMCache disk (cold prefix)~1,100~1.5s~1.5sNVMe cache hits skip prefill
+ FP8 KV + LMCache disk~1,800~1.4s~1.4sfewer evictions, more hot blocks

"Warm prefix" TTFT means the shared prefix was previously cached (from a prior request in this session or loaded from NVMe). "Cold" TTFT means the first request with no cached prefix available.

The throughput improvement from LMCache comes from avoiding prefill recomputation on cache hits, not from NVMe I/O being fast. NVMe at 7 GB/s is slow relative to GPU HBM; the gain is that loading precomputed KV blocks from disk is still much faster than running the full attention computation from scratch for a 128K-token prompt. VAST Data benchmarks document TTFT dropping from ~11s to ~1.5s for a 128K-token system prompt on H100, because prefill computation is replaced by a cache hit from disk.

When to Use NVMe KV Cache Offloading

Use it when:

  • Context lengths are 32K tokens or more per request
  • You are running multi-turn chatbots where conversation history grows across sessions
  • RAG pipelines share document prefixes across many users (high prefix cache hit rate)
  • Batch inference has many requests sharing a preamble (system prompt, instructions)
  • GPU KV cache fills and you are seeing OOM errors, request queuing, or high TTFT on repeated prefixes

Skip it when:

  • Requests are under 4K tokens (KV cache per request is small, NVMe tier adds overhead for minimal gain)
  • Every request has a unique prefix (no cache hits, only latency penalty)
  • Your p99 SLA is under 100ms (NVMe access latency is hundreds of microseconds per miss)
  • You already have GPU KV cache headroom to spare (NVMe tier only helps when GPU is the bottleneck)
  • You are using speculative decoding (which relies on low-latency draft token generation that conflicts with cold NVMe access patterns; see the speculative decoding production guide for details)

If you are still deciding on the broader LLM deployment stack before adding KV cache offloading, the LLM deployment guide covers infrastructure setup from scratch.

Cost Analysis: Effective Cost Per Concurrent User

The GPU cost per hour does not change with NVMe offloading. What changes is how many concurrent users you serve from that GPU. More users per GPU means lower effective cost per user.

Live pricing, H100 PCIe on Spheron, Llama 3.1 70B, 128K context:

ConfigurationGPU $/hrMax concurrent usersEffective $/hr per user
No optimization (BF16 KV)$2.63~1~$2.63/user/hr
FP8 KV + PagedAttention$2.63~1~$2.63/user/hr
FP8 KV + CPU swap + LMCache disk$2.63~8-10~$0.29/user/hr
NVFP4 KV + LMCache disk (B200)check pricing~20+varies

Pricing fluctuates based on GPU availability. The prices above are based on 01 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Spheron bare-metal instances include NVMe SSDs at no additional cost. On AWS, GCP, and Azure, NVMe block storage is typically billed separately per GB/month. The NVMe tier on Spheron bare metal is part of the instance, so the cost-per-user figures above do not carry an NVMe storage surcharge.


NVMe KV cache offloading converts a $2.63/hr H100 PCIe GPU into roughly $0.29/hr per concurrent user at 128K context, without adding hardware. Spheron's bare-metal H100 and B200 instances include NVMe storage, so this three-tier stack works out of the box.

Rent H100 → | View all GPU pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.