On an H100 serving Llama 3.1 70B at 128K context, the KV cache for a single user takes ~40 GB. Eight concurrent users need ~320 GB total, which is 4x the H100's entire HBM. The only way to serve 8 users on one H100 is to move cold KV blocks off the GPU. That is what NVMe KV cache offloading does.
This post covers the three-tier storage hierarchy (GPU HBM, CPU DRAM, NVMe SSD), NVIDIA's ICMSP hardware architecture announced at CES 2026, and a step-by-step deployment of LMCache with a disk backend on vLLM. If you have not yet read the foundational in-GPU techniques (PagedAttention, FP8/NVFP4, CPU swap space), start with the KV Cache Optimization Guide first.
The GPU Memory Wall: KV Cache as the Inference Bottleneck
KV cache size grows with the number of layers (L), attention heads (H_kv), head dimension (D), sequence length (S), and batch size (B). For the derivation of that formula and the in-GPU optimizations (FP8 quantization, PagedAttention, CPU swap), see the KV Cache Optimization Guide. This post picks up where that one leaves off: the NVMe tier that extends capacity beyond what CPU DRAM can hold.
The 2026 inflection point is context length. 128K-token and 1M-token windows are production workloads now, not research benchmarks. At those lengths, in-GPU KV storage for multiple concurrent users is not physically possible on a single GPU.
| GPU | HBM | KV cache (128K, BF16) per user | Users at 100% fill |
|---|---|---|---|
| H100 SXM5 | 80 GB | ~40 GB | ~1-2 after weight compression |
| B200 SXM6 | 192 GB | ~40 GB | ~3 (FP8 weights ~70 GB + 3 × 40 GB BF16 KV = 190 GB) |
Note: these figures use BF16 KV (the uncompressed baseline) to show the worst case. With FP8 KV quantization (~20 GB per user at 128K), B200 fits ~6 concurrent users at no offload (see the detailed table in the Cost Analysis section below).
PagedAttention, FP8/NVFP4 KV quantization, and --swap-space (covered in the prerequisite post) extend capacity within the GPU and CPU DRAM. NVMe offloading extends it further to the SSD tier.
KV Cache Storage Tiers Explained
The three-tier hierarchy assigns KV blocks to storage based on how recently they were accessed and how likely they are to be needed again:
| Tier | Storage | Bandwidth | Latency per block | Cost/TB |
|---|---|---|---|---|
| Hot | GPU HBM | ~3.35 TB/s (H100 SXM5) | less than 1 µs | highest |
| Warm | CPU DRAM | ~63 GB/s (PCIe 5.0 x16) | ~10-100 µs | mid |
| Cold | NVMe SSD | ~7 GB/s (PCIe 4.0 NVMe) | ~100 µs - 1 ms | lowest |
The offloading heuristic is straightforward: hot KV blocks (recently accessed, active generation) stay on GPU. Warm blocks (completed but likely needed for multi-turn conversations) go to CPU DRAM. Cold blocks (historical context, prefix cache for future requests) go to NVMe.
NVMe offloading is a net win only when GPU KV cache is the actual bottleneck. For short-context single-user workloads, it adds I/O latency for no benefit. The "When to Use" section at the end covers this in detail.
NVIDIA ICMSP: Hardware Architecture for Tiered KV Cache
ICMSP (Inference Context Memory Storage Platform) was announced at CES 2026. The architecture uses BlueField-4 DPUs to manage KV cache movement off the GPU compute path. The GPU's streaming multiprocessors stay focused on the forward pass instead of stalling on PCIe I/O.
NIXL (NVIDIA Inference Xfer Library) is the KV block transport protocol within ICMSP. It supports:
- NVLink for single-node GPU-to-GPU KV transfer
- InfiniBand RDMA for multi-node disaggregated serving
- PCIe for GPU-to-CPU-to-NVMe local tier movement
- TCP fallback for environments without high-speed interconnect
VAST Data reported roughly 10x faster prefill times using ICMSP with VAST DataStore (all-NVMe storage), enabling proportionally higher concurrency. The architecture this post deploys in software is the same pattern: GPU HBM, CPU DRAM, and NVMe as ordered tiers with a protocol layer managing block movement between them.
Dynamo 1.0 uses NIXL as its KV transport layer between prefill and decode workers. LMCache and Dynamo share the same NIXL primitives for moving KV blocks. For a full walkthrough of Dynamo's disaggregated serving approach, see the NVIDIA Dynamo Disaggregated Inference Guide.
One important caveat: ICMSP hardware acceleration requires BlueField-4 DPUs in the server. Spheron bare-metal instances do not include BlueField-4 DPUs today. LMCache implements the same tiered storage pattern in software via PCIe, and NIXL-accelerated paths become available as BlueField-4 infrastructure rolls out across GPU cloud providers.
LMCache: NVMe KV Offloading Available Today
LMCache is a KV cache engine that adds persistent storage backends to vLLM's in-process prefix cache. It supports vLLM, SGLang, and NVIDIA Dynamo as inference engines. If you have not yet set up vLLM in production, see the vLLM production deployment guide first.
What it does differently from vLLM's built-in --swap-space:
| Feature | vLLM --swap-space | LMCache disk backend |
|---|---|---|
| Storage tier | CPU RAM only | NVMe, Redis, remote storage |
| Survives server restart | No | Yes |
| Shared across replicas | No | Yes (with shared storage) |
| Block granularity | vLLM internal | Configurable chunk_size |
The practical difference matters most for multi-turn chatbots and RAG workloads. With --swap-space, every server restart clears the warm cache and the next user request recomputes the full prefix. With LMCache, the blocks persist on disk, so a server restart does not wipe the cache.
Production adoption: Google Cloud uses LMCache in GKE Inference, and CoreWeave and Cohere have benchmarked LMCache on CoreWeave infrastructure. VAST Data benchmarks document TTFT dropping from ~11s to ~1.5s for a 128K-token system prompt on H100, because prefill computation is replaced by a cache hit from disk.
Version pinning note: LMCache's configuration format and compatible vLLM versions change with releases. Check docs.lmcache.ai for current version compatibility before deploying.
Deploy LMCache with vLLM on a Spheron H100 (Step-by-Step)
Provision Your Instance
Go to app.spheron.ai and select an H100 SXM5 80GB or H100 PCIe 80GB bare-metal instance. Current on-demand pricing: H100 PCIe from $2.63/hr (as of 01 Apr 2026). See H100 GPU rental for current availability and pricing.
Bare-metal instances on Spheron include NVMe SSDs at no additional cost. On managed cloud instances, NVMe block storage is typically billed separately per GB/month. On Spheron bare-metal, the NVMe is already there.
Pricing fluctuates based on GPU availability. The prices above are based on 01 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Verify NVMe Storage
SSH into your instance and confirm the NVMe device is accessible. The device path and mount point vary across instances, so check before hardcoding any path in your config:
lsblk
nvme list
df -hExpected output will show an NVMe device (typically /dev/nvme0n1) and its mount point. Create the cache directory:
# Adjust the mount path based on your lsblk output
mkdir -p /nvme/lmcache
df -h /nvmeInstall Dependencies
pip install vllm lmcache
# Verify compatible versions
python -c "import vllm; print(vllm.__version__)"
python -c "import lmcache; print(lmcache.__version__)"Check docs.lmcache.ai for which LMCache version is compatible with your vLLM version before proceeding. The config format has changed across releases.
Configure LMCache with Disk Backend
Create lmcache_config.yaml:
chunk_size: 256
local_cpu: false
max_local_cpu_size: 0
local_disk: true
local_disk_path: /nvme/lmcache
max_local_disk_size: 200 # GB - set to ~80% of available NVMe space
remote_url: nullSetting local_cpu: false routes cold blocks directly to NVMe, skipping CPU DRAM. If you want a warm tier in CPU DRAM between GPU and NVMe, set local_cpu: true and configure max_local_cpu_size accordingly.
Check available NVMe space before setting max_local_disk_size:
df -h /nvmeLaunch vLLM with LMCache
# --max-model-len 32768: conservative default for a single H100; see note below
LMCACHE_CONFIG_FILE=/path/to/lmcache_config.yaml \
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--gpu-memory-utilization 0.90 \
--max-model-len 32768 \
--swap-space 16 \
--port 8000Context length note: This command uses --max-model-len 32768 (32K) as a conservative default that fits comfortably on a single H100 80 GB with FP8 weights and leaves room for multiple concurrent users. The rest of this post is motivated by 128K context workloads. To serve 128K context, set --max-model-len 131072 instead. Before doing so, verify you have enough GPU HBM + NVMe capacity: at 128K, each user's FP8 KV cache is ~20 GB, so a single H100 can hold at most one user in GPU memory with no offload. LMCache's NVMe backend is what makes 128K serving practical at any real concurrency.
--enable-prefix-caching activates vLLM's in-process prefix cache. LMCache extends this with the NVMe tier. --swap-space 16 adds a CPU RAM buffer for burst handling. --kv-cache-dtype fp8 halves KV memory on GPU, leaving more room for hot blocks before any eviction happens.
Validate Offloading
Send a request with a long shared prefix, then send the same prefix again:
# Check the cache directory is growing
du -sh /nvme/lmcache
# Monitor KV cache metrics
curl http://localhost:8000/metrics | grep kv_cacheTTFT on the second request (same prefix) should be substantially lower than the first. If du shows no growth after the first request, check that LMCACHE_CONFIG_FILE is set correctly and that the local_disk_path directory is writable.
NVFP4 + NVMe Offloading on Blackwell GPUs
On B200 instances, combine --kv-cache-dtype nvfp4 with LMCache's NVMe backend for higher concurrency. NVFP4 stores KV tensors in 4-bit format, which is 75% smaller than BF16 and 50% smaller than FP8. This means the GPU retains more hot blocks before any eviction to NVMe.
Do not use --kv-cache-dtype nvfp4 on H100 or A100. NVFP4 KV cache requires Blackwell hardware acceleration. On H100, use --kv-cache-dtype fp8.
Memory estimates for Llama 3.1 70B at 128K context on a B200 192 GB:
| Configuration | Hot GPU KV | Cold NVMe KV | Estimated concurrent users |
|---|---|---|---|
| BF16, no offload | 192 GB cap | 0 | ~1 (weights ~140 GB, ~40 GB BF16 KV each; 140 + 2 × 40 = 220 GB exceeds 192 GB) |
| FP8, no offload | 192 GB cap | 0 | ~6 (weights ~70 GB, ~20 GB FP8 KV each; 70 + 6 × 20 = 190 GB) |
| FP8, NVMe offload | ~40 GB hot | 200 GB cold | ~12-15 (assumes high NVMe prefix cache hit rate; unique-prefix workloads will see lower concurrency and higher latency) |
| NVFP4, NVMe offload | ~40 GB hot | 200 GB cold | ~20-25 (estimated) |
These are estimates. Actual concurrency depends on working set size relative to GPU HBM: if most requests share prefixes that fit in the hot tier, you get the high end of the range. If every request has a unique 128K prefix, cold NVMe accesses dominate and the latency penalty compounds. See FP4 Quantization on Blackwell GPUs for the quality analysis behind NVFP4.
Check GPU pricing for current B200 availability.
Benchmarks: Throughput and Latency Across Storage Tiers
These are illustrative figures drawn from published LMCache research and NVIDIA documentation, not independently measured on the exact configuration below. Actual numbers vary by model, batch size, and request pattern.
Configuration: Llama 3.1 70B, 8 concurrent users, 32K context, H100 PCIe:
| KV Storage Tier | Throughput (tok/s) | TTFT (warm prefix) | TTFT (cold) | Notes |
|---|---|---|---|---|
| GPU HBM only | ~900 | ~0.8s | ~11s | vLLM default, no prefix cache |
| + CPU swap (--swap-space 32) | ~950 | ~0.9s | ~11s | burst tolerance, no persistence |
| + LMCache disk (cold prefix) | ~1,100 | ~1.5s | ~1.5s | NVMe cache hits skip prefill |
| + FP8 KV + LMCache disk | ~1,800 | ~1.4s | ~1.4s | fewer evictions, more hot blocks |
"Warm prefix" TTFT means the shared prefix was previously cached (from a prior request in this session or loaded from NVMe). "Cold" TTFT means the first request with no cached prefix available.
The throughput improvement from LMCache comes from avoiding prefill recomputation on cache hits, not from NVMe I/O being fast. NVMe at 7 GB/s is slow relative to GPU HBM; the gain is that loading precomputed KV blocks from disk is still much faster than running the full attention computation from scratch for a 128K-token prompt. VAST Data benchmarks document TTFT dropping from ~11s to ~1.5s for a 128K-token system prompt on H100, because prefill computation is replaced by a cache hit from disk.
When to Use NVMe KV Cache Offloading
Use it when:
- Context lengths are 32K tokens or more per request
- You are running multi-turn chatbots where conversation history grows across sessions
- RAG pipelines share document prefixes across many users (high prefix cache hit rate)
- Batch inference has many requests sharing a preamble (system prompt, instructions)
- GPU KV cache fills and you are seeing OOM errors, request queuing, or high TTFT on repeated prefixes
Skip it when:
- Requests are under 4K tokens (KV cache per request is small, NVMe tier adds overhead for minimal gain)
- Every request has a unique prefix (no cache hits, only latency penalty)
- Your p99 SLA is under 100ms (NVMe access latency is hundreds of microseconds per miss)
- You already have GPU KV cache headroom to spare (NVMe tier only helps when GPU is the bottleneck)
- You are using speculative decoding (which relies on low-latency draft token generation that conflicts with cold NVMe access patterns; see the speculative decoding production guide for details)
If you are still deciding on the broader LLM deployment stack before adding KV cache offloading, the LLM deployment guide covers infrastructure setup from scratch.
Cost Analysis: Effective Cost Per Concurrent User
The GPU cost per hour does not change with NVMe offloading. What changes is how many concurrent users you serve from that GPU. More users per GPU means lower effective cost per user.
Live pricing, H100 PCIe on Spheron, Llama 3.1 70B, 128K context:
| Configuration | GPU $/hr | Max concurrent users | Effective $/hr per user |
|---|---|---|---|
| No optimization (BF16 KV) | $2.63 | ~1 | ~$2.63/user/hr |
| FP8 KV + PagedAttention | $2.63 | ~1 | ~$2.63/user/hr |
| FP8 KV + CPU swap + LMCache disk | $2.63 | ~8-10 | ~$0.29/user/hr |
| NVFP4 KV + LMCache disk (B200) | check pricing | ~20+ | varies |
Pricing fluctuates based on GPU availability. The prices above are based on 01 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Spheron bare-metal instances include NVMe SSDs at no additional cost. On AWS, GCP, and Azure, NVMe block storage is typically billed separately per GB/month. The NVMe tier on Spheron bare metal is part of the instance, so the cost-per-user figures above do not carry an NVMe storage surcharge.
NVMe KV cache offloading converts a $2.63/hr H100 PCIe GPU into roughly $0.29/hr per concurrent user at 128K context, without adding hardware. Spheron's bare-metal H100 and B200 instances include NVMe storage, so this three-tier stack works out of the box.
