What is KV cache offloading to NVMe SSD?

KV cache offloading extends the standard two-tier GPU/CPU memory model with a third tier: NVMe SSD. Hot KV blocks (active generation) stay on GPU HBM at ~3.35 TB/s bandwidth. Warm blocks (completed but likely reused) go to CPU DRAM at ~63 GB/s over PCIe 5.0. Cold blocks (historical context, prefix cache hits for future requests) go to NVMe at ~7 GB/s. This three-tier hierarchy lets a single H100 80GB serve far more concurrent users than GPU memory alone would allow, at the cost of higher latency for cold block accesses.

How does LMCache enable NVMe KV cache offloading?

LMCache is a KV cache engine that plugs into vLLM, SGLang, and NVIDIA Dynamo. It serializes KV blocks to configurable storage backends including local disk (NVMe), Redis, and remote object stores. When a request arrives with a prefix that was previously computed and evicted to disk, LMCache loads the KV blocks from NVMe instead of recomputing them during prefill. Unlike vLLM's built-in --swap-space flag (which buffers in CPU RAM for one server instance only), LMCache persists blocks across server restarts and shares them across multiple server replicas. It is used in production at Google Cloud (GKE Inference), and CoreWeave and Cohere have benchmarked LMCache on CoreWeave infrastructure.

What is NVIDIA ICMSP and how does it relate to KV cache offloading?

ICMSP (Inference Context Memory Storage Platform) was announced by NVIDIA at CES 2026. It uses BlueField-4 DPUs to manage KV cache movement off the GPU compute path so the GPU's SMs stay focused on math rather than waiting on I/O. The NIXL protocol (NVIDIA Inference Xfer Library) is ICMSP's KV transport layer, supporting NVLink, InfiniBand RDMA, PCIe, and TCP. VAST Data reported roughly 10x faster prefill times using ICMSP with all-NVMe storage, enabling proportionally higher concurrency. LMCache implements the same tiered storage pattern in software today, and NIXL-accelerated paths become available as BlueField-4 infrastructure rolls out.

How much slower is NVMe storage vs GPU HBM for KV cache access?

GPU HBM on an H100 SXM5 runs at ~3.35 TB/s. CPU DRAM accessible over PCIe 5.0 x16 provides ~63 GB/s. A PCIe 4.0 NVMe SSD delivers ~7 GB/s sequential throughput with latency in the 100 microsecond to 1 millisecond range per access. NVMe is roughly 500x slower than GPU HBM in bandwidth. This makes NVMe viable only for cold KV blocks that would otherwise be evicted: the latency penalty for loading from NVMe is far smaller than the cost of recomputing a 128K-token prefix from scratch (which takes ~11 seconds on H100).

Which Spheron GPU instances support NVMe KV cache offloading?

All Spheron bare-metal GPU instances include NVMe SSDs as part of the instance. H100 SXM5 and H100 PCIe instances are the primary targets for LMCache NVMe offloading with Llama 3.1 70B and similar models. B200 SXM6 instances pair well with NVFP4 KV quantization plus NVMe offloading for even higher concurrency. You can see current availability and pricing at /gpu-rental/h100/ and /pricing/.

NVMe KV Cache Offloading for LLM Inference: Serve 10x More Users on the Same GPU (2026)

On an H100 serving Llama 3.1 70B at 128K context, the KV cache for a single user takes ~40 GB. Eight concurrent users need ~320 GB total, which is 4x the H100's entire HBM. The only way to serve 8 users on one H100 is to move cold KV blocks off the GPU. That is what NVMe KV cache offloading does.

This post covers the three-tier storage hierarchy (GPU HBM, CPU DRAM, NVMe SSD), NVIDIA's ICMSP hardware architecture announced at CES 2026, and a step-by-step deployment of LMCache with a disk backend on vLLM. If you have not yet read the foundational in-GPU techniques (PagedAttention, FP8/NVFP4, CPU swap space), start with the KV Cache Optimization Guide first.

The GPU Memory Wall: KV Cache as the Inference Bottleneck

KV cache size grows with the number of layers (L), attention heads (H_kv), head dimension (D), sequence length (S), and batch size (B). For the derivation of that formula and the in-GPU optimizations (FP8 quantization, PagedAttention, CPU swap), see the KV Cache Optimization Guide. This post picks up where that one leaves off: the NVMe tier that extends capacity beyond what CPU DRAM can hold.

The 2026 inflection point is context length. 128K-token and 1M-token windows are production workloads now, not research benchmarks. At those lengths, in-GPU KV storage for multiple concurrent users is not physically possible on a single GPU.

GPU	HBM	KV cache (128K, BF16) per user	Users at 100% fill
H100 SXM5	80 GB	~40 GB	~1-2 after weight compression
B200 SXM6	192 GB	~40 GB	~3 (FP8 weights ~70 GB + 3 × 40 GB BF16 KV = 190 GB)

Note: these figures use BF16 KV (the uncompressed baseline) to show the worst case. With FP8 KV quantization (~20 GB per user at 128K), B200 fits ~6 concurrent users at no offload (see the detailed table in the Cost Analysis section below).

PagedAttention, FP8/NVFP4 KV quantization, and --swap-space (covered in the prerequisite post) extend capacity within the GPU and CPU DRAM. NVMe offloading extends it further to the SSD tier.

KV Cache Storage Tiers Explained

The three-tier hierarchy assigns KV blocks to storage based on how recently they were accessed and how likely they are to be needed again:

Tier	Storage	Bandwidth	Latency per block	Cost/TB
Hot	GPU HBM	~3.35 TB/s (H100 SXM5)	less than 1 µs	highest
Warm	CPU DRAM	~63 GB/s (PCIe 5.0 x16)	~10-100 µs	mid
Cold	NVMe SSD	~7 GB/s (PCIe 4.0 NVMe)	~100 µs - 1 ms	lowest

The offloading heuristic is straightforward: hot KV blocks (recently accessed, active generation) stay on GPU. Warm blocks (completed but likely needed for multi-turn conversations) go to CPU DRAM. Cold blocks (historical context, prefix cache for future requests) go to NVMe.

NVMe offloading is a net win only when GPU KV cache is the actual bottleneck. For short-context single-user workloads, it adds I/O latency for no benefit. The "When to Use" section at the end covers this in detail.

NVIDIA ICMSP: Hardware Architecture for Tiered KV Cache

ICMSP (Inference Context Memory Storage Platform) was announced at CES 2026. The architecture uses BlueField-4 DPUs to manage KV cache movement off the GPU compute path. The GPU's streaming multiprocessors stay focused on the forward pass instead of stalling on PCIe I/O.

NIXL (NVIDIA Inference Xfer Library) is the KV block transport protocol within ICMSP. It supports:

NVLink for single-node GPU-to-GPU KV transfer
InfiniBand RDMA for multi-node disaggregated serving
PCIe for GPU-to-CPU-to-NVMe local tier movement
TCP fallback for environments without high-speed interconnect

VAST Data reported roughly 10x faster prefill times using ICMSP with VAST DataStore (all-NVMe storage), enabling proportionally higher concurrency. The architecture this post deploys in software is the same pattern: GPU HBM, CPU DRAM, and NVMe as ordered tiers with a protocol layer managing block movement between them.

Dynamo 1.0 uses NIXL as its KV transport layer between prefill and decode workers. LMCache and Dynamo share the same NIXL primitives for moving KV blocks. For a full walkthrough of Dynamo's disaggregated serving approach, see the NVIDIA Dynamo Disaggregated Inference Guide.

One important caveat: ICMSP hardware acceleration requires BlueField-4 DPUs in the server. Spheron bare-metal instances do not include BlueField-4 DPUs today. LMCache implements the same tiered storage pattern in software via PCIe, and NIXL-accelerated paths become available as BlueField-4 infrastructure rolls out across GPU cloud providers.

LMCache: NVMe KV Offloading Available Today

LMCache is a KV cache engine that adds persistent storage backends to vLLM's in-process prefix cache. It supports vLLM, SGLang, and NVIDIA Dynamo as inference engines. If you have not yet set up vLLM in production, see the vLLM production deployment guide first.

What it does differently from vLLM's built-in --swap-space:

Feature	vLLM `--swap-space`	LMCache disk backend
Storage tier	CPU RAM only	NVMe, Redis, remote storage
Survives server restart	No	Yes
Shared across replicas	No	Yes (with shared storage)
Block granularity	vLLM internal	Configurable chunk_size

The practical difference matters most for multi-turn chatbots and RAG workloads. With --swap-space, every server restart clears the warm cache and the next user request recomputes the full prefix. With LMCache, the blocks persist on disk, so a server restart does not wipe the cache.

Production adoption: Google Cloud uses LMCache in GKE Inference, and CoreWeave and Cohere have benchmarked LMCache on CoreWeave infrastructure. VAST Data benchmarks document TTFT dropping from ~11s to ~1.5s for a 128K-token system prompt on H100, because prefill computation is replaced by a cache hit from disk.

Version pinning note: LMCache's configuration format and compatible vLLM versions change with releases. Check docs.lmcache.ai for current version compatibility before deploying.

Deploy LMCache with vLLM on a Spheron H100 (Step-by-Step)

Provision Your Instance

Go to app.spheron.ai and select an H100 SXM5 80GB or H100 PCIe 80GB bare-metal instance. Current on-demand pricing: H100 PCIe from $2.63/hr (as of 01 Apr 2026). See H100 GPU rental for current availability and pricing.

Bare-metal instances on Spheron include NVMe SSDs at no additional cost. On managed cloud instances, NVMe block storage is typically billed separately per GB/month. On Spheron bare-metal, the NVMe is already there.

Pricing fluctuates based on GPU availability. The prices above are based on 01 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Verify NVMe Storage

SSH into your instance and confirm the NVMe device is accessible. The device path and mount point vary across instances, so check before hardcoding any path in your config:

bash

lsblk
nvme list
df -h

Expected output will show an NVMe device (typically /dev/nvme0n1) and its mount point. Create the cache directory:

bash

# Adjust the mount path based on your lsblk output
mkdir -p /nvme/lmcache
df -h /nvme

Install Dependencies

bash

pip install vllm lmcache

# Verify compatible versions
python -c "import vllm; print(vllm.__version__)"
python -c "import lmcache; print(lmcache.__version__)"

Check docs.lmcache.ai for which LMCache version is compatible with your vLLM version before proceeding. The config format has changed across releases.

Configure LMCache with Disk Backend

Create lmcache_config.yaml:

yaml

chunk_size: 256
local_cpu: false
max_local_cpu_size: 0
local_disk: true
local_disk_path: /nvme/lmcache
max_local_disk_size: 200  # GB - set to ~80% of available NVMe space
remote_url: null

Setting local_cpu: false routes cold blocks directly to NVMe, skipping CPU DRAM. If you want a warm tier in CPU DRAM between GPU and NVMe, set local_cpu: true and configure max_local_cpu_size accordingly.

Check available NVMe space before setting max_local_disk_size:

bash

df -h /nvme

Launch vLLM with LMCache

bash

# --max-model-len 32768: conservative default for a single H100; see note below
LMCACHE_CONFIG_FILE=/path/to/lmcache_config.yaml \
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --swap-space 16 \
  --port 8000

Context length note: This command uses --max-model-len 32768 (32K) as a conservative default that fits comfortably on a single H100 80 GB with FP8 weights and leaves room for multiple concurrent users. The rest of this post is motivated by 128K context workloads. To serve 128K context, set --max-model-len 131072 instead. Before doing so, verify you have enough GPU HBM + NVMe capacity: at 128K, each user's FP8 KV cache is ~20 GB, so a single H100 can hold at most one user in GPU memory with no offload. LMCache's NVMe backend is what makes 128K serving practical at any real concurrency.

--enable-prefix-caching activates vLLM's in-process prefix cache. LMCache extends this with the NVMe tier. --swap-space 16 adds a CPU RAM buffer for burst handling. --kv-cache-dtype fp8 halves KV memory on GPU, leaving more room for hot blocks before any eviction happens.

Validate Offloading

Send a request with a long shared prefix, then send the same prefix again:

bash

# Check the cache directory is growing
du -sh /nvme/lmcache

# Monitor KV cache metrics
curl http://localhost:8000/metrics | grep kv_cache

TTFT on the second request (same prefix) should be substantially lower than the first. If du shows no growth after the first request, check that LMCACHE_CONFIG_FILE is set correctly and that the local_disk_path directory is writable.

NVFP4 + NVMe Offloading on Blackwell GPUs

On B200 instances, combine --kv-cache-dtype nvfp4 with LMCache's NVMe backend for higher concurrency. NVFP4 stores KV tensors in 4-bit format, which is 75% smaller than BF16 and 50% smaller than FP8. This means the GPU retains more hot blocks before any eviction to NVMe.

Do not use --kv-cache-dtype nvfp4 on H100 or A100. NVFP4 KV cache requires Blackwell hardware acceleration. On H100, use --kv-cache-dtype fp8.

Memory estimates for Llama 3.1 70B at 128K context on a B200 192 GB:

Configuration	Hot GPU KV	Cold NVMe KV	Estimated concurrent users
BF16, no offload	192 GB cap	0	~1 (weights ~140 GB, ~40 GB BF16 KV each; 140 + 2 × 40 = 220 GB exceeds 192 GB)
FP8, no offload	192 GB cap	0	~6 (weights ~70 GB, ~20 GB FP8 KV each; 70 + 6 × 20 = 190 GB)
FP8, NVMe offload	~40 GB hot	200 GB cold	~12-15 (assumes high NVMe prefix cache hit rate; unique-prefix workloads will see lower concurrency and higher latency)
NVFP4, NVMe offload	~40 GB hot	200 GB cold	~20-25 (estimated)

These are estimates. Actual concurrency depends on working set size relative to GPU HBM: if most requests share prefixes that fit in the hot tier, you get the high end of the range. If every request has a unique 128K prefix, cold NVMe accesses dominate and the latency penalty compounds. See FP4 Quantization on Blackwell GPUs for the quality analysis behind NVFP4.

Check GPU pricing for current B200 availability.

Benchmarks: Throughput and Latency Across Storage Tiers

These are illustrative figures drawn from published LMCache research and NVIDIA documentation, not independently measured on the exact configuration below. Actual numbers vary by model, batch size, and request pattern.

Configuration: Llama 3.1 70B, 8 concurrent users, 32K context, H100 PCIe:

KV Storage Tier	Throughput (tok/s)	TTFT (warm prefix)	TTFT (cold)	Notes
GPU HBM only	~900	~0.8s	~11s	vLLM default, no prefix cache
+ CPU swap (--swap-space 32)	~950	~0.9s	~11s	burst tolerance, no persistence
+ LMCache disk (cold prefix)	~1,100	~1.5s	~1.5s	NVMe cache hits skip prefill
+ FP8 KV + LMCache disk	~1,800	~1.4s	~1.4s	fewer evictions, more hot blocks

"Warm prefix" TTFT means the shared prefix was previously cached (from a prior request in this session or loaded from NVMe). "Cold" TTFT means the first request with no cached prefix available.

The throughput improvement from LMCache comes from avoiding prefill recomputation on cache hits, not from NVMe I/O being fast. NVMe at 7 GB/s is slow relative to GPU HBM; the gain is that loading precomputed KV blocks from disk is still much faster than running the full attention computation from scratch for a 128K-token prompt. VAST Data benchmarks document TTFT dropping from ~11s to ~1.5s for a 128K-token system prompt on H100, because prefill computation is replaced by a cache hit from disk.

When to Use NVMe KV Cache Offloading

Use it when:

Context lengths are 32K tokens or more per request
You are running multi-turn chatbots where conversation history grows across sessions
RAG pipelines share document prefixes across many users (high prefix cache hit rate)
Batch inference has many requests sharing a preamble (system prompt, instructions)
GPU KV cache fills and you are seeing OOM errors, request queuing, or high TTFT on repeated prefixes

Skip it when:

Requests are under 4K tokens (KV cache per request is small, NVMe tier adds overhead for minimal gain)
Every request has a unique prefix (no cache hits, only latency penalty)
Your p99 SLA is under 100ms (NVMe access latency is hundreds of microseconds per miss)
You already have GPU KV cache headroom to spare (NVMe tier only helps when GPU is the bottleneck)
You are using speculative decoding (which relies on low-latency draft token generation that conflicts with cold NVMe access patterns; see the speculative decoding production guide for details)

If you are still deciding on the broader LLM deployment stack before adding KV cache offloading, the LLM deployment guide covers infrastructure setup from scratch.

Cost Analysis: Effective Cost Per Concurrent User

The GPU cost per hour does not change with NVMe offloading. What changes is how many concurrent users you serve from that GPU. More users per GPU means lower effective cost per user.

Live pricing, H100 PCIe on Spheron, Llama 3.1 70B, 128K context:

Configuration	GPU $/hr	Max concurrent users	Effective $/hr per user
No optimization (BF16 KV)	$2.63	~1	~$2.63/user/hr
FP8 KV + PagedAttention	$2.63	~1	~$2.63/user/hr
FP8 KV + CPU swap + LMCache disk	$2.63	~8-10	~$0.29/user/hr
NVFP4 KV + LMCache disk (B200)	check pricing	~20+	varies

Pricing fluctuates based on GPU availability. The prices above are based on 01 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Spheron bare-metal instances include NVMe SSDs at no additional cost. On AWS, GCP, and Azure, NVMe block storage is typically billed separately per GB/month. The NVMe tier on Spheron bare metal is part of the instance, so the cost-per-user figures above do not carry an NVMe storage surcharge.

NVMe KV cache offloading converts a $2.63/hr H100 PCIe GPU into roughly $0.29/hr per concurrent user at 128K context, without adding hardware. Spheron's bare-metal H100 and B200 instances include NVMe storage, so this three-tier stack works out of the box.
Rent H100 → | View all GPU pricing →
Get started on Spheron →

The GPU Memory Wall: KV Cache as the Inference Bottleneck

KV Cache Storage Tiers Explained

NVIDIA ICMSP: Hardware Architecture for Tiered KV Cache

LMCache: NVMe KV Offloading Available Today

Deploy LMCache with vLLM on a Spheron H100 (Step-by-Step)

Provision Your Instance

Verify NVMe Storage

Install Dependencies

Configure LMCache with Disk Backend

Launch vLLM with LMCache

Validate Offloading

NVFP4 + NVMe Offloading on Blackwell GPUs

Benchmarks: Throughput and Latency Across Storage Tiers

When to Use NVMe KV Cache Offloading

Use it when:

Skip it when:

Cost Analysis: Effective Cost Per Concurrent User

Build what's next.