Engineering

NVIDIA ICMSP Explained: KV Cache NVMe Offload, 5x Inference Gains, and Setup Guide (2026)

NVIDIA ICMSPKV Cache NVMeInference Context Memory Storage PlatformNVIDIA CMXCMX Context Memory Storage PlatformKV Cache OptimizationvLLMSGLangGPU CloudH100H200B200
NVIDIA ICMSP Explained: KV Cache NVMe Offload, 5x Inference Gains, and Setup Guide (2026)

At 1M-token context windows, the KV cache for a single user running Llama 3.1 70B exceeds 320 GB, which is 4x the entire HBM of an H100. NVIDIA's Inference Context Memory Storage Platform (ICMSP), announced at CES 2026, addresses this directly: it standardizes three-tier KV cache storage from GPU HBM through CPU DRAM down to NVMe SSD, with BlueField-4 DPUs handling the I/O scheduling so the GPU never stalls waiting on storage. For the software-only predecessor approach using LMCache that you can deploy today, see the NVMe KV Cache Offloading guide first.

What Is NVIDIA ICMSP?

NVIDIA's official branding for this platform is CMX (CMX Context Memory Storage Platform). The longer "Inference Context Memory Storage Platform" (ICMSP) name appears in NVIDIA's announcement URL and remains a common reference for the same product.

ICMSP (Inference Context Memory Storage Platform) is NVIDIA's standardized framework for KV cache tiering. It moves KV blocks from GPU VRAM through CPU DRAM down to NVMe SSD based on access recency. When a block hasn't been accessed in a while, it migrates to a cheaper, slower tier. When needed again, it loads back up the hierarchy.

Three components make this work together:

  • cuFile: GPU-direct NVMe transfer API that bypasses the CPU for NVMe reads and writes. Without cuFile, NVMe I/O requires the CPU to coordinate the transfer, adding latency and CPU interrupt overhead.
  • BlueField-4 DPU: NVIDIA's data processing unit manages KV block scheduling off the host CPU entirely. The DPU handles I/O completion, eviction decisions, and tier movement so the host CPU stays available for inference logic.
  • NIXL protocol: The NVIDIA Inference Transfer Library handles KV block movement within a cluster, supporting NVLink for single-node GPU-to-GPU transfers, InfiniBand RDMA for multi-node, PCIe for local tier movement, and TCP as a fallback. For the full NIXL architecture and multi-node disaggregated serving setup, see the NVIDIA NIXL Disaggregated Inference guide.

Enterprise NVMe drives commonly used with cuFile and GPUDirect Storage (Samsung, Micron, Seagate, and Kioxia) are well-suited for the write amplification patterns of KV cache workloads. ICMSP's announced storage platform partners include AIC, Cloudian, DDN, Dell, HPE, Hitachi Vantara, IBM, Nutanix, Pure Storage, Supermicro, VAST Data, and WEKA.

H200 instances with 141 GB HBM3e are particularly well-suited to ICMSP deployments because the larger VRAM keeps more hot blocks on-GPU, reducing the frequency of NVMe fetches.

The KV Cache Memory Wall

KV cache size scales with layers, KV heads, head dimension, sequence length, and batch size simultaneously. The KV Cache Optimization Guide covers the full derivation and the in-GPU optimizations (PagedAttention, FP8/NVFP4 quantization, CPU swap space). This post focuses on what happens when in-GPU and CPU-DRAM capacity runs out.

GPUHBMKV cache per user (128K BF16)Max concurrent users (KV only)
H100 SXM580 GB~40 GB~1-2
H200 SXM5141 GB~40 GB~2-3
B200 SXM6192 GB~40 GB~4

At 1M-token windows, the numbers get worse: KV cache per user at BF16 hits ~320 GB. No current GPU fits even one user's full KV cache in HBM at 1M tokens. At that scale, NVMe tiering stops being an optimization and becomes a requirement.

For a deeper analysis of why adding GPUs doesn't solve the latency problem when you're KV-cache-bound, see the AI Memory Wall Inference Latency Guide.

ICMSP Architecture: Three-Tier Storage Hierarchy

ICMSP maps KV blocks to three storage tiers based on temperature. Hot blocks (active generation, recently accessed) stay on GPU HBM. Warm blocks (completed requests, likely multi-turn reuse) go to CPU DRAM. Cold blocks (historical context, evicted prefix cache entries) go to NVMe SSD.

GPU HBM (hot, ~3.35 TB/s)
     |  cuFile / NVLink
CPU DRAM (warm, ~63 GB/s via PCIe 5.0)
     |  DPU-managed I/O
NVMe SSD (cold, ~7 GB/s, ~100µs latency)

The bandwidth numbers define what each tier is good for. At 3.35 TB/s, GPU HBM can serve KV blocks at token generation speed. CPU DRAM at 63 GB/s is ~50x slower but still fast enough for warm blocks that get promoted when a multi-turn conversation resumes. NVMe at ~7 GB/s with ~100 microsecond access latency is where cold blocks live. Loading 40 GB of KV blocks from NVMe takes about 6 seconds, which sounds slow until you compare it to recomputing a 128K-token prefix from scratch (~11 seconds on H100).

The DPU layer is what makes ICMSP different from software-only approaches. When LMCache manages NVMe tiering in software, every I/O completion triggers a CPU interrupt. At high concurrency, those interrupts accumulate and add tail latency. ICMSP's BlueField-4 DPU absorbs the I/O management entirely, so the host CPU never sees the NVMe traffic at all. BlueField-4 DPU nodes are on NVIDIA's H2 2026 hardware roadmap; the setup guide in this post demonstrates the software-only LMCache tiering pattern that ICMSP will hardware-accelerate when DPU nodes ship.

Storage Platform Partners and NVMe Drive Requirements

NVIDIA's announced ICMSP ecosystem partners are storage-system builders rather than drive vendors. The initial partner list includes AIC, Cloudian, DDN, Dell, HPE, Hitachi Vantara, IBM, Nutanix, Pure Storage, Supermicro, VAST Data, and WEKA. These are the vendors building ICMSP-integrated storage appliances and validated clusters.

For the underlying NVMe drives, ICMSP requires enterprise-class SSDs that can sustain the write amplification patterns KV cache workloads create. Consumer drives are excluded because they lack the endurance ratings for sustained inference. The drives below are commonly used in cuFile and GPUDirect Storage deployments and meet the interface and endurance requirements:

VendorDriveInterfaceSeq ReadNotes
SamsungPM9A3PCIe 4.0 x4~6.5 GB/sEnterprise endurance
Micron9400 ProPCIe 4.0 x4~7.0 GB/sHigh IOPS for random reads
SeagateNytro XD6510PCIe 4.0 x4~7.4 GB/sData center SLC cache
KioxiaCD8 SeriesPCIe 4.0 x4~7.0 GB/sCM7 series enterprise

The BlueField-4 DPU handles I/O scheduling and acts as the ICMSP I/O controller. It connects to the host via PCIe and exposes NVMe devices to the storage fabric. From the GPU's perspective, all NVMe access goes through the DPU rather than directly over the PCIe bus to the host CPU.

Performance Claims Dissected: Where Do the 5x and 20x Numbers Come From?

NVIDIA's CES 2026 announcement cited two headline numbers: 5x tokens-per-second improvement for long-context workloads and 5x power efficiency. NVIDIA partner benchmarks have also cited up to 20x improvement in time-to-first-token. Each covers a different scenario.

Tokens/sec gain: At 1M-token context, every new request requires recomputing the entire prefix. On H100, that prefill takes roughly 90 seconds. With NVMe tiering, the same KV blocks load from SSD in ~8 seconds. The speedup in multi-turn or prefix-reuse scenarios is larger: at 128K with 90% prefix overlap, recomputation costs ~11 seconds while loading from NVMe costs ~0.8 seconds. The 5x number comes from specific high-prefix-reuse production workloads. For single-turn, short-context inference with no prefix reuse, ICMSP adds latency rather than reducing it, because the DPU scheduling overhead costs more than it saves.

TTFT gain: The up to 20x time-to-first-token improvement cited in partner benchmarks comes from prefix-reuse scenarios where KV blocks are already cached on NVMe. For a 128K context with 90% prefix overlap, loading cached KV blocks from NVMe (~0.8s) versus recomputing the prefix from scratch (~11s) gives a ~13x improvement. At 1M tokens the gain increases further because full recomputation grows proportionally while NVMe load time scales more slowly with block count. The 20x headline represents the most favorable prefix-heavy production workloads.

Power efficiency: KV cache I/O via DPU draws far less power than the GPU recomputing prefill at full SM utilization. A 90-second full-HBM prefill run uses significantly more watt-hours than an 8-second NVMe load where the GPU idles.

ScenarioWithout ICMSPWith ICMSPGain
128K single-turn, no prefix reuseBaseline~Same or slower0x (DPU overhead)
128K multi-turn, 90% prefix matchRecompute ~11s prefillLoad from NVMe ~0.8s~13x TTFT
1M-token single-turn RAGRecompute ~90sLoad from NVMe ~8s~11x TTFT
High-concurrency (8+ users, 128K each)OOM or massive swapServe from NVMe tierThroughput-limited only by NVMe bandwidth

ICMSP vs LMCache vs Mooncake vs DeepSeek NVMe

SystemTypeTransportDPU accelerationFrameworksOpen source
NVIDIA ICMSPHardware + softwarecuFile, NIXL (NVLink/IB/PCIe)BlueField-4vLLM, SGLang, DynamoPartial (spec open)
LMCacheSoftware onlyCPU memcpy, POSIX I/ONovLLM, SGLang, DynamoYes (Apache 2.0)
MooncakeSoftware onlyRDMA for P/D disaggregationNovLLM (fork)Yes (Apache 2.0)
DeepSeek NVMeSoftware, proprietaryCustom RDMANoInternal onlyNo

The practical takeaway: LMCache is the production-ready option today for anyone on standard hardware. You install it, configure the disk backend, and get the same three-tier storage hierarchy that ICMSP provides, managed by the CPU rather than a DPU. ICMSP's BlueField-4 path adds hardware acceleration when DPU-equipped nodes become available, reducing CPU interrupt load and improving tail latency at high concurrency.

For Dynamo's role in KV cache orchestration across disaggregated clusters, see the NVIDIA Dynamo Disaggregated Inference Guide. Dynamo uses NIXL underneath for the same block transfer primitives that ICMSP builds on.

Setup Guide: Enabling ICMSP-Compatible KV Offloading with vLLM and SGLang

This setup deploys LMCache with a local NVMe disk backend on vLLM, which implements the same storage tiering pattern ICMSP standardizes at the hardware level. You can run this today on any Spheron bare-metal node.

Step 1: Provision a Spheron bare-metal node

bash
# From app.spheron.ai, select an H100 SXM5, H200, or B200 bare-metal instance
# All Spheron bare-metal instances include local NVMe at no added cost
ssh user@<spheron-instance-ip>

Step 2: Verify NVMe

bash
lsblk
nvme list
df -h | grep nvme
mkdir -p /nvme/kvcache

Step 3: Install vLLM + LMCache

bash
pip install vllm lmcache
python -c "import vllm; print(vllm.__version__)"
python -c "import lmcache; print(lmcache.__version__)"

Check docs.lmcache.ai for version compatibility notes before deploying.

Step 4: Create LMCache config

yaml
# lmcache_config.yaml
local_disk: true
local_disk_path: /nvme/kvcache
max_local_disk_size: 1000  # GB
local_cpu: false
max_local_cpu_size: 0

Step 5: Launch vLLM with prefix caching

bash
export LMCACHE_CONFIG_FILE=/path/to/lmcache_config.yaml

# 8B fits on a single H100/H200/B200 GPU at BF16.
# For 70B: use --quantization fp8 on H200/B200, or --tensor-parallel-size 2 across two GPUs.
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.90 \
  --swap-space 16 \
  --port 8000

SGLang variant:

bash
# Same model-size guidance applies: 8B for single-GPU; add --tp 2 for 70B across two GPUs.
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --enable-prefix-caching \
  --mem-fraction-static 0.85 \
  --port 8000

SGLang's RadixAttention manages KV cache sharing automatically across requests that share common prefixes.

Step 6: Benchmark TTFT before/after

bash
# First request (cold - populates NVMe cache)
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.1-8B-Instruct",
       "prompt": "<your-1M-token-prefix> Continue:",
       "max_tokens": 100}'

# Second request (hot - loads from NVMe, skips prefill recompute)
# Measure TTFT with time or a benchmarking tool

# Monitor cache
du -sh /nvme/kvcache
curl http://localhost:8000/metrics | grep kv_cache

For the complete vLLM flag reference and multi-GPU tensor parallel setup, see the vLLM Production Deployment guide. For SGLang's RadixAttention configuration and production tuning, see the SGLang Production Deployment guide.

Workloads Where ICMSP Matters Most

ICMSP and LMCache NVMe tiering show the largest gains in workloads with high prefix reuse or very long context windows. Short-context, single-turn inference without prefix overlap gets no benefit and may see slight latency increases from DPU scheduling overhead.

The workloads where tiering matters most:

Reasoning models (DeepSeek R1, QwQ-32B): Long chain-of-thought generation creates large intermediate KV context that gets reused across sampling iterations. Multi-step reasoning with the same context prefix is exactly the pattern ICMSP's NVMe tier handles well.

AI agents with memory: Agents that maintain multi-turn context across tool calls and sub-tasks accumulate KV blocks over many interactions. The earlier turns' context sits cold in NVMe while the recent turns stay hot in HBM. See the Agentic RAG GPU Infrastructure Guide for how this fits into a broader agent infrastructure design.

1M-token RAG systems: Document-scale retrieval pipelines where the retrieved text is the entire context. The document KV blocks get computed once and cached. Subsequent queries over the same document hit the NVMe tier instead of recomputing the full prefix.

Multi-session chat: Shared system prompts reused across thousands of concurrent users. The system prompt's KV blocks stay cached across sessions, cutting TTFT for every user.

Ultra-long context serving: B200 SXM6 nodes with 192 GB HBM and higher HBM3e bandwidth are the best hardware for serving contexts above 512K tokens even without NVMe offloading, but the NVMe tier extends their effective capacity by another order of magnitude for multi-user scenarios.

For 10M-token context windows using Ring Attention and Tree Attention, see the Ring Attention and Tree Attention guide.

Cost-Per-Million-Tokens: Vanilla vs ICMSP-Enabled vs AWS

Live pricing from the Spheron API at time of writing: H100 SXM5 at $4.06/hr on-demand ($1.43/hr spot), B200 SXM6 at $8.61/hr on-demand ($5.34/hr spot). Spot pricing is available where capacity permits and is the strongest cost lever for ICMSP evaluation budgets.

ConfigurationGPUPriceTokens/sec (Llama 70B, 128K context)Cost per 1M tokens
Vanilla vLLM, Spheron (on-demand)H100 SXM5$4.06/hr~80 (single user)~$14.10
Vanilla vLLM, Spheron (spot)H100 SXM5$1.43/hr~80 (single user)~$4.97
ICMSP-enabled vLLM, Spheron (on-demand)H100 SXM5$4.06/hr~400 (8 concurrent users, NVMe tiering)~$2.82
ICMSP-enabled vLLM, Spheron (spot)H100 SXM5$1.43/hr~400 (8 concurrent users, NVMe tiering)~$0.99
Vanilla vLLM, AWS p5.48xlargeH100~$98/hr (8x H100 bundle)~80/GPU~$42.53/GPU
ICMSP-enabled vLLM, AWS p5H100~$98/hr~400/GPU (if local NVMe provisioned)~$8.51/GPU
Vanilla vLLM, Spheron (on-demand)B200 SXM6$8.61/hr~200 (higher HBM)~$11.96
ICMSP-enabled vLLM, Spheron (on-demand)B200 SXM6$8.61/hr~800 (8 concurrent users)~$2.99
ICMSP-enabled vLLM, Spheron (spot)B200 SXM6$5.34/hr~800 (8 concurrent users)~$1.85

AWS p5 instances include local NVMe SSDs (8x 3.84 TB), so ICMSP-style tiering is technically possible there. The gap is the pricing model: p5.48xlarge on-demand runs around $98/hr for the full 8-GPU node, which means you pay for all 8 GPUs even if your workload uses fewer. Spheron's per-GPU bare-metal pricing means NVMe is always included without paying for a fixed multi-GPU bundle.

On H100 SXM5 on Spheron, H100 spot at $1.43/hr is over 8x cheaper per GPU-hour than the $12.25/hr per-GPU effective cost of AWS p5, and NVMe tiering is included at no extra charge. On on-demand pricing ($4.06/hr), the gap is still 3x in Spheron's favor. Spot pricing where available is even cheaper: $1.43/hr per H100, $5.34/hr per B200.

Pricing fluctuates based on GPU availability. The prices above are based on 06 Jun 2026 and may have changed. Check current GPU pricing for live rates.

For a broader provider comparison including AWS, Azure, and Lambda Labs, see GPU Cloud Pricing Comparison 2026. For FinOps context on total cost including egress and storage, see AI Inference Cost Economics 2026.

Production Rollout Checklist for Spheron-Hosted Clusters

  1. Confirm NVMe mount point and available capacity before setting max_local_disk_size in LMCache config. Running out of NVMe space mid-serving causes eviction loops and TTFT spikes.
  2. Use --kv-cache-dtype fp8 in vLLM to halve KV block size before writing to NVMe. This doubles effective NVMe capacity for KV storage and cuts I/O bandwidth requirements by half.
  3. Set --gpu-memory-utilization 0.90 to leave headroom for KV block I/O buffers in GPU memory.
  4. Monitor kv_cache_usage_perc at /metrics. If it stays consistently below 70%, you have room to increase concurrent users before triggering NVMe eviction.
  5. For multi-node clusters, configure NIXL's RDMA transport to move KV blocks between nodes without going through NVMe. See the NVIDIA NIXL guide for setup details.
  6. For prefill-decode disaggregation, ICMSP's NVMe tier sits on the decode node. The prefill node does not need NVMe. See the Prefill-Decode Disaggregation guide for the full topology.
  7. With multiple NVMe drives, distribute the LMCache path across drives using a RAID-0 or symlink rotation to avoid single-drive bandwidth bottleneck.
  8. Add a readiness probe that checks /metrics for kv_cache_usage_perc < 95 before routing traffic. A saturated NVMe tier causes latency spikes that look like model slowdowns.

For measuring your actual throughput gains and cost-per-token before and after enabling tiering, see the GPU Cost Per Token Benchmark guide.


Spheron H100, H200, and B200 bare-metal instances include local NVMe storage, the exact substrate ICMSP and LMCache need for KV cache tiering. No extra provisioning, no separate EBS volume, no per-GB NVMe surcharge. Spot pricing available where capacity permits.

H100 SXM5 on Spheron → | H200 availability → | Check current GPU pricing →

STEPS / 07

Quick Setup Guide

  1. Provision a Spheron bare-metal GPU node with NVMe

    Log into app.spheron.ai and select an H100 SXM5, H200, or B200 bare-metal instance. All Spheron bare-metal nodes include local NVMe storage at no added cost. SSH into the node after provisioning.

  2. Verify NVMe device and mount point

    Run lsblk and nvme list to confirm the NVMe device. Check its mount point with df -h or mount | grep nvme. Common paths are /nvme, /mnt/nvme, or /data. Create the cache directory: mkdir -p <nvme-mount>/kvcache

  3. Install vLLM and LMCache

    pip install vllm lmcache. Verify: python -c 'import vllm; print(vllm.__version__)' and python -c 'import lmcache; print(lmcache.__version__)'. Check docs.lmcache.ai for version compatibility.

  4. Configure LMCache disk backend

    Create lmcache_config.yaml with: local_disk: true, local_disk_path: <nvme-mount>/kvcache, max_local_disk_size: <available_GB>, local_cpu: false, max_local_cpu_size: 0

  5. Launch vLLM with ICMSP-compatible flags

    Export LMCACHE_CONFIG_FILE=/path/to/lmcache_config.yaml. Add to vLLM launch: --kv-cache-dtype fp8 --enable-prefix-caching --gpu-memory-utilization 0.90 --swap-space 16. Monitor kv_cache_usage_perc at the /metrics endpoint.

  6. Validate offloading and measure TTFT improvement

    Send a long-prefix request, then resend the same prefix. Confirm du -sh <nvme-mount>/kvcache grows after the first request and that TTFT drops on the second. Compare tokens/second before and after to measure your actual throughput gain.

  7. Optional: Enable SGLang with prefix caching

    For SGLang, launch with --enable-prefix-caching and --mem-fraction-static 0.85. SGLang's RadixAttention manages KV cache sharing automatically. See the SGLang Production Deployment Guide for the full flag reference.

FAQ / 06

Frequently Asked Questions

ICMSP (Inference Context Memory Storage Platform) is NVIDIA's standardized framework for offloading KV cache blocks from GPU HBM to CPU DRAM and NVMe SSD. Announced at CES 2026, it uses cuFile for GPU-direct NVMe transfers and BlueField-4 DPUs to move data off the GPU compute path, freeing SMs for matrix math rather than memory management.

LMCache is a software-only library that bolts onto vLLM or SGLang to serialize KV blocks to configurable backends (CPU DRAM, local disk, Redis). ICMSP is a hardware-software co-design: it uses BlueField-4 DPUs and GPU-direct NVMe via cuFile so the KV block movement runs on the DPU fabric rather than on the CPU. In practice, both implement the same three-tier storage hierarchy (GPU HBM, CPU DRAM, NVMe), but ICMSP's DPU path offloads I/O scheduling entirely off the CPU, which reduces host-side latency spikes.

NVIDIA's announced ICMSP ecosystem partners are storage-system builders: AIC, Cloudian, DDN, Dell, HPE, Hitachi Vantara, IBM, Nutanix, Pure Storage, Supermicro, VAST Data, and WEKA. For the underlying drives, Samsung PM9A3, Micron 9400 Pro, Seagate Nytro XD6510, and Kioxia CD8 are enterprise NVMe SSDs commonly used in cuFile and GPUDirect Storage deployments and meet the interface requirements (PCIe 4.0 or 5.0 x4, NVMe 1.4+). Consumer drives are excluded because they lack the endurance ratings for sustained inference write amplification.

NVIDIA's CES 2026 announcement cited 5x tokens-per-second improvement for long-context inference workloads (1M-token windows) and 5x power efficiency. The tokens-per-second gain comes primarily from avoiding KV cache recomputation: without offloading, a 1M-token prefix must be recomputed from scratch on every new request, consuming the GPU's entire compute budget. With NVMe tiering, the KV blocks are loaded from SSD in ~100ms rather than recomputed over ~90 seconds. The power efficiency gain reflects that I/O-bound KV loads consume far less power than compute-bound prefill runs.

Spheron bare-metal H100, H200, and B200 instances include local NVMe SSDs as part of the instance at no additional cost. The software components - vLLM with prefix caching and LMCache with a local disk backend - are available today. Full BlueField-4 DPU hardware acceleration requires DPU-equipped nodes, which are on NVIDIA's H2 2026 hardware roadmap. Until then, the storage tiering layer (GPU HBM to CPU DRAM to NVMe) can be validated immediately using LMCache on any Spheron bare-metal node. This guide demonstrates exactly that software-only tiering pattern, which ICMSP will hardware-accelerate when DPU nodes ship.

DeepSeek's internal system (used for their R1 and V3 deployments) uses a similar three-tier approach but is entirely proprietary software with no public API. ICMSP is NVIDIA's attempt to standardize and hardware-accelerate the same pattern across multiple inference frameworks. LMCache and Mooncake are open-source software alternatives that implement the tiering logic without DPU hardware acceleration.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.