Tutorial

Deploy LMCache on GPU Cloud: Share KV Cache Across vLLM Nodes for 15x Higher Throughput (2026 Guide)

LMCacheLMCache vLLMKV Cache SharingvLLM Production StackKV CacheH100GPU CloudKubernetesLLM InferenceTTFT Optimization
Deploy LMCache on GPU Cloud: Share KV Cache Across vLLM Nodes for 15x Higher Throughput (2026 Guide)

A Llama 3.1 70B request with a 128K-token system prompt spends 11 seconds on prefill before generating a single output token. If 1,000 users send the same system prompt, the cluster recomputes it 1,000 times, burning GPU compute that could be spent on decode. LMCache solves this by externalizing the KV cache so vLLM workers share computed KV state across nodes.

For KV cache fundamentals like PagedAttention, FP8/NVFP4, and CPU swap space, start with the KV Cache Optimization Guide. For single-node NVMe tiering, see NVMe KV cache offloading with LMCache. This post covers the multi-node deployment path.

LMCache graduated to production in January 2026 and is now used by Google Cloud GKE Inference, CoreWeave, and Cohere. The vLLM Production Stack integration ships in LMCache 0.4+.

What LMCache Does (and Where vLLM Prefix Caching Falls Short)

vLLM's RadixAttention and --enable-prefix-caching flag work well for a single worker. They find the longest cached prefix for an incoming request and skip its prefill entirely. The catch: each replica in a multi-replica deployment maintains its own local cache. A request routed to worker B gets no benefit from what worker A already computed, even if they ran the same 128K system prompt an hour ago.

LMCache works as a KV connector and tiered storage layer that operates alongside vLLM's memory manager. It serializes KV tensors into a tiered store (GPU HBM, CPU DRAM, NVMe, Redis) and exposes them to any worker that receives the same token prefix. Worker B reads the cached KV blocks from the shared store rather than recomputing them.

FeaturevLLM Prefix CacheLMCache
ScopeSingle nodeMulti-node cluster
StorageGPU HBM onlyGPU, CPU, NVMe, Redis
TTFT benefit~0% (same request re-sent to different worker)11s to 1.5s at 128K context
Throughput gain1xUp to 15x on chatbot workloads
Survives spot restartsNo (cache lost on shutdown)Yes (remote tier persists)

LMCache Architecture

LMCache organizes cache storage into four tiers, ordered by latency:

Tier 0: GPU HBM. Hot cache, sub-millisecond access. Limited by VRAM (80 GB on H100 SXM5). LMCache manages this automatically alongside the active KV pages.

Tier 1: CPU DRAM. Warm cache, roughly 5 microseconds access latency, 256-512 GB typical per node. Bandwidth is constrained to PCIe (around 32 GB/s peak).

Tier 2: Local NVMe. Cold cache, 100-500 microsecond access, 2-4 TB capacity. At 6-12 GB/s sequential throughput on fast NVMe, this tier can feed a 70B model's prefill miss rate at reasonable concurrency.

Tier 3: Remote shared store (Redis or Infinispan). Shared across all nodes. On 25 GbE, you get roughly 3 GB/s effective bandwidth. On RDMA/RoCEv2, up to 20 GB/s. For RDMA networking options, see the GPU Networking: InfiniBand, RoCE, and Spectrum-X guide.

 vLLM Worker A          vLLM Worker B
 [GPU HBM]             [GPU HBM]
    |                     |
 [CPU DRAM]            [CPU DRAM]
    |                     |
 [Local NVMe]          [Local NVMe]
    |                     |
    +------- Redis --------+
         (shared tier)
TierStorageLatencyBandwidthCost/GB/moBest For
GPU HBM80 GB H100< 1 µs3.35 TB/s~$30Hot prefixes for active requests
CPU DRAM256-512 GB5-50 µs~32 GB/s~$0.50Recent session prefixes
NVMe SSD2-4 TB100-500 µs6-12 GB/s~$0.10Long-tail and batch workloads
Redis (25 GbE)Unlimited500-2000 µs~3 GB/s~$0.05Shared cross-node cache

Prerequisites and Hardware Sizing

Spheron H100 and H200 instances include NVMe SSD storage at no extra cost. Current pricing from the Spheron API:

GPUVRAMBest ForOn-DemandSpot
H100 SXM580 GB HBM3Llama 3.1 70B (TP4), Mistral 22B$5.01/hr$1.43/hr
H200 SXM5141 GB HBM3eLlama 3.1 405B (TP4), long-context$5.86/hr$3.31/hr
A100 80GB SXM480 GB HBM2eBudget option, FP16 only$1.69/hr$0.82/hr

NVMe requirement: minimum 2 TB with 6+ GB/s sequential read. If the NVMe bandwidth falls below that, the disk tier becomes a bottleneck before the GPU does. Check with fio --name=read-test --rw=read --bs=1M --size=10G --numjobs=4 --iodepth=32.

For H100 SXM5 on Spheron and H200 instances on Spheron, bare-metal access means no hypervisor overhead on NVMe or network I/O. For a baseline vLLM setup on Spheron before adding LMCache, see the Spheron vLLM deployment guide.

Pricing fluctuates based on GPU availability. The prices above are based on 11 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Step 1: Deploy LMCache with vLLM on a Single Node

Install dependencies:

bash
pip install lmcache vllm

Mount NVMe:

bash
sudo mkdir -p /mnt/nvme/kvcache
sudo mount /dev/nvme0n1 /mnt/nvme

Create lmcache.yaml:

yaml
chunk_size: 256
local_cpu: true
local_disk: /mnt/nvme/kvcache
max_local_disk_size: 2000  # GB
max_local_cpu_size: 128    # GB
save_decode_cache: true

Launch vLLM with LMCache:

bash
LMCACHE_CONFIG_FILE=lmcache.yaml \
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.92 \
  --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'

The --kv-transfer-config flag wires LMCache as the KV connector. All existing vLLM config flags work normally alongside it.

Test with a long system prompt:

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
    "messages": [
      {"role": "system", "content": "<128K system prompt here>"},
      {"role": "user", "content": "What are the key points?"}
    ]
  }'

Check metrics:

bash
curl -s http://localhost:8000/metrics | grep lmcache
# lmcache_cache_hit_rate 0.0 (first request, cold)
# lmcache_cache_hit_rate 0.95 (subsequent requests)

Step 2: Multi-Node KV Cache Sharing with vLLM Production Stack on Kubernetes

Option A: P2P Sharing Over Redis

Start a Redis container on a dedicated node (or co-located on the first worker):

bash
docker run -d -p 6379:6379 --name lmcache-redis redis:7

Add the remote_url to lmcache.yaml on each worker node:

yaml
chunk_size: 256
local_cpu: true
local_disk: /mnt/nvme/kvcache
max_local_disk_size: 2000
max_local_cpu_size: 128
remote_url: redis://REDIS_NODE_IP:6379
save_decode_cache: true

All workers read from and write to the same Redis instance. The first worker to compute a prefix writes it to Redis; all subsequent workers on any node read it from there.

Option B: vLLM Production Stack on Kubernetes

For Kubernetes setup, use the same kubectl and Helm workflow as the llm-d Kubernetes disaggregated inference guide. Provision 2+ H100 nodes on Spheron, then:

1. Install GPU Operator and Redis:

bash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install nvidia-gpu-operator nvidia/gpu-operator

helm repo add bitnami https://charts.bitnami.com/bitnami
helm install redis bitnami/redis --set auth.enabled=false

2. Create values.yaml for the vLLM Production Stack:

yaml
model:
  name: meta-llama/Meta-Llama-3.1-70B-Instruct
  tensorParallelSize: 4
  kvCacheDtype: fp8
  gpuMemoryUtilization: 0.92

lmcache:
  enabled: true
  config:
    chunk_size: 256
    local_cpu: true
    local_disk: /mnt/nvme/kvcache
    max_local_disk_size: 2000
    max_local_cpu_size: 128
    remote_url: redis://redis-master:6379
    save_decode_cache: true

replicas: 2

resources:
  limits:
    nvidia.com/gpu: 4
  requests:
    nvidia.com/gpu: 4

volumeMounts:
  - name: nvme-storage
    mountPath: /mnt/nvme/kvcache

volumes:
  - name: nvme-storage
    hostPath:
      path: /mnt/nvme/kvcache
      type: DirectoryOrCreate

3. Deploy:

bash
helm repo add lmcache https://charts.lmcache.ai
helm install vllm-stack lmcache/vllm-stack -f values.yaml

4. Verify cache sharing:

Send the same prompt to both workers back-to-back:

bash
# Send to worker 1 (cold cache - expect ~11s TTFT)
curl http://worker1:8000/v1/chat/completions -d '{"messages": [...]}'

# Send to worker 2 (warm cache via Redis - expect ~1.5s TTFT)
curl http://worker2:8000/v1/chat/completions -d '{"messages": [...]}'

Benchmarks: TTFT and Throughput Before/After LMCache

Table 1: Multi-turn QA TTFT (Llama 3.1 70B, H100 SXM5, 10-turn conversation, average 2K tokens per turn)

RoundWithout LMCacheWith LMCacheSpeedup
Turn 1 (cold)4.2s4.2s1x
Turn 58.1s1.8s4.5x
Turn 1011.3s1.5s7.5x

Turn 1 is identical because both paths compute the full prefill. By Turn 10, the growing conversation history dominates TTFT, and LMCache serves the accumulated prefix from cache.

Table 2: 128K system prompt TTFT (Llama 3.1 70B, single H100 SXM5)

ScenarioTTFTNotes
No cache (every request cold)11.2sFull prefill every request
vLLM prefix cache (same worker)1.6sHit rate ~60% in practice (routing-dependent)
LMCache (multi-node)1.5sHit rate 95%+ with sticky routing disabled

The vLLM-only prefix cache achieves 1.6s but only for requests that land on the same worker that computed the prefix. With multiple replicas and round-robin routing, that happens about 60% of the time. LMCache raises that to 95%+ because all workers share the same Redis-backed cache.

Table 3: Throughput at 80% system-prompt reuse (chatbot workload, 4x H100 SXM5 cluster)

ConcurrencyBaseline req/sLMCache req/sGain
100.86.27.8x
501.111.410.4x
2001.421.315.2x

The 15x gain applies to high-concurrency, high-reuse workloads. At low concurrency, the prefill bottleneck is less severe and the gain is 7-8x. Single-user sequential workloads also see 7-8x.

Figures above are representative estimates based on LMCache benchmarking methodology. Actual results depend on model, hardware, concurrency, and workload characteristics. The 15x figure is the high-concurrency (200 req), high-reuse (80%) upper bound; LMCache's production benchmarks on smaller workloads report 4-5x TTFT improvement.

Storage Tiering in Practice: When to Use NVMe vs CPU vs Remote

GPU HBM hot tier: Always active. LMCache manages it automatically alongside the active KV pages.

CPU DRAM warm tier: Enable when CPU RAM is 256 GB or more. Set local_cpu: true in lmcache.yaml. This tier captures recently-evicted HBM blocks without the NVMe round-trip.

NVMe cold tier: Enable when disk throughput exceeds 4 GB/s sequential. Set local_disk and max_local_disk_size. Run fio to verify bandwidth before relying on this tier in production. SATA SSDs (500 MB/s) are too slow to serve as an effective cold tier.

Remote Redis: Enable when running two or more workers. This is the tier that makes cross-node KV cache sharing work. Without it, each node only benefits its own local cache.

For a detailed breakdown of NVMe tiering requirements, bandwidth math, and ICMSP comparison, see NVMe KV cache offloading with LMCache.

Cost math for a 4x H100 SXM5 cluster at $1.43/hr spot per GPU ($5.72/hr total):

Without LMCache, 1,000 requests with a 128K-token system prompt each cost roughly:

  • 1,000 × 11s × $0.001588/cluster-second = $17.47 in prefill compute

With LMCache at 80% cache hit rate (800 hits at 1.5s, 200 cold at 11s):

  • 200 × 11s + 800 × 1.5s = 2,200 + 1,200 = 3,400 cluster-seconds
  • 3,400 × $0.001588 = $5.40

That is a 69% reduction in prefill cost for the same 1,000 requests. At scale, the savings compound.

Running LMCache on Spheron: Sizing, Pricing, and Pitfalls

  • 4x H100 SXM5 (primary inference): 4x GPU × $1.43/hr = $5.72/hr spot
  • 512 GB CPU RAM per node for the warm tier
  • 4 TB NVMe per node for the cold tier
  • 1 dedicated instance for Redis (any CPU-only node works)

Spot vs on-demand

Use on-demand for primary vLLM inference nodes where SLA guarantees matter. Use spot for cache worker nodes (CPU-heavy Redis, NVMe offload workers) where a restart just means a brief cache cold-start. The shared Redis tier survives spot interruption when backed by persistent storage.

Common pitfalls

NVMe IOPS vs bandwidth: Many cloud NVMe configs advertise high IOPS but have limited sequential throughput. KV cache loads are large sequential reads, not random 4K IOPS. Run fio to verify. SATA-backed SSDs are a common trap.

Redis network: If Redis is on a 1 GbE link, it becomes the bottleneck at ~125 MB/s. Plan for 25 GbE or RDMA between the cache workers and Redis. For RDMA setup, see the GPU networking guide.

Cache fragmentation: Very long sequences (over 128K tokens) can fragment the chunk index. Use chunk_size: 512 for long-context workloads instead of the default 256.

KV dtype consistency: Set --kv-cache-dtype consistently across all vLLM workers in a cluster. LMCache serializes KV blocks using the dtype vLLM passes down, so mixing fp8 and fp16 workers will cause cache misses because the serialized tensors won't match.

For the prefill bottleneck that LMCache addresses, see prefill-decode disaggregation on GPU cloud for the broader architectural picture. For a Kubernetes-native approach to KV cache-aware routing, see the llm-d guide.


LMCache's disk-tier KV cache offload is most effective on instances with high-bandwidth NVMe and large CPU DRAM, which is exactly the config available on Spheron H100 SXM5 nodes. Multi-node P2P sharing works across Spheron instances in the same region over 25 GbE.

H100 on Spheron → | H200 on Spheron → | View all GPU pricing →

STEPS / 06

Quick Setup Guide

  1. Provision GPU instances on Spheron with NVMe storage

    Log in to app.spheron.ai and deploy at least one H100 SXM5 instance. Select a configuration with 2+ TB NVMe and 256+ GB CPU RAM. For multi-node P2P sharing, provision two or more H100 nodes in the same availability region.

  2. Install LMCache and vLLM

    pip install lmcache vllm. LMCache 0.4+ (January 2026 release) ships with the vLLM Production Stack integration. Verify with: python -c 'import lmcache; print(lmcache.__version__)'

  3. Write the lmcache.yaml config file

    Create a lmcache.yaml specifying chunk_size (default 256 tokens), local_cpu (True/False), local_disk (e.g. /mnt/nvme/kvcache), max_local_disk_size (in GB), remote_url (Redis endpoint for multi-node), and save_decode_cache (True for chatbot workloads).

  4. Start vLLM with LMCache backend

    Run: LMCACHE_CONFIG_FILE=lmcache.yaml python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --tensor-parallel-size 4 --kv-cache-dtype fp8 --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'. The --kv-transfer-config flag wires LMCache as the KV connector, routing cache reads and writes through LMCache's tiered storage.

  5. Deploy multi-node sharing with Redis backend

    Start a Redis container: docker run -d -p 6379:6379 redis:7. Update lmcache.yaml remote_url to redis://YOUR_REDIS_IP:6379. On each vLLM node, set the same remote_url so all workers read from and write to the shared cache.

  6. Validate cache hit rate with a benchmark request

    Send the same system prompt twice to two different workers and compare TTFT. First request should show 11s, second should show ~1.5s (from cache). Check LMCache metrics: curl http://localhost:8000/metrics | grep lmcache_cache_hit_rate

FAQ / 05

Frequently Asked Questions

vLLM prefix caching reuses KV state within a single node. LMCache adds a distributed cache layer so multiple vLLM workers on different machines share the same KV cache over Ethernet, RDMA, or NVLink, which eliminates redundant prefill across nodes in a multi-replica cluster.

For the hot tier: H100 or H200 GPUs with 80-141 GB HBM. For the warm tier: 256-512 GB CPU DRAM per node. For the disk tier: 2-4 TB NVMe SSD with at least 6 GB/s sequential read bandwidth. For the shared tier: a Redis or Infinispan backend on a low-latency network.

On workloads with repeated system prompts (chatbots, RAG, agents), prefill is the bottleneck. LMCache stores the KV tensors after first prefill and serves them from cache on subsequent requests. On a Llama 3.1 70B chatbot at 128K system prompt, this cuts TTFT from 11s to 1.5s and allows the GPU to handle 15x more decode requests per second instead of re-computing the same prefill.

Cache worker nodes (which store KV on CPU/NVMe) can run on spot if the LMCache backend is backed by a persistent Redis. The shared tier survives spot interruption because Redis persists the serialized KV tensors. Prefill and primary decode GPUs should be on-demand for SLA guarantees.

LMCache uses a configurable eviction policy (LRU by default). When a tier fills up, the least-recently-used KV blocks are evicted to the next tier or dropped. You configure max_local_disk_size, max_local_cpu_size, and remote_url in the lmcache.yaml config file. For long-context workloads, size the disk tier to hold at least 10x the average active sessions.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.