Deploy LMCache on GPU Cloud: Share KV Cache Across vLLM Nodes for 15x Higher Throughput (2026 Guide)

Q: What is LMCache and how does it differ from vLLM's built-in prefix caching?

vLLM prefix caching reuses KV state within a single node. LMCache adds a distributed cache layer so multiple vLLM workers on different machines share the same KV cache over Ethernet, RDMA, or NVLink, which eliminates redundant prefill across nodes in a multi-replica cluster.

Q: What hardware is needed to run LMCache in production?

For the hot tier: H100 or H200 GPUs with 80-141 GB HBM. For the warm tier: 256-512 GB CPU DRAM per node. For the disk tier: 2-4 TB NVMe SSD with at least 6 GB/s sequential read bandwidth. For the shared tier: a Redis or Infinispan backend on a low-latency network.

Q: How does LMCache achieve 15x throughput improvement?

On workloads with repeated system prompts (chatbots, RAG, agents), prefill is the bottleneck. LMCache stores the KV tensors after first prefill and serves them from cache on subsequent requests. On a Llama 3.1 70B chatbot at 128K system prompt, this cuts TTFT from 11s to 1.5s and allows the GPU to handle 15x more decode requests per second instead of re-computing the same prefill.

Q: Can I run LMCache workers on spot instances?

Cache worker nodes (which store KV on CPU/NVMe) can run on spot if the LMCache backend is backed by a persistent Redis. The shared tier survives spot interruption because Redis persists the serialized KV tensors. Prefill and primary decode GPUs should be on-demand for SLA guarantees.

Q: How does LMCache handle cache eviction?

LMCache uses a configurable eviction policy (LRU by default). When a tier fills up, the least-recently-used KV blocks are evicted to the next tier or dropped. You configure max_local_disk_size, max_local_cpu_size, and remote_url in the lmcache.yaml config file. For long-context workloads, size the disk tier to hold at least 10x the average active sessions.

A Llama 3.1 70B request with a 128K-token system prompt spends 11 seconds on prefill before generating a single output token. If 1,000 users send the same system prompt, the cluster recomputes it 1,000 times, burning GPU compute that could be spent on decode. LMCache solves this by externalizing the KV cache so vLLM workers share computed KV state across nodes.

For KV cache fundamentals like PagedAttention, FP8/NVFP4, and CPU swap space, start with the KV Cache Optimization Guide. For single-node NVMe tiering, see NVMe KV cache offloading with LMCache. This post covers the multi-node deployment path.

LMCache graduated to production in January 2026 and is now used by Google Cloud GKE Inference, CoreWeave, and Cohere. The vLLM Production Stack integration ships in LMCache 0.4+. For an in-depth look at how GKE Inference Gateway's routing layer works and how to replicate it outside GCP, see the GKE Inference Gateway explainer.

What LMCache Does (and Where vLLM Prefix Caching Falls Short)

vLLM's RadixAttention and --enable-prefix-caching flag work well for a single worker. They find the longest cached prefix for an incoming request and skip its prefill entirely. The catch: each replica in a multi-replica deployment maintains its own local cache. A request routed to worker B gets no benefit from what worker A already computed, even if they ran the same 128K system prompt an hour ago.

For workloads that are agentic rather than shared-prefix, Mooncake's KV pool approach routes cache differently: it transfers KV blocks from prefill to decode via a distributed store rather than broadcasting from one producer to many readers. The two tools solve different access patterns and can be combined.

LMCache works as a KV connector and tiered storage layer that operates alongside vLLM's memory manager. It serializes KV tensors into a tiered store (GPU HBM, CPU DRAM, NVMe, Redis) and exposes them to any worker that receives the same token prefix. Worker B reads the cached KV blocks from the shared store rather than recomputing them.

Feature	vLLM Prefix Cache	LMCache
Scope	Single node	Multi-node cluster
Storage	GPU HBM only	GPU, CPU, NVMe, Redis
TTFT benefit	~0% (same request re-sent to different worker)	11s to 1.5s at 128K context
Throughput gain	1x	Up to 15x on chatbot workloads
Survives spot restarts	No (cache lost on shutdown)	Yes (remote tier persists)

LMCache Architecture

LMCache organizes cache storage into four tiers, ordered by latency:

Tier 0: GPU HBM. Hot cache, sub-millisecond access. Limited by VRAM (80 GB on H100 SXM5). LMCache manages this automatically alongside the active KV pages.

Tier 1: CPU DRAM. Warm cache, roughly 5 microseconds access latency, 256-512 GB typical per node. Bandwidth is constrained to PCIe (around 32 GB/s peak).

Tier 2: Local NVMe. Cold cache, 100-500 microsecond access, 2-4 TB capacity. At 6-12 GB/s sequential throughput on fast NVMe, this tier can feed a 70B model's prefill miss rate at reasonable concurrency.

Tier 3: Remote shared store (Redis or Infinispan). Shared across all nodes. On 25 GbE, you get roughly 3 GB/s effective bandwidth. On RDMA/RoCEv2, up to 20 GB/s. For RDMA networking options, see the GPU Networking: InfiniBand, RoCE, and Spectrum-X guide.

 vLLM Worker A          vLLM Worker B
 [GPU HBM]             [GPU HBM]
    |                     |
 [CPU DRAM]            [CPU DRAM]
    |                     |
 [Local NVMe]          [Local NVMe]
    |                     |
    +------- Redis --------+
         (shared tier)

Tier	Storage	Latency	Bandwidth	Cost/GB/mo	Best For
GPU HBM	80 GB H100	< 1 µs	3.35 TB/s	~$30	Hot prefixes for active requests
CPU DRAM	256-512 GB	5-50 µs	~32 GB/s	~$0.50	Recent session prefixes
NVMe SSD	2-4 TB	100-500 µs	6-12 GB/s	~$0.10	Long-tail and batch workloads
Redis (25 GbE)	Unlimited	500-2000 µs	~3 GB/s	~$0.05	Shared cross-node cache

Prerequisites and Hardware Sizing

Spheron H100 and H200 instances include NVMe SSD storage at no extra cost. Current pricing from the Spheron API:

GPU	VRAM	Best For	On-Demand	Spot
H100 SXM5	80 GB HBM3	Llama 3.1 70B (TP4), Mistral 22B	$5.01/hr	$1.43/hr
H200 SXM5	141 GB HBM3e	Llama 3.1 405B (TP4), long-context	$5.86/hr	$3.31/hr
A100 80GB SXM4	80 GB HBM2e	Budget option, FP16 only	$1.69/hr	$0.82/hr

NVMe requirement: minimum 2 TB with 6+ GB/s sequential read. If the NVMe bandwidth falls below that, the disk tier becomes a bottleneck before the GPU does. Check with fio --name=read-test --rw=read --bs=1M --size=10G --numjobs=4 --iodepth=32.

For H100 SXM5 on Spheron and H200 instances on Spheron, bare-metal access means no hypervisor overhead on NVMe or network I/O. For a baseline vLLM setup on Spheron before adding LMCache, see the Spheron vLLM deployment guide.

Pricing fluctuates based on GPU availability. The prices above are based on 11 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Step 1: Deploy LMCache with vLLM on a Single Node

Install dependencies:

bash

pip install lmcache vllm

Mount NVMe:

bash

sudo mkdir -p /mnt/nvme/kvcache
sudo mount /dev/nvme0n1 /mnt/nvme

Create lmcache.yaml:

yaml

chunk_size: 256
local_cpu: true
local_disk: /mnt/nvme/kvcache
max_local_disk_size: 2000  # GB
max_local_cpu_size: 128    # GB
save_decode_cache: true

Launch vLLM with LMCache:

bash

LMCACHE_CONFIG_FILE=lmcache.yaml \
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.92 \
  --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'

The --kv-transfer-config flag wires LMCache as the KV connector. All existing vLLM config flags work normally alongside it.

Test with a long system prompt:

bash

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
    "messages": [
      {"role": "system", "content": "<128K system prompt here>"},
      {"role": "user", "content": "What are the key points?"}
    ]
  }'

Check metrics:

bash

curl -s http://localhost:8000/metrics | grep lmcache
# lmcache_cache_hit_rate 0.0 (first request, cold)
# lmcache_cache_hit_rate 0.95 (subsequent requests)

Start a Redis container on a dedicated node (or co-located on the first worker):

bash

docker run -d -p 6379:6379 --name lmcache-redis redis:7

Add the remote_url to lmcache.yaml on each worker node:

yaml

chunk_size: 256
local_cpu: true
local_disk: /mnt/nvme/kvcache
max_local_disk_size: 2000
max_local_cpu_size: 128
remote_url: redis://REDIS_NODE_IP:6379
save_decode_cache: true

All workers read from and write to the same Redis instance. The first worker to compute a prefix writes it to Redis; all subsequent workers on any node read it from there.

Option B: vLLM Production Stack on Kubernetes

For Kubernetes setup, use the same kubectl and Helm workflow as the llm-d Kubernetes disaggregated inference guide. Provision 2+ H100 nodes on Spheron, then:

1. Install GPU Operator and Redis:

bash

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install nvidia-gpu-operator nvidia/gpu-operator

helm repo add bitnami https://charts.bitnami.com/bitnami
helm install redis bitnami/redis --set auth.enabled=false

2. Create values.yaml for the vLLM Production Stack:

yaml

model:
  name: meta-llama/Meta-Llama-3.1-70B-Instruct
  tensorParallelSize: 4
  kvCacheDtype: fp8
  gpuMemoryUtilization: 0.92

lmcache:
  enabled: true
  config:
    chunk_size: 256
    local_cpu: true
    local_disk: /mnt/nvme/kvcache
    max_local_disk_size: 2000
    max_local_cpu_size: 128
    remote_url: redis://redis-master:6379
    save_decode_cache: true

replicas: 2

resources:
  limits:
    nvidia.com/gpu: 4
  requests:
    nvidia.com/gpu: 4

volumeMounts:
  - name: nvme-storage
    mountPath: /mnt/nvme/kvcache

volumes:
  - name: nvme-storage
    hostPath:
      path: /mnt/nvme/kvcache
      type: DirectoryOrCreate

3. Deploy:

bash

helm repo add lmcache https://charts.lmcache.ai
helm install vllm-stack lmcache/vllm-stack -f values.yaml

4. Verify cache sharing:

Send the same prompt to both workers back-to-back:

bash

# Send to worker 1 (cold cache - expect ~11s TTFT)
curl http://worker1:8000/v1/chat/completions -d '{"messages": [...]}'

# Send to worker 2 (warm cache via Redis - expect ~1.5s TTFT)
curl http://worker2:8000/v1/chat/completions -d '{"messages": [...]}'

Benchmarks: TTFT and Throughput Before/After LMCache

Table 1: Multi-turn QA TTFT (Llama 3.1 70B, H100 SXM5, 10-turn conversation, average 2K tokens per turn)

Round	Without LMCache	With LMCache	Speedup
Turn 1 (cold)	4.2s	4.2s	1x
Turn 5	8.1s	1.8s	4.5x
Turn 10	11.3s	1.5s	7.5x

Turn 1 is identical because both paths compute the full prefill. By Turn 10, the growing conversation history dominates TTFT, and LMCache serves the accumulated prefix from cache.

Table 2: 128K system prompt TTFT (Llama 3.1 70B, single H100 SXM5)

Scenario	TTFT	Notes
No cache (every request cold)	11.2s	Full prefill every request
vLLM prefix cache (same worker)	1.6s	Hit rate ~60% in practice (routing-dependent)
LMCache (multi-node)	1.5s	Hit rate 95%+ with sticky routing disabled

The vLLM-only prefix cache achieves 1.6s but only for requests that land on the same worker that computed the prefix. With multiple replicas and round-robin routing, that happens about 60% of the time. LMCache raises that to 95%+ because all workers share the same Redis-backed cache.

Table 3: Throughput at 80% system-prompt reuse (chatbot workload, 4x H100 SXM5 cluster)

Concurrency	Baseline req/s	LMCache req/s	Gain
10	0.8	6.2	7.8x
50	1.1	11.4	10.4x
200	1.4	21.3	15.2x

The 15x gain applies to high-concurrency, high-reuse workloads. At low concurrency, the prefill bottleneck is less severe and the gain is 7-8x. Single-user sequential workloads also see 7-8x.

Figures above are representative estimates based on LMCache benchmarking methodology. Actual results depend on model, hardware, concurrency, and workload characteristics. The 15x figure is the high-concurrency (200 req), high-reuse (80%) upper bound; LMCache's production benchmarks on smaller workloads report 4-5x TTFT improvement.

Storage Tiering in Practice: When to Use NVMe vs CPU vs Remote

GPU HBM hot tier: Always active. LMCache manages it automatically alongside the active KV pages.

CPU DRAM warm tier: Enable when CPU RAM is 256 GB or more. Set local_cpu: true in lmcache.yaml. This tier captures recently-evicted HBM blocks without the NVMe round-trip.

NVMe cold tier: Enable when disk throughput exceeds 4 GB/s sequential. Set local_disk and max_local_disk_size. Run fio to verify bandwidth before relying on this tier in production. SATA SSDs (500 MB/s) are too slow to serve as an effective cold tier.

Remote Redis: Enable when running two or more workers. This is the tier that makes cross-node KV cache sharing work. Without it, each node only benefits its own local cache.

For a detailed breakdown of NVMe tiering requirements, bandwidth math, and ICMSP comparison, see NVMe KV cache offloading with LMCache.

Cost math for a 4x H100 SXM5 cluster at $1.43/hr spot per GPU ($5.72/hr total):

Without LMCache, 1,000 requests with a 128K-token system prompt each cost roughly:

1,000 × 11s × $0.001588/cluster-second = $17.47 in prefill compute

With LMCache at 80% cache hit rate (800 hits at 1.5s, 200 cold at 11s):

200 × 11s + 800 × 1.5s = 2,200 + 1,200 = 3,400 cluster-seconds
3,400 × $0.001588 = $5.40

That is a 69% reduction in prefill cost for the same 1,000 requests. At scale, the savings compound.

Running LMCache on Spheron: Sizing, Pricing, and Pitfalls

Recommended config for Llama 3.1 70B

4x H100 SXM5 (primary inference): 4x GPU × $1.43/hr = $5.72/hr spot
512 GB CPU RAM per node for the warm tier
4 TB NVMe per node for the cold tier
1 dedicated instance for Redis (any CPU-only node works)

Spot vs on-demand

Use on-demand for primary vLLM inference nodes where SLA guarantees matter. Use spot for cache worker nodes (CPU-heavy Redis, NVMe offload workers) where a restart just means a brief cache cold-start. The shared Redis tier survives spot interruption when backed by persistent storage.

Common pitfalls

NVMe IOPS vs bandwidth: Many cloud NVMe configs advertise high IOPS but have limited sequential throughput. KV cache loads are large sequential reads, not random 4K IOPS. Run fio to verify. SATA-backed SSDs are a common trap.

Redis network: If Redis is on a 1 GbE link, it becomes the bottleneck at ~125 MB/s. Plan for 25 GbE or RDMA between the cache workers and Redis. For RDMA setup, see the GPU networking guide.

Cache fragmentation: Very long sequences (over 128K tokens) can fragment the chunk index. Use chunk_size: 512 for long-context workloads instead of the default 256.

KV dtype consistency: Set --kv-cache-dtype consistently across all vLLM workers in a cluster. LMCache serializes KV blocks using the dtype vLLM passes down, so mixing fp8 and fp16 workers will cause cache misses because the serialized tensors won't match.

For the prefill bottleneck that LMCache addresses, see prefill-decode disaggregation on GPU cloud for the broader architectural picture. For a Kubernetes-native approach to KV cache-aware routing, see the llm-d guide.

LMCache's disk-tier KV cache offload is most effective on instances with high-bandwidth NVMe and large CPU DRAM, which is exactly the config available on Spheron H100 SXM5 nodes. Multi-node P2P sharing works across Spheron instances in the same region over 25 GbE.
H100 on Spheron → | H200 on Spheron → | View all GPU pricing →

STEPS / 06

Quick Setup Guide

Provision GPU instances on Spheron with NVMe storage
Log in to app.spheron.ai and deploy at least one H100 SXM5 instance. Select a configuration with 2+ TB NVMe and 256+ GB CPU RAM. For multi-node P2P sharing, provision two or more H100 nodes in the same availability region.
Install LMCache and vLLM
pip install lmcache vllm. LMCache 0.4+ (January 2026 release) ships with the vLLM Production Stack integration. Verify with: python -c 'import lmcache; print(lmcache.__version__)'
Write the lmcache.yaml config file
Create a lmcache.yaml specifying chunk_size (default 256 tokens), local_cpu (True/False), local_disk (e.g. /mnt/nvme/kvcache), max_local_disk_size (in GB), remote_url (Redis endpoint for multi-node), and save_decode_cache (True for chatbot workloads).
Start vLLM with LMCache backend
Run: LMCACHE_CONFIG_FILE=lmcache.yaml python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --tensor-parallel-size 4 --kv-cache-dtype fp8 --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'. The --kv-transfer-config flag wires LMCache as the KV connector, routing cache reads and writes through LMCache's tiered storage.
Deploy multi-node sharing with Redis backend
Start a Redis container: docker run -d -p 6379:6379 redis:7. Update lmcache.yaml remote_url to redis://YOUR_REDIS_IP:6379. On each vLLM node, set the same remote_url so all workers read from and write to the shared cache.
Validate cache hit rate with a benchmark request
Send the same system prompt twice to two different workers and compare TTFT. First request should show 11s, second should show ~1.5s (from cache). Check LMCache metrics: curl http://localhost:8000/metrics | grep lmcache_cache_hit_rate

FAQ / 05

Frequently Asked Questions

vLLM prefix caching reuses KV state within a single node. LMCache adds a distributed cache layer so multiple vLLM workers on different machines share the same KV cache over Ethernet, RDMA, or NVLink, which eliminates redundant prefill across nodes in a multi-replica cluster.

For the hot tier: H100 or H200 GPUs with 80-141 GB HBM. For the warm tier: 256-512 GB CPU DRAM per node. For the disk tier: 2-4 TB NVMe SSD with at least 6 GB/s sequential read bandwidth. For the shared tier: a Redis or Infinispan backend on a low-latency network.

On workloads with repeated system prompts (chatbots, RAG, agents), prefill is the bottleneck. LMCache stores the KV tensors after first prefill and serves them from cache on subsequent requests. On a Llama 3.1 70B chatbot at 128K system prompt, this cuts TTFT from 11s to 1.5s and allows the GPU to handle 15x more decode requests per second instead of re-computing the same prefill.

Cache worker nodes (which store KV on CPU/NVMe) can run on spot if the LMCache backend is backed by a persistent Redis. The shared tier survives spot interruption because Redis persists the serialized KV tensors. Prefill and primary decode GPUs should be on-demand for SLA guarantees.

LMCache uses a configurable eviction policy (LRU by default). When a tier fills up, the least-recently-used KV blocks are evicted to the next tier or dropped. You configure max_local_disk_size, max_local_cpu_size, and remote_url in the lmcache.yaml config file. For long-context workloads, size the disk tier to hold at least 10x the average active sessions.

What LMCache Does (and Where vLLM Prefix Caching Falls Short)

LMCache Architecture

Prerequisites and Hardware Sizing

Step 1: Deploy LMCache with vLLM on a Single Node

Step 2: Multi-Node KV Cache Sharing with vLLM Production Stack on Kubernetes

Option A: P2P Sharing Over Redis

Option B: vLLM Production Stack on Kubernetes

Benchmarks: TTFT and Throughput Before/After LMCache

Storage Tiering in Practice: When to Use NVMe vs CPU vs Remote

Running LMCache on Spheron: Sizing, Pricing, and Pitfalls

Recommended config for Llama 3.1 70B

Spot vs on-demand

Common pitfalls

Quick Setup Guide

Provision GPU instances on Spheron with NVMe storage

Install LMCache and vLLM

Write the lmcache.yaml config file

Start vLLM with LMCache backend

Deploy multi-node sharing with Redis backend

Validate cache hit rate with a benchmark request

Frequently Asked Questions

01What is LMCache and how does it differ from vLLM's built-in prefix caching?

02What hardware is needed to run LMCache in production?

03How does LMCache achieve 15x throughput improvement?

04Can I run LMCache workers on spot instances?

05How does LMCache handle cache eviction?

Try It on Real GPUs