A Llama 3.1 70B request with a 128K-token system prompt spends 11 seconds on prefill before generating a single output token. If 1,000 users send the same system prompt, the cluster recomputes it 1,000 times, burning GPU compute that could be spent on decode. LMCache solves this by externalizing the KV cache so vLLM workers share computed KV state across nodes.
For KV cache fundamentals like PagedAttention, FP8/NVFP4, and CPU swap space, start with the KV Cache Optimization Guide. For single-node NVMe tiering, see NVMe KV cache offloading with LMCache. This post covers the multi-node deployment path.
LMCache graduated to production in January 2026 and is now used by Google Cloud GKE Inference, CoreWeave, and Cohere. The vLLM Production Stack integration ships in LMCache 0.4+.
What LMCache Does (and Where vLLM Prefix Caching Falls Short)
vLLM's RadixAttention and --enable-prefix-caching flag work well for a single worker. They find the longest cached prefix for an incoming request and skip its prefill entirely. The catch: each replica in a multi-replica deployment maintains its own local cache. A request routed to worker B gets no benefit from what worker A already computed, even if they ran the same 128K system prompt an hour ago.
LMCache works as a KV connector and tiered storage layer that operates alongside vLLM's memory manager. It serializes KV tensors into a tiered store (GPU HBM, CPU DRAM, NVMe, Redis) and exposes them to any worker that receives the same token prefix. Worker B reads the cached KV blocks from the shared store rather than recomputing them.
| Feature | vLLM Prefix Cache | LMCache |
|---|---|---|
| Scope | Single node | Multi-node cluster |
| Storage | GPU HBM only | GPU, CPU, NVMe, Redis |
| TTFT benefit | ~0% (same request re-sent to different worker) | 11s to 1.5s at 128K context |
| Throughput gain | 1x | Up to 15x on chatbot workloads |
| Survives spot restarts | No (cache lost on shutdown) | Yes (remote tier persists) |
LMCache Architecture
LMCache organizes cache storage into four tiers, ordered by latency:
Tier 0: GPU HBM. Hot cache, sub-millisecond access. Limited by VRAM (80 GB on H100 SXM5). LMCache manages this automatically alongside the active KV pages.
Tier 1: CPU DRAM. Warm cache, roughly 5 microseconds access latency, 256-512 GB typical per node. Bandwidth is constrained to PCIe (around 32 GB/s peak).
Tier 2: Local NVMe. Cold cache, 100-500 microsecond access, 2-4 TB capacity. At 6-12 GB/s sequential throughput on fast NVMe, this tier can feed a 70B model's prefill miss rate at reasonable concurrency.
Tier 3: Remote shared store (Redis or Infinispan). Shared across all nodes. On 25 GbE, you get roughly 3 GB/s effective bandwidth. On RDMA/RoCEv2, up to 20 GB/s. For RDMA networking options, see the GPU Networking: InfiniBand, RoCE, and Spectrum-X guide.
vLLM Worker A vLLM Worker B
[GPU HBM] [GPU HBM]
| |
[CPU DRAM] [CPU DRAM]
| |
[Local NVMe] [Local NVMe]
| |
+------- Redis --------+
(shared tier)| Tier | Storage | Latency | Bandwidth | Cost/GB/mo | Best For |
|---|---|---|---|---|---|
| GPU HBM | 80 GB H100 | < 1 µs | 3.35 TB/s | ~$30 | Hot prefixes for active requests |
| CPU DRAM | 256-512 GB | 5-50 µs | ~32 GB/s | ~$0.50 | Recent session prefixes |
| NVMe SSD | 2-4 TB | 100-500 µs | 6-12 GB/s | ~$0.10 | Long-tail and batch workloads |
| Redis (25 GbE) | Unlimited | 500-2000 µs | ~3 GB/s | ~$0.05 | Shared cross-node cache |
Prerequisites and Hardware Sizing
Spheron H100 and H200 instances include NVMe SSD storage at no extra cost. Current pricing from the Spheron API:
| GPU | VRAM | Best For | On-Demand | Spot |
|---|---|---|---|---|
| H100 SXM5 | 80 GB HBM3 | Llama 3.1 70B (TP4), Mistral 22B | $5.01/hr | $1.43/hr |
| H200 SXM5 | 141 GB HBM3e | Llama 3.1 405B (TP4), long-context | $5.86/hr | $3.31/hr |
| A100 80GB SXM4 | 80 GB HBM2e | Budget option, FP16 only | $1.69/hr | $0.82/hr |
NVMe requirement: minimum 2 TB with 6+ GB/s sequential read. If the NVMe bandwidth falls below that, the disk tier becomes a bottleneck before the GPU does. Check with fio --name=read-test --rw=read --bs=1M --size=10G --numjobs=4 --iodepth=32.
For H100 SXM5 on Spheron and H200 instances on Spheron, bare-metal access means no hypervisor overhead on NVMe or network I/O. For a baseline vLLM setup on Spheron before adding LMCache, see the Spheron vLLM deployment guide.
Pricing fluctuates based on GPU availability. The prices above are based on 11 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Step 1: Deploy LMCache with vLLM on a Single Node
Install dependencies:
pip install lmcache vllmMount NVMe:
sudo mkdir -p /mnt/nvme/kvcache
sudo mount /dev/nvme0n1 /mnt/nvmeCreate lmcache.yaml:
chunk_size: 256
local_cpu: true
local_disk: /mnt/nvme/kvcache
max_local_disk_size: 2000 # GB
max_local_cpu_size: 128 # GB
save_decode_cache: trueLaunch vLLM with LMCache:
LMCACHE_CONFIG_FILE=lmcache.yaml \
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.92 \
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'The --kv-transfer-config flag wires LMCache as the KV connector. All existing vLLM config flags work normally alongside it.
Test with a long system prompt:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"messages": [
{"role": "system", "content": "<128K system prompt here>"},
{"role": "user", "content": "What are the key points?"}
]
}'Check metrics:
curl -s http://localhost:8000/metrics | grep lmcache
# lmcache_cache_hit_rate 0.0 (first request, cold)
# lmcache_cache_hit_rate 0.95 (subsequent requests)Step 2: Multi-Node KV Cache Sharing with vLLM Production Stack on Kubernetes
Option A: P2P Sharing Over Redis
Start a Redis container on a dedicated node (or co-located on the first worker):
docker run -d -p 6379:6379 --name lmcache-redis redis:7Add the remote_url to lmcache.yaml on each worker node:
chunk_size: 256
local_cpu: true
local_disk: /mnt/nvme/kvcache
max_local_disk_size: 2000
max_local_cpu_size: 128
remote_url: redis://REDIS_NODE_IP:6379
save_decode_cache: trueAll workers read from and write to the same Redis instance. The first worker to compute a prefix writes it to Redis; all subsequent workers on any node read it from there.
Option B: vLLM Production Stack on Kubernetes
For Kubernetes setup, use the same kubectl and Helm workflow as the llm-d Kubernetes disaggregated inference guide. Provision 2+ H100 nodes on Spheron, then:
1. Install GPU Operator and Redis:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install nvidia-gpu-operator nvidia/gpu-operator
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install redis bitnami/redis --set auth.enabled=false2. Create values.yaml for the vLLM Production Stack:
model:
name: meta-llama/Meta-Llama-3.1-70B-Instruct
tensorParallelSize: 4
kvCacheDtype: fp8
gpuMemoryUtilization: 0.92
lmcache:
enabled: true
config:
chunk_size: 256
local_cpu: true
local_disk: /mnt/nvme/kvcache
max_local_disk_size: 2000
max_local_cpu_size: 128
remote_url: redis://redis-master:6379
save_decode_cache: true
replicas: 2
resources:
limits:
nvidia.com/gpu: 4
requests:
nvidia.com/gpu: 4
volumeMounts:
- name: nvme-storage
mountPath: /mnt/nvme/kvcache
volumes:
- name: nvme-storage
hostPath:
path: /mnt/nvme/kvcache
type: DirectoryOrCreate3. Deploy:
helm repo add lmcache https://charts.lmcache.ai
helm install vllm-stack lmcache/vllm-stack -f values.yaml4. Verify cache sharing:
Send the same prompt to both workers back-to-back:
# Send to worker 1 (cold cache - expect ~11s TTFT)
curl http://worker1:8000/v1/chat/completions -d '{"messages": [...]}'
# Send to worker 2 (warm cache via Redis - expect ~1.5s TTFT)
curl http://worker2:8000/v1/chat/completions -d '{"messages": [...]}'Benchmarks: TTFT and Throughput Before/After LMCache
Table 1: Multi-turn QA TTFT (Llama 3.1 70B, H100 SXM5, 10-turn conversation, average 2K tokens per turn)
| Round | Without LMCache | With LMCache | Speedup |
|---|---|---|---|
| Turn 1 (cold) | 4.2s | 4.2s | 1x |
| Turn 5 | 8.1s | 1.8s | 4.5x |
| Turn 10 | 11.3s | 1.5s | 7.5x |
Turn 1 is identical because both paths compute the full prefill. By Turn 10, the growing conversation history dominates TTFT, and LMCache serves the accumulated prefix from cache.
Table 2: 128K system prompt TTFT (Llama 3.1 70B, single H100 SXM5)
| Scenario | TTFT | Notes |
|---|---|---|
| No cache (every request cold) | 11.2s | Full prefill every request |
| vLLM prefix cache (same worker) | 1.6s | Hit rate ~60% in practice (routing-dependent) |
| LMCache (multi-node) | 1.5s | Hit rate 95%+ with sticky routing disabled |
The vLLM-only prefix cache achieves 1.6s but only for requests that land on the same worker that computed the prefix. With multiple replicas and round-robin routing, that happens about 60% of the time. LMCache raises that to 95%+ because all workers share the same Redis-backed cache.
Table 3: Throughput at 80% system-prompt reuse (chatbot workload, 4x H100 SXM5 cluster)
| Concurrency | Baseline req/s | LMCache req/s | Gain |
|---|---|---|---|
| 10 | 0.8 | 6.2 | 7.8x |
| 50 | 1.1 | 11.4 | 10.4x |
| 200 | 1.4 | 21.3 | 15.2x |
The 15x gain applies to high-concurrency, high-reuse workloads. At low concurrency, the prefill bottleneck is less severe and the gain is 7-8x. Single-user sequential workloads also see 7-8x.
Figures above are representative estimates based on LMCache benchmarking methodology. Actual results depend on model, hardware, concurrency, and workload characteristics. The 15x figure is the high-concurrency (200 req), high-reuse (80%) upper bound; LMCache's production benchmarks on smaller workloads report 4-5x TTFT improvement.
Storage Tiering in Practice: When to Use NVMe vs CPU vs Remote
GPU HBM hot tier: Always active. LMCache manages it automatically alongside the active KV pages.
CPU DRAM warm tier: Enable when CPU RAM is 256 GB or more. Set local_cpu: true in lmcache.yaml. This tier captures recently-evicted HBM blocks without the NVMe round-trip.
NVMe cold tier: Enable when disk throughput exceeds 4 GB/s sequential. Set local_disk and max_local_disk_size. Run fio to verify bandwidth before relying on this tier in production. SATA SSDs (500 MB/s) are too slow to serve as an effective cold tier.
Remote Redis: Enable when running two or more workers. This is the tier that makes cross-node KV cache sharing work. Without it, each node only benefits its own local cache.
For a detailed breakdown of NVMe tiering requirements, bandwidth math, and ICMSP comparison, see NVMe KV cache offloading with LMCache.
Cost math for a 4x H100 SXM5 cluster at $1.43/hr spot per GPU ($5.72/hr total):
Without LMCache, 1,000 requests with a 128K-token system prompt each cost roughly:
- 1,000 × 11s × $0.001588/cluster-second = $17.47 in prefill compute
With LMCache at 80% cache hit rate (800 hits at 1.5s, 200 cold at 11s):
- 200 × 11s + 800 × 1.5s = 2,200 + 1,200 = 3,400 cluster-seconds
- 3,400 × $0.001588 = $5.40
That is a 69% reduction in prefill cost for the same 1,000 requests. At scale, the savings compound.
Running LMCache on Spheron: Sizing, Pricing, and Pitfalls
Recommended config for Llama 3.1 70B
- 4x H100 SXM5 (primary inference): 4x GPU × $1.43/hr = $5.72/hr spot
- 512 GB CPU RAM per node for the warm tier
- 4 TB NVMe per node for the cold tier
- 1 dedicated instance for Redis (any CPU-only node works)
Spot vs on-demand
Use on-demand for primary vLLM inference nodes where SLA guarantees matter. Use spot for cache worker nodes (CPU-heavy Redis, NVMe offload workers) where a restart just means a brief cache cold-start. The shared Redis tier survives spot interruption when backed by persistent storage.
Common pitfalls
NVMe IOPS vs bandwidth: Many cloud NVMe configs advertise high IOPS but have limited sequential throughput. KV cache loads are large sequential reads, not random 4K IOPS. Run fio to verify. SATA-backed SSDs are a common trap.
Redis network: If Redis is on a 1 GbE link, it becomes the bottleneck at ~125 MB/s. Plan for 25 GbE or RDMA between the cache workers and Redis. For RDMA setup, see the GPU networking guide.
Cache fragmentation: Very long sequences (over 128K tokens) can fragment the chunk index. Use chunk_size: 512 for long-context workloads instead of the default 256.
KV dtype consistency: Set --kv-cache-dtype consistently across all vLLM workers in a cluster. LMCache serializes KV blocks using the dtype vLLM passes down, so mixing fp8 and fp16 workers will cause cache misses because the serialized tensors won't match.
For the prefill bottleneck that LMCache addresses, see prefill-decode disaggregation on GPU cloud for the broader architectural picture. For a Kubernetes-native approach to KV cache-aware routing, see the llm-d guide.
LMCache's disk-tier KV cache offload is most effective on instances with high-bandwidth NVMe and large CPU DRAM, which is exactly the config available on Spheron H100 SXM5 nodes. Multi-node P2P sharing works across Spheron instances in the same region over 25 GbE.
H100 on Spheron → | H200 on Spheron → | View all GPU pricing →
Quick Setup Guide
Log in to app.spheron.ai and deploy at least one H100 SXM5 instance. Select a configuration with 2+ TB NVMe and 256+ GB CPU RAM. For multi-node P2P sharing, provision two or more H100 nodes in the same availability region.
pip install lmcache vllm. LMCache 0.4+ (January 2026 release) ships with the vLLM Production Stack integration. Verify with: python -c 'import lmcache; print(lmcache.__version__)'
Create a lmcache.yaml specifying chunk_size (default 256 tokens), local_cpu (True/False), local_disk (e.g. /mnt/nvme/kvcache), max_local_disk_size (in GB), remote_url (Redis endpoint for multi-node), and save_decode_cache (True for chatbot workloads).
Run: LMCACHE_CONFIG_FILE=lmcache.yaml python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --tensor-parallel-size 4 --kv-cache-dtype fp8 --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'. The --kv-transfer-config flag wires LMCache as the KV connector, routing cache reads and writes through LMCache's tiered storage.
Start a Redis container: docker run -d -p 6379:6379 redis:7. Update lmcache.yaml remote_url to redis://YOUR_REDIS_IP:6379. On each vLLM node, set the same remote_url so all workers read from and write to the shared cache.
Send the same system prompt twice to two different workers and compare TTFT. First request should show 11s, second should show ~1.5s (from cache). Check LMCache metrics: curl http://localhost:8000/metrics | grep lmcache_cache_hit_rate
Frequently Asked Questions
vLLM prefix caching reuses KV state within a single node. LMCache adds a distributed cache layer so multiple vLLM workers on different machines share the same KV cache over Ethernet, RDMA, or NVLink, which eliminates redundant prefill across nodes in a multi-replica cluster.
For the hot tier: H100 or H200 GPUs with 80-141 GB HBM. For the warm tier: 256-512 GB CPU DRAM per node. For the disk tier: 2-4 TB NVMe SSD with at least 6 GB/s sequential read bandwidth. For the shared tier: a Redis or Infinispan backend on a low-latency network.
On workloads with repeated system prompts (chatbots, RAG, agents), prefill is the bottleneck. LMCache stores the KV tensors after first prefill and serves them from cache on subsequent requests. On a Llama 3.1 70B chatbot at 128K system prompt, this cuts TTFT from 11s to 1.5s and allows the GPU to handle 15x more decode requests per second instead of re-computing the same prefill.
Cache worker nodes (which store KV on CPU/NVMe) can run on spot if the LMCache backend is backed by a persistent Redis. The shared tier survives spot interruption because Redis persists the serialized KV tensors. Prefill and primary decode GPUs should be on-demand for SLA guarantees.
LMCache uses a configurable eviction policy (LRU by default). When a tier fills up, the least-recently-used KV blocks are evicted to the next tier or dropped. You configure max_local_disk_size, max_local_cpu_size, and remote_url in the lmcache.yaml config file. For long-context workloads, size the disk tier to hold at least 10x the average active sessions.
