Tutorial

NVIDIA NIXL and Disaggregated Inference: Move KV Caches Across GPUs at Wire Speed

Back to BlogWritten by Mitrasish, Co-founderApr 3, 2026
NVIDIA NIXLDisaggregated InferenceKV CachevLLMGPU CloudLLM InferenceH100Prefill Decode Separation
NVIDIA NIXL and Disaggregated Inference: Move KV Caches Across GPUs at Wire Speed

NVIDIA NIXL (NVIDIA Inference Transfer Library) is the transport layer that makes disaggregated LLM inference practical: it moves KV cache tensors from prefill GPUs to decode GPUs over RDMA or NVMe at wire speed. Production systems like Mooncake, llm-d, and DistServe all rely on this kind of KV transfer, and NIXL is a primary connector option for disaggregated prefill in vLLM. If you are building on NVIDIA Dynamo, NIXL handles the low-level transfers that Dynamo's orchestration layer coordinates. For the Dynamo side of the picture, see NVIDIA Dynamo: Disaggregated Inference Orchestration.

TL;DR

  • NIXL is a point-to-point KV cache transfer library open-sourced by NVIDIA at GTC 2025.
  • It supports five backends: RDMA/InfiniBand, RoCE via UCX, TCP fallback, NVMe-oF, and S3-compatible object storage.
  • vLLM uses NIXL as a primary connector option for disaggregated prefill via --kv-transfer-config (NixlConnector, alongside LMCacheConnector, MooncakeConnector, and others).
  • NVIDIA Dynamo's KV Block Manager (KVBM) routes prefill outputs to decode workers using NIXL underneath.
  • Separating prefill and decode pools lets you use cheaper GPUs for decode, cutting cost per token by 3-5x for chat workloads.
  • Spheron lets you rent H100 SXM5 for prefill and A100 80G for decode in the same region, with the interconnect bandwidth NIXL requires.

Why Disaggregated Inference Is Now the Default

Running prefill and decode on the same GPU wastes both. Prefill is compute-bound: it processes the entire prompt in one forward pass, hammering FLOPS. Decode is memory-bandwidth-bound: it reads the full KV cache and model weights on every single token step, slowly. Put them on the same GPU and each phase runs below its hardware limits, constrained by whatever the other phase needs.

PropertyPrefill phaseDecode phase
BottleneckCompute (TFLOPS)Memory bandwidth (TB/s)
GPU utilization patternBurst, then idleSustained, low arithmetic intensity
DurationSeconds (long prompts)Milliseconds per token
ParallelismEmbarrassingly parallel across prompt tokensSequential token generation

The consequence is poor GPU utilization across both dimensions simultaneously. Disaggregated inference fixes this by giving each phase its own hardware. For the underlying KV cache memory math and why this matters, see KV Cache Optimization for LLM Inference. For a broader view of what causes slow inference, see Diagnosing Slow LLM Inference.

Disaggregated setups require moving the KV cache produced by prefill workers to decode workers. NIXL is the library that does this transfer.

What Is NVIDIA NIXL

NIXL (NVIDIA Inference Transfer Library) is an asynchronous point-to-point transfer library for KV cache tensors. It abstracts the transport layer so inference frameworks can move KV data between any combination of GPU HBM, local NVMe, and remote storage without writing transport-specific code.

NIXL exposes a simple API: register a memory buffer, initiate an async transfer, poll for completion. The transport backend handles the actual data movement. Five backends are available:

BackendLatency profileBest use case
RDMA/InfiniBand NDR (400 Gbps)Sub-1ms for 4K-token KVProduction multi-node, same data center
RoCE via UCX1-3ms, depends on switch configurationMulti-node without IB hardware
TCP fallback (UCX)5-20msDev/test, latency-tolerant workloads
NVMe-oF2-10ms (local NVMe), higher remotelyKV tiering, evicting cool context off GPU
S3-compatible object storage50-500msArchiving, shared prefix caches, multi-cluster reuse

Integration points in the inference ecosystem:

  • vLLM: NixlConnector in --kv-transfer-config (one of several supported connectors)
  • NVIDIA Dynamo: KV Block Manager (KVBM) uses NIXL for inter-worker transfers
  • SGLang: NIXL connector available via community contribution
  • TensorRT-LLM: NIXL integration in the executor layer for disaggregated builds

Prefill vs Decode Workers: The Split Explained

When a request arrives at a disaggregated inference system, it goes to a prefill worker first. The prefill worker processes the entire prompt, computes the KV cache for every prompt token, and then stops. It does not generate any output tokens. The KV cache is the output of prefill. NIXL transfers that KV cache to a decode worker.

The decode worker receives the KV cache, picks up where prefill left off, and generates tokens one at a time. Each decode step reads the KV cache for all preceding context, appends the new token's K and V vectors, and passes the output through the model head.

PropertyPrefill nodeDecode node
BottleneckCompute (TFLOPS)Memory bandwidth (TB/s)
Recommended GPUH100 SXM5A100 80G, H100 PCIe
Token outputNoneAll generated tokens
KV cache roleProducerConsumer
Typical GPU count in clusterFewer, largerMore, smaller

The hardware recommendation follows directly from the bottleneck. H100 SXM5 has 1,979 TFLOPS BF16 and NVLink 4.0 for fast inter-GPU communication during tensor-parallel prefill. A100 80G SXM4 has 2 TB/s HBM2e bandwidth and 80 GB of memory: the right profile for decode, where you need to read the KV cache and model weights repeatedly per token.

Setting Up NIXL with vLLM: Step-by-Step

You need a recent vLLM release with disaggregated prefill support. The --kv-transfer-config JSON schema changed across minor versions; check your vLLM release notes for the correct connector options. For production deployment patterns with vLLM on multi-GPU bare metal, see vLLM Multi-GPU Production Deployment.

Step 1: Install dependencies on all nodes

bash
pip install vllm
pip install nixl

Verify RDMA devices are visible:

bash
ibv_devices

You should see at least one device (e.g., mlx5_0). If the list is empty, InfiniBand drivers are not installed or RDMA is not available on your hardware. On Spheron bare-metal H100 instances, IB drivers come pre-installed.

Step 2: Start the NIXL metadata server

The metadata server handles RDMA endpoint discovery. Run it on any reachable coordination node:

bash
nixl-metadata-server --port 5557

Both prefill and decode workers register their endpoints at startup. No central scheduler is needed.

Step 3: Start prefill workers

bash
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer","kv_rank":0}' \
  --tensor-parallel-size 4 \
  --port 8100

The kv_rank identifies this prefill worker in the transfer ring. If you run multiple prefill workers, increment kv_rank per worker.

Step 4: Start decode workers

bash
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer","kv_rank":0}' \
  --tensor-parallel-size 4 \
  --port 8200

For multiple decode workers, use the same kv_rank as the corresponding prefill worker or configure a many-to-many mapping depending on your request router.

Step 5: Put a router in front

vLLM alone does not include a disaggregated request router. Use any proxy that can route prefill requests to prefill endpoints and decode requests to decode endpoints. NGINX with custom Lua routing, NVIDIA Dynamo's frontend, or a lightweight Python proxy all work.

Step 6: Validate end-to-end

bash
curl http://localhost:8200/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-70B-Instruct",
    "prompt": "Explain the transformer attention mechanism.",
    "max_tokens": 200
  }'

Check the decode node logs for a line like:

nixl: received KV transfer from rank=0, tokens=47, latency=2.3ms

Sub-5ms for a 47-token prompt on InfiniBand HDR is expected. If you see 50ms+, TCP fallback is active and your RDMA setup needs attention.

NIXL + NVIDIA Dynamo: KVBM and Intelligent Routing

NIXL handles bits on the wire. NVIDIA Dynamo adds the orchestration layer: it decides which prefill worker handles each request, which decode worker receives the resulting KV cache, and how to reuse prefix cache hits across workers.

Dynamo's KV Block Manager (KVBM) sits above NIXL and tracks which KV blocks are on which worker. When a new request arrives with a prefix that was already computed by a prior request, KVBM routes the new request to the decode worker that already has those KV blocks, skipping prefill entirely for the shared prefix. For a full walkthrough of Dynamo's orchestration architecture and the broader disaggregated inference setup, see NVIDIA Dynamo: Disaggregated Inference Orchestration.

The Dynamo config YAML to enable NIXL transport:

yaml
# dynamo.yaml
kv_transfer:
  backend: nixl
  metadata_server: "tcp://coordination-node:5557"
  rdma_device: mlx5_0
  transport_threads: 4

workers:
  prefill:
    count: 4
    gpu_type: h100_sxm5
    tensor_parallel: 4
  decode:
    count: 8
    gpu_type: a100_80g
    tensor_parallel: 2

With this config, Dynamo passes the NIXL endpoint configuration to each worker at startup. Workers register with the metadata server and KVBM begins routing transfers through NIXL automatically.

KV Cache Tiering: GPU to NVMe to Object Storage

NIXL enables a three-tier memory hierarchy for KV cache that goes beyond GPU HBM. Each tier trades latency for capacity and cost:

TierStorageLatencyCapacityBest for
Tier 1GPU HBM~1 microsecond80-192 GB per GPUActive context, hot KV blocks
Tier 2Local NVMe SSD50-500 microseconds1-8 TB per nodeRecently evicted KV blocks, warm context
Tier 3S3-compatible object storage10-500 millisecondsUnlimitedArchived sessions, shared prefix caches across clusters

Tier 1 to Tier 2 eviction happens when GPU HBM fills and new requests need space. NIXL serializes the evicted KV blocks to NVMe asynchronously. If the evicted context is requested again before it is overwritten, NIXL loads it back to GPU HBM on demand, adding NVMe read latency to TTFT.

Tier 3 (S3) use cases are different: shared system prompt caches, RAG document caches used across multiple serving nodes, or warm-start context for long-running agent sessions. The latency penalty from S3 reads is too high for interactive workloads, but for batch or asynchronous jobs it is viable.

For workloads with repetitive long context (RAG, document QA, chat with long history), configuring NVMe tiering can reduce your per-node GPU memory requirements by 30-50% with acceptable TTFT degradation. For background on the broader KV cache memory challenge, see KV Cache Optimization for LLM Inference.

Benchmarks: Disaggregated vs Monolithic on H100 Clusters

The headline number from NVIDIA's GTC 2026 Dynamo 1.0 announcement is 7x throughput improvement. That figure was measured for disaggregated serving combined with wide expert parallelism on Blackwell (GB200 NVL72) hardware, not a pure disaggregated-vs-monolithic comparison on Hopper. Do not expect 7x on H100 clusters alone. The per-prompt-length picture on H100 is more nuanced:

SetupPrompt lengthThroughput (req/s)TTFT p50Decode throughput (tok/s/GPU)
8x H100 SXM5 monolithic512 tokens12.4220ms3,100
8x H100 SXM5 monolithic2048 tokens4.1890ms2,800
8x H100 SXM5 monolithic4096 tokens1.81,740ms2,400
4x H100 SXM5 prefill + 4x A100 80G decode512 tokens28.3 (estimated)180ms4,200
4x H100 SXM5 prefill + 4x A100 80G decode2048 tokens19.7 (estimated)410ms4,900
4x H100 SXM5 prefill + 4x A100 80G decode4096 tokens11.2 (estimated)820ms5,100

The monolithic numbers come from published vLLM benchmarks on H100 SXM5. The disaggregated figures are estimated based on the 7x headline result scaled per prompt length. Actual results depend on model size, batch composition, and interconnect latency. Run your own benchmarks on your target hardware before making infrastructure decisions. For comparison against other inference engines, see vLLM vs TensorRT-LLM vs SGLang Benchmarks.

The throughput gap grows with prompt length because long prefill is where monolithic setups suffer most. At 4K tokens, the decode pool sits nearly idle while prefill chews through the prompt. Disaggregated setups keep both pools busy.

Cost Analysis: Right-Sizing Prefill and Decode on Spheron

The throughput improvement translates directly to cost per token when you can use cheaper GPUs for decode. Here is what the numbers look like on Spheron with live pricing:

ConfigurationGPU setupHourly costRelative throughputCost per 1M tokens (index)
Monolithic8x H100 SXM5 on-demand$19.20/hr1.0x1.00
Disaggregated4x H100 SXM5 + 4x A100 80G SXM4$13.92/hr~4x (estimated)~0.18
Disaggregated (spot)4x H100 SXM5 spot + 4x A100 80G spot$5.00/hr~4.8x (estimated)~0.05

Hourly costs: H100 SXM5 on-demand $2.40/GPU, spot $0.80/GPU. A100 80G SXM4 on-demand $1.08/GPU, spot $0.45/GPU.

Pricing fluctuates based on GPU availability. The prices above are based on 03 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

The "cost per 1M tokens" index is derived by dividing hardware cost by relative throughput. Even with conservative throughput assumptions (4x, not 7x), the disaggregated A100 decode pool cuts token cost to roughly 18% of the monolithic baseline. On spot pricing the savings are steeper.

Two caveats: first, the A100 decode pool adds latency for very long contexts because A100 has lower TFLOPS than H100 for the residual compute in decode. Second, the NIXL metadata server and routing layer add a small ops overhead. For batch workloads where per-request latency is not critical, A100 decode is a clear win. For real-time chat with strict TTFT SLAs, evaluate whether A100 decode latency meets your p95 targets.

You can also mix in spot instances for the decode pool, since decode nodes can be replaced mid-session by migrating KV state to a new node via NIXL's NVMe tier. This is not yet production-ready in all setups but is on the NIXL roadmap.

Explore GPU options: Rent H100 for prefill → | Rent A100 for decode →

Production Checklist

Before deploying a NIXL disaggregated cluster in production:

  1. RDMA driver verification - Run ibv_devices and ibv_devinfo on every node. Confirm Active MTU: 4096 and Active width: 4X. Mismatched MTU causes silent throughput degradation.
  1. NCCL version pinning - Pin NCCL to the version tested with your vLLM release. NCCL updates frequently break tensor-parallel all-reduce patterns. Use NCCL_VERSION=2.21.5 or whatever your vLLM release notes specify.
  1. NIXL metadata server HA - The metadata server is a single point of failure at startup. For production, run two instances behind a simple TCP load balancer (HAProxy or NGINX stream). Workers reconnect automatically on metadata server restart.
  1. Prometheus metrics to watch:
  • nixl_transfer_bytes_total - total bytes moved per direction
  • nixl_transfer_latency_p99 - p99 KV transfer time; alert above 10ms on IB, 25ms on RoCE
  • vllm:kv_cache_usage_perc - if decode pool hits 95%+, add decode workers
  • vllm:time_to_first_token_seconds - tracks end-to-end prefill + transfer latency
  1. Decode pool autoscaling - Scale decode workers based on vllm:kv_cache_usage_perc and vllm:request_queue_length. Prefill workers scale less frequently since prefill is bursty. Keep a 2:1 or 3:1 decode-to-prefill ratio as a starting point for chat workloads.
  1. NVMe tier flush policy - If using NVMe tiering, set a flush interval (nixl_nvme_flush_interval_ms) that matches your session timeout. Flushing too aggressively wears NVMe SSDs; too infrequently causes OOM on eviction.
  1. Prefill node failure recovery - If a prefill node dies mid-transfer, the decode worker waiting for the KV cache will timeout. Recent vLLM releases re-queue the request to a healthy prefill worker after kv_transfer_timeout_ms (default 5000ms). Set this to 2000ms for interactive workloads.
  1. Spheron multi-node networking - On Spheron, place prefill and decode instances in the same cluster region. Bare-metal instances get dedicated public IPs without NAT overhead. Open UDP port 4791 (RoCE) and TCP port 5557 (NIXL metadata) between nodes. Check Spheron network configuration docs for firewall rules.
  1. NIXL transport thread count - The default is 2 transport threads per process. For 400 Gbps IB NDR links transferring 4096-token KV caches, 4-8 threads improve saturation. Set via NIXL_NUM_TRANSPORT_THREADS=4.
  1. KV cache dtype alignment - Prefill and decode workers must use the same KV cache dtype (FP8, BF16, or FP16). Mismatches cause NIXL to fall back to CPU-side conversion, adding milliseconds to every transfer. Verify with --kv-cache-dtype matching on both sides.

Disaggregated inference with NIXL cuts GPU costs by 3-5x for high-throughput chat workloads. Spheron lets you provision separate prefill and decode GPU pools in the same region with high-bandwidth interconnect.

Rent H100 for prefill → | Rent A100 for decode → | View all GPU pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.