NIXL (NVIDIA Inference Transfer Library) is an open-source library that handles GPU-to-GPU and GPU-to-NVMe KV cache transfers for disaggregated LLM inference. It supports RDMA, UCX, NVMe, and object storage backends.

What is disaggregated inference?

Disaggregated inference separates prefill workers (which process the prompt) from decode workers (which generate tokens). The KV cache produced by prefill is transferred to decode nodes via NIXL, reducing TTFT and improving total throughput.

Does NIXL work with vLLM?

Yes. NIXL is a primary KV cache transfer connector in vLLM's disaggregated prefill mode (NixlConnector). You enable it with --kv-transfer-config pointing to the nixl connector.

What networking does NIXL require?

NIXL works over RDMA (InfiniBand or RoCE), UCX (TCP fallback), NVMe-oF, and S3-compatible object storage. For production multi-node setups, InfiniBand HDR or NDR (400+ Gbps) gives the best transfer latency.

How much does disaggregated inference reduce GPU costs?

By separating prefill and decode, you can right-size each worker pool independently. Prefill nodes need compute-dense GPUs (H100 SXM5), while decode nodes can use memory-dense but cheaper options (A100, H100 PCIe). Real deployments report 3-5x improvement in throughput per dollar for chat workloads.

NVIDIA NIXL and Disaggregated Inference: Move KV Caches Across GPUs at Wire Speed

NVIDIA NIXL (NVIDIA Inference Transfer Library) is the transport layer that makes disaggregated LLM inference practical: it moves KV cache tensors from prefill GPUs to decode GPUs over RDMA or NVMe at wire speed. Production systems like Mooncake, llm-d, and DistServe all rely on this kind of KV transfer, and NIXL is a primary connector option for disaggregated prefill in vLLM. If you are building on NVIDIA Dynamo, NIXL handles the low-level transfers that Dynamo's orchestration layer coordinates. For the Dynamo side of the picture, see NVIDIA Dynamo: Disaggregated Inference Orchestration.

TL;DR

NIXL is a point-to-point KV cache transfer library open-sourced by NVIDIA at GTC 2025.
It supports five backends: RDMA/InfiniBand, RoCE via UCX, TCP fallback, NVMe-oF, and S3-compatible object storage.
vLLM uses NIXL as a primary connector option for disaggregated prefill via --kv-transfer-config (NixlConnector, alongside LMCacheConnector, MooncakeConnector, and others).
NVIDIA Dynamo's KV Block Manager (KVBM) routes prefill outputs to decode workers using NIXL underneath.
Separating prefill and decode pools lets you use cheaper GPUs for decode, cutting cost per token by 3-5x for chat workloads.
Spheron lets you rent H100 SXM5 for prefill and A100 80G for decode in the same region, with the interconnect bandwidth NIXL requires.

Why Disaggregated Inference Is Now the Default

Running prefill and decode on the same GPU wastes both. Prefill is compute-bound: it processes the entire prompt in one forward pass, hammering FLOPS. Decode is memory-bandwidth-bound: it reads the full KV cache and model weights on every single token step, slowly. Put them on the same GPU and each phase runs below its hardware limits, constrained by whatever the other phase needs.

Property	Prefill phase	Decode phase
Bottleneck	Compute (TFLOPS)	Memory bandwidth (TB/s)
GPU utilization pattern	Burst, then idle	Sustained, low arithmetic intensity
Duration	Seconds (long prompts)	Milliseconds per token
Parallelism	Embarrassingly parallel across prompt tokens	Sequential token generation

The consequence is poor GPU utilization across both dimensions simultaneously. Disaggregated inference fixes this by giving each phase its own hardware. For the underlying KV cache memory math and why this matters, see KV Cache Optimization for LLM Inference. For a broader view of what causes slow inference, see Diagnosing Slow LLM Inference.

Disaggregated setups require moving the KV cache produced by prefill workers to decode workers. NIXL is the library that does this transfer.

What Is NVIDIA NIXL

NIXL (NVIDIA Inference Transfer Library) is an asynchronous point-to-point transfer library for KV cache tensors. It abstracts the transport layer so inference frameworks can move KV data between any combination of GPU HBM, local NVMe, and remote storage without writing transport-specific code.

NIXL exposes a simple API: register a memory buffer, initiate an async transfer, poll for completion. The transport backend handles the actual data movement. Five backends are available:

Backend	Latency profile	Best use case
RDMA/InfiniBand NDR (400 Gbps)	Sub-1ms for 4K-token KV	Production multi-node, same data center
RoCE via UCX	1-3ms, depends on switch configuration	Multi-node without IB hardware
TCP fallback (UCX)	5-20ms	Dev/test, latency-tolerant workloads
NVMe-oF	2-10ms (local NVMe), higher remotely	KV tiering, evicting cool context off GPU
S3-compatible object storage	50-500ms	Archiving, shared prefix caches, multi-cluster reuse

Integration points in the inference ecosystem:

vLLM: NixlConnector in --kv-transfer-config (one of several supported connectors)
NVIDIA Dynamo: KV Block Manager (KVBM) uses NIXL for inter-worker transfers
SGLang: NIXL connector available via community contribution
TensorRT-LLM: NIXL integration in the executor layer for disaggregated builds

Prefill vs Decode Workers: The Split Explained

When a request arrives at a disaggregated inference system, it goes to a prefill worker first. The prefill worker processes the entire prompt, computes the KV cache for every prompt token, and then stops. It does not generate any output tokens. The KV cache is the output of prefill. NIXL transfers that KV cache to a decode worker.

The decode worker receives the KV cache, picks up where prefill left off, and generates tokens one at a time. Each decode step reads the KV cache for all preceding context, appends the new token's K and V vectors, and passes the output through the model head.

Property	Prefill node	Decode node
Bottleneck	Compute (TFLOPS)	Memory bandwidth (TB/s)
Recommended GPU	H100 SXM5	A100 80G, H100 PCIe
Token output	None	All generated tokens
KV cache role	Producer	Consumer
Typical GPU count in cluster	Fewer, larger	More, smaller

The hardware recommendation follows directly from the bottleneck. H100 SXM5 has 1,979 TFLOPS BF16 and NVLink 4.0 for fast inter-GPU communication during tensor-parallel prefill. A100 80G SXM4 has 2 TB/s HBM2e bandwidth and 80 GB of memory: the right profile for decode, where you need to read the KV cache and model weights repeatedly per token.

Setting Up NIXL with vLLM: Step-by-Step

You need a recent vLLM release with disaggregated prefill support. The --kv-transfer-config JSON schema changed across minor versions; check your vLLM release notes for the correct connector options. For production deployment patterns with vLLM on multi-GPU bare metal, see vLLM Multi-GPU Production Deployment.

Step 1: Install dependencies on all nodes

bash

pip install vllm
pip install nixl

Verify RDMA devices are visible:

bash

ibv_devices

You should see at least one device (e.g., mlx5_0). If the list is empty, InfiniBand drivers are not installed or RDMA is not available on your hardware. On Spheron bare-metal H100 instances, IB drivers come pre-installed.

Step 2: Start the NIXL metadata server

The metadata server handles RDMA endpoint discovery. Run it on any reachable coordination node:

bash

nixl-metadata-server --port 5557

Both prefill and decode workers register their endpoints at startup. No central scheduler is needed.

Step 3: Start prefill workers

bash

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer","kv_rank":0}' \
  --tensor-parallel-size 4 \
  --port 8100

The kv_rank identifies this prefill worker in the transfer ring. If you run multiple prefill workers, increment kv_rank per worker.

Step 4: Start decode workers

bash

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer","kv_rank":0}' \
  --tensor-parallel-size 4 \
  --port 8200

For multiple decode workers, use the same kv_rank as the corresponding prefill worker or configure a many-to-many mapping depending on your request router.

Step 5: Put a router in front

vLLM alone does not include a disaggregated request router. Use any proxy that can route prefill requests to prefill endpoints and decode requests to decode endpoints. NGINX with custom Lua routing, NVIDIA Dynamo's frontend, or a lightweight Python proxy all work.

Step 6: Validate end-to-end

bash

curl http://localhost:8200/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-70B-Instruct",
    "prompt": "Explain the transformer attention mechanism.",
    "max_tokens": 200
  }'

Check the decode node logs for a line like:

nixl: received KV transfer from rank=0, tokens=47, latency=2.3ms

Sub-5ms for a 47-token prompt on InfiniBand HDR is expected. If you see 50ms+, TCP fallback is active and your RDMA setup needs attention.

NIXL + NVIDIA Dynamo: KVBM and Intelligent Routing

NIXL handles bits on the wire. NVIDIA Dynamo adds the orchestration layer: it decides which prefill worker handles each request, which decode worker receives the resulting KV cache, and how to reuse prefix cache hits across workers.

Dynamo's KV Block Manager (KVBM) sits above NIXL and tracks which KV blocks are on which worker. When a new request arrives with a prefix that was already computed by a prior request, KVBM routes the new request to the decode worker that already has those KV blocks, skipping prefill entirely for the shared prefix. For a full walkthrough of Dynamo's orchestration architecture and the broader disaggregated inference setup, see NVIDIA Dynamo: Disaggregated Inference Orchestration.

The Dynamo config YAML to enable NIXL transport:

yaml

# dynamo.yaml
kv_transfer:
  backend: nixl
  metadata_server: "tcp://coordination-node:5557"
  rdma_device: mlx5_0
  transport_threads: 4

workers:
  prefill:
    count: 4
    gpu_type: h100_sxm5
    tensor_parallel: 4
  decode:
    count: 8
    gpu_type: a100_80g
    tensor_parallel: 2

With this config, Dynamo passes the NIXL endpoint configuration to each worker at startup. Workers register with the metadata server and KVBM begins routing transfers through NIXL automatically.

KV Cache Tiering: GPU to NVMe to Object Storage

NIXL enables a three-tier memory hierarchy for KV cache that goes beyond GPU HBM. Each tier trades latency for capacity and cost:

Tier	Storage	Latency	Capacity	Best for
Tier 1	GPU HBM	~1 microsecond	80-192 GB per GPU	Active context, hot KV blocks
Tier 2	Local NVMe SSD	50-500 microseconds	1-8 TB per node	Recently evicted KV blocks, warm context
Tier 3	S3-compatible object storage	10-500 milliseconds	Unlimited	Archived sessions, shared prefix caches across clusters

Tier 1 to Tier 2 eviction happens when GPU HBM fills and new requests need space. NIXL serializes the evicted KV blocks to NVMe asynchronously. If the evicted context is requested again before it is overwritten, NIXL loads it back to GPU HBM on demand, adding NVMe read latency to TTFT.

Tier 3 (S3) use cases are different: shared system prompt caches, RAG document caches used across multiple serving nodes, or warm-start context for long-running agent sessions. The latency penalty from S3 reads is too high for interactive workloads, but for batch or asynchronous jobs it is viable.

For workloads with repetitive long context (RAG, document QA, chat with long history), configuring NVMe tiering can reduce your per-node GPU memory requirements by 30-50% with acceptable TTFT degradation. For background on the broader KV cache memory challenge, see KV Cache Optimization for LLM Inference.

Benchmarks: Disaggregated vs Monolithic on H100 Clusters

The headline number from NVIDIA's GTC 2026 Dynamo 1.0 announcement is 7x throughput improvement. That figure was measured for disaggregated serving combined with wide expert parallelism on Blackwell (GB200 NVL72) hardware, not a pure disaggregated-vs-monolithic comparison on Hopper. Do not expect 7x on H100 clusters alone. The per-prompt-length picture on H100 is more nuanced:

Setup	Prompt length	Throughput (req/s)	TTFT p50	Decode throughput (tok/s/GPU)
8x H100 SXM5 monolithic	512 tokens	12.4	220ms	3,100
8x H100 SXM5 monolithic	2048 tokens	4.1	890ms	2,800
8x H100 SXM5 monolithic	4096 tokens	1.8	1,740ms	2,400
4x H100 SXM5 prefill + 4x A100 80G decode	512 tokens	28.3 (estimated)	180ms	4,200
4x H100 SXM5 prefill + 4x A100 80G decode	2048 tokens	19.7 (estimated)	410ms	4,900
4x H100 SXM5 prefill + 4x A100 80G decode	4096 tokens	11.2 (estimated)	820ms	5,100

The monolithic numbers come from published vLLM benchmarks on H100 SXM5. The disaggregated figures are estimated based on the 7x headline result scaled per prompt length. Actual results depend on model size, batch composition, and interconnect latency. Run your own benchmarks on your target hardware before making infrastructure decisions. For comparison against other inference engines, see vLLM vs TensorRT-LLM vs SGLang Benchmarks.

The throughput gap grows with prompt length because long prefill is where monolithic setups suffer most. At 4K tokens, the decode pool sits nearly idle while prefill chews through the prompt. Disaggregated setups keep both pools busy.

Cost Analysis: Right-Sizing Prefill and Decode on Spheron

The throughput improvement translates directly to cost per token when you can use cheaper GPUs for decode. Here is what the numbers look like on Spheron with live pricing:

Configuration	GPU setup	Hourly cost	Relative throughput	Cost per 1M tokens (index)
Monolithic	8x H100 SXM5 on-demand	$19.20/hr	1.0x	1.00
Disaggregated	4x H100 SXM5 + 4x A100 80G SXM4	$13.92/hr	~4x (estimated)	~0.18
Disaggregated (spot)	4x H100 SXM5 spot + 4x A100 80G spot	$5.00/hr	~4.8x (estimated)	~0.05

Hourly costs: H100 SXM5 on-demand $2.40/GPU, spot $0.80/GPU. A100 80G SXM4 on-demand $1.08/GPU, spot $0.45/GPU.

Pricing fluctuates based on GPU availability. The prices above are based on 03 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

The "cost per 1M tokens" index is derived by dividing hardware cost by relative throughput. Even with conservative throughput assumptions (4x, not 7x), the disaggregated A100 decode pool cuts token cost to roughly 18% of the monolithic baseline. On spot pricing the savings are steeper.

Two caveats: first, the A100 decode pool adds latency for very long contexts because A100 has lower TFLOPS than H100 for the residual compute in decode. Second, the NIXL metadata server and routing layer add a small ops overhead. For batch workloads where per-request latency is not critical, A100 decode is a clear win. For real-time chat with strict TTFT SLAs, evaluate whether A100 decode latency meets your p95 targets.

You can also mix in spot instances for the decode pool, since decode nodes can be replaced mid-session by migrating KV state to a new node via NIXL's NVMe tier. This is not yet production-ready in all setups but is on the NIXL roadmap.

Explore GPU options: Rent H100 for prefill → | Rent A100 for decode →

Production Checklist

Before deploying a NIXL disaggregated cluster in production:

RDMA driver verification - Run ibv_devices and ibv_devinfo on every node. Confirm Active MTU: 4096 and Active width: 4X. Mismatched MTU causes silent throughput degradation.

NCCL version pinning - Pin NCCL to the version tested with your vLLM release. NCCL updates frequently break tensor-parallel all-reduce patterns. Use NCCL_VERSION=2.21.5 or whatever your vLLM release notes specify.

NIXL metadata server HA - The metadata server is a single point of failure at startup. For production, run two instances behind a simple TCP load balancer (HAProxy or NGINX stream). Workers reconnect automatically on metadata server restart.

Prometheus metrics to watch:

nixl_transfer_bytes_total - total bytes moved per direction
nixl_transfer_latency_p99 - p99 KV transfer time; alert above 10ms on IB, 25ms on RoCE
vllm:kv_cache_usage_perc - if decode pool hits 95%+, add decode workers
vllm:time_to_first_token_seconds - tracks end-to-end prefill + transfer latency

Decode pool autoscaling - Scale decode workers based on vllm:kv_cache_usage_perc and vllm:request_queue_length. Prefill workers scale less frequently since prefill is bursty. Keep a 2:1 or 3:1 decode-to-prefill ratio as a starting point for chat workloads.

NVMe tier flush policy - If using NVMe tiering, set a flush interval (nixl_nvme_flush_interval_ms) that matches your session timeout. Flushing too aggressively wears NVMe SSDs; too infrequently causes OOM on eviction.

Prefill node failure recovery - If a prefill node dies mid-transfer, the decode worker waiting for the KV cache will timeout. Recent vLLM releases re-queue the request to a healthy prefill worker after kv_transfer_timeout_ms (default 5000ms). Set this to 2000ms for interactive workloads.

Spheron multi-node networking - On Spheron, place prefill and decode instances in the same cluster region. Bare-metal instances get dedicated public IPs without NAT overhead. Open UDP port 4791 (RoCE) and TCP port 5557 (NIXL metadata) between nodes. Check Spheron network configuration docs for firewall rules.

NIXL transport thread count - The default is 2 transport threads per process. For 400 Gbps IB NDR links transferring 4096-token KV caches, 4-8 threads improve saturation. Set via NIXL_NUM_TRANSPORT_THREADS=4.

KV cache dtype alignment - Prefill and decode workers must use the same KV cache dtype (FP8, BF16, or FP16). Mismatches cause NIXL to fall back to CPU-side conversion, adding milliseconds to every transfer. Verify with --kv-cache-dtype matching on both sides.

Disaggregated inference with NIXL cuts GPU costs by 3-5x for high-throughput chat workloads. Spheron lets you provision separate prefill and decode GPU pools in the same region with high-bandwidth interconnect.
Rent H100 for prefill → | Rent A100 for decode → | View all GPU pricing →

TL;DR

Why Disaggregated Inference Is Now the Default

What Is NVIDIA NIXL

Prefill vs Decode Workers: The Split Explained

Setting Up NIXL with vLLM: Step-by-Step

NIXL + NVIDIA Dynamo: KVBM and Intelligent Routing

KV Cache Tiering: GPU to NVMe to Object Storage

Benchmarks: Disaggregated vs Monolithic on H100 Clusters

Cost Analysis: Right-Sizing Prefill and Decode on Spheron

Production Checklist

Build what's next.