What is prefill-decode disaggregation in LLM serving?

Prefill-decode disaggregation separates the two phases of LLM inference onto dedicated GPU nodes. Prefill (processing the input prompt) is compute-bound and runs on compute-dense GPUs like H100 or B200. Decode (generating output tokens) is memory-bandwidth-bound and runs on memory-dense GPUs like H200. Splitting them prevents each phase from degrading the other, boosting overall throughput.

When should I use chunked prefill instead of full disaggregation?

Chunked prefill is the right first step for most deployments. It breaks long prompts into fixed-size chunks processed sequentially, reducing prefill-induced decode stalls without requiring separate hardware or inter-node KV cache transfer. Full disaggregation makes sense when you have consistently long prompts (8k+ tokens), high concurrency, and the networking infrastructure to transfer KV cache between nodes at low latency.

What GPUs work best for disaggregated prefill vs decode nodes?

Prefill nodes need raw compute (FLOPS): H100 SXM5, B200, or B300 are ideal. Decode nodes need memory bandwidth and capacity: H200 SXM5 with 141GB HBM3e is the best option in 2026 for large models. A100 80GB SXM4 is a cost-effective decode node for smaller models. The key metric for prefill is FP8 TFLOPS; for decode it is HBM bandwidth (TB/s).

Does vLLM support prefill-decode disaggregation natively?

vLLM v0.8+ (V1 engine) supports disaggregated prefilling natively via NixlConnector. The prefill instance serves as a KV cache producer and the decode instance as a consumer. KV cache is transferred over the network using NIXL (NVIDIA's Inference Xfer Library) or a Redis-backed cache. SGLang also supports disaggregated serving via its router API.

How much does disaggregated inference cost on Spheron compared to colocated?

Cost depends on your workload. For high-throughput long-context workloads, disaggregation can reduce the number of expensive compute-dense GPUs needed by offloading decode to cheaper memory-dense nodes. An H100 SXM5 prefill node at $2.40/hr paired with an H200 SXM5 decode node at $4.54/hr runs at $6.94/hr total, but for long-context workloads this configuration can deliver nearly double the throughput of two colocated H100s at $4.80/hr.

Prefill-Decode Disaggregation on GPU Cloud: Split LLM Inference for 2x Throughput (2026 Guide)

Most LLM serving guides treat inference as one thing. It is not. There are two phases with opposite hardware requirements, and running them on the same GPU means both are running slower than they have to.

Disaggregating prefill and decode onto separate GPUs fixes that. Here is how it works, when it is worth doing, and how to set it up with vLLM or SGLang on Spheron's GPU cloud.

Why Prefill and Decode Pull in Opposite Directions

When you send a prompt to an LLM, two things happen in sequence:

Prefill: the model processes your entire input prompt in one forward pass. Every token in the prompt attends to every other token. This is a dense matrix multiply operation. It is compute-bound: the limiting factor is raw FLOPS (floating-point operations per second), not memory bandwidth.

Decode: the model generates output tokens one at a time. Each new token needs to read the key-value vectors for every previously generated token from memory. This is memory-bound: the GPU spends most of its time loading KV cache tensors, not computing.

Phase	Bottleneck	Ideal GPU	What wastes capacity
Prefill	FP8 TFLOPS	H100, B200, B300	Sitting idle waiting for decode
Decode	HBM bandwidth (TB/s)	H200 (4.8 TB/s)	Stalled by prefill monopolizing GPU

When both phases share the same GPU, prefill jobs block decode batches from continuing. A long 32K-token prompt arriving mid-generation stalls every other user in the queue. This head-of-line blocking is the core problem disaggregation solves.

For a deeper look at how KV cache memory math constrains decode, see our KV Cache Optimization guide.

How Disaggregation Works: Data Flow and Architecture

In a disaggregated setup, you run two separate pools of GPU nodes:

Prefill nodes: receive incoming requests, process the full prompt through the model, and produce a KV cache block for each request
Decode nodes: receive the KV cache from the prefill node and run autoregressive generation until the response is complete

The request flow looks like this:

User request
     |
     v
  Router
     |
     v
Prefill node (H100/B200)
  - Runs forward pass over prompt
  - Generates KV cache
     |
     | (KV cache transfer via NIXL)
     v
Decode node (H200)
  - Receives KV cache
  - Generates output tokens
     |
     v
  Response

The key enabling technology is NIXL (NVIDIA Inference Xfer Library), which transfers KV cache tensors between nodes using RDMA or TCP. In 2026, NIXL is the standard mechanism for this transfer in both vLLM and NVIDIA Dynamo.

NVIDIA Dynamo 1.0 uses this same disaggregated architecture at scale, with its own routing layer on top of vLLM. See our NVIDIA Dynamo deployment guide for the Dynamo-specific setup.

Chunked Prefill vs Full Disaggregation: Choosing the Right Approach

Full disaggregation is not always the right choice. Chunked prefill solves a similar problem with less infrastructure overhead.

Chunked prefill runs on a single node. Instead of processing a long prompt in one blocking pass, it breaks the prompt into fixed-size chunks and interleaves them with ongoing decode steps. A 32K-token prompt gets processed in, say, 4 chunks of 8K tokens each, with decode steps running between chunks. Decode is never fully blocked.

Full disaggregation uses separate nodes. The prefill node runs completely independently from the decode node. KV cache transfers happen over the network. There is no sharing of GPU resources between phases.

Approach	When to use	Overhead	Throughput gain
Chunked prefill	Prompts under 8k tokens, single-node	Low	20-40%
Full disaggregation	Prompts 8k+, high concurrency	Medium (network)	1.5-2.5x

For chunked prefill, add these flags to your vLLM launch:

bash

--enable-chunked-prefill \
--max-num-batched-tokens 512

Start with chunked prefill. If you are consistently hitting 8k+ prompt lengths with high concurrency and chunked prefill is not moving the needle enough, then add the infrastructure for full disaggregation.

Cache-Aware Disaggregation for Long-Context Models

If your workload has a shared system prompt (a long preamble that every request shares), prefix caching can eliminate most of your prefill cost.

With prefix caching enabled, the prefill node computes the KV cache for the shared prefix once and stores it. Subsequent requests that share that prefix skip the prefill entirely for the cached portion and only compute the KV cache for the novel parts of the prompt.

SGLang's RadixAttention handles this efficiently on the prefill side: it uses a radix tree to find the longest cached prefix for any incoming request and routes only the uncached suffix to the prefill compute path.

In a disaggregated setup, route requests by novelty:

Novel prompts (not in cache): send to prefill nodes, which compute the full KV cache and write it to the cache
Continuation requests (same session, generating more tokens): route directly to decode nodes, which already have the KV cache for the prior context

Prefix caching is one of the highest-leverage KV cache optimizations available. The details are in KV Cache Optimization: Serve 10x More Users.

Hardware Planning: What to Buy (or Rent) for Each Role

Prefill nodes need maximum FP8 TFLOPS and fast NVLink for tensor parallelism within the node. The computation is dense matrix multiply, so raw FLOPS is the bottleneck. H100 SXM5 and B200 SXM6 are the natural choices.

Decode nodes need maximum HBM bandwidth and capacity. Each generated token requires loading KV vectors for every prior token from HBM. The H200 SXM5 with 141GB HBM3e at 4.8 TB/s bandwidth is purpose-built for this. For smaller models where the KV cache fits in less memory, A100 80GB is a cost-effective decode option.

Here are the current Spheron prices for these GPUs:

GPU	Role	On-demand	Spot	HBM	BW
H100 SXM5	Prefill	$2.40/hr	$0.80/hr	80GB	3.35 TB/s
B200 SXM6	Prefill	$7.43/hr	N/A	180GB	7.7 TB/s
H200 SXM5	Decode	$4.54/hr	N/A	141GB	4.8 TB/s
A100 80G SXM4	Decode	$1.05/hr	N/A	80GB	2.0 TB/s

Pricing fluctuates based on GPU availability. The prices above are based on 03 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Spheron lets you provision H100 or B200 prefill nodes alongside H200 decode nodes independently, choosing GPU types per instance without fixed cluster configurations.

Compare full GPU specs at Best GPU for AI Inference in 2026.

Step-by-Step Setup with vLLM on Spheron

Step 1: Provision Your Nodes

One H100 SXM5 for the prefill worker
One H200 SXM5 for the decode worker

Place both in the same region to minimize inter-node transfer latency. Note both instances' private IP addresses.

Step 2: Install vLLM

On both nodes, install vLLM v0.8+ (required for NixlConnector and disaggregated prefilling support):

bash

pip install vllm
# Verify version supports P/D disaggregation
python -c "import vllm; print(vllm.__version__)"

Ensure CUDA 12.4+ is installed. On the prefill node, enable FP8 by passing --dtype fp8 at launch.

Step 3: Start the Prefill Instance

On the H100 prefill node:

bash

export VLLM_KV_TRANSFER_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_producer","kv_buffer_device":"cuda","kv_connector_extra_config":{"decode_addr":"<DECODE_NODE_IP>:8201"}}'

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --dtype fp8 \
  --tensor-parallel-size 1 \
  --kv-transfer-config $VLLM_KV_TRANSFER_CONFIG \
  --port 8200

The prefill node generates KV cache for each request and transfers it to the decode node address.

Step 4: Start the Decode Instance

On the H200 decode node:

bash

export VLLM_KV_TRANSFER_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_consumer","kv_buffer_device":"cuda","kv_connector_extra_config":{"prefill_addr":"<PREFILL_NODE_IP>:8200"}}'

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --dtype fp8 \
  --kv-transfer-config $VLLM_KV_TRANSFER_CONFIG \
  --port 8201

Step 5: Route Traffic Through the P/D Router

vLLM ships a minimal proxy server for routing between prefill and decode instances. Start it with both node addresses:

bash

python vllm/tests/disagg_prefill/toy_proxy_server.py \
  --prefill http://<PREFILL_NODE_IP>:8200 \
  --decode http://<DECODE_NODE_IP>:8201 \
  --port 8000

Send requests to port 8000 on the router node. The proxy forwards incoming requests to the prefill node and handles the KV cache handoff to decode automatically. For production traffic, replace the toy proxy with a third-party load balancer or your own routing layer that implements the same prefill-then-decode handoff logic.

SGLang alternative: SGLang supports disaggregated serving via its router with --disaggregation-mode. The configuration is similar: one prefill worker, one decode worker, and a router process pointing to both. SGLang's RadixAttention gives it an edge for prefix-heavy workloads.

For a full guide to vLLM multi-GPU configuration without disaggregation, see vLLM Multi-GPU Production Deployment 2026.

For quick-start deployment templates on Spheron, see Spheron LLM quick-guides.

KV Cache Transfer: Network Requirements and NIXL

NIXL uses RDMA (InfiniBand or RoCE) for sub-millisecond KV cache transfer when available. It falls back to TCP, which adds more latency but still works.

To size your network requirements, estimate the KV cache per batch:

KV cache (bytes) = 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_element

For Llama 3.1 70B at FP8 with 8k context and a batch of 8 requests:

2 × 80 layers × 8 KV heads × 128 head_dim × 8192 tokens × 1 byte × 8 requests = ~10 GB per batch

To transfer 10 GB before the decode node can start generating, you need high inter-node bandwidth. Practical minimums:

10GbE: adequate for short contexts under 2k tokens
100GbE: required for 8k+ context disaggregation
InfiniBand / RoCE (RDMA): recommended for 32k+ contexts or latency-sensitive workloads

See NVIDIA Dynamo disaggregated inference guide for Dynamo-native NIXL configuration details.

Benchmarks: Throughput and Cost vs Colocated Inference

These figures are based on published vLLM and SGLang benchmark data for Llama 3.1 70B, adjusted for Spheron pricing at 03 Apr 2026:

Setup	Model	GPUs	TTFT (p90)	Throughput (tokens/s)	Cost
Colocated (H100 x2)	Llama 3.1 70B	2x H100	~800ms	~1,200	$4.80/hr
Disaggregated (H100 prefill + H200 decode)	Llama 3.1 70B	1x H100 + 1x H200	~350ms	~2,100	$6.94/hr
Disaggregated (B200 prefill + H200 decode)	Llama 3.1 70B	1x B200 + 1x H200	~180ms	~3,000	$11.97/hr

The H100+H200 disaggregated setup costs about 45% more per hour but delivers 75% more throughput. At scale, cost per million tokens is actually lower with disaggregation because you serve far more requests per GPU-hour.

Using spot pricing on H100 prefill nodes ($0.80/hr) brings the H100+H200 config to $5.34/hr, delivering the same 2,100 tokens/s for roughly what two on-demand H100s cost colocated.

Pricing fluctuates based on GPU availability. The prices above are based on 03 Apr 2026 and may have changed. Check current GPU pricing for live rates.

For framework-level benchmarks comparing vLLM, TensorRT-LLM, and SGLang throughput on single-node setups, see vLLM vs TensorRT-LLM vs SGLang Benchmarks.

When NOT to Disaggregate

Disaggregation adds real complexity. It is not the right choice for every workload:

Short prompts (under 512 tokens): prefill is cheap at this length. The overhead of KV transfer to a remote decode node outweighs any benefit.

Low concurrency (under 10 simultaneous users): batching gains do not materialize with small user counts. The infrastructure complexity is not worth it.

Tight latency budgets with cold requests: KV transfer adds 50-200ms of network latency before the decode node can start generating. If TTFT is your primary constraint, chunked prefill on a single node is often better.

Models under 7B parameters: prefill is fast enough colocated, and the KV cache for small models fits comfortably in HBM without memory pressure.

Dev and test environments: operational overhead is not justified for one-off inference or rapid iteration workflows.

If you are hitting slow inference for other reasons, start with Why Your LLM Inference Is Slow to diagnose the root cause before adding infrastructure complexity.

Production Monitoring and Auto-Scaling

Disaggregated serving requires monitoring two separate queues, not one.

Key metrics to track:

Prefill queue depth: requests waiting for a prefill node
Decode queue depth: requests waiting for a decode node
TTFT (time-to-first-token): end-to-end latency from request arrival to first generated token
Inter-node transfer latency: time spent moving KV cache between prefill and decode nodes

vLLM metrics endpoint: GET /metrics returns Prometheus-format metrics. Key fields:

vllm:num_requests_waiting: queue depth per node
vllm:kv_cache_usage_perc: KV cache fill percentage
vllm:time_to_first_token_seconds: TTFT distribution

Auto-scaling heuristic: scale decode replicas when the decode queue exceeds N requests; scale prefill replicas when the prefill queue exceeds M requests. These pools scale independently. A traffic spike of requests with long prompts fills the prefill queue first; a spike of continuation requests fills decode first. Track both separately.

For Prometheus and Grafana GPU monitoring setup, see GPU Monitoring for ML.

For broader GPU cost optimization strategies across your inference stack, see GPU Cloud Cost Optimization Playbook.

Spheron lets you provision H100 or B200 prefill nodes alongside H200 decode nodes independently, with spot pricing to cut costs on prefill nodes. Mix GPU types in a single deployment without fixed cluster constraints.
Rent H100 → | Rent H200 → | Rent B200 → | View all pricing →

Why Prefill and Decode Pull in Opposite Directions

How Disaggregation Works: Data Flow and Architecture

Chunked Prefill vs Full Disaggregation: Choosing the Right Approach

Cache-Aware Disaggregation for Long-Context Models

Hardware Planning: What to Buy (or Rent) for Each Role

Step-by-Step Setup with vLLM on Spheron

Step 1: Provision Your Nodes

Step 2: Install vLLM

Step 3: Start the Prefill Instance

Step 4: Start the Decode Instance

Step 5: Route Traffic Through the P/D Router

KV Cache Transfer: Network Requirements and NIXL

Benchmarks: Throughput and Cost vs Colocated Inference

When NOT to Disaggregate

Production Monitoring and Auto-Scaling

Build what's next.