Most LLM serving guides treat inference as one thing. It is not. There are two phases with opposite hardware requirements, and running them on the same GPU means both are running slower than they have to.
Disaggregating prefill and decode onto separate GPUs fixes that. Here is how it works, when it is worth doing, and how to set it up with vLLM or SGLang on Spheron's GPU cloud.
Why Prefill and Decode Pull in Opposite Directions
When you send a prompt to an LLM, two things happen in sequence:
Prefill: the model processes your entire input prompt in one forward pass. Every token in the prompt attends to every other token. This is a dense matrix multiply operation. It is compute-bound: the limiting factor is raw FLOPS (floating-point operations per second), not memory bandwidth.
Decode: the model generates output tokens one at a time. Each new token needs to read the key-value vectors for every previously generated token from memory. This is memory-bound: the GPU spends most of its time loading KV cache tensors, not computing.
| Phase | Bottleneck | Ideal GPU | What wastes capacity |
|---|---|---|---|
| Prefill | FP8 TFLOPS | H100, B200, B300 | Sitting idle waiting for decode |
| Decode | HBM bandwidth (TB/s) | H200 (4.8 TB/s) | Stalled by prefill monopolizing GPU |
When both phases share the same GPU, prefill jobs block decode batches from continuing. A long 32K-token prompt arriving mid-generation stalls every other user in the queue. This head-of-line blocking is the core problem disaggregation solves.
For a deeper look at how KV cache memory math constrains decode, see our KV Cache Optimization guide.
How Disaggregation Works: Data Flow and Architecture
In a disaggregated setup, you run two separate pools of GPU nodes:
- Prefill nodes: receive incoming requests, process the full prompt through the model, and produce a KV cache block for each request
- Decode nodes: receive the KV cache from the prefill node and run autoregressive generation until the response is complete
The request flow looks like this:
User request
|
v
Router
|
v
Prefill node (H100/B200)
- Runs forward pass over prompt
- Generates KV cache
|
| (KV cache transfer via NIXL)
v
Decode node (H200)
- Receives KV cache
- Generates output tokens
|
v
ResponseThe key enabling technology is NIXL (NVIDIA Inference Xfer Library), which transfers KV cache tensors between nodes using RDMA or TCP. In 2026, NIXL is the standard mechanism for this transfer in both vLLM and NVIDIA Dynamo.
NVIDIA Dynamo 1.0 uses this same disaggregated architecture at scale, with its own routing layer on top of vLLM. See our NVIDIA Dynamo deployment guide for the Dynamo-specific setup.
Chunked Prefill vs Full Disaggregation: Choosing the Right Approach
Full disaggregation is not always the right choice. Chunked prefill solves a similar problem with less infrastructure overhead.
Chunked prefill runs on a single node. Instead of processing a long prompt in one blocking pass, it breaks the prompt into fixed-size chunks and interleaves them with ongoing decode steps. A 32K-token prompt gets processed in, say, 4 chunks of 8K tokens each, with decode steps running between chunks. Decode is never fully blocked.
Full disaggregation uses separate nodes. The prefill node runs completely independently from the decode node. KV cache transfers happen over the network. There is no sharing of GPU resources between phases.
| Approach | When to use | Overhead | Throughput gain |
|---|---|---|---|
| Chunked prefill | Prompts under 8k tokens, single-node | Low | 20-40% |
| Full disaggregation | Prompts 8k+, high concurrency | Medium (network) | 1.5-2.5x |
For chunked prefill, add these flags to your vLLM launch:
--enable-chunked-prefill \
--max-num-batched-tokens 512Start with chunked prefill. If you are consistently hitting 8k+ prompt lengths with high concurrency and chunked prefill is not moving the needle enough, then add the infrastructure for full disaggregation.
Cache-Aware Disaggregation for Long-Context Models
If your workload has a shared system prompt (a long preamble that every request shares), prefix caching can eliminate most of your prefill cost.
With prefix caching enabled, the prefill node computes the KV cache for the shared prefix once and stores it. Subsequent requests that share that prefix skip the prefill entirely for the cached portion and only compute the KV cache for the novel parts of the prompt.
SGLang's RadixAttention handles this efficiently on the prefill side: it uses a radix tree to find the longest cached prefix for any incoming request and routes only the uncached suffix to the prefill compute path.
In a disaggregated setup, route requests by novelty:
- Novel prompts (not in cache): send to prefill nodes, which compute the full KV cache and write it to the cache
- Continuation requests (same session, generating more tokens): route directly to decode nodes, which already have the KV cache for the prior context
Prefix caching is one of the highest-leverage KV cache optimizations available. The details are in KV Cache Optimization: Serve 10x More Users.
Hardware Planning: What to Buy (or Rent) for Each Role
Prefill nodes need maximum FP8 TFLOPS and fast NVLink for tensor parallelism within the node. The computation is dense matrix multiply, so raw FLOPS is the bottleneck. H100 SXM5 and B200 SXM6 are the natural choices.
Decode nodes need maximum HBM bandwidth and capacity. Each generated token requires loading KV vectors for every prior token from HBM. The H200 SXM5 with 141GB HBM3e at 4.8 TB/s bandwidth is purpose-built for this. For smaller models where the KV cache fits in less memory, A100 80GB is a cost-effective decode option.
Here are the current Spheron prices for these GPUs:
| GPU | Role | On-demand | Spot | HBM | BW |
|---|---|---|---|---|---|
| H100 SXM5 | Prefill | $2.40/hr | $0.80/hr | 80GB | 3.35 TB/s |
| B200 SXM6 | Prefill | $7.43/hr | N/A | 180GB | 7.7 TB/s |
| H200 SXM5 | Decode | $4.54/hr | N/A | 141GB | 4.8 TB/s |
| A100 80G SXM4 | Decode | $1.05/hr | N/A | 80GB | 2.0 TB/s |
Pricing fluctuates based on GPU availability. The prices above are based on 03 Apr 2026 and may have changed. Check current GPU pricing for live rates.
Spheron lets you provision H100 or B200 prefill nodes alongside H200 decode nodes independently, choosing GPU types per instance without fixed cluster configurations.
Compare full GPU specs at Best GPU for AI Inference in 2026.
Step-by-Step Setup with vLLM on Spheron
Step 1: Provision Your Nodes
Log in to app.spheron.ai, create two GPU instances:
- One H100 SXM5 for the prefill worker
- One H200 SXM5 for the decode worker
Place both in the same region to minimize inter-node transfer latency. Note both instances' private IP addresses.
Step 2: Install vLLM
On both nodes, install vLLM v0.8+ (required for NixlConnector and disaggregated prefilling support):
pip install vllm
# Verify version supports P/D disaggregation
python -c "import vllm; print(vllm.__version__)"Ensure CUDA 12.4+ is installed. On the prefill node, enable FP8 by passing --dtype fp8 at launch.
Step 3: Start the Prefill Instance
On the H100 prefill node:
export VLLM_KV_TRANSFER_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_producer","kv_buffer_device":"cuda","kv_connector_extra_config":{"decode_addr":"<DECODE_NODE_IP>:8201"}}'
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--dtype fp8 \
--tensor-parallel-size 1 \
--kv-transfer-config $VLLM_KV_TRANSFER_CONFIG \
--port 8200The prefill node generates KV cache for each request and transfers it to the decode node address.
Step 4: Start the Decode Instance
On the H200 decode node:
export VLLM_KV_TRANSFER_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_consumer","kv_buffer_device":"cuda","kv_connector_extra_config":{"prefill_addr":"<PREFILL_NODE_IP>:8200"}}'
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--dtype fp8 \
--kv-transfer-config $VLLM_KV_TRANSFER_CONFIG \
--port 8201Step 5: Route Traffic Through the P/D Router
vLLM ships a minimal proxy server for routing between prefill and decode instances. Start it with both node addresses:
python vllm/tests/disagg_prefill/toy_proxy_server.py \
--prefill http://<PREFILL_NODE_IP>:8200 \
--decode http://<DECODE_NODE_IP>:8201 \
--port 8000Send requests to port 8000 on the router node. The proxy forwards incoming requests to the prefill node and handles the KV cache handoff to decode automatically. For production traffic, replace the toy proxy with a third-party load balancer or your own routing layer that implements the same prefill-then-decode handoff logic.
SGLang alternative: SGLang supports disaggregated serving via its router with --disaggregation-mode. The configuration is similar: one prefill worker, one decode worker, and a router process pointing to both. SGLang's RadixAttention gives it an edge for prefix-heavy workloads.
For a full guide to vLLM multi-GPU configuration without disaggregation, see vLLM Multi-GPU Production Deployment 2026.
For quick-start deployment templates on Spheron, see Spheron LLM quick-guides.
KV Cache Transfer: Network Requirements and NIXL
NIXL uses RDMA (InfiniBand or RoCE) for sub-millisecond KV cache transfer when available. It falls back to TCP, which adds more latency but still works.
To size your network requirements, estimate the KV cache per batch:
KV cache (bytes) = 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_elementFor Llama 3.1 70B at FP8 with 8k context and a batch of 8 requests:
2 × 80 layers × 8 KV heads × 128 head_dim × 8192 tokens × 1 byte × 8 requests = ~10 GB per batchTo transfer 10 GB before the decode node can start generating, you need high inter-node bandwidth. Practical minimums:
- 10GbE: adequate for short contexts under 2k tokens
- 100GbE: required for 8k+ context disaggregation
- InfiniBand / RoCE (RDMA): recommended for 32k+ contexts or latency-sensitive workloads
See NVIDIA Dynamo disaggregated inference guide for Dynamo-native NIXL configuration details.
Benchmarks: Throughput and Cost vs Colocated Inference
These figures are based on published vLLM and SGLang benchmark data for Llama 3.1 70B, adjusted for Spheron pricing at 03 Apr 2026:
| Setup | Model | GPUs | TTFT (p90) | Throughput (tokens/s) | Cost |
|---|---|---|---|---|---|
| Colocated (H100 x2) | Llama 3.1 70B | 2x H100 | ~800ms | ~1,200 | $4.80/hr |
| Disaggregated (H100 prefill + H200 decode) | Llama 3.1 70B | 1x H100 + 1x H200 | ~350ms | ~2,100 | $6.94/hr |
| Disaggregated (B200 prefill + H200 decode) | Llama 3.1 70B | 1x B200 + 1x H200 | ~180ms | ~3,000 | $11.97/hr |
The H100+H200 disaggregated setup costs about 45% more per hour but delivers 75% more throughput. At scale, cost per million tokens is actually lower with disaggregation because you serve far more requests per GPU-hour.
Using spot pricing on H100 prefill nodes ($0.80/hr) brings the H100+H200 config to $5.34/hr, delivering the same 2,100 tokens/s for roughly what two on-demand H100s cost colocated.
Pricing fluctuates based on GPU availability. The prices above are based on 03 Apr 2026 and may have changed. Check current GPU pricing for live rates.
For framework-level benchmarks comparing vLLM, TensorRT-LLM, and SGLang throughput on single-node setups, see vLLM vs TensorRT-LLM vs SGLang Benchmarks.
When NOT to Disaggregate
Disaggregation adds real complexity. It is not the right choice for every workload:
Short prompts (under 512 tokens): prefill is cheap at this length. The overhead of KV transfer to a remote decode node outweighs any benefit.
Low concurrency (under 10 simultaneous users): batching gains do not materialize with small user counts. The infrastructure complexity is not worth it.
Tight latency budgets with cold requests: KV transfer adds 50-200ms of network latency before the decode node can start generating. If TTFT is your primary constraint, chunked prefill on a single node is often better.
Models under 7B parameters: prefill is fast enough colocated, and the KV cache for small models fits comfortably in HBM without memory pressure.
Dev and test environments: operational overhead is not justified for one-off inference or rapid iteration workflows.
If you are hitting slow inference for other reasons, start with Why Your LLM Inference Is Slow to diagnose the root cause before adding infrastructure complexity.
Production Monitoring and Auto-Scaling
Disaggregated serving requires monitoring two separate queues, not one.
Key metrics to track:
- Prefill queue depth: requests waiting for a prefill node
- Decode queue depth: requests waiting for a decode node
- TTFT (time-to-first-token): end-to-end latency from request arrival to first generated token
- Inter-node transfer latency: time spent moving KV cache between prefill and decode nodes
vLLM metrics endpoint: GET /metrics returns Prometheus-format metrics. Key fields:
vllm:num_requests_waiting: queue depth per nodevllm:kv_cache_usage_perc: KV cache fill percentagevllm:time_to_first_token_seconds: TTFT distribution
Auto-scaling heuristic: scale decode replicas when the decode queue exceeds N requests; scale prefill replicas when the prefill queue exceeds M requests. These pools scale independently. A traffic spike of requests with long prompts fills the prefill queue first; a spike of continuation requests fills decode first. Track both separately.
For Prometheus and Grafana GPU monitoring setup, see GPU Monitoring for ML.
For broader GPU cost optimization strategies across your inference stack, see GPU Cloud Cost Optimization Playbook.
Spheron lets you provision H100 or B200 prefill nodes alongside H200 decode nodes independently, with spot pricing to cut costs on prefill nodes. Mix GPU types in a single deployment without fixed cluster constraints.
Rent H100 → | Rent H200 → | Rent B200 → | View all pricing →
