Engineering

Prefill-Decode Disaggregation on GPU Cloud: Split LLM Inference for 2x Throughput (2026 Guide)

Back to BlogWritten by Mitrasish, Co-founderApr 3, 2026
LLM InferenceGPU CloudDisaggregated InferencevLLMSGLangKV CacheH100Inference Optimization
Prefill-Decode Disaggregation on GPU Cloud: Split LLM Inference for 2x Throughput (2026 Guide)

Most LLM serving guides treat inference as one thing. It is not. There are two phases with opposite hardware requirements, and running them on the same GPU means both are running slower than they have to.

Disaggregating prefill and decode onto separate GPUs fixes that. Here is how it works, when it is worth doing, and how to set it up with vLLM or SGLang on Spheron's GPU cloud.

Why Prefill and Decode Pull in Opposite Directions

When you send a prompt to an LLM, two things happen in sequence:

Prefill: the model processes your entire input prompt in one forward pass. Every token in the prompt attends to every other token. This is a dense matrix multiply operation. It is compute-bound: the limiting factor is raw FLOPS (floating-point operations per second), not memory bandwidth.

Decode: the model generates output tokens one at a time. Each new token needs to read the key-value vectors for every previously generated token from memory. This is memory-bound: the GPU spends most of its time loading KV cache tensors, not computing.

PhaseBottleneckIdeal GPUWhat wastes capacity
PrefillFP8 TFLOPSH100, B200, B300Sitting idle waiting for decode
DecodeHBM bandwidth (TB/s)H200 (4.8 TB/s)Stalled by prefill monopolizing GPU

When both phases share the same GPU, prefill jobs block decode batches from continuing. A long 32K-token prompt arriving mid-generation stalls every other user in the queue. This head-of-line blocking is the core problem disaggregation solves.

For a deeper look at how KV cache memory math constrains decode, see our KV Cache Optimization guide.

How Disaggregation Works: Data Flow and Architecture

In a disaggregated setup, you run two separate pools of GPU nodes:

  • Prefill nodes: receive incoming requests, process the full prompt through the model, and produce a KV cache block for each request
  • Decode nodes: receive the KV cache from the prefill node and run autoregressive generation until the response is complete

The request flow looks like this:

User request
     |
     v
  Router
     |
     v
Prefill node (H100/B200)
  - Runs forward pass over prompt
  - Generates KV cache
     |
     | (KV cache transfer via NIXL)
     v
Decode node (H200)
  - Receives KV cache
  - Generates output tokens
     |
     v
  Response

The key enabling technology is NIXL (NVIDIA Inference Xfer Library), which transfers KV cache tensors between nodes using RDMA or TCP. In 2026, NIXL is the standard mechanism for this transfer in both vLLM and NVIDIA Dynamo.

NVIDIA Dynamo 1.0 uses this same disaggregated architecture at scale, with its own routing layer on top of vLLM. See our NVIDIA Dynamo deployment guide for the Dynamo-specific setup.

Chunked Prefill vs Full Disaggregation: Choosing the Right Approach

Full disaggregation is not always the right choice. Chunked prefill solves a similar problem with less infrastructure overhead.

Chunked prefill runs on a single node. Instead of processing a long prompt in one blocking pass, it breaks the prompt into fixed-size chunks and interleaves them with ongoing decode steps. A 32K-token prompt gets processed in, say, 4 chunks of 8K tokens each, with decode steps running between chunks. Decode is never fully blocked.

Full disaggregation uses separate nodes. The prefill node runs completely independently from the decode node. KV cache transfers happen over the network. There is no sharing of GPU resources between phases.

ApproachWhen to useOverheadThroughput gain
Chunked prefillPrompts under 8k tokens, single-nodeLow20-40%
Full disaggregationPrompts 8k+, high concurrencyMedium (network)1.5-2.5x

For chunked prefill, add these flags to your vLLM launch:

bash
--enable-chunked-prefill \
--max-num-batched-tokens 512

Start with chunked prefill. If you are consistently hitting 8k+ prompt lengths with high concurrency and chunked prefill is not moving the needle enough, then add the infrastructure for full disaggregation.

Cache-Aware Disaggregation for Long-Context Models

If your workload has a shared system prompt (a long preamble that every request shares), prefix caching can eliminate most of your prefill cost.

With prefix caching enabled, the prefill node computes the KV cache for the shared prefix once and stores it. Subsequent requests that share that prefix skip the prefill entirely for the cached portion and only compute the KV cache for the novel parts of the prompt.

SGLang's RadixAttention handles this efficiently on the prefill side: it uses a radix tree to find the longest cached prefix for any incoming request and routes only the uncached suffix to the prefill compute path.

In a disaggregated setup, route requests by novelty:

  • Novel prompts (not in cache): send to prefill nodes, which compute the full KV cache and write it to the cache
  • Continuation requests (same session, generating more tokens): route directly to decode nodes, which already have the KV cache for the prior context

Prefix caching is one of the highest-leverage KV cache optimizations available. The details are in KV Cache Optimization: Serve 10x More Users.

Hardware Planning: What to Buy (or Rent) for Each Role

Prefill nodes need maximum FP8 TFLOPS and fast NVLink for tensor parallelism within the node. The computation is dense matrix multiply, so raw FLOPS is the bottleneck. H100 SXM5 and B200 SXM6 are the natural choices.

Decode nodes need maximum HBM bandwidth and capacity. Each generated token requires loading KV vectors for every prior token from HBM. The H200 SXM5 with 141GB HBM3e at 4.8 TB/s bandwidth is purpose-built for this. For smaller models where the KV cache fits in less memory, A100 80GB is a cost-effective decode option.

Here are the current Spheron prices for these GPUs:

GPURoleOn-demandSpotHBMBW
H100 SXM5Prefill$2.40/hr$0.80/hr80GB3.35 TB/s
B200 SXM6Prefill$7.43/hrN/A180GB7.7 TB/s
H200 SXM5Decode$4.54/hrN/A141GB4.8 TB/s
A100 80G SXM4Decode$1.05/hrN/A80GB2.0 TB/s

Pricing fluctuates based on GPU availability. The prices above are based on 03 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Spheron lets you provision H100 or B200 prefill nodes alongside H200 decode nodes independently, choosing GPU types per instance without fixed cluster configurations.

Compare full GPU specs at Best GPU for AI Inference in 2026.

Step-by-Step Setup with vLLM on Spheron

Step 1: Provision Your Nodes

Log in to app.spheron.ai, create two GPU instances:

  • One H100 SXM5 for the prefill worker
  • One H200 SXM5 for the decode worker

Place both in the same region to minimize inter-node transfer latency. Note both instances' private IP addresses.

Step 2: Install vLLM

On both nodes, install vLLM v0.8+ (required for NixlConnector and disaggregated prefilling support):

bash
pip install vllm
# Verify version supports P/D disaggregation
python -c "import vllm; print(vllm.__version__)"

Ensure CUDA 12.4+ is installed. On the prefill node, enable FP8 by passing --dtype fp8 at launch.

Step 3: Start the Prefill Instance

On the H100 prefill node:

bash
export VLLM_KV_TRANSFER_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_producer","kv_buffer_device":"cuda","kv_connector_extra_config":{"decode_addr":"<DECODE_NODE_IP>:8201"}}'

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --dtype fp8 \
  --tensor-parallel-size 1 \
  --kv-transfer-config $VLLM_KV_TRANSFER_CONFIG \
  --port 8200

The prefill node generates KV cache for each request and transfers it to the decode node address.

Step 4: Start the Decode Instance

On the H200 decode node:

bash
export VLLM_KV_TRANSFER_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_consumer","kv_buffer_device":"cuda","kv_connector_extra_config":{"prefill_addr":"<PREFILL_NODE_IP>:8200"}}'

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --dtype fp8 \
  --kv-transfer-config $VLLM_KV_TRANSFER_CONFIG \
  --port 8201

Step 5: Route Traffic Through the P/D Router

vLLM ships a minimal proxy server for routing between prefill and decode instances. Start it with both node addresses:

bash
python vllm/tests/disagg_prefill/toy_proxy_server.py \
  --prefill http://<PREFILL_NODE_IP>:8200 \
  --decode http://<DECODE_NODE_IP>:8201 \
  --port 8000

Send requests to port 8000 on the router node. The proxy forwards incoming requests to the prefill node and handles the KV cache handoff to decode automatically. For production traffic, replace the toy proxy with a third-party load balancer or your own routing layer that implements the same prefill-then-decode handoff logic.

SGLang alternative: SGLang supports disaggregated serving via its router with --disaggregation-mode. The configuration is similar: one prefill worker, one decode worker, and a router process pointing to both. SGLang's RadixAttention gives it an edge for prefix-heavy workloads.

For a full guide to vLLM multi-GPU configuration without disaggregation, see vLLM Multi-GPU Production Deployment 2026.

For quick-start deployment templates on Spheron, see Spheron LLM quick-guides.

KV Cache Transfer: Network Requirements and NIXL

NIXL uses RDMA (InfiniBand or RoCE) for sub-millisecond KV cache transfer when available. It falls back to TCP, which adds more latency but still works.

To size your network requirements, estimate the KV cache per batch:

KV cache (bytes) = 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_element

For Llama 3.1 70B at FP8 with 8k context and a batch of 8 requests:

2 × 80 layers × 8 KV heads × 128 head_dim × 8192 tokens × 1 byte × 8 requests = ~10 GB per batch

To transfer 10 GB before the decode node can start generating, you need high inter-node bandwidth. Practical minimums:

  • 10GbE: adequate for short contexts under 2k tokens
  • 100GbE: required for 8k+ context disaggregation
  • InfiniBand / RoCE (RDMA): recommended for 32k+ contexts or latency-sensitive workloads

See NVIDIA Dynamo disaggregated inference guide for Dynamo-native NIXL configuration details.

Benchmarks: Throughput and Cost vs Colocated Inference

These figures are based on published vLLM and SGLang benchmark data for Llama 3.1 70B, adjusted for Spheron pricing at 03 Apr 2026:

SetupModelGPUsTTFT (p90)Throughput (tokens/s)Cost
Colocated (H100 x2)Llama 3.1 70B2x H100~800ms~1,200$4.80/hr
Disaggregated (H100 prefill + H200 decode)Llama 3.1 70B1x H100 + 1x H200~350ms~2,100$6.94/hr
Disaggregated (B200 prefill + H200 decode)Llama 3.1 70B1x B200 + 1x H200~180ms~3,000$11.97/hr

The H100+H200 disaggregated setup costs about 45% more per hour but delivers 75% more throughput. At scale, cost per million tokens is actually lower with disaggregation because you serve far more requests per GPU-hour.

Using spot pricing on H100 prefill nodes ($0.80/hr) brings the H100+H200 config to $5.34/hr, delivering the same 2,100 tokens/s for roughly what two on-demand H100s cost colocated.

Pricing fluctuates based on GPU availability. The prices above are based on 03 Apr 2026 and may have changed. Check current GPU pricing for live rates.

For framework-level benchmarks comparing vLLM, TensorRT-LLM, and SGLang throughput on single-node setups, see vLLM vs TensorRT-LLM vs SGLang Benchmarks.

When NOT to Disaggregate

Disaggregation adds real complexity. It is not the right choice for every workload:

Short prompts (under 512 tokens): prefill is cheap at this length. The overhead of KV transfer to a remote decode node outweighs any benefit.

Low concurrency (under 10 simultaneous users): batching gains do not materialize with small user counts. The infrastructure complexity is not worth it.

Tight latency budgets with cold requests: KV transfer adds 50-200ms of network latency before the decode node can start generating. If TTFT is your primary constraint, chunked prefill on a single node is often better.

Models under 7B parameters: prefill is fast enough colocated, and the KV cache for small models fits comfortably in HBM without memory pressure.

Dev and test environments: operational overhead is not justified for one-off inference or rapid iteration workflows.

If you are hitting slow inference for other reasons, start with Why Your LLM Inference Is Slow to diagnose the root cause before adding infrastructure complexity.

Production Monitoring and Auto-Scaling

Disaggregated serving requires monitoring two separate queues, not one.

Key metrics to track:

  • Prefill queue depth: requests waiting for a prefill node
  • Decode queue depth: requests waiting for a decode node
  • TTFT (time-to-first-token): end-to-end latency from request arrival to first generated token
  • Inter-node transfer latency: time spent moving KV cache between prefill and decode nodes

vLLM metrics endpoint: GET /metrics returns Prometheus-format metrics. Key fields:

  • vllm:num_requests_waiting: queue depth per node
  • vllm:kv_cache_usage_perc: KV cache fill percentage
  • vllm:time_to_first_token_seconds: TTFT distribution

Auto-scaling heuristic: scale decode replicas when the decode queue exceeds N requests; scale prefill replicas when the prefill queue exceeds M requests. These pools scale independently. A traffic spike of requests with long prompts fills the prefill queue first; a spike of continuation requests fills decode first. Track both separately.

For Prometheus and Grafana GPU monitoring setup, see GPU Monitoring for ML.

For broader GPU cost optimization strategies across your inference stack, see GPU Cloud Cost Optimization Playbook.


Spheron lets you provision H100 or B200 prefill nodes alongside H200 decode nodes independently, with spot pricing to cut costs on prefill nodes. Mix GPU types in a single deployment without fixed cluster constraints.

Rent H100 → | Rent H200 → | Rent B200 → | View all pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.