On March 24, 2026, llm-d was accepted into the CNCF Sandbox. The project solves a specific problem: monolithic vLLM saturates prefill GPUs during high-concurrency decode, and disaggregating those two phases onto separate GPU pools fixes it. Unlike NVIDIA Dynamo's disaggregated inference approach, which is an orchestration layer running above vLLM outside Kubernetes, llm-d is a first-class Kubernetes citizen using standard CRDs and the Gateway API Inference Extension. If you haven't set up vLLM on bare metal yet, the vLLM production deployment guide covers that baseline first.
What Is llm-d
llm-d is a Kubernetes-native framework for disaggregated LLM inference. Backed by IBM, Red Hat, Google, CoreWeave, and NVIDIA, among others, the project was donated to CNCF on March 24, 2026 and immediately accepted into the Sandbox program.
Four core components:
- llm-d scheduler: the cache-aware routing process that tracks KV cache state per decode worker and routes requests to maximize prefix cache hits
- InferencePool CRD: a custom resource that groups decode GPU pods into a named pool the scheduler can target
- InferenceObjective CRD: maps model names (and LoRA adapter names) to backend deployments with traffic weights and request criticality labels
- Gateway API Inference Extension integration: plugs into the standard Kubernetes Gateway API to replace round-robin with inference-aware routing
The key difference from NVIDIA Dynamo: Dynamo is an orchestration layer that sits above vLLM and runs outside Kubernetes. It uses NIXL for KV cache transfer and is tightly coupled to NVIDIA's DGX/HGX hardware stack. llm-d uses standard Kubernetes networking, CRDs, and the Gateway API. If your team already manages Kubernetes clusters and wants disaggregation without adopting a separate proprietary control plane, llm-d is the natural path.
Prefill/Decode Disaggregation on Kubernetes
The Two Phases and Why They Need Different Hardware
Every LLM generation request has two phases. Prefill processes all input tokens in a single forward pass. Attention is computed over the full prompt length, making prefill compute-bound: FLOPs scale with prompt token count.
Decode generates one output token per step, reading the full KV cache at every step. With a 4K-token context, decode makes 256 separate KV cache reads across 4K+ tokens each. It is dominated by memory bandwidth, not compute.
In monolithic vLLM, both phases share the same GPU. A long prefill request blocks decode batches from continuing. This head-of-line blocking is what disaggregation solves: send prefill work to compute-optimized nodes, send decode work to memory-bandwidth-optimized nodes.
How llm-d Routes Requests
Client
|
v
Kubernetes Gateway (Gateway API)
|
v
llm-d Scheduler (cache-aware routing)
| \
v v
Prefill Pool KV transfer
(H100 SXM/B200) |
v
Decode Pool
(H100 PCIe / A100 80GB)
|
v
Response to ClientThe scheduler has a key advantage over naive load balancing: it knows which decode workers hold which KV cache blocks. When an incoming request shares a prefix with a recently cached request, the scheduler routes it to the decode worker that already has that prefix cached. This reduces redundant prefill work and cuts time-to-first-token for repeated-prefix workloads.
For deeper background on how KV cache memory works, how PagedAttention allocates blocks, and what prefix caching looks like at the memory level, see the KV Cache Optimization Guide.
Hardware Requirements and GPU Selection
The right GPU depends on which pool it's in.
| Role | Recommended GPU | Why | Price |
|---|---|---|---|
| Prefill | H100 SXM5 | Compute-bound, 3,958 TFLOPS FP8 (with sparsity) | $2.40/hr on-demand |
| Prefill (high-end) | B200 SXM6 | ~2.3x dense FP8 FLOP increase over H100 | $1.67/hr (spot) |
| Decode | H100 PCIe | Memory bandwidth, lower cost | $2.01/hr on-demand |
| Decode | A100 SXM4 80GB | Cost-efficient memory bandwidth | $1.08/hr on-demand |
| Mixed (dev/test) | H100 SXM5 | Single pool, no disaggregation | $2.40/hr on-demand |
The v0.5 benchmark validates meaningful throughput at scale: 3,100 tokens/second per B200 decode GPU, and 50,000 output tokens/second on a 16x16 B200 topology (16 prefill nodes, 16 decode nodes). At single-node or small-cluster scale, results will differ. For per-GPU benchmark data including memory bandwidth and inference throughput comparisons across H100, A100, and B200, see Best GPU for AI Inference 2026.
Pricing fluctuates based on GPU availability. The prices above are based on 01 Apr 2026 and may have changed. Check current GPU pricing for live rates.
Step-by-Step Deployment on Spheron with Kubernetes
Provision GPU Instances on Spheron
Rent at least two GPU instances: one for prefill, one or more for decode. For prefill, use H100 SXM5 or B200 for compute throughput. For decode, H100 PCIe or A100 80GB at lower hourly rates are a good fit.
After SSH access:
nvidia-smi
# Verify: GPU model, VRAM, driver version (575.xx+ for H100/B200)Both nodes need Kubernetes 1.29+ running. If you're starting from scratch, kubeadm init on the control plane node and kubeadm join on workers is the standard path. Multi-node clusters need a CNI (Flannel, Calico, Cilium all work).
Install Kubernetes Prerequisites
Install the NVIDIA GPU Operator first. This handles the NVIDIA device plugin, container toolkit, and driver management in one Helm chart:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespaceInstall Gateway API CRDs (v1.3.0+ required for llm-d):
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yamlInstall cert-manager (required by llm-d's webhook):
helm repo add jetstack https://charts.jetstack.io
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set crds.enabled=trueInstall llm-d via Helm
helm repo add llm-d-infra https://llm-d-incubation.github.io/llm-d-infra/
helm repo update
helm install llm-d llm-d-infra/llm-d-infra \
--namespace llm-d \
--create-namespace \
-f values.yamlA minimal values.yaml for a 1-prefill, 1-decode deployment (verify field names against the actual chart values before applying; consult llm-d scaling docs before increasing replicas beyond 1+1):
prefillReplicas: 1
decodeReplicas: 1
model:
name: "meta-llama/Llama-3.1-70B-Instruct"
source: "huggingface"
dtype: "fp8"
scheduler:
cacheAwareRouting: true
gateway:
enabled: true
className: "llm-d-gateway"Define InferencePool and InferenceObjective CRDs
The InferencePool groups decode pods and is the target for the scheduler. The examples below use inference.networking.x-k8s.io/v1alpha2, the experimental API group. A stable inference.networking.k8s.io/v1 API is also available; check the Gateway API Inference Extension docs for the version shipped with your llm-d release.
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: llama-70b-decode-pool
namespace: llm-d
spec:
targetPortNumber: 8000
selector:
matchLabels:
llm-d.ai/role: decode
app: llama-70b-workerThe InferenceObjective (renamed from InferenceModel in the v1 GA release; v1alpha2 still uses InferenceModel) maps a model name to the pool and sets request criticality:
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: llama-70b
namespace: llm-d
spec:
modelName: "meta-llama/Llama-3.1-70B-Instruct"
criticality: Standard
poolRef:
name: llama-70b-decode-pool
targetModels:
- name: "meta-llama/Llama-3.1-70B-Instruct"
weight: 100Deploy Prefill and Decode Workers
Prefill worker Deployment:
Important: llm-d uses NIXL (NVIDIA Inference Xfer Library) for KV cache transfer between prefill and decode nodes. NIXL handles peer discovery at the transport layer, so you do not need to specify fixed NCCL ranks or
kv_parallel_sizein the connector config. The manifests below usereplicas: 1for both deployments as a starting point. Scaling to multiple replicas with NIXL still requires coordination so each prefill pod connects to the correct decode peer. Consult the llm-d scaling documentation before increasing replicas beyond 1.
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-70b-prefill
namespace: llm-d
spec:
replicas: 1
selector:
matchLabels:
app: llama-70b-worker
llm-d.ai/role: prefill
template:
metadata:
labels:
app: llama-70b-worker
llm-d.ai/role: prefill
spec:
containers:
- name: vllm-prefill
image: vllm/vllm-openai:v0.18.0
env:
- name: VLLM_MODEL
value: "meta-llama/Llama-3.1-70B-Instruct"
args:
- "--kv-transfer-config={\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_producer\"}"
- "--dtype=fp8"
- "--gpu-memory-utilization=0.90"
- "--max-model-len=32768"
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"Decode worker Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-70b-decode
namespace: llm-d
spec:
replicas: 1
selector:
matchLabels:
app: llama-70b-worker
llm-d.ai/role: decode
template:
metadata:
labels:
app: llama-70b-worker
llm-d.ai/role: decode
spec:
containers:
- name: vllm-decode
image: vllm/vllm-openai:v0.18.0
env:
- name: VLLM_MODEL
value: "meta-llama/Llama-3.1-70B-Instruct"
args:
- "--kv-transfer-config={\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_consumer\"}"
- "--dtype=fp8"
- "--gpu-memory-utilization=0.92"
- "--max-model-len=32768"
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"vLLM version and KV connector: Both manifests use
v0.18.0(current as of March 2026). llm-d's minimum requirement is v0.10.0, so any version at or above that works. Check Docker Hub for the latest vllm-openai tag before deploying. Thekv_connectorfield is pluggable.NixlConnectoris the connector llm-d uses for KV cache transfer via NIXL (NVIDIA Inference Xfer Library). Verify the connector name against your vLLM version's disaggregated prefill documentation if you change versions.
Configure the Gateway HTTPRoute
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: llama-70b-route
namespace: llm-d
spec:
parentRefs:
- name: llm-d-gateway
namespace: llm-d
rules:
- backendRefs:
- group: inference.networking.x-k8s.io
kind: InferencePool
name: llama-70b-decode-pool
port: 8000Send a Test Request
Get the Gateway's external IP:
kubectl get gateway llm-d-gateway -n llm-d -o jsonpath='{.status.addresses[0].value}'Send a completion request:
curl http://<GATEWAY_IP>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-70B-Instruct",
"messages": [{"role": "user", "content": "Explain KV cache disaggregation in two sentences."}],
"max_tokens": 100
}'A successful response means prefill ran on the prefill pool, KV tensors transferred to a decode worker, and decode completed the generation.
Configuring the Gateway API Inference Extension
The Gateway API Inference Extension adds three capabilities that standard round-robin load balancing doesn't have:
- Cache-aware load balancing: routes requests to decode workers that already hold matching KV prefix blocks, reducing redundant prefill work
- Model-aware routing: dispatches based on the
modelfield in the request body, supporting multiple models behind one Gateway endpoint - Request prioritization: uses the
criticalityfield onInferenceObjectiveto shed low-priority requests under load before dropping high-priority ones
InferencePool in Detail
The targetPortNumber must match the port your vLLM decode pods expose. The selector uses standard Kubernetes label selectors, so you can target any pod label combination:
spec:
targetPortNumber: 8000
selector:
matchLabels:
llm-d.ai/role: decode
app: llama-70b-workerInferenceObjective Criticality
The criticality field on InferenceObjective has three values:
- Critical: always served; shed last under load
- Standard: normal priority; shed before Critical requests
- Sheddable: background workloads; shed first when capacity is tight
This lets you run batch inference jobs (Sheddable) alongside interactive API traffic (Critical) on the same cluster without batch jobs crowding out user-facing requests.
Multi-Version Traffic Splitting
The targetModels list supports weighted routing across model versions, useful for A/B testing a fine-tuned checkpoint against the base model:
targetModels:
- name: "meta-llama/Llama-3.1-70B-Instruct"
weight: 90
- name: "llama-70b-finetuned-v2"
weight: 10For more on Spheron bare metal instances, see the Spheron documentation.
Benchmarks: llm-d vs Monolithic vLLM
The v0.5 published numbers use a 16x16 B200 topology (16 prefill nodes, 16 decode nodes) running Llama 3.1 70B FP8. That is a large cluster; single-node and small-cluster comparisons will show different magnitudes.
The pattern holds across cluster sizes: disaggregation gains are most pronounced at high concurrency. At low concurrency (under 20 simultaneous requests), disaggregation adds KV transfer latency overhead that may outweigh the throughput benefit.
| Configuration | Throughput | TTFT (p50) | Notes |
|---|---|---|---|
| vLLM monolithic (2x H100 SXM5) | baseline | baseline | No KV transfer overhead |
| llm-d disaggregated (2P + 4D H100) | ~2-3x higher at >50 concurrent | lower (dedicated prefill) | KV transfer adds ~5-15ms |
| llm-d (16x16 B200, v0.5) | 50,000 tok/s total | not published | Published benchmark |
The 2-3x throughput figure at high concurrency comes from prefill and decode no longer competing for the same GPU cycles. When a batch of 100 requests arrives simultaneously, monolithic vLLM serializes prefill for each request before decode can continue; disaggregated llm-d runs prefill in parallel across the prefill pool while the decode pool stays active.
For broader inference framework comparisons including TensorRT-LLM and SGLang, see the vLLM vs TensorRT-LLM vs SGLang benchmarks. For GPU-level throughput and cost-per-token data on H100 and B200, see GPU cloud benchmarks 2026.
Scaling: Autoscaling, LoRA Routing, and Multi-Model Serving
Autoscaling with KEDA or HPA
Prefill and decode pools scale independently because their bottlenecks differ. KEDA with a custom metric works well here: scale decode workers on queue depth, scale prefill workers on compute utilization.
Note on scaling with NixlConnector: The deployments in this guide use NixlConnector, which handles peer discovery at the transport layer and does not require fixed NCCL ranks. Scaling NixlConnector-based deployments requires that each prefill pod can discover and connect to the correct decode peer. Consult the llm-d scaling documentation before increasing replicas beyond 1+1. If you switch to P2pNcclConnector (a generic vLLM connector), note that it uses a fixed 2-member NCCL communicator with static ranks, and you would need dynamic rank assignment before autoscaling. The KEDA example below shows the pattern for NixlConnector-based deployments.
A KEDA ScaledObject for the decode Deployment:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llama-decode-scaler
namespace: llm-d
spec:
scaleTargetRef:
name: llama-70b-decode
minReplicaCount: 1
maxReplicaCount: 16
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: vllm_num_requests_waiting
threshold: "10"
query: sum(vllm:num_requests_waiting{model_name="meta-llama/Llama-3.1-70B-Instruct"})When vllm:num_requests_waiting exceeds 10 pending requests per decode pod, KEDA adds replicas. When it drops below, it scales down. The decode pool responds to load without touching prefill capacity.
LoRA Adapter Routing
The InferenceModel CRD (called InferenceObjective in the v1 GA API) natively supports LoRA adapter routing. Multiple adapters can share a decode pool, with traffic distributed by weight:
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: llama-70b-with-adapters
namespace: llm-d
spec:
modelName: "meta-llama/Llama-3.1-70B-Instruct"
criticality: Standard
poolRef:
name: llama-70b-decode-pool
targetModels:
- name: "meta-llama/Llama-3.1-70B-Instruct"
weight: 70
- name: "llama-70b-customer-support-lora"
weight: 20
- name: "llama-70b-code-lora"
weight: 10The scheduler routes requests to decode workers that have the requested adapter loaded, minimizing adapter swap overhead. For the full setup including adapter loading, hot-swap, and memory management, see the LoRA multi-adapter serving guide.
Multi-Model Serving
Multiple InferenceObjective CRDs can share a single InferencePool. The scheduler routes by model name from the request body. The constraint is GPU memory: each model you load adds to the memory footprint of each decode worker. On an 80 GB H100, you might load a 7B and a 13B simultaneously (roughly 14 GB + 26 GB in FP8), but a 70B alone nearly fills the card.
Cost Analysis: llm-d on Spheron vs Managed Inference APIs
Self-hosted cost depends on actual GPU utilization. The figures below assume steady-state utilization at roughly 60-70% for a production workload generating 1M tokens per day.
| Option | Setup | $/1M tokens | Control |
|---|---|---|---|
| OpenAI GPT-4o | None | ~$5 (input) + $15 (output) | None |
| Fireworks / Together AI | None | ~$0.90-$3 (model-dependent) | Limited |
| llm-d on Spheron H100 SXM5 | ~1hr | ~$0.15-$0.40 (at utilization) | Full |
| llm-d on Spheron B200 SXM6 (spot) | ~1hr | ~$0.02-$0.05 (at utilization) | Full |
The self-hosted estimates derive from current Spheron prices ($2.40/hr on-demand for H100 SXM5; $1.67/hr spot for B200 SXM6; on-demand B200 pricing is not currently available) divided by estimated throughput from the v0.5 benchmark data at realistic utilization. At higher utilization, cost per token drops further.
Bare metal matters here. KVM hypervisors add 5-15% CUDA overhead; shared cloud VMs add more. Spheron's bare metal instances give you direct PCIe and NVLink access to the GPU. For disaggregated inference specifically, KV cache transfer latency over the network fabric is a critical variable, and hypervisor-induced jitter in network I/O degrades it.
For a full comparison across cloud providers including per-GPU pricing history and spot availability, see GPU cloud pricing comparison 2026.
A note on CNCF Sandbox status: accepted does not mean production-ready. Sandbox signals that the CNCF Technical Oversight Committee found the project technically sound and the governance model appropriate for community development. The v0.5 API surface will change before a stable release. Run llm-d in staging, validate your specific workload, and pin to a specific Helm chart version before deploying to production traffic.
llm-d's disaggregated inference delivers its best results on bare metal where you control the full CUDA stack and network. Spheron's H100 and B200 bare metal instances give you Kubernetes-ready GPU nodes without virtualization overhead.
