llm-d on Kubernetes: Disaggregated LLM Inference Deployment Guide (2026)

On March 24, 2026, llm-d was accepted into the CNCF Sandbox. The project solves a specific problem: monolithic vLLM saturates prefill GPUs during high-concurrency decode, and disaggregating those two phases onto separate GPU pools fixes it. Unlike NVIDIA Dynamo's disaggregated inference approach, which is an orchestration layer running above vLLM outside Kubernetes, llm-d is a first-class Kubernetes citizen using standard CRDs and the Gateway API Inference Extension. If you haven't set up vLLM on bare metal yet, the vLLM production deployment guide covers that baseline first.

What Is llm-d

llm-d is a Kubernetes-native framework for disaggregated LLM inference. Backed by IBM, Red Hat, Google, CoreWeave, and NVIDIA, among others, the project was donated to CNCF on March 24, 2026 and immediately accepted into the Sandbox program.

Four core components:

llm-d scheduler: the cache-aware routing process that tracks KV cache state per decode worker and routes requests to maximize prefix cache hits
InferencePool CRD: a custom resource that groups decode GPU pods into a named pool the scheduler can target
InferenceObjective CRD: maps model names (and LoRA adapter names) to backend deployments with traffic weights and request criticality labels
Gateway API Inference Extension integration: plugs into the standard Kubernetes Gateway API to replace round-robin with inference-aware routing

The key difference from NVIDIA Dynamo: Dynamo is an orchestration layer that sits above vLLM and runs outside Kubernetes. It uses NIXL for KV cache transfer and is tightly coupled to NVIDIA's DGX/HGX hardware stack. llm-d uses standard Kubernetes networking, CRDs, and the Gateway API. If your team already manages Kubernetes clusters and wants disaggregation without adopting a separate proprietary control plane, llm-d is the natural path. For a deeper look at the new DRA driver and KAI Scheduler that underpin Kubernetes GPU scheduling in 2026, see our Kubernetes GPU Orchestration guide.

Prefill/Decode Disaggregation on Kubernetes

The Two Phases and Why They Need Different Hardware

Every LLM generation request has two phases. Prefill processes all input tokens in a single forward pass. Attention is computed over the full prompt length, making prefill compute-bound: FLOPs scale with prompt token count.

Decode generates one output token per step, reading the full KV cache at every step. With a 4K-token context, decode makes 256 separate KV cache reads across 4K+ tokens each. It is dominated by memory bandwidth, not compute.

In monolithic vLLM, both phases share the same GPU. A long prefill request blocks decode batches from continuing. This head-of-line blocking is what disaggregation solves: send prefill work to compute-optimized nodes, send decode work to memory-bandwidth-optimized nodes.

How llm-d Routes Requests

Client
  |
  v
Kubernetes Gateway (Gateway API)
  |
  v
llm-d Scheduler (cache-aware routing)
  |         \
  v           v
Prefill Pool     KV transfer
(H100 SXM/B200)     |
                     v
               Decode Pool
               (H100 PCIe / A100 80GB)
                     |
                     v
               Response to Client

The scheduler has a key advantage over naive load balancing: it knows which decode workers hold which KV cache blocks. When an incoming request shares a prefix with a recently cached request, the scheduler routes it to the decode worker that already has that prefix cached. This reduces redundant prefill work and cuts time-to-first-token for repeated-prefix workloads.

For deeper background on how KV cache memory works, how PagedAttention allocates blocks, and what prefix caching looks like at the memory level, see the KV Cache Optimization Guide.

Hardware Requirements and GPU Selection

The right GPU depends on which pool it's in.

Role	Recommended GPU	Why	Price
Prefill	H100 SXM5	Compute-bound, 3,958 TFLOPS FP8 (with sparsity)	$2.40/hr on-demand
Prefill (high-end)	B200 SXM6	~2.3x dense FP8 FLOP increase over H100	$1.67/hr (spot)
Decode	H100 PCIe	Memory bandwidth, lower cost	$2.01/hr on-demand
Decode	A100 SXM4 80GB	Cost-efficient memory bandwidth	$1.08/hr on-demand
Mixed (dev/test)	H100 SXM5	Single pool, no disaggregation	$2.40/hr on-demand

The v0.5 benchmark validates meaningful throughput at scale: 3,100 tokens/second per B200 decode GPU, and 50,000 output tokens/second on a 16x16 B200 topology (16 prefill nodes, 16 decode nodes). At single-node or small-cluster scale, results will differ. For per-GPU benchmark data including memory bandwidth and inference throughput comparisons across H100, A100, and B200, see Best GPU for AI Inference 2026.

Pricing fluctuates based on GPU availability. The prices above are based on 01 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Step-by-Step Deployment on Spheron with Kubernetes

Provision GPU Instances on Spheron

Rent at least two GPU instances: one for prefill, one or more for decode. For prefill, use H100 SXM5 or B200 for compute throughput. For decode, H100 PCIe or A100 80GB at lower hourly rates are a good fit.

After SSH access:

bash

nvidia-smi
# Verify: GPU model, VRAM, driver version (575.xx+ for H100/B200)

Both nodes need Kubernetes 1.29+ running. If you're starting from scratch, kubeadm init on the control plane node and kubeadm join on workers is the standard path. Multi-node clusters need a CNI (Flannel, Calico, Cilium all work).

Install Kubernetes Prerequisites

Install the NVIDIA GPU Operator first. This handles the NVIDIA device plugin, container toolkit, and driver management in one Helm chart:

bash

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace

Install Gateway API CRDs (v1.3.0+ required for llm-d):

bash

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml

Install cert-manager (required by llm-d's webhook):

bash

helm repo add jetstack https://charts.jetstack.io
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set crds.enabled=true

Install llm-d via Helm

bash

helm repo add llm-d-infra https://llm-d-incubation.github.io/llm-d-infra/
helm repo update
helm install llm-d llm-d-infra/llm-d-infra \
  --namespace llm-d \
  --create-namespace \
  -f values.yaml

A minimal values.yaml for a 1-prefill, 1-decode deployment (verify field names against the actual chart values before applying; consult llm-d scaling docs before increasing replicas beyond 1+1):

yaml

prefillReplicas: 1
decodeReplicas: 1

model:
  name: "meta-llama/Llama-3.1-70B-Instruct"
  source: "huggingface"
  dtype: "fp8"

scheduler:
  cacheAwareRouting: true

gateway:
  enabled: true
  className: "llm-d-gateway"

Define InferencePool and InferenceObjective CRDs

The InferencePool groups decode pods and is the target for the scheduler. The examples below use inference.networking.x-k8s.io/v1alpha2, the experimental API group. A stable inference.networking.k8s.io/v1 API is also available; check the Gateway API Inference Extension docs for the version shipped with your llm-d release.

yaml

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: llama-70b-decode-pool
  namespace: llm-d
spec:
  targetPortNumber: 8000
  selector:
    matchLabels:
      llm-d.ai/role: decode
      app: llama-70b-worker

The InferenceObjective (renamed from InferenceModel in the v1 GA release; v1alpha2 still uses InferenceModel) maps a model name to the pool and sets request criticality:

yaml

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: llama-70b
  namespace: llm-d
spec:
  modelName: "meta-llama/Llama-3.1-70B-Instruct"
  criticality: Standard
  poolRef:
    name: llama-70b-decode-pool
  targetModels:
    - name: "meta-llama/Llama-3.1-70B-Instruct"
      weight: 100

Deploy Prefill and Decode Workers

Prefill worker Deployment:

Important: llm-d uses NIXL (NVIDIA Inference Xfer Library) for KV cache transfer between prefill and decode nodes. NIXL handles peer discovery at the transport layer, so you do not need to specify fixed NCCL ranks or kv_parallel_size in the connector config. The manifests below use replicas: 1 for both deployments as a starting point. Scaling to multiple replicas with NIXL still requires coordination so each prefill pod connects to the correct decode peer. Consult the llm-d scaling documentation before increasing replicas beyond 1.

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-70b-prefill
  namespace: llm-d
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama-70b-worker
      llm-d.ai/role: prefill
  template:
    metadata:
      labels:
        app: llama-70b-worker
        llm-d.ai/role: prefill
    spec:
      containers:
        - name: vllm-prefill
          image: vllm/vllm-openai:v0.18.0
          env:
            - name: VLLM_MODEL
              value: "meta-llama/Llama-3.1-70B-Instruct"
          args:
            - "--kv-transfer-config={\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_producer\"}"
            - "--dtype=fp8"
            - "--gpu-memory-utilization=0.90"
            - "--max-model-len=32768"
          resources:
            requests:
              nvidia.com/gpu: "1"
            limits:
              nvidia.com/gpu: "1"

Decode worker Deployment:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-70b-decode
  namespace: llm-d
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama-70b-worker
      llm-d.ai/role: decode
  template:
    metadata:
      labels:
        app: llama-70b-worker
        llm-d.ai/role: decode
    spec:
      containers:
        - name: vllm-decode
          image: vllm/vllm-openai:v0.18.0
          env:
            - name: VLLM_MODEL
              value: "meta-llama/Llama-3.1-70B-Instruct"
          args:
            - "--kv-transfer-config={\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_consumer\"}"
            - "--dtype=fp8"
            - "--gpu-memory-utilization=0.92"
            - "--max-model-len=32768"
          resources:
            requests:
              nvidia.com/gpu: "1"
            limits:
              nvidia.com/gpu: "1"

vLLM version and KV connector: Both manifests use v0.18.0 (current as of March 2026). llm-d's minimum requirement is v0.10.0, so any version at or above that works. Check Docker Hub for the latest vllm-openai tag before deploying. The kv_connector field is pluggable. NixlConnector is the connector llm-d uses for KV cache transfer via NIXL (NVIDIA Inference Xfer Library). Verify the connector name against your vLLM version's disaggregated prefill documentation if you change versions.

Configure the Gateway HTTPRoute

yaml

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: llama-70b-route
  namespace: llm-d
spec:
  parentRefs:
    - name: llm-d-gateway
      namespace: llm-d
  rules:
    - backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: llama-70b-decode-pool
          port: 8000

Send a Test Request

Get the Gateway's external IP:

bash

kubectl get gateway llm-d-gateway -n llm-d -o jsonpath='{.status.addresses[0].value}'

Send a completion request:

bash

curl http://<GATEWAY_IP>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-70B-Instruct",
    "messages": [{"role": "user", "content": "Explain KV cache disaggregation in two sentences."}],
    "max_tokens": 100
  }'

A successful response means prefill ran on the prefill pool, KV tensors transferred to a decode worker, and decode completed the generation.

Configuring the Gateway API Inference Extension

The Gateway API Inference Extension adds three capabilities that standard round-robin load balancing doesn't have:

Cache-aware load balancing: routes requests to decode workers that already hold matching KV prefix blocks, reducing redundant prefill work
Model-aware routing: dispatches based on the model field in the request body, supporting multiple models behind one Gateway endpoint
Request prioritization: uses the criticality field on InferenceObjective to shed low-priority requests under load before dropping high-priority ones

InferencePool in Detail

The targetPortNumber must match the port your vLLM decode pods expose. The selector uses standard Kubernetes label selectors, so you can target any pod label combination:

yaml

spec:
  targetPortNumber: 8000
  selector:
    matchLabels:
      llm-d.ai/role: decode
      app: llama-70b-worker

InferenceObjective Criticality

The criticality field on InferenceObjective has three values:

Critical: always served; shed last under load
Standard: normal priority; shed before Critical requests
Sheddable: background workloads; shed first when capacity is tight

This lets you run batch inference jobs (Sheddable) alongside interactive API traffic (Critical) on the same cluster without batch jobs crowding out user-facing requests.

Multi-Version Traffic Splitting

The targetModels list supports weighted routing across model versions, useful for A/B testing a fine-tuned checkpoint against the base model:

yaml

targetModels:
  - name: "meta-llama/Llama-3.1-70B-Instruct"
    weight: 90
  - name: "llama-70b-finetuned-v2"
    weight: 10

For more on Spheron bare metal instances, see the Spheron documentation.

Benchmarks: llm-d vs Monolithic vLLM

The v0.5 published numbers use a 16x16 B200 topology (16 prefill nodes, 16 decode nodes) running Llama 3.1 70B FP8. That is a large cluster; single-node and small-cluster comparisons will show different magnitudes.

The pattern holds across cluster sizes: disaggregation gains are most pronounced at high concurrency. At low concurrency (under 20 simultaneous requests), disaggregation adds KV transfer latency overhead that may outweigh the throughput benefit.

Configuration	Throughput	TTFT (p50)	Notes
vLLM monolithic (2x H100 SXM5)	baseline	baseline	No KV transfer overhead
llm-d disaggregated (2P + 4D H100)	~2-3x higher at >50 concurrent	lower (dedicated prefill)	KV transfer adds ~5-15ms
llm-d (16x16 B200, v0.5)	50,000 tok/s total	not published	Published benchmark

The 2-3x throughput figure at high concurrency comes from prefill and decode no longer competing for the same GPU cycles. When a batch of 100 requests arrives simultaneously, monolithic vLLM serializes prefill for each request before decode can continue; disaggregated llm-d runs prefill in parallel across the prefill pool while the decode pool stays active.

For broader inference framework comparisons including TensorRT-LLM and SGLang, see the vLLM vs TensorRT-LLM vs SGLang benchmarks. For GPU-level throughput and cost-per-token data on H100 and B200, see GPU cloud benchmarks 2026.

Scaling: Autoscaling, LoRA Routing, and Multi-Model Serving

Autoscaling with KEDA or HPA

Prefill and decode pools scale independently because their bottlenecks differ. KEDA with a custom metric works well here: scale decode workers on queue depth, scale prefill workers on compute utilization.

Note on scaling with NixlConnector: The deployments in this guide use NixlConnector, which handles peer discovery at the transport layer and does not require fixed NCCL ranks. Scaling NixlConnector-based deployments requires that each prefill pod can discover and connect to the correct decode peer. Consult the llm-d scaling documentation before increasing replicas beyond 1+1. If you switch to P2pNcclConnector (a generic vLLM connector), note that it uses a fixed 2-member NCCL communicator with static ranks, and you would need dynamic rank assignment before autoscaling. The KEDA example below shows the pattern for NixlConnector-based deployments.

A KEDA ScaledObject for the decode Deployment:

yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llama-decode-scaler
  namespace: llm-d
spec:
  scaleTargetRef:
    name: llama-70b-decode
  minReplicaCount: 1
  maxReplicaCount: 16
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: vllm_num_requests_waiting
        threshold: "10"
        query: sum(vllm:num_requests_waiting{model_name="meta-llama/Llama-3.1-70B-Instruct"})

When vllm:num_requests_waiting exceeds 10 pending requests per decode pod, KEDA adds replicas. When it drops below, it scales down. The decode pool responds to load without touching prefill capacity.

LoRA Adapter Routing

The InferenceModel CRD (called InferenceObjective in the v1 GA API) natively supports LoRA adapter routing. Multiple adapters can share a decode pool, with traffic distributed by weight:

yaml

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: llama-70b-with-adapters
  namespace: llm-d
spec:
  modelName: "meta-llama/Llama-3.1-70B-Instruct"
  criticality: Standard
  poolRef:
    name: llama-70b-decode-pool
  targetModels:
    - name: "meta-llama/Llama-3.1-70B-Instruct"
      weight: 70
    - name: "llama-70b-customer-support-lora"
      weight: 20
    - name: "llama-70b-code-lora"
      weight: 10

The scheduler routes requests to decode workers that have the requested adapter loaded, minimizing adapter swap overhead. For the full setup including adapter loading, hot-swap, and memory management, see the LoRA multi-adapter serving guide.

Multi-Model Serving

Multiple InferenceObjective CRDs can share a single InferencePool. The scheduler routes by model name from the request body. The constraint is GPU memory: each model you load adds to the memory footprint of each decode worker. On an 80 GB H100, you might load a 7B and a 13B simultaneously (roughly 14 GB + 26 GB in FP8), but a 70B alone nearly fills the card.

Cost Analysis: llm-d on Spheron vs Managed Inference APIs

Self-hosted cost depends on actual GPU utilization. The figures below assume steady-state utilization at roughly 60-70% for a production workload generating 1M tokens per day.

Option	Setup	$/1M tokens	Control
OpenAI GPT-4o	None	~$5 (input) + $15 (output)	None
Fireworks / Together AI	None	~$0.90-$3 (model-dependent)	Limited
llm-d on Spheron H100 SXM5	~1hr	~$0.15-$0.40 (at utilization)	Full
llm-d on Spheron B200 SXM6 (spot)	~1hr	~$0.02-$0.05 (at utilization)	Full

The self-hosted estimates derive from current Spheron prices ($2.40/hr on-demand for H100 SXM5; $1.67/hr spot for B200 SXM6; on-demand B200 pricing is not currently available) divided by estimated throughput from the v0.5 benchmark data at realistic utilization. At higher utilization, cost per token drops further.

Bare metal matters here. KVM hypervisors add 5-15% CUDA overhead; shared cloud VMs add more. Spheron's bare metal instances give you direct PCIe and NVLink access to the GPU. For disaggregated inference specifically, KV cache transfer latency over the network fabric is a critical variable, and hypervisor-induced jitter in network I/O degrades it.

For a full comparison across cloud providers including per-GPU pricing history and spot availability, see GPU cloud pricing comparison 2026.

A note on CNCF Sandbox status: accepted does not mean production-ready. Sandbox signals that the CNCF Technical Oversight Committee found the project technically sound and the governance model appropriate for community development. The v0.5 API surface will change before a stable release. Run llm-d in staging, validate your specific workload, and pin to a specific Helm chart version before deploying to production traffic.

llm-d's disaggregated inference delivers its best results on bare metal where you control the full CUDA stack and network. Spheron's H100 and B200 bare metal instances give you Kubernetes-ready GPU nodes without virtualization overhead.
Rent H100 → | View all GPU pricing →
Get started on Spheron →

STEPS / 06

Quick Setup Guide

Provision GPU instances on Spheron
Rent two or more GPU instances on Spheron for separate prefill and decode worker pools. For prefill nodes, use H100 SXM5 or B200 for compute throughput. For decode nodes, H100 PCIe or A100 80GB provide cost-efficient memory bandwidth. Both nodes must be Kubernetes-ready with NVIDIA Container Toolkit and kubeadm or an existing cluster.
Install llm-d and its Kubernetes prerequisites
Install the llm-d Helm chart from the official registry. Prerequisites: Kubernetes 1.29+, cert-manager, the Gateway API CRDs (v1.3.0+), and the NVIDIA GPU Operator. Run `helm repo add llm-d-infra https://llm-d-incubation.github.io/llm-d-infra/ && helm install llm-d llm-d-infra/llm-d-infra -f values.yaml` with your prefill/decode topology defined in values.yaml.
Configure the InferencePool and InferenceObjective CRDs
Define an InferencePool custom resource that groups your decode GPU pods, and an InferenceObjective resource that maps model names to backend deployments. These CRDs (from the Gateway API Inference Extension) are what enable cache-aware routing and LoRA adapter dispatch.
Deploy prefill and decode worker Deployments
Create two Kubernetes Deployments: one for prefill workers and one for decode workers. Annotate pods with `llm-d.ai/role: prefill` or `llm-d.ai/role: decode`. Each pod runs vLLM with the `--kv-transfer-config` CLI argument specifying the connector (NixlConnector) and role (kv_producer or kv_consumer). NIXL handles peer discovery at the transport layer, so no fixed NCCL ranks are needed. Start with replicas: 1 for each; consult llm-d scaling docs before increasing replicas. Set resource requests/limits with `nvidia.com/gpu: 1` per pod.
Start the llm-d scheduler and Gateway
Deploy the llm-d scheduler (cache-aware routing logic) and configure the HTTPRoute resource to point the Gateway to the InferencePool. The scheduler tracks KV cache state per decode worker and routes incoming requests to workers with the best prefix cache hit rate.
Test and benchmark the deployment
Send a test request via the Gateway endpoint using the OpenAI-compatible API. Run throughput benchmarks using the llm-d benchmarking tool or vLLM's benchmark_serving.py. Compare tokens/second on the disaggregated deployment against a baseline monolithic vLLM deployment to validate the disaggregation gain.

FAQ / 05

Frequently Asked Questions

llm-d is a Kubernetes-native framework for disaggregated LLM inference, donated to CNCF on March 24, 2026. It separates prefill and decode phases onto independent GPU pools and uses the Kubernetes Gateway API Inference Extension for cache-aware request routing.

Both do prefill/decode disaggregation, but llm-d is Kubernetes-native and CNCF-governed, while Dynamo is an NVIDIA-proprietary orchestration layer. llm-d uses standard Kubernetes CRDs and the Gateway API Inference Extension, making it a better fit for teams already running Kubernetes. Dynamo targets bare-metal or NVIDIA DGX clusters without a Kubernetes dependency.

For prefill nodes, use compute-heavy GPUs like H100 SXM5 or B200. For decode nodes, memory-bandwidth-heavy GPUs like H100 PCIe or A100 80GB work well and cost less. The v0.5 benchmark validated 3,100 tokens/second per B200 decode GPU and 50,000 output tokens/second on a 16x16 B200 topology.

The Gateway API Inference Extension (part of the Kubernetes SIG Network effort) adds LLM-aware routing to the standard Kubernetes Gateway API. It enables cache-aware load balancing based on KV cache hits and model-aware routing for multi-model and LoRA-adapter serving, replacing static round-robin with inference-aware request dispatch.

As of v0.5 (March 2026), llm-d is in active development under CNCF Sandbox status. Sandbox does not imply production-ready; it signals community governance and trajectory. For production workloads, treat v0.5 as early-adopter/staging. The benchmark numbers from v0.5 on B200 topologies are genuine, but expect API changes before a stable release.

What Is llm-d

Prefill/Decode Disaggregation on Kubernetes

The Two Phases and Why They Need Different Hardware

How llm-d Routes Requests

Hardware Requirements and GPU Selection

Step-by-Step Deployment on Spheron with Kubernetes

Provision GPU Instances on Spheron

Install Kubernetes Prerequisites

Install llm-d via Helm

Define InferencePool and InferenceObjective CRDs

Deploy Prefill and Decode Workers

Configure the Gateway HTTPRoute

Send a Test Request

Configuring the Gateway API Inference Extension

InferencePool in Detail

InferenceObjective Criticality

Multi-Version Traffic Splitting

Benchmarks: llm-d vs Monolithic vLLM

Scaling: Autoscaling, LoRA Routing, and Multi-Model Serving

Autoscaling with KEDA or HPA

LoRA Adapter Routing

Multi-Model Serving

Cost Analysis: llm-d on Spheron vs Managed Inference APIs

Quick Setup Guide

Provision GPU instances on Spheron

Install llm-d and its Kubernetes prerequisites

Configure the InferencePool and InferenceObjective CRDs

Deploy prefill and decode worker Deployments

Start the llm-d scheduler and Gateway

Test and benchmark the deployment

Frequently Asked Questions

01What is llm-d?

02How does llm-d differ from NVIDIA Dynamo?

03What GPUs work best with llm-d?

04What is the Gateway API Inference Extension used in llm-d?

05Is llm-d production ready?

Build what's next.