Tutorial

llm-d on Kubernetes: Disaggregated LLM Inference Deployment Guide (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 1, 2026
llm-dKubernetesLLM InferenceDisaggregated InferenceCNCFvLLMGPU CloudH100B200Inference Optimization
llm-d on Kubernetes: Disaggregated LLM Inference Deployment Guide (2026)

On March 24, 2026, llm-d was accepted into the CNCF Sandbox. The project solves a specific problem: monolithic vLLM saturates prefill GPUs during high-concurrency decode, and disaggregating those two phases onto separate GPU pools fixes it. Unlike NVIDIA Dynamo's disaggregated inference approach, which is an orchestration layer running above vLLM outside Kubernetes, llm-d is a first-class Kubernetes citizen using standard CRDs and the Gateway API Inference Extension. If you haven't set up vLLM on bare metal yet, the vLLM production deployment guide covers that baseline first.

What Is llm-d

llm-d is a Kubernetes-native framework for disaggregated LLM inference. Backed by IBM, Red Hat, Google, CoreWeave, and NVIDIA, among others, the project was donated to CNCF on March 24, 2026 and immediately accepted into the Sandbox program.

Four core components:

  • llm-d scheduler: the cache-aware routing process that tracks KV cache state per decode worker and routes requests to maximize prefix cache hits
  • InferencePool CRD: a custom resource that groups decode GPU pods into a named pool the scheduler can target
  • InferenceObjective CRD: maps model names (and LoRA adapter names) to backend deployments with traffic weights and request criticality labels
  • Gateway API Inference Extension integration: plugs into the standard Kubernetes Gateway API to replace round-robin with inference-aware routing

The key difference from NVIDIA Dynamo: Dynamo is an orchestration layer that sits above vLLM and runs outside Kubernetes. It uses NIXL for KV cache transfer and is tightly coupled to NVIDIA's DGX/HGX hardware stack. llm-d uses standard Kubernetes networking, CRDs, and the Gateway API. If your team already manages Kubernetes clusters and wants disaggregation without adopting a separate proprietary control plane, llm-d is the natural path.

Prefill/Decode Disaggregation on Kubernetes

The Two Phases and Why They Need Different Hardware

Every LLM generation request has two phases. Prefill processes all input tokens in a single forward pass. Attention is computed over the full prompt length, making prefill compute-bound: FLOPs scale with prompt token count.

Decode generates one output token per step, reading the full KV cache at every step. With a 4K-token context, decode makes 256 separate KV cache reads across 4K+ tokens each. It is dominated by memory bandwidth, not compute.

In monolithic vLLM, both phases share the same GPU. A long prefill request blocks decode batches from continuing. This head-of-line blocking is what disaggregation solves: send prefill work to compute-optimized nodes, send decode work to memory-bandwidth-optimized nodes.

How llm-d Routes Requests

Client
  |
  v
Kubernetes Gateway (Gateway API)
  |
  v
llm-d Scheduler (cache-aware routing)
  |         \
  v           v
Prefill Pool     KV transfer
(H100 SXM/B200)     |
                     v
               Decode Pool
               (H100 PCIe / A100 80GB)
                     |
                     v
               Response to Client

The scheduler has a key advantage over naive load balancing: it knows which decode workers hold which KV cache blocks. When an incoming request shares a prefix with a recently cached request, the scheduler routes it to the decode worker that already has that prefix cached. This reduces redundant prefill work and cuts time-to-first-token for repeated-prefix workloads.

For deeper background on how KV cache memory works, how PagedAttention allocates blocks, and what prefix caching looks like at the memory level, see the KV Cache Optimization Guide.

Hardware Requirements and GPU Selection

The right GPU depends on which pool it's in.

RoleRecommended GPUWhyPrice
PrefillH100 SXM5Compute-bound, 3,958 TFLOPS FP8 (with sparsity)$2.40/hr on-demand
Prefill (high-end)B200 SXM6~2.3x dense FP8 FLOP increase over H100$1.67/hr (spot)
DecodeH100 PCIeMemory bandwidth, lower cost$2.01/hr on-demand
DecodeA100 SXM4 80GBCost-efficient memory bandwidth$1.08/hr on-demand
Mixed (dev/test)H100 SXM5Single pool, no disaggregation$2.40/hr on-demand

The v0.5 benchmark validates meaningful throughput at scale: 3,100 tokens/second per B200 decode GPU, and 50,000 output tokens/second on a 16x16 B200 topology (16 prefill nodes, 16 decode nodes). At single-node or small-cluster scale, results will differ. For per-GPU benchmark data including memory bandwidth and inference throughput comparisons across H100, A100, and B200, see Best GPU for AI Inference 2026.

Pricing fluctuates based on GPU availability. The prices above are based on 01 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Step-by-Step Deployment on Spheron with Kubernetes

Provision GPU Instances on Spheron

Rent at least two GPU instances: one for prefill, one or more for decode. For prefill, use H100 SXM5 or B200 for compute throughput. For decode, H100 PCIe or A100 80GB at lower hourly rates are a good fit.

After SSH access:

bash
nvidia-smi
# Verify: GPU model, VRAM, driver version (575.xx+ for H100/B200)

Both nodes need Kubernetes 1.29+ running. If you're starting from scratch, kubeadm init on the control plane node and kubeadm join on workers is the standard path. Multi-node clusters need a CNI (Flannel, Calico, Cilium all work).

Install Kubernetes Prerequisites

Install the NVIDIA GPU Operator first. This handles the NVIDIA device plugin, container toolkit, and driver management in one Helm chart:

bash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace

Install Gateway API CRDs (v1.3.0+ required for llm-d):

bash
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml

Install cert-manager (required by llm-d's webhook):

bash
helm repo add jetstack https://charts.jetstack.io
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set crds.enabled=true

Install llm-d via Helm

bash
helm repo add llm-d-infra https://llm-d-incubation.github.io/llm-d-infra/
helm repo update
helm install llm-d llm-d-infra/llm-d-infra \
  --namespace llm-d \
  --create-namespace \
  -f values.yaml

A minimal values.yaml for a 1-prefill, 1-decode deployment (verify field names against the actual chart values before applying; consult llm-d scaling docs before increasing replicas beyond 1+1):

yaml
prefillReplicas: 1
decodeReplicas: 1

model:
  name: "meta-llama/Llama-3.1-70B-Instruct"
  source: "huggingface"
  dtype: "fp8"

scheduler:
  cacheAwareRouting: true

gateway:
  enabled: true
  className: "llm-d-gateway"

Define InferencePool and InferenceObjective CRDs

The InferencePool groups decode pods and is the target for the scheduler. The examples below use inference.networking.x-k8s.io/v1alpha2, the experimental API group. A stable inference.networking.k8s.io/v1 API is also available; check the Gateway API Inference Extension docs for the version shipped with your llm-d release.

yaml
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: llama-70b-decode-pool
  namespace: llm-d
spec:
  targetPortNumber: 8000
  selector:
    matchLabels:
      llm-d.ai/role: decode
      app: llama-70b-worker

The InferenceObjective (renamed from InferenceModel in the v1 GA release; v1alpha2 still uses InferenceModel) maps a model name to the pool and sets request criticality:

yaml
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: llama-70b
  namespace: llm-d
spec:
  modelName: "meta-llama/Llama-3.1-70B-Instruct"
  criticality: Standard
  poolRef:
    name: llama-70b-decode-pool
  targetModels:
    - name: "meta-llama/Llama-3.1-70B-Instruct"
      weight: 100

Deploy Prefill and Decode Workers

Prefill worker Deployment:

Important: llm-d uses NIXL (NVIDIA Inference Xfer Library) for KV cache transfer between prefill and decode nodes. NIXL handles peer discovery at the transport layer, so you do not need to specify fixed NCCL ranks or kv_parallel_size in the connector config. The manifests below use replicas: 1 for both deployments as a starting point. Scaling to multiple replicas with NIXL still requires coordination so each prefill pod connects to the correct decode peer. Consult the llm-d scaling documentation before increasing replicas beyond 1.

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-70b-prefill
  namespace: llm-d
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama-70b-worker
      llm-d.ai/role: prefill
  template:
    metadata:
      labels:
        app: llama-70b-worker
        llm-d.ai/role: prefill
    spec:
      containers:
        - name: vllm-prefill
          image: vllm/vllm-openai:v0.18.0
          env:
            - name: VLLM_MODEL
              value: "meta-llama/Llama-3.1-70B-Instruct"
          args:
            - "--kv-transfer-config={\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_producer\"}"
            - "--dtype=fp8"
            - "--gpu-memory-utilization=0.90"
            - "--max-model-len=32768"
          resources:
            requests:
              nvidia.com/gpu: "1"
            limits:
              nvidia.com/gpu: "1"

Decode worker Deployment:

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-70b-decode
  namespace: llm-d
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama-70b-worker
      llm-d.ai/role: decode
  template:
    metadata:
      labels:
        app: llama-70b-worker
        llm-d.ai/role: decode
    spec:
      containers:
        - name: vllm-decode
          image: vllm/vllm-openai:v0.18.0
          env:
            - name: VLLM_MODEL
              value: "meta-llama/Llama-3.1-70B-Instruct"
          args:
            - "--kv-transfer-config={\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_consumer\"}"
            - "--dtype=fp8"
            - "--gpu-memory-utilization=0.92"
            - "--max-model-len=32768"
          resources:
            requests:
              nvidia.com/gpu: "1"
            limits:
              nvidia.com/gpu: "1"

vLLM version and KV connector: Both manifests use v0.18.0 (current as of March 2026). llm-d's minimum requirement is v0.10.0, so any version at or above that works. Check Docker Hub for the latest vllm-openai tag before deploying. The kv_connector field is pluggable. NixlConnector is the connector llm-d uses for KV cache transfer via NIXL (NVIDIA Inference Xfer Library). Verify the connector name against your vLLM version's disaggregated prefill documentation if you change versions.

Configure the Gateway HTTPRoute

yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: llama-70b-route
  namespace: llm-d
spec:
  parentRefs:
    - name: llm-d-gateway
      namespace: llm-d
  rules:
    - backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: llama-70b-decode-pool
          port: 8000

Send a Test Request

Get the Gateway's external IP:

bash
kubectl get gateway llm-d-gateway -n llm-d -o jsonpath='{.status.addresses[0].value}'

Send a completion request:

bash
curl http://<GATEWAY_IP>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-70B-Instruct",
    "messages": [{"role": "user", "content": "Explain KV cache disaggregation in two sentences."}],
    "max_tokens": 100
  }'

A successful response means prefill ran on the prefill pool, KV tensors transferred to a decode worker, and decode completed the generation.

Configuring the Gateway API Inference Extension

The Gateway API Inference Extension adds three capabilities that standard round-robin load balancing doesn't have:

  1. Cache-aware load balancing: routes requests to decode workers that already hold matching KV prefix blocks, reducing redundant prefill work
  2. Model-aware routing: dispatches based on the model field in the request body, supporting multiple models behind one Gateway endpoint
  3. Request prioritization: uses the criticality field on InferenceObjective to shed low-priority requests under load before dropping high-priority ones

InferencePool in Detail

The targetPortNumber must match the port your vLLM decode pods expose. The selector uses standard Kubernetes label selectors, so you can target any pod label combination:

yaml
spec:
  targetPortNumber: 8000
  selector:
    matchLabels:
      llm-d.ai/role: decode
      app: llama-70b-worker

InferenceObjective Criticality

The criticality field on InferenceObjective has three values:

  • Critical: always served; shed last under load
  • Standard: normal priority; shed before Critical requests
  • Sheddable: background workloads; shed first when capacity is tight

This lets you run batch inference jobs (Sheddable) alongside interactive API traffic (Critical) on the same cluster without batch jobs crowding out user-facing requests.

Multi-Version Traffic Splitting

The targetModels list supports weighted routing across model versions, useful for A/B testing a fine-tuned checkpoint against the base model:

yaml
targetModels:
  - name: "meta-llama/Llama-3.1-70B-Instruct"
    weight: 90
  - name: "llama-70b-finetuned-v2"
    weight: 10

For more on Spheron bare metal instances, see the Spheron documentation.

Benchmarks: llm-d vs Monolithic vLLM

The v0.5 published numbers use a 16x16 B200 topology (16 prefill nodes, 16 decode nodes) running Llama 3.1 70B FP8. That is a large cluster; single-node and small-cluster comparisons will show different magnitudes.

The pattern holds across cluster sizes: disaggregation gains are most pronounced at high concurrency. At low concurrency (under 20 simultaneous requests), disaggregation adds KV transfer latency overhead that may outweigh the throughput benefit.

ConfigurationThroughputTTFT (p50)Notes
vLLM monolithic (2x H100 SXM5)baselinebaselineNo KV transfer overhead
llm-d disaggregated (2P + 4D H100)~2-3x higher at >50 concurrentlower (dedicated prefill)KV transfer adds ~5-15ms
llm-d (16x16 B200, v0.5)50,000 tok/s totalnot publishedPublished benchmark

The 2-3x throughput figure at high concurrency comes from prefill and decode no longer competing for the same GPU cycles. When a batch of 100 requests arrives simultaneously, monolithic vLLM serializes prefill for each request before decode can continue; disaggregated llm-d runs prefill in parallel across the prefill pool while the decode pool stays active.

For broader inference framework comparisons including TensorRT-LLM and SGLang, see the vLLM vs TensorRT-LLM vs SGLang benchmarks. For GPU-level throughput and cost-per-token data on H100 and B200, see GPU cloud benchmarks 2026.

Scaling: Autoscaling, LoRA Routing, and Multi-Model Serving

Autoscaling with KEDA or HPA

Prefill and decode pools scale independently because their bottlenecks differ. KEDA with a custom metric works well here: scale decode workers on queue depth, scale prefill workers on compute utilization.

Note on scaling with NixlConnector: The deployments in this guide use NixlConnector, which handles peer discovery at the transport layer and does not require fixed NCCL ranks. Scaling NixlConnector-based deployments requires that each prefill pod can discover and connect to the correct decode peer. Consult the llm-d scaling documentation before increasing replicas beyond 1+1. If you switch to P2pNcclConnector (a generic vLLM connector), note that it uses a fixed 2-member NCCL communicator with static ranks, and you would need dynamic rank assignment before autoscaling. The KEDA example below shows the pattern for NixlConnector-based deployments.

A KEDA ScaledObject for the decode Deployment:

yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llama-decode-scaler
  namespace: llm-d
spec:
  scaleTargetRef:
    name: llama-70b-decode
  minReplicaCount: 1
  maxReplicaCount: 16
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: vllm_num_requests_waiting
        threshold: "10"
        query: sum(vllm:num_requests_waiting{model_name="meta-llama/Llama-3.1-70B-Instruct"})

When vllm:num_requests_waiting exceeds 10 pending requests per decode pod, KEDA adds replicas. When it drops below, it scales down. The decode pool responds to load without touching prefill capacity.

LoRA Adapter Routing

The InferenceModel CRD (called InferenceObjective in the v1 GA API) natively supports LoRA adapter routing. Multiple adapters can share a decode pool, with traffic distributed by weight:

yaml
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: llama-70b-with-adapters
  namespace: llm-d
spec:
  modelName: "meta-llama/Llama-3.1-70B-Instruct"
  criticality: Standard
  poolRef:
    name: llama-70b-decode-pool
  targetModels:
    - name: "meta-llama/Llama-3.1-70B-Instruct"
      weight: 70
    - name: "llama-70b-customer-support-lora"
      weight: 20
    - name: "llama-70b-code-lora"
      weight: 10

The scheduler routes requests to decode workers that have the requested adapter loaded, minimizing adapter swap overhead. For the full setup including adapter loading, hot-swap, and memory management, see the LoRA multi-adapter serving guide.

Multi-Model Serving

Multiple InferenceObjective CRDs can share a single InferencePool. The scheduler routes by model name from the request body. The constraint is GPU memory: each model you load adds to the memory footprint of each decode worker. On an 80 GB H100, you might load a 7B and a 13B simultaneously (roughly 14 GB + 26 GB in FP8), but a 70B alone nearly fills the card.

Cost Analysis: llm-d on Spheron vs Managed Inference APIs

Self-hosted cost depends on actual GPU utilization. The figures below assume steady-state utilization at roughly 60-70% for a production workload generating 1M tokens per day.

OptionSetup$/1M tokensControl
OpenAI GPT-4oNone~$5 (input) + $15 (output)None
Fireworks / Together AINone~$0.90-$3 (model-dependent)Limited
llm-d on Spheron H100 SXM5~1hr~$0.15-$0.40 (at utilization)Full
llm-d on Spheron B200 SXM6 (spot)~1hr~$0.02-$0.05 (at utilization)Full

The self-hosted estimates derive from current Spheron prices ($2.40/hr on-demand for H100 SXM5; $1.67/hr spot for B200 SXM6; on-demand B200 pricing is not currently available) divided by estimated throughput from the v0.5 benchmark data at realistic utilization. At higher utilization, cost per token drops further.

Bare metal matters here. KVM hypervisors add 5-15% CUDA overhead; shared cloud VMs add more. Spheron's bare metal instances give you direct PCIe and NVLink access to the GPU. For disaggregated inference specifically, KV cache transfer latency over the network fabric is a critical variable, and hypervisor-induced jitter in network I/O degrades it.

For a full comparison across cloud providers including per-GPU pricing history and spot availability, see GPU cloud pricing comparison 2026.


A note on CNCF Sandbox status: accepted does not mean production-ready. Sandbox signals that the CNCF Technical Oversight Committee found the project technically sound and the governance model appropriate for community development. The v0.5 API surface will change before a stable release. Run llm-d in staging, validate your specific workload, and pin to a specific Helm chart version before deploying to production traffic.

llm-d's disaggregated inference delivers its best results on bare metal where you control the full CUDA stack and network. Spheron's H100 and B200 bare metal instances give you Kubernetes-ready GPU nodes without virtualization overhead.

Rent H100 → | View all GPU pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.