Tutorial

NVIDIA Grove on GPU Cloud: Kubernetes-Native Orchestration for Disaggregated LLM Inference (2026)

NVIDIA GroveGrove KubernetesDisaggregated Inference OrchestrationKubernetes GPU OrchestrationPodCliquePodCliqueSetDisaggregated InferencevLLMGPU Cloud
NVIDIA Grove on GPU Cloud: Kubernetes-Native Orchestration for Disaggregated LLM Inference (2026)

Grove is NVIDIA's Kubernetes API for managing disaggregated inference workloads. Where raw Kubernetes Deployments treat prefill, decode, and router pods as independent units, Grove wraps them into a single declarative resource with startup ordering, gang scheduling, and coordinated scaling built in. This guide zooms in on Grove specifically, covering its CRDs in depth, a complete prefill-decode vLLM deployment on Spheron B200/H200 nodes, and how Grove fits into the DRA/KAI Scheduler stack. For background on DRA and KAI Scheduler themselves, see the Kubernetes GPU Orchestration 2026 guide.

What Grove Is

A disaggregated inference stack has three to five separate workloads: a prefill pool, a decode pool, a router, and optionally KV cache proxy nodes and topology-aware placement controllers. Each component has a different GPU requirement, a different startup dependency, and a different scaling profile.

Raw Kubernetes Deployments handle each component independently. You create a Deployment for prefill pods, a separate Deployment for decode pods, a Deployment for the router. There is no native way to say "start the router before the prefill workers, start prefill before decode, and if decode pods can't schedule, hold the whole set pending rather than starting a partial topology." You end up wiring these constraints together with init containers, readiness probes, and custom controllers.

Grove wraps all of this into a single CRD: the PodCliqueSet. One manifest describes the full topology. Grove's operator handles the startup ordering, the gang scheduling constraint, and the scaling coordination. For a deep dive on prefill-decode disaggregation fundamentals, that guide covers the why and the hardware pairing logic in detail.

The Grove CRDs Explained

Grove introduces five custom resources. Three are user-facing; two are internal.

PodClique

A PodClique defines one role in the workload: all prefill workers, or all decode workers, or the router. Fields:

  • containers: the pod spec for this role (image, args, resources)
  • replicas: how many pods in this clique
  • resourceClaims: one or more DRA ResourceClaimTemplate references that bind each pod to a specific GPU type
  • startsAfter: optional name of another PodClique that must reach Ready before this one starts. This is how Grove expresses startup ordering: each dependent clique declares which clique it depends on.

A prefill PodClique and a decode PodClique will have different resourceClaims pointing to different GPU selectors. The DRA driver handles matching CEL expressions like "productName starts with B200" against available devices.

yaml
# Minimal PodClique within a PodCliqueSet for prefill on B200
- name: prefill-clique
  replicas: 1
  resourceClaims:
  - name: gpu
    resourceClaimTemplateName: b200-prefill-claim
  containers:
  - name: vllm-prefill
    image: nvcr.io/nvidia/vllm:latest
    args:
    - "--model"
    - "meta-llama/Llama-4-Scout-17B-16E-Instruct"
    - "--disaggregation-mode"
    - "prefill"
    - "--kv-transfer-config"
    - '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'
    resources:
      claims:
      - name: gpu

PodCliqueScalingGroup

A PodCliqueScalingGroup bundles multiple PodCliques that must scale together at a fixed ratio. If you want one prefill pod for every two decode pods, you set:

yaml
scalingGroups:
- name: prefill-decode-ratio
  cliques: [prefill-clique, decode-clique]
  replicas: [1, 2]

When the PodCliqueSet scales from 1 to 2, the prefill pool goes from 1 to 2 and the decode pool goes from 2 to 4, maintaining the ratio. This prevents the decode pool from scaling while prefill stays flat, which would create a throughput bottleneck at the prefill stage.

PodCliqueSet

The PodCliqueSet is the top-level workload definition. Key fields:

  • scaling: global min/max replicas for the set
  • scalingGroups: ratio constraints across cliques
  • podCliques: the list of PodClique definitions (each clique can declare a startsAfter dependency on another clique to express startup ordering)

The gang-scheduling constraint is implicit: all PodCliques in a PodCliqueSet must be schedulable together, or the whole set remains pending. If one clique's GPU requirements can't be met, Grove doesn't partially start the others.

ClusterTopologyBinding

ClusterTopologyBinding describes the physical layout of the cluster: which GPUs are connected via NVLink, which nodes share the same NVSwitch fabric, and rack-level placement. Grove reads this to make topology-aware PodClique placement decisions.

This CRD matters most for multi-node B200/H200 deployments. When prefill workers transfer KV cache to decode workers over NIXL, NVLink bandwidth is 10 to 50 times higher than InfiniBand for intra-node or intra-NVSwitch transfers. If Grove can see the ClusterTopologyBinding, it places prefill and decode cliques on nodes in the same NVSwitch fabric, keeping KV transfer latency low.

Bare metal is critical here. Hypervisors frequently filter NVLink peer topology data from the guest OS. Grove's ClusterTopologyBinding CRD needs the raw hardware topology that the NVIDIA DRA driver exposes, which requires direct hardware access.

PodGang

PodGang is the internal primitive Grove passes to KAI Scheduler to express the gang constraint. It is not user-facing, but it is worth understanding because it is how Grove integrates with the scheduler. Grove creates a PodGang resource for each PodCliqueSet; KAI Scheduler reads the PodGang and enforces the "schedule all or none" constraint via its gang scheduling logic. Note that PodGang lives in the scheduler.grove.io API group, separate from the user-facing CRDs in the grove.io group.

Grove vs Raw Deployments vs Dynamo's Control Plane

DimensionRaw DeploymentsNVIDIA DynamoGrove
Orchestration levelKubernetes pod groupsAbove vLLM, outside KubernetesKubernetes CRDs
Startup orderingManual (init containers, probes)Process-level (Dynamo supervisor)Declarative (startsAfter dependency per PodClique)
Gang schedulingNone nativeNot applicable (not Kubernetes)Native via PodGang + KAI Scheduler
Topology awarenessManual node labelsNVLink-aware via NIXLClusterTopologyBinding CRD, DRA structured parameters
Failure atomicityPer-Deployment independentlyDynamo supervisor restarts workersPodCliqueSet stays degraded; won't partial-start
Kubernetes nativenessFullNoneFull

Dynamo and Grove are not competing tools. Dynamo manages the inference control plane at runtime: routing incoming requests, tracking KV cache locations, and dispatching prefill/decode work to the right workers. Grove manages the Kubernetes workload lifecycle: starting pods in the right order, placing them on the right hardware, scaling them as a coordinated unit.

A production setup can run both. Grove manages the PodCliqueSet. Inside each prefill and decode pod, vLLM runs with Dynamo's routing layer on top. For the Dynamo-specific setup, see the NVIDIA Dynamo disaggregated inference guide.

Step-by-Step: Prefill-Decode-Disaggregated vLLM with Grove

Step 1: Provision Bare-Metal GPU Nodes on Spheron

For a minimal prefill-decode setup, you need at least two nodes. B200 SXM6 instances on Spheron are compute-dense and ideal for the prefill role. H200 SXM5 availability on Spheron provides 4.8 TB/s HBM3e bandwidth, which decode needs more than raw TFLOPS.

Select bare metal (not containerized VMs) when provisioning. The NVIDIA DRA driver needs direct access to the GPU hardware topology for ClusterTopologyBinding CRD population.

For this guide: one B200 node for prefill, two H200 nodes for decode (maintaining the 1:2 ratio).

Step 2: Install Prerequisites

Install Kubernetes 1.33+, then the NVIDIA DRA driver, KAI Scheduler, and Grove in order:

bash
# DRA driver
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia/charts
helm install nvidia-dra-driver nvidia/nvidia-dra-driver \
  --namespace nvidia-dra \
  --create-namespace \
  --version 0.2.0

# KAI Scheduler
helm upgrade -i kai-scheduler \
  oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler \
  --namespace kai-system \
  --create-namespace \
  --version 0.2.0

# Grove CRDs + operator
kubectl apply -f https://github.com/NVIDIA/grove/releases/download/v0.1.0-alpha.1/grove-crds.yaml
helm repo add grove https://nvidia.github.io/grove
helm install grove grove/grove-operator \
  --namespace grove-system \
  --create-namespace \
  --version v0.1.0-alpha.1  # verify the latest alpha tag at https://github.com/NVIDIA/grove/releases

# Verify
kubectl get pods -n grove-system
kubectl get deviceclass
kubectl get podcliquesets -A

Step 3: Create ResourceClaimTemplates

Create one ResourceClaimTemplate for each GPU role. DRA uses CEL expressions to match devices against hardware attributes exposed by the NVIDIA DRA driver:

yaml
# b200-prefill-claim.yaml
apiVersion: resource.k8s.io/v1beta2
kind: ResourceClaimTemplate
metadata:
  name: b200-prefill-claim
  namespace: inference
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: gpu.nvidia.com
        selectors:
        - cel:
            expression: >
              device.attributes["gpu.nvidia.com"].productName.startsWith("B200")
              && device.attributes["gpu.nvidia.com"].memory >= 193273528320
---
# h200-decode-claim.yaml
apiVersion: resource.k8s.io/v1beta2
kind: ResourceClaimTemplate
metadata:
  name: h200-decode-claim
  namespace: inference
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: gpu.nvidia.com
        selectors:
        - cel:
            expression: >
              device.attributes["gpu.nvidia.com"].productName.startsWith("H200")
              && device.attributes["gpu.nvidia.com"].memory >= 141733920768
bash
kubectl apply -f b200-prefill-claim.yaml
kubectl apply -f h200-decode-claim.yaml

Step 4: Write the PodCliqueSet Manifest

This manifest defines the full disaggregated topology with startup ordering and a 1:2 prefill:decode scaling ratio:

yaml
# grove-disaggregated-vllm.yaml
apiVersion: grove.io/v1alpha1
kind: PodCliqueSet
metadata:
  name: vllm-disaggregated
  namespace: inference
spec:
  scaling:
    minReplicas: 1
    maxReplicas: 4
  scalingGroups:
  - name: prefill-decode-ratio
    cliques: [prefill-clique, decode-clique]
    replicas: [1, 2]
  podCliques:
  - name: router-clique
    replicas: 1
    containers:
    - name: dynamo-router
      image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
      args:
      - "--role"
      - "router"
      ports:
      - containerPort: 8000
  - name: prefill-clique
    startsAfter: router-clique
    replicas: 1
    resourceClaims:
    - name: gpu
      resourceClaimTemplateName: b200-prefill-claim
    containers:
    - name: vllm-prefill
      image: nvcr.io/nvidia/vllm:latest
      args:
      - "--model"
      - "meta-llama/Llama-4-Scout-17B-16E-Instruct"
      - "--disaggregation-mode"
      - "prefill"
      - "--kv-transfer-config"
      - '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'
      resources:
        claims:
        - name: gpu
  - name: decode-clique
    startsAfter: prefill-clique
    replicas: 2
    resourceClaims:
    - name: gpu
      resourceClaimTemplateName: h200-decode-claim
    containers:
    - name: vllm-decode
      image: nvcr.io/nvidia/vllm:latest
      args:
      - "--model"
      - "meta-llama/Llama-4-Scout-17B-16E-Instruct"
      - "--disaggregation-mode"
      - "decode"
      - "--kv-transfer-config"
      - '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
      resources:
        claims:
        - name: gpu

Key points in this manifest:

  • startsAfter on each PodClique: prefill-clique starts after router-clique, and decode-clique starts after prefill-clique. The router is reachable before workers initialize; prefill workers register with the router before decode workers come up.
  • scalingGroups: if you scale the set to replicas: 2, prefill goes from 1 to 2 pods and decode goes from 2 to 4 pods. The ratio is enforced automatically.
  • resourceClaims in each PodClique: DRA binds each pod to a physical GPU matching the CEL selector in the referenced claim template. Prefill pods get B200 GPUs; decode pods get H200 GPUs.

Router Service

Grove does not automatically create a Kubernetes Service for the router-clique pods. You need to create a ClusterIP Service that selects them by the labels Grove applies at runtime. Create this before applying the PodCliqueSet:

yaml
# vllm-router-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: vllm-disaggregated
  namespace: inference
spec:
  selector:
    grove.io/podcliquesets: vllm-disaggregated
    grove.io/podcliquesets-clique: router-clique
  ports:
  - port: 8000
    targetPort: 8000
bash
kubectl apply -f vllm-router-service.yaml

Step 5: Apply and Verify

bash
kubectl apply -f grove-disaggregated-vllm.yaml

# Watch PodCliqueSet transitions
kubectl get podcliquesets -n inference -w

# Confirm DRA allocated the claims
kubectl get resourceclaims -n inference

# Check each clique's pods
kubectl get pods -n inference -l grove.io/podcliquesets=vllm-disaggregated

# Logs from prefill workers
kubectl logs -n inference -l grove.io/podcliquesets=vllm-disaggregated,grove.io/podcliquesets-clique=prefill-clique

When all cliques reach Ready, test through the router:

bash
kubectl port-forward svc/vllm-disaggregated 8000:8000 -n inference &
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
    "prompt": "Explain disaggregated inference in one paragraph.",
    "max_tokens": 200
  }'

Pairing Grove with DRA and KAI Scheduler

Grove, DRA, and KAI Scheduler form a three-layer stack. Each layer handles a distinct concern:

DRA (Dynamic Resource Allocation): allocates GPU resources with structured constraints. Instead of counting whole GPUs via the legacy device plugin, DRA exposes GPU attributes (productName, memory, topology) as CEL-queryable device parameters. Grove's PodCliques reference ResourceClaimTemplates that use these attributes to select the right GPU tier for each role.

KAI Scheduler: enforces gang scheduling at the Kubernetes level. When Grove creates a PodGang for a PodCliqueSet, KAI Scheduler ensures all PodGang members schedule together or none do. KAI also manages fair-share queues across tenants, so a large-scale prefill-decode deployment from one team doesn't starve other workloads on the cluster.

Grove: declares the multi-component workload lifecycle above the scheduler. It creates PodGangs for KAI, watches PodClique readiness, enforces startup ordering, and coordinates scaling via PodCliqueScalingGroups. Grove is what turns three independent Kubernetes Deployments into a single managed resource.

LayerWhat it doesWhat it knows
DRA driverAllocates GPUs with structured constraintsHardware attributes (productName, memory, topology)
KAI SchedulerGang scheduling, fair-share queuesPodGang membership, queue priorities
Grove operatorWorkload lifecycle, startup order, scalingPodClique roles, startup dependencies, scaling ratios

For the full DRA and KAI setup walkthrough including node labels, ClusterTopologyBinding configuration, and MIG support, the Kubernetes GPU Orchestration guide covers those steps in detail to avoid repeating them here.

Scaling, Failure Recovery, and Cost Tuning

Scaling

PodCliqueScalingGroup maintains the prefill:decode ratio as load increases. When KAI Scheduler's metrics show decode pods at high request queue depth, you scale the PodCliqueSet's replicas. Grove applies the scaling group ratio: for a 1:2 group, each additional "replica unit" adds 1 prefill pod and 2 decode pods together.

Set minReplicas conservatively for baseline cost and maxReplicas based on your peak traffic estimate. KAI Scheduler's fair-share queue prevents a scaling burst from one PodCliqueSet from starving other workloads on the same cluster.

Failure Recovery

If a decode PodClique pod fails, Grove detects the divergence between desired and actual state and schedules a replacement. DRA ensures the replacement is placed on a node with a matching GPU (H200 with the right memory). The gang constraint applies to initial scheduling, not to replacement of a failed pod in an already-running set.

If no matching GPU is available for the replacement, the PodCliqueSet enters a degraded state. The still-running pods continue serving, but the set reports not-Ready until the replacement pod can be placed. Grove does not tear down the running pods to maintain "all or nothing" once the set is already up.

Cost Tuning with Spheron Pricing

Right-sizing each pool independently is the main cost lever in a Grove deployment. Prefill needs raw compute (FP8 TFLOPS). Decode needs memory bandwidth (HBM TB/s). Matching GPU to bottleneck cuts cost per token significantly compared to a homogeneous cluster.

GPURoleSpheron On-DemandWhy this role
B200 SXM6Prefill$3.70/hrHigh FP4/FP8 TFLOPS for compute-bound prefill
H200 SXM5Decode$4.54/hr4.8 TB/s HBM3e for memory-bandwidth-bound decode
H100 SXM5Prefill (budget)$2.54/hrStrong prefill at lower cost than B200
A100 80G SXM4Decode (budget)$1.69/hrCost-effective decode for models under 70B

A 1:2 prefill:decode setup with one B200 and two H200s runs at $3.70 + 2x$4.54 = $12.78/hr. For long-context workloads at high concurrency, this configuration typically outperforms three H100 SXM5 nodes at $7.62/hr by a wide margin on throughput per dollar, because prefill and decode stop competing for the same GPU.

Pricing fluctuates based on GPU availability. The prices above are based on 22 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Spot instances can cut prefill costs when your SLA tolerates occasional preemption. When spot pricing is below on-demand rates for your target GPU tier (availability-dependent and subject to change), running prefill on spot while decode stays on reserved or on-demand H200 lowers total cluster cost. See Spheron spot instances for current availability and preemption behavior before committing to a spot-based prefill pool.

What to Watch Out For

Version drift: Grove is alpha (v0.1.0-alpha.x at time of writing). The CRD API is grove.io/v1alpha1. Check https://github.com/NVIDIA/grove/releases before deploying to confirm the latest published alpha tag and update the install commands accordingly.

Bare metal, not VMs: ClusterTopologyBinding CRD population depends on the NVIDIA DRA driver reading full NVLink peer topology from the hardware. Most hypervisors filter this data. If you run Grove on VM-backed nodes, topology-aware placement degrades to best-effort based on node labels, which misses the NVLink locality that matters for NIXL KV transfer performance.

One claim template per role: link each DRA claim template to exactly one ResourceClaimTemplate per role. Multiple claims from the same pod to different templates create ambiguous binding behavior in DRA beta.

Gang scheduling and resource saturation: gang scheduling guarantees atomicity but can also cause priority inversion. If your cluster is near capacity, a large PodCliqueSet may hold its gang pending while smaller, already-schedulable workloads wait. Set realistic maxReplicas and use KAI Scheduler's fair-share queues to reserve capacity for interactive workloads during burst scaling.


Grove's topology-aware orchestration runs on any bare-metal NVIDIA cluster - no hyperscaler required. Spheron's B200 and H200 bare-metal nodes give you the NVLink topology Grove needs for production-grade disaggregated serving.

B200 on Spheron → | H200 on Spheron → | View all pricing →

STEPS / 06

Quick Setup Guide

  1. Provision bare-metal GPU nodes on Spheron

    Rent bare-metal B200 or H200 instances on Spheron for your prefill and decode pools. Select bare metal (not VMs) to ensure the NVIDIA DRA driver can read full NVLink topology. Plan for at least two nodes: one for the prefill PodClique and one for the decode PodClique.

  2. Install prerequisites: DRA driver, KAI Scheduler, and Grove

    Install Kubernetes 1.33+, the NVIDIA DRA driver (via Helm from nvidia/nvidia-dra-driver), KAI Scheduler (from oci://ghcr.io/kai-scheduler), and Grove (apply grove-crds.yaml then install grove/grove-operator). Verify with kubectl get deviceclass and kubectl get podcliquesets -A.

  3. Define ResourceClaimTemplates for each GPU role

    Create one ResourceClaimTemplate for prefill nodes (selecting B200 by productName and memory via CEL expression) and one for decode nodes. Each PodClique in your PodCliqueSet references the appropriate claim template so DRA binds pods to the correct GPU tier.

  4. Write the PodCliqueSet manifest for disaggregated vLLM

    Define a PodCliqueSet with StartsAfter dependencies on each PodClique: prefill-clique starts after router-clique, decode-clique starts after prefill-clique. Add three PodCliques: router-clique (dynamo-router image, no GPU resource claim, exposes port 8000 as the OpenAI-compatible endpoint that accepts requests and routes them to the worker pools), prefill-clique (vllm with --disaggregation-mode prefill, referencing the B200 claim), and decode-clique (vllm with --disaggregation-mode decode, referencing the H200 claim). Set PodCliqueScalingGroup to maintain a 1:2 prefill:decode ratio.

  5. Apply the manifest and verify gang scheduling

    Run kubectl apply -f grove-disaggregated-vllm.yaml. Watch kubectl get podcliquesets -n inference -w to confirm all PodCliques transition to Ready together. Verify the ResourceClaims were allocated with kubectl get resourceclaims -n inference.

  6. Test end-to-end and tune scaling

    Port-forward the router service and send a test inference request via the OpenAI-compatible API. Monitor token throughput with kubectl top pods and grove metrics. Adjust PodCliqueScalingGroup minReplicas and maxReplicas to match your traffic profile.

FAQ / 05

Frequently Asked Questions

Grove is NVIDIA's open-source Kubernetes API for managing multi-component inference workloads. It introduces CRDs like PodClique and PodCliqueSet that let you declare a full disaggregated inference topology - prefill pools, decode pools, and a router - as a single Kubernetes resource. Grove handles startup ordering, gang scheduling, and scaling across all components together.

A PodClique is a Grove CRD that defines a group of pods sharing a specific role in a disaggregated inference workload - for example, all prefill workers or all decode workers. It specifies the container spec, GPU resource claims, and replica count for that role. Multiple PodCliques are composed into a PodCliqueSet, which orchestrates them together as a unit.

Dynamo is a distributed inference orchestration layer that runs above vLLM outside Kubernetes, using its own routing and worker management. Grove is a Kubernetes-native API: it works through standard CRDs, the Kubernetes scheduler (via KAI Scheduler for gang scheduling), and the NVIDIA DRA driver for GPU allocation. Grove declares what the workload should look like; the Kubernetes control plane enforces it. Dynamo manages the serving control plane at runtime. They are complementary: Grove handles Kubernetes-level lifecycle, Dynamo handles inference-level request routing.

Grove can work with the legacy device plugin for basic deployments, but its topology-aware placement and gang-scheduling capabilities require DRA (for structured GPU attributes) and KAI Scheduler (for gang scheduling and fair-share queues). The three components are designed as a stack: DRA exposes GPU topology, KAI Scheduler places pods with gang constraints, and Grove manages the multi-component workload lifecycle above that.

Yes. Grove runs on any Kubernetes cluster with NVIDIA GPUs. Spheron bare-metal B200 and H200 nodes expose full NVLink topology via the NVIDIA DRA driver, which Grove uses for topology-aware PodClique placement. Bare metal matters because hypervisors often filter NVLink peer topology data, which Grove's ClusterTopologyBinding CRD needs to make optimal placement decisions.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.