What is Dynamic Resource Allocation (DRA) in Kubernetes?

DRA is a Kubernetes API introduced in 1.26 (alpha), redesigned in 1.31, promoted to beta in 1.32 with the v1beta1 API, and updated to v1beta2 in 1.33. It lets drivers describe hardware resources (like GPUs) with rich attributes and constraints, rather than the binary present/absent model of the legacy device plugin framework. With DRA, you can request a GPU with specific memory capacity, MIG profile, or NVLink topology, and the scheduler will satisfy those constraints at allocation time.

How is the NVIDIA DRA driver different from the NVIDIA device plugin?

The legacy NVIDIA device plugin exposes GPUs as an integer count (nvidia.com/gpu: 1). The DRA driver exposes GPU attributes as structured parameters: memory size, compute capability, MIG partitions, NVLink topology, and more. This lets workloads express resource requirements precisely rather than relying on manual node labeling and tolerations.

What is the KAI Scheduler and why does it matter for multi-tenant GPU clusters?

KAI (Kubernetes AI Infrastructure) Scheduler is NVIDIA's open-source, priority-based batch scheduler for GPU clusters. It supports gang scheduling (all pods in a job start together or none start), fair-share queuing across teams, and bin-packing to reduce fragmentation. It is designed to run alongside the default kube-scheduler, handling AI batch workloads while the default scheduler handles everything else.

What is Grove in NVIDIA's Kubernetes ecosystem?

Grove is NVIDIA's declarative Kubernetes API for managing inference workloads. It introduces PodCliqueSet, PodClique, PodCliqueScalingGroup, ClusterTopology, and PodGang custom resources. These handle gang scheduling, startup ordering, topology-aware placement, and scaling for multi-pod GPU workloads.

Does DRA work with MIG (Multi-Instance GPU)?

Yes. The NVIDIA DRA driver supports MIG partitions as first-class resource claims. You can request a specific MIG profile (e.g., 1g.10gb on an H100) via a ResourceClaimTemplate, and the DRA driver will allocate and configure the partition automatically. This eliminates the manual MIG configuration that the device plugin required.

Can I run DRA and the legacy device plugin side by side?

Yes, temporarily. NVIDIA recommends running them in parallel during migration. The DRA driver uses a DeviceClass named gpu.nvidia.com, while workloads using DRA reference resources via ResourceClaim rather than the legacy nvidia.com/gpu resource name. Both can coexist on the same cluster while you migrate workloads incrementally, since they use different request mechanisms.

Kubernetes GPU Orchestration in 2026: DRA, KAI Scheduler, and Grove Setup Guide

At KubeCon Europe 2026, NVIDIA donated its Dynamic Resource Allocation (DRA) driver to CNCF. That single event changes what Kubernetes GPU scheduling looks like for platform engineers building production AI inference clusters. If you have been running GPU workloads on Kubernetes using the NVIDIA device plugin, you are working with tooling that is nearly a decade old. DRA, combined with KAI Scheduler and Grove, replaces it with an API-driven resource model that actually reflects how modern GPUs work.

GPU scheduling on Kubernetes has been stuck at integer resource counts since the device plugin model shipped in 2017. You request nvidia.com/gpu: 1 and get a GPU. Which GPU, with what memory, on what topology? The scheduler does not know and cannot help. Teams compensate with node labels, tolerations, and hand-crafted affinity rules that turn cluster configuration into a pile of bespoke YAML. DRA replaces that with structured resource parameters the scheduler can actually reason about.

This guide covers the full stack: DRA for resource allocation, KAI Scheduler for multi-tenant priority scheduling, and Grove for declarative inference workload management. The step-by-step section walks through deploying all three on bare-metal GPU nodes.

The Old Way: GPU Device Plugins and Their Limits

The device plugin API works by advertising GPU resources as opaque integers on each node. A pod spec requesting a GPU looks like this:

yaml

resources:
  limits:
    nvidia.com/gpu: 1

That is the entire interface. The scheduler sees a node has one GPU available and schedules the pod there. It has no visibility into GPU memory size, compute capability, NVLink connectivity, or MIG configuration. It cannot place a pod requiring 80GB VRAM onto an H100 80G and away from an A100 40G unless you add manual node labels and affinity rules yourself.

The practical problems this causes in production:

No topology awareness. Multi-GPU training jobs that need NVLink between GPUs are placed via affinity heuristics, not scheduler logic.
No MIG granularity. Serving a 7B model on a full H100 wastes 70GB of VRAM. MIG partitioning helps, but the device plugin treats each MIG instance as a separate opaque device, requiring separate node configurations and manual partition setup.
No preemption. A low-priority batch job holding GPUs blocks high-priority inference workloads until the batch job finishes.
No gang scheduling. A distributed training job with 8 pods can have 7 pods running and 1 pending, leaving 7 GPUs idle waiting for the 8th to schedule.

What teams actually need: fractional GPU allocation, topology-aware placement, priority queues with preemption, and gang scheduling for batch jobs. None of that is in the device plugin.

Dynamic Resource Allocation: How DRA Works

DRA is a Kubernetes API introduced in 1.26 (alpha), redesigned in 1.31, promoted to beta in 1.32 with the v1beta1 API, and updated to v1beta2 in 1.33. The core idea is that hardware drivers publish structured resource attributes, and pods express resource requirements as constraints against those attributes. The scheduler matches claims to nodes at allocation time rather than relying on opaque integer counts.

Three new resource types define the DRA model:

DeviceClass - defines a category of hardware (e.g., "NVIDIA GPU") and which driver handles allocation.
ResourceClaimTemplate - a pod-level template that generates a ResourceClaim for each pod.
ResourceClaim - the actual allocation request, bound to a specific hardware instance by the driver.

A minimal ResourceClaimTemplate for an H100 GPU with DRA looks like this:

yaml

apiVersion: resource.k8s.io/v1beta2
kind: ResourceClaimTemplate
metadata:
  name: h100-gpu-claim
  namespace: inference
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: gpu.nvidia.com
        selectors:
        - cel:
            expression: >
              device.attributes["gpu.nvidia.com"].productName.startsWith("H100")
              && device.attributes["gpu.nvidia.com"].memory >= 80737418240

The cel expression is evaluated against the structured parameters the NVIDIA DRA driver publishes for each GPU on the node. Memory is expressed in bytes. Compute capability, MIG profile availability, and NVLink peer topology are all queryable the same way.

For context on MIG-level scheduling in particular, the MIG and GPU sharing guide covers how MIG partitions work and why allocating at the partition level is more efficient than whole-GPU assignment for smaller models.

NVIDIA DRA Driver: From Proprietary to CNCF

At KubeCon Europe 2026, NVIDIA contributed the DRA driver to CNCF. The driver was already available as an open-source NVIDIA project before the donation, but CNCF governance changes the trajectory: neutral ownership, a proper vendor-neutral review process, and a clearer path for other GPU vendors to implement the same interface.

The driver does three things: it discovers GPU hardware on each node and publishes structured parameters to the Kubernetes API, it handles ResourceClaim allocation by binding claims to specific GPU instances, and it configures MIG partitions and NVLink fabric manager when a claim requires them.

Prerequisites before installing:

Kubernetes 1.33 or later (v1beta2 DRA API required)
NVIDIA drivers 565 or later on all GPU nodes
Helm 3
NVIDIA Container Toolkit installed on GPU nodes

Installation:

bash

# Add the NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install the DRA driver into its own namespace
# Pin to a specific release tag; check https://github.com/kubernetes-sigs/nvidia-dra-driver-gpu for the latest
helm install nvidia-dra-driver nvidia/nvidia-dra-driver \
  --namespace nvidia-dra \
  --create-namespace \
  --version 0.2.0

After installation, verify the driver is running and the DeviceClass is registered:

bash

kubectl get deviceclass
# NAME              DRIVER                  AGE
# gpu.nvidia.com    gpu.nvidia.com          2m

A DeviceClass definition created by the driver looks like this:

yaml

apiVersion: resource.k8s.io/v1beta2
kind: DeviceClass
metadata:
  name: gpu.nvidia.com
spec:
  selectors:
  - cel:
      expression: device.driver == "gpu.nvidia.com"

The driver auto-creates this class on startup. You do not need to manage it manually.

KAI Scheduler: Priority-Based GPU Scheduling

The default Kubernetes scheduler handles general workloads well but was not designed for GPU batch jobs. KAI (Kubernetes AI Infrastructure) Scheduler fills that gap. It is an NVIDIA open-source project that runs as a secondary scheduler alongside kube-scheduler, handling GPU AI workloads specifically.

What KAI adds that kube-scheduler does not have:

Gang scheduling. All pods in a job start together or none start. A distributed training job with 8 pods will not partially schedule and leave GPUs idle.
Fair-share queuing. Each team gets a GPU quota. When one team is under quota, it gets priority over teams that are over quota. No team can starve another.
Priority-based preemption. A high-priority inference workload can preempt lower-priority batch jobs, freeing GPUs without manual intervention.
Bin-packing. KAI packs workloads onto the fewest nodes possible, leaving contiguous GPU blocks available for large multi-GPU jobs.

Queue definition for a team with a GPU quota:

yaml

apiVersion: scheduling.kai.io/v1alpha1
kind: Queue
metadata:
  name: inference-team-a
spec:
  deservedGPUs: 8    # fair-share allocation for this team
  maxGPUs: 16        # burst limit
  priority: 100

Job referencing a queue:

yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: llama-training-run
  namespace: team-a
  annotations:
    scheduling.kai.io/queue-name: inference-team-a
spec:
  parallelism: 8
  completions: 8
  template:
    spec:
      schedulerName: kai-scheduler   # direct to KAI instead of default scheduler
      restartPolicy: OnFailure
      resourceClaims:
      - name: gpu
        resourceClaimTemplateName: h100-gpu-claim
      containers:
      - name: trainer
        image: nvcr.io/nvidia/pytorch:24.12-py3
        resources:
          claims:
          - name: gpu

In a multi-tenant H100 cluster shared by several teams, KAI prevents the scenario where one large training job monopolizes all GPUs for hours. The fair-share queue ensures other teams continue to get GPU access proportional to their quota. For more on managing GPU costs in shared cluster environments, the GPU cost optimization playbook covers the full range of cost reduction strategies.

Grove: Declarative Inference Workload API

Grove is NVIDIA's Kubernetes API for inference workloads. Where KAI handles scheduling, Grove manages the lifecycle of multi-pod GPU serving deployments: startup ordering, scaling, and gang-scheduling constraints for groups of pods with differentiated roles (prefill, decode, router).

Grove introduces five custom resources:

PodClique - a group of pods sharing a specific role in the workload (e.g., prefill leader, decode workers). Each PodClique defines the container spec, resource claims, and role for a set of replicas.
PodCliqueScalingGroup - a bundle of PodCliques that scale together, maintaining a fixed ratio between roles as replicas increase or decrease.
PodCliqueSet - the top-level workload definition. Specifies startup ordering (so prefill pods initialize before decode pods), scaling policies, and gang-scheduling constraints that ensure the full set starts together or not at all.
ClusterTopology - describes the physical topology of the cluster (NVLink connectivity, rack layout) so the Grove operator can make topology-aware placement decisions.
PodGang - a set of pods that must be scheduled together as a group, enforcing the gang-scheduling constraint at the Kubernetes scheduler level.

A minimal PodCliqueSet for a disaggregated prefill-decode deployment:

yaml

apiVersion: grove.io/v1alpha1
kind: PodCliqueSet
metadata:
  name: llama-4-scout-pcs
  namespace: inference
spec:
  startupOrder:
  - prefill-clique   # prefill pods must be ready before decode pods start
  - decode-clique
  scaling:
    minReplicas: 1
    maxReplicas: 8
  podCliques:
  - name: prefill-clique
    replicas: 1
    resourceClaims:
    - name: gpu
      resourceClaimTemplateName: h100-gpu-claim
    containers:
    - name: prefill
      image: nvcr.io/nvidia/vllm:latest
      args: ["--role", "prefill"]
  - name: decode-clique
    replicas: 2
    resourceClaims:
    - name: gpu
      resourceClaimTemplateName: h100-gpu-claim
    containers:
    - name: decode
      image: nvcr.io/nvidia/vllm:latest
      args: ["--role", "decode"]

The Kubernetes Gateway API Inference Extension can be used alongside Grove for traffic routing in the same stack. The Gateway API Inference Extension dispatches requests based on KV cache affinity, routing repeated prompts to replicas with warmed caches. Grove manages the pod lifecycle and scaling; the Gateway API Inference Extension handles the request routing layer on top. These are complementary technologies, not the same thing.

For disaggregated inference deployments that split prefill and decode across separate pools, see the llm-d Kubernetes disaggregated inference guide and the vLLM production deployment guide for the baseline serving configuration.

AI Cluster Runtime: Validated Recipes

Before deploying DRA, KAI, and Grove, the NVIDIA AI Cluster Runtime provides a validated baseline for the GPU node stack. It is a collection of Helm charts that install and configure the components the NVIDIA DRA driver depends on:

Container runtime: NVIDIA Container Toolkit with CDI (Container Device Interface) support
NVIDIA drivers: Operator-managed driver installation and lifecycle
Fabric Manager: NVLink and NVSwitch initialization for multi-GPU nodes
NCCL configuration: Network topology discovery and NCCL tuning for collective operations

Install the AI Cluster Runtime before the DRA driver. The DRA driver's fabric topology discovery depends on Fabric Manager being initialized, and the structured parameters for NVLink connectivity will be empty if Fabric Manager is not running.

bash

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install the GPU Operator (includes driver, toolkit, device plugin for migration compatibility)
helm install nvidia-gpu-operator nvidia/gpu-operator \
  --namespace nvidia-gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set fabricmanager.enabled=true

Step-by-Step: Setting Up DRA + Grove on a Bare-Metal GPU Cluster

This section walks through a full installation from provisioned GPU nodes to a running inference workload. Each step includes the exact commands.

Step 1: Provision a GPU Cluster

Start with bare-metal H100 nodes on Spheron. Bare metal matters here for a specific reason: the NVIDIA DRA driver reads GPU topology data from the hardware directly via the NVML library and the Fabric Manager API. On virtualized instances, the hypervisor may not pass through NVLink topology information, which means DRA's topology-aware scheduling operates on incomplete data.

On bare-metal nodes, the DRA driver sees the full NVLink peer graph, real MIG partition availability, and accurate compute capability attributes. The scheduler's placement decisions are based on actual hardware state, not what the hypervisor chooses to expose.

Step 2: Install Prerequisites

On each GPU node, before initializing Kubernetes:

bash

# Install NVIDIA driver 565+
apt-get install -y nvidia-driver-565

# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update && apt-get install -y nvidia-container-toolkit

# Configure containerd to use the NVIDIA runtime
nvidia-ctk runtime configure --runtime=containerd
systemctl restart containerd

Initialize the Kubernetes cluster with Kubernetes 1.33 or later:

bash

kubeadm init --kubernetes-version=1.33.0

Step 3: Deploy the NVIDIA DRA Driver

bash

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install nvidia-dra-driver nvidia/nvidia-dra-driver \
  --namespace nvidia-dra \
  --create-namespace \
  --version 0.2.0   # pin to a specific release; check the repo for the latest

# Verify the driver pods are running
kubectl get pods -n nvidia-dra

# Verify the DeviceClass is registered
kubectl get deviceclass

Step 4: Deploy KAI Scheduler

bash

helm upgrade -i kai-scheduler oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler \
  --namespace kai-system \
  --create-namespace \
  --version 0.2.0   # check https://github.com/NVIDIA/KAI-Scheduler for latest

# Verify KAI scheduler is running
kubectl get pods -n kai-system

Step 5: Deploy Grove

bash

# Install Grove CRDs first
kubectl apply -f https://github.com/NVIDIA/grove/releases/download/v0.1.0/grove-crds.yaml

# Install the Grove operator
helm repo add grove https://nvidia.github.io/grove
helm repo update

helm install grove grove/grove-operator \
  --namespace grove-system \
  --create-namespace \
  --version 0.1.0   # check https://github.com/NVIDIA/grove for latest

# Verify the operator is running
kubectl get pods -n grove-system

Step 6: Create a ResourceClaim

Define a ResourceClaimTemplate for an H100 GPU with DRA:

yaml

apiVersion: resource.k8s.io/v1beta2
kind: ResourceClaimTemplate
metadata:
  name: h100-80gb-claim
  namespace: inference
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: gpu.nvidia.com
        selectors:
        - cel:
            expression: >
              device.attributes["gpu.nvidia.com"].productName.startsWith("H100")
              && device.attributes["gpu.nvidia.com"].memory >= 80737418240

Apply it:

bash

kubectl apply -f h100-claim.yaml

Step 7: Deploy a PodCliqueSet with Grove

yaml

apiVersion: grove.io/v1alpha1
kind: PodCliqueSet
metadata:
  name: test-inference-pcs
  namespace: inference
spec:
  startupOrder:
  - inference-clique
  scaling:
    minReplicas: 1
    maxReplicas: 4
  podCliques:
  - name: inference-clique
    replicas: 1
    resourceClaims:
    - name: gpu
      resourceClaimTemplateName: h100-80gb-claim
    containers:
    - name: vllm-server
      image: nvcr.io/nvidia/vllm:latest
      args:
      - "--model"
      - "meta-llama/Llama-4-Scout-17B-16E-Instruct"
      - "--port"
      - "8000"

Apply it:

bash

kubectl apply -f inference-pool.yaml

Step 8: Verify End-to-End

bash

# Check the PodCliqueSet status
kubectl get podcliquesets -n inference

# Check the ResourceClaim was allocated
kubectl get resourceclaim -n inference

# Check pod logs
kubectl logs -n inference -l grove.io/podcliquesets=test-inference-pcs

# Send a test request through the service
kubectl port-forward svc/test-inference-pcs 8000:8000 -n inference &
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct", "prompt": "Hello", "max_tokens": 10}'

GPU Utilization Before and After DRA Migration

The device plugin model creates predictable fragmentation patterns. When the scheduler can only see integer GPU counts, it cannot pack workloads efficiently and cannot account for topology. The result is clusters with significant idle GPU capacity that the scheduler considers "in use."

Metric	Device Plugin	DRA + KAI Scheduler
Scheduling granularity	Whole GPU only	GPU, MIG slice, or fraction
Topology awareness	None (manual labels required)	Native, via structured parameters
MIG support	Manual partition setup, opaque allocation	First-class: request a profile, driver configures it
Preemption	Not supported	Priority-based, KAI handles eviction
Multi-tenant fairness	None (first-come first-served)	Fair-share queues with guaranteed quotas

Clusters running the device plugin often see 20-30% GPU idle time from fragmentation. A workload requesting 1 GPU on a node with 7 of 8 GPUs in use blocks that eighth GPU from being used for smaller workloads that do not need a full GPU. DRA-enabled bin-packing and MIG scheduling reduce this by allowing the scheduler to allocate at finer granularity, filling gaps that would otherwise sit empty.

Cost Impact: How Better Orchestration Reduces Your GPU Cloud Bill

Orchestration quality directly affects GPU rental spend. Better scheduling means fewer idle nodes, which means fewer billable GPU-hours for the same workload output. Three mechanisms drive the savings:

Bin-packing. KAI packs jobs onto the fewest nodes possible, leaving contiguous blocks free rather than spreading load across the cluster. Fewer nodes in active use means you can scale down idle nodes entirely.

Spot-aware scheduling. KAI supports preemption for spot instance reclamation. When a spot node is reclaimed by the provider, KAI reschedules the affected workloads to available on-demand nodes without manual intervention. This lets you run batch workloads on spot and automatically fail over, rather than provisioning all on-demand to avoid interruption.

MIG-level allocation. Serving a 7B model on a full H100 wastes most of the GPU's capacity. With DRA, you can request a 1g.10gb MIG profile and serve multiple models on a single H100, reducing per-model GPU cost significantly.

Current Spheron pricing for the GPUs most commonly used in Kubernetes inference clusters:

GPU	On-Demand ($/hr)	Spot ($/hr)
H100 SXM5	$2.98	$0.80
H200 SXM5	$4.50	$1.19
A100 80G SXM4	$1.64	$0.45
B200 SXM6	N/A	$2.06

Pricing fluctuates based on GPU availability. The prices above are based on 06 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

A cluster running inference on spot H100s at $0.80/hr versus on-demand at $2.98/hr cuts the GPU line item by over 70% for workloads that tolerate interruption. KAI's preemption handling makes that interruption tolerance practical in production.

For broader GPU cloud cost strategies, see the serverless vs on-demand vs reserved GPU comparison for guidance on which pricing tier to use for each workload type. For provider pricing comparisons, the GPU cloud pricing comparison for 2026 covers the major options.

Getting Started with GPU Kubernetes on Spheron

Spheron provides bare-metal multi-GPU nodes designed for production AI workloads. The key properties that matter for DRA and KAI:

No hypervisor layer. Full NVLink visibility means the DRA driver can publish accurate topology parameters. The scheduler sees the real hardware graph.
RDMA-capable networking. InfiniBand or RoCE for inter-node GPU communication, which matters for multi-node training and disaggregated inference.
Direct hardware access. Fabric Manager runs on the bare-metal OS, so NVSwitch initialization and peer-to-peer GPU memory mappings work as designed.
Data center partners across multiple regions. Nodes available across multiple locations for latency-sensitive inference deployments.

Available GPU options for Kubernetes inference clusters:

H100 SXM5 - highest compute density for large model inference
H200 SXM5 - 141GB HBM3e for memory-intensive models
A100 80G SXM4 - cost-effective for production inference at scale

See all GPU options and pricing →

DRA and Grove are most effective on bare-metal GPU nodes where the scheduler can see the full hardware topology. Spheron's H100, H200, and A100 instances give you direct hardware access with no hypervisor layer - deploy your GPU Kubernetes cluster today.
Rent H100 → | Rent H200 → | Rent A100 → | View all pricing →

The Old Way: GPU Device Plugins and Their Limits

Dynamic Resource Allocation: How DRA Works

NVIDIA DRA Driver: From Proprietary to CNCF

KAI Scheduler: Priority-Based GPU Scheduling

Grove: Declarative Inference Workload API

AI Cluster Runtime: Validated Recipes

Step-by-Step: Setting Up DRA + Grove on a Bare-Metal GPU Cluster

Step 1: Provision a GPU Cluster

Step 2: Install Prerequisites

Step 3: Deploy the NVIDIA DRA Driver

Step 4: Deploy KAI Scheduler

Step 5: Deploy Grove

Step 6: Create a ResourceClaim

Step 7: Deploy a PodCliqueSet with Grove

Step 8: Verify End-to-End

GPU Utilization Before and After DRA Migration

Cost Impact: How Better Orchestration Reduces Your GPU Cloud Bill

Getting Started with GPU Kubernetes on Spheron

Build what's next.