At KubeCon Europe 2026, NVIDIA donated its Dynamic Resource Allocation (DRA) driver to CNCF. That single event changes what Kubernetes GPU scheduling looks like for platform engineers building production AI inference clusters. If you have been running GPU workloads on Kubernetes using the NVIDIA device plugin, you are working with tooling that is nearly a decade old. DRA, combined with KAI Scheduler and Grove, replaces it with an API-driven resource model that actually reflects how modern GPUs work.
GPU scheduling on Kubernetes has been stuck at integer resource counts since the device plugin model shipped in 2017. You request nvidia.com/gpu: 1 and get a GPU. Which GPU, with what memory, on what topology? The scheduler does not know and cannot help. Teams compensate with node labels, tolerations, and hand-crafted affinity rules that turn cluster configuration into a pile of bespoke YAML. DRA replaces that with structured resource parameters the scheduler can actually reason about.
This guide covers the full stack: DRA for resource allocation, KAI Scheduler for multi-tenant priority scheduling, and Grove for declarative inference workload management. The step-by-step section walks through deploying all three on bare-metal GPU nodes.
The Old Way: GPU Device Plugins and Their Limits
The device plugin API works by advertising GPU resources as opaque integers on each node. A pod spec requesting a GPU looks like this:
resources:
limits:
nvidia.com/gpu: 1That is the entire interface. The scheduler sees a node has one GPU available and schedules the pod there. It has no visibility into GPU memory size, compute capability, NVLink connectivity, or MIG configuration. It cannot place a pod requiring 80GB VRAM onto an H100 80G and away from an A100 40G unless you add manual node labels and affinity rules yourself.
The practical problems this causes in production:
- No topology awareness. Multi-GPU training jobs that need NVLink between GPUs are placed via affinity heuristics, not scheduler logic.
- No MIG granularity. Serving a 7B model on a full H100 wastes 70GB of VRAM. MIG partitioning helps, but the device plugin treats each MIG instance as a separate opaque device, requiring separate node configurations and manual partition setup.
- No preemption. A low-priority batch job holding GPUs blocks high-priority inference workloads until the batch job finishes.
- No gang scheduling. A distributed training job with 8 pods can have 7 pods running and 1 pending, leaving 7 GPUs idle waiting for the 8th to schedule.
What teams actually need: fractional GPU allocation, topology-aware placement, priority queues with preemption, and gang scheduling for batch jobs. None of that is in the device plugin.
Dynamic Resource Allocation: How DRA Works
DRA is a Kubernetes API introduced in 1.26 (alpha), redesigned in 1.31, promoted to beta in 1.32 with the v1beta1 API, and updated to v1beta2 in 1.33. The core idea is that hardware drivers publish structured resource attributes, and pods express resource requirements as constraints against those attributes. The scheduler matches claims to nodes at allocation time rather than relying on opaque integer counts.
Three new resource types define the DRA model:
- DeviceClass - defines a category of hardware (e.g., "NVIDIA GPU") and which driver handles allocation.
- ResourceClaimTemplate - a pod-level template that generates a
ResourceClaimfor each pod. - ResourceClaim - the actual allocation request, bound to a specific hardware instance by the driver.
A minimal ResourceClaimTemplate for an H100 GPU with DRA looks like this:
apiVersion: resource.k8s.io/v1beta2
kind: ResourceClaimTemplate
metadata:
name: h100-gpu-claim
namespace: inference
spec:
spec:
devices:
requests:
- name: gpu
deviceClassName: gpu.nvidia.com
selectors:
- cel:
expression: >
device.attributes["gpu.nvidia.com"].productName.startsWith("H100")
&& device.attributes["gpu.nvidia.com"].memory >= 80737418240The cel expression is evaluated against the structured parameters the NVIDIA DRA driver publishes for each GPU on the node. Memory is expressed in bytes. Compute capability, MIG profile availability, and NVLink peer topology are all queryable the same way.
For context on MIG-level scheduling in particular, the MIG and GPU sharing guide covers how MIG partitions work and why allocating at the partition level is more efficient than whole-GPU assignment for smaller models.
NVIDIA DRA Driver: From Proprietary to CNCF
At KubeCon Europe 2026, NVIDIA contributed the DRA driver to CNCF. The driver was already available as an open-source NVIDIA project before the donation, but CNCF governance changes the trajectory: neutral ownership, a proper vendor-neutral review process, and a clearer path for other GPU vendors to implement the same interface.
The driver does three things: it discovers GPU hardware on each node and publishes structured parameters to the Kubernetes API, it handles ResourceClaim allocation by binding claims to specific GPU instances, and it configures MIG partitions and NVLink fabric manager when a claim requires them.
Prerequisites before installing:
- Kubernetes 1.33 or later (v1beta2 DRA API required)
- NVIDIA drivers 565 or later on all GPU nodes
- Helm 3
- NVIDIA Container Toolkit installed on GPU nodes
Installation:
# Add the NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Install the DRA driver into its own namespace
# Pin to a specific release tag; check https://github.com/kubernetes-sigs/nvidia-dra-driver-gpu for the latest
helm install nvidia-dra-driver nvidia/nvidia-dra-driver \
--namespace nvidia-dra \
--create-namespace \
--version 0.2.0After installation, verify the driver is running and the DeviceClass is registered:
kubectl get deviceclass
# NAME DRIVER AGE
# gpu.nvidia.com gpu.nvidia.com 2mA DeviceClass definition created by the driver looks like this:
apiVersion: resource.k8s.io/v1beta2
kind: DeviceClass
metadata:
name: gpu.nvidia.com
spec:
selectors:
- cel:
expression: device.driver == "gpu.nvidia.com"The driver auto-creates this class on startup. You do not need to manage it manually.
KAI Scheduler: Priority-Based GPU Scheduling
The default Kubernetes scheduler handles general workloads well but was not designed for GPU batch jobs. KAI (Kubernetes AI Infrastructure) Scheduler fills that gap. It is an NVIDIA open-source project that runs as a secondary scheduler alongside kube-scheduler, handling GPU AI workloads specifically.
What KAI adds that kube-scheduler does not have:
- Gang scheduling. All pods in a job start together or none start. A distributed training job with 8 pods will not partially schedule and leave GPUs idle.
- Fair-share queuing. Each team gets a GPU quota. When one team is under quota, it gets priority over teams that are over quota. No team can starve another.
- Priority-based preemption. A high-priority inference workload can preempt lower-priority batch jobs, freeing GPUs without manual intervention.
- Bin-packing. KAI packs workloads onto the fewest nodes possible, leaving contiguous GPU blocks available for large multi-GPU jobs.
Queue definition for a team with a GPU quota:
apiVersion: scheduling.kai.io/v1alpha1
kind: Queue
metadata:
name: inference-team-a
spec:
deservedGPUs: 8 # fair-share allocation for this team
maxGPUs: 16 # burst limit
priority: 100Job referencing a queue:
apiVersion: batch/v1
kind: Job
metadata:
name: llama-training-run
namespace: team-a
annotations:
scheduling.kai.io/queue-name: inference-team-a
spec:
parallelism: 8
completions: 8
template:
spec:
schedulerName: kai-scheduler # direct to KAI instead of default scheduler
restartPolicy: OnFailure
resourceClaims:
- name: gpu
resourceClaimTemplateName: h100-gpu-claim
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.12-py3
resources:
claims:
- name: gpuIn a multi-tenant H100 cluster shared by several teams, KAI prevents the scenario where one large training job monopolizes all GPUs for hours. The fair-share queue ensures other teams continue to get GPU access proportional to their quota. For more on managing GPU costs in shared cluster environments, the GPU cost optimization playbook covers the full range of cost reduction strategies.
Grove: Declarative Inference Workload API
Grove is NVIDIA's Kubernetes API for inference workloads. Where KAI handles scheduling, Grove manages the lifecycle of multi-pod GPU serving deployments: startup ordering, scaling, and gang-scheduling constraints for groups of pods with differentiated roles (prefill, decode, router).
Grove introduces five custom resources:
- PodClique - a group of pods sharing a specific role in the workload (e.g., prefill leader, decode workers). Each PodClique defines the container spec, resource claims, and role for a set of replicas.
- PodCliqueScalingGroup - a bundle of PodCliques that scale together, maintaining a fixed ratio between roles as replicas increase or decrease.
- PodCliqueSet - the top-level workload definition. Specifies startup ordering (so prefill pods initialize before decode pods), scaling policies, and gang-scheduling constraints that ensure the full set starts together or not at all.
- ClusterTopology - describes the physical topology of the cluster (NVLink connectivity, rack layout) so the Grove operator can make topology-aware placement decisions.
- PodGang - a set of pods that must be scheduled together as a group, enforcing the gang-scheduling constraint at the Kubernetes scheduler level.
A minimal PodCliqueSet for a disaggregated prefill-decode deployment:
apiVersion: grove.io/v1alpha1
kind: PodCliqueSet
metadata:
name: llama-4-scout-pcs
namespace: inference
spec:
startupOrder:
- prefill-clique # prefill pods must be ready before decode pods start
- decode-clique
scaling:
minReplicas: 1
maxReplicas: 8
podCliques:
- name: prefill-clique
replicas: 1
resourceClaims:
- name: gpu
resourceClaimTemplateName: h100-gpu-claim
containers:
- name: prefill
image: nvcr.io/nvidia/vllm:latest
args: ["--role", "prefill"]
- name: decode-clique
replicas: 2
resourceClaims:
- name: gpu
resourceClaimTemplateName: h100-gpu-claim
containers:
- name: decode
image: nvcr.io/nvidia/vllm:latest
args: ["--role", "decode"]The Kubernetes Gateway API Inference Extension can be used alongside Grove for traffic routing in the same stack. The Gateway API Inference Extension dispatches requests based on KV cache affinity, routing repeated prompts to replicas with warmed caches. Grove manages the pod lifecycle and scaling; the Gateway API Inference Extension handles the request routing layer on top. These are complementary technologies, not the same thing.
For disaggregated inference deployments that split prefill and decode across separate pools, see the llm-d Kubernetes disaggregated inference guide and the vLLM production deployment guide for the baseline serving configuration.
AI Cluster Runtime: Validated Recipes
Before deploying DRA, KAI, and Grove, the NVIDIA AI Cluster Runtime provides a validated baseline for the GPU node stack. It is a collection of Helm charts that install and configure the components the NVIDIA DRA driver depends on:
- Container runtime: NVIDIA Container Toolkit with CDI (Container Device Interface) support
- NVIDIA drivers: Operator-managed driver installation and lifecycle
- Fabric Manager: NVLink and NVSwitch initialization for multi-GPU nodes
- NCCL configuration: Network topology discovery and NCCL tuning for collective operations
Install the AI Cluster Runtime before the DRA driver. The DRA driver's fabric topology discovery depends on Fabric Manager being initialized, and the structured parameters for NVLink connectivity will be empty if Fabric Manager is not running.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Install the GPU Operator (includes driver, toolkit, device plugin for migration compatibility)
helm install nvidia-gpu-operator nvidia/gpu-operator \
--namespace nvidia-gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set fabricmanager.enabled=trueStep-by-Step: Setting Up DRA + Grove on a Bare-Metal GPU Cluster
This section walks through a full installation from provisioned GPU nodes to a running inference workload. Each step includes the exact commands.
Step 1: Provision a GPU Cluster
Start with bare-metal H100 nodes on Spheron. Bare metal matters here for a specific reason: the NVIDIA DRA driver reads GPU topology data from the hardware directly via the NVML library and the Fabric Manager API. On virtualized instances, the hypervisor may not pass through NVLink topology information, which means DRA's topology-aware scheduling operates on incomplete data.
On bare-metal nodes, the DRA driver sees the full NVLink peer graph, real MIG partition availability, and accurate compute capability attributes. The scheduler's placement decisions are based on actual hardware state, not what the hypervisor chooses to expose.
Step 2: Install Prerequisites
On each GPU node, before initializing Kubernetes:
# Install NVIDIA driver 565+
apt-get install -y nvidia-driver-565
# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update && apt-get install -y nvidia-container-toolkit
# Configure containerd to use the NVIDIA runtime
nvidia-ctk runtime configure --runtime=containerd
systemctl restart containerdInitialize the Kubernetes cluster with Kubernetes 1.33 or later:
kubeadm init --kubernetes-version=1.33.0Step 3: Deploy the NVIDIA DRA Driver
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install nvidia-dra-driver nvidia/nvidia-dra-driver \
--namespace nvidia-dra \
--create-namespace \
--version 0.2.0 # pin to a specific release; check the repo for the latest
# Verify the driver pods are running
kubectl get pods -n nvidia-dra
# Verify the DeviceClass is registered
kubectl get deviceclassStep 4: Deploy KAI Scheduler
helm upgrade -i kai-scheduler oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler \
--namespace kai-system \
--create-namespace \
--version 0.2.0 # check https://github.com/NVIDIA/KAI-Scheduler for latest
# Verify KAI scheduler is running
kubectl get pods -n kai-systemStep 5: Deploy Grove
# Install Grove CRDs first
kubectl apply -f https://github.com/NVIDIA/grove/releases/download/v0.1.0/grove-crds.yaml
# Install the Grove operator
helm repo add grove https://nvidia.github.io/grove
helm repo update
helm install grove grove/grove-operator \
--namespace grove-system \
--create-namespace \
--version 0.1.0 # check https://github.com/NVIDIA/grove for latest
# Verify the operator is running
kubectl get pods -n grove-systemStep 6: Create a ResourceClaim
Define a ResourceClaimTemplate for an H100 GPU with DRA:
apiVersion: resource.k8s.io/v1beta2
kind: ResourceClaimTemplate
metadata:
name: h100-80gb-claim
namespace: inference
spec:
spec:
devices:
requests:
- name: gpu
deviceClassName: gpu.nvidia.com
selectors:
- cel:
expression: >
device.attributes["gpu.nvidia.com"].productName.startsWith("H100")
&& device.attributes["gpu.nvidia.com"].memory >= 80737418240Apply it:
kubectl apply -f h100-claim.yamlStep 7: Deploy a PodCliqueSet with Grove
apiVersion: grove.io/v1alpha1
kind: PodCliqueSet
metadata:
name: test-inference-pcs
namespace: inference
spec:
startupOrder:
- inference-clique
scaling:
minReplicas: 1
maxReplicas: 4
podCliques:
- name: inference-clique
replicas: 1
resourceClaims:
- name: gpu
resourceClaimTemplateName: h100-80gb-claim
containers:
- name: vllm-server
image: nvcr.io/nvidia/vllm:latest
args:
- "--model"
- "meta-llama/Llama-4-Scout-17B-16E-Instruct"
- "--port"
- "8000"Apply it:
kubectl apply -f inference-pool.yamlStep 8: Verify End-to-End
# Check the PodCliqueSet status
kubectl get podcliquesets -n inference
# Check the ResourceClaim was allocated
kubectl get resourceclaim -n inference
# Check pod logs
kubectl logs -n inference -l grove.io/podcliquesets=test-inference-pcs
# Send a test request through the service
kubectl port-forward svc/test-inference-pcs 8000:8000 -n inference &
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct", "prompt": "Hello", "max_tokens": 10}'GPU Utilization Before and After DRA Migration
The device plugin model creates predictable fragmentation patterns. When the scheduler can only see integer GPU counts, it cannot pack workloads efficiently and cannot account for topology. The result is clusters with significant idle GPU capacity that the scheduler considers "in use."
| Metric | Device Plugin | DRA + KAI Scheduler |
|---|---|---|
| Scheduling granularity | Whole GPU only | GPU, MIG slice, or fraction |
| Topology awareness | None (manual labels required) | Native, via structured parameters |
| MIG support | Manual partition setup, opaque allocation | First-class: request a profile, driver configures it |
| Preemption | Not supported | Priority-based, KAI handles eviction |
| Multi-tenant fairness | None (first-come first-served) | Fair-share queues with guaranteed quotas |
Clusters running the device plugin often see 20-30% GPU idle time from fragmentation. A workload requesting 1 GPU on a node with 7 of 8 GPUs in use blocks that eighth GPU from being used for smaller workloads that do not need a full GPU. DRA-enabled bin-packing and MIG scheduling reduce this by allowing the scheduler to allocate at finer granularity, filling gaps that would otherwise sit empty.
Cost Impact: How Better Orchestration Reduces Your GPU Cloud Bill
Orchestration quality directly affects GPU rental spend. Better scheduling means fewer idle nodes, which means fewer billable GPU-hours for the same workload output. Three mechanisms drive the savings:
Bin-packing. KAI packs jobs onto the fewest nodes possible, leaving contiguous blocks free rather than spreading load across the cluster. Fewer nodes in active use means you can scale down idle nodes entirely.
Spot-aware scheduling. KAI supports preemption for spot instance reclamation. When a spot node is reclaimed by the provider, KAI reschedules the affected workloads to available on-demand nodes without manual intervention. This lets you run batch workloads on spot and automatically fail over, rather than provisioning all on-demand to avoid interruption.
MIG-level allocation. Serving a 7B model on a full H100 wastes most of the GPU's capacity. With DRA, you can request a 1g.10gb MIG profile and serve multiple models on a single H100, reducing per-model GPU cost significantly.
Current Spheron pricing for the GPUs most commonly used in Kubernetes inference clusters:
| GPU | On-Demand ($/hr) | Spot ($/hr) |
|---|---|---|
| H100 SXM5 | $2.98 | $0.80 |
| H200 SXM5 | $4.50 | $1.19 |
| A100 80G SXM4 | $1.64 | $0.45 |
| B200 SXM6 | N/A | $2.06 |
Pricing fluctuates based on GPU availability. The prices above are based on 06 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
A cluster running inference on spot H100s at $0.80/hr versus on-demand at $2.98/hr cuts the GPU line item by over 70% for workloads that tolerate interruption. KAI's preemption handling makes that interruption tolerance practical in production.
For broader GPU cloud cost strategies, see the serverless vs on-demand vs reserved GPU comparison for guidance on which pricing tier to use for each workload type. For provider pricing comparisons, the GPU cloud pricing comparison for 2026 covers the major options.
Getting Started with GPU Kubernetes on Spheron
Spheron provides bare-metal multi-GPU nodes designed for production AI workloads. The key properties that matter for DRA and KAI:
- No hypervisor layer. Full NVLink visibility means the DRA driver can publish accurate topology parameters. The scheduler sees the real hardware graph.
- RDMA-capable networking. InfiniBand or RoCE for inter-node GPU communication, which matters for multi-node training and disaggregated inference.
- Direct hardware access. Fabric Manager runs on the bare-metal OS, so NVSwitch initialization and peer-to-peer GPU memory mappings work as designed.
- Data center partners across multiple regions. Nodes available across multiple locations for latency-sensitive inference deployments.
Available GPU options for Kubernetes inference clusters:
- H100 SXM5 - highest compute density for large model inference
- H200 SXM5 - 141GB HBM3e for memory-intensive models
- A100 80G SXM4 - cost-effective for production inference at scale
See all GPU options and pricing →
DRA and Grove are most effective on bare-metal GPU nodes where the scheduler can see the full hardware topology. Spheron's H100, H200, and A100 instances give you direct hardware access with no hypervisor layer - deploy your GPU Kubernetes cluster today.
Rent H100 → | Rent H200 → | Rent A100 → | View all pricing →
