Since NVIDIA acquired Run:ai in 2024, the external implementation documentation has thinned considerably. Engineers searching for deployment guides and scheduler comparisons land on vendor marketing pages instead. This guide covers what you actually need: how to install Run:ai on a Kubernetes GPU cluster, configure multi-tenant quota projects, use fractional GPU sharing, run gang-scheduled distributed training jobs, and evaluate whether the licensing cost makes sense versus free alternatives.
For the broader Kubernetes GPU scheduling landscape, including DRA and KAI Scheduler, see the guide on Kubernetes GPU scheduling with DRA and KAI Scheduler. For HPC-style batch workloads, the Slurm for batch training workloads guide covers when Slurm wins over container orchestration. For inference-specific GPU sharing decisions, see the fractional GPU inference options guide.
What Run:ai Actually Does
Run:ai adds three distinct primitives on top of Kubernetes that vanilla Kubernetes lacks:
- The GPU-aware scheduler replaces the default Kubernetes scheduler. It understands GPU topology, fractional GPU units, quota allocation per project, and priority-based preemption. Standard
kube-schedulertreatsnvidia.com/gpuas an integer count and nothing more.
- The fractional GPU runtime is a software-level time-slicing mechanism. Workloads declare fractional GPU units (e.g.,
0.5GPU) and the runtime shares the physical GPU device context between them. This is distinct from NVIDIA MIG, which partitions silicon at the hardware level.
- The quota and project system tracks
deservedGpus(guaranteed allocation per team), over-quota weight (priority for borrowing idle capacity), and a department hierarchy for organizational grouping. Standard KubernetesResourceQuotaobjects have no concept of borrowing or returning capacity.
| Primitive | What it replaces | Kubernetes equivalent |
|---|---|---|
| GPU scheduler | kube-scheduler | kube-scheduler (no GPU awareness) |
| Fractional GPU | CUDA time-slicing | None - requires device plugin tricks |
| Quota system | ResourceQuota | ResourceQuota (no borrow/return) |
Understanding these three things separately matters. Teams sometimes deploy Run:ai just for fractional GPU sharing or just for quota management and get confused when the other primitives behave unexpectedly.
Run:ai Architecture Deep-Dive
Run:ai has three deployable components:
Run:ai control plane can run as SaaS (hosted at app.run.ai) or self-hosted. It manages users, projects, departments, and quota definitions. It stores queue state and job history. The control plane does not touch your GPU nodes directly; it communicates with the cluster engine over a secure channel.
Run:ai cluster engine is deployed per cluster via Helm. It overrides kube-scheduler by registering as a custom scheduler named runai-scheduler. It runs the allocation engine, reports GPU utilization metrics back to the control plane, and handles the local scheduling decisions. One cluster engine per Kubernetes cluster.
Run:ai scheduler plugin is the actual scheduling decision loop inside the cluster engine. It implements bin-packing for GPU fragmentation reduction, over-quota borrowing logic, and preemption ordering. When a workload is submitted with schedulerName: runai-scheduler, this plugin intercepts the pod placement request.
The request flow:
researcher submits RunaiJob
-> cluster engine intercepts via webhook
-> quota check: does the project have deservedGpus available?
-> if yes: scheduler plugin places pod onto node with matching GPU
-> if no: check over-quota borrowing weight vs idle projects
-> if borrowable: place with preemptible flag
-> if not: queue the workload
-> DCGM exporter reports GPU utilization back to control planeThe scheduler override works by setting schedulerName: runai-scheduler on every workload Run:ai manages. Vanilla Kubernetes pods that do not specify schedulerName continue to use kube-scheduler. Both schedulers coexist; Run:ai only manages workloads explicitly routed to it.
Installing Run:ai on a GPU Cloud Kubernetes Cluster
Before installing Run:ai, your cluster needs:
- Kubernetes 1.26+ (1.32+ recommended for DRA compatibility)
- NVIDIA GPU Operator with DCGM exporter enabled
- Helm 3.x
- A Run:ai account at
app.run.aiand an NGC API key (the Helm registry is athelm.ngc.nvidia.com/nvidia/runai, gated behind NGC authentication)
Install the Run:ai control plane (self-hosted option):
# Requires Run:ai account and NGC API key (app.run.ai)
helm repo add runai https://helm.ngc.nvidia.com/nvidia/runai --force-update \
--username='$oauthtoken' \
--password=<NGC_API_KEY>
helm repo update
# Install control plane into the runai-backend namespace
helm upgrade -i runai-backend -n runai-backend runai/control-plane \
--set global.domain=<DOMAIN> \
--set global.ingress.ingressClass=haproxy \
--set tenantsManager.config.adminUsername=<ADMIN_EMAIL> \
--set tenantsManager.config.adminPassword="<ADMIN_PASSWORD>" \
--create-namespace
# Verify
kubectl get pods -n runai-backendInstall the Run:ai cluster engine on each GPU cluster:
# Use the same repo added above (helm.ngc.nvidia.com/nvidia/runai)
# Generate cluster UUID and client secret from app.run.ai -> Clusters -> New Cluster
helm upgrade -i runai-cluster runai/runai-cluster -n runai \
--set controlPlane.url=https://your-cp.run.ai \
--set controlPlane.clientSecret=<CLIENT_SECRET> \
--set cluster.uid=<CLUSTER_UUID> \
--set cluster.url=https://your-cluster.run.ai \
--version="<VERSION>" \
--create-namespaceThe domain and admin credentials are set via --set flags or a values file generated from the Run:ai SaaS portal under Clusters -> Installation. The portal also provides the cluster UUID and client secret needed for the cluster engine installation step.
Common install errors:
| Error | Likely cause | Fix |
|---|---|---|
CrashLoopBackOff on cluster engine | Control plane URL unreachable | Check network egress from cluster to app.run.ai or your self-hosted CP URL |
| Certificate validation failure | Clock skew between CP and cluster | Sync NTP on control plane and worker nodes |
| GPU Operator conflict | GPU Operator version mismatch | Check Run:ai release notes for supported GPU Operator versions |
Pending pods after install | Node selector not matching | Verify nvidia.com/gpu resource is visible on GPU nodes: kubectl describe node <node> |
Configuring Projects, Departments, and Over-Quota Borrowing
Run:ai's multi-tenancy model has three layers:
- Projects map 1:1 to Kubernetes namespaces. Each project has a
deservedGpus(guaranteed) and anoverQuotaWeight. - Departments group projects for organizational hierarchy. A department sets a ceiling on total GPU usage across its projects.
- deservedGpus is the allocation that Run:ai will never preempt. If a project's workload is within its
deservedGpus, no other project can displace it. - overQuotaWeight determines how idle capacity is distributed. A project with weight 3 gets three times as much over-quota capacity as a project with weight 1 when both are competing for free GPUs.
Worked example with three teams on a 16-GPU cluster:
| Project | deservedGpus | over-quota weight | Proportional over-quota share (all competing) |
|---|---|---|---|
| team-training | 8 | 3 | 60% of idle capacity (~9.6 GPUs) |
| team-inference | 4 | 1 | 20% of idle capacity (~3.2 GPUs) |
| team-research | 2 | 1 | 20% of idle capacity (~3.2 GPUs) |
The proportional share column shows how idle capacity is split when all three projects are simultaneously competing for over-quota GPUs. With weights 3:1:1 (total 5), team-training gets 3/5 = 60%, each of the others gets 1/5 = 20%. When only one project is active and the other two are fully idle, that project can borrow up to all 16 GPUs. When team-inference submits work, Run:ai preempts team-training's over-quota workloads and returns 4 GPUs.
Project CRD (kubectl apply):
apiVersion: run.ai/v1
kind: Project
metadata:
name: team-training
namespace: runai
spec:
deservedGpus: 8
overQuotaWeight: 3Preemption order: Run:ai preempts over-quota workloads first, ordered by reverse submission time (newest submitted workload gets preempted first). Workloads within the guaranteed deservedGpus are never preempted unless the entire cluster runs out of capacity. This means stateless development workloads are safe to mark as low-priority over-quota consumers, while production inference endpoints should be within their guaranteed quota to avoid mid-request eviction.
Fractional GPU Sharing: How Run:ai Differs from MIG, MPS, and vGPU
Run:ai fractional GPU sharing is a software-level mechanism. The GPU device is shared via time-slicing at the CUDA scheduler level. A workload requesting 0.5 GPU gets scheduled on a physical GPU alongside another 0.5 GPU workload. The GPU executes their kernels in alternating time slices. Memory is not partitioned; both workloads share the same GPU memory address space without enforcement of per-workload limits. This means an out-of-memory error in one fractional workload can terminate all co-located workloads sharing that physical GPU. For production inference with strict SLAs, prefer MIG or dedicated GPUs over fractional sharing.
For the MIG and time-slicing comparison in more detail, see the MIG and time-slicing guide. For production inference decisions involving vGPU, see the vGPU and MPS right-sizing guide.
| Method | Isolation | GPU model requirement | Memory limits enforced | Performance overhead |
|---|---|---|---|---|
| Run:ai fractional | Soft (time-sharing) | Any CUDA GPU | No (best-effort) | Low - software scheduling |
| NVIDIA MIG | Hard (silicon partition) | A30, A100, H100, H200, H20, B200, GB200, RTX PRO 5000/6000 (datacenter only) | Yes (per-slice) | None - hardware isolation |
| NVIDIA MPS | Soft (context sharing) | Any CUDA GPU | No | Near-zero for trusted workloads |
| NVIDIA vGPU | Hard (virtualization) | Licensed hardware | Yes | 5-15% overhead |
Note on MIG availability: MIG is supported on Ampere (A30, A100), Hopper (H100, H200, H20), and Blackwell (B200, GB200, RTX PRO 5000/6000) datacenter GPUs. It is not supported on consumer cards (RTX 4090, RTX 5090) or pre-Ampere generations. For GPUs without MIG support, hardware isolation requires vGPU, and software isolation options are MPS or Run:ai fractional.
When to choose each approach:
- Use MIG when workloads need guaranteed memory isolation (inference serving with strict SLAs, compliance requirements)
- Use Run:ai fractional when running many small development or research jobs that share a cluster loosely and can tolerate variable performance
- Use MPS for MPI/HPC workloads running a single trusted application context where near-zero overhead matters
RunaiJob fractional GPU request:
apiVersion: run.ai/v1
kind: RunaiJob
metadata:
name: small-inference
namespace: team-inference
spec:
template:
spec:
schedulerName: runai-scheduler
containers:
- name: inference
image: nvcr.io/nvidia/pytorch:24.12-py3
resources:
limits:
nvidia.com/gpu: "0.5"One important caveat: over-quota workloads on fractional GPUs are preemptible. If a fractional inference workload is running on borrowed quota and another project submits a deserved-quota job, Run:ai will preempt the fractional workload mid-execution. For inference endpoints, mark them as interactive workloads (highest priority, non-preemptible) to prevent this.
Gang Scheduling for Distributed Training (FSDP, DeepSpeed, Megatron)
Distributed training requires all ranks to start simultaneously. A PyTorch dist.barrier() call blocks indefinitely if one rank never arrives because its pod got stuck pending. This is the gang scheduling problem.
For the distributed training setup itself, see the guide on distributed LLM training with FSDP and DeepSpeed.
Kubernetes default behavior schedules pods independently. In a 4-node training job, 3 pods may start on available nodes while the fourth waits for a free node. The result: 3 pods block at dist.barrier(), the fourth never starts, and you pay for 3 nodes worth of GPU time doing nothing.
Run:ai's gang scheduling holds all pod placement decisions for a job until GPUs are available on all required nodes simultaneously, then places them atomically. No partial starts.
Full RunaiJob manifest for a 4-node FSDP training job:
apiVersion: run.ai/v1
kind: RunaiJob
metadata:
name: fsdp-training
namespace: team-training
spec:
template:
spec:
schedulerName: runai-scheduler
distributedTraining:
workers:
replicas: 4
template:
spec:
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.12-py3
resources:
limits:
nvidia.com/gpu: "8"
command: [torchrun]
args:
- --nnodes=4
- --nproc_per_node=8
- train.pyRun:ai sets MASTER_ADDR, MASTER_PORT, WORLD_SIZE, and RANK environment variables automatically. No manual scontrol or MPI hostfile generation is needed, unlike Slurm. For the NCCL tuning that applies regardless of which scheduler you use, see NCCL environment variables for multi-GPU training.
Inference Workload Patterns: Queue Priority, Preemption, and Node Pools
Inference services need to run continuously, not as batch jobs. But they compete with training jobs for GPUs on shared clusters. Run:ai handles this through workload types with different preemption policies:
- Interactive workloads (highest priority, not preemptible): for long-running inference endpoints. An interactive workload within its guaranteed quota will not be evicted.
- Train workloads (preemptible): for batch training runs. Can be preempted when higher-priority workloads need GPUs.
- Build workloads (lowest priority, preemptible): for interactive development sessions and notebooks.
Default preemption order (highest to lowest protection):
| Workload type | Within guaranteed quota | Over quota |
|---|---|---|
| Interactive | Never preempted | Preemptible |
| Train | Never preempted | Preemptible |
| Build | Never preempted | Preemptible (first) |
Node pools let you label specific nodes for specific workload types. Training jobs never land on inference-designated nodes, even during a GPU shortage, so a training surge cannot evict live inference endpoints.
Label nodes for a pool:
kubectl label node <gpu-node-1> runai/node-pool=inference-pool
kubectl label node <gpu-node-2> runai/node-pool=inference-poolBind a project to a node pool:
apiVersion: run.ai/v1
kind: Project
metadata:
name: team-inference
namespace: runai
spec:
deservedGpus: 4
overQuotaWeight: 1
nodePools:
- name: inference-pool
deservedGpus: 4With this configuration, team-inference workloads will only be placed on nodes in inference-pool. Training jobs from team-training cannot consume those nodes.
Licensing Math: When Run:ai Pays Off vs Kueue, Volcano, and KAI Scheduler
This is the question most engineers actually want answered. For the full comparison of Kueue and KAI Scheduler, see the KAI Scheduler and Kueue on Kubernetes guide.
Feature comparison:
| Feature | Run:ai | Kueue | KAI Scheduler | Volcano |
|---|---|---|---|---|
| License | Commercial (NVIDIA AI Enterprise) | Apache 2.0 | Apache 2.0 | Apache 2.0 |
| Fractional GPU sharing | Yes (software time-slicing) | No | No | No |
| Over-quota borrowing | Yes | Yes (cohort borrowing) | Yes | Partial |
| Gang scheduling | Yes | Yes | Yes | Yes |
| Multi-tenant quota hierarchy | Yes (Projects + Departments) | Yes (ClusterQueues + LocalQueues) | Yes | Yes |
| DCGM metric integration | Native | Manual | Native | Manual |
| Run:ai UI dashboard | Yes | No (Grafana only) | No | No |
| Support SLA | Commercial | Community | Community + NVIDIA | Community |
KAI Scheduler and Run:ai share the same scheduling core. NVIDIA open-sourced the KAI Scheduler under Apache 2.0 in April 2025. KAI originated inside the Run:ai platform and was released as a standalone project. When comparing Run:ai against KAI Scheduler, you are not evaluating two unrelated tools. The commercial Run:ai platform adds a management UI, multi-cluster management, commercial support SLA, and the fractional GPU runtime on top of the same scheduling core that KAI now exposes as open source. The practical decision is whether you need fractional GPU sharing (no open-source equivalent in KAI or Kueue), the Run:ai management UI, or a commercial support contract. The scheduler core itself is now open source either way.
Break-even analysis with live H100 pricing:
Run:ai licensing runs approximately $2,500-$5,000 per GPU per year (NVIDIA AI Enterprise pricing, varies by contract). H100 SXM5 on Spheron is available at $1.69/hr spot and $4.00/hr on-demand. At continuous use, the annual cost per GPU is roughly $14,804 at spot or $35,040 at on-demand. On a 16-GPU cluster, that is approximately $236,864/year at spot pricing or $560,640/year at on-demand pricing.
If Run:ai's fractional scheduling and over-quota borrowing recover 20% GPU utilization on a 16-GPU cluster, that recovers 3.2 GPU-years of capacity. At spot rates that is worth roughly $47,373 annually; at on-demand rates, roughly $112,128. The licensing cost for 16 GPUs at $2,500-$5,000/GPU/year runs $40,000-$80,000. At spot pricing and the lower licensing tier the math is close; at on-demand rates it pays off more clearly. At the upper licensing end with spot-only workloads, Kueue's cohort borrowing covers most of the same territory for free.
State clearly: for clusters under 16 GPUs or utilization below 60%, Kueue or KAI Scheduler are sufficient and free. Run:ai's additional value is fractional GPU sharing (no open-source equivalent) and the commercial support SLA.
Pricing fluctuates based on GPU availability. The prices above are based on 13 May 2026 and may have changed. Check current GPU pricing → for live rates.
Run:ai on Bare-Metal GPU Cloud (Spheron) vs Hyperscaler-Managed Kubernetes
Run:ai's design assumptions fit bare-metal clusters better than hyperscaler-managed Kubernetes. The differences are architectural, not cosmetic.
Why Run:ai works better on bare metal:
Run:ai's quota system assumes you own the GPU capacity. On GKE, EKS, or AKS, the cloud autoscaler also makes placement decisions. When the cloud autoscaler tries to scale down a node that Run:ai has reserved for gang scheduling, you get placement conflicts. Run:ai's node pool definitions fight the cloud autoscaler's bin-packing logic. On bare metal with reserved instances, there is no competing autoscaler.
Run:ai's topology-aware placement is most effective when it can see actual hardware topology. On cloud VMs, NVLink connectivity and InfiniBand fabric details are sometimes hidden behind the hypervisor. On Spheron bare-metal nodes, the Run:ai scheduler sees the full PCIe bus topology, actual NVLink peer graphs, and real IB port assignments. Placement decisions reflect actual hardware state.
Spot eviction on hyperscaler clouds is unpredictable. Run:ai's queue state lives in etcd on the control plane. A mid-job spot eviction on AWS or GCP drops a node that Run:ai expects to be reserved, causing gang scheduling holds to deadlock. Spheron's reserved instances do not get arbitrarily evicted.
Teams running large-scale training on Run:ai typically start with a dedicated H100 cluster. See H100 on Spheron for current bare-metal H100 availability and pricing.
To compare GPU pricing across instance types before committing to a cluster size, the pricing page shows live per-hour rates for H100, A100, H200, and B200 with on-demand and spot options.
Hyperscaler comparison:
| Factor | Spheron bare metal | GKE/EKS/AKS |
|---|---|---|
| Run:ai autoscaler conflict | None (reserved nodes, no cloud autoscaler) | High risk on spot/preemptible nodes |
| GPU topology visibility | Full (PCIe, NVLink, IB) | Partial (hypervisor-filtered) |
| Node eviction risk | Low (reserved commitments) | High on spot |
| Run:ai control plane latency | Low (same DC) | Variable (cross-region possible) |
| Cost for persistent queue state | Low (etcd on same nodes) | Cloud etcd surcharges |
Migration Path: Moving from Plain Kubernetes + KEDA to Run:ai
For teams coming from reactive autoscaling setups, see the KEDA and Knative autoscaling guide as the baseline. Run:ai and KEDA serve different purposes and can coexist during migration.
Step-by-step approach:
- Install Run:ai alongside existing kube-scheduler. They coexist. Do not remove KEDA yet. Run:ai only manages workloads that explicitly set
schedulerName: runai-scheduler.
- Label a test namespace to route through Run:ai. Set
runai-job: "true"on a non-production namespace to route only those workloads through the Run:ai scheduler. Existing workloads are unaffected.
- Migrate batch training jobs first. Change
schedulerName: default-schedulertoschedulerName: runai-schedulerin training job specs. Create a matching Run:ai project for the namespace.
- Create Run:ai projects matching existing namespaces. Set
deservedGpusequal to current ResourceQuota limits. This makes the initial state identical to before, and you can tune quota from there.
- Remove KEDA ScaledJobs for workloads now managed by Run:ai. Keep KEDA for HTTP-triggered inference autoscaling where it still applies. Run:ai handles queue-based training scheduling; KEDA handles HTTP-driven scale-to-zero. They cover different dimensions.
- Migrate inference workloads to Run:ai interactive workloads. This gives better preemption control for production endpoints than KEDA-managed Deployments.
Existing workloads (vanilla Deployments, Jobs) that do not set schedulerName continue to use kube-scheduler with no changes. No mass rewrite is required.
Monitoring and Cost Attribution with Run:ai Metrics + DCGM Exporter
Run:ai exposes Prometheus metrics at the cluster engine's /metrics endpoint. Key metrics for operational dashboards:
runai_gpu_utilization_per_project- GPU utilization broken down by project, useful for chargebacksrunai_allocated_gpus- currently allocated GPUs per projectrunai_pending_other_in_queue- queue backlog, proxy for scheduling latencyrunai_job_wait_time_seconds- time a job spends waiting before scheduling begins
For the GPU monitoring stack itself, see the GPU monitoring and observability guide.
PromQL query for cost attribution by project (H100 at $1.69/hr):
# Estimated GPU cost per project (H100 SXM5 as of 13 May 2026)
# Spot price: $1.69/hr | On-demand price: $4.00/hr
# Replace the multiplier with the rate matching your instance type
runai_allocated_gpus{project="team-training"} * 1.69DCGM exporter (nvidia/dcgm-exporter) provides the hardware-level metrics that Run:ai's dashboard does not surface: temperature, memory bandwidth, SM utilization, NVLink error counters. Run:ai gives scheduling and queue visibility; DCGM gives hardware health visibility. Both are needed in production.
Import the Run:ai Grafana dashboard from the Run:ai UI at app.run.ai -> Dashboards, and layer DCGM panels on the same dashboard for the full picture. Set alerts on runai_pending_other_in_queue spiking above your SLA threshold and on DCGM thermal warnings when GPU temperature approaches throttle limits.
Run:ai delivers the most value when the underlying GPU cluster is yours to control, not a rented serverless slot. Spheron's bare-metal GPU cloud gives you the reserved, NVLink-aware nodes that Run:ai's scheduler and fractional GPU runtime depend on, without the hyperscaler autoscaler fighting your quota assignments.
Rent H100 on Spheron → | Rent A100 on Spheron → | View all GPU pricing →
