Tutorial

NVIDIA Run:ai on GPU Cloud: AI Workload Scheduling, Fractional GPU Sharing, and Multi-Tenant Quota Management Guide (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 13, 2026
NVIDIA Run:aiRun:ai GPU SchedulerAI Workload Orchestration GPU CloudRun:ai KubernetesGPU Quota ManagementFractional GPU SharingGang Scheduling GPUMulti-Tenant GPU Cluster
NVIDIA Run:ai on GPU Cloud: AI Workload Scheduling, Fractional GPU Sharing, and Multi-Tenant Quota Management Guide (2026)

Since NVIDIA acquired Run:ai in 2024, the external implementation documentation has thinned considerably. Engineers searching for deployment guides and scheduler comparisons land on vendor marketing pages instead. This guide covers what you actually need: how to install Run:ai on a Kubernetes GPU cluster, configure multi-tenant quota projects, use fractional GPU sharing, run gang-scheduled distributed training jobs, and evaluate whether the licensing cost makes sense versus free alternatives.

For the broader Kubernetes GPU scheduling landscape, including DRA and KAI Scheduler, see the guide on Kubernetes GPU scheduling with DRA and KAI Scheduler. For HPC-style batch workloads, the Slurm for batch training workloads guide covers when Slurm wins over container orchestration. For inference-specific GPU sharing decisions, see the fractional GPU inference options guide.

What Run:ai Actually Does

Run:ai adds three distinct primitives on top of Kubernetes that vanilla Kubernetes lacks:

  1. The GPU-aware scheduler replaces the default Kubernetes scheduler. It understands GPU topology, fractional GPU units, quota allocation per project, and priority-based preemption. Standard kube-scheduler treats nvidia.com/gpu as an integer count and nothing more.
  1. The fractional GPU runtime is a software-level time-slicing mechanism. Workloads declare fractional GPU units (e.g., 0.5 GPU) and the runtime shares the physical GPU device context between them. This is distinct from NVIDIA MIG, which partitions silicon at the hardware level.
  1. The quota and project system tracks deservedGpus (guaranteed allocation per team), over-quota weight (priority for borrowing idle capacity), and a department hierarchy for organizational grouping. Standard Kubernetes ResourceQuota objects have no concept of borrowing or returning capacity.
PrimitiveWhat it replacesKubernetes equivalent
GPU schedulerkube-schedulerkube-scheduler (no GPU awareness)
Fractional GPUCUDA time-slicingNone - requires device plugin tricks
Quota systemResourceQuotaResourceQuota (no borrow/return)

Understanding these three things separately matters. Teams sometimes deploy Run:ai just for fractional GPU sharing or just for quota management and get confused when the other primitives behave unexpectedly.

Run:ai Architecture Deep-Dive

Run:ai has three deployable components:

Run:ai control plane can run as SaaS (hosted at app.run.ai) or self-hosted. It manages users, projects, departments, and quota definitions. It stores queue state and job history. The control plane does not touch your GPU nodes directly; it communicates with the cluster engine over a secure channel.

Run:ai cluster engine is deployed per cluster via Helm. It overrides kube-scheduler by registering as a custom scheduler named runai-scheduler. It runs the allocation engine, reports GPU utilization metrics back to the control plane, and handles the local scheduling decisions. One cluster engine per Kubernetes cluster.

Run:ai scheduler plugin is the actual scheduling decision loop inside the cluster engine. It implements bin-packing for GPU fragmentation reduction, over-quota borrowing logic, and preemption ordering. When a workload is submitted with schedulerName: runai-scheduler, this plugin intercepts the pod placement request.

The request flow:

researcher submits RunaiJob
  -> cluster engine intercepts via webhook
  -> quota check: does the project have deservedGpus available?
  -> if yes: scheduler plugin places pod onto node with matching GPU
  -> if no: check over-quota borrowing weight vs idle projects
  -> if borrowable: place with preemptible flag
  -> if not: queue the workload
  -> DCGM exporter reports GPU utilization back to control plane

The scheduler override works by setting schedulerName: runai-scheduler on every workload Run:ai manages. Vanilla Kubernetes pods that do not specify schedulerName continue to use kube-scheduler. Both schedulers coexist; Run:ai only manages workloads explicitly routed to it.

Installing Run:ai on a GPU Cloud Kubernetes Cluster

Before installing Run:ai, your cluster needs:

  • Kubernetes 1.26+ (1.32+ recommended for DRA compatibility)
  • NVIDIA GPU Operator with DCGM exporter enabled
  • Helm 3.x
  • A Run:ai account at app.run.ai and an NGC API key (the Helm registry is at helm.ngc.nvidia.com/nvidia/runai, gated behind NGC authentication)

Install the Run:ai control plane (self-hosted option):

bash
# Requires Run:ai account and NGC API key (app.run.ai)
helm repo add runai https://helm.ngc.nvidia.com/nvidia/runai --force-update \
  --username='$oauthtoken' \
  --password=<NGC_API_KEY>
helm repo update

# Install control plane into the runai-backend namespace
helm upgrade -i runai-backend -n runai-backend runai/control-plane \
  --set global.domain=<DOMAIN> \
  --set global.ingress.ingressClass=haproxy \
  --set tenantsManager.config.adminUsername=<ADMIN_EMAIL> \
  --set tenantsManager.config.adminPassword="<ADMIN_PASSWORD>" \
  --create-namespace

# Verify
kubectl get pods -n runai-backend

Install the Run:ai cluster engine on each GPU cluster:

bash
# Use the same repo added above (helm.ngc.nvidia.com/nvidia/runai)
# Generate cluster UUID and client secret from app.run.ai -> Clusters -> New Cluster
helm upgrade -i runai-cluster runai/runai-cluster -n runai \
  --set controlPlane.url=https://your-cp.run.ai \
  --set controlPlane.clientSecret=<CLIENT_SECRET> \
  --set cluster.uid=<CLUSTER_UUID> \
  --set cluster.url=https://your-cluster.run.ai \
  --version="<VERSION>" \
  --create-namespace

The domain and admin credentials are set via --set flags or a values file generated from the Run:ai SaaS portal under Clusters -> Installation. The portal also provides the cluster UUID and client secret needed for the cluster engine installation step.

Common install errors:

ErrorLikely causeFix
CrashLoopBackOff on cluster engineControl plane URL unreachableCheck network egress from cluster to app.run.ai or your self-hosted CP URL
Certificate validation failureClock skew between CP and clusterSync NTP on control plane and worker nodes
GPU Operator conflictGPU Operator version mismatchCheck Run:ai release notes for supported GPU Operator versions
Pending pods after installNode selector not matchingVerify nvidia.com/gpu resource is visible on GPU nodes: kubectl describe node <node>

Configuring Projects, Departments, and Over-Quota Borrowing

Run:ai's multi-tenancy model has three layers:

  • Projects map 1:1 to Kubernetes namespaces. Each project has a deservedGpus (guaranteed) and an overQuotaWeight.
  • Departments group projects for organizational hierarchy. A department sets a ceiling on total GPU usage across its projects.
  • deservedGpus is the allocation that Run:ai will never preempt. If a project's workload is within its deservedGpus, no other project can displace it.
  • overQuotaWeight determines how idle capacity is distributed. A project with weight 3 gets three times as much over-quota capacity as a project with weight 1 when both are competing for free GPUs.

Worked example with three teams on a 16-GPU cluster:

ProjectdeservedGpusover-quota weightProportional over-quota share (all competing)
team-training8360% of idle capacity (~9.6 GPUs)
team-inference4120% of idle capacity (~3.2 GPUs)
team-research2120% of idle capacity (~3.2 GPUs)

The proportional share column shows how idle capacity is split when all three projects are simultaneously competing for over-quota GPUs. With weights 3:1:1 (total 5), team-training gets 3/5 = 60%, each of the others gets 1/5 = 20%. When only one project is active and the other two are fully idle, that project can borrow up to all 16 GPUs. When team-inference submits work, Run:ai preempts team-training's over-quota workloads and returns 4 GPUs.

Project CRD (kubectl apply):

yaml
apiVersion: run.ai/v1
kind: Project
metadata:
  name: team-training
  namespace: runai
spec:
  deservedGpus: 8
  overQuotaWeight: 3

Preemption order: Run:ai preempts over-quota workloads first, ordered by reverse submission time (newest submitted workload gets preempted first). Workloads within the guaranteed deservedGpus are never preempted unless the entire cluster runs out of capacity. This means stateless development workloads are safe to mark as low-priority over-quota consumers, while production inference endpoints should be within their guaranteed quota to avoid mid-request eviction.

Fractional GPU Sharing: How Run:ai Differs from MIG, MPS, and vGPU

Run:ai fractional GPU sharing is a software-level mechanism. The GPU device is shared via time-slicing at the CUDA scheduler level. A workload requesting 0.5 GPU gets scheduled on a physical GPU alongside another 0.5 GPU workload. The GPU executes their kernels in alternating time slices. Memory is not partitioned; both workloads share the same GPU memory address space without enforcement of per-workload limits. This means an out-of-memory error in one fractional workload can terminate all co-located workloads sharing that physical GPU. For production inference with strict SLAs, prefer MIG or dedicated GPUs over fractional sharing.

For the MIG and time-slicing comparison in more detail, see the MIG and time-slicing guide. For production inference decisions involving vGPU, see the vGPU and MPS right-sizing guide.

MethodIsolationGPU model requirementMemory limits enforcedPerformance overhead
Run:ai fractionalSoft (time-sharing)Any CUDA GPUNo (best-effort)Low - software scheduling
NVIDIA MIGHard (silicon partition)A30, A100, H100, H200, H20, B200, GB200, RTX PRO 5000/6000 (datacenter only)Yes (per-slice)None - hardware isolation
NVIDIA MPSSoft (context sharing)Any CUDA GPUNoNear-zero for trusted workloads
NVIDIA vGPUHard (virtualization)Licensed hardwareYes5-15% overhead

Note on MIG availability: MIG is supported on Ampere (A30, A100), Hopper (H100, H200, H20), and Blackwell (B200, GB200, RTX PRO 5000/6000) datacenter GPUs. It is not supported on consumer cards (RTX 4090, RTX 5090) or pre-Ampere generations. For GPUs without MIG support, hardware isolation requires vGPU, and software isolation options are MPS or Run:ai fractional.

When to choose each approach:

  • Use MIG when workloads need guaranteed memory isolation (inference serving with strict SLAs, compliance requirements)
  • Use Run:ai fractional when running many small development or research jobs that share a cluster loosely and can tolerate variable performance
  • Use MPS for MPI/HPC workloads running a single trusted application context where near-zero overhead matters

RunaiJob fractional GPU request:

yaml
apiVersion: run.ai/v1
kind: RunaiJob
metadata:
  name: small-inference
  namespace: team-inference
spec:
  template:
    spec:
      schedulerName: runai-scheduler
      containers:
        - name: inference
          image: nvcr.io/nvidia/pytorch:24.12-py3
          resources:
            limits:
              nvidia.com/gpu: "0.5"

One important caveat: over-quota workloads on fractional GPUs are preemptible. If a fractional inference workload is running on borrowed quota and another project submits a deserved-quota job, Run:ai will preempt the fractional workload mid-execution. For inference endpoints, mark them as interactive workloads (highest priority, non-preemptible) to prevent this.

Gang Scheduling for Distributed Training (FSDP, DeepSpeed, Megatron)

Distributed training requires all ranks to start simultaneously. A PyTorch dist.barrier() call blocks indefinitely if one rank never arrives because its pod got stuck pending. This is the gang scheduling problem.

For the distributed training setup itself, see the guide on distributed LLM training with FSDP and DeepSpeed.

Kubernetes default behavior schedules pods independently. In a 4-node training job, 3 pods may start on available nodes while the fourth waits for a free node. The result: 3 pods block at dist.barrier(), the fourth never starts, and you pay for 3 nodes worth of GPU time doing nothing.

Run:ai's gang scheduling holds all pod placement decisions for a job until GPUs are available on all required nodes simultaneously, then places them atomically. No partial starts.

Full RunaiJob manifest for a 4-node FSDP training job:

yaml
apiVersion: run.ai/v1
kind: RunaiJob
metadata:
  name: fsdp-training
  namespace: team-training
spec:
  template:
    spec:
      schedulerName: runai-scheduler
  distributedTraining:
    workers:
      replicas: 4
      template:
        spec:
          containers:
            - name: trainer
              image: nvcr.io/nvidia/pytorch:24.12-py3
              resources:
                limits:
                  nvidia.com/gpu: "8"
              command: [torchrun]
              args:
                - --nnodes=4
                - --nproc_per_node=8
                - train.py

Run:ai sets MASTER_ADDR, MASTER_PORT, WORLD_SIZE, and RANK environment variables automatically. No manual scontrol or MPI hostfile generation is needed, unlike Slurm. For the NCCL tuning that applies regardless of which scheduler you use, see NCCL environment variables for multi-GPU training.

Inference Workload Patterns: Queue Priority, Preemption, and Node Pools

Inference services need to run continuously, not as batch jobs. But they compete with training jobs for GPUs on shared clusters. Run:ai handles this through workload types with different preemption policies:

  • Interactive workloads (highest priority, not preemptible): for long-running inference endpoints. An interactive workload within its guaranteed quota will not be evicted.
  • Train workloads (preemptible): for batch training runs. Can be preempted when higher-priority workloads need GPUs.
  • Build workloads (lowest priority, preemptible): for interactive development sessions and notebooks.

Default preemption order (highest to lowest protection):

Workload typeWithin guaranteed quotaOver quota
InteractiveNever preemptedPreemptible
TrainNever preemptedPreemptible
BuildNever preemptedPreemptible (first)

Node pools let you label specific nodes for specific workload types. Training jobs never land on inference-designated nodes, even during a GPU shortage, so a training surge cannot evict live inference endpoints.

Label nodes for a pool:

bash
kubectl label node <gpu-node-1> runai/node-pool=inference-pool
kubectl label node <gpu-node-2> runai/node-pool=inference-pool

Bind a project to a node pool:

yaml
apiVersion: run.ai/v1
kind: Project
metadata:
  name: team-inference
  namespace: runai
spec:
  deservedGpus: 4
  overQuotaWeight: 1
  nodePools:
    - name: inference-pool
      deservedGpus: 4

With this configuration, team-inference workloads will only be placed on nodes in inference-pool. Training jobs from team-training cannot consume those nodes.

Licensing Math: When Run:ai Pays Off vs Kueue, Volcano, and KAI Scheduler

This is the question most engineers actually want answered. For the full comparison of Kueue and KAI Scheduler, see the KAI Scheduler and Kueue on Kubernetes guide.

Feature comparison:

FeatureRun:aiKueueKAI SchedulerVolcano
LicenseCommercial (NVIDIA AI Enterprise)Apache 2.0Apache 2.0Apache 2.0
Fractional GPU sharingYes (software time-slicing)NoNoNo
Over-quota borrowingYesYes (cohort borrowing)YesPartial
Gang schedulingYesYesYesYes
Multi-tenant quota hierarchyYes (Projects + Departments)Yes (ClusterQueues + LocalQueues)YesYes
DCGM metric integrationNativeManualNativeManual
Run:ai UI dashboardYesNo (Grafana only)NoNo
Support SLACommercialCommunityCommunity + NVIDIACommunity

KAI Scheduler and Run:ai share the same scheduling core. NVIDIA open-sourced the KAI Scheduler under Apache 2.0 in April 2025. KAI originated inside the Run:ai platform and was released as a standalone project. When comparing Run:ai against KAI Scheduler, you are not evaluating two unrelated tools. The commercial Run:ai platform adds a management UI, multi-cluster management, commercial support SLA, and the fractional GPU runtime on top of the same scheduling core that KAI now exposes as open source. The practical decision is whether you need fractional GPU sharing (no open-source equivalent in KAI or Kueue), the Run:ai management UI, or a commercial support contract. The scheduler core itself is now open source either way.

Break-even analysis with live H100 pricing:

Run:ai licensing runs approximately $2,500-$5,000 per GPU per year (NVIDIA AI Enterprise pricing, varies by contract). H100 SXM5 on Spheron is available at $1.69/hr spot and $4.00/hr on-demand. At continuous use, the annual cost per GPU is roughly $14,804 at spot or $35,040 at on-demand. On a 16-GPU cluster, that is approximately $236,864/year at spot pricing or $560,640/year at on-demand pricing.

If Run:ai's fractional scheduling and over-quota borrowing recover 20% GPU utilization on a 16-GPU cluster, that recovers 3.2 GPU-years of capacity. At spot rates that is worth roughly $47,373 annually; at on-demand rates, roughly $112,128. The licensing cost for 16 GPUs at $2,500-$5,000/GPU/year runs $40,000-$80,000. At spot pricing and the lower licensing tier the math is close; at on-demand rates it pays off more clearly. At the upper licensing end with spot-only workloads, Kueue's cohort borrowing covers most of the same territory for free.

State clearly: for clusters under 16 GPUs or utilization below 60%, Kueue or KAI Scheduler are sufficient and free. Run:ai's additional value is fractional GPU sharing (no open-source equivalent) and the commercial support SLA.

Pricing fluctuates based on GPU availability. The prices above are based on 13 May 2026 and may have changed. Check current GPU pricing → for live rates.

Run:ai on Bare-Metal GPU Cloud (Spheron) vs Hyperscaler-Managed Kubernetes

Run:ai's design assumptions fit bare-metal clusters better than hyperscaler-managed Kubernetes. The differences are architectural, not cosmetic.

Why Run:ai works better on bare metal:

Run:ai's quota system assumes you own the GPU capacity. On GKE, EKS, or AKS, the cloud autoscaler also makes placement decisions. When the cloud autoscaler tries to scale down a node that Run:ai has reserved for gang scheduling, you get placement conflicts. Run:ai's node pool definitions fight the cloud autoscaler's bin-packing logic. On bare metal with reserved instances, there is no competing autoscaler.

Run:ai's topology-aware placement is most effective when it can see actual hardware topology. On cloud VMs, NVLink connectivity and InfiniBand fabric details are sometimes hidden behind the hypervisor. On Spheron bare-metal nodes, the Run:ai scheduler sees the full PCIe bus topology, actual NVLink peer graphs, and real IB port assignments. Placement decisions reflect actual hardware state.

Spot eviction on hyperscaler clouds is unpredictable. Run:ai's queue state lives in etcd on the control plane. A mid-job spot eviction on AWS or GCP drops a node that Run:ai expects to be reserved, causing gang scheduling holds to deadlock. Spheron's reserved instances do not get arbitrarily evicted.

Teams running large-scale training on Run:ai typically start with a dedicated H100 cluster. See H100 on Spheron for current bare-metal H100 availability and pricing.

To compare GPU pricing across instance types before committing to a cluster size, the pricing page shows live per-hour rates for H100, A100, H200, and B200 with on-demand and spot options.

Hyperscaler comparison:

FactorSpheron bare metalGKE/EKS/AKS
Run:ai autoscaler conflictNone (reserved nodes, no cloud autoscaler)High risk on spot/preemptible nodes
GPU topology visibilityFull (PCIe, NVLink, IB)Partial (hypervisor-filtered)
Node eviction riskLow (reserved commitments)High on spot
Run:ai control plane latencyLow (same DC)Variable (cross-region possible)
Cost for persistent queue stateLow (etcd on same nodes)Cloud etcd surcharges

Migration Path: Moving from Plain Kubernetes + KEDA to Run:ai

For teams coming from reactive autoscaling setups, see the KEDA and Knative autoscaling guide as the baseline. Run:ai and KEDA serve different purposes and can coexist during migration.

Step-by-step approach:

  1. Install Run:ai alongside existing kube-scheduler. They coexist. Do not remove KEDA yet. Run:ai only manages workloads that explicitly set schedulerName: runai-scheduler.
  1. Label a test namespace to route through Run:ai. Set runai-job: "true" on a non-production namespace to route only those workloads through the Run:ai scheduler. Existing workloads are unaffected.
  1. Migrate batch training jobs first. Change schedulerName: default-scheduler to schedulerName: runai-scheduler in training job specs. Create a matching Run:ai project for the namespace.
  1. Create Run:ai projects matching existing namespaces. Set deservedGpus equal to current ResourceQuota limits. This makes the initial state identical to before, and you can tune quota from there.
  1. Remove KEDA ScaledJobs for workloads now managed by Run:ai. Keep KEDA for HTTP-triggered inference autoscaling where it still applies. Run:ai handles queue-based training scheduling; KEDA handles HTTP-driven scale-to-zero. They cover different dimensions.
  1. Migrate inference workloads to Run:ai interactive workloads. This gives better preemption control for production endpoints than KEDA-managed Deployments.

Existing workloads (vanilla Deployments, Jobs) that do not set schedulerName continue to use kube-scheduler with no changes. No mass rewrite is required.

Monitoring and Cost Attribution with Run:ai Metrics + DCGM Exporter

Run:ai exposes Prometheus metrics at the cluster engine's /metrics endpoint. Key metrics for operational dashboards:

  • runai_gpu_utilization_per_project - GPU utilization broken down by project, useful for chargebacks
  • runai_allocated_gpus - currently allocated GPUs per project
  • runai_pending_other_in_queue - queue backlog, proxy for scheduling latency
  • runai_job_wait_time_seconds - time a job spends waiting before scheduling begins

For the GPU monitoring stack itself, see the GPU monitoring and observability guide.

PromQL query for cost attribution by project (H100 at $1.69/hr):

promql
# Estimated GPU cost per project (H100 SXM5 as of 13 May 2026)
# Spot price: $1.69/hr  |  On-demand price: $4.00/hr
# Replace the multiplier with the rate matching your instance type
runai_allocated_gpus{project="team-training"} * 1.69

DCGM exporter (nvidia/dcgm-exporter) provides the hardware-level metrics that Run:ai's dashboard does not surface: temperature, memory bandwidth, SM utilization, NVLink error counters. Run:ai gives scheduling and queue visibility; DCGM gives hardware health visibility. Both are needed in production.

Import the Run:ai Grafana dashboard from the Run:ai UI at app.run.ai -> Dashboards, and layer DCGM panels on the same dashboard for the full picture. Set alerts on runai_pending_other_in_queue spiking above your SLA threshold and on DCGM thermal warnings when GPU temperature approaches throttle limits.


Run:ai delivers the most value when the underlying GPU cluster is yours to control, not a rented serverless slot. Spheron's bare-metal GPU cloud gives you the reserved, NVLink-aware nodes that Run:ai's scheduler and fractional GPU runtime depend on, without the hyperscaler autoscaler fighting your quota assignments.

Rent H100 on Spheron → | Rent A100 on Spheron → | View all GPU pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.