NVIDIA announced Mission Control at GTC 2026 as the unified control plane across BCM, Run:ai, NeMo, DCGM, and DSX: the full NVIDIA AI factory software stack under one lifecycle manager. Most coverage so far is marketing copy and press recaps. This guide covers how to actually deploy it: provisioning on bare-metal GPU cloud, multi-tenant quota setup, fault-tolerant training configuration, SLO-aware inference scheduling, and migration from existing schedulers.
For context on the scheduling layer inside Mission Control, the NVIDIA Run:ai on GPU Cloud guide covers Run:ai architecture, fractional GPU sharing, and Helm installation in depth.
What NVIDIA Mission Control Is
Mission Control is a lifecycle management layer, not a replacement for the individual components underneath it. Each component still does what it does. Mission Control adds the control loop above them: unified provisioning, policy federation, and cross-stack telemetry aggregation.
| Component | What it does | Mission Control's role |
|---|---|---|
| BCM (Base Command Manager) | Cluster provisioning, firmware, OS imaging | Lifecycle layer: BCM is the provisioning source of truth |
| Run:ai | GPU quota, fractional sharing, gang scheduling | Scheduling layer: Mission Control federates project and quota policy to Run:ai |
| NeMo microservices | Model serving, inference endpoints, fine-tuning pipelines | Serving layer: Mission Control routes traffic and manages NeMo deployment lifecycle |
| DCGM | GPU telemetry, health monitoring, ECC error tracking | Observability layer: DCGM metrics aggregate into Mission Control's health dashboard |
| DSX MaxLPS | Parallel storage for training data and checkpoints | Storage layer: Mission Control policies govern data locality and access controls |
Before Mission Control, operating this stack meant five separate dashboards with no shared state. A node failure visible in DCGM had no automatic feedback path to Run:ai's scheduler or BCM's provisioning layer. You correlated events manually. Mission Control closes that loop.
Architecture: Control Plane, Scheduler, Telemetry, Policy Engine
Mission Control has four internal sub-planes:
+----------------------------------------------------------------+
| NVIDIA Mission Control |
| +--------------+ +--------------+ +----------------------+ |
| | Control Sub- | | Policy | | Telemetry Aggregator | |
| | Plane (BCM) | | Engine | | (DCGM + logs) | |
| +------+-------+ +------+-------+ +----------+-----------+ |
| | | | |
+---------+-----------------+---------------------+--------------+
| | |
v v v
+-------------+ +-------------+ +-------------+
| BCM | | Run:ai | | DCGM |
| (provision) | | (schedule) | | (telemetry) |
+-------------+ +------+------+ +-------------+
|
v
+-------------+
| NeMo |
| (inference) |
+-------------+Control sub-plane: BCM handles node inventory, firmware updates, OS imaging, and health checks. Mission Control queries BCM's REST API continuously and surfaces node state in a unified dashboard.
Policy engine: Quota definitions, chargeback labels, and namespace rules are authored once in Mission Control and pushed to Run:ai's scheduler via the federation API. You do not manually sync policy between tools.
Telemetry aggregator: DCGM metrics, Run:ai queue metrics, and NeMo request traces feed into a shared telemetry store. The Mission Control UI provides cross-stack correlation: a GPU utilization spike from DCGM aligns with a queue depth increase from Run:ai in the same timeline.
Scheduling sub-plane: The SLO-aware placement engine inside Mission Control reads inference SLO manifests (TTFT targets, minimum throughput) and routes workloads to GPU nodes that can meet those targets based on current DCGM utilization data.
The request flow for a new inference deployment:
User submits InferenceDeployment manifest
→ Mission Control policy engine validates quotas
→ Telemetry aggregator queries DCGM for current GPU utilization
→ SLO placement engine selects target nodes (H200 vs B200 based on TTFT target)
→ Run:ai schedules pods on selected nodes
→ NeMo service starts and registers endpoint
→ Mission Control workload view shows active statusMission Control vs Standalone Run:ai vs Kubernetes-Only
| Capability | Run:ai standalone | Kubernetes + Kueue | Mission Control |
|---|---|---|---|
| Cluster provisioning | None | None | BCM integration |
| GPU quota per team | Yes | Yes (via ResourceQuota) | Yes, federated |
| Fractional GPU sharing | Yes | No | Yes (via Run:ai) |
| Gang scheduling | Yes | Via KAI Scheduler | Yes (via Run:ai) |
| Inference serving | None | Manual deployment | NeMo integration |
| Cross-stack telemetry | Partial (Run:ai metrics) | Partial (DCGM separate) | Unified (DCGM + Run:ai + NeMo) |
| Chargeback reporting | Per-project GPU-hours | Manual | Built-in with cost labels |
| Firmware/OS lifecycle | None | None | BCM |
| Migration tooling | Limited | None | BCM bridge for Slurm |
When to use each:
Run:ai standalone makes sense when you already have BCM and NeMo managed separately and want GPU scheduling without consolidating the stack. You get fractional GPU and quota management, nothing more.
Kubernetes with Kueue (or KAI Scheduler) is the right call for smaller clusters (under 16 GPUs), single teams, or training-only workloads. No licensing cost, broad community support. See Kubernetes GPU orchestration with DRA and KAI Scheduler for the full setup.
Mission Control pays off when you have multiple teams sharing the cluster, mixed training and inference workloads, chargeback requirements, and a mix of BCM-provisioned nodes with NeMo endpoints. The consolidation benefit compounds at 32+ GPUs with 3+ teams.
For HPC-style batch training that does not need the Kubernetes stack at all, the Slurm for AI training guide covers when Slurm wins on simplicity.
Provisioning a Mission Control Cluster on Spheron
Spheron provides bare-metal H200 SXM5 and B200 SXM6 nodes without the DGX Cloud or Azure NC-series markup. The full NVIDIA AI enterprise software stack runs on any Kubernetes cluster with NVIDIA GPUs. You do not need a hyperscaler to run Mission Control.
Cluster footprint for a mid-size mixed workload (training + inference):
| Node role | Instance type | Count | On-demand price | Monthly cost estimate |
|---|---|---|---|---|
| Control plane (BCM + Mission Control) | CPU instance | 1 | ~$0.50/hr | ~$360/mo |
| GPU worker (training) | H200 SXM5 | 4 nodes × 8 GPU | $5.92/hr per GPU | ~$34,099/mo per node |
| GPU worker (inference) | B200 SXM6 | 2 nodes × 8 GPU | $8.61/hr per GPU | ~$49,594/mo per node |
| Spot alternative (training) | H200 SXM5 spot | 4 nodes × 8 GPU | ~$1.78/hr per GPU | ~$10,246/mo per node |
Pricing fluctuates based on GPU availability. The prices above are based on 05 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Teams running fault-tolerant training jobs should put training workers on spot H200 SXM5 nodes and keep inference on on-demand B200 SXM6 to prevent preemption from affecting latency SLOs.
To get started, rent H200 SXM5 on Spheron for training workers and B200 SXM6 instances for the inference tier. Spheron aggregates GPU supply from 5+ providers so you get competitive pricing without sourcing from each data center partner directly.
Control plane setup:
# 1. Deploy Kubernetes on a CPU instance (k3s recommended for simplicity)
curl -sfL https://get.k3s.io | sh -
# 2. Join GPU worker nodes (run on each GPU worker)
k3s agent --server https://<control-plane-ip>:6443 --token <node-token>
# 3. Install NVIDIA GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm upgrade -i gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=false # if drivers pre-installed
# 4. Verify GPU resources appear
kubectl describe nodes | grep nvidia.com/gpuBCM installation:
# Add NGC Helm registry
helm repo add nvbc https://helm.ngc.nvidia.com/nvidia/bcm
helm repo update
# Install BCM control plane
helm upgrade -i bcm-control-plane nvbc/bcm-control-plane \
--namespace bcm-system \
--create-namespace \
-f bcm-values.yaml
# bcm-values.yaml (minimal config)
# cluster:
# endpoint: https://<k8s-api-server>:6443
# gpuInventory:
# - model: H200-SXM5
# count: 32
# storageClass: local-pathMission Control installation follows the NVIDIA AI Enterprise deployment documentation from the NGC catalog. The Mission Control operator wraps BCM, Run:ai, and NeMo into a single Helm umbrella chart. Configure it to point at your existing BCM endpoint and Run:ai control plane URL:
# Add NGC Helm registry for Mission Control
helm repo add nvai https://helm.ngc.nvidia.com/nvidia/mission-control
helm repo update
# Install Mission Control operator (from NGC, requires NVAIE subscription)
helm upgrade -i nvidia-mission-control nvai/mission-control \
--namespace nvidia-mission-control \
--create-namespace \
--set bcm.endpoint=https://<bcm-api>:8443 \
--set runai.controlPlane.url=https://<runai-cp>:443 \
--set dcgm.prometheusEndpoint=http://<dcgm-exporter>:9400/metrics \
--set nemo.registry=nvcr.io/nvidiaMulti-Tenant Quota and Chargeback
Mission Control uses a three-level hierarchy: Department (business unit), Project (team), Workload (individual job).
apiVersion: mission-control.nvidia.com/v1
kind: MissionControlDepartment
metadata:
name: ai-research
spec:
displayName: "AI Research"
projects:
- name: llm-pretraining
deservedGpus: 32
overQuotaWeight: 2
chargebackLabel: "cost-center/research-llm"
- name: inference-prod
deservedGpus: 16
overQuotaWeight: 1
chargebackLabel: "cost-center/product-inference"
---
apiVersion: mission-control.nvidia.com/v1
kind: MissionControlProject
metadata:
name: llm-pretraining
namespace: ai-research
spec:
deservedGpus: 32
overQuotaWeight: 2
chargebackLabel: "cost-center/research-llm"
maxOverQuotaGpus: 16 # can borrow up to 16 extra when cluster is idleGPU-hours per project accumulate in Mission Control's cost attribution system. The chargeback webhook pushes hourly summaries to your cost management tool (Datadog, internal billing system, or a Prometheus counter scraped by Grafana).
For the reporting layer on top of this attribution data, the GPU FinOps and cost allocation guide covers how to build team-level GPU spend dashboards and budget alerts.
Fault-Tolerant LLM Training with Checkpoint Recovery
Mission Control's policy engine listens to DCGM health signals and triggers automatic workload rescheduling when a GPU node enters an unhealthy state. The recovery flow:
DCGM detects GPU error (ECC multi-bit error, XID fault, or utilization drop to 0)
→ Mission Control health monitor fires FaultTolerancePolicy
→ RunaiJob marked for preemption
→ checkpoint saved to DSX storage (or NFS mount)
→ BCM marks node unhealthy, removes from schedulable pool
→ Run:ai reschedules job on healthy nodes
→ torchrun resumes from last checkpoint fileThe key manifest:
apiVersion: mission-control.nvidia.com/v1
kind: FaultTolerancePolicy
metadata:
name: training-ft-policy
spec:
trigger:
dcgmEventCodes: ["XID 79", "XID 94", "XID 95"] # NVLink errors, DBE
utilizationDropThreshold: 0.05 # GPU util drops to <5% unexpectedly
action:
type: CheckpointAndReschedule
checkpointPath: /mnt/dsx/checkpoints/
maxRescheduleAttempts: 3
rescheduleDelay: 60sThe NeMo checkpoint config to pair with it:
# In your NeMo trainer config (trainer.yaml)
trainer:
checkpoint_callback_params:
save_top_k: 3
every_n_train_steps: 500 # checkpoint every 500 steps
dirpath: /mnt/dsx/checkpoints/
filename: "step={step}-loss={val_loss:.2f}"
enable_progress_bar: trueWith torchrun, the resume path is automatic if you point --resume_from_checkpoint at the latest checkpoint directory and Mission Control ensures the path is mounted on the rescheduled nodes.
For the detailed engineering of checkpoint formats, storage tiers, and spot-instance resilience patterns, the spot GPU training resilience and checkpointing guide covers the full stack.
SLO-Aware Inference Scheduling on H200 and B200
The Mission Control placement engine reads inference SLO manifests and selects GPU hardware based on current cluster state. TTFT targets drive hardware selection: H200's 141GB HBM3e handles memory-intensive large context windows, while B200's higher HBM3e bandwidth and FP4 throughput make it the better choice for high-throughput short-context serving.
| Workload profile | TTFT target | Recommended hardware | Mission Control placement rule |
|---|---|---|---|
| 70B model, long context (32K+) | <500ms | H200 SXM5 | Place on nodes with DCGM util <60% and 141GB+ VRAM |
| 70B model, high throughput batch | <2000ms | B200 SXM6 | Place on nodes with highest memory bandwidth |
| 7-13B fast interactive | <150ms | B200 SXM6 | Place on lowest-latency nodes by queue depth |
| 405B multi-GPU | <1000ms | H200 SXM5 (multi-node NVLink) | Place on NVLink-connected node groups via BCM topology |
The InferenceDeployment manifest:
apiVersion: mission-control.nvidia.com/v1
kind: InferenceDeployment
metadata:
name: llama4-maverick-inference
spec:
model:
registry: nvcr.io/nvidia
name: llama-4-maverick
precision: fp8
replicas: 2
slo:
targetTTFT: 150ms # time-to-first-token target
minThroughput: 500 # tokens per second minimum
p99Latency: 800ms # per-token generation latency p99
placement:
preferHardware: [B200-SXM6, H200-SXM5] # ranked preference
avoidIfDCGMUtil: 0.80 # skip nodes above 80% GPU utilization
project: inference-prod # Run:ai project for quota accountingMission Control's placement engine queries DCGM utilization every 10 seconds and rebalances inference replicas across nodes if a node's utilization climbs past the avoidIfDCGMUtil threshold.
For the methodology behind TTFT budgets, ITL targets, and SLO decomposition across model serving tiers, the LLM inference SLO, TTFT, and latency budget guide covers the full framework.
Observability: Integrating DCGM Telemetry with Grafana and Langfuse
DCGM metrics flow from the GPU nodes to a Prometheus scrape target, which Mission Control's telemetry aggregator picks up and combines with Run:ai queue metrics and NeMo request traces.
Prometheus scrape config for Mission Control metrics:
# prometheus.yml
scrape_configs:
- job_name: 'mission-control'
static_configs:
- targets: ['nvidia-mission-control.nvidia-mission-control.svc:9400']
metrics_path: /metrics
scrape_interval: 15s
- job_name: 'dcgm'
static_configs:
- targets: ['dcgm-exporter.gpu-operator.svc:9400']
scrape_interval: 10s
- job_name: 'runai'
static_configs:
- targets: ['runai-backend.runai-backend.svc:9999']
scrape_interval: 30sGrafana dashboard panel for GPU utilization per Run:ai project:
{
"title": "GPU Utilization by Project",
"type": "timeseries",
"targets": [
{
"expr": "avg by (project_name) (runai_project_gpu_utilization{cluster='spheron-cluster-1'})",
"legendFormat": "{{project_name}}"
},
{
"expr": "avg by (modelName) (DCGM_FI_DEV_GPU_UTIL{namespace='inference-prod'})",
"legendFormat": "GPU: {{modelName}}"
}
],
"fieldConfig": {
"defaults": { "unit": "percent", "max": 100 }
}
}Langfuse sits on top of this for LLM request tracing. Configure NeMo to emit span data to Langfuse, then correlate Langfuse TTFT measurements against DCGM GPU utilization to identify whether latency is GPU-bound or queue-bound:
# Configure NeMo inference server with Langfuse tracing
import os
from langfuse import Langfuse
langfuse = Langfuse(
public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
secret_key=os.environ["LANGFUSE_SECRET_KEY"],
host="https://cloud.langfuse.com"
)
# Wrap NeMo inference client
trace = langfuse.trace(name="llm-inference", metadata={"model": "llama-4-maverick"})
span = trace.span(name="generate")
response = nemo_client.generate(prompt=prompt, max_tokens=512)
span.end(output=response.text, usage={"totalTokens": response.usage.total_tokens})For the full observability stack covering Langfuse, Arize Phoenix, and Helicone alongside GPU telemetry, the LLM observability guide covers the end-to-end setup.
Mission Control Cost on GPU Cloud: License, Ops Overhead, Break-Even
Mission Control is part of the NVIDIA AI Enterprise (NVAIE) subscription. NVAIE is priced per GPU per year through NVIDIA resellers; the exact tier depends on your commitment term and volume. For break-even analysis, you need to estimate the utilization improvement that Run:ai's fractional GPU sharing and quota borrowing provides.
Break-even math for a 32-GPU H200 cluster:
Without Mission Control (vanilla Kubernetes + manual ops):
- Typical GPU utilization: 55-65% (idle time from job queuing, resource fragmentation)
- At $5.92/hr × 32 GPUs × 720 hrs/mo = ~$136,396/mo compute
- Effective compute at 60% util: ~$81,838/mo in useful work
With Mission Control (Run:ai fractional GPU + quota borrowing):
- Typical GPU utilization: 75-85% (improved by fractional sharing and over-quota borrowing)
- Same 32 GPUs × $5.92/hr × 720 hrs = ~$136,396/mo compute
- Effective compute at 80% util: ~$109,117/mo in useful work
The utilization improvement adds ~$27,279/mo in useful compute from the same hardware. If NVAIE licensing runs under that figure for your GPU count, Mission Control pays for itself on utilization alone, before counting the operational savings from unified management.
Mission Control + Spheron vs hyperscaler alternatives:
| Option | GPU | On-demand rate | Control plane overhead | Lock-in |
|---|---|---|---|---|
| Spheron + Mission Control | H200 SXM5 | $5.92/hr per GPU | Self-managed (Mission Control) | None |
| DGX Cloud (Azure) | H100 | ~$8-10/hr per GPU | Fully managed | Azure + NVIDIA |
| Azure NC H100 v3 | H100 | ~$6-8/hr per GPU | Azure AKS | Azure |
| Spheron + Mission Control | B200 SXM6 | $8.61/hr per GPU | Self-managed | None |
Pricing fluctuates based on GPU availability. The prices above are based on 05 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
The ops overhead of self-managed Mission Control is real. You own BCM configuration, Helm upgrades, and GPU Operator updates. For teams without a dedicated MLOps engineer, DGX Cloud's managed offering may be worth the price premium. For teams with infra capability, the Spheron + Mission Control combination cuts GPU compute costs by roughly 30-50% versus hyperscaler-managed offerings.
When Mission Control Is Overkill
Mission Control is the right tool for large, multi-tenant, mixed-workload clusters. It is not the right tool for every GPU cluster.
Skip Mission Control if:
- Your cluster is under 16 GPUs. The BCM provisioning layer and Run:ai control plane add 2-3 hours of setup and ongoing maintenance. Below 16 GPUs, vanilla Kubernetes with the GPU Operator handles scheduling without the overhead.
- You have a single team. Multi-tenant quota management exists to prevent teams from starving each other. With one team, there is nothing to isolate.
- You only run training jobs. If you have no inference workloads, the NeMo integration layer adds no value. Run Slurm with Pyxis for pure training clusters.
- You are on a tight timeline. BCM setup, Mission Control federation, and NeMo configuration take a week minimum. If you need GPUs in 24 hours, start with a Kubernetes cluster and the GPU Operator.
Decision matrix:
| Cluster size | # of teams | Workloads | Right tool |
|---|---|---|---|
| <16 GPUs | 1 | Training only | k3s + GPU Operator |
| <16 GPUs | 1 | Training + inference | Kubernetes + Kueue |
| 16-64 GPUs | 1-2 | Training only | Slurm or Kubernetes + KAI Scheduler |
| 16-64 GPUs | 3+ | Mixed | Mission Control |
| 64+ GPUs | Any | Mixed | Mission Control |
Migration from KubeFlow, Slurm, or Standalone Run:ai
From KubeFlow:
KubeFlow Pipelines and Mission Control overlap on training pipeline orchestration. Mission Control replaces KFP's pipeline DAG runner for training jobs through Run:ai's TrainingWorkload CRD. Keep KubeFlow's model registry if it integrates with your MLflow tracking. Replace the KFP execution backend with Run:ai. The NeMo model serving component replaces KFServing for NVIDIA models specifically.
Components to keep: model registry, experiment tracking (MLflow), data versioning (DVC).
Components Mission Control replaces: pipeline DAG executor, resource scheduler, model serving layer.
From Slurm:
BCM provides a Slurm bridge that translates sbatch job submissions into Run:ai workloads. This lets you migrate incrementally: submit existing Slurm scripts through BCM, which wraps them in RunaiJob manifests and schedules them through Run:ai. Teams continue using sbatch scripts while the infrastructure layer migrates underneath.
# BCM Slurm bridge: submit existing sbatch scripts
bcm submit --scheduler runai --wrap "sbatch train.sh" \
--gpus 8 \
--project llm-pretraining \
--checkpoint-path /mnt/dsx/checkpoints/From standalone Run:ai:
This is the simplest migration. The Run:ai cluster engine continues running unchanged on your GPU nodes. Mission Control wraps it by registering the existing cluster engine as a managed sub-component. You do not reinstall GPU worker components. The upgrade path:
- Install BCM and point it at your existing Kubernetes cluster.
- Install the Mission Control operator and configure
runai.controlPlane.urlto point at your existing Run:ai control plane. - Mission Control discovers existing Run:ai projects and departments automatically.
- Add DCGM integration and NeMo deployments as you're ready.
For the broader infrastructure design context, the production GPU cloud architecture guide covers the full reliability and storage stack that Mission Control sits on top of.
Frequently Asked Questions
What is NVIDIA Mission Control and what does it unify?
NVIDIA Mission Control is the unified AI factory control plane announced at GTC 2026. It provides a single pane of glass over BCM (Base Command Manager), Run:ai workload scheduler, NeMo microservices, DCGM telemetry, and DSX MaxLPS storage. Instead of managing each product through a separate dashboard, operators get one lifecycle management layer that tracks cluster health, workload quotas, model deployments, and chargeback across the full stack.
Does NVIDIA Mission Control require DGX Cloud or Azure?
No. Mission Control is infrastructure-agnostic at the software layer. It runs on any Kubernetes cluster with NVIDIA GPUs. DGX Cloud and Azure NC-series are Microsoft's managed hosting options for NVIDIA enterprise stacks, but you can deploy Mission Control on bare-metal GPU cloud providers like Spheron without the hyperscaler markup.
How does Mission Control differ from standalone Run:ai?
Run:ai is a workload scheduler that sits inside Mission Control as one component. Standalone Run:ai manages GPU quota, fractional sharing, and gang scheduling for a Kubernetes cluster but has no knowledge of the BCM provisioning layer, NeMo model services, or storage policies. Mission Control adds the orchestration layer above Run:ai: it provisions the cluster via BCM, pushes policy to Run:ai, routes inference traffic to NeMo endpoints, and aggregates telemetry from DCGM into a single control loop.
What GPU hardware does NVIDIA Mission Control support?
Mission Control supports any NVIDIA datacenter GPU that runs the CUDA 12.x stack and is compatible with the GPU Operator. This includes Hopper (H100, H200), Blackwell (B200, GB200, B300), and Ampere (A100) generations. The SLO-aware inference scheduler specifically targets multi-GPU configurations, with the Blackwell B200 and H200 SXM5 as the recommended hardware for mixed training and inference clusters.
When is Mission Control overkill for a GPU cluster?
Mission Control adds meaningful value at scale: 8+ GPU nodes, multiple teams sharing the cluster, mixed training and inference workloads, and chargeback requirements. For a single team running training-only workloads on a small cluster (under 16 GPUs), the operational overhead of BCM, the Run:ai control plane, and the NeMo service layer exceeds what simpler schedulers provide. Below that scale, the licensing cost and ops complexity do not break even.
Mission Control runs on any NVIDIA datacenter GPU cluster. You do not need DGX Cloud to use the full NVIDIA AI factory stack. Spheron offers bare-metal H200 SXM5 and B200 SXM6 nodes with per-minute billing and no long-term lock-in.
H200 SXM5 on Spheron | B200 SXM6 availability | View all GPU pricing
Quick Setup Guide
Rent H200 SXM5 or B200 SXM6 instances on Spheron with reserved commitments for the control plane nodes and on-demand or spot allocation for worker nodes. Provision a CPU-only control plane instance for Kubernetes (k3s or kubeadm), then join the GPU worker nodes. Verify the GPU Operator and DCGM exporter are running before installing Mission Control components.
Add the NGC Helm registry and install Base Command Manager with helm upgrade -i bcm-control-plane. Configure the bcm-values.yaml with your cluster endpoint, GPU node inventory, and storage class. BCM manages provisioning lifecycle and exposes a REST API that Mission Control uses to track node health and run firmware updates.
Install the Run:ai cluster engine Helm chart and point controlPlane.url at the Mission Control endpoint rather than standalone app.run.ai. Mission Control federates the Run:ai project and quota configuration, so you define teams and GPU quotas in the Mission Control UI and they propagate to Run:ai's scheduling layer automatically.
In the Mission Control dashboard, create a Department per business unit and Projects per team. Set deservedGpus and over-quota weights per project. Enable the cost attribution webhook and configure chargeback labels that map GPU-hours to cost centers. The GPU FinOps dashboard covers the reporting layer.
Create a NeMo deployment manifest referencing your model registry. Mission Control's SLO-aware placement engine reads target TTFT and throughput SLOs from the manifest and selects H200 or B200 nodes based on current queue depth and GPU utilization from DCGM. The placement decision is visible in the Mission Control workload view alongside training queue status.
Frequently Asked Questions
NVIDIA Mission Control is the unified AI factory control plane announced at GTC 2026. It provides a single pane of glass over BCM (Base Command Manager), Run:ai workload scheduler, NeMo microservices, DCGM telemetry, and DSX MaxLPS storage. Instead of managing each product through a separate dashboard, operators get one lifecycle management layer that tracks cluster health, workload quotas, model deployments, and chargeback across the full stack.
No. Mission Control is infrastructure-agnostic at the software layer. It runs on any Kubernetes cluster with NVIDIA GPUs. DGX Cloud and Azure NC-series are Microsoft's managed hosting options for NVIDIA enterprise stacks, but you can deploy Mission Control on bare-metal GPU cloud providers like Spheron without the hyperscaler markup.
Run:ai is a workload scheduler that sits inside Mission Control as one component. Standalone Run:ai manages GPU quota, fractional sharing, and gang scheduling for a Kubernetes cluster but has no knowledge of the BCM provisioning layer, NeMo model services, or storage policies. Mission Control adds the orchestration layer above Run:ai: it provisions the cluster via BCM, pushes policy to Run:ai, routes inference traffic to NeMo endpoints, and aggregates telemetry from DCGM into a single control loop.
Mission Control supports any NVIDIA datacenter GPU that runs the CUDA 12.x stack and is compatible with the GPU Operator. This includes Hopper (H100, H200), Blackwell (B200, GB200, B300), and Ampere (A100) generations. The SLO-aware inference scheduler specifically targets multi-GPU configurations - the Blackwell B200 and H200 SXM5 are the recommended hardware for mixed training and inference clusters.
Mission Control adds meaningful value at scale: 8+ GPU nodes, multiple teams sharing the cluster, mixed training and inference workloads, and chargeback requirements. For a single team running training-only workloads on a small cluster (under 16 GPUs), the operational overhead of BCM, the Run:ai control plane, and the NeMo service layer exceeds what simpler schedulers (vanilla Kubernetes with Kueue, or Slurm) provide. Below that scale, the licensing cost and ops complexity do not break even.
