NVIDIA Mission Control on GPU Cloud: AI Factory Lifecycle Management, Multi-Tenant LLM Inference and Training (2026 Guide)

Q: What is NVIDIA Mission Control and what does it unify?

NVIDIA Mission Control is the unified AI factory control plane announced at GTC 2026. It provides a single pane of glass over BCM (Base Command Manager), Run:ai workload scheduler, NeMo microservices, DCGM telemetry, and DSX MaxLPS storage. Instead of managing each product through a separate dashboard, operators get one lifecycle management layer that tracks cluster health, workload quotas, model deployments, and chargeback across the full stack.

Q: Does NVIDIA Mission Control require DGX Cloud or Azure?

No. Mission Control is infrastructure-agnostic at the software layer. It runs on any Kubernetes cluster with NVIDIA GPUs. DGX Cloud and Azure NC-series are Microsoft's managed hosting options for NVIDIA enterprise stacks, but you can deploy Mission Control on bare-metal GPU cloud providers like Spheron without the hyperscaler markup.

Q: How does Mission Control differ from standalone Run:ai?

Run:ai is a workload scheduler that sits inside Mission Control as one component. Standalone Run:ai manages GPU quota, fractional sharing, and gang scheduling for a Kubernetes cluster but has no knowledge of the BCM provisioning layer, NeMo model services, or storage policies. Mission Control adds the orchestration layer above Run:ai: it provisions the cluster via BCM, pushes policy to Run:ai, routes inference traffic to NeMo endpoints, and aggregates telemetry from DCGM into a single control loop.

Q: What GPU hardware does NVIDIA Mission Control support?

Mission Control supports any NVIDIA datacenter GPU that runs the CUDA 12.x stack and is compatible with the GPU Operator. This includes Hopper (H100, H200), Blackwell (B200, GB200, B300), and Ampere (A100) generations. The SLO-aware inference scheduler specifically targets multi-GPU configurations - the Blackwell B200 and H200 SXM5 are the recommended hardware for mixed training and inference clusters.

Q: When is Mission Control overkill for a GPU cluster?

Mission Control adds meaningful value at scale: 8+ GPU nodes, multiple teams sharing the cluster, mixed training and inference workloads, and chargeback requirements. For a single team running training-only workloads on a small cluster (under 16 GPUs), the operational overhead of BCM, the Run:ai control plane, and the NeMo service layer exceeds what simpler schedulers (vanilla Kubernetes with Kueue, or Slurm) provide. Below that scale, the licensing cost and ops complexity do not break even.

NVIDIA announced Mission Control at GTC 2026 as the unified control plane across BCM, Run:ai, NeMo, DCGM, and DSX: the full NVIDIA AI factory software stack under one lifecycle manager. Most coverage so far is marketing copy and press recaps. This guide covers how to actually deploy it: provisioning on bare-metal GPU cloud, multi-tenant quota setup, fault-tolerant training configuration, SLO-aware inference scheduling, and migration from existing schedulers.

For context on the scheduling layer inside Mission Control, the NVIDIA Run:ai on GPU Cloud guide covers Run:ai architecture, fractional GPU sharing, and Helm installation in depth.

What NVIDIA Mission Control Is

Mission Control is a lifecycle management layer, not a replacement for the individual components underneath it. Each component still does what it does. Mission Control adds the control loop above them: unified provisioning, policy federation, and cross-stack telemetry aggregation.

Component	What it does	Mission Control's role
BCM (Base Command Manager)	Cluster provisioning, firmware, OS imaging	Lifecycle layer: BCM is the provisioning source of truth
Run:ai	GPU quota, fractional sharing, gang scheduling	Scheduling layer: Mission Control federates project and quota policy to Run:ai
NeMo microservices	Model serving, inference endpoints, fine-tuning pipelines	Serving layer: Mission Control routes traffic and manages NeMo deployment lifecycle
DCGM	GPU telemetry, health monitoring, ECC error tracking	Observability layer: DCGM metrics aggregate into Mission Control's health dashboard
DSX MaxLPS	Parallel storage for training data and checkpoints	Storage layer: Mission Control policies govern data locality and access controls

Before Mission Control, operating this stack meant five separate dashboards with no shared state. A node failure visible in DCGM had no automatic feedback path to Run:ai's scheduler or BCM's provisioning layer. You correlated events manually. Mission Control closes that loop.

Architecture: Control Plane, Scheduler, Telemetry, Policy Engine

Mission Control has four internal sub-planes:

+----------------------------------------------------------------+
|                     NVIDIA Mission Control                     |
|  +--------------+  +--------------+  +----------------------+  |
|  | Control Sub- |  | Policy       |  | Telemetry Aggregator |  |
|  | Plane (BCM)  |  | Engine       |  | (DCGM + logs)        |  |
|  +------+-------+  +------+-------+  +----------+-----------+  |
|         |                 |                     |              |
+---------+-----------------+---------------------+--------------+
          |                 |                     |
          v                 v                     v
   +-------------+   +-------------+       +-------------+
   |     BCM     |   |   Run:ai    |       |    DCGM     |
   | (provision) |   | (schedule)  |       | (telemetry) |
   +-------------+   +------+------+       +-------------+
                            |
                            v
                     +-------------+
                     |    NeMo     |
                     | (inference) |
                     +-------------+

Control sub-plane: BCM handles node inventory, firmware updates, OS imaging, and health checks. Mission Control queries BCM's REST API continuously and surfaces node state in a unified dashboard.

Policy engine: Quota definitions, chargeback labels, and namespace rules are authored once in Mission Control and pushed to Run:ai's scheduler via the federation API. You do not manually sync policy between tools.

Telemetry aggregator: DCGM metrics, Run:ai queue metrics, and NeMo request traces feed into a shared telemetry store. The Mission Control UI provides cross-stack correlation: a GPU utilization spike from DCGM aligns with a queue depth increase from Run:ai in the same timeline.

Scheduling sub-plane: The SLO-aware placement engine inside Mission Control reads inference SLO manifests (TTFT targets, minimum throughput) and routes workloads to GPU nodes that can meet those targets based on current DCGM utilization data.

The request flow for a new inference deployment:

User submits InferenceDeployment manifest
  → Mission Control policy engine validates quotas
    → Telemetry aggregator queries DCGM for current GPU utilization
      → SLO placement engine selects target nodes (H200 vs B200 based on TTFT target)
        → Run:ai schedules pods on selected nodes
          → NeMo service starts and registers endpoint
            → Mission Control workload view shows active status

Mission Control vs Standalone Run:ai vs Kubernetes-Only

Capability	Run:ai standalone	Kubernetes + Kueue	Mission Control
Cluster provisioning	None	None	BCM integration
GPU quota per team	Yes	Yes (via ResourceQuota)	Yes, federated
Fractional GPU sharing	Yes	No	Yes (via Run:ai)
Gang scheduling	Yes	Via KAI Scheduler	Yes (via Run:ai)
Inference serving	None	Manual deployment	NeMo integration
Cross-stack telemetry	Partial (Run:ai metrics)	Partial (DCGM separate)	Unified (DCGM + Run:ai + NeMo)
Chargeback reporting	Per-project GPU-hours	Manual	Built-in with cost labels
Firmware/OS lifecycle	None	None	BCM
Migration tooling	Limited	None	BCM bridge for Slurm

When to use each:

Run:ai standalone makes sense when you already have BCM and NeMo managed separately and want GPU scheduling without consolidating the stack. You get fractional GPU and quota management, nothing more.

Kubernetes with Kueue (or KAI Scheduler) is the right call for smaller clusters (under 16 GPUs), single teams, or training-only workloads. No licensing cost, broad community support. See Kubernetes GPU orchestration with DRA and KAI Scheduler for the full setup.

Mission Control pays off when you have multiple teams sharing the cluster, mixed training and inference workloads, chargeback requirements, and a mix of BCM-provisioned nodes with NeMo endpoints. The consolidation benefit compounds at 32+ GPUs with 3+ teams.

For HPC-style batch training that does not need the Kubernetes stack at all, the Slurm for AI training guide covers when Slurm wins on simplicity.

Provisioning a Mission Control Cluster on Spheron

Spheron provides bare-metal H200 SXM5 and B200 SXM6 nodes without the DGX Cloud or Azure NC-series markup. The full NVIDIA AI enterprise software stack runs on any Kubernetes cluster with NVIDIA GPUs. You do not need a hyperscaler to run Mission Control. For a rack-scale AI factory where the whole NVLink domain is one unit, Spheron has GB200 NVL72 availability to reserve today: list your GPU count, timeline, and workload on the form and the team confirms capacity within a business day.

Cluster footprint for a mid-size mixed workload (training + inference):

Node role	Instance type	Count	On-demand price	Monthly cost estimate
Control plane (BCM + Mission Control)	CPU instance	1	~$0.50/hr	~$360/mo
GPU worker (training)	H200 SXM5	4 nodes × 8 GPU	$5.92/hr per GPU	~$34,099/mo per node
GPU worker (inference)	B200 SXM6	2 nodes × 8 GPU	$8.61/hr per GPU	~$49,594/mo per node
Spot alternative (training)	H200 SXM5 spot	4 nodes × 8 GPU	~$1.78/hr per GPU	~$10,246/mo per node

Pricing fluctuates based on GPU availability. The prices above are based on 05 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Teams running fault-tolerant training jobs should put training workers on spot H200 SXM5 nodes and keep inference on on-demand B200 SXM6 to prevent preemption from affecting latency SLOs.

To get started, rent H200 SXM5 on Spheron for training workers and B200 SXM6 instances for the inference tier. Spheron aggregates GPU supply from 5+ providers so you get competitive pricing without sourcing from each data center partner directly.

Control plane setup:

bash

# 1. Deploy Kubernetes on a CPU instance (k3s recommended for simplicity)
curl -sfL https://get.k3s.io | sh -

# 2. Join GPU worker nodes (run on each GPU worker)
k3s agent --server https://<control-plane-ip>:6443 --token <node-token>

# 3. Install NVIDIA GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm upgrade -i gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=false  # if drivers pre-installed

# 4. Verify GPU resources appear
kubectl describe nodes | grep nvidia.com/gpu

BCM installation:

bash

# Add NGC Helm registry
helm repo add nvbc https://helm.ngc.nvidia.com/nvidia/bcm
helm repo update

# Install BCM control plane
helm upgrade -i bcm-control-plane nvbc/bcm-control-plane \
  --namespace bcm-system \
  --create-namespace \
  -f bcm-values.yaml

# bcm-values.yaml (minimal config)
# cluster:
#   endpoint: https://<k8s-api-server>:6443
#   gpuInventory:
#     - model: H200-SXM5
#       count: 32
#   storageClass: local-path

Mission Control installation follows the NVIDIA AI Enterprise deployment documentation from the NGC catalog. The Mission Control operator wraps BCM, Run:ai, and NeMo into a single Helm umbrella chart. Configure it to point at your existing BCM endpoint and Run:ai control plane URL:

bash

# Add NGC Helm registry for Mission Control
helm repo add nvai https://helm.ngc.nvidia.com/nvidia/mission-control
helm repo update

# Install Mission Control operator (from NGC, requires NVAIE subscription)
helm upgrade -i nvidia-mission-control nvai/mission-control \
  --namespace nvidia-mission-control \
  --create-namespace \
  --set bcm.endpoint=https://<bcm-api>:8443 \
  --set runai.controlPlane.url=https://<runai-cp>:443 \
  --set dcgm.prometheusEndpoint=http://<dcgm-exporter>:9400/metrics \
  --set nemo.registry=nvcr.io/nvidia

Multi-Tenant Quota and Chargeback

Mission Control uses a three-level hierarchy: Department (business unit), Project (team), Workload (individual job).

yaml

apiVersion: mission-control.nvidia.com/v1
kind: MissionControlDepartment
metadata:
  name: ai-research
spec:
  displayName: "AI Research"
  projects:
    - name: llm-pretraining
      deservedGpus: 32
      overQuotaWeight: 2
      chargebackLabel: "cost-center/research-llm"
    - name: inference-prod
      deservedGpus: 16
      overQuotaWeight: 1
      chargebackLabel: "cost-center/product-inference"
---
apiVersion: mission-control.nvidia.com/v1
kind: MissionControlProject
metadata:
  name: llm-pretraining
  namespace: ai-research
spec:
  deservedGpus: 32
  overQuotaWeight: 2
  chargebackLabel: "cost-center/research-llm"
  maxOverQuotaGpus: 16  # can borrow up to 16 extra when cluster is idle

GPU-hours per project accumulate in Mission Control's cost attribution system. The chargeback webhook pushes hourly summaries to your cost management tool (Datadog, internal billing system, or a Prometheus counter scraped by Grafana).

For the reporting layer on top of this attribution data, the GPU FinOps and cost allocation guide covers how to build team-level GPU spend dashboards and budget alerts.

Fault-Tolerant LLM Training with Checkpoint Recovery

Mission Control's policy engine listens to DCGM health signals and triggers automatic workload rescheduling when a GPU node enters an unhealthy state. The recovery flow:

DCGM detects GPU error (ECC multi-bit error, XID fault, or utilization drop to 0)
  → Mission Control health monitor fires FaultTolerancePolicy
    → RunaiJob marked for preemption
      → checkpoint saved to DSX storage (or NFS mount)
        → BCM marks node unhealthy, removes from schedulable pool
          → Run:ai reschedules job on healthy nodes
            → torchrun resumes from last checkpoint file

The key manifest:

yaml

apiVersion: mission-control.nvidia.com/v1
kind: FaultTolerancePolicy
metadata:
  name: training-ft-policy
spec:
  trigger:
    dcgmEventCodes: ["XID 79", "XID 94", "XID 95"]  # NVLink errors, DBE
    utilizationDropThreshold: 0.05  # GPU util drops to <5% unexpectedly
  action:
    type: CheckpointAndReschedule
    checkpointPath: /mnt/dsx/checkpoints/
    maxRescheduleAttempts: 3
    rescheduleDelay: 60s

The NeMo checkpoint config to pair with it:

yaml

# In your NeMo trainer config (trainer.yaml)
trainer:
  checkpoint_callback_params:
    save_top_k: 3
    every_n_train_steps: 500  # checkpoint every 500 steps
    dirpath: /mnt/dsx/checkpoints/
    filename: "step={step}-loss={val_loss:.2f}"
  enable_progress_bar: true

With torchrun, the resume path is automatic if you point --resume_from_checkpoint at the latest checkpoint directory and Mission Control ensures the path is mounted on the rescheduled nodes.

For the detailed engineering of checkpoint formats, storage tiers, and spot-instance resilience patterns, the spot GPU training resilience and checkpointing guide covers the full stack.

SLO-Aware Inference Scheduling on H200 and B200

The Mission Control placement engine reads inference SLO manifests and selects GPU hardware based on current cluster state. TTFT targets drive hardware selection: H200's 141GB HBM3e handles memory-intensive large context windows, while B200's higher HBM3e bandwidth and FP4 throughput make it the better choice for high-throughput short-context serving.

Workload profile	TTFT target	Recommended hardware	Mission Control placement rule
70B model, long context (32K+)	<500ms	H200 SXM5	Place on nodes with DCGM util <60% and 141GB+ VRAM
70B model, high throughput batch	<2000ms	B200 SXM6	Place on nodes with highest memory bandwidth
7-13B fast interactive	<150ms	B200 SXM6	Place on lowest-latency nodes by queue depth
405B multi-GPU	<1000ms	H200 SXM5 (multi-node NVLink)	Place on NVLink-connected node groups via BCM topology

The InferenceDeployment manifest:

yaml

apiVersion: mission-control.nvidia.com/v1
kind: InferenceDeployment
metadata:
  name: llama4-maverick-inference
spec:
  model:
    registry: nvcr.io/nvidia
    name: llama-4-maverick
    precision: fp8
  replicas: 2
  slo:
    targetTTFT: 150ms          # time-to-first-token target
    minThroughput: 500          # tokens per second minimum
    p99Latency: 800ms           # per-token generation latency p99
  placement:
    preferHardware: [B200-SXM6, H200-SXM5]  # ranked preference
    avoidIfDCGMUtil: 0.80       # skip nodes above 80% GPU utilization
  project: inference-prod       # Run:ai project for quota accounting

Mission Control's placement engine queries DCGM utilization every 10 seconds and rebalances inference replicas across nodes if a node's utilization climbs past the avoidIfDCGMUtil threshold.

For the methodology behind TTFT budgets, ITL targets, and SLO decomposition across model serving tiers, the LLM inference SLO, TTFT, and latency budget guide covers the full framework.

Observability: Integrating DCGM Telemetry with Grafana and Langfuse

DCGM metrics flow from the GPU nodes to a Prometheus scrape target, which Mission Control's telemetry aggregator picks up and combines with Run:ai queue metrics and NeMo request traces.

Prometheus scrape config for Mission Control metrics:

yaml

# prometheus.yml
scrape_configs:
  - job_name: 'mission-control'
    static_configs:
      - targets: ['nvidia-mission-control.nvidia-mission-control.svc:9400']
    metrics_path: /metrics
    scrape_interval: 15s

  - job_name: 'dcgm'
    static_configs:
      - targets: ['dcgm-exporter.gpu-operator.svc:9400']
    scrape_interval: 10s

  - job_name: 'runai'
    static_configs:
      - targets: ['runai-backend.runai-backend.svc:9999']
    scrape_interval: 30s

Grafana dashboard panel for GPU utilization per Run:ai project:

json

{
  "title": "GPU Utilization by Project",
  "type": "timeseries",
  "targets": [
    {
      "expr": "avg by (project_name) (runai_project_gpu_utilization{cluster='spheron-cluster-1'})",
      "legendFormat": "{{project_name}}"
    },
    {
      "expr": "avg by (modelName) (DCGM_FI_DEV_GPU_UTIL{namespace='inference-prod'})",
      "legendFormat": "GPU: {{modelName}}"
    }
  ],
  "fieldConfig": {
    "defaults": { "unit": "percent", "max": 100 }
  }
}

Langfuse sits on top of this for LLM request tracing. Configure NeMo to emit span data to Langfuse, then correlate Langfuse TTFT measurements against DCGM GPU utilization to identify whether latency is GPU-bound or queue-bound:

python

# Configure NeMo inference server with Langfuse tracing
import os
from langfuse import Langfuse

langfuse = Langfuse(
    public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
    secret_key=os.environ["LANGFUSE_SECRET_KEY"],
    host="https://cloud.langfuse.com"
)

# Wrap NeMo inference client
trace = langfuse.trace(name="llm-inference", metadata={"model": "llama-4-maverick"})
span = trace.span(name="generate")
response = nemo_client.generate(prompt=prompt, max_tokens=512)
span.end(output=response.text, usage={"totalTokens": response.usage.total_tokens})

For the full observability stack covering Langfuse, Arize Phoenix, and Helicone alongside GPU telemetry, the LLM observability guide covers the end-to-end setup.

Mission Control Cost on GPU Cloud: License, Ops Overhead, Break-Even

Mission Control is part of the NVIDIA AI Enterprise (NVAIE) subscription. NVAIE is priced per GPU per year through NVIDIA resellers; the exact tier depends on your commitment term and volume. For break-even analysis, you need to estimate the utilization improvement that Run:ai's fractional GPU sharing and quota borrowing provides.

Break-even math for a 32-GPU H200 cluster:

Without Mission Control (vanilla Kubernetes + manual ops):

Typical GPU utilization: 55-65% (idle time from job queuing, resource fragmentation)
At $5.92/hr × 32 GPUs × 720 hrs/mo = ~$136,396/mo compute
Effective compute at 60% util: ~$81,838/mo in useful work

With Mission Control (Run:ai fractional GPU + quota borrowing):

Typical GPU utilization: 75-85% (improved by fractional sharing and over-quota borrowing)
Same 32 GPUs × $5.92/hr × 720 hrs = ~$136,396/mo compute
Effective compute at 80% util: ~$109,117/mo in useful work

The utilization improvement adds ~$27,279/mo in useful compute from the same hardware. If NVAIE licensing runs under that figure for your GPU count, Mission Control pays for itself on utilization alone, before counting the operational savings from unified management.

Mission Control + Spheron vs hyperscaler alternatives:

Option	GPU	On-demand rate	Control plane overhead	Lock-in
Spheron + Mission Control	H200 SXM5	$5.92/hr per GPU	Self-managed (Mission Control)	None
DGX Cloud (Azure)	H100	~$8-10/hr per GPU	Fully managed	Azure + NVIDIA
Azure NC H100 v3	H100	~$6-8/hr per GPU	Azure AKS	Azure
Spheron + Mission Control	B200 SXM6	$8.61/hr per GPU	Self-managed	None

Pricing fluctuates based on GPU availability. The prices above are based on 05 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

The ops overhead of self-managed Mission Control is real. You own BCM configuration, Helm upgrades, and GPU Operator updates. For teams without a dedicated MLOps engineer, DGX Cloud's managed offering may be worth the price premium. For teams with infra capability, the Spheron + Mission Control combination cuts GPU compute costs by roughly 30-50% versus hyperscaler-managed offerings.

When Mission Control Is Overkill

Mission Control is the right tool for large, multi-tenant, mixed-workload clusters. It is not the right tool for every GPU cluster.

Skip Mission Control if:

Your cluster is under 16 GPUs. The BCM provisioning layer and Run:ai control plane add 2-3 hours of setup and ongoing maintenance. Below 16 GPUs, vanilla Kubernetes with the GPU Operator handles scheduling without the overhead.
You have a single team. Multi-tenant quota management exists to prevent teams from starving each other. With one team, there is nothing to isolate.
You only run training jobs. If you have no inference workloads, the NeMo integration layer adds no value. Run Slurm with Pyxis for pure training clusters.
You are on a tight timeline. BCM setup, Mission Control federation, and NeMo configuration take a week minimum. If you need GPUs in 24 hours, start with a Kubernetes cluster and the GPU Operator.

Decision matrix:

Cluster size	# of teams	Workloads	Right tool
<16 GPUs	1	Training only	k3s + GPU Operator
<16 GPUs	1	Training + inference	Kubernetes + Kueue
16-64 GPUs	1-2	Training only	Slurm or Kubernetes + KAI Scheduler
16-64 GPUs	3+	Mixed	Mission Control
64+ GPUs	Any	Mixed	Mission Control

Migration from KubeFlow, Slurm, or Standalone Run:ai

From KubeFlow:

KubeFlow Pipelines and Mission Control overlap on training pipeline orchestration. Mission Control replaces KFP's pipeline DAG runner for training jobs through Run:ai's TrainingWorkload CRD. Keep KubeFlow's model registry if it integrates with your MLflow tracking. Replace the KFP execution backend with Run:ai. The NeMo model serving component replaces KFServing for NVIDIA models specifically.

Components to keep: model registry, experiment tracking (MLflow), data versioning (DVC).

Components Mission Control replaces: pipeline DAG executor, resource scheduler, model serving layer.

From Slurm:

BCM provides a Slurm bridge that translates sbatch job submissions into Run:ai workloads. This lets you migrate incrementally: submit existing Slurm scripts through BCM, which wraps them in RunaiJob manifests and schedules them through Run:ai. Teams continue using sbatch scripts while the infrastructure layer migrates underneath.

bash

# BCM Slurm bridge: submit existing sbatch scripts
bcm submit --scheduler runai --wrap "sbatch train.sh" \
  --gpus 8 \
  --project llm-pretraining \
  --checkpoint-path /mnt/dsx/checkpoints/

From standalone Run:ai:

This is the simplest migration. The Run:ai cluster engine continues running unchanged on your GPU nodes. Mission Control wraps it by registering the existing cluster engine as a managed sub-component. You do not reinstall GPU worker components. The upgrade path:

Install BCM and point it at your existing Kubernetes cluster.
Install the Mission Control operator and configure runai.controlPlane.url to point at your existing Run:ai control plane.
Mission Control discovers existing Run:ai projects and departments automatically.
Add DCGM integration and NeMo deployments as you're ready.

For the broader infrastructure design context, the production GPU cloud architecture guide covers the full reliability and storage stack that Mission Control sits on top of.

Frequently Asked Questions

What is NVIDIA Mission Control and what does it unify?

NVIDIA Mission Control is the unified AI factory control plane announced at GTC 2026. It provides a single pane of glass over BCM (Base Command Manager), Run:ai workload scheduler, NeMo microservices, DCGM telemetry, and DSX MaxLPS storage. Instead of managing each product through a separate dashboard, operators get one lifecycle management layer that tracks cluster health, workload quotas, model deployments, and chargeback across the full stack.

Does NVIDIA Mission Control require DGX Cloud or Azure?

No. Mission Control is infrastructure-agnostic at the software layer. It runs on any Kubernetes cluster with NVIDIA GPUs. DGX Cloud and Azure NC-series are Microsoft's managed hosting options for NVIDIA enterprise stacks, but you can deploy Mission Control on bare-metal GPU cloud providers like Spheron without the hyperscaler markup.

How does Mission Control differ from standalone Run:ai?

Run:ai is a workload scheduler that sits inside Mission Control as one component. Standalone Run:ai manages GPU quota, fractional sharing, and gang scheduling for a Kubernetes cluster but has no knowledge of the BCM provisioning layer, NeMo model services, or storage policies. Mission Control adds the orchestration layer above Run:ai: it provisions the cluster via BCM, pushes policy to Run:ai, routes inference traffic to NeMo endpoints, and aggregates telemetry from DCGM into a single control loop.

What GPU hardware does NVIDIA Mission Control support?

Mission Control supports any NVIDIA datacenter GPU that runs the CUDA 12.x stack and is compatible with the GPU Operator. This includes Hopper (H100, H200), Blackwell (B200, GB200, B300), and Ampere (A100) generations. The SLO-aware inference scheduler specifically targets multi-GPU configurations, with the Blackwell B200 and H200 SXM5 as the recommended hardware for mixed training and inference clusters.

When is Mission Control overkill for a GPU cluster?

Mission Control adds meaningful value at scale: 8+ GPU nodes, multiple teams sharing the cluster, mixed training and inference workloads, and chargeback requirements. For a single team running training-only workloads on a small cluster (under 16 GPUs), the operational overhead of BCM, the Run:ai control plane, and the NeMo service layer exceeds what simpler schedulers provide. Below that scale, the licensing cost and ops complexity do not break even.

Mission Control runs on any NVIDIA datacenter GPU cluster. You do not need DGX Cloud to use the full NVIDIA AI factory stack. Spheron offers bare-metal H200 SXM5 and B200 SXM6 nodes with per-minute billing and no long-term lock-in.
H200 SXM5 on Spheron | B200 SXM6 availability | View all GPU pricing

STEPS / 05

Quick Setup Guide

Provision bare-metal H200 or B200 GPU nodes on Spheron
Rent H200 SXM5 or B200 SXM6 instances on Spheron with reserved commitments for the control plane nodes and on-demand or spot allocation for worker nodes. Provision a CPU-only control plane instance for Kubernetes (k3s or kubeadm), then join the GPU worker nodes. Verify the GPU Operator and DCGM exporter are running before installing Mission Control components.
Install BCM and the Mission Control control plane via Helm
Add the NGC Helm registry and install Base Command Manager with helm upgrade -i bcm-control-plane. Configure the bcm-values.yaml with your cluster endpoint, GPU node inventory, and storage class. BCM manages provisioning lifecycle and exposes a REST API that Mission Control uses to track node health and run firmware updates.
Deploy the Run:ai cluster engine and connect it to Mission Control
Install the Run:ai cluster engine Helm chart and point controlPlane.url at the Mission Control endpoint rather than standalone app.run.ai. Mission Control federates the Run:ai project and quota configuration, so you define teams and GPU quotas in the Mission Control UI and they propagate to Run:ai's scheduling layer automatically.
Configure multi-tenant quota policies and chargeback labels
In the Mission Control dashboard, create a Department per business unit and Projects per team. Set deservedGpus and over-quota weights per project. Enable the cost attribution webhook and configure chargeback labels that map GPU-hours to cost centers. The GPU FinOps dashboard covers the reporting layer.
Deploy NeMo microservices under Mission Control for inference
Create a NeMo deployment manifest referencing your model registry. Mission Control's SLO-aware placement engine reads target TTFT and throughput SLOs from the manifest and selects H200 or B200 nodes based on current queue depth and GPU utilization from DCGM. The placement decision is visible in the Mission Control workload view alongside training queue status.

FAQ / 05

Frequently Asked Questions

Mission Control supports any NVIDIA datacenter GPU that runs the CUDA 12.x stack and is compatible with the GPU Operator. This includes Hopper (H100, H200), Blackwell (B200, GB200, B300), and Ampere (A100) generations. The SLO-aware inference scheduler specifically targets multi-GPU configurations - the Blackwell B200 and H200 SXM5 are the recommended hardware for mixed training and inference clusters.

Mission Control adds meaningful value at scale: 8+ GPU nodes, multiple teams sharing the cluster, mixed training and inference workloads, and chargeback requirements. For a single team running training-only workloads on a small cluster (under 16 GPUs), the operational overhead of BCM, the Run:ai control plane, and the NeMo service layer exceeds what simpler schedulers (vanilla Kubernetes with Kueue, or Slurm) provide. Below that scale, the licensing cost and ops complexity do not break even.

What NVIDIA Mission Control Is

Architecture: Control Plane, Scheduler, Telemetry, Policy Engine

Mission Control vs Standalone Run:ai vs Kubernetes-Only

Provisioning a Mission Control Cluster on Spheron

Multi-Tenant Quota and Chargeback

Fault-Tolerant LLM Training with Checkpoint Recovery

SLO-Aware Inference Scheduling on H200 and B200

Observability: Integrating DCGM Telemetry with Grafana and Langfuse

Mission Control Cost on GPU Cloud: License, Ops Overhead, Break-Even

When Mission Control Is Overkill

Migration from KubeFlow, Slurm, or Standalone Run:ai

Frequently Asked Questions

What is NVIDIA Mission Control and what does it unify?

Does NVIDIA Mission Control require DGX Cloud or Azure?

How does Mission Control differ from standalone Run:ai?

What GPU hardware does NVIDIA Mission Control support?

When is Mission Control overkill for a GPU cluster?

Quick Setup Guide

Provision bare-metal H200 or B200 GPU nodes on Spheron

Install BCM and the Mission Control control plane via Helm

Deploy the Run:ai cluster engine and connect it to Mission Control

Configure multi-tenant quota policies and chargeback labels

Deploy NeMo microservices under Mission Control for inference

Frequently Asked Questions

01What is NVIDIA Mission Control and what does it unify?

02Does NVIDIA Mission Control require DGX Cloud or Azure?

03How does Mission Control differ from standalone Run:ai?

04What GPU hardware does NVIDIA Mission Control support?

05When is Mission Control overkill for a GPU cluster?

Try It on Real GPUs