Deploy NVIDIA Nemotron 3 Ultra on GPU Cloud: Self-Host the 550B Reasoning Model (2026)

NVIDIA released Nemotron 3 Ultra on June 4, 2026, and the defining number is 550B total parameters with 55B active per token. That active-to-total ratio is the whole point: you get frontier reasoning quality at a compute cost much closer to a 55B dense model than a 550B one. What sets this apart from earlier Nemotron releases is the post-training recipe. NVIDIA's headline technique is Multi-Teacher On-Policy Distillation (MOPD): more than 10 specialized teacher models guide training on the student's own generated outputs across coding, reasoning, tool use, and agent action sequences. RLVR (RL with Verifiable Rewards) is an earlier stage in the pipeline. The result is a model that generalizes better on agentic workflows rather than overperforming on one benchmark category.

Before getting into deployment specifics, if you are already running smaller NVIDIA models, the Nemotron 3 Super deployment guide covers the 120B/12B hybrid Mamba-Transformer tier for teams who need single-GPU deployment. For baseline vLLM multi-node server setup, see vLLM production deployment.

Nemotron 3 Ultra Architecture: LatentMoE Hybrid with MOPD Post-Training

Nemotron 3 Ultra is not a standard transformer MoE. The architecture combines three components in a single hybrid stack: Mamba-2 SSM layers, MoE feed-forward layers, and standard attention layers. NVIDIA calls this LatentMoE. It also includes Multi-Token Prediction (MTP) for native speculative decoding without a separate draft model.

The MoE component handles the parameter-to-compute ratio. The model has 550B total parameters across experts, but only 55B activate on any given forward pass. This is what makes the GPU math interesting: you need 550 GB of VRAM to hold all expert weights (at FP8), but the compute per token is equivalent to a 55B dense model. You pay for storage, not for compute.

The Mamba-2 layers bring a different benefit. SSM layers maintain a recurrent state that scales linearly with sequence length rather than quadratically like attention. For long reasoning chains, this matters. A multi-step agent reasoning trace that runs to 32K tokens costs far less KV cache headroom than a purely attention-based model would need. Combined with a 1M token context window, the architecture handles extended agentic workflows without the memory wall that hits pure-transformer models.

Multi-Teacher On-Policy Distillation (MOPD) is the headline claim NVIDIA makes for Nemotron 3 Ultra's agentic accuracy. MOPD uses more than 10 specialized teacher models to guide training on the student's own generated outputs (on-policy rollouts) across coding, reasoning, tool use, and multi-step agent workflows simultaneously. RLVR (RL with Verifiable Rewards) runs earlier in the pipeline; MOPD then preserves and extends those RL gains. The practical consequence is a model that generalizes better across agentic workflows rather than overfitting to one task domain.

Nemotron 3 Family: Choosing the Right Tier

Model	Total Params	Active Params	Architecture	Min GPUs (FP8)
Nemotron 3 Nano	~8B	~8B (dense)	Hybrid Mamba-MoE Transformer	1x A100
Nemotron 3 Super	120B	12B	Hybrid Mamba-MoE	1x H100 (NVFP4)
Nemotron Ultra 253B	253B	253B (dense)	Dense Transformer	8x H100
Nemotron 3 Ultra	550B	55B	LatentMoE (Mamba-2 + MoE + Attention)	8x H200 (FP8)

When to choose each tier:

Nemotron 3 Nano: Single-GPU serving, edge deployment, or cost-constrained inference pipelines where 8B quality is acceptable
Nemotron 3 Super: Production serving on a single H100, long-context workloads where SSM memory savings matter, SWE-Bench class coding tasks at minimal cost
Nemotron Ultra 253B: Dense architecture where MoE routing overhead is undesirable, 8x H100 budget already committed, tasks requiring consistent active compute across all tokens
Nemotron 3 Ultra: Highest quality agentic and reasoning workloads, budget for 8x H200 or 16x H100, workflows that benefit from MOPD post-training generalization across coding, reasoning, and tool use

GPU Hardware Planning: VRAM and Interconnect for a 550B MoE

The fundamental rule for MoE deployment: you load all parameters into VRAM, not just the active ones. Every expert's weights must be resident so the router can dispatch tokens to any of them on demand. For Nemotron 3 Ultra, that means 550B parameters regardless of the 55B active count.

The VRAM formula:

total_vram = (total_params x bytes_per_dtype x 1.15) + kv_cache_budget

The 1.15 multiplier covers activation memory, framework overhead, and routing dispatch buffers. KV cache depends on your context length and batch size targets.

Precision	Weight Size	+15% Overhead	Min GPU Config	Notes
BF16	1,100 GB	~1,265 GB	2x (8x H100 SXM5) = 1,280 GB	Requires multi-node
FP8	550 GB	~633 GB	8x H200 SXM5 (1,128 GB)	Single node, comfortable headroom
NVFP4/FP4	275 GB	~316 GB	8x H100 (640 GB) or 4x H200	Best on Blackwell; H100 via runtime dequant

Interconnect Requirements

Within a single node, NVLink 4.0 (SXM variants) runs at 900 GB/s bidirectional. This bandwidth is what makes tensor parallelism at TP=8 viable. PCIe interconnects top out around 64 GB/s and create a severe all-reduce bottleneck when each forward pass requires synchronizing tensors across 8 GPUs. For Nemotron 3 Ultra, SXM variants are not optional.

For multi-node BF16 deployments (16x H100 across two nodes), InfiniBand handles cross-node communication for pipeline parallel stages. 400 Gb/s or 800 Gb/s InfiniBand is the right call here; 100 Gb/s Ethernet between nodes will produce visible throughput degradation. The distributed LLM training guide covers the multi-node networking setup in more detail.

GPU Tier Recommendations

Pricing fetched from the Spheron API on 12 Jun 2026:

Config	Precision	VRAM	$/hr (Spheron)	Best For
8x H200 SXM5	FP8	1,128 GB	8 × $4.84 = $38.72/hr on-demand	Single-node production (recommended)
16x H100 SXM5	FP8	1,280 GB	16 × ~$3.92 = $62.72/hr on-demand	Multi-node, lower cost-per-GPU
8x H100 SXM5	NVFP4	640 GB	8 × $3.92 = $31.36/hr on-demand	Budget single-node (H100 dequant overhead)
8x B200 SXM6	FP8	1,536 GB	8 × $2.71 = $21.68/hr spot	Blackwell native, best throughput/$

Pricing fluctuates based on GPU availability. The prices above are based on 12 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Deploying Nemotron 3 Ultra with vLLM

Prerequisites

Python 3.10+
CUDA 12.4+ for H100/H200 (Hopper architecture); CUDA 12.8+ for B200/B300 (Blackwell)
vLLM (use the version NVIDIA recommends in its serving guide; NVIDIA's model card references vllm/vllm-openai:v0.22.0 at launch)
Ray (required for multi-node setups only)
550 GB free storage per node for the FP8 checkpoint (1,100 GB for BF16)

Installation

bash

# All nodes - use the version NVIDIA recommends in its serving guide
pip install vllm ray

# Verify GPU visibility
nvidia-smi

Ray Setup (Multi-Node Only)

bash

# Head node
ray start --head --port=6379

# Worker nodes (replace with actual head node IP)
ray start --address=<HEAD_NODE_IP>:6379

# Verify Ray cluster health
python -c "import ray; ray.init(address='auto'); print(ray.available_resources())"

Model Download

NVIDIA's primary official checkpoints are BF16 and NVFP4. An FP8 checkpoint may be available or may not be released as a separate pre-quantized artifact. If nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-FP8 does not exist on HuggingFace, download BF16 and use vLLM's --quantization fp8 flag for runtime FP8 quantization.

bash

# BF16 checkpoint (~1,100 GB) - confirmed available
huggingface-cli download nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 \
  --local-dir /models/nemotron-3-ultra

# NVFP4 checkpoint (~275 GB) - confirmed available; best for Blackwell
huggingface-cli download nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
  --local-dir /models/nemotron-3-ultra-nvfp4

# FP8 checkpoint (~550 GB) - verify availability on NVIDIA's release page
# If not available, use BF16 with --quantization fp8 in vLLM
huggingface-cli download nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-FP8 \
  --local-dir /models/nemotron-3-ultra-fp8

Verify repository names against NVIDIA's release page before downloading. HuggingFace repo names can shift between announcement and public availability.

Single-Node 8x H200 FP8 (Recommended)

This is the primary path. A single 8x H200 SXM5 node with 141 GB per GPU gives 1,128 GB total, leaving roughly 495 GB headroom above the ~633 GB FP8 weight-plus-overhead figure. That headroom goes to KV cache for serving concurrent requests at 32K context length.

bash

vllm serve /models/nemotron-3-ultra-fp8 \
  --quantization fp8 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --trust-remote-code \
  --port 8000

Multi-Node 16x H100 FP8 with Pipeline Parallelism

Two 8x H100 SXM5 nodes connected via InfiniBand. Each node runs TP=8 across its NVLink domain; pipeline parallelism handles the cross-node boundary.

bash

# Run on head node only after Ray cluster is healthy
vllm serve /models/nemotron-3-ultra-fp8 \
  --quantization fp8 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --pipeline-parallel-size 2 \
  --distributed-executor-backend ray \
  --gpu-memory-utilization 0.88 \
  --max-model-len 32768 \
  --trust-remote-code \
  --port 8000

The --gpu-memory-utilization 0.88 (vs 0.90 on single-node) leaves headroom for pipeline boundary activation buffers that would not exist in a single-node setup.

Single-Node 8x H100 NVFP4 (Budget Path)

The NVFP4 checkpoint is 275 GB, comfortably under 640 GB total VRAM on 8x H100 80GB. The tradeoff: H100 does not have native FP4 tensor cores. vLLM dequantizes NVFP4 to FP8 at runtime, so you get the memory savings but not Blackwell's native FP4 throughput gain.

bash

vllm serve /models/nemotron-3-ultra-nvfp4 \
  --quantization nvfp4 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --gpu-memory-utilization 0.88 \
  --max-model-len 32768 \
  --trust-remote-code \
  --port 8000

Key vLLM Flag Reference

Flag	Value	Why
`--enable-expert-parallel`	enabled	Routes expert FFN layers to separate GPU groups; reduces all-to-all communication vs full tensor parallelism
`--tensor-parallel-size`	8 (per node)	Splits attention and SSM layers across 8 GPUs via NVLink
`--pipeline-parallel-size`	2 (multi-node only)	Splits model layers across 2 nodes
`--distributed-executor-backend`	ray (multi-node only)	Ray handles inter-node coordination
`--gpu-memory-utilization`	0.88-0.90	Headroom for expert dispatch buffers and pipeline boundary activations
`--max-model-len`	32768	Safe default for reasoning workloads; increase to 131072 or higher if VRAM headroom permits
`--trust-remote-code`	required	NVIDIA uses custom model code in the checkpoint

Note: verify these flags against your installed vLLM version's vllm serve --help. The --distributed-executor-backend and --pipeline-parallel-size flags may behave differently between vLLM minor versions.

Expert-Parallel Serving: How Routing Affects Throughput

Standard tensor parallelism splits every weight matrix across GPUs. For each matrix multiply, every GPU computes a shard of the result, then an all-reduce synchronizes the partial outputs. At TP=8, that's 7 all-reduce operations per layer per forward pass across 8 GPUs.

Expert parallelism takes a different approach. Each GPU holds a complete subset of expert networks. Routing dispatches each token to whichever GPU owns the relevant experts via an all-to-all communication pattern. GPUs then compute locally on their assigned experts and gather results.

The advantage for large MoE models: the all-to-all communication volume is proportional to the number of routed tokens, not the full model state. For batches where tokens spread across many experts, the GPU utilization stays high because different GPUs are computing on different token-expert pairs simultaneously. The more diverse the batch (different tokens selecting different experts), the better the GPU utilization.

This is why larger batches help with MoE expert parallelism in a way that does not apply to dense models. A batch of 1 token routes to a small set of experts, leaving most GPU expert capacity idle. A batch of 64 tokens routes to a much wider set of experts, spreading the compute load. For production serving of a model like Nemotron 3 Ultra, serving with a minimum batch size of 8-16 requests will significantly improve GPU utilization versus single-request serving.

For deeper analysis of expert parallelism tuning, including when to prefer EP over TP and how all-to-all communication patterns change with different routing configurations, see the MoE inference optimization guide.

Reasoning Workload Tuning: Throughput, Latency, and Cost Per Token

Nemotron 3 Ultra generates reasoning chains before producing final answers when configured with enable_thinking=True in the chat template. A chain-of-thought trace for a complex math or coding problem can run 2,000 to 8,000 tokens before the final response. This changes the throughput math substantially compared to a chat model that gives short direct answers.

Time-To-First-Token vs Inter-Token Latency: For reasoning workloads, TTFT includes the full prefill of the system prompt and user message. The thinking chain runs during decode. If your SLO is sensitive to total response time rather than time-to-first-token, the relevant metric is the full decode latency for the reasoning chain plus the answer. Setting a shorter max_tokens budget for the thinking chain reduces total latency at the cost of potentially truncated reasoning.

Chunked prefill: Enable --enable-chunked-prefill for long system prompts or long-context inputs. This pipelines the prefill across multiple forward passes and can improve throughput by overlapping prefill computation with decode of other active requests. The Mamba-2 SSM layers in Nemotron 3 Ultra use a different memory access pattern than pure attention, so test chunked prefill correctness on your specific checkpoint before enabling in production.

KV cache pressure for reasoning chains: Each token in a reasoning chain extends the KV cache for attention layers. At 32K context, a 4,096-token reasoning chain plus 512-token input uses 14% of the context window. At 131K context, the same trace is trivial. For serving many concurrent requests with extended reasoning, the KV cache budget per request increases. See the KV cache optimization guide for sizing the cache budget against your concurrency targets.

Throughput estimates: At FP8 on 8x H200 SXM5, expect roughly 500-900 tokens/sec total throughput depending on batch size and reasoning chain length. Multi-node 16x H100 FP8 is broadly comparable but with inter-node communication overhead at each pipeline stage boundary. These are estimates based on the model architecture; your actual figures will depend on your vLLM version, batch composition, and reasoning chain lengths.

Cost Analysis: Spheron vs Hyperscaler and Managed Inference

Monthly cost at continuous 24/7 use (730 hours), using live Spheron pricing from 12 Jun 2026:

Config	$/hr	$/month (continuous)	Use Case
8x H200 SXM5 on-demand	$38.72 ($4.84/GPU)	~$28,266	Committed production
8x H200 SXM5 spot	$14.56 ($1.82/GPU)	~$10,629	Batch / non-latency-critical
16x H100 SXM5 on-demand	$62.72 (~$3.92/GPU)	~$45,786	Multi-node BF16 or FP8
8x B200 SXM6 spot	$21.68 ($2.71/GPU)	~$15,826	Blackwell, best throughput/$

Break-even against managed inference APIs: Managed API pricing for frontier reasoning models runs roughly $4-8 per million tokens. At $38.72/hr on-demand for 8x H200, and estimating 700 tokens/sec throughput, the cost per million tokens self-hosted is roughly $15.36/M. At API rates of $4/M tokens, you need approximately 7.1 billion tokens per month before self-hosting breaks even. At $8/M tokens, that drops to 3.5 billion tokens per month.

If your workload is episodic rather than continuous, spot instances change the calculation substantially. At $14.56/hr spot for 8x H200, the equivalent cost-per-token drops to roughly $5.78/M at the same 700 tokens/sec throughput, making the break-even against a $4/M API around 2.7 billion tokens per month.

Cost-per-token calculation method: hourly rate / (throughput_tokens_per_sec x 3600) = cost per token. At $38.72/hr and 700 tokens/sec: $38.72 / (700 x 3600) = $0.00001536 per token = $15.36 per million tokens.

Pricing fluctuates based on GPU availability. The prices above are based on 12 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Relevant links: H200 GPU rental | B200 GPU rental | View all GPU pricing

Quantization Options: FP8, NVFP4, and BF16

Format	VRAM (550B)	Throughput vs BF16	Quality vs BF16	Best Hardware
BF16	~1,100 GB	Baseline	Reference	2x (8x H100) nodes
FP8	~550 GB	~1.5-1.8x	~98%	8x H200 or 16x H100
NVFP4	~275 GB	~2.5-3x on Blackwell	~95-96%	8x B200 (native) / 8x H100 (via dequant)

A few practical notes:

FP8 is the recommended precision for H200. The quality drop from BF16 to FP8 is negligible on reasoning tasks, and fitting the entire model on a single 8x H200 node eliminates the multi-node networking complexity. Note that NVIDIA's primary pre-quantized releases are BF16 and NVFP4. If a dedicated FP8 checkpoint is not available, download BF16 and use vLLM's --quantization fp8 flag for runtime quantization. Unless you have a specific BF16 accuracy requirement, FP8 is the right precision choice for H200 serving.

NVFP4 on H100 is memory-efficient, not compute-efficient. H100 lacks native FP4 tensor cores. vLLM dequantizes NVFP4 weights to FP8 before computation, so you get the VRAM benefit (275 GB vs 550 GB) but you lose the Blackwell throughput advantage. Use NVFP4 on H100 when VRAM is the constraint, not when chasing throughput.

INT4 is not recommended for reasoning workloads. Quality degradation at INT4 compounds across multi-step chain-of-thought traces. A small error in one reasoning step can propagate forward and produce incorrect conclusions several steps later. The bit savings are not worth the accuracy risk for agentic workflows.

Benchmark Comparison: Nemotron 3 Ultra vs Other Frontier Models

Published benchmark results from the NVIDIA Nemotron 3 Ultra technical report and model card (June 2026):

Benchmark	Nemotron 3 Ultra 550B	Notes
GPQA (no tools)	87.0	NVIDIA-published; measures graduate-level science reasoning
LiveCodeBench v6	89.0	NVIDIA-published; functional code generation
SWE-Bench Verified	65.0-70.4 (five agent harnesses)	NVIDIA-published range across Pi, OpenHands, Hermes, OpenCode, and Mini SWE Agent; repository-level code editing
IMOAnswerBench	88.6	NVIDIA-published; competition math reasoning
RULER @ 1M tokens	94.7	NVIDIA-published; long-context retrieval

Comparative figures for DeepSeek V4 and Qwen 3.6 Plus are not included here, as independent benchmark reproductions take time to surface after a model release. Verify head-to-head comparisons against each model's official technical report or a neutral benchmark harness before using them in production model selection decisions.

For teams considering DeepSeek V4 as an alternative, the Deploy DeepSeek V4 on GPU Cloud guide covers the 685B total / 37B active MoE and its vLLM configuration. For Qwen 3.6 Plus's hybrid 1M context MoE, see the Deploy Qwen 3.6 Plus guide.

When to choose Nemotron 3 Ultra over alternatives:

NVIDIA ecosystem integration: If your inference stack already uses NIM microservices, NVIDIA's tooling integrates directly with Nemotron checkpoints, including optimized serving configurations and NCCL tuning for NVLink fabrics.
Agentic multi-step accuracy: MOPD post-training gives Nemotron 3 Ultra a specific advantage on workflows that combine tool calls, code execution, and multi-step reasoning in the same trace.
Long-context at scale: The LatentMoE hybrid with SSM layers reduces KV cache pressure for sequences above 32K tokens, which matters for repository-level analysis or document processing pipelines.
Model trust and support: NVIDIA provides NIM container support, security patching, and a commercial support path that open-source community models typically lack.

Production Reliability and Monitoring

Before going to production, run through this checklist:

GPU provisioned with NVLink SXM variant; NVLink verified with nvidia-smi topo -m (look for NVLink entries in the topology matrix)
CUDA version verified: nvcc --version shows 12.4+ for H100/H200, 12.8+ for B200/B300
vLLM installed at NVIDIA's recommended version; Ray cluster healthy: ray status shows all expected GPUs
Model checkpoint downloaded and file sizes verified (FP8 checkpoint should be roughly 550 GB)
vLLM server started with --trust-remote-code and the correct parallelism flags
Health check: curl http://localhost:8000/health returns HTTP 200
GPU utilization spread across all cards: nvidia-smi dmon -s u should show >80% utilization across all 8 GPUs during inference. If some GPUs are idle, expert parallelism is not routing correctly
KV cache utilization monitored via vLLM metrics: curl http://localhost:8000/metrics | grep cache
Load balancer in front of vLLM for production traffic (nginx or Envoy with health check on /health)
Spot instance checkpoint saving configured if using spot pricing: vLLM does not automatically checkpoint inference state, so plan for restart latency

For monitoring tooling and GPU metrics pipelines, see the GPU monitoring for ML guide.

Nemotron 3 Ultra gives you a self-hosted 550B reasoner that fits on a single 8x H200 node at FP8, with no hyperscaler contract required.
H200 SXM5 on Spheron → | B200 availability → | View all GPU pricing →
Get started on Spheron →

STEPS / 06

Quick Setup Guide

Size your GPU cluster using total parameter count
For Nemotron 3 Ultra 550B: at FP8, weights are 550 GB. Apply 15% overhead for activations and KV cache initialization, giving roughly 633 GB minimum weight plus overhead. A single 8x H200 SXM5 node (1,128 GB total) gives comfortable headroom for FP8 serving with a 32K context window and batch size 8. For BF16, 550B x 2 bytes = 1,100 GB, which exceeds a single 8x H100 node (640 GB) and requires two 8x H100 SXM5 nodes or a 16x H100 cluster.
Provision the GPU node on Spheron
Log into app.spheron.ai and select an 8x H200 SXM5 or 16x H100 SXM5 configuration. SXM variants are required: they include NVLink 4.0 (900 GB/s bidirectional) which is essential for tensor parallelism at TP=8 or higher. PCIe variants (~64 GB/s) will cause severe all-reduce bottlenecks at this scale. For multi-node setups, provision both instances in the same availability zone to minimize InfiniBand latency.
Install vLLM with Ray for multi-node support
On all nodes: pip install vllm ray (use the version NVIDIA recommends in its serving guide). On the head node, start Ray: ray start --head --port=6379. On worker nodes: ray start --address=<HEAD_NODE_IP>:6379. Verify all GPUs are visible in the Ray cluster: python -c 'import ray; ray.init(address="auto"); print(ray.available_resources())'. CUDA 12.4+ is required for H100 (Hopper); CUDA 12.8+ for H200 and B200 (Blackwell).
Download the Nemotron 3 Ultra model checkpoint
On the head node: huggingface-cli download nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-FP8 --local-dir /models/nemotron-3-ultra-fp8. The FP8 checkpoint is approximately 550 GB. Use a shared NFS mount or identical local downloads on all nodes before launching the vLLM server. For BF16: huggingface-cli download nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 --local-dir /models/nemotron-3-ultra. Verify the exact HuggingFace repository names against NVIDIA's release page.
Launch vLLM with expert parallelism
Single-node 8x H200 FP8: vllm serve /models/nemotron-3-ultra-fp8 --quantization fp8 --tensor-parallel-size 8 --enable-expert-parallel --gpu-memory-utilization 0.90 --max-model-len 32768 --trust-remote-code --port 8000. Multi-node 16x H100: vllm serve /models/nemotron-3-ultra-fp8 --quantization fp8 --tensor-parallel-size 8 --enable-expert-parallel --pipeline-parallel-size 2 --distributed-executor-backend ray --gpu-memory-utilization 0.88 --max-model-len 32768 --trust-remote-code --port 8000. Run this command only on the head node; vLLM uses Ray to coordinate the worker.
Validate the deployment and benchmark throughput
Send a test request: curl http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"/models/nemotron-3-ultra-fp8","messages":[{"role":"user","content":"Solve: what is 17 multiplied by 23 step by step."}]}'. Benchmark with vLLM's serving script: python benchmarks/benchmark_serving.py --model /models/nemotron-3-ultra-fp8 --num-prompts 100 --request-rate 4 --max-tokens 2048. For reasoning workloads, expect extended generation times per sequence due to chain-of-thought output length.

FAQ / 05

Frequently Asked Questions

At FP8 precision (1 byte/param), the 550B weights occupy roughly 550 GB. With 15% framework and activation overhead, you need around 633 GB minimum. A single 8x H200 SXM5 node (1,128 GB total VRAM) fits FP8 comfortably. For BF16 (1,100 GB weights), you need two 8x H100 SXM5 nodes (1,280 GB combined) or a 16x H100 cluster. Nemotron 3 Ultra's MoE design means all expert weights must be in VRAM even though only 55B activate per token.

Yes. vLLM supports MoE models with expert parallelism via --enable-expert-parallel (use the version NVIDIA recommends in its serving guide; NVIDIA's model card references the vllm/vllm-openai:v0.22.0 container at launch). For multi-node spanning two 8-GPU nodes, use vLLM with a Ray backend: ray start --head on the first node, then launch vllm serve with --tensor-parallel-size 8 --enable-expert-parallel --pipeline-parallel-size 2 --distributed-executor-backend ray. The model checkpoint is nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 on Hugging Face (verify the exact repository name against NVIDIA's release page).

Nemotron 3 Ultra is the largest tier in the Nemotron 3 family at 550B total parameters with 55B active per token. It uses a LatentMoE architecture combining Mamba-2, MoE, and Attention layers in one hybrid stack, plus Multi-Token Prediction for speculative decoding. Nemotron 3 Super is 120B total / 12B active with a Mamba-Transformer hybrid. Nemotron Ultra 253B is a dense Transformer with no MoE. Nemotron 3 Ultra uses a staged post-training pipeline including RLVR and Multi-Teacher On-Policy Distillation (MOPD), targeting agentic task accuracy across coding, reasoning, tool use, and agent action sequences.

NVIDIA used a staged post-training pipeline. Multi-Teacher On-Policy Distillation (MOPD) is the headline technique: more than 10 specialized teacher models guide training on the student's own generated outputs (on-policy rollouts), covering coding, math, reasoning, tool use, and agentic workflows. RLVR (RL with Verifiable Rewards) runs as an earlier stage in the pipeline. MOPD then preserves and extends those RL gains across multiple domains simultaneously, improving multi-step task completion rather than overfitting to one benchmark category.

On Spheron, an 8x H200 SXM5 node costs $38.72/hr on-demand ($4.84/GPU). At continuous 24/7 use that is roughly $28,266 per month. Managed inference APIs for comparable frontier reasoning models typically charge $4-8 per million tokens. The break-even against a self-hosted 8x H200 on-demand configuration is approximately 3.5-7 billion tokens per month depending on which API tier you are comparing against.

Nemotron 3 Ultra Architecture: LatentMoE Hybrid with MOPD Post-Training

Nemotron 3 Family: Choosing the Right Tier

GPU Hardware Planning: VRAM and Interconnect for a 550B MoE

Interconnect Requirements

GPU Tier Recommendations

Deploying Nemotron 3 Ultra with vLLM

Prerequisites

Installation

Ray Setup (Multi-Node Only)

Model Download

Single-Node 8x H200 FP8 (Recommended)

Multi-Node 16x H100 FP8 with Pipeline Parallelism

Single-Node 8x H100 NVFP4 (Budget Path)

Key vLLM Flag Reference

Expert-Parallel Serving: How Routing Affects Throughput

Reasoning Workload Tuning: Throughput, Latency, and Cost Per Token

Cost Analysis: Spheron vs Hyperscaler and Managed Inference

Quantization Options: FP8, NVFP4, and BF16

Benchmark Comparison: Nemotron 3 Ultra vs Other Frontier Models

Production Reliability and Monitoring

Quick Setup Guide

Size your GPU cluster using total parameter count

Provision the GPU node on Spheron

Install vLLM with Ray for multi-node support

Download the Nemotron 3 Ultra model checkpoint

Launch vLLM with expert parallelism

Validate the deployment and benchmark throughput

Frequently Asked Questions

01How many GPUs do you need to run Nemotron 3 Ultra 550B?

02Does vLLM support Nemotron 3 Ultra 550B?

03What makes Nemotron 3 Ultra different from Nemotron 3 Super and Nemotron Ultra 253B?

04What post-training method did NVIDIA use for Nemotron 3 Ultra?

05What does it cost to run Nemotron 3 Ultra on Spheron vs a managed API?

Try It on Real GPUs