NVIDIA released Nemotron 3 Ultra on June 4, 2026, and the defining number is 550B total parameters with 55B active per token. That active-to-total ratio is the whole point: you get frontier reasoning quality at a compute cost much closer to a 55B dense model than a 550B one. What sets this apart from earlier Nemotron releases is the post-training recipe. NVIDIA's headline technique is Multi-Teacher On-Policy Distillation (MOPD): more than 10 specialized teacher models guide training on the student's own generated outputs across coding, reasoning, tool use, and agent action sequences. RLVR (RL with Verifiable Rewards) is an earlier stage in the pipeline. The result is a model that generalizes better on agentic workflows rather than overperforming on one benchmark category.
Before getting into deployment specifics, if you are already running smaller NVIDIA models, the Nemotron 3 Super deployment guide covers the 120B/12B hybrid Mamba-Transformer tier for teams who need single-GPU deployment. For baseline vLLM multi-node server setup, see vLLM production deployment.
Nemotron 3 Ultra Architecture: LatentMoE Hybrid with MOPD Post-Training
Nemotron 3 Ultra is not a standard transformer MoE. The architecture combines three components in a single hybrid stack: Mamba-2 SSM layers, MoE feed-forward layers, and standard attention layers. NVIDIA calls this LatentMoE. It also includes Multi-Token Prediction (MTP) for native speculative decoding without a separate draft model.
The MoE component handles the parameter-to-compute ratio. The model has 550B total parameters across experts, but only 55B activate on any given forward pass. This is what makes the GPU math interesting: you need 550 GB of VRAM to hold all expert weights (at FP8), but the compute per token is equivalent to a 55B dense model. You pay for storage, not for compute.
The Mamba-2 layers bring a different benefit. SSM layers maintain a recurrent state that scales linearly with sequence length rather than quadratically like attention. For long reasoning chains, this matters. A multi-step agent reasoning trace that runs to 32K tokens costs far less KV cache headroom than a purely attention-based model would need. Combined with a 1M token context window, the architecture handles extended agentic workflows without the memory wall that hits pure-transformer models.
Multi-Teacher On-Policy Distillation (MOPD) is the headline claim NVIDIA makes for Nemotron 3 Ultra's agentic accuracy. MOPD uses more than 10 specialized teacher models to guide training on the student's own generated outputs (on-policy rollouts) across coding, reasoning, tool use, and multi-step agent workflows simultaneously. RLVR (RL with Verifiable Rewards) runs earlier in the pipeline; MOPD then preserves and extends those RL gains. The practical consequence is a model that generalizes better across agentic workflows rather than overfitting to one task domain.
Nemotron 3 Family: Choosing the Right Tier
| Model | Total Params | Active Params | Architecture | Min GPUs (FP8) |
|---|---|---|---|---|
| Nemotron 3 Nano | ~8B | ~8B (dense) | Hybrid Mamba-MoE Transformer | 1x A100 |
| Nemotron 3 Super | 120B | 12B | Hybrid Mamba-MoE | 1x H100 (NVFP4) |
| Nemotron Ultra 253B | 253B | 253B (dense) | Dense Transformer | 8x H100 |
| Nemotron 3 Ultra | 550B | 55B | LatentMoE (Mamba-2 + MoE + Attention) | 8x H200 (FP8) |
When to choose each tier:
- Nemotron 3 Nano: Single-GPU serving, edge deployment, or cost-constrained inference pipelines where 8B quality is acceptable
- Nemotron 3 Super: Production serving on a single H100, long-context workloads where SSM memory savings matter, SWE-Bench class coding tasks at minimal cost
- Nemotron Ultra 253B: Dense architecture where MoE routing overhead is undesirable, 8x H100 budget already committed, tasks requiring consistent active compute across all tokens
- Nemotron 3 Ultra: Highest quality agentic and reasoning workloads, budget for 8x H200 or 16x H100, workflows that benefit from MOPD post-training generalization across coding, reasoning, and tool use
GPU Hardware Planning: VRAM and Interconnect for a 550B MoE
The fundamental rule for MoE deployment: you load all parameters into VRAM, not just the active ones. Every expert's weights must be resident so the router can dispatch tokens to any of them on demand. For Nemotron 3 Ultra, that means 550B parameters regardless of the 55B active count.
The VRAM formula:
total_vram = (total_params x bytes_per_dtype x 1.15) + kv_cache_budgetThe 1.15 multiplier covers activation memory, framework overhead, and routing dispatch buffers. KV cache depends on your context length and batch size targets.
| Precision | Weight Size | +15% Overhead | Min GPU Config | Notes |
|---|---|---|---|---|
| BF16 | 1,100 GB | ~1,265 GB | 2x (8x H100 SXM5) = 1,280 GB | Requires multi-node |
| FP8 | 550 GB | ~633 GB | 8x H200 SXM5 (1,128 GB) | Single node, comfortable headroom |
| NVFP4/FP4 | 275 GB | ~316 GB | 8x H100 (640 GB) or 4x H200 | Best on Blackwell; H100 via runtime dequant |
Interconnect Requirements
Within a single node, NVLink 4.0 (SXM variants) runs at 900 GB/s bidirectional. This bandwidth is what makes tensor parallelism at TP=8 viable. PCIe interconnects top out around 64 GB/s and create a severe all-reduce bottleneck when each forward pass requires synchronizing tensors across 8 GPUs. For Nemotron 3 Ultra, SXM variants are not optional.
For multi-node BF16 deployments (16x H100 across two nodes), InfiniBand handles cross-node communication for pipeline parallel stages. 400 Gb/s or 800 Gb/s InfiniBand is the right call here; 100 Gb/s Ethernet between nodes will produce visible throughput degradation. The distributed LLM training guide covers the multi-node networking setup in more detail.
GPU Tier Recommendations
Pricing fetched from the Spheron API on 12 Jun 2026:
| Config | Precision | VRAM | $/hr (Spheron) | Best For |
|---|---|---|---|---|
| 8x H200 SXM5 | FP8 | 1,128 GB | 8 × $4.84 = $38.72/hr on-demand | Single-node production (recommended) |
| 16x H100 SXM5 | FP8 | 1,280 GB | 16 × ~$3.92 = $62.72/hr on-demand | Multi-node, lower cost-per-GPU |
| 8x H100 SXM5 | NVFP4 | 640 GB | 8 × $3.92 = $31.36/hr on-demand | Budget single-node (H100 dequant overhead) |
| 8x B200 SXM6 | FP8 | 1,536 GB | 8 × $2.71 = $21.68/hr spot | Blackwell native, best throughput/$ |
Pricing fluctuates based on GPU availability. The prices above are based on 12 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Deploying Nemotron 3 Ultra with vLLM
Prerequisites
- Python 3.10+
- CUDA 12.4+ for H100/H200 (Hopper architecture); CUDA 12.8+ for B200/B300 (Blackwell)
- vLLM (use the version NVIDIA recommends in its serving guide; NVIDIA's model card references vllm/vllm-openai:v0.22.0 at launch)
- Ray (required for multi-node setups only)
- 550 GB free storage per node for the FP8 checkpoint (1,100 GB for BF16)
Installation
# All nodes - use the version NVIDIA recommends in its serving guide
pip install vllm ray
# Verify GPU visibility
nvidia-smiRay Setup (Multi-Node Only)
# Head node
ray start --head --port=6379
# Worker nodes (replace with actual head node IP)
ray start --address=<HEAD_NODE_IP>:6379
# Verify Ray cluster health
python -c "import ray; ray.init(address='auto'); print(ray.available_resources())"Model Download
NVIDIA's primary official checkpoints are BF16 and NVFP4. An FP8 checkpoint may be available or may not be released as a separate pre-quantized artifact. If nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-FP8 does not exist on HuggingFace, download BF16 and use vLLM's --quantization fp8 flag for runtime FP8 quantization.
# BF16 checkpoint (~1,100 GB) - confirmed available
huggingface-cli download nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 \
--local-dir /models/nemotron-3-ultra
# NVFP4 checkpoint (~275 GB) - confirmed available; best for Blackwell
huggingface-cli download nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
--local-dir /models/nemotron-3-ultra-nvfp4
# FP8 checkpoint (~550 GB) - verify availability on NVIDIA's release page
# If not available, use BF16 with --quantization fp8 in vLLM
huggingface-cli download nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-FP8 \
--local-dir /models/nemotron-3-ultra-fp8Verify repository names against NVIDIA's release page before downloading. HuggingFace repo names can shift between announcement and public availability.
Single-Node 8x H200 FP8 (Recommended)
This is the primary path. A single 8x H200 SXM5 node with 141 GB per GPU gives 1,128 GB total, leaving roughly 495 GB headroom above the ~633 GB FP8 weight-plus-overhead figure. That headroom goes to KV cache for serving concurrent requests at 32K context length.
vllm serve /models/nemotron-3-ultra-fp8 \
--quantization fp8 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--gpu-memory-utilization 0.90 \
--max-model-len 32768 \
--trust-remote-code \
--port 8000Multi-Node 16x H100 FP8 with Pipeline Parallelism
Two 8x H100 SXM5 nodes connected via InfiniBand. Each node runs TP=8 across its NVLink domain; pipeline parallelism handles the cross-node boundary.
# Run on head node only after Ray cluster is healthy
vllm serve /models/nemotron-3-ultra-fp8 \
--quantization fp8 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--pipeline-parallel-size 2 \
--distributed-executor-backend ray \
--gpu-memory-utilization 0.88 \
--max-model-len 32768 \
--trust-remote-code \
--port 8000The --gpu-memory-utilization 0.88 (vs 0.90 on single-node) leaves headroom for pipeline boundary activation buffers that would not exist in a single-node setup.
Single-Node 8x H100 NVFP4 (Budget Path)
The NVFP4 checkpoint is 275 GB, comfortably under 640 GB total VRAM on 8x H100 80GB. The tradeoff: H100 does not have native FP4 tensor cores. vLLM dequantizes NVFP4 to FP8 at runtime, so you get the memory savings but not Blackwell's native FP4 throughput gain.
vllm serve /models/nemotron-3-ultra-nvfp4 \
--quantization nvfp4 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--gpu-memory-utilization 0.88 \
--max-model-len 32768 \
--trust-remote-code \
--port 8000Key vLLM Flag Reference
| Flag | Value | Why |
|---|---|---|
--enable-expert-parallel | enabled | Routes expert FFN layers to separate GPU groups; reduces all-to-all communication vs full tensor parallelism |
--tensor-parallel-size | 8 (per node) | Splits attention and SSM layers across 8 GPUs via NVLink |
--pipeline-parallel-size | 2 (multi-node only) | Splits model layers across 2 nodes |
--distributed-executor-backend | ray (multi-node only) | Ray handles inter-node coordination |
--gpu-memory-utilization | 0.88-0.90 | Headroom for expert dispatch buffers and pipeline boundary activations |
--max-model-len | 32768 | Safe default for reasoning workloads; increase to 131072 or higher if VRAM headroom permits |
--trust-remote-code | required | NVIDIA uses custom model code in the checkpoint |
Note: verify these flags against your installed vLLM version's vllm serve --help. The --distributed-executor-backend and --pipeline-parallel-size flags may behave differently between vLLM minor versions.
Expert-Parallel Serving: How Routing Affects Throughput
Standard tensor parallelism splits every weight matrix across GPUs. For each matrix multiply, every GPU computes a shard of the result, then an all-reduce synchronizes the partial outputs. At TP=8, that's 7 all-reduce operations per layer per forward pass across 8 GPUs.
Expert parallelism takes a different approach. Each GPU holds a complete subset of expert networks. Routing dispatches each token to whichever GPU owns the relevant experts via an all-to-all communication pattern. GPUs then compute locally on their assigned experts and gather results.
The advantage for large MoE models: the all-to-all communication volume is proportional to the number of routed tokens, not the full model state. For batches where tokens spread across many experts, the GPU utilization stays high because different GPUs are computing on different token-expert pairs simultaneously. The more diverse the batch (different tokens selecting different experts), the better the GPU utilization.
This is why larger batches help with MoE expert parallelism in a way that does not apply to dense models. A batch of 1 token routes to a small set of experts, leaving most GPU expert capacity idle. A batch of 64 tokens routes to a much wider set of experts, spreading the compute load. For production serving of a model like Nemotron 3 Ultra, serving with a minimum batch size of 8-16 requests will significantly improve GPU utilization versus single-request serving.
For deeper analysis of expert parallelism tuning, including when to prefer EP over TP and how all-to-all communication patterns change with different routing configurations, see the MoE inference optimization guide.
Reasoning Workload Tuning: Throughput, Latency, and Cost Per Token
Nemotron 3 Ultra generates reasoning chains before producing final answers when configured with enable_thinking=True in the chat template. A chain-of-thought trace for a complex math or coding problem can run 2,000 to 8,000 tokens before the final response. This changes the throughput math substantially compared to a chat model that gives short direct answers.
Time-To-First-Token vs Inter-Token Latency: For reasoning workloads, TTFT includes the full prefill of the system prompt and user message. The thinking chain runs during decode. If your SLO is sensitive to total response time rather than time-to-first-token, the relevant metric is the full decode latency for the reasoning chain plus the answer. Setting a shorter max_tokens budget for the thinking chain reduces total latency at the cost of potentially truncated reasoning.
Chunked prefill: Enable --enable-chunked-prefill for long system prompts or long-context inputs. This pipelines the prefill across multiple forward passes and can improve throughput by overlapping prefill computation with decode of other active requests. The Mamba-2 SSM layers in Nemotron 3 Ultra use a different memory access pattern than pure attention, so test chunked prefill correctness on your specific checkpoint before enabling in production.
KV cache pressure for reasoning chains: Each token in a reasoning chain extends the KV cache for attention layers. At 32K context, a 4,096-token reasoning chain plus 512-token input uses 14% of the context window. At 131K context, the same trace is trivial. For serving many concurrent requests with extended reasoning, the KV cache budget per request increases. See the KV cache optimization guide for sizing the cache budget against your concurrency targets.
Throughput estimates: At FP8 on 8x H200 SXM5, expect roughly 500-900 tokens/sec total throughput depending on batch size and reasoning chain length. Multi-node 16x H100 FP8 is broadly comparable but with inter-node communication overhead at each pipeline stage boundary. These are estimates based on the model architecture; your actual figures will depend on your vLLM version, batch composition, and reasoning chain lengths.
Cost Analysis: Spheron vs Hyperscaler and Managed Inference
Monthly cost at continuous 24/7 use (730 hours), using live Spheron pricing from 12 Jun 2026:
| Config | $/hr | $/month (continuous) | Use Case |
|---|---|---|---|
| 8x H200 SXM5 on-demand | $38.72 ($4.84/GPU) | ~$28,266 | Committed production |
| 8x H200 SXM5 spot | $14.56 ($1.82/GPU) | ~$10,629 | Batch / non-latency-critical |
| 16x H100 SXM5 on-demand | $62.72 (~$3.92/GPU) | ~$45,786 | Multi-node BF16 or FP8 |
| 8x B200 SXM6 spot | $21.68 ($2.71/GPU) | ~$15,826 | Blackwell, best throughput/$ |
Break-even against managed inference APIs: Managed API pricing for frontier reasoning models runs roughly $4-8 per million tokens. At $38.72/hr on-demand for 8x H200, and estimating 700 tokens/sec throughput, the cost per million tokens self-hosted is roughly $15.36/M. At API rates of $4/M tokens, you need approximately 7.1 billion tokens per month before self-hosting breaks even. At $8/M tokens, that drops to 3.5 billion tokens per month.
If your workload is episodic rather than continuous, spot instances change the calculation substantially. At $14.56/hr spot for 8x H200, the equivalent cost-per-token drops to roughly $5.78/M at the same 700 tokens/sec throughput, making the break-even against a $4/M API around 2.7 billion tokens per month.
Cost-per-token calculation method: hourly rate / (throughput_tokens_per_sec x 3600) = cost per token. At $38.72/hr and 700 tokens/sec: $38.72 / (700 x 3600) = $0.00001536 per token = $15.36 per million tokens.
Pricing fluctuates based on GPU availability. The prices above are based on 12 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Relevant links: H200 GPU rental | B200 GPU rental | View all GPU pricing
Quantization Options: FP8, NVFP4, and BF16
| Format | VRAM (550B) | Throughput vs BF16 | Quality vs BF16 | Best Hardware |
|---|---|---|---|---|
| BF16 | ~1,100 GB | Baseline | Reference | 2x (8x H100) nodes |
| FP8 | ~550 GB | ~1.5-1.8x | ~98% | 8x H200 or 16x H100 |
| NVFP4 | ~275 GB | ~2.5-3x on Blackwell | ~95-96% | 8x B200 (native) / 8x H100 (via dequant) |
A few practical notes:
FP8 is the recommended precision for H200. The quality drop from BF16 to FP8 is negligible on reasoning tasks, and fitting the entire model on a single 8x H200 node eliminates the multi-node networking complexity. Note that NVIDIA's primary pre-quantized releases are BF16 and NVFP4. If a dedicated FP8 checkpoint is not available, download BF16 and use vLLM's --quantization fp8 flag for runtime quantization. Unless you have a specific BF16 accuracy requirement, FP8 is the right precision choice for H200 serving.
NVFP4 on H100 is memory-efficient, not compute-efficient. H100 lacks native FP4 tensor cores. vLLM dequantizes NVFP4 weights to FP8 before computation, so you get the VRAM benefit (275 GB vs 550 GB) but you lose the Blackwell throughput advantage. Use NVFP4 on H100 when VRAM is the constraint, not when chasing throughput.
INT4 is not recommended for reasoning workloads. Quality degradation at INT4 compounds across multi-step chain-of-thought traces. A small error in one reasoning step can propagate forward and produce incorrect conclusions several steps later. The bit savings are not worth the accuracy risk for agentic workflows.
Benchmark Comparison: Nemotron 3 Ultra vs Other Frontier Models
Published benchmark results from the NVIDIA Nemotron 3 Ultra technical report and model card (June 2026):
| Benchmark | Nemotron 3 Ultra 550B | Notes |
|---|---|---|
| GPQA (no tools) | 87.0 | NVIDIA-published; measures graduate-level science reasoning |
| LiveCodeBench v6 | 89.0 | NVIDIA-published; functional code generation |
| SWE-Bench Verified | 65.0-70.4 (five agent harnesses) | NVIDIA-published range across Pi, OpenHands, Hermes, OpenCode, and Mini SWE Agent; repository-level code editing |
| IMOAnswerBench | 88.6 | NVIDIA-published; competition math reasoning |
| RULER @ 1M tokens | 94.7 | NVIDIA-published; long-context retrieval |
Comparative figures for DeepSeek V4 and Qwen 3.6 Plus are not included here, as independent benchmark reproductions take time to surface after a model release. Verify head-to-head comparisons against each model's official technical report or a neutral benchmark harness before using them in production model selection decisions.
For teams considering DeepSeek V4 as an alternative, the Deploy DeepSeek V4 on GPU Cloud guide covers the 685B total / 37B active MoE and its vLLM configuration. For Qwen 3.6 Plus's hybrid 1M context MoE, see the Deploy Qwen 3.6 Plus guide.
When to choose Nemotron 3 Ultra over alternatives:
- NVIDIA ecosystem integration: If your inference stack already uses NIM microservices, NVIDIA's tooling integrates directly with Nemotron checkpoints, including optimized serving configurations and NCCL tuning for NVLink fabrics.
- Agentic multi-step accuracy: MOPD post-training gives Nemotron 3 Ultra a specific advantage on workflows that combine tool calls, code execution, and multi-step reasoning in the same trace.
- Long-context at scale: The LatentMoE hybrid with SSM layers reduces KV cache pressure for sequences above 32K tokens, which matters for repository-level analysis or document processing pipelines.
- Model trust and support: NVIDIA provides NIM container support, security patching, and a commercial support path that open-source community models typically lack.
Production Reliability and Monitoring
Before going to production, run through this checklist:
- GPU provisioned with NVLink SXM variant; NVLink verified with
nvidia-smi topo -m(look for NVLink entries in the topology matrix) - CUDA version verified:
nvcc --versionshows 12.4+ for H100/H200, 12.8+ for B200/B300 - vLLM installed at NVIDIA's recommended version; Ray cluster healthy:
ray statusshows all expected GPUs - Model checkpoint downloaded and file sizes verified (FP8 checkpoint should be roughly 550 GB)
- vLLM server started with
--trust-remote-codeand the correct parallelism flags - Health check:
curl http://localhost:8000/healthreturns HTTP 200 - GPU utilization spread across all cards:
nvidia-smi dmon -s ushould show >80% utilization across all 8 GPUs during inference. If some GPUs are idle, expert parallelism is not routing correctly - KV cache utilization monitored via vLLM metrics:
curl http://localhost:8000/metrics | grep cache - Load balancer in front of vLLM for production traffic (nginx or Envoy with health check on
/health) - Spot instance checkpoint saving configured if using spot pricing: vLLM does not automatically checkpoint inference state, so plan for restart latency
For monitoring tooling and GPU metrics pipelines, see the GPU monitoring for ML guide.
Nemotron 3 Ultra gives you a self-hosted 550B reasoner that fits on a single 8x H200 node at FP8, with no hyperscaler contract required.
H200 SXM5 on Spheron → | B200 availability → | View all GPU pricing →
Quick Setup Guide
For Nemotron 3 Ultra 550B: at FP8, weights are 550 GB. Apply 15% overhead for activations and KV cache initialization, giving roughly 633 GB minimum weight plus overhead. A single 8x H200 SXM5 node (1,128 GB total) gives comfortable headroom for FP8 serving with a 32K context window and batch size 8. For BF16, 550B x 2 bytes = 1,100 GB, which exceeds a single 8x H100 node (640 GB) and requires two 8x H100 SXM5 nodes or a 16x H100 cluster.
Log into app.spheron.ai and select an 8x H200 SXM5 or 16x H100 SXM5 configuration. SXM variants are required: they include NVLink 4.0 (900 GB/s bidirectional) which is essential for tensor parallelism at TP=8 or higher. PCIe variants (~64 GB/s) will cause severe all-reduce bottlenecks at this scale. For multi-node setups, provision both instances in the same availability zone to minimize InfiniBand latency.
On all nodes: pip install vllm ray (use the version NVIDIA recommends in its serving guide). On the head node, start Ray: ray start --head --port=6379. On worker nodes: ray start --address=<HEAD_NODE_IP>:6379. Verify all GPUs are visible in the Ray cluster: python -c 'import ray; ray.init(address="auto"); print(ray.available_resources())'. CUDA 12.4+ is required for H100 (Hopper); CUDA 12.8+ for H200 and B200 (Blackwell).
On the head node: huggingface-cli download nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-FP8 --local-dir /models/nemotron-3-ultra-fp8. The FP8 checkpoint is approximately 550 GB. Use a shared NFS mount or identical local downloads on all nodes before launching the vLLM server. For BF16: huggingface-cli download nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 --local-dir /models/nemotron-3-ultra. Verify the exact HuggingFace repository names against NVIDIA's release page.
Single-node 8x H200 FP8: vllm serve /models/nemotron-3-ultra-fp8 --quantization fp8 --tensor-parallel-size 8 --enable-expert-parallel --gpu-memory-utilization 0.90 --max-model-len 32768 --trust-remote-code --port 8000. Multi-node 16x H100: vllm serve /models/nemotron-3-ultra-fp8 --quantization fp8 --tensor-parallel-size 8 --enable-expert-parallel --pipeline-parallel-size 2 --distributed-executor-backend ray --gpu-memory-utilization 0.88 --max-model-len 32768 --trust-remote-code --port 8000. Run this command only on the head node; vLLM uses Ray to coordinate the worker.
Send a test request: curl http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"/models/nemotron-3-ultra-fp8","messages":[{"role":"user","content":"Solve: what is 17 multiplied by 23 step by step."}]}'. Benchmark with vLLM's serving script: python benchmarks/benchmark_serving.py --model /models/nemotron-3-ultra-fp8 --num-prompts 100 --request-rate 4 --max-tokens 2048. For reasoning workloads, expect extended generation times per sequence due to chain-of-thought output length.
Frequently Asked Questions
At FP8 precision (1 byte/param), the 550B weights occupy roughly 550 GB. With 15% framework and activation overhead, you need around 633 GB minimum. A single 8x H200 SXM5 node (1,128 GB total VRAM) fits FP8 comfortably. For BF16 (1,100 GB weights), you need two 8x H100 SXM5 nodes (1,280 GB combined) or a 16x H100 cluster. Nemotron 3 Ultra's MoE design means all expert weights must be in VRAM even though only 55B activate per token.
Yes. vLLM supports MoE models with expert parallelism via --enable-expert-parallel (use the version NVIDIA recommends in its serving guide; NVIDIA's model card references the vllm/vllm-openai:v0.22.0 container at launch). For multi-node spanning two 8-GPU nodes, use vLLM with a Ray backend: ray start --head on the first node, then launch vllm serve with --tensor-parallel-size 8 --enable-expert-parallel --pipeline-parallel-size 2 --distributed-executor-backend ray. The model checkpoint is nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 on Hugging Face (verify the exact repository name against NVIDIA's release page).
Nemotron 3 Ultra is the largest tier in the Nemotron 3 family at 550B total parameters with 55B active per token. It uses a LatentMoE architecture combining Mamba-2, MoE, and Attention layers in one hybrid stack, plus Multi-Token Prediction for speculative decoding. Nemotron 3 Super is 120B total / 12B active with a Mamba-Transformer hybrid. Nemotron Ultra 253B is a dense Transformer with no MoE. Nemotron 3 Ultra uses a staged post-training pipeline including RLVR and Multi-Teacher On-Policy Distillation (MOPD), targeting agentic task accuracy across coding, reasoning, tool use, and agent action sequences.
NVIDIA used a staged post-training pipeline. Multi-Teacher On-Policy Distillation (MOPD) is the headline technique: more than 10 specialized teacher models guide training on the student's own generated outputs (on-policy rollouts), covering coding, math, reasoning, tool use, and agentic workflows. RLVR (RL with Verifiable Rewards) runs as an earlier stage in the pipeline. MOPD then preserves and extends those RL gains across multiple domains simultaneously, improving multi-step task completion rather than overfitting to one benchmark category.
On Spheron, an 8x H200 SXM5 node costs $38.72/hr on-demand ($4.84/GPU). At continuous 24/7 use that is roughly $28,266 per month. Managed inference APIs for comparable frontier reasoning models typically charge $4-8 per million tokens. The break-even against a self-hosted 8x H200 on-demand configuration is approximately 3.5-7 billion tokens per month depending on which API tier you are comparing against.
