Tutorial

Self-Host Nemotron 3 Super on GPU Cloud: Deployment Guide (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 2, 2026
LLM InferenceGPU CloudNVIDIAvLLMH100QuantizationEnterprise AI
Self-Host Nemotron 3 Super on GPU Cloud: Deployment Guide (2026)

Nemotron 3 Super hit 60.47% on SWE-Bench Verified when NVIDIA released it on March 11, 2026, ahead of GTC (March 16-19, 2026). The number that makes it interesting for production deployment is 12B: that's the active parameter count per forward pass, despite a 120B total parameter count. This guide covers the GPU math, vLLM configuration, and cost breakdown for running it yourself on cloud infrastructure.

Before diving into deployment, if you're already running vLLM for other models, the vLLM production deployment guide covers the baseline setup you'll want in place first.

Nemotron 3 Super Architecture: What the Hybrid Mamba-Transformer MoE Means for Inference

Nemotron 3 Super is not a standard transformer. Understanding the architecture directly affects how you configure vLLM, what hardware you need, and where performance bottlenecks will appear.

Standard transformer attention layers compute query-key-value products across the entire sequence at every layer. Memory consumption grows quadratically with sequence length because every token attends to every other token. For long context inference, this is expensive.

SSM (Structured State Space Model) layers, as implemented in Mamba, replace the attention computation with a recurrent state that processes tokens sequentially. Memory consumption scales linearly with sequence length rather than quadratically. The trade-off: SSM layers maintain a running state that is fundamentally sequential during decode, which limits certain forms of batching.

Nemotron 3 Super interleaves SSM layers with standard attention layers through the same model stack. You get the KV cache savings of SSM for long sequences while keeping the parallel attention layers that handle complex reasoning.

The MoE (Mixture of Experts) component is separate from the SSM/attention distinction. The model has 120B total parameters split across experts, but only roughly 12B activate per forward pass. This is why you need to load all 120B into VRAM even though only 10% compute on any given token.

LatentMoE and Multi-Token Prediction

Two architectural features distinguish Nemotron 3 Super from standard MoE models.

LatentMoE: Rather than routing input tokens directly to experts, the model first projects tokens into a compressed latent space before expert selection. This lets the routing mechanism activate 4x more experts at the same compute cost compared to standard MoE. More experts contributing to each token improves quality without a proportional VRAM or compute increase.

Multi-Token Prediction (MTP): The model is trained to predict multiple future tokens per forward pass, enabling native speculative decoding. Unlike most speculative decoding setups that require a separate smaller draft model, Nemotron 3 Super handles draft generation internally. This reduces inference latency for medium-to-long responses without extra GPU allocation for a draft model.

Memory Access Patterns

SSM layers have linear sequence scaling vs quadratic for attention. The model supports context windows up to 1M tokens. For a 128K context window, the KV cache from attention layers grows at the usual rate, but SSM layers add only a fixed-size recurrent state buffer per layer regardless of context length. At 1M context, this difference becomes substantial: pure transformer KV cache would be enormous, while Nemotron 3 Super's SSM layers add no additional cache pressure.

For production deployments, the single-GPU NVFP4 vLLM command in this guide uses --max-model-len 32768 as a practical default. The 4x H100 BF16 evaluation configuration uses --max-model-len 16384 due to limited headroom after loading weights. The full 1M context is available on configurations with sufficient VRAM headroom. For repository-level code analysis or document processing with very long inputs, you can increase --max-model-len as long as your GPU has headroom.

The practical implication: Nemotron 3 Super's effective VRAM for long context is lower than a comparable parameter-count pure transformer. A 120B dense model at BF16 with 128K context would require substantially more total memory than Nemotron 3 Super at the same precision.

Prefill vs Decode Differences

During decode, SSM layers compute token by token with a recurrent state. This is sequential by design. Standard attention layers can process in parallel with chunked prefill, but SSM layers cannot correctly initialize their recurrent state across chunk boundaries without special handling.

This is a correctness issue, not just a performance issue. vLLM enables chunked prefill by default in recent versions for long-context efficiency. For Nemotron 3 Super, you must pass --no-enable-chunked-prefill until you have validated that your specific vLLM version handles SSM chunk boundaries correctly. Incorrect SSM state initialization produces wrong outputs, not just slower outputs. Note: vLLM 0.17.1 includes SSM-aware chunked prefill improvements that may handle chunk boundaries correctly; test explicitly before relying on this in production.

Architecture ComponentStandard TransformerNemotron 3 Super
Attention layersAll layersAlternating with SSM
KV cache growthO(layers x seq_len)Reduced (fewer attention layers)
Decode stateStatelessSSM layers maintain recurrent state
Long-context VRAMHighLower (fewer KV cache layers)
Active params per token100%~10% (12B of 120B)

GPU Requirements: VRAM Budgets by Precision Tier

The key insight with MoE models: you load all 120B parameters into VRAM even though only 12B activate per token. This is fundamentally different from a 12B dense model. You need enough VRAM for the full parameter count, not just the active parameters.

For the general formula and a broader VRAM reference across model families, see GPU memory requirements for LLMs.

VRAM Formula for MoE Models

VRAM = (total_params x bytes_per_param) + KV_cache + activation_overhead

Worked examples for Nemotron 3 Super (120B total parameters):

  • BF16: 120B x 2 bytes = 240 GB minimum. Requires 8x H100 80GB for comfortable production headroom. 4x H100 (320 GB total) technically holds the weights but leaves only 80 GB across 4 cards for KV cache and activations — treat 4x H100 as an evaluation floor, not a production configuration. A single B200 (192 GB HBM3e) cannot hold the BF16 model (240 GB > 192 GB).
  • FP8: 120B x 1 byte = 120 GB minimum. Fits on 2x H100 80GB with reasonable headroom for KV cache.
  • NVFP4: 120B x 0.5 bytes = 60 GB minimum. Fits on a single H100 80GB with ~20 GB remaining for KV cache and activations.
  • GGUF Q4_K_M: Approximately 65-70 GB, similar to NVFP4 but CPU-loadable via llama.cpp for environments without 80 GB GPU access.

GPU Tier Recommendations

Use CasePrecisionGPUs NeededSpheron Option
Single dev/evaluationNVFP41x H100 PCIe$2.01/hr
Staging / small productionFP82x H100 PCIe$4.03/hr
Evaluation only (short context)BF164x H100 PCIe$8.05/hr
Production (BF16 recommended)BF168x H100 PCIe$16.11/hr
Best price-perf (spot)NVFP41x B200 SXM6$1.67/hr spot

The B200 row is worth highlighting: a single B200 (192 GB HBM3e) fits the NVFP4 quantized 120B model (~60 GB) with over 130 GB headroom for KV cache. At spot pricing of $1.67/hr, it's the most cost-efficient path for quantized serving. Note that the full BF16 model requires 240 GB, which exceeds B200 VRAM — use 8x H100 for BF16 production workloads. See the B200 GPU rental page for availability.

Pricing fluctuates based on GPU availability. The prices above are based on 02 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Deploying with vLLM: Configuration for Hybrid Architectures

vLLM is the fastest path to a working Nemotron 3 Super endpoint. TensorRT-LLM gives higher throughput for sustained production but requires NVIDIA's TRT-LLM repo with the nemotron branch and significantly more setup time. See vLLM vs TensorRT-LLM vs SGLang benchmarks for a framework comparison.

Prerequisites

  • CUDA 12.4+ for Hopper (H100); CUDA 12.8+ for Blackwell (B200, B300)
  • vLLM 0.17.1+ (includes native Mamba kernel support; no separate causal-conv1d or mamba-ssm packages needed)
  • Python 3.10+

Earlier vLLM versions do not include the SSM hybrid kernel required for Nemotron 3 Super. If your existing vLLM installation is older, --trust-remote-code alone will not fix the missing kernel; you need to upgrade to 0.17.1 or later.

Installation

bash
pip install "vllm>=0.17.1"
nvcc --version  # Verify CUDA 12.4+ (Hopper/H100) or 12.8+ (Blackwell/B200, B300)

Download the Model

bash
pip install huggingface-hub
huggingface-cli download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
  --local-dir /models/nemotron-super

For the NVFP4 quantized checkpoint:

bash
huggingface-cli download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --local-dir /models/nemotron-super-nvfp4

For FP8:

bash
huggingface-cli download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
  --local-dir /models/nemotron-super-fp8

Single H100: NVFP4 Launch Command

bash
python -m vllm.entrypoints.openai.api_server \
  --model /models/nemotron-super-nvfp4 \
  --quantization nvfp4 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 32768 \
  --trust-remote-code \
  --no-enable-chunked-prefill \
  --port 8000

Multi-GPU BF16: 4x H100 with Tensor Parallelism

This configuration is for evaluation only (short context). With 4x H100 (320 GB total), the BF16 weights consume ~240 GB, leaving only ~80 GB across 4 cards for KV cache and activations. Using 128K context at this tier will OOM. Cap --max-model-len at 16384 for evaluation workloads, or use 8x H100 for production with larger context windows.

bash
python -m vllm.entrypoints.openai.api_server \
  --model /models/nemotron-super \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 16384 \
  --trust-remote-code \
  --no-enable-chunked-prefill \
  --port 8000

Mamba-Specific vLLM Flags

FlagValueWhy
--trust-remote-coderequiredNemotron uses custom model code
--no-enable-chunked-prefillrequired initiallySSM state initialization across chunk boundaries is a correctness issue; test without chunked prefill first (vLLM 0.17.1 may handle this correctly; validate before enabling)
--max-num-seqs32-64SSM decode state limits effective batch parallelism vs pure attention
--gpu-memory-utilization0.90-0.92Leave headroom for SSM activation buffers

Quantization Options: NVFP4 vs FP8 vs GGUF Q4_K_M

NVFP4 (NVIDIA-Native, Blackwell-First)

NVFP4 is the recommended starting point for single-H100 or B200 deployment:

  • Native FP4 tensor core support on Blackwell (B200, B300). H100 (Hopper) lacks native FP4 tensor cores. vLLM can still load NVFP4 checkpoints on H100 by dequantizing to FP8 at runtime, giving you the memory savings of 4-bit weights with FP8-equivalent compute throughput.
  • 0.5 bytes per parameter weight (4-bit)
  • Requires CUDA toolkit 12.4+ and vLLM's built-in NVFP4 kernel
  • Best throughput on Blackwell hardware; still usable on H100 with some throughput penalty vs native
  • Quality degradation is minor for coding tasks. The SWE-Bench gap vs BF16 is typically less than 2%.

Enable in vLLM: --quantization nvfp4

FP8 (Good Middle Ground)

FP8 gives better quality than NVFP4 at the cost of 2x the VRAM:

  • 1 byte per parameter weight
  • Supported natively on H100 (E4M3 format)
  • Enable in vLLM: --quantization fp8
  • Recommended for staging environments or quality-sensitive production where a single H100 is insufficient but full BF16 is overkill

GGUF Q4_K_M (CPU-Loadable, llama.cpp Path)

If you don't have access to an 80 GB GPU and need to run locally or on smaller hardware:

  • Works with llama.cpp if a GGUF conversion exists for the checkpoint
  • Can offload layers to CPU for low-QPS workloads
  • Higher latency vs vLLM GPU path
  • Approximately 65-70 GB for the full model, similar footprint to NVFP4 but CPU-accessible
FormatVRAM (120B model)ThroughputQuality vs BF16Hardware
BF16~240 GBBaseline100%8x H100 (4x minimum for eval)
FP8~120 GB1.5-1.8x BF16~98%2x H100
NVFP4~60 GB2.5-3x BF16~96%1x H100
GGUF Q4_K_M~65 GBSlower (CPU path)~95%1x H100 or CPU

One clarification on NVIDIA's advertised 5x throughput claim: that figure compares Nemotron 3 Super against the previous Nemotron Super model, reflecting both architectural improvements and efficiency gains across the model family. NVIDIA also publishes a separate 4x throughput claim comparing NVFP4 on Blackwell (B200) against FP8 on Hopper (H100). The 2.5-3x figure in the table above is the more accurate estimate for NVFP4 vs BF16 on the same H100 hardware.

For deeper analysis of FP4 quantization economics, see FP4 quantization on Blackwell GPUs and KV cache optimization guide.

Benchmarks: SWE-Bench, Throughput, and What the 5x Claim Actually Means

SWE-Bench Verified Results

ModelSWE-Bench VerifiedActive ParamsNotes
Nemotron 3 Super60.47%12BHybrid Mamba-MoE, released March 2026
DeepSeek V4Not publicly benchmarked on SWE-Bench Verified as of Mar 2026
GPT-5.4Not publicly benchmarked on SWE-Bench Verified as of Mar 2026

SWE-Bench scores vary depending on test harness and scaffold. The 60.47% figure is from NVIDIA's announcement using their evaluation setup. Your production results with the same model weight may differ based on prompt engineering and tool scaffolding.

For throughput comparison methodology across frameworks, see vLLM vs TensorRT-LLM vs SGLang benchmarks.

The 5x Throughput Claim

NVIDIA's 5x claim compares Nemotron 3 Super against the previous Nemotron Super model, reflecting architectural and efficiency gains across the model family. NVIDIA separately claims 4x throughput for NVFP4 on Blackwell vs FP8 on Hopper. Neither figure is a direct NVFP4 vs BF16 comparison on the same hardware.

A concrete estimate: at NVFP4 on a single H100, expect roughly 2,000-4,000 tokens/sec for decode depending on batch size. BF16 on 4x H100 (roughly the same hardware cost) runs approximately 800-1,500 tokens/sec. The NVFP4 single-card path wins on throughput per dollar at moderate batch sizes.

These are derived estimates. Actual throughput depends heavily on sequence lengths, batch sizes, and deployment configuration. Run benchmarks on your specific workload before committing to a hardware tier.

Agentic Coding Task Throughput

Coding agents produce longer outputs than chat: multi-file edits, test generation, and code review outputs often run 4,000-12,000 tokens per task. At 3,000 tokens/sec decode on a single H100 NVFP4, a 9,000-token agent response takes about 3 seconds. At 1,000 tokens/sec on BF16 4x H100, the same output takes 9 seconds.

For cost-per-task analysis, see the cost section below.

Cost to Run Nemotron 3 Super: Monthly Estimates for Enterprise Coding Workloads

Configuration$/hr (Spheron)$/month (continuous)Best For
1x H100 PCIe NVFP4$2.01~$1,447Dev / low-QPS staging
2x H100 PCIe FP8$4.03~$2,902Small team production
4x H100 PCIe BF16$8.05~$5,796Evaluation only (short context; use 8x H100 for production)
8x H100 PCIe BF16$16.11~$11,599High-throughput production
1x B200 SXM6 spot$1.67~$1,202Best price-perf for NVFP4
1x B200 SXM6 on-demand$7.43~$5,350Committed production on B200

Pricing fluctuates based on GPU availability. The prices above are based on 02 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Relevant hardware pages: H100 GPU rental, B200 GPU rental, A100 GPU rental for teams currently on A100 planning a migration path.

For strategies to reduce GPU spend, see GPU cost optimization playbook.

Cost Per Coding Task

Assume an agentic coding task produces 8,000 output tokens at 3,000 tokens/sec on 1x H100 NVFP4:

  • Time per task: ~2.7 seconds
  • Cost per task: $2.01/hr / 3,600 sec x 2.7 sec = $0.0015
  • At 10,000 tasks/day: ~$15/day, or ~$450/month

Compare this to per-token API pricing for similarly capable proprietary models. At scale (10,000+ tasks/day), self-hosting on Spheron at $2.01/hr consistently beats per-token API pricing. The breakeven point depends on the specific API you're comparing against, but for coding agents running at high volume, the math favors self-hosting well before you hit 5,000 tasks/day.

Nemotron 3 Super vs DeepSeek V4 vs GPT-5.4: Choosing the Right Coding Model

ModelSWE-BenchTotal ParamsActive ParamsContextSelf-HostableApprox Cost per 1M tokens
Nemotron 3 Super60.47%120B12B1MYes~$0.19 (NVFP4, 1x H100)
DeepSeek V4Not publishedTBDTBDTBDYesTBD
GPT-5.4Not publishedClosedClosedTBDNoAPI only

When to Choose Nemotron 3 Super

  • Your coding agents process long context windows (file-level or repo-level code review): SSM layers reduce KV cache pressure at long sequences, giving you more context per dollar
  • You need NVIDIA ecosystem tooling: TensorRT-LLM, NIM microservices, NeMo framework
  • You want to avoid third-party API dependencies for enterprise compliance or data residency

When DeepSeek V4 Makes Sense

Production Decision Framework

SWE-Bench threshold neededVRAM budgetCompliance requirementRecommendation
60%+80 GB single cardData residency requiredNemotron 3 Super NVFP4 on H100
60%+192 GB single cardData residency requiredNemotron 3 Super NVFP4 on B200
60%+4x 80GBAPI acceptableNemotron 3 Super BF16 on 4x H100 (evaluation only, short context; OOM risk at longer contexts, use 8x H100 for production)
AnyMinimalAPI acceptableProprietary API (no infra overhead)
55%+Existing DeepSeek clusterFlexibleDeepSeek V3.2 (minimize migration)

Production Checklist

  1. GPU provisioned with correct VRAM tier for chosen precision
  2. CUDA version verified (nvcc --version): 12.4+ for Hopper (H100); 12.8+ for Blackwell (B200, B300)
  3. vLLM 0.17.1+ installed (pip install "vllm>=0.17.1")
  4. Model checkpoint downloaded and checksum verified
  5. vLLM server launched with --trust-remote-code and correct --tensor-parallel-size
  6. Chunked prefill disabled (--no-enable-chunked-prefill); test and re-enable only after baseline benchmarks validate correctness
  7. Health endpoint verified: curl http://localhost:8000/health
  8. GPU utilization monitored via nvidia-smi dmon -s u
  9. Load balancer or reverse proxy in front of vLLM for production traffic
  10. Pricing alert set up via Spheron dashboard to avoid unexpected cost overruns

For GPU monitoring tooling, see GPU monitoring for ML and production GPU cloud architecture.


Nemotron 3 Super's hybrid architecture makes it one of the most VRAM-efficient 120B-class coding models available. At NVFP4 on a single H100 PCIe, you can run full inference for roughly $2/hr on Spheron with no enterprise contracts.

Rent H100 → | Rent B200 → | View all GPU pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.