Nemotron 3 Super hit 60.47% on SWE-Bench Verified when NVIDIA released it on March 11, 2026, ahead of GTC (March 16-19, 2026). The number that makes it interesting for production deployment is 12B: that's the active parameter count per forward pass, despite a 120B total parameter count. This guide covers the GPU math, vLLM configuration, and cost breakdown for running it yourself on cloud infrastructure.
Before diving into deployment, if you're already running vLLM for other models, the vLLM production deployment guide covers the baseline setup you'll want in place first.
Nemotron 3 Super Architecture: What the Hybrid Mamba-Transformer MoE Means for Inference
Nemotron 3 Super is not a standard transformer. Understanding the architecture directly affects how you configure vLLM, what hardware you need, and where performance bottlenecks will appear.
Standard transformer attention layers compute query-key-value products across the entire sequence at every layer. Memory consumption grows quadratically with sequence length because every token attends to every other token. For long context inference, this is expensive.
SSM (Structured State Space Model) layers, as implemented in Mamba, replace the attention computation with a recurrent state that processes tokens sequentially. Memory consumption scales linearly with sequence length rather than quadratically. The trade-off: SSM layers maintain a running state that is fundamentally sequential during decode, which limits certain forms of batching.
Nemotron 3 Super interleaves SSM layers with standard attention layers through the same model stack. You get the KV cache savings of SSM for long sequences while keeping the parallel attention layers that handle complex reasoning.
The MoE (Mixture of Experts) component is separate from the SSM/attention distinction. The model has 120B total parameters split across experts, but only roughly 12B activate per forward pass. This is why you need to load all 120B into VRAM even though only 10% compute on any given token.
LatentMoE and Multi-Token Prediction
Two architectural features distinguish Nemotron 3 Super from standard MoE models.
LatentMoE: Rather than routing input tokens directly to experts, the model first projects tokens into a compressed latent space before expert selection. This lets the routing mechanism activate 4x more experts at the same compute cost compared to standard MoE. More experts contributing to each token improves quality without a proportional VRAM or compute increase.
Multi-Token Prediction (MTP): The model is trained to predict multiple future tokens per forward pass, enabling native speculative decoding. Unlike most speculative decoding setups that require a separate smaller draft model, Nemotron 3 Super handles draft generation internally. This reduces inference latency for medium-to-long responses without extra GPU allocation for a draft model.
Memory Access Patterns
SSM layers have linear sequence scaling vs quadratic for attention. The model supports context windows up to 1M tokens. For a 128K context window, the KV cache from attention layers grows at the usual rate, but SSM layers add only a fixed-size recurrent state buffer per layer regardless of context length. At 1M context, this difference becomes substantial: pure transformer KV cache would be enormous, while Nemotron 3 Super's SSM layers add no additional cache pressure.
For production deployments, the single-GPU NVFP4 vLLM command in this guide uses --max-model-len 32768 as a practical default. The 4x H100 BF16 evaluation configuration uses --max-model-len 16384 due to limited headroom after loading weights. The full 1M context is available on configurations with sufficient VRAM headroom. For repository-level code analysis or document processing with very long inputs, you can increase --max-model-len as long as your GPU has headroom.
The practical implication: Nemotron 3 Super's effective VRAM for long context is lower than a comparable parameter-count pure transformer. A 120B dense model at BF16 with 128K context would require substantially more total memory than Nemotron 3 Super at the same precision.
Prefill vs Decode Differences
During decode, SSM layers compute token by token with a recurrent state. This is sequential by design. Standard attention layers can process in parallel with chunked prefill, but SSM layers cannot correctly initialize their recurrent state across chunk boundaries without special handling.
This is a correctness issue, not just a performance issue. vLLM enables chunked prefill by default in recent versions for long-context efficiency. For Nemotron 3 Super, you must pass --no-enable-chunked-prefill until you have validated that your specific vLLM version handles SSM chunk boundaries correctly. Incorrect SSM state initialization produces wrong outputs, not just slower outputs. Note: vLLM 0.17.1 includes SSM-aware chunked prefill improvements that may handle chunk boundaries correctly; test explicitly before relying on this in production.
| Architecture Component | Standard Transformer | Nemotron 3 Super |
|---|---|---|
| Attention layers | All layers | Alternating with SSM |
| KV cache growth | O(layers x seq_len) | Reduced (fewer attention layers) |
| Decode state | Stateless | SSM layers maintain recurrent state |
| Long-context VRAM | High | Lower (fewer KV cache layers) |
| Active params per token | 100% | ~10% (12B of 120B) |
GPU Requirements: VRAM Budgets by Precision Tier
The key insight with MoE models: you load all 120B parameters into VRAM even though only 12B activate per token. This is fundamentally different from a 12B dense model. You need enough VRAM for the full parameter count, not just the active parameters.
For the general formula and a broader VRAM reference across model families, see GPU memory requirements for LLMs.
VRAM Formula for MoE Models
VRAM = (total_params x bytes_per_param) + KV_cache + activation_overheadWorked examples for Nemotron 3 Super (120B total parameters):
- BF16: 120B x 2 bytes = 240 GB minimum. Requires 8x H100 80GB for comfortable production headroom. 4x H100 (320 GB total) technically holds the weights but leaves only 80 GB across 4 cards for KV cache and activations — treat 4x H100 as an evaluation floor, not a production configuration. A single B200 (192 GB HBM3e) cannot hold the BF16 model (240 GB > 192 GB).
- FP8: 120B x 1 byte = 120 GB minimum. Fits on 2x H100 80GB with reasonable headroom for KV cache.
- NVFP4: 120B x 0.5 bytes = 60 GB minimum. Fits on a single H100 80GB with ~20 GB remaining for KV cache and activations.
- GGUF Q4_K_M: Approximately 65-70 GB, similar to NVFP4 but CPU-loadable via llama.cpp for environments without 80 GB GPU access.
GPU Tier Recommendations
| Use Case | Precision | GPUs Needed | Spheron Option |
|---|---|---|---|
| Single dev/evaluation | NVFP4 | 1x H100 PCIe | $2.01/hr |
| Staging / small production | FP8 | 2x H100 PCIe | $4.03/hr |
| Evaluation only (short context) | BF16 | 4x H100 PCIe | $8.05/hr |
| Production (BF16 recommended) | BF16 | 8x H100 PCIe | $16.11/hr |
| Best price-perf (spot) | NVFP4 | 1x B200 SXM6 | $1.67/hr spot |
The B200 row is worth highlighting: a single B200 (192 GB HBM3e) fits the NVFP4 quantized 120B model (~60 GB) with over 130 GB headroom for KV cache. At spot pricing of $1.67/hr, it's the most cost-efficient path for quantized serving. Note that the full BF16 model requires 240 GB, which exceeds B200 VRAM — use 8x H100 for BF16 production workloads. See the B200 GPU rental page for availability.
Pricing fluctuates based on GPU availability. The prices above are based on 02 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Deploying with vLLM: Configuration for Hybrid Architectures
vLLM is the fastest path to a working Nemotron 3 Super endpoint. TensorRT-LLM gives higher throughput for sustained production but requires NVIDIA's TRT-LLM repo with the nemotron branch and significantly more setup time. See vLLM vs TensorRT-LLM vs SGLang benchmarks for a framework comparison.
Prerequisites
- CUDA 12.4+ for Hopper (H100); CUDA 12.8+ for Blackwell (B200, B300)
- vLLM 0.17.1+ (includes native Mamba kernel support; no separate
causal-conv1dormamba-ssmpackages needed) - Python 3.10+
Earlier vLLM versions do not include the SSM hybrid kernel required for Nemotron 3 Super. If your existing vLLM installation is older, --trust-remote-code alone will not fix the missing kernel; you need to upgrade to 0.17.1 or later.
Installation
pip install "vllm>=0.17.1"
nvcc --version # Verify CUDA 12.4+ (Hopper/H100) or 12.8+ (Blackwell/B200, B300)Download the Model
pip install huggingface-hub
huggingface-cli download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
--local-dir /models/nemotron-superFor the NVFP4 quantized checkpoint:
huggingface-cli download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--local-dir /models/nemotron-super-nvfp4For FP8:
huggingface-cli download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
--local-dir /models/nemotron-super-fp8Single H100: NVFP4 Launch Command
python -m vllm.entrypoints.openai.api_server \
--model /models/nemotron-super-nvfp4 \
--quantization nvfp4 \
--gpu-memory-utilization 0.92 \
--max-model-len 32768 \
--trust-remote-code \
--no-enable-chunked-prefill \
--port 8000Multi-GPU BF16: 4x H100 with Tensor Parallelism
This configuration is for evaluation only (short context). With 4x H100 (320 GB total), the BF16 weights consume ~240 GB, leaving only ~80 GB across 4 cards for KV cache and activations. Using 128K context at this tier will OOM. Cap --max-model-len at 16384 for evaluation workloads, or use 8x H100 for production with larger context windows.
python -m vllm.entrypoints.openai.api_server \
--model /models/nemotron-super \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.90 \
--max-model-len 16384 \
--trust-remote-code \
--no-enable-chunked-prefill \
--port 8000Mamba-Specific vLLM Flags
| Flag | Value | Why |
|---|---|---|
--trust-remote-code | required | Nemotron uses custom model code |
--no-enable-chunked-prefill | required initially | SSM state initialization across chunk boundaries is a correctness issue; test without chunked prefill first (vLLM 0.17.1 may handle this correctly; validate before enabling) |
--max-num-seqs | 32-64 | SSM decode state limits effective batch parallelism vs pure attention |
--gpu-memory-utilization | 0.90-0.92 | Leave headroom for SSM activation buffers |
Quantization Options: NVFP4 vs FP8 vs GGUF Q4_K_M
NVFP4 (NVIDIA-Native, Blackwell-First)
NVFP4 is the recommended starting point for single-H100 or B200 deployment:
- Native FP4 tensor core support on Blackwell (B200, B300). H100 (Hopper) lacks native FP4 tensor cores. vLLM can still load NVFP4 checkpoints on H100 by dequantizing to FP8 at runtime, giving you the memory savings of 4-bit weights with FP8-equivalent compute throughput.
- 0.5 bytes per parameter weight (4-bit)
- Requires CUDA toolkit 12.4+ and vLLM's built-in NVFP4 kernel
- Best throughput on Blackwell hardware; still usable on H100 with some throughput penalty vs native
- Quality degradation is minor for coding tasks. The SWE-Bench gap vs BF16 is typically less than 2%.
Enable in vLLM: --quantization nvfp4
FP8 (Good Middle Ground)
FP8 gives better quality than NVFP4 at the cost of 2x the VRAM:
- 1 byte per parameter weight
- Supported natively on H100 (E4M3 format)
- Enable in vLLM:
--quantization fp8 - Recommended for staging environments or quality-sensitive production where a single H100 is insufficient but full BF16 is overkill
GGUF Q4_K_M (CPU-Loadable, llama.cpp Path)
If you don't have access to an 80 GB GPU and need to run locally or on smaller hardware:
- Works with llama.cpp if a GGUF conversion exists for the checkpoint
- Can offload layers to CPU for low-QPS workloads
- Higher latency vs vLLM GPU path
- Approximately 65-70 GB for the full model, similar footprint to NVFP4 but CPU-accessible
| Format | VRAM (120B model) | Throughput | Quality vs BF16 | Hardware |
|---|---|---|---|---|
| BF16 | ~240 GB | Baseline | 100% | 8x H100 (4x minimum for eval) |
| FP8 | ~120 GB | 1.5-1.8x BF16 | ~98% | 2x H100 |
| NVFP4 | ~60 GB | 2.5-3x BF16 | ~96% | 1x H100 |
| GGUF Q4_K_M | ~65 GB | Slower (CPU path) | ~95% | 1x H100 or CPU |
One clarification on NVIDIA's advertised 5x throughput claim: that figure compares Nemotron 3 Super against the previous Nemotron Super model, reflecting both architectural improvements and efficiency gains across the model family. NVIDIA also publishes a separate 4x throughput claim comparing NVFP4 on Blackwell (B200) against FP8 on Hopper (H100). The 2.5-3x figure in the table above is the more accurate estimate for NVFP4 vs BF16 on the same H100 hardware.
For deeper analysis of FP4 quantization economics, see FP4 quantization on Blackwell GPUs and KV cache optimization guide.
Benchmarks: SWE-Bench, Throughput, and What the 5x Claim Actually Means
SWE-Bench Verified Results
| Model | SWE-Bench Verified | Active Params | Notes |
|---|---|---|---|
| Nemotron 3 Super | 60.47% | 12B | Hybrid Mamba-MoE, released March 2026 |
| DeepSeek V4 | Not publicly benchmarked on SWE-Bench Verified as of Mar 2026 | ||
| GPT-5.4 | Not publicly benchmarked on SWE-Bench Verified as of Mar 2026 |
SWE-Bench scores vary depending on test harness and scaffold. The 60.47% figure is from NVIDIA's announcement using their evaluation setup. Your production results with the same model weight may differ based on prompt engineering and tool scaffolding.
For throughput comparison methodology across frameworks, see vLLM vs TensorRT-LLM vs SGLang benchmarks.
The 5x Throughput Claim
NVIDIA's 5x claim compares Nemotron 3 Super against the previous Nemotron Super model, reflecting architectural and efficiency gains across the model family. NVIDIA separately claims 4x throughput for NVFP4 on Blackwell vs FP8 on Hopper. Neither figure is a direct NVFP4 vs BF16 comparison on the same hardware.
A concrete estimate: at NVFP4 on a single H100, expect roughly 2,000-4,000 tokens/sec for decode depending on batch size. BF16 on 4x H100 (roughly the same hardware cost) runs approximately 800-1,500 tokens/sec. The NVFP4 single-card path wins on throughput per dollar at moderate batch sizes.
These are derived estimates. Actual throughput depends heavily on sequence lengths, batch sizes, and deployment configuration. Run benchmarks on your specific workload before committing to a hardware tier.
Agentic Coding Task Throughput
Coding agents produce longer outputs than chat: multi-file edits, test generation, and code review outputs often run 4,000-12,000 tokens per task. At 3,000 tokens/sec decode on a single H100 NVFP4, a 9,000-token agent response takes about 3 seconds. At 1,000 tokens/sec on BF16 4x H100, the same output takes 9 seconds.
For cost-per-task analysis, see the cost section below.
Cost to Run Nemotron 3 Super: Monthly Estimates for Enterprise Coding Workloads
| Configuration | $/hr (Spheron) | $/month (continuous) | Best For |
|---|---|---|---|
| 1x H100 PCIe NVFP4 | $2.01 | ~$1,447 | Dev / low-QPS staging |
| 2x H100 PCIe FP8 | $4.03 | ~$2,902 | Small team production |
| 4x H100 PCIe BF16 | $8.05 | ~$5,796 | Evaluation only (short context; use 8x H100 for production) |
| 8x H100 PCIe BF16 | $16.11 | ~$11,599 | High-throughput production |
| 1x B200 SXM6 spot | $1.67 | ~$1,202 | Best price-perf for NVFP4 |
| 1x B200 SXM6 on-demand | $7.43 | ~$5,350 | Committed production on B200 |
Pricing fluctuates based on GPU availability. The prices above are based on 02 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Relevant hardware pages: H100 GPU rental, B200 GPU rental, A100 GPU rental for teams currently on A100 planning a migration path.
For strategies to reduce GPU spend, see GPU cost optimization playbook.
Cost Per Coding Task
Assume an agentic coding task produces 8,000 output tokens at 3,000 tokens/sec on 1x H100 NVFP4:
- Time per task: ~2.7 seconds
- Cost per task: $2.01/hr / 3,600 sec x 2.7 sec = $0.0015
- At 10,000 tasks/day: ~$15/day, or ~$450/month
Compare this to per-token API pricing for similarly capable proprietary models. At scale (10,000+ tasks/day), self-hosting on Spheron at $2.01/hr consistently beats per-token API pricing. The breakeven point depends on the specific API you're comparing against, but for coding agents running at high volume, the math favors self-hosting well before you hit 5,000 tasks/day.
Nemotron 3 Super vs DeepSeek V4 vs GPT-5.4: Choosing the Right Coding Model
| Model | SWE-Bench | Total Params | Active Params | Context | Self-Hostable | Approx Cost per 1M tokens |
|---|---|---|---|---|---|---|
| Nemotron 3 Super | 60.47% | 120B | 12B | 1M | Yes | ~$0.19 (NVFP4, 1x H100) |
| DeepSeek V4 | Not published | TBD | TBD | TBD | Yes | TBD |
| GPT-5.4 | Not published | Closed | Closed | TBD | No | API only |
When to Choose Nemotron 3 Super
- Your coding agents process long context windows (file-level or repo-level code review): SSM layers reduce KV cache pressure at long sequences, giving you more context per dollar
- You need NVIDIA ecosystem tooling: TensorRT-LLM, NIM microservices, NeMo framework
- You want to avoid third-party API dependencies for enterprise compliance or data residency
When DeepSeek V4 Makes Sense
- You already have a DeepSeek V3/V3.2 deployment and want to minimize migration cost
- See the DeepSeek V3.2 deployment guide for setup details
Production Decision Framework
| SWE-Bench threshold needed | VRAM budget | Compliance requirement | Recommendation |
|---|---|---|---|
| 60%+ | 80 GB single card | Data residency required | Nemotron 3 Super NVFP4 on H100 |
| 60%+ | 192 GB single card | Data residency required | Nemotron 3 Super NVFP4 on B200 |
| 60%+ | 4x 80GB | API acceptable | Nemotron 3 Super BF16 on 4x H100 (evaluation only, short context; OOM risk at longer contexts, use 8x H100 for production) |
| Any | Minimal | API acceptable | Proprietary API (no infra overhead) |
| 55%+ | Existing DeepSeek cluster | Flexible | DeepSeek V3.2 (minimize migration) |
Production Checklist
- GPU provisioned with correct VRAM tier for chosen precision
- CUDA version verified (
nvcc --version): 12.4+ for Hopper (H100); 12.8+ for Blackwell (B200, B300) - vLLM 0.17.1+ installed (
pip install "vllm>=0.17.1") - Model checkpoint downloaded and checksum verified
- vLLM server launched with
--trust-remote-codeand correct--tensor-parallel-size - Chunked prefill disabled (
--no-enable-chunked-prefill); test and re-enable only after baseline benchmarks validate correctness - Health endpoint verified:
curl http://localhost:8000/health - GPU utilization monitored via
nvidia-smi dmon -s u - Load balancer or reverse proxy in front of vLLM for production traffic
- Pricing alert set up via Spheron dashboard to avoid unexpected cost overruns
For GPU monitoring tooling, see GPU monitoring for ML and production GPU cloud architecture.
Nemotron 3 Super's hybrid architecture makes it one of the most VRAM-efficient 120B-class coding models available. At NVFP4 on a single H100 PCIe, you can run full inference for roughly $2/hr on Spheron with no enterprise contracts.
