What GPU do I need to run Nemotron 3 Super?

For NVFP4 or GGUF Q4_K_M quantization, a single H100 (80 GB) handles the 12B active parameter load. Full BF16 precision requires 8x H100 (recommended) or 4x H100 as the absolute minimum floor for short-context evaluation. The B200 (192 GB HBM3e) is suited for NVFP4 or FP8, not BF16, since 120B BF16 requires 240 GB. The B200 spot price on Spheron is $1.67/GPU/hr.

Does vLLM support Nemotron 3 Super's hybrid Mamba-Transformer architecture?

Yes. vLLM 0.17.1+ includes Mamba kernel support for Nemotron 3 Super. You need to pass --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 (or the NVFP4/FP8 variant) with --trust-remote-code and set tensor-parallel-size to match your GPU count. The hybrid SSM layers use different memory access patterns than pure attention, so chunked prefill should be disabled initially.

How much does it cost to run Nemotron 3 Super on Spheron?

At quantized NVFP4 precision on a single H100 PCIe ($2.01/hr on Spheron), a full month of continuous inference costs roughly $1,450. Multi-GPU BF16 on 4x H100 runs about $5,800/month. B200 spot at $1.67/hr gives the best price-to-performance for full-precision serving.

What is the Mamba-Transformer hybrid architecture and why does it matter for inference?

Nemotron 3 Super mixes SSM (Mamba) layers with standard transformer attention layers in the same stack. SSM layers process sequences with O(sequence length) memory instead of O(sequence length^2), which means long-context inference costs far less VRAM than a pure transformer of similar capability. The trade-off is that SSM layers have sequential state that cannot be parallelized the same way as attention, so prefill batching behaves differently.

How does Nemotron 3 Super compare to DeepSeek for coding tasks?

On SWE-Bench Verified, Nemotron 3 Super scores 60.47%. The hybrid MoE architecture keeps only 12B parameters active per token, which gives competitive throughput vs dense models. For pure coding tasks with long context windows (file-level edits, multi-file PRs), Nemotron 3 Super's SSM layers reduce KV cache pressure substantially.

Self-Host Nemotron 3 Super on GPU Cloud: Deployment Guide (2026)

Nemotron 3 Super hit 60.47% on SWE-Bench Verified when NVIDIA released it on March 11, 2026, ahead of GTC (March 16-19, 2026). The number that makes it interesting for production deployment is 12B: that's the active parameter count per forward pass, despite a 120B total parameter count. This guide covers the GPU math, vLLM configuration, and cost breakdown for running it yourself on cloud infrastructure.

Before diving into deployment, if you're already running vLLM for other models, the vLLM production deployment guide covers the baseline setup you'll want in place first.

Nemotron 3 Super Architecture: What the Hybrid Mamba-Transformer MoE Means for Inference

Nemotron 3 Super is not a standard transformer. Understanding the architecture directly affects how you configure vLLM, what hardware you need, and where performance bottlenecks will appear.

Standard transformer attention layers compute query-key-value products across the entire sequence at every layer. Memory consumption grows quadratically with sequence length because every token attends to every other token. For long context inference, this is expensive.

SSM (Structured State Space Model) layers, as implemented in Mamba, replace the attention computation with a recurrent state that processes tokens sequentially. Memory consumption scales linearly with sequence length rather than quadratically. The trade-off: SSM layers maintain a running state that is fundamentally sequential during decode, which limits certain forms of batching.

Nemotron 3 Super interleaves SSM layers with standard attention layers through the same model stack. You get the KV cache savings of SSM for long sequences while keeping the parallel attention layers that handle complex reasoning.

The MoE (Mixture of Experts) component is separate from the SSM/attention distinction. The model has 120B total parameters split across experts, but only roughly 12B activate per forward pass. This is why you need to load all 120B into VRAM even though only 10% compute on any given token.

LatentMoE and Multi-Token Prediction

Two architectural features distinguish Nemotron 3 Super from standard MoE models.

LatentMoE: Rather than routing input tokens directly to experts, the model first projects tokens into a compressed latent space before expert selection. This lets the routing mechanism activate 4x more experts at the same compute cost compared to standard MoE. More experts contributing to each token improves quality without a proportional VRAM or compute increase.

Multi-Token Prediction (MTP): The model is trained to predict multiple future tokens per forward pass, enabling native speculative decoding. Unlike most speculative decoding setups that require a separate smaller draft model, Nemotron 3 Super handles draft generation internally. This reduces inference latency for medium-to-long responses without extra GPU allocation for a draft model.

Memory Access Patterns

SSM layers have linear sequence scaling vs quadratic for attention. The model supports context windows up to 1M tokens. For a 128K context window, the KV cache from attention layers grows at the usual rate, but SSM layers add only a fixed-size recurrent state buffer per layer regardless of context length. At 1M context, this difference becomes substantial: pure transformer KV cache would be enormous, while Nemotron 3 Super's SSM layers add no additional cache pressure.

For production deployments, the single-GPU NVFP4 vLLM command in this guide uses --max-model-len 32768 as a practical default. The 4x H100 BF16 evaluation configuration uses --max-model-len 16384 due to limited headroom after loading weights. The full 1M context is available on configurations with sufficient VRAM headroom. For repository-level code analysis or document processing with very long inputs, you can increase --max-model-len as long as your GPU has headroom.

The practical implication: Nemotron 3 Super's effective VRAM for long context is lower than a comparable parameter-count pure transformer. A 120B dense model at BF16 with 128K context would require substantially more total memory than Nemotron 3 Super at the same precision.

Prefill vs Decode Differences

During decode, SSM layers compute token by token with a recurrent state. This is sequential by design. Standard attention layers can process in parallel with chunked prefill, but SSM layers cannot correctly initialize their recurrent state across chunk boundaries without special handling.

This is a correctness issue, not just a performance issue. vLLM enables chunked prefill by default in recent versions for long-context efficiency. For Nemotron 3 Super, you must pass --no-enable-chunked-prefill until you have validated that your specific vLLM version handles SSM chunk boundaries correctly. Incorrect SSM state initialization produces wrong outputs, not just slower outputs. Note: vLLM 0.17.1 includes SSM-aware chunked prefill improvements that may handle chunk boundaries correctly; test explicitly before relying on this in production.

Architecture Component	Standard Transformer	Nemotron 3 Super
Attention layers	All layers	Alternating with SSM
KV cache growth	O(layers x seq_len)	Reduced (fewer attention layers)
Decode state	Stateless	SSM layers maintain recurrent state
Long-context VRAM	High	Lower (fewer KV cache layers)
Active params per token	100%	~10% (12B of 120B)

GPU Requirements: VRAM Budgets by Precision Tier

The key insight with MoE models: you load all 120B parameters into VRAM even though only 12B activate per token. This is fundamentally different from a 12B dense model. You need enough VRAM for the full parameter count, not just the active parameters.

For the general formula and a broader VRAM reference across model families, see GPU memory requirements for LLMs.

VRAM Formula for MoE Models

VRAM = (total_params x bytes_per_param) + KV_cache + activation_overhead

Worked examples for Nemotron 3 Super (120B total parameters):

BF16: 120B x 2 bytes = 240 GB minimum. Requires 8x H100 80GB for comfortable production headroom. 4x H100 (320 GB total) technically holds the weights but leaves only 80 GB across 4 cards for KV cache and activations — treat 4x H100 as an evaluation floor, not a production configuration. A single B200 (192 GB HBM3e) cannot hold the BF16 model (240 GB > 192 GB).
FP8: 120B x 1 byte = 120 GB minimum. Fits on 2x H100 80GB with reasonable headroom for KV cache.
NVFP4: 120B x 0.5 bytes = 60 GB minimum. Fits on a single H100 80GB with ~20 GB remaining for KV cache and activations.
GGUF Q4_K_M: Approximately 65-70 GB, similar to NVFP4 but CPU-loadable via llama.cpp for environments without 80 GB GPU access.

GPU Tier Recommendations

Use Case	Precision	GPUs Needed	Spheron Option
Single dev/evaluation	NVFP4	1x H100 PCIe	$2.01/hr
Staging / small production	FP8	2x H100 PCIe	$4.03/hr
Evaluation only (short context)	BF16	4x H100 PCIe	$8.05/hr
Production (BF16 recommended)	BF16	8x H100 PCIe	$16.11/hr
Best price-perf (spot)	NVFP4	1x B200 SXM6	$1.67/hr spot

The B200 row is worth highlighting: a single B200 (192 GB HBM3e) fits the NVFP4 quantized 120B model (~60 GB) with over 130 GB headroom for KV cache. At spot pricing of $1.67/hr, it's the most cost-efficient path for quantized serving. Note that the full BF16 model requires 240 GB, which exceeds B200 VRAM — use 8x H100 for BF16 production workloads. See the B200 GPU rental page for availability.

Pricing fluctuates based on GPU availability. The prices above are based on 02 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Deploying with vLLM: Configuration for Hybrid Architectures

vLLM is the fastest path to a working Nemotron 3 Super endpoint. TensorRT-LLM gives higher throughput for sustained production but requires NVIDIA's TRT-LLM repo with the nemotron branch and significantly more setup time. See vLLM vs TensorRT-LLM vs SGLang benchmarks for a framework comparison.

Prerequisites

CUDA 12.4+ for Hopper (H100); CUDA 12.8+ for Blackwell (B200, B300)
vLLM 0.17.1+ (includes native Mamba kernel support; no separate causal-conv1d or mamba-ssm packages needed)
Python 3.10+

Earlier vLLM versions do not include the SSM hybrid kernel required for Nemotron 3 Super. If your existing vLLM installation is older, --trust-remote-code alone will not fix the missing kernel; you need to upgrade to 0.17.1 or later.

Installation

bash

pip install "vllm>=0.17.1"
nvcc --version  # Verify CUDA 12.4+ (Hopper/H100) or 12.8+ (Blackwell/B200, B300)

Download the Model

bash

pip install huggingface-hub
huggingface-cli download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
  --local-dir /models/nemotron-super

For the NVFP4 quantized checkpoint:

bash

huggingface-cli download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --local-dir /models/nemotron-super-nvfp4

For FP8:

bash

huggingface-cli download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
  --local-dir /models/nemotron-super-fp8

Single H100: NVFP4 Launch Command

bash

python -m vllm.entrypoints.openai.api_server \
  --model /models/nemotron-super-nvfp4 \
  --quantization nvfp4 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 32768 \
  --trust-remote-code \
  --no-enable-chunked-prefill \
  --port 8000

Multi-GPU BF16: 4x H100 with Tensor Parallelism

This configuration is for evaluation only (short context). With 4x H100 (320 GB total), the BF16 weights consume ~240 GB, leaving only ~80 GB across 4 cards for KV cache and activations. Using 128K context at this tier will OOM. Cap --max-model-len at 16384 for evaluation workloads, or use 8x H100 for production with larger context windows.

bash

python -m vllm.entrypoints.openai.api_server \
  --model /models/nemotron-super \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 16384 \
  --trust-remote-code \
  --no-enable-chunked-prefill \
  --port 8000

Mamba-Specific vLLM Flags

Flag	Value	Why
`--trust-remote-code`	required	Nemotron uses custom model code
`--no-enable-chunked-prefill`	required initially	SSM state initialization across chunk boundaries is a correctness issue; test without chunked prefill first (vLLM 0.17.1 may handle this correctly; validate before enabling)
`--max-num-seqs`	32-64	SSM decode state limits effective batch parallelism vs pure attention
`--gpu-memory-utilization`	0.90-0.92	Leave headroom for SSM activation buffers

Quantization Options: NVFP4 vs FP8 vs GGUF Q4_K_M

NVFP4 (NVIDIA-Native, Blackwell-First)

NVFP4 is the recommended starting point for single-H100 or B200 deployment:

Native FP4 tensor core support on Blackwell (B200, B300). H100 (Hopper) lacks native FP4 tensor cores. vLLM can still load NVFP4 checkpoints on H100 by dequantizing to FP8 at runtime, giving you the memory savings of 4-bit weights with FP8-equivalent compute throughput.
0.5 bytes per parameter weight (4-bit)
Requires CUDA toolkit 12.4+ and vLLM's built-in NVFP4 kernel
Best throughput on Blackwell hardware; still usable on H100 with some throughput penalty vs native
Quality degradation is minor for coding tasks. The SWE-Bench gap vs BF16 is typically less than 2%.

Enable in vLLM: --quantization nvfp4

FP8 (Good Middle Ground)

FP8 gives better quality than NVFP4 at the cost of 2x the VRAM:

1 byte per parameter weight
Supported natively on H100 (E4M3 format)
Enable in vLLM: --quantization fp8
Recommended for staging environments or quality-sensitive production where a single H100 is insufficient but full BF16 is overkill

GGUF Q4_K_M (CPU-Loadable, llama.cpp Path)

If you don't have access to an 80 GB GPU and need to run locally or on smaller hardware:

Works with llama.cpp if a GGUF conversion exists for the checkpoint
Can offload layers to CPU for low-QPS workloads
Higher latency vs vLLM GPU path
Approximately 65-70 GB for the full model, similar footprint to NVFP4 but CPU-accessible

Format	VRAM (120B model)	Throughput	Quality vs BF16	Hardware
BF16	~240 GB	Baseline	100%	8x H100 (4x minimum for eval)
FP8	~120 GB	1.5-1.8x BF16	~98%	2x H100
NVFP4	~60 GB	2.5-3x BF16	~96%	1x H100
GGUF Q4_K_M	~65 GB	Slower (CPU path)	~95%	1x H100 or CPU

One clarification on NVIDIA's advertised 5x throughput claim: that figure compares Nemotron 3 Super against the previous Nemotron Super model, reflecting both architectural improvements and efficiency gains across the model family. NVIDIA also publishes a separate 4x throughput claim comparing NVFP4 on Blackwell (B200) against FP8 on Hopper (H100). The 2.5-3x figure in the table above is the more accurate estimate for NVFP4 vs BF16 on the same H100 hardware.

For deeper analysis of FP4 quantization economics, see FP4 quantization on Blackwell GPUs and KV cache optimization guide.

Benchmarks: SWE-Bench, Throughput, and What the 5x Claim Actually Means

SWE-Bench Verified Results

Model	SWE-Bench Verified	Active Params	Notes
Nemotron 3 Super	60.47%	12B	Hybrid Mamba-MoE, released March 2026
DeepSeek V4	Not publicly benchmarked on SWE-Bench Verified as of Mar 2026
GPT-5.4	Not publicly benchmarked on SWE-Bench Verified as of Mar 2026

SWE-Bench scores vary depending on test harness and scaffold. The 60.47% figure is from NVIDIA's announcement using their evaluation setup. Your production results with the same model weight may differ based on prompt engineering and tool scaffolding.

For throughput comparison methodology across frameworks, see vLLM vs TensorRT-LLM vs SGLang benchmarks.

The 5x Throughput Claim

NVIDIA's 5x claim compares Nemotron 3 Super against the previous Nemotron Super model, reflecting architectural and efficiency gains across the model family. NVIDIA separately claims 4x throughput for NVFP4 on Blackwell vs FP8 on Hopper. Neither figure is a direct NVFP4 vs BF16 comparison on the same hardware.

A concrete estimate: at NVFP4 on a single H100, expect roughly 2,000-4,000 tokens/sec for decode depending on batch size. BF16 on 4x H100 (roughly the same hardware cost) runs approximately 800-1,500 tokens/sec. The NVFP4 single-card path wins on throughput per dollar at moderate batch sizes.

These are derived estimates. Actual throughput depends heavily on sequence lengths, batch sizes, and deployment configuration. Run benchmarks on your specific workload before committing to a hardware tier.

Agentic Coding Task Throughput

Coding agents produce longer outputs than chat: multi-file edits, test generation, and code review outputs often run 4,000-12,000 tokens per task. At 3,000 tokens/sec decode on a single H100 NVFP4, a 9,000-token agent response takes about 3 seconds. At 1,000 tokens/sec on BF16 4x H100, the same output takes 9 seconds.

For cost-per-task analysis, see the cost section below.

Cost to Run Nemotron 3 Super: Monthly Estimates for Enterprise Coding Workloads

Configuration	$/hr (Spheron)	$/month (continuous)	Best For
1x H100 PCIe NVFP4	$2.01	~$1,447	Dev / low-QPS staging
2x H100 PCIe FP8	$4.03	~$2,902	Small team production
4x H100 PCIe BF16	$8.05	~$5,796	Evaluation only (short context; use 8x H100 for production)
8x H100 PCIe BF16	$16.11	~$11,599	High-throughput production
1x B200 SXM6 spot	$1.67	~$1,202	Best price-perf for NVFP4
1x B200 SXM6 on-demand	$7.43	~$5,350	Committed production on B200

Pricing fluctuates based on GPU availability. The prices above are based on 02 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Relevant hardware pages: H100 GPU rental, B200 GPU rental, A100 GPU rental for teams currently on A100 planning a migration path.

For strategies to reduce GPU spend, see GPU cost optimization playbook.

Cost Per Coding Task

Assume an agentic coding task produces 8,000 output tokens at 3,000 tokens/sec on 1x H100 NVFP4:

Time per task: ~2.7 seconds
Cost per task: $2.01/hr / 3,600 sec x 2.7 sec = $0.0015
At 10,000 tasks/day: ~$15/day, or ~$450/month

Compare this to per-token API pricing for similarly capable proprietary models. At scale (10,000+ tasks/day), self-hosting on Spheron at $2.01/hr consistently beats per-token API pricing. The breakeven point depends on the specific API you're comparing against, but for coding agents running at high volume, the math favors self-hosting well before you hit 5,000 tasks/day.

Nemotron 3 Super vs DeepSeek V4 vs GPT-5.4: Choosing the Right Coding Model

Model	SWE-Bench	Total Params	Active Params	Context	Self-Hostable	Approx Cost per 1M tokens
Nemotron 3 Super	60.47%	120B	12B	1M	Yes	~$0.19 (NVFP4, 1x H100)
DeepSeek V4	Not published	TBD	TBD	TBD	Yes	TBD
GPT-5.4	Not published	Closed	Closed	TBD	No	API only

When to Choose Nemotron 3 Super

Your coding agents process long context windows (file-level or repo-level code review): SSM layers reduce KV cache pressure at long sequences, giving you more context per dollar
You need NVIDIA ecosystem tooling: TensorRT-LLM, NIM microservices, NeMo framework
You want to avoid third-party API dependencies for enterprise compliance or data residency

When DeepSeek V4 Makes Sense

You already have a DeepSeek V3/V3.2 deployment and want to minimize migration cost
See the DeepSeek V3.2 deployment guide for setup details

Production Decision Framework

SWE-Bench threshold needed	VRAM budget	Compliance requirement	Recommendation
60%+	80 GB single card	Data residency required	Nemotron 3 Super NVFP4 on H100
60%+	192 GB single card	Data residency required	Nemotron 3 Super NVFP4 on B200
60%+	4x 80GB	API acceptable	Nemotron 3 Super BF16 on 4x H100 (evaluation only, short context; OOM risk at longer contexts, use 8x H100 for production)
Any	Minimal	API acceptable	Proprietary API (no infra overhead)
55%+	Existing DeepSeek cluster	Flexible	DeepSeek V3.2 (minimize migration)

Production Checklist

GPU provisioned with correct VRAM tier for chosen precision
CUDA version verified (nvcc --version): 12.4+ for Hopper (H100); 12.8+ for Blackwell (B200, B300)
vLLM 0.17.1+ installed (pip install "vllm>=0.17.1")
Model checkpoint downloaded and checksum verified
vLLM server launched with --trust-remote-code and correct --tensor-parallel-size
Chunked prefill disabled (--no-enable-chunked-prefill); test and re-enable only after baseline benchmarks validate correctness
Health endpoint verified: curl http://localhost:8000/health
GPU utilization monitored via nvidia-smi dmon -s u
Load balancer or reverse proxy in front of vLLM for production traffic
Pricing alert set up via Spheron dashboard to avoid unexpected cost overruns

For GPU monitoring tooling, see GPU monitoring for ML and production GPU cloud architecture.

Nemotron 3 Super's hybrid architecture makes it one of the most VRAM-efficient 120B-class coding models available. At NVFP4 on a single H100 PCIe, you can run full inference for roughly $2/hr on Spheron with no enterprise contracts.
Rent H100 → | Rent B200 → | View all GPU pricing →
Get started on Spheron →

Nemotron 3 Super Architecture: What the Hybrid Mamba-Transformer MoE Means for Inference

LatentMoE and Multi-Token Prediction

Memory Access Patterns

Prefill vs Decode Differences

GPU Requirements: VRAM Budgets by Precision Tier

VRAM Formula for MoE Models

GPU Tier Recommendations

Deploying with vLLM: Configuration for Hybrid Architectures

Prerequisites

Installation

Download the Model

Single H100: NVFP4 Launch Command

Multi-GPU BF16: 4x H100 with Tensor Parallelism

Mamba-Specific vLLM Flags

Quantization Options: NVFP4 vs FP8 vs GGUF Q4_K_M

NVFP4 (NVIDIA-Native, Blackwell-First)

FP8 (Good Middle Ground)

GGUF Q4_K_M (CPU-Loadable, llama.cpp Path)

Benchmarks: SWE-Bench, Throughput, and What the 5x Claim Actually Means

SWE-Bench Verified Results

The 5x Throughput Claim

Agentic Coding Task Throughput

Cost to Run Nemotron 3 Super: Monthly Estimates for Enterprise Coding Workloads

Cost Per Coding Task

Nemotron 3 Super vs DeepSeek V4 vs GPT-5.4: Choosing the Right Coding Model

When to Choose Nemotron 3 Super

When DeepSeek V4 Makes Sense

Production Decision Framework

Production Checklist

Build what's next.