Deploy Nemotron Ultra 253B on GPU Cloud: Self-Host NVIDIA's Best Open-Weight Reasoning Model (2026)

Nemotron Ultra 253B is the first open-weight model to beat DeepSeek R1 on GPQA Diamond and LiveCodeBench while fitting on a single 8xH100 node. That combination matters for teams who need frontier reasoning accuracy without the 16+ GPU cluster DeepSeek R1 demands. If your workloads involve long reasoning chains, also check out the KV cache optimization guide since reasoning models create significant KV pressure that affects throughput substantially.

What Is Nemotron Ultra 253B

Nemotron Ultra 253B is a dense Transformer with Grouped Query Attention (GQA), trained by NVIDIA using Llama 3.1 as the base model and extended with reinforcement learning post-training. It is not a Mixture-of-Experts architecture: all 253B parameters are active on every forward pass.

The model supports a 128K context window and dual-mode operation via system prompt. Set the system message to detailed thinking on for extended chain-of-thought generation; use detailed thinking off for direct responses with standard latency. This toggle replaces the budget token parameter approach used by some other reasoning models.

Compared to DeepSeek R1 (671B total, 37B active per token), Nemotron Ultra is about 38% of the total parameter count. DeepSeek R1's MoE design keeps active parameters low per token, but the 671B total weight still requires 16+ H100s in BF16. Nemotron Ultra at 253B dense fits on 8 H100 SXM5 GPUs at FP8, which is a single commercially available node configuration.

Spec	Value
Architecture	Dense Transformer + GQA
Total Parameters	253B
Context Window	128K tokens
Reasoning Mode	System prompt toggle
Release Date	April 2025
License	NVIDIA Open Model License + Llama 3.1 Community License
HuggingFace ID	`nvidia/Llama-3_1-Nemotron-Ultra-253B-v1`

Nemotron Ultra 253B vs DeepSeek R1 vs Llama 4: Benchmark Comparison

Nemotron Ultra 253B leads DeepSeek R1 on GPQA Diamond and LiveCodeBench. R1 holds a narrow edge on MATH-500 (97.3% vs 97.0%) and scores higher on AIME.

Benchmark	Nemotron Ultra 253B	DeepSeek R1 671B	Llama 4 Maverick
MATH-500	97.0%	97.3%	~85%
GPQA Diamond	76.01%	71.5%	~69.8%
AIME 2025	72.50%	79.8%†	~50%
LiveCodeBench	66.31%	65.9%	~43.4%

(Source: NVIDIA HuggingFace model card, reasoning-ON scores. Llama 4 Maverick GPQA Diamond and LiveCodeBench scores from Meta model card; MATH-500 and AIME 2025 are estimates from third-party evaluations and vary across leaderboards. †DeepSeek R1's 79.8% is from AIME 2024; official AIME 2025 scores were not included in the original DeepSeek-R1 paper.)

The hardware efficiency story is the more important comparison. DeepSeek R1 in BF16 needs approximately 1.34 TB of VRAM for weights alone, which means 16-18 H100s minimum. Nemotron Ultra at FP8 fits on 8 H100s. For a team running their own inference infrastructure, that's half the hardware and half the monthly cost for comparable or better benchmark scores on most tasks.

For a broader comparison of these models across more benchmarks, see DeepSeek vs Llama 4 vs Qwen 3. The cost implications of running large reasoning models are covered in the reasoning model inference cost guide. For Microsoft's in-house API-only reasoning model with commercial licensing and 256K context, the MAI-Thinking-1 deployment guide covers API access and self-hostable open-weight alternatives.

GPU Hardware Requirements: VRAM, Memory Bandwidth, and Node Sizing

At FP8 precision (1 byte per parameter), Nemotron Ultra 253B weights occupy approximately 253 GB. Add 15% for framework state and activations (~38 GB), and you need 291 GB just to load the model. The remaining capacity on an 8xH100 SXM5 node (640 GB total) goes to KV cache.

Precision	Weight VRAM	Overhead (15%)	KV Cache (32K ctx, batch 8)	Total	Min GPUs
BF16	~506 GB	~76 GB	~120 GB	~702 GB	10x H100 (not viable single-node)
FP8	~253 GB	~38 GB	~120 GB	~411 GB	8x H100 SXM5 (640 GB, comfortable)
MXFP4*	~127 GB	~19 GB	~120 GB	~266 GB	4x B200 (768 GB, with large KV budget)

*MXFP4 requires Blackwell B200/B300 hardware only.

The connection topology between GPUs matters as much as total VRAM. Tensor parallelism at TP=8 performs an all-reduce operation across all 8 GPUs every transformer layer. On NVLink-connected SXM5 nodes, NVLink 4.0 provides 900 GB/s bidirectional bandwidth. PCIe 5.0 across 8 GPUs provides roughly 64 GB/s. That's a 14x bandwidth difference per all-reduce, which translates directly to per-layer latency. Always use SXM5 nodes for 8-GPU tensor-parallel inference.

For detailed VRAM estimation methodology, the GPU memory requirements for LLMs guide covers the full calculation including KV cache scaling at different sequence lengths.

An 8xH100 SXM5 node on Spheron runs at $23.20/hr on-demand.

Step-by-Step Deployment with vLLM on 8xH100 (Single Node Setup)

Prerequisites:

CUDA 12.4+ (Hopper Transformer Engine support required for FP8)
Python 3.10+
500 GB+ persistent storage volume (for the FP8 checkpoint)
An 8xH100 SXM5 node with NVLink connectivity

Step 1: Provision the node

Log into app.spheron.ai, navigate to the GPU catalog, and select H100 SXM5 with the 8-GPU bundle configuration.

Step 2: Install vLLM

bash

pip install "vllm>=0.8.3"

Verify the install with vllm --version. CUDA 12.4 or later is required for Hopper FP8 acceleration via the Transformer Engine. Note: this configuration follows the HuggingFace model card example and is community-tested. Check vLLM release notes for the latest official compatibility status.

Step 3: Download the FP8 checkpoint

bash

huggingface-cli download nvidia/Llama-3_1-Nemotron-Ultra-253B-v1-FP8 \
  --local-dir /models/nemotron-ultra-fp8

The FP8 checkpoint is approximately 253 GB. Store it on a persistent volume mounted to the instance so you don't re-download on restart.

Step 4: Launch the vLLM server

bash

vllm serve /models/nemotron-ultra-fp8 \
  --quantization modelopt \
  --tensor-parallel-size 8 \
  --trust-remote-code \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --enable-chunked-prefill \
  --served-model-name nemotron-ultra \
  --port 8000

Flag notes:

--quantization modelopt: NVIDIA-recommended quantization backend for the official FP8 checkpoint. An alternative is --dtype fp8, which activates Transformer Engine FP8 compute directly, but NVIDIA's model card specifies --quantization modelopt for this checkpoint.
--tensor-parallel-size 8: splits model weights across all 8 GPUs via NVLink
--trust-remote-code: required by the Nemotron Ultra model card
--gpu-memory-utilization 0.90: reserves 10% VRAM headroom to avoid OOM on large batches
--max-model-len 32768: sets 32K context; see the production section for guidance on 128K
--enable-chunked-prefill: pipelines prefill across iterations, reduces TTFT on long reasoning prompts

Step 5: Send a test request with reasoning enabled

bash

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron-ultra",
    "messages": [
      {"role": "system", "content": "detailed thinking on"},
      {"role": "user", "content": "Prove that the sum of the first n odd numbers equals n^2."}
    ],
    "max_tokens": 4096
  }'

Docker alternative:

If you prefer a containerized deployment, use the official vLLM Docker image:

bash

docker run --gpus all --shm-size=10g \
  -v /models:/models \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model /models/nemotron-ultra-fp8 \
  --quantization modelopt \
  --tensor-parallel-size 8 \
  --trust-remote-code \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --enable-chunked-prefill \
  --served-model-name nemotron-ultra

The --shm-size=10g flag is required for multi-GPU shared memory during tensor parallel communication.

For the broader vLLM configuration context, the vLLM production deployment guide covers load balancing, health checks, and rolling restarts. The batching concepts behind --enable-chunked-prefill are explained in the LLM serving optimization guide.

Quantization Options: FP8 and MXFP4 for Smaller GPU Configurations

FP8 on H100 (recommended for Hopper clusters):

FP8 is hardware-accelerated via NVIDIA's Transformer Engine on Hopper (H100). NVIDIA provides an official FP8 checkpoint at nvidia/Llama-3_1-Nemotron-Ultra-253B-v1-FP8 on HuggingFace, so you don't need to run offline quantization yourself. The accuracy loss vs BF16 is less than 1% on reasoning benchmarks, and the memory reduction is a flat 2x.

Launch flag: --quantization modelopt (NVIDIA-recommended for the official FP8 checkpoint). Alternatively, --dtype fp8 activates Transformer Engine FP8 compute directly, but NVIDIA's model card specifies --quantization modelopt for this specific checkpoint.

MXFP4 on Blackwell (B200/B300 only):

MXFP4 uses finer-grained scaling factors per 32-element block, which reduces quantization error compared to standard INT4 at the same bit width. On a 4xB200 node (768 GB total VRAM), MXFP4 reduces weight storage to roughly 127 GB, leaving approximately 622 GB for KV cache and batching.

MXFP4 is Blackwell-exclusive. It will not work on H100, A100, or any pre-Blackwell hardware. Attempting to run it on H100 will fail at model load time.

For the full MXFP4 methodology, the MXFP4 quantization guide covers block scaling mechanics and accuracy tradeoffs. The FP4 quantization on Blackwell GPU cloud guide has cost breakdowns per configuration.

INT4 / GGUF (not recommended for production reasoning):

Quantization errors compound through long thinking chains. INT4 accuracy degradation on MATH-500 and AIME tasks typically exceeds 5-8 percentage points at this model size. GGUF Q5_K_M on a single RTX 5090 (32 GB) cannot fit Nemotron Ultra 253B regardless. If you need a small local deployment, use distilled variants such as Nemotron 4B or 8B.

Hardware summary:

Configuration	VRAM	Cost/hr (Spheron)	Notes
8x H100 SXM5, FP8	640 GB	~$23.20	Recommended production config
4x B200 SXM6, MXFP4	768 GB	~$8.24 (spot)	Best price/performance on Blackwell; on-demand B200 pricing not currently available
8x H200 SXM5, FP8	1128 GB	varies	Useful for 128K+ context with large KV budget

Production Configuration: Throughput Tuning, Context Length, and Batching

Context length selection:

Set --max-model-len to the 95th-percentile of your actual query length distribution, not the model's 128K theoretical maximum. A 128K context window at batch size 8 in FP16 KV cache requires approximately 960 GB, which doesn't fit on an 8xH100 node. FP8 KV cache (--kv-cache-dtype fp8_e5m2) halves that to ~480 GB, but you're still constrained to small batch concurrency.

Practical recommendations:

Interactive use: --max-model-len 32768
Extended reasoning traces: --max-model-len 65536 with --kv-cache-dtype fp8_e5m2
Full 128K (single-stream evaluation only): --max-model-len 131072 --kv-cache-dtype fp8_e5m2 --max-num-seqs 1

Warning: Setting --max-model-len 131072 without enabling FP8 KV cache will cause an OOM error at model load time on an 8xH100 node.

Thinking mode routing:

Use the detailed thinking on/off system prompt toggle to route queries. Simple factual queries with off generate 200-400 tokens. Complex reasoning with on generates 5,000-20,000 tokens. A lightweight classifier routing to the appropriate mode reduces average GPU-seconds per query by 3-5x and meaningfully improves throughput under mixed workload traffic.

Chunked prefill:

Nemotron Ultra's reasoning workloads produce long output sequences that in turn become long prefill sequences on follow-up turns. --enable-chunked-prefill pipelines prefill work across multiple scheduler iterations so the GPU doesn't stall while processing large input contexts. Enable this flag for any production deployment with interactive users.

Tensor parallelism vs pipeline parallelism:

At 8 GPUs, TP=8 (pure tensor parallel) minimizes time-to-first-token for interactive serving. For offline batch workloads where throughput matters more than latency, TP=4 + PP=2 reduces per-layer all-reduce size at the cost of pipeline bubble overhead. For serving mixed interactive and batch workloads, keep TP=8 and separate batch jobs to off-peak hours.

KV cache monitoring:

vLLM exposes a Prometheus endpoint at /metrics. Track vllm:gpu_cache_usage_perc and alert at 85% to prevent cache eviction cascades during traffic spikes. Cache eviction causes visible latency increases as the model re-processes evicted context.

Cost Comparison: Nemotron Ultra on Spheron vs NIM API vs Hyperscalers

Using live Spheron pricing as of 20 Apr 2026:

Platform	Configuration	Cost/hr	Cost/month (730 hrs)	Notes
Spheron	8x H100 SXM5 on-demand	$23.20	~$16,936	Bare metal, full root access
AWS	p4de.24xlarge (8x A100 80GB)	~$40.97	~$29,908	A100, not H100 SXM5
GCP	a3-highgpu-8g (8x H100 80GB)	~$32.77	~$23,922	H100, includes managed cloud overhead
Azure	ND96isr H100 v5 (8x H100 80GB)	~$37.04	~$27,039	H100, on-demand; reserved pricing is lower
NVIDIA NIM API	Pay per token	-	~$600-$1,200	At 5M tokens/day, $4-8/M rate

AWS, GCP, and Azure prices are approximate on-demand list prices and change frequently. Spheron runs roughly 30-40% below the major hyperscalers for equivalent H100 hardware.

Break-even analysis:

At 5M tokens/day (reasonable for a team-scale deployment), NIM API at a conservative $4/M tokens costs $20/day = $600/month. Spheron self-hosting at $23.20/hr costs ~$557/day = ~$16,936/month. Self-hosting breaks even at roughly 4.2 billion tokens/month of sustained throughput (~140M tokens/day). Above that volume, each additional billion tokens per month saves approximately $4,000 compared to NIM API at the $4/M rate.

At $8/M tokens (mid-range NIM API pricing), the break-even drops to ~2.1 billion tokens/month.

Pricing fluctuates based on GPU availability. The prices above are based on 20 Apr 2026 and may have changed. Check current GPU pricing for live rates.

For a broader multi-provider analysis, see the GPU cloud pricing comparison 2026.

Use Cases: RAG, Tool Calling, and Agentic Reasoning Workflows

Retrieval-Augmented Generation (RAG)

Nemotron Ultra's 128K context window and 76.01% GPQA Diamond score make it well-suited for multi-document RAG with complex synthesis requirements. At 32K context, a single 8xH100 node sustains 8-12 concurrent RAG queries with detailed thinking on. For high-volume document processing where speed matters more than depth, use detailed thinking off and reduce the context window to 8K to maximize concurrency. See Agentic RAG on GPU Cloud for infrastructure patterns around retrieval pipelines.

Tool Calling and Function Calling

Nemotron Ultra 253B has native function calling support. Use structured output mode with --guided-decoding-backend outlines to enforce JSON schema conformance for tool call outputs. For high-throughput tool calling pipelines, disable thinking mode to minimize token generation and latency. At detailed thinking off, average response length for structured tool calls drops to 100-300 tokens, which translates to 3-5x throughput improvement over reasoning-enabled queries. See the structured output and function calling inference guide for schema configuration examples.

Agentic Reasoning Workflows

For multi-step agentic tasks where reasoning depth matters (code generation, multi-hop research, plan execution), detailed thinking on provides the reasoning chains needed for reliable task completion. Use adaptive token budgets: short budget (2K max tokens) for simple subtasks, full budget (16K+) for complex planning steps. This prevents token waste on low-complexity steps while preserving quality on hard ones.

For NIM-based deployment alternatives, the NVIDIA NIM self-host deployment guide covers containerized NVIDIA-managed inference if you prefer that over raw vLLM.

Nemotron Ultra 253B fits on a single 8xH100 node at FP8, the same hardware that runs DeepSeek R1's distilled 70B variants at a fraction of the cost of frontier API providers. Spheron provides on-demand 8xH100 SXM5 nodes at $23.20/hr with no contracts and per-minute billing.
Rent 8xH100 SXM5 on Spheron | View all GPU pricing
Deploy Nemotron Ultra now

STEPS / 06

Quick Setup Guide

Calculate VRAM requirements for Nemotron Ultra 253B
At FP8 precision (1 byte per parameter), Nemotron Ultra 253B weights require approximately 253 GB. Add 15% overhead for activations and framework state (~38 GB), leaving 349 GB of the 640 GB available on an 8xH100 SXM5 node for KV cache. With GQA-compressed KV heads, a 32K context window uses roughly 80-120 GB depending on batch size, fitting comfortably within the budget.
Provision an 8xH100 SXM5 node on Spheron
Log into app.spheron.ai, navigate to the GPU catalog, and select H100 SXM5. Choose the 8-GPU bundle configuration. NVLink is available on SXM5 nodes and is essential for tensor parallelism at TP=8: NVLink 4.0 provides 900 GB/s bidirectional bandwidth compared to ~64 GB/s on PCIe, reducing all-reduce communication latency by more than 10x per layer.
Install vLLM and download the model checkpoint
Install vLLM 0.8.3+ (community-tested; check vLLM release notes for latest compatibility) with CUDA 12.4 and Python 3.10+: pip install 'vllm>=0.8.3'. Download the FP8 checkpoint from Hugging Face: huggingface-cli download nvidia/Llama-3_1-Nemotron-Ultra-253B-v1-FP8 --local-dir /models/nemotron-ultra-fp8. The FP8 checkpoint is approximately 253 GB. Store it on a persistent volume to avoid re-downloading on instance restart.
Launch the vLLM server with tensor parallelism
Run: vllm serve /models/nemotron-ultra-fp8 --quantization modelopt --tensor-parallel-size 8 --trust-remote-code --gpu-memory-utilization 0.90 --max-model-len 32768 --enable-chunked-prefill --port 8000. The --enable-chunked-prefill flag is recommended for long-context reasoning workloads to pipeline prefill and decode. The server exposes an OpenAI-compatible /v1/chat/completions endpoint.
Enable reasoning mode via system prompt
Nemotron Ultra 253B supports a thinking mode activated via a system prompt. Set the system message to 'detailed thinking on' to trigger extended chain-of-thought generation, or 'detailed thinking off' for direct responses. This is equivalent to the budget_tokens parameter in some APIs. For latency-sensitive applications, use 'detailed thinking off' with max_tokens capped at 2048. For accuracy-critical tasks (math, coding, logic), use 'detailed thinking on'.
Validate throughput and latency
Use vLLM's benchmark_serving.py script: python benchmarks/benchmark_serving.py --backend vllm --model /models/nemotron-ultra-fp8 --num-prompts 100 --request-rate 4 --max-tokens 2048. Expect 350-550 tokens/sec total throughput on an 8xH100 SXM5 node at FP8 with batch size 8. Time-to-first-token for a 512-token prompt should be under 500ms with chunked prefill enabled.

FAQ / 05

Frequently Asked Questions

At FP8 precision, Nemotron Ultra 253B requires approximately 253 GB of VRAM for weights plus activation and KV cache headroom. A single 8xH100 SXM5 node (640 GB total VRAM) handles this comfortably in FP8. For BF16, you need 506 GB of weights alone, which exceeds 8xH100 (640 GB only if you keep KV cache extremely small). FP8 on 8xH100 is the recommended production configuration.

Yes. Nemotron Ultra 253B uses a standard dense Transformer architecture with GQA (Grouped Query Attention), which vLLM supports natively. Use vLLM 0.8.3+ (community-tested; check vLLM release notes for latest compatibility) with --quantization modelopt, --tensor-parallel-size 8, and --trust-remote-code on an 8xH100 SXM5 node.

Nemotron Ultra 253B outperforms DeepSeek R1 (671B) on GPQA Diamond and LiveCodeBench, despite being less than half the total parameter count. On LiveCodeBench it scores 66.31% vs DeepSeek R1's 65.9%, edging ahead on coding tasks. R1 holds an edge on MATH-500 (97.3% vs 97.0%) and AIME. The key advantage is hardware efficiency: Nemotron Ultra fits on a single 8xH100 node where DeepSeek R1 in BF16 requires 16+ H100s.

Yes. MXFP4 (Microscaling FP4) is supported on NVIDIA Blackwell B200/B300 GPUs only. On a 4xB200 node (768 GB total VRAM), MXFP4 reduces model weight storage to roughly 127 GB, leaving approximately 622 GB for KV cache. On H100, FP8 is the recommended minimum-bitwidth quantization since H100 lacks native FP4 hardware support. INT4 is not recommended for reasoning workloads.

On Spheron, an 8xH100 SXM5 node runs at approximately $23.20/hr on-demand (as of April 20, 2026). Sustaining that for a month costs roughly $16,936. NIM API pricing for large reasoning models typically runs $4-8 per million tokens, which at 5M tokens/day maps to $20-40/day ($600-1,200/month). Self-hosting breaks even when monthly token volume reaches billions — at $4/M tokens, the crossover is around 4.2 billion tokens/month. Below that scale, NIM API is cheaper; above it, self-hosting wins.

What Is Nemotron Ultra 253B

Nemotron Ultra 253B vs DeepSeek R1 vs Llama 4: Benchmark Comparison

GPU Hardware Requirements: VRAM, Memory Bandwidth, and Node Sizing

Step-by-Step Deployment with vLLM on 8xH100 (Single Node Setup)

Quantization Options: FP8 and MXFP4 for Smaller GPU Configurations

Production Configuration: Throughput Tuning, Context Length, and Batching

Cost Comparison: Nemotron Ultra on Spheron vs NIM API vs Hyperscalers

Use Cases: RAG, Tool Calling, and Agentic Reasoning Workflows

Retrieval-Augmented Generation (RAG)

Tool Calling and Function Calling

Agentic Reasoning Workflows

Quick Setup Guide

Calculate VRAM requirements for Nemotron Ultra 253B

Provision an 8xH100 SXM5 node on Spheron

Install vLLM and download the model checkpoint

Launch the vLLM server with tensor parallelism

Enable reasoning mode via system prompt

Validate throughput and latency

Frequently Asked Questions

01How many GPUs do you need to run Nemotron Ultra 253B?

02Does vLLM support Nemotron Ultra 253B?

03How does Nemotron Ultra 253B compare to DeepSeek R1 on reasoning benchmarks?

04Can I run Nemotron Ultra 253B with MXFP4 quantization on fewer GPUs?

05What is the cost of running Nemotron Ultra 253B on Spheron vs NVIDIA NIM API?

Build what's next.