Tutorial

Deploy Nemotron Ultra 253B on GPU Cloud: Self-Host NVIDIA's Best Open-Weight Reasoning Model (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 20, 2026
Nemotron Ultra 253BNemotron UltraNVIDIA NemotronvLLMReasoning ModelsH100 SXM5FP8 QuantizationMXFP4Self-Hosted AIGPU Cloud
Deploy Nemotron Ultra 253B on GPU Cloud: Self-Host NVIDIA's Best Open-Weight Reasoning Model (2026)

Nemotron Ultra 253B is the first open-weight model to beat DeepSeek R1 on GPQA Diamond and LiveCodeBench while fitting on a single 8xH100 node. That combination matters for teams who need frontier reasoning accuracy without the 16+ GPU cluster DeepSeek R1 demands. If your workloads involve long reasoning chains, also check out the KV cache optimization guide since reasoning models create significant KV pressure that affects throughput substantially.

What Is Nemotron Ultra 253B

Nemotron Ultra 253B is a dense Transformer with Grouped Query Attention (GQA), trained by NVIDIA using Llama 3.1 as the base model and extended with reinforcement learning post-training. It is not a Mixture-of-Experts architecture: all 253B parameters are active on every forward pass.

The model supports a 128K context window and dual-mode operation via system prompt. Set the system message to detailed thinking on for extended chain-of-thought generation; use detailed thinking off for direct responses with standard latency. This toggle replaces the budget token parameter approach used by some other reasoning models.

Compared to DeepSeek R1 (671B total, 37B active per token), Nemotron Ultra is about 38% of the total parameter count. DeepSeek R1's MoE design keeps active parameters low per token, but the 671B total weight still requires 16+ H100s in BF16. Nemotron Ultra at 253B dense fits on 8 H100 SXM5 GPUs at FP8, which is a single commercially available node configuration.

SpecValue
ArchitectureDense Transformer + GQA
Total Parameters253B
Context Window128K tokens
Reasoning ModeSystem prompt toggle
Release DateApril 2025
LicenseNVIDIA Open Model License + Llama 3.1 Community License
HuggingFace IDnvidia/Llama-3_1-Nemotron-Ultra-253B-v1

Nemotron Ultra 253B vs DeepSeek R1 vs Llama 4: Benchmark Comparison

Nemotron Ultra 253B leads DeepSeek R1 on GPQA Diamond and LiveCodeBench. R1 holds a narrow edge on MATH-500 (97.3% vs 97.0%) and scores higher on AIME.

BenchmarkNemotron Ultra 253BDeepSeek R1 671BLlama 4 Maverick
MATH-50097.0%97.3%~85%
GPQA Diamond76.01%71.5%~69.8%
AIME 202572.50%79.8%†~50%
LiveCodeBench66.31%65.9%~43.4%

(Source: NVIDIA HuggingFace model card, reasoning-ON scores. Llama 4 Maverick GPQA Diamond and LiveCodeBench scores from Meta model card; MATH-500 and AIME 2025 are estimates from third-party evaluations and vary across leaderboards. †DeepSeek R1's 79.8% is from AIME 2024; official AIME 2025 scores were not included in the original DeepSeek-R1 paper.)

The hardware efficiency story is the more important comparison. DeepSeek R1 in BF16 needs approximately 1.34 TB of VRAM for weights alone, which means 16-18 H100s minimum. Nemotron Ultra at FP8 fits on 8 H100s. For a team running their own inference infrastructure, that's half the hardware and half the monthly cost for comparable or better benchmark scores on most tasks.

For a broader comparison of these models across more benchmarks, see DeepSeek vs Llama 4 vs Qwen 3. The cost implications of running large reasoning models are covered in the reasoning model inference cost guide.

GPU Hardware Requirements: VRAM, Memory Bandwidth, and Node Sizing

At FP8 precision (1 byte per parameter), Nemotron Ultra 253B weights occupy approximately 253 GB. Add 15% for framework state and activations (~38 GB), and you need 291 GB just to load the model. The remaining capacity on an 8xH100 SXM5 node (640 GB total) goes to KV cache.

PrecisionWeight VRAMOverhead (15%)KV Cache (32K ctx, batch 8)TotalMin GPUs
BF16~506 GB~76 GB~120 GB~702 GB10x H100 (not viable single-node)
FP8~253 GB~38 GB~120 GB~411 GB8x H100 SXM5 (640 GB, comfortable)
MXFP4*~127 GB~19 GB~120 GB~266 GB4x B200 (768 GB, with large KV budget)

*MXFP4 requires Blackwell B200/B300 hardware only.

The connection topology between GPUs matters as much as total VRAM. Tensor parallelism at TP=8 performs an all-reduce operation across all 8 GPUs every transformer layer. On NVLink-connected SXM5 nodes, NVLink 4.0 provides 900 GB/s bidirectional bandwidth. PCIe 5.0 across 8 GPUs provides roughly 64 GB/s. That's a 14x bandwidth difference per all-reduce, which translates directly to per-layer latency. Always use SXM5 nodes for 8-GPU tensor-parallel inference.

For detailed VRAM estimation methodology, the GPU memory requirements for LLMs guide covers the full calculation including KV cache scaling at different sequence lengths.

An 8xH100 SXM5 node on Spheron runs at $23.20/hr on-demand.

Step-by-Step Deployment with vLLM on 8xH100 (Single Node Setup)

Prerequisites:

  • CUDA 12.4+ (Hopper Transformer Engine support required for FP8)
  • Python 3.10+
  • 500 GB+ persistent storage volume (for the FP8 checkpoint)
  • An 8xH100 SXM5 node with NVLink connectivity

Step 1: Provision the node

Log into app.spheron.ai, navigate to the GPU catalog, and select H100 SXM5 with the 8-GPU bundle configuration.

Step 2: Install vLLM

bash
pip install "vllm>=0.8.3"

Verify the install with vllm --version. CUDA 12.4 or later is required for Hopper FP8 acceleration via the Transformer Engine. Note: this configuration follows the HuggingFace model card example and is community-tested. Check vLLM release notes for the latest official compatibility status.

Step 3: Download the FP8 checkpoint

bash
huggingface-cli download nvidia/Llama-3_1-Nemotron-Ultra-253B-v1-FP8 \
  --local-dir /models/nemotron-ultra-fp8

The FP8 checkpoint is approximately 253 GB. Store it on a persistent volume mounted to the instance so you don't re-download on restart.

Step 4: Launch the vLLM server

bash
vllm serve /models/nemotron-ultra-fp8 \
  --quantization modelopt \
  --tensor-parallel-size 8 \
  --trust-remote-code \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --enable-chunked-prefill \
  --served-model-name nemotron-ultra \
  --port 8000

Flag notes:

  • --quantization modelopt: NVIDIA-recommended quantization backend for the official FP8 checkpoint. An alternative is --dtype fp8, which activates Transformer Engine FP8 compute directly, but NVIDIA's model card specifies --quantization modelopt for this checkpoint.
  • --tensor-parallel-size 8: splits model weights across all 8 GPUs via NVLink
  • --trust-remote-code: required by the Nemotron Ultra model card
  • --gpu-memory-utilization 0.90: reserves 10% VRAM headroom to avoid OOM on large batches
  • --max-model-len 32768: sets 32K context; see the production section for guidance on 128K
  • --enable-chunked-prefill: pipelines prefill across iterations, reduces TTFT on long reasoning prompts

Step 5: Send a test request with reasoning enabled

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron-ultra",
    "messages": [
      {"role": "system", "content": "detailed thinking on"},
      {"role": "user", "content": "Prove that the sum of the first n odd numbers equals n^2."}
    ],
    "max_tokens": 4096
  }'

Docker alternative:

If you prefer a containerized deployment, use the official vLLM Docker image:

bash
docker run --gpus all --shm-size=10g \
  -v /models:/models \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model /models/nemotron-ultra-fp8 \
  --quantization modelopt \
  --tensor-parallel-size 8 \
  --trust-remote-code \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --enable-chunked-prefill \
  --served-model-name nemotron-ultra

The --shm-size=10g flag is required for multi-GPU shared memory during tensor parallel communication.

For the broader vLLM configuration context, the vLLM production deployment guide covers load balancing, health checks, and rolling restarts. The batching concepts behind --enable-chunked-prefill are explained in the LLM serving optimization guide.

Quantization Options: FP8 and MXFP4 for Smaller GPU Configurations

FP8 on H100 (recommended for Hopper clusters):

FP8 is hardware-accelerated via NVIDIA's Transformer Engine on Hopper (H100). NVIDIA provides an official FP8 checkpoint at nvidia/Llama-3_1-Nemotron-Ultra-253B-v1-FP8 on HuggingFace, so you don't need to run offline quantization yourself. The accuracy loss vs BF16 is less than 1% on reasoning benchmarks, and the memory reduction is a flat 2x.

Launch flag: --quantization modelopt (NVIDIA-recommended for the official FP8 checkpoint). Alternatively, --dtype fp8 activates Transformer Engine FP8 compute directly, but NVIDIA's model card specifies --quantization modelopt for this specific checkpoint.

MXFP4 on Blackwell (B200/B300 only):

MXFP4 uses finer-grained scaling factors per 32-element block, which reduces quantization error compared to standard INT4 at the same bit width. On a 4xB200 node (768 GB total VRAM), MXFP4 reduces weight storage to roughly 127 GB, leaving approximately 622 GB for KV cache and batching.

MXFP4 is Blackwell-exclusive. It will not work on H100, A100, or any pre-Blackwell hardware. Attempting to run it on H100 will fail at model load time.

For the full MXFP4 methodology, the MXFP4 quantization guide covers block scaling mechanics and accuracy tradeoffs. The FP4 quantization on Blackwell GPU cloud guide has cost breakdowns per configuration.

INT4 / GGUF (not recommended for production reasoning):

Quantization errors compound through long thinking chains. INT4 accuracy degradation on MATH-500 and AIME tasks typically exceeds 5-8 percentage points at this model size. GGUF Q5_K_M on a single RTX 5090 (32 GB) cannot fit Nemotron Ultra 253B regardless. If you need a small local deployment, use distilled variants such as Nemotron 4B or 8B.

Hardware summary:

ConfigurationVRAMCost/hr (Spheron)Notes
8x H100 SXM5, FP8640 GB~$23.20Recommended production config
4x B200 SXM6, MXFP4768 GB~$8.24 (spot)Best price/performance on Blackwell; on-demand B200 pricing not currently available
8x H200 SXM5, FP81128 GBvariesUseful for 128K+ context with large KV budget

Production Configuration: Throughput Tuning, Context Length, and Batching

Context length selection:

Set --max-model-len to the 95th-percentile of your actual query length distribution, not the model's 128K theoretical maximum. A 128K context window at batch size 8 in FP16 KV cache requires approximately 960 GB, which doesn't fit on an 8xH100 node. FP8 KV cache (--kv-cache-dtype fp8_e5m2) halves that to ~480 GB, but you're still constrained to small batch concurrency.

Practical recommendations:

  • Interactive use: --max-model-len 32768
  • Extended reasoning traces: --max-model-len 65536 with --kv-cache-dtype fp8_e5m2
  • Full 128K (single-stream evaluation only): --max-model-len 131072 --kv-cache-dtype fp8_e5m2 --max-num-seqs 1

Warning: Setting --max-model-len 131072 without enabling FP8 KV cache will cause an OOM error at model load time on an 8xH100 node.

Thinking mode routing:

Use the detailed thinking on/off system prompt toggle to route queries. Simple factual queries with off generate 200-400 tokens. Complex reasoning with on generates 5,000-20,000 tokens. A lightweight classifier routing to the appropriate mode reduces average GPU-seconds per query by 3-5x and meaningfully improves throughput under mixed workload traffic.

Chunked prefill:

Nemotron Ultra's reasoning workloads produce long output sequences that in turn become long prefill sequences on follow-up turns. --enable-chunked-prefill pipelines prefill work across multiple scheduler iterations so the GPU doesn't stall while processing large input contexts. Enable this flag for any production deployment with interactive users.

Tensor parallelism vs pipeline parallelism:

At 8 GPUs, TP=8 (pure tensor parallel) minimizes time-to-first-token for interactive serving. For offline batch workloads where throughput matters more than latency, TP=4 + PP=2 reduces per-layer all-reduce size at the cost of pipeline bubble overhead. For serving mixed interactive and batch workloads, keep TP=8 and separate batch jobs to off-peak hours.

KV cache monitoring:

vLLM exposes a Prometheus endpoint at /metrics. Track vllm:gpu_cache_usage_perc and alert at 85% to prevent cache eviction cascades during traffic spikes. Cache eviction causes visible latency increases as the model re-processes evicted context.

Cost Comparison: Nemotron Ultra on Spheron vs NIM API vs Hyperscalers

Using live Spheron pricing as of 20 Apr 2026:

PlatformConfigurationCost/hrCost/month (730 hrs)Notes
Spheron8x H100 SXM5 on-demand$23.20~$16,936Bare metal, full root access
AWSp4de.24xlarge (8x A100 80GB)~$40.97~$29,908A100, not H100 SXM5
GCPa3-highgpu-8g (8x H100 80GB)~$32.77~$23,922H100, includes managed cloud overhead
AzureND96isr H100 v5 (8x H100 80GB)~$37.04~$27,039H100, on-demand; reserved pricing is lower
NVIDIA NIM APIPay per token-~$600-$1,200At 5M tokens/day, $4-8/M rate

AWS, GCP, and Azure prices are approximate on-demand list prices and change frequently. Spheron runs roughly 30-40% below the major hyperscalers for equivalent H100 hardware.

Break-even analysis:

At 5M tokens/day (reasonable for a team-scale deployment), NIM API at a conservative $4/M tokens costs $20/day = $600/month. Spheron self-hosting at $23.20/hr costs ~$557/day = ~$16,936/month. Self-hosting breaks even at roughly 4.2 billion tokens/month of sustained throughput (~140M tokens/day). Above that volume, each additional billion tokens per month saves approximately $4,000 compared to NIM API at the $4/M rate.

At $8/M tokens (mid-range NIM API pricing), the break-even drops to ~2.1 billion tokens/month.

Pricing fluctuates based on GPU availability. The prices above are based on 20 Apr 2026 and may have changed. Check current GPU pricing for live rates.

For a broader multi-provider analysis, see the GPU cloud pricing comparison 2026.

Use Cases: RAG, Tool Calling, and Agentic Reasoning Workflows

Retrieval-Augmented Generation (RAG)

Nemotron Ultra's 128K context window and 76.01% GPQA Diamond score make it well-suited for multi-document RAG with complex synthesis requirements. At 32K context, a single 8xH100 node sustains 8-12 concurrent RAG queries with detailed thinking on. For high-volume document processing where speed matters more than depth, use detailed thinking off and reduce the context window to 8K to maximize concurrency. See Agentic RAG on GPU Cloud for infrastructure patterns around retrieval pipelines.

Tool Calling and Function Calling

Nemotron Ultra 253B has native function calling support. Use structured output mode with --guided-decoding-backend outlines to enforce JSON schema conformance for tool call outputs. For high-throughput tool calling pipelines, disable thinking mode to minimize token generation and latency. At detailed thinking off, average response length for structured tool calls drops to 100-300 tokens, which translates to 3-5x throughput improvement over reasoning-enabled queries. See the structured output and function calling inference guide for schema configuration examples.

Agentic Reasoning Workflows

For multi-step agentic tasks where reasoning depth matters (code generation, multi-hop research, plan execution), detailed thinking on provides the reasoning chains needed for reliable task completion. Use adaptive token budgets: short budget (2K max tokens) for simple subtasks, full budget (16K+) for complex planning steps. This prevents token waste on low-complexity steps while preserving quality on hard ones.

For NIM-based deployment alternatives, the NVIDIA NIM self-host deployment guide covers containerized NVIDIA-managed inference if you prefer that over raw vLLM.


Nemotron Ultra 253B fits on a single 8xH100 node at FP8, the same hardware that runs DeepSeek R1's distilled 70B variants at a fraction of the cost of frontier API providers. Spheron provides on-demand 8xH100 SXM5 nodes at $23.20/hr with no contracts and per-minute billing.

Rent 8xH100 SXM5 on Spheron | View all GPU pricing

Deploy Nemotron Ultra now

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.