Engineering

NVIDIA Vera Rubin NVL72: H300 GPU Specs, Cloud Pricing, and Blackwell Upgrade Guide (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 18, 2026
NVIDIA Vera RubinH300 GPUNVL72HBM4BlackwellGPU CloudAI InfrastructureRubin Architecture
NVIDIA Vera Rubin NVL72: H300 GPU Specs, Cloud Pricing, and Blackwell Upgrade Guide (2026)

The Vera Rubin NVL72 is NVIDIA's rack-scale successor to the GB200 NVL72, announced at CES 2026 and detailed further at GTC 2026, with H2 2026 delivery. It puts 72 R100 GPUs (also branded H300 by some cloud providers) and 36 Vera CPUs into a single liquid-cooled rack, connected by NVLink 6. For the per-GPU architecture behind this system, see the R100 GPU architecture guide. This post covers the NVL72 system specifically: what it delivers at rack scale, how it compares to Blackwell, when it arrives, and whether upgrading makes sense for your workloads.

What Is the NVIDIA Vera Rubin NVL72

The Vera Rubin NVL72 is a direct successor to the GB200 NVL72 architecture. Same form factor, same 72-GPU count, same 36-CPU configuration, but a full generational jump in every key spec. Where GB200 NVL72 uses NVLink 5 at 130 TB/s total fabric bandwidth, Vera Rubin NVL72 doubles that to 260 TB/s with NVLink 6. The per-GPU chip-to-chip bandwidth goes from 1.8 TB/s to 3.6 TB/s.

SpecValue
GPU count72 R100 (H300-class)
CPU count36 Vera
Per-GPU memory288 GB HBM4
Per-GPU memory bandwidthup to 22 TB/s
Per-GPU NVLink bandwidth3.6 TB/s
Total rack NVLink bandwidth260 TB/s
Per-GPU FP4 throughput50 PFLOPS
Rack-level FP8 throughput~1.4 ExaFLOPS
GPU power (rack)~166 kW

The naming situation is worth addressing upfront. NVIDIA internally calls the chip the "Rubin GPU" within the Vera Rubin platform. Industry sources have used both "R100" and "R200" (reflecting a dual-die package design). Cloud providers, particularly those with existing H100 and H200 naming conventions, may brand it "H300". All three refer to the same hardware. This post uses "R100" consistent with the broader Rubin platform documentation.

H300 GPU Specs: HBM4, Transistors, and Compute

The R100 is a substantial step from the B200 and B300. Here's the full spec table:

SpecR100 (H300)
VRAM288 GB HBM4
Memory bandwidthup to 22 TB/s
FP4 Inference throughput (with TE)50 PFLOPS
FP8 Dense throughput~19,400 TFLOPS (est.)
FP16 Dense throughput~8,000 TFLOPS (est.)
Transistors336 billion
Process nodeTSMC N3 (est.)
NVLink versionNVLink 6
NVLink bandwidth (per GPU)3.6 TB/s
TDP~2,300 W
NetworkingConnectX-9 (1.6T)

HBM4 vs HBM3e

The B200 and B300 both use HBM3e at 8 TB/s per GPU. The R100 moves to HBM4 with up to 22 TB/s per GPU, confirmed by NVIDIA at GTC 2026. That is a 2.75x bandwidth increase from Blackwell to Rubin on a single GPU.

For inference, this matters most in two places. KV cache reads and writes are bandwidth-bound operations (see the AI memory wall and inference latency guide for a detailed breakdown of how bandwidth bottlenecks affect serving). At 22 TB/s, the R100 can sustain significantly larger active KV caches during long-context inference without stalling the compute pipeline. A model that currently needs 4xH100 or 2xB200 to handle 128K context windows may run on a single R100 with headroom to spare. Fewer GPUs means less NVLink communication overhead per token, which reduces latency at every context length.

The second impact is on decode throughput at high concurrency. When the batch size is large enough that memory reads dominate compute, bandwidth is the bottleneck. At 2.75x the B200's bandwidth, the R100 decodes tokens faster at equivalent concurrency levels.

Third-Generation Transformer Engine

The R100 includes a third-generation Transformer Engine. The first generation, introduced in Hopper, handled FP8 mixed-precision automatically during training. The Blackwell generation added native FP4 support (second generation). The Rubin third-generation Transformer Engine extends FP4 to both training and inference paths, and adds hardware support for finer-grained quantization strategies that reduce accuracy loss at FP4 precision.

In practical terms, this means FP4 inference on R100 is more reliable out of the box than on B200. Models that required FP8 fallback on Blackwell to preserve quality may run at FP4 on Rubin with comparable accuracy.

The "H300" Alias

Cloud providers with H100 and H200 offerings often brand the R100 as H300 to maintain naming continuity for their customers. This creates no hardware difference. If you see "H300" in a cloud provider's catalog, it refers to the R100. The NVL72 system built on R100 may appear as "H300 NVL72" or "Vera Rubin NVL72" depending on the provider.

Vera Rubin vs Blackwell vs Hopper: Specs Side by Side

For the full three-generation architecture comparison, see the Rubin vs Blackwell vs Hopper comparison. For a deep dive on the B200 specifically, see the NVIDIA B200 complete guide. The summary table:

SpecH100 SXM5H200 SXM5B200B300R100 (Vera Rubin)
ArchitectureHopperHopperBlackwellBlackwell UltraRubin
Memory80 GB HBM3141 GB HBM3e192 GB HBM3e288 GB HBM3e288 GB HBM4
Memory bandwidth3.35 TB/s4.8 TB/s8 TB/s8 TB/sup to 22 TB/s
NVLink bandwidth900 GB/s900 GB/s1.8 TB/s1.8 TB/s3.6 TB/s
NVLink versionNVLink 4NVLink 4NVLink 5NVLink 5NVLink 6
FP8 TFLOPS3,9583,9584,5005,000~19,400 (est.)
FP4 TFLOPSN/AN/A9,00015,00050,000
Transistors80 B80 B208 B208 B336 B
TDP~700 W~700 W~1,000 W~1,400 W~2,300 W
Process nodeTSMC 4NTSMC 4NTSMC 4NPTSMC 4NPTSMC N3 (est.)
Release (cloud)2023202420252026H2 2026

A few things stand out in this comparison. Memory bandwidth is where Rubin makes the biggest jump: 22 TB/s vs 8 TB/s for both B200 and B300 is nearly 3x, despite B200 and B300 sharing the same HBM3e stack. That gap widened because HBM4 allows significantly higher bandwidth per stack than HBM3e.

FP4 compute also shows a step that does not follow a linear trajectory. B200 to B300 is 1.67x (9 to 15 PFLOPS). B300 to R100 is 3.33x (15 to 50 PFLOPS). NVIDIA appears to have reserved a disproportionate compute jump for the HBM4 generation, which has the bandwidth to feed it.

Performance per Watt

Normalizing by TDP gives a rough energy efficiency comparison:

GPUFP4 PFLOPSTDP (W)FP4 PFLOPS/kW
B2009~1,0009.0
B30015~1,40010.7
R10050~2,30021.7

The R100 roughly doubles the PFLOPS-per-watt of the B300 at FP4. For hyperscale inference operators who pay significant costs for power, this matters at the fleet level.

Cloud Availability Timeline

H2 2026 (announced): The first confirmed NVL72 cohort includes AWS, Google Cloud, Microsoft Azure, OCI, CoreWeave, Lambda, Nebius, and Nscale, per the announcements at CES 2026 and GTC 2026. Crusoe and Together AI are confirmed for Vera CPU deployments but have not been publicly confirmed for full NVL72 system access.

2027 (expected): Broader availability across additional GPU cloud marketplaces, including Spheron. This is the window where Rubin pricing is expected to compress from launch premiums as supply increases, and when most teams will find it practically accessible.

Use forward-looking language when planning around these dates. NVIDIA's generation timelines have been reliable at the year level but specific quarter commitments have slipped before. Plan your Rubin-dependent workloads around Q1 2027 availability as a conservative baseline.

Spheron is actively evaluating Vera Rubin hardware access alongside its existing Hopper and Blackwell inventory. For currently available GPUs, see the GPU rental catalog.

Projected Cloud Pricing: H300 vs B200 vs H200 Cost-Per-Token

Live pricing from Spheron as of 18 Apr 2026 for the current generation:

GPUOn-demand ($/hr)Spot ($/hr)VRAM
H100 SXM5$4.41N/A80 GB
H100 PCIefrom $2.51N/A80 GB
H200 SXM5$4.50$1.19141 GB
B200 SXM6N/A$2.06192 GB
R100 (Vera Rubin)projected $15-25projected $8-14288 GB

Pricing fluctuates based on GPU availability. The prices above are based on 18 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

R100 pricing is projected using the historical 30-50% premium that each generation commanded over its predecessor at launch. The B200 launched at roughly $5-7/hr at first-cohort providers in early 2025; the H100 launched at $8-10/hr in 2023 and compressed to below $2/hr within 18 months. R100 will follow a similar trajectory. At hyperscalers, expect launch pricing in the $18-25/hr range; specialist GPU clouds that receive hardware later tend to price 40-50% below hyperscaler rates.

Cost-Per-Token: How to Think About It

At 50 PFLOPS FP4 per GPU, the R100 delivers roughly 5x the throughput of a B200 at matched precision. At R100's projected launch spot pricing of ~$8-14/hr versus B200 spot at $2.06/hr, the cost-per-token advantage is modest initially (close to parity at $10/hr R100 vs $2.06/hr B200). The 5x tokens-per-dollar advantage becomes achievable as Rubin spot pricing compresses toward $2-3/hr once supply normalises in 2027, following the same trajectory H100 pricing took after launch.

For a deeper cost comparison across the inference-vs-ownership decision, see the on-premise vs GPU cloud cost breakdown.

When to Upgrade from Blackwell to Rubin: Workload Decision Matrix

WorkloadStay on BlackwellMove to RubinNotes
Inference, models under 70B parametersYesNoB200 or B300 is adequate; Rubin adds cost without proportional benefit
Inference, 70B-400B parametersLikely yes (2026)Consider for 2027At 70B FP8, B300 is already cost-effective; R100 becomes attractive at 200B+
MoE at scale (e.g., 671B parameter models)Depends on rack availabilityYesFull-rack inference benefits most from 260 TB/s fabric
Long-context inference (128K+ tokens)AcceptablePreferred22 TB/s HBM4 reduces KV cache bottleneck significantly
FP4 disaggregated inference at scalePossible on B200/B300YesR100's third-gen Transformer Engine improves FP4 accuracy
Pre-training large modelsYes (if capacity available)Not in 2026R100 won't be widely available at scale for training in 2026
Cost-sensitive batch jobsYesNoB200/B300 spot pricing is already highly competitive

Upgrade signals: move toward Rubin when:

  • Your model does not fit on two B300s at FP8 (above ~560B parameters)
  • Long-context decode latency is a primary product constraint
  • You're planning capacity for 2027 and beyond
  • FP4 precision is viable for your model and quality requirements

Stay on Blackwell when:

  • Your workload fits comfortably in 192-288 GB HBM3e
  • You're currently running at under 60% GPU utilization (buy more GPUs before buying newer ones)
  • You're on spot pricing and cost-per-token already meets your targets
  • Your inference stack is not yet FP4-ready

How to Prepare Your Inference Stack for Vera Rubin

vLLM and Inference Framework Readiness

vLLM support for Rubin hardware is in development. Upcoming releases will add NVLink 6 awareness and updated tensor parallelism defaults. The main changes to watch for:

  • TP degree: with 72 GPUs in a single NVLink fabric, tensor parallelism up to TP=72 is theoretically possible within a rack. Today most configurations use TP=8 within a node; full-rack TP changes scheduling assumptions significantly.
  • FP4 kernels: vLLM's FP4 quantization support was added for Blackwell. The R100 extends FP4 to more precision paths, which upcoming vLLM releases will expose.
  • NVLink 6 bandwidth: the 3.6 TB/s per-GPU interconnect changes all-reduce timing assumptions in distributed inference. If you have hardcoded communication budgets in custom CUDA kernels, review them before deploying on Rubin.

For an overview of recent vLLM deployment changes, see the vLLM Model Runner V2 deployment guide.

FP4 Quantization Readiness

FP4 inference is the primary way to extract Rubin's compute advantage. Before deploying on R100:

  1. Verify your model has been tested at FP4 precision. Not all architectures quantize to FP4 without quality regression.
  2. Check your quantization toolchain for FP4 support. NVIDIA's ModelOpt and common frameworks like AutoGPTQ have varying levels of FP4 maturity.
  3. Benchmark FP4 vs FP8 quality on your production evaluation set before committing to FP4 inference in production.

NVLink 6 Tensor Parallelism

Today's standard inference configurations use 8-GPU nodes with TP=8. The Vera Rubin NVL72 rack exposes 72 GPUs in a single all-to-all NVLink fabric. This changes the optimal tensor parallelism configuration for large models:

  • For a 400B parameter FP4 model, TP=1 fits on a single R100 (200 GB weights vs 288 GB VRAM). No tensor parallelism needed.
  • For a 1T parameter model, TP=8 or TP=16 across R100 GPUs is needed, but the communication overhead per TP step is lower at 3.6 TB/s NVLink than at 1.8 TB/s (Blackwell).

For deployment documentation, NVIDIA's tooling and Spheron documentation at https://docs.spheron.ai will be updated as Rubin hardware ships.

Deploy on Spheron: Next-Gen NVIDIA Hardware Access

Spheron currently offers H100, H200, and A100 instances, with Blackwell GPUs also available:

  • H100 GPU rental from $2.51/hr on-demand (PCIe). 80 GB VRAM, mature software stack, widest framework support.
  • A100 GPU rental for teams on SXM4 or PCIe configurations.
  • H200 GPU rental at $4.50/hr on-demand. 141 GB VRAM, HBM3e, the most cost-effective option for 70B models that exceed H100 memory.
  • Blackwell B200 and B300 spot instances. Check current pricing for live rates.

Spheron works with vetted data center partners worldwide to offer multi-generation GPU selection without long-term contracts. The pricing model is pay-as-you-go, which means teams can access Hopper-generation hardware at competitive rates now while evaluating when Blackwell and eventually Rubin make sense for their specific workloads.

As a marketplace aggregator, Spheron's pricing tends to track significantly below hyperscaler rates for equivalent hardware. Teams migrating from AWS, GCP, or Azure GPU instances typically find meaningful savings on per-hour rates for the same GPU models.

Vera Rubin NVL72 access on Spheron is under evaluation for 2027 availability alongside existing inventory.

Summary

  • The Vera Rubin NVL72 houses 72 R100 (H300) GPUs and 36 Vera CPUs in one rack, with 260 TB/s all-to-all NVLink 6 fabric.
  • Per-GPU memory bandwidth increases from 8 TB/s (B200/B300 HBM3e) to up to 22 TB/s (R100 HBM4), a 2.75x jump.
  • FP4 compute per GPU reaches 50 PFLOPS, 3.33x higher than the B300 and 5.6x higher than the B200.
  • Cloud availability starts H2 2026 at first-cohort providers. Broader access, including Spheron, is expected in 2027.
  • Projected launch pricing is $15-25/hr on-demand at hyperscalers; specialist GPU clouds typically come in 40-50% lower once they receive allocations.
  • For most inference workloads below 70B parameters, Blackwell remains the right choice in 2026. Rubin makes clear economic sense at 200B+ parameter scale and for workloads where per-token cost is the primary constraint.

Vera Rubin NVL72 raises the bar for rack-scale inference, but H100s, H200s, and A100s on Spheron cover most inference workloads today at significantly lower cost than hyperscalers.

Rent H100 → | Rent H200 → | View all pricing → | Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.