NVIDIA Vera Rubin NVL72: Rack-Scale H300 System Specs, Pricing, and Cloud Availability (2026)

R100 pre-order is open on Spheron. Capacity for the Vera Rubin generation is being allocated now to teams that want first-line access when the NVL72 hits the marketplace. Register your interest on the R100 pre-order page and we'll follow up with confirmed pricing and timing the moment a slot opens. System breakdown below.

This page covers the Vera Rubin NVL72 as a rack-scale system. If you are searching for per-GPU chip specifications (transistor count, HBM4 bandwidth, NVLink 6 bandwidth per chip), see the NVIDIA Rubin R100 chip specs guide instead. This post covers the NVL72 specifically: what the full rack delivers, how 72 R100 GPUs interconnect via NVLink 6 at 260 TB/s, how it compares to the GB200 NVL72, when cloud providers will ship it, and whether upgrading from Blackwell makes sense for your workloads. Cloud providers including AWS, Google, Azure, and CoreWeave are calling this GPU the "H300", so if you reached this page searching for H300, you are in the right place.

The Vera Rubin NVL72 is NVIDIA's rack-scale successor to the GB200 NVL72, announced at CES 2026 and detailed further at GTC 2026, with H2 2026 delivery. It puts 72 R100 GPUs (also branded H300 by some cloud providers) and 36 Vera CPUs into a single liquid-cooled rack, connected by NVLink 6. For the per-GPU architecture behind this system, see the R100 GPU architecture guide.

What Is the NVIDIA Vera Rubin NVL72

The Vera Rubin NVL72 is a direct successor to the GB200 NVL72 architecture. Same form factor, same 72-GPU count, same 36-CPU configuration, but a full generational jump in every key spec. Where GB200 NVL72 uses NVLink 5 at 130 TB/s total fabric bandwidth, Vera Rubin NVL72 doubles that to 260 TB/s with NVLink 6. The per-GPU chip-to-chip bandwidth goes from 1.8 TB/s to 3.6 TB/s.

Spec	Value
GPU count	72 R100 (H300-class)
CPU count	36 Vera
Per-GPU memory	288 GB HBM4
Per-GPU memory bandwidth	up to 22 TB/s
Per-GPU NVLink bandwidth	3.6 TB/s
Total rack NVLink bandwidth	260 TB/s
Per-GPU FP4 throughput	50 PFLOPS
Rack-level FP8 throughput	~1.4 ExaFLOPS
GPU power (rack)	~166 kW

The naming situation is worth addressing upfront. NVIDIA internally calls the chip the "Rubin GPU" within the Vera Rubin platform. Industry sources have used both "R100" and "R200" (reflecting a dual-die package design). Cloud providers, particularly those with existing H100 and H200 naming conventions, may brand it "H300". All three refer to the same hardware. This post uses "R100" consistent with the broader Rubin platform documentation.

H300 GPU Specs: HBM4, Transistors, and Compute

The R100 is a substantial step from the B200 and B300. Pre-order access is open on the Spheron R100 page for teams that want first-line capacity. Here's the full spec table:

Spec	R100 (H300)
VRAM	288 GB HBM4
Memory bandwidth	up to 22 TB/s
FP4 Inference throughput (with TE)	50 PFLOPS
FP8 Dense throughput	~19,400 TFLOPS (est.)
FP16 Dense throughput	~8,000 TFLOPS (est.)
Transistors	336 billion
Process node	TSMC N3 (est.)
NVLink version	NVLink 6
NVLink bandwidth (per GPU)	3.6 TB/s
TDP	~2,300 W
Networking	ConnectX-9 (1.6T)

HBM4 vs HBM3e

The B200 and B300 both use HBM3e at 8 TB/s per GPU. The R100 moves to HBM4 with up to 22 TB/s per GPU, confirmed by NVIDIA at GTC 2026. That is a 2.75x bandwidth increase from Blackwell to Rubin on a single GPU.

For inference, this matters most in two places. KV cache reads and writes are bandwidth-bound operations (see the AI memory wall and inference latency guide for a detailed breakdown of how bandwidth bottlenecks affect serving). At 22 TB/s, the R100 can sustain significantly larger active KV caches during long-context inference without stalling the compute pipeline. A model that currently needs 4xH100 or 2xB200 to handle 128K context windows may run on a single R100 with headroom to spare. Fewer GPUs means less NVLink communication overhead per token, which reduces latency at every context length.

The second impact is on decode throughput at high concurrency. When the batch size is large enough that memory reads dominate compute, bandwidth is the bottleneck. At 2.75x the B200's bandwidth, the R100 decodes tokens faster at equivalent concurrency levels.

Third-Generation Transformer Engine

The R100 includes a third-generation Transformer Engine. The first generation, introduced in Hopper, handled FP8 mixed-precision automatically during training. The Blackwell generation added native FP4 support (second generation). The Rubin third-generation Transformer Engine extends FP4 to both training and inference paths, and adds hardware support for finer-grained quantization strategies that reduce accuracy loss at FP4 precision.

In practical terms, this means FP4 inference on R100 is more reliable out of the box than on B200. Models that required FP8 fallback on Blackwell to preserve quality may run at FP4 on Rubin with comparable accuracy.

The "H300" Alias

Cloud providers with H100 and H200 offerings often brand the R100 as H300 to maintain naming continuity for their customers. This creates no hardware difference. If you see "H300" in a cloud provider's catalog, it refers to the R100. The NVL72 system built on R100 may appear as "H300 NVL72" or "Vera Rubin NVL72" depending on the provider.

Vera Rubin vs Blackwell vs Hopper: Specs Side by Side

For the full three-generation architecture comparison, see the Rubin vs Blackwell vs Hopper comparison. For a deep dive on the B200 specifically, see the NVIDIA B200 complete guide. The summary table:

Spec	H100 SXM5	H200 SXM5	B200	B300	R100 (Vera Rubin)
Architecture	Hopper	Hopper	Blackwell	Blackwell Ultra	Rubin
Memory	80 GB HBM3	141 GB HBM3e	192 GB HBM3e	288 GB HBM3e	288 GB HBM4
Memory bandwidth	3.35 TB/s	4.8 TB/s	8 TB/s	8 TB/s	up to 22 TB/s
NVLink bandwidth	900 GB/s	900 GB/s	1.8 TB/s	1.8 TB/s	3.6 TB/s
NVLink version	NVLink 4	NVLink 4	NVLink 5	NVLink 5	NVLink 6
FP8 TFLOPS	3,958	3,958	4,500	5,000	~19,400 (est.)
FP4 TFLOPS	N/A	N/A	9,000	15,000	50,000
Transistors	80 B	80 B	208 B	208 B	336 B
TDP	~700 W	~700 W	~1,000 W	~1,400 W	~2,300 W
Process node	TSMC 4N	TSMC 4N	TSMC 4NP	TSMC 4NP	TSMC N3 (est.)
Release (cloud)	2023	2024	2025	2026	H2 2026

A few things stand out in this comparison. Memory bandwidth is where Rubin makes the biggest jump: 22 TB/s vs 8 TB/s for both B200 and B300 is nearly 3x, despite B200 and B300 sharing the same HBM3e stack. That gap widened because HBM4 allows significantly higher bandwidth per stack than HBM3e.

FP4 compute also shows a step that does not follow a linear trajectory. B200 to B300 is 1.67x (9 to 15 PFLOPS). B300 to R100 is 3.33x (15 to 50 PFLOPS). NVIDIA appears to have reserved a disproportionate compute jump for the HBM4 generation, which has the bandwidth to feed it.

Performance per Watt

Normalizing by TDP gives a rough energy efficiency comparison:

GPU	FP4 PFLOPS	TDP (W)	FP4 PFLOPS/kW
B200	9	~1,000	9.0
B300	15	~1,400	10.7
R100	50	~2,300	21.7

The R100 roughly doubles the PFLOPS-per-watt of the B300 at FP4. For hyperscale inference operators who pay significant costs for power, this matters at the fleet level.

Cloud Availability Timeline

H2 2026 (announced): The first confirmed NVL72 cohort includes AWS, Google Cloud, Microsoft Azure, OCI, CoreWeave, Lambda, Nebius, and Nscale, per the announcements at CES 2026 and GTC 2026. Crusoe and Together AI are confirmed for Vera CPU deployments but have not been publicly confirmed for full NVL72 system access.

2027 (expected): Broader availability across additional GPU cloud marketplaces, including Spheron. This is the window where Rubin pricing is expected to compress from launch premiums as supply increases, and when most teams will find it practically accessible.

Use forward-looking language when planning around these dates. NVIDIA's generation timelines have been reliable at the year level but specific quarter commitments have slipped before. Plan your Rubin-dependent workloads around Q1 2027 availability as a conservative baseline.

Spheron is actively evaluating Vera Rubin hardware access alongside its existing Hopper and Blackwell inventory. For currently available GPUs, see the GPU rental catalog.

Projected Cloud Pricing: H300 vs B200 vs H200 Cost-Per-Token

Live pricing from Spheron as of 18 Apr 2026 for the current generation:

GPU	On-demand ($/hr)	Spot ($/hr)	VRAM
H100 SXM5	$4.41	N/A	80 GB
H100 PCIe	from $2.51	N/A	80 GB
H200 SXM5	$4.50	$1.19	141 GB
B200 SXM6	N/A	$2.06	192 GB
R100 (Vera Rubin)	projected $15-25	projected $8-14	288 GB

Pricing fluctuates based on GPU availability. The prices above are based on 18 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

R100 pricing is projected using the historical 30-50% premium that each generation commanded over its predecessor at launch. The B200 launched at roughly $5-7/hr at first-cohort providers in early 2025; the H100 launched at $8-10/hr in 2023 and compressed to below $2/hr within 18 months. R100 will follow a similar trajectory. At hyperscalers, expect launch pricing in the $18-25/hr range; specialist GPU clouds that receive hardware later tend to price 40-50% below hyperscaler rates.

Cost-Per-Token: How to Think About It

At 50 PFLOPS FP4 per GPU, the R100 delivers roughly 5x the throughput of a B200 at matched precision. At R100's projected launch spot pricing of ~$8-14/hr versus B200 spot at $2.06/hr, the cost-per-token advantage is modest initially (close to parity at $10/hr R100 vs $2.06/hr B200). The 5x tokens-per-dollar advantage becomes achievable as Rubin spot pricing compresses toward $2-3/hr once supply normalises in 2027, following the same trajectory H100 pricing took after launch.

For a deeper cost comparison across the inference-vs-ownership decision, see the on-premise vs GPU cloud cost breakdown.

When to Upgrade from Blackwell to Rubin: Workload Decision Matrix

Workload	Stay on Blackwell	Move to Rubin	Notes
Inference, models under 70B parameters	Yes	No	B200 or B300 is adequate; Rubin adds cost without proportional benefit
Inference, 70B-400B parameters	Likely yes (2026)	Consider for 2027	At 70B FP8, B300 is already cost-effective; R100 becomes attractive at 200B+
MoE at scale (e.g., 671B parameter models)	Depends on rack availability	Yes	Full-rack inference benefits most from 260 TB/s fabric
Long-context inference (128K+ tokens)	Acceptable	Preferred	22 TB/s HBM4 reduces KV cache bottleneck significantly
FP4 disaggregated inference at scale	Possible on B200/B300	Yes	R100's third-gen Transformer Engine improves FP4 accuracy
Pre-training large models	Yes (if capacity available)	Not in 2026	R100 won't be widely available at scale for training in 2026
Cost-sensitive batch jobs	Yes	No	B200/B300 spot pricing is already highly competitive

Upgrade signals: move toward Rubin when:

Your model does not fit on two B300s at FP8 (above ~560B parameters)
Long-context decode latency is a primary product constraint
You're planning capacity for 2027 and beyond
FP4 precision is viable for your model and quality requirements

Stay on Blackwell when:

Your workload fits comfortably in 192-288 GB HBM3e
You're currently running at under 60% GPU utilization (buy more GPUs before buying newer ones)
You're on spot pricing and cost-per-token already meets your targets
Your inference stack is not yet FP4-ready

How to Prepare Your Inference Stack for Vera Rubin

vLLM and Inference Framework Readiness

vLLM support for Rubin hardware is in development. Upcoming releases will add NVLink 6 awareness and updated tensor parallelism defaults. The main changes to watch for:

TP degree: with 72 GPUs in a single NVLink fabric, tensor parallelism up to TP=72 is theoretically possible within a rack. Today most configurations use TP=8 within a node; full-rack TP changes scheduling assumptions significantly.
FP4 kernels: vLLM's FP4 quantization support was added for Blackwell. The R100 extends FP4 to more precision paths, which upcoming vLLM releases will expose.
NVLink 6 bandwidth: the 3.6 TB/s per-GPU interconnect changes all-reduce timing assumptions in distributed inference. If you have hardcoded communication budgets in custom CUDA kernels, review them before deploying on Rubin.

For an overview of recent vLLM deployment changes, see the vLLM Model Runner V2 deployment guide.

FP4 Quantization Readiness

FP4 inference is the primary way to extract Rubin's compute advantage. Before deploying on R100:

Verify your model has been tested at FP4 precision. Not all architectures quantize to FP4 without quality regression.
Check your quantization toolchain for FP4 support. NVIDIA's ModelOpt and common frameworks like AutoGPTQ have varying levels of FP4 maturity.
Benchmark FP4 vs FP8 quality on your production evaluation set before committing to FP4 inference in production.

NVLink 6 Tensor Parallelism

Today's standard inference configurations use 8-GPU nodes with TP=8. The Vera Rubin NVL72 rack exposes 72 GPUs in a single all-to-all NVLink fabric. This changes the optimal tensor parallelism configuration for large models:

For a 400B parameter FP4 model, TP=1 fits on a single R100 (200 GB weights vs 288 GB VRAM). No tensor parallelism needed.
For a 1T parameter model, TP=8 or TP=16 across R100 GPUs is needed, but the communication overhead per TP step is lower at 3.6 TB/s NVLink than at 1.8 TB/s (Blackwell).

For deployment documentation, NVIDIA's tooling and Spheron documentation at https://docs.spheron.ai will be updated as Rubin hardware ships.

Deploy on Spheron: Next-Gen NVIDIA Hardware Access

Spheron currently offers H100, H200, and A100 instances, with Blackwell GPUs also available:

H100 GPU rental from $2.51/hr on-demand (PCIe). 80 GB VRAM, mature software stack, widest framework support.
A100 GPU rental for teams on SXM4 or PCIe configurations.
H200 GPU rental at $4.50/hr on-demand. 141 GB VRAM, HBM3e, the most cost-effective option for 70B models that exceed H100 memory.
Blackwell B200 and B300 spot instances. Check current pricing for live rates.

Spheron works with vetted data center partners worldwide to offer multi-generation GPU selection without long-term contracts. The pricing model is pay-as-you-go, which means teams can access Hopper-generation hardware at competitive rates now while evaluating when Blackwell and eventually Rubin make sense for their specific workloads. Teams comparing GPU cloud providers will find a breakdown of how current-generation GPU cloud platforms differ on pricing, availability, and instance types.

As a marketplace aggregator, Spheron's pricing tends to track significantly below hyperscaler rates for equivalent hardware. Teams migrating from AWS, GCP, or Azure GPU instances typically find meaningful savings on per-hour rates for the same GPU models.

Vera Rubin NVL72 access on Spheron is under evaluation for 2027 availability alongside existing inventory.

Summary

The Vera Rubin NVL72 houses 72 R100 (H300) GPUs and 36 Vera CPUs in one rack, with 260 TB/s all-to-all NVLink 6 fabric.
Per-GPU memory bandwidth increases from 8 TB/s (B200/B300 HBM3e) to up to 22 TB/s (R100 HBM4), a 2.75x jump.
FP4 compute per GPU reaches 50 PFLOPS, 3.33x higher than the B300 and 5.6x higher than the B200.
Cloud availability starts H2 2026 at first-cohort providers. Broader access, including Spheron, is expected in 2027.
Projected launch pricing is $15-25/hr on-demand at hyperscalers; specialist GPU clouds typically come in 40-50% lower once they receive allocations.
For most inference workloads below 70B parameters, Blackwell remains the right choice in 2026. Rubin makes clear economic sense at 200B+ parameter scale and for workloads where per-token cost is the primary constraint.

Vera Rubin NVL72 raises the bar for rack-scale inference. Pre-order R100 capacity now, and run H100s, H200s, and A100s on Spheron in the meantime at significantly lower cost than hyperscalers.
Pre-order R100 → | H100 GPU pricing → | Check H200 availability → | View all pricing → | Get started on Spheron →

FAQ / 06

Frequently Asked Questions

The NVIDIA Vera Rubin NVL72 is a rack-scale AI system with 72 R100 GPUs (also called H300) and 36 Vera CPUs interconnected via NVLink 6. Each R100 carries 288GB of HBM4 memory at up to 22 TB/s bandwidth, and the full rack delivers 260 TB/s all-to-all NVLink fabric bandwidth and 50 PFLOPS of NVFP4 inference throughput per GPU.

The H300 (R100) uses HBM4 instead of HBM3e, increasing per-GPU memory bandwidth from 8 TB/s to up to 22 TB/s. It also moves from NVLink 5 to NVLink 6, doubling chip-to-chip bandwidth per link, and adds a third-generation Transformer Engine. The per-GPU memory capacity grows from 192GB to 288GB.

NVIDIA announced H2 2026 cloud availability. AWS, Google Cloud, Azure, CoreWeave, Lambda, Nebius, and Nscale are in the first cohort. Broader GPU cloud availability, including providers like Spheron, is expected in 2027.

For models under 70B parameters on existing Blackwell capacity, there is no urgent reason to switch in 2026. Vera Rubin makes economic sense for models above 70B parameters, FP8 or FP4 disaggregated inference at scale, and workloads where per-token compute cost is the primary constraint.

No official pricing has been published yet. Based on the historical pricing premium over the previous generation at launch, Rubin NVL72 access is expected to command a 30-50% premium over Blackwell NVL72 in 2026, declining as supply increases through 2027.

72 R100 GPUs (H300-class) paired with 36 Vera CPUs. It is a full rack unit combining GPU and CPU compute in a single NVLink-connected fabric.