NVIDIA Vera Rubin NVL4 vs NVL72: Which Form Factor to Rent for AI Inference, Training, and HPC (2026)

Both the NVL4 and NVL72 ship in H2 2026, and they target fundamentally different buyers. The NVL4 is a 2U server with 4 R100 GPUs and 2 Vera CPUs, built for teams that need high-bandwidth GPU-CPU coherent compute without committing to a full rack. The NVL72 is a 72-GPU full rack, justified only for trillion-parameter workloads and rack-scale disaggregated inference. Most teams will find that NVL4-scale compute covers their actual needs, and that the NVL72's rack commitment and power requirements introduce unnecessary infrastructure overhead. For the R100 chip specs that underpin both form factors, see the R100 chip specs guide.

This guide is a workload-driven decision framework: given what you are running, which form factor makes sense, when each becomes worth the wait, and what to run on Spheron today while Rubin availability is limited to hyperscalers.

TL;DR Decision Matrix

Workload	Right Form Factor	Reason
7B-70B fine-tuning, single node	NVL4	200 PFLOPS FP4, 1.15 TB HBM4 sufficient; no rack commitment
HPC (molecular dynamics, genomics, climate modeling)	NVL4	NVLink-C2C CPU-GPU coherence, coherent 1 TB+ memory
200B-400B inference (TP=4)	NVL4	Fits in 4 × 288 GB = 1.15 TB HBM4 at FP4
400B+ pre-training, multi-node all-reduce	NVL72	260 TB/s all-to-all NVLink 6 fabric required
Trillion-parameter frontier inference (large batch)	NVL72	3.6 EFLOPS FP4, 20.7 TB HBM4 single-rack
Disaggregated prefill-decode at scale	NVL72	Full-rack prefill throughput required

The Vera Rubin Platform: NVL4 and NVL72 in Context

Both form factors are built on the same chip: the R100, also branded H300 by some cloud providers. NVIDIA's naming reflects a dual-die package that some sources call "R200", but the site's existing Rubin content uses "R100" consistently and this post follows the same convention. For the generational context and how Rubin compares to Blackwell and Hopper across memory bandwidth, compute, and NVLink generations, see the Rubin vs Blackwell vs Hopper comparison.

The architecture unit is the VR200 Superchip: 2 R100 GPUs connected to 1 Vera ARM CPU via NVLink-C2C, a chip-to-chip interconnect that provides coherent CPU-GPU memory access without the latency and bandwidth limits of PCIe. This is the same architectural philosophy as the GH200 (one Grace CPU + one H100 GPU per superchip), extended to a 2-GPU-per-CPU ratio.

The NVL4 puts 2 VR200 Superchips together: 4 R100 GPUs and 2 Vera CPUs in a 2U server. The NVL72 scales to 36 VR200 Superchips: 72 R100 GPUs and 36 Vera CPUs in a full liquid-cooled rack connected by the 260 TB/s NVLink 6 all-to-all fabric.

The pattern is consistent with prior generations. The GB200 NVL72 guide covers how NVIDIA built rack-scale interconnected systems from the B200 superchip. Vera Rubin follows the same blueprint, replacing B200 with R100 and NVLink 5 with NVLink 6. The H100 NVL form factor guide shows how the same per-GPU chip gets packaged into different server configurations for different use cases.

(NVL4 rack-level specs are derived from confirmed per-GPU R100 specs: 4 × R100 per VR200 Superchip pair. NVIDIA has not published an official NVL4 spec sheet as of June 2026.)

Vera Rubin NVL4: Specs and Workload Fit

Spec	NVL4
GPUs	4 R100 (H300-class)
Superchips	2 VR200 (each: 2 × R100 + 1 Vera CPU)
Total GPU memory	4 × 288 GB = 1.152 TB HBM4
Memory bandwidth	4 × 22 TB/s = 88 TB/s combined
FP4 compute	4 × 50 PFLOPS = 200 PFLOPS
CPU-GPU interconnect	NVLink-C2C (within each superchip)
NVLink generation	NVLink 6 (between superchips)
Form factor	2U server (standard rack)

NVLink-C2C and CPU-GPU Memory Coherence

The distinguishing feature of the NVL4 over standard GPU server configurations is NVLink-C2C for CPU-GPU interconnect. In a conventional 8×H100 or 8×B200 server, the CPU and GPUs communicate over PCIe, which has roughly 64 GB/s bidirectional bandwidth. NVLink-C2C operates at a different order of magnitude, providing coherent shared memory addressing between the CPU and its paired R100 GPUs.

This matters most for HPC workloads where the CPU performs significant preprocessing or postprocessing between GPU kernel launches. Molecular dynamics simulations, climate modeling, and genomics pipelines typically involve frequent CPU-GPU data exchange. In a PCIe setup, this exchange is a bottleneck. In an NVLink-C2C setup, CPU memory and GPU memory are addressable in a coherent space, and the transfer latency drops substantially.

Compared to the GH200 (96 GB HBM3 per GPU at ~4 TB/s, NVLink-C2C), the NVL4 offers 288 GB HBM4 per GPU, roughly 5.5x the per-GPU memory bandwidth (22 TB/s vs ~4 TB/s), and 2 GPUs per Vera CPU instead of 1. For HPC workloads that fit within the GH200 memory envelope, GH200 is available now. For large-scale workloads that need more than 96 GB per GPU, the NVL4 is the natural successor.

NVL4 for Fine-Tuning and Training

The aggregate 1.152 TB HBM4 across 4 GPUs makes the NVL4 a strong single-node training server for models up to 400B parameters. A 70B FP16 model takes roughly 140 GB of GPU memory for weights alone. At FP4, that compresses to 35 GB, leaving substantial headroom for optimizer states, activations, and KV cache.

Note: the headline 50 PFLOPS per GPU (200 PFLOPS for NVL4) is the dense FP4 inference figure. FP4 training throughput is lower, roughly 35 PFLOPS per GPU, because training uses a mix of FP4 forward passes and higher-precision backward passes.

For full parameter fine-tuning of 70B models with FSDP across 4 GPUs, the NVL4's memory is more than sufficient. At FP4, 4 × 288 GB = 1.15 TB means you could shard a 400B model across the 4 GPUs with weights alone consuming about 200 GB, leaving 950 GB for optimizer states and activations. That's a workload that would need 8×H100 (640 GB total HBM3) or 4×B300 (1.15 TB HBM3e) today. The NVL4 provides comparable memory capacity to 4×B300 but with the added CPU coherence and HBM4 bandwidth.

NVLink 6 between the two VR200 Superchips provides the inter-GPU bandwidth needed for FSDP all-reduce across 4 GPUs. At 3.6 TB/s per GPU NVLink bandwidth, gradient synchronization is not the bottleneck.

NVL4 for Inference

At TP=4, the NVL4 serves models up to 400B parameters at FP4. A 200B FP4 model needs roughly 100 GB of weights, well within one GPU's 288 GB. At TP=4, the per-token memory bandwidth available to the decode step is 88 TB/s combined, compared to 13.4 TB/s on a 4×H100 SXM5 configuration (4 × 3.35 TB/s HBM3). The bandwidth advantage translates directly to lower decode latency at equivalent batch sizes.

For vLLM Model Runner V2 deployments using TP=4, the NVL4 is a natural fit for models in the 70B-200B range. The single-node NVLink 6 interconnect avoids the InfiniBand overhead of multi-node setups, simplifying the deployment topology.

NVL4 for HPC

Scientific computing workloads that need both CPU-GPU coherence and large GPU memory will benefit from the NVL4's architecture. Genomics pipelines like those run through GPU-accelerated tools (see the NVIDIA Parabricks GPU cloud genomics guide for a concrete example of HPC workload requirements) involve alternating CPU preprocessing and GPU-accelerated computation. The NVLink-C2C interconnect reduces the transfer overhead for these alternating compute phases.

Climate modeling and molecular dynamics simulations similarly involve large datasets that need to move between CPU memory (for I/O and preprocessing) and GPU memory (for simulation). At 288 GB HBM4 per GPU and coherent CPU-GPU addressing, the NVL4 can hold substantially larger simulation states in GPU memory than a GH200 configuration.

Vera Rubin NVL72: Specs and Workload Fit

Spec	NVL72
GPUs	72 R100 (H300-class)
Superchips	36 VR200
Total GPU memory	20.7 TB HBM4
Memory bandwidth per GPU	22 TB/s
NVLink fabric bandwidth	260 TB/s all-to-all
FP4 compute (NVFP4)	72 × 50 = 3.6 EFLOPS
Rack power	~166 kW

The NVL72 covers the same ground as the NVL4 but at 18x the scale. For the full architecture breakdown, specs comparison against the GB200 NVL72, cloud timeline, and projected pricing analysis, see the Vera Rubin NVL72 architecture guide.

260 TB/s NVLink 6 at Rack Scale

The defining feature of the NVL72 over any multi-node configuration is the all-to-all fabric bandwidth. 260 TB/s means that every GPU in the rack can communicate directly with every other GPU at full speed simultaneously. For training trillion-parameter models with tensor parallelism distributed across all 72 GPUs, all-reduce operations run entirely over this fabric without touching any external network.

In a multi-node setup built from multiple NVL4 servers, inter-node traffic runs over InfiniBand or Ethernet, which delivers 400 Gb/s to 800 Gb/s per link, orders of magnitude below the NVL72 fabric. For all-reduce at model parallelism degree 72, this bandwidth difference determines whether gradient synchronization becomes the training bottleneck.

The NVL4's 2-superchip NVLink path provides high bandwidth within the 4-GPU node, but it cannot match the NVL72's all-to-all reach. If your workload requires TP=72 for a trillion-parameter model, you need the NVL72 fabric.

Rack-Scale Economics

At 3.6 EFLOPS FP4 for a trillion-parameter inference workload, the NVL72 delivers cost-per-token economics that improve with batch size. At large batch sizes and full rack utilization, the per-token compute cost compresses relative to smaller node configurations running equivalent throughput across multiple nodes.

CoreWeave completed the first NVL72 bring-up on June 1, 2026. The broader cloud availability timeline and cost-per-token economics at scale are covered in the Rubin cloud availability and cost-per-token guide.

NVL4 vs NVL72: Direct Comparison

Spec	NVL4	NVL72
GPUs	4 R100	72 R100
Total HBM4	1.15 TB	20.7 TB
FP4 throughput	200 PFLOPS	3.6 EFLOPS
NVLink fabric	NVLink 6 (inter-superchip, within node)	260 TB/s all-to-all
Form factor	2U server	Full rack (~166 kW)
Infrastructure need	Standard rack	Purpose-built liquid cooling
Cloud SKU type	Node reservation	Rack-level contract
Largest model (FP4)	~400B parameters	~2T+ parameters

Cloud Rental Reality: Which SKU Maps to Which Form Factor

The NVL4 maps to per-node reservations: a team can rent a single NVL4 server the same way they currently rent a single 8×H100 server. The access model is familiar and accessible to teams without dedicated data center contracts.

The NVL72 maps to rack-level enterprise contracts. Rack power at ~166 kW requires purpose-built liquid-cooled infrastructure. Few data centers support this density, and cloud providers are offering NVL72 access through reserved capacity arrangements rather than standard on-demand SKUs.

For most teams, the NVL4 is the form factor that actually lands in a cloud console. The NVL72 is a custom infrastructure deal.

Current Alternatives on Spheron

Both NVL4 and NVL72 are not yet available on Spheron (pre-order open at the Rubin R100 pre-order page). While Rubin capacity is limited to hyperscaler first access, available on-demand options on Spheron offer the closest current analogues:

GPU	On-Demand ($/hr)	Spot ($/hr)	Memory
H100 SXM5	$4.06	$1.49	80 GB HBM3
H100 PCIe	$2.01	N/A	80 GB HBM3
H200 SXM5	$3.70	$3.31	141 GB HBM3e
B200 SXM6	N/A (spot only)	$5.34	192 GB HBM3e
B300 SXM6	$9.16	N/A	288 GB HBM3e
R100 (Vera Rubin)	projected $15-25	projected $8-14	288 GB HBM4

B300 at 288 GB HBM3e is the closest current analogue to the NVL4 in per-GPU memory capacity. Four B300 GPUs aggregate 1.15 TB HBM3e, matching the NVL4's aggregate HBM4 capacity. Bandwidth is lower (B300 runs at 8 TB/s vs R100's 22 TB/s), but workloads that are primarily capacity-constrained rather than bandwidth-constrained will see comparable model fit.

R100 projected pricing reflects the historical 30-50% premium at new-generation launch versus prior generation, consistent with the pattern at B200 and H200 launches. Actual pricing will be confirmed when Spheron NVL4 capacity opens.

Pricing fluctuates based on GPU availability. The prices above are based on 29 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Right-Sizing: When NVL4 Is Enough and When You Need NVL72

You need NVL4 if:

Your model is under 400B parameters at FP4. You are fine-tuning single-node with LoRA or full FSDP across 4 GPUs. Your HPC pipeline needs CPU-GPU memory coherence without a rack footprint. Your inference workload runs 70B-200B models at TP=4 and doesn't require disaggregated prefill. You want to avoid a rack-level infrastructure commitment.

You need NVL72 if:

Your model exceeds 400B parameters and requires TP > 4. You are running disaggregated prefill-decode at scale where the prefill step saturates a multi-node configuration. You are pre-training a trillion-parameter model and need the 260 TB/s all-to-all fabric for all-reduce without InfiniBand as the bottleneck. Your inference load requires 20.7 TB of aggregated GPU memory to avoid tensor parallel overhead at production batch sizes.

Spheron's NVLink-Aware GPU Filtering

Spheron's April 2026 platform update added a nvlink: true flag in the GPU API response, identifying NVLink-bridged multi-GPU offers programmatically. This lets teams filter specifically for NVLink-interconnected configurations when submitting job specs. The April 2026 product update covers the feature in detail. API filtering reference is at the Spheron API reference.

The April update also introduced volume discounts that apply automatically at scale, so teams scaling to multi-node B300 configurations in advance of Rubin availability receive the same pricing treatment without manual negotiation.

Per-minute billing with no long-term contract means teams can match GPU allocation to actual workload demand. A LoRA fine-tuning job that runs 4 hours per week does not justify a reserved rack contract. On Spheron, you pay for what you use at per-minute granularity and switch to a larger configuration when your workload grows.

Migration Path

Start on B300 (288 GB HBM3e per GPU) now via Spheron's B300 GPU rental. The per-GPU memory capacity matches NVL4's R100 per-GPU spec, giving you the model fit to verify your workload runs before committing to Rubin capacity. When NVL4 availability opens through Spheron in 2027, migrating is a configuration change, not a contract renegotiation. Scale to NVL72 only when your workload crosses the 400B threshold or your inference load requires rack-scale economics. The per-minute billing structure means there is no sunk cost pushing you to over-provision before you are ready.

Conclusion

The NVL4 is the right Vera Rubin form factor for the vast majority of teams running production AI workloads in 2026 and 2027. Four R100 GPUs with 1.15 TB HBM4, 200 PFLOPS FP4, and NVLink-C2C CPU coherence cover fine-tuning, HPC, and inference of models up to 400B parameters without a rack-level infrastructure commitment. The NVL72 is a specialized system for teams building at the trillion-parameter frontier, running disaggregated prefill at rack scale, or operating at the point where per-token cost at full rack utilization justifies a dedicated deployment. For teams deciding now, the practical path is B300 on Spheron today, NVL4 reservation when availability opens, and NVL72 only when the workload scale actually demands it.

Most teams working with the Vera Rubin generation will find NVL4-scale compute covers their needs. Spheron offers per-minute billing with no rack commitment and NVLink-aware GPU filtering, so you can match allocations to your actual workload. Pre-order Rubin capacity and run on available Blackwell hardware in the meantime.
Spheron R100 first-access → | B300 on Spheron → | B200 pricing → | View all GPU pricing →

STEPS / 05

Quick Setup Guide

Assess your workload scale
Identify your largest model size, required context length, and whether your job is inference-serving, fine-tuning, or pre-training. Models up to 400B parameters at FP4 fit within NVL4's 1.15 TB aggregate HBM4.
Check if NVL4 covers your compute
Estimate required FP4 TFLOPS. NVL4 delivers 200 PFLOPS FP4 combined (4 × 50 PFLOPS). If your peak workload exceeds 200 PFLOPS sustained, evaluate NVL72.
Evaluate NVL72 only for rack-scale needs
NVL72 is justified when: your model exceeds 400B parameters at FP4 and needs TP > 4, you are running disaggregated prefill at scale requiring the 260 TB/s NVLink fabric, or you are training trillion-parameter models.
Register for Spheron Rubin first-access
Visit the R100 pre-order page at spheron.network/gpu-rental/r100/ and submit your workload requirements. Spheron will notify you when NVL4 or NVL72 capacity becomes available in its marketplace.
Run on Blackwell while waiting
B200 and B300 are available on Spheron now with per-minute billing. Use B300 (288 GB HBM3e) for the closest current analogue to NVL4 memory capacity. Switch to Rubin when it arrives without a long-term contract penalty.

FAQ / 05

Frequently Asked Questions

The Vera Rubin NVL4 is a 2-superchip server module housing 4 R100 GPUs and 2 Vera ARM CPUs. Each VR200 Superchip pairs 2 R100 GPUs with 1 Vera CPU via NVLink-C2C at high-bandwidth coherent interconnect, eliminating the PCIe bottleneck between CPU and GPU. The NVL4 targets single-node fine-tuning, HPC workloads requiring CPU-GPU memory coherence, and inference serving of 200B-400B parameter models without committing to a full NVL72 rack.

For models up to 400B parameters, the NVL4 provides enough compute (4 × 50 PFLOPS FP4 = 200 PFLOPS combined) and memory (1.15 TB HBM4 across 4 GPUs) to serve most production inference workloads. The NVL72 is justified for trillion-parameter models, disaggregated prefill-decode at rack scale, or when you need the full 3.6 EFLOPS throughput and 260 TB/s NVLink 6 fabric for all-to-all communication.

No. Both NVL4 and NVL72 ship in H2 2026, with first access going to AWS, Google Cloud, Azure, CoreWeave, Lambda, Nebius, and Nscale. Broader availability, including GPU marketplace platforms, is expected in 2027. You can register for first access on Spheron via the R100 pre-order page.

Three workloads justify NVL72: (1) pre-training or continual pre-training of trillion-parameter models that need the 260 TB/s all-to-all NVLink 6 fabric for all-reduce synchronization; (2) disaggregated prefill-decode inference pipelines at scale where prefill throughput requires full-rack FP4 compute; (3) large-batch production inference of 1T+ parameter models where per-token cost at rack utilization beats any single-node configuration.

Both use a GPU-CPU superchip architecture with NVLink-C2C for coherent memory. The GH200 pairs one H100 GPU with one Grace CPU at 96 GB HBM3. The Vera Rubin NVL4 pairs two R100 GPUs with one Vera CPU per superchip, with 288 GB HBM4 per GPU (576 GB per superchip). For HPC workloads that fit within the GH200's memory envelope, GH200 is available now at lower cost. For large-scale molecular dynamics, genomics pipelines, or climate models that saturate GH200 VRAM, NVL4 provides roughly 3x the per-GPU memory and ~5.5x the memory bandwidth (R100 at 22 TB/s vs GH200's H100 at ~4 TB/s HBM3).

TL;DR Decision Matrix

The Vera Rubin Platform: NVL4 and NVL72 in Context

Vera Rubin NVL4: Specs and Workload Fit

NVLink-C2C and CPU-GPU Memory Coherence

NVL4 for Fine-Tuning and Training

NVL4 for Inference

NVL4 for HPC

Vera Rubin NVL72: Specs and Workload Fit

260 TB/s NVLink 6 at Rack Scale

Rack-Scale Economics

NVL4 vs NVL72: Direct Comparison

Cloud Rental Reality: Which SKU Maps to Which Form Factor

Current Alternatives on Spheron

Right-Sizing: When NVL4 Is Enough and When You Need NVL72

Spheron's NVLink-Aware GPU Filtering

Migration Path

Conclusion

Quick Setup Guide

Assess your workload scale

Check if NVL4 covers your compute

Evaluate NVL72 only for rack-scale needs

Register for Spheron Rubin first-access

Run on Blackwell while waiting

Frequently Asked Questions

01What is the NVIDIA Vera Rubin NVL4?

02NVL4 vs NVL72: which is better for LLM inference?

03Can I rent the Vera Rubin NVL4 on GPU cloud today?

04What workloads actually need the NVL72?

05How does the NVL4 compare to the GH200 for HPC?

Build what's next.