All cloud prices in this article are indicative as of 22 Mar 2026 and can fluctuate over time based on GPU availability. Check current GPU pricing for live rates.
The GB200 NVL72 packs 72 B200 GPUs, 36 Grace ARM CPUs, 13.4 TB of unified GPU memory, and 1.44 exaflops of FP4 compute into a single liquid-cooled rack. Cloud providers sell access at the Superchip or rack-node level, not as individual GPU slots. This post answers the question that actually matters: when does this architecture justify the cost, and when does a cluster of 8×B200 nodes do the job cheaper?
Quick Answer: GB200 NVL72 vs 8×H100 vs 8×B200
| System | Memory | NVLink BW (per GPU) | Best For | Cloud Price/GPU-hr (indicative) |
|---|---|---|---|---|
| 8×H100 SXM5 | 640 GB (8×80 GB) | 900 GB/s | Sub-70B models, cost-sensitive | $2.01/hr on-demand (Spheron) |
| 8×B200 SXM | 1.44 TB (8×180 GB) | 1.8 TB/s | 70B–100B models, FP4 workloads | $6.03/hr on-demand (Spheron) |
| GB200 NVL72 rack | 13.4 TB (72 B200 GPUs) | ~1.8 TB/s (130 TB/s total rack, all-to-all fabric across all 72 GPUs) | 200B+ training, 671B inference | $10.50–$27/GPU-hr equiv |
No single-GPU GB200 rental exists. For Blackwell access without a rack commitment, see B200 or B300.
What Is the GB200 NVL72?
The GB200 is not a GPU you rent individually. It's a Superchip: two B200 GPUs and one Grace ARM CPU on a single package, connected by NVLink-C2C at 900 GB/s. That 900 GB/s is roughly 7x PCIe Gen 5's 128 GB/s, which means the CPU and GPU share memory coherently without the usual PCIe transfer overhead.
The NVL72 is what you get when you stack 36 of those Superchips in one rack. The result: 72 B200 GPUs, 36 Grace CPUs, and NVSwitch fabric connecting every GPU to every other GPU at 130 TB/s all-to-all bandwidth.
This is fundamentally different from a standard HGX B200 server. An HGX B200 node puts 8 B200 GPUs on a server baseboard with a separate x86 CPU connected via PCIe. NVLink in that configuration operates at 1.8 TB/s per GPU for intra-node communication. The GB200 NVL72 replaces the x86 CPU with a Grace ARM CPU and connects CPU to GPU via NVLink-C2C instead of PCIe, then extends the NVLink fabric across all 36 Superchips in the rack.
NVIDIA also released the GB300 NVL72 as the next-generation successor. CoreWeave was the first cloud provider to deploy GB300 NVL72, with the first systems announced in July 2025 and cloud instances generally available from August 19, 2025. Azure followed with the first large-scale GB300 NVL72 cluster for OpenAI workloads in October 2025, and AWS launched EC2 P6e-GB300 UltraServers with general availability on December 2, 2025. The GB200 NVL72 remains widely available across CoreWeave, Oracle Cloud, Azure, and Google Cloud as of early 2026.
For the single-GPU Blackwell deep-dive, see our B200 complete guide.
Full GB200 NVL72 Rack Specifications
| Spec | Value |
|---|---|
| GPUs | 72 B200 |
| Superchips | 36 (each: 2×B200 + 1×Grace CPU) |
| Total GPU Memory | 13.4 TB HBM3e |
| HBM3e Memory Bandwidth | 576 TB/s |
| NVLink Switch Bandwidth (all-to-all) | 130 TB/s |
| FP4 Performance (with sparsity) | 1.44 exaflops |
| FP4 Performance (dense) | 720 petaflops |
| FP8 Performance (with sparsity) | 720 petaflops |
| FP8 Performance (dense) | 360 petaflops |
| NVLink Generation | NVLink 5 |
| NVLink-C2C (CPU-GPU) | 900 GB/s per Superchip |
| Rack Power Draw | ~120 kW |
| Rack Weight | ~1.36 metric tons |
| GPUs per Superchip | 2 B200 |
| CPU per Superchip | 1 NVIDIA Grace (72-core ARM) |
What 13.4 TB of unified GPU memory enables is running a 671B-parameter model entirely within one rack. DeepSeek R1 in FP8 requires roughly 700–750 GB for weights and runtime buffers. At FP4 the weight-only footprint is around 335 GB. The NVL72's all-to-all NVLink fabric at 130 TB/s means KV cache access, attention computation, and expert routing in MoE models can all happen without crossing slow InfiniBand links to other nodes.
GB200 NVL72 vs 8×H100 vs 8×B200: When Rack-Scale Wins
| Specification | 8×H100 SXM5 | 8×B200 SXM | GB200 NVL72 Rack |
|---|---|---|---|
| VRAM per node | 640 GB | 1.44 TB | 13.4 TB |
| NVLink all-reduce BW (per GPU) | 900 GB/s | 1.8 TB/s | ~1.8 TB/s (130 TB/s total rack, all-to-all across 72 GPUs) |
| FP4 compute (with sparsity) | N/A | 144 PFLOPS | 1,440 PFLOPS |
| Max model size (FP16) | ~300B (1 node) | ~650B (1 node) | ~671B (1 rack) |
| Indicative price range | $2.01/GPU-hr on-demand | ~$6.03/GPU-hr on-demand | ~$10.50–$27/GPU-hr equiv |
| Power per node | ~5.6 kW | ~8 kW | ~120 kW (full rack) |
When 8×H100 Still Makes Sense
For models up to 70B parameters, the H100 cluster remains the most cost-effective option in 2026. The Hopper software stack is mature, quantization tools are well-tested, and at $2.01/hr on-demand or $0.99/hr spot on Spheron, the price is hard to beat. If your team has tuned inference pipelines for H100 and your models don't overflow 80 GB VRAM, migration to newer hardware carries engineering cost without guaranteed payoff. For a quick start on H100 nodes, see the vLLM inference server guide, the Llama 3 deployment guide, or the Mistral and Mixtral deployment guide for MoE inference at lower memory budgets. Explore H100 rental for current availability.
When 8×B200 Is the Sweet Spot
The B200's 180 GB VRAM and 9,000 TFLOPS FP4 dense make it the right call for 70B–100B parameter models. A single B200 can hold Llama 3.1 70B at FP16 with room left for KV cache, eliminating the tensor parallelism overhead that H100 deployments require. At $6.03/hr on-demand or $2.25/hr spot on Spheron, it also delivers significantly better cost-per-token than H100 for large model inference. For benchmark data, see the H200 vs B200 vs GB200 comparison. For deploying Llama models on Spheron, see the Llama 3 deployment guide or the Llama 4 Scout and Maverick guide. For larger MoE models like Mixtral-8x22B that fit comfortably in B200 VRAM, see the Mistral and Mixtral deployment guide or the Gemma 3 deployment guide. Explore B200 rental.
When GB200 NVL72 Is the Right Call
Three conditions where GB200 NVL72 wins clearly: training 200B+ parameter models where 130 TB/s all-reduce bandwidth cuts communication overhead below the compute time; production inference on 671B-scale reasoning models that need to fit entirely in one rack's memory; or multi-modal pipelines where the 900 GB/s NVLink-C2C CPU-GPU bandwidth eliminates the PCIe bottleneck in data preprocessing and tokenization.
Everything else, a cluster of 8×B200 nodes is cheaper and more flexible.
Performance: What 1.44 Exaflops Actually Means
The exaflop number is easy to misread. 1.44 exaflops FP4 is the with-sparsity figure (NVIDIA's headline number). The dense FP4 is 720 petaflops. Per GPU across 72 B200 GPUs, that works out to roughly 20,000 TFLOPS FP4 with sparsity (or ~10,000 TFLOPS FP4 dense), versus a standalone HGX B200's 9,000 TFLOPS FP4 dense. The modest per-GPU advantage comes from the NVL72's higher system power envelope: each B200 GPU within the NVL72 Superchip draws up to 1,200W, allowing it to run at higher power than the standalone HGX B200 configuration rated at 1,000W per GPU. The full Superchip (two B200 GPUs plus one Grace CPU) draws approximately 2,700W total.
In real terms for inference on DeepSeek R1 671B: a multi-node B200 HGX cluster distributing the model across nodes can process roughly 300–500 tokens per second total, with cross-node InfiniBand at 400 Gb/s (50 GB/s) becoming the bottleneck for all-reduce. The NVL72's 130 TB/s NVLink fabric eliminates that cross-node overhead entirely. The bottleneck shifts back to pure compute, and at 1.44 exaflops FP4, that's a lot of compute.
For context on why interconnect bandwidth dominates at this scale, see our multi-node GPU training without InfiniBand breakdown.
Use Cases: Where the GB200 NVL72 Earns Its Cost
200B+ Parameter Model Training
At 200B+ parameters, training across multiple 8×B200 nodes becomes communication-bound: every gradient sync crosses InfiniBand. A 200B parameter model at FP16 weighs roughly 400 GB, which fits within a single 8×B200 node (1,440 GB VRAM). The problem is not memory capacity but all-reduce bandwidth. At 400 Gb/s (50 GB/s) InfiniBand, gradient synchronization across 8 nodes for a 200B model takes roughly 400–600 ms per step. The NVL72's 130 TB/s NVLink drops that to microseconds. For training runs measured in millions of steps, the time savings are real. For setting up multi-node distributed training on Spheron, see the distributed training guide.
Large-Scale Inference Serving
DeepSeek R1 671B in FP8 requires roughly 700–750 GB for weights plus runtime buffers. A single 8×H100 node has 640 GB total (8×80 GB), which cannot hold even the raw weights, so the minimum is 2 H100 nodes at 1,280 GB. A single 8×B200 node (1,440 GB) holds the entire weight tensor with approximately 690 GB to spare. For production inference at scale, though, KV cache for long context windows, activation buffers, and multiple concurrent requests push total memory well beyond the raw weight footprint. That is where the NVL72's 13.4 TB provides a clear edge: the full model plus large KV caches for high-throughput serving all fit in one rack, with no cross-rack memory spill. Single-rack latency is measurably better than multi-rack because attention and expert routing never cross a network hop. For a step-by-step guide to deploying DeepSeek R1 and DeepSeek V3 671B on Spheron, see the DeepSeek R1 deployment guide. For maximum token throughput on NVIDIA hardware, see the TensorRT-LLM + Triton inference guide. For large MoE models like Qwen3-235B-A22B that also push memory requirements well beyond a single 8-GPU node, see the Qwen3 deployment guide.
Multi-Modal Foundation Model Work
Vision-language models with large embedding spaces and tight CPU-GPU memory access patterns benefit from NVLink-C2C's 900 GB/s CPU-GPU bandwidth. When preprocessing pipelines (image encoding, tokenization, batch construction) run on the Grace CPUs and feed directly into GPU attention layers, eliminating the PCIe bottleneck can cut pipeline latency by 30–50% compared to an x86 + HGX B200 configuration. For deploying multimodal models on Spheron, see the multimodal LLM guides.
Enterprise Private AI Infrastructure
Teams with compliance requirements that prevent shared cloud infrastructure need on-prem or dedicated capacity. For organizations that must match hyperscaler inference throughput in a private deployment, GB200 NVL72 is the right system. The alternative is a large cluster of H100 or B200 nodes, which requires InfiniBand networking and more floor space.
Cloud Pricing: Who Offers GB200 NVL72 Access
Pricing for GB200 NVL72 capacity varies widely and is listed here as indicative estimates based on public announcements and market reports as of 22 Mar 2026. These are not official quotes.
| Provider | Access Type | Price Range | Notes |
|---|---|---|---|
| CoreWeave | On-demand | ~$10.50/GPU-hr | 4-GPU instances (gb200-4x), generally available; minimum 72 GPUs per full rack |
| Oracle Cloud | On-demand | ~$16/GPU-hr | BM.GPU.GB200.41 bare-metal (4 B200 GPUs per instance), public on-demand pricing |
| Azure (ND GB200 v6) | On-demand / Reserved | ~$27/GPU-hr | Generally available since March 2025 |
| Google Cloud (A4X) | On-demand | Contact for pricing | Generally available since May 2025, no standard public rate listed |
| Lambda Labs | Not available | N/A | No public GB200 NVL72 offering as of Mar 2026; B200 available from $4.62/hr (HGX B200 1-Click Cluster) or $6.08/hr (single B200 instance) |
All figures are indicative and approximate. Prices fluctuate and these are not official quotes from any provider. Check each provider directly for current pricing.
Spheron's Practical Alternative
Spheron does not currently offer GB200 NVL72 rack instances. For teams needing Blackwell performance without the rack commitment, the practical options are:
- H100 at $2.01/hr on-demand or $0.99/hr spot. Covers 80 GB VRAM with a proven inference stack for sub-70B models. Rent H100 on Spheron.
- B200 at $6.03/hr on-demand or $2.25/hr spot. Covers 180 GB VRAM, 9,000 TFLOPS FP4 dense, 1.8 TB/s NVLink per GPU. Rent B200 on Spheron.
- B300 at $8.55/hr on-demand or $3.67/hr spot. Covers 288 GB VRAM, 15,000 TFLOPS FP4 dense. Rent B300 on Spheron.
Spheron prices as of 22 Mar 2026 and can fluctuate over time based on GPU availability.
For the full B200 breakdown, see our B200 complete guide. For B300 Blackwell Ultra specs and pricing, see the B300 Blackwell Ultra guide. For deployment guides, see the Spheron LLM inference quick guide.
Cost Analysis: GB200 NVL72 vs Smaller Clusters
Here's the math that determines whether rack-scale is actually worth it.
A full GB200 NVL72 rack at $10.50–$27/GPU-hr across 72 GPUs runs $756–$1,944/hr. The same 72 GPUs in nine 8×B200 nodes at $6.03/GPU-hr on-demand on Spheron totals roughly $434/hr. The NVL72 costs 1.7–4.5x more per hour.
For that premium to pay off, the NVL72's 130 TB/s NVLink must cut actual training or inference time enough to offset the higher rate. If you're training a 200B parameter model and all-reduce currently consumes 40% of your step time across 9 InfiniBand-connected B200 nodes, replacing that with NVLink reduces effective cost-per-step by around 40%. At 1.7x rack premium, you break even if NVLink saves more than 42% of your step time. For tensor-parallel all-reduce at 200B scale, that's plausible. For 70B models where all-reduce is a smaller fraction of step time, it rarely pencils out.
| Config | GPUs | Total VRAM | NVLink BW | Est. Cost/hr | Best For |
|---|---|---|---|---|---|
| 1× GB200 NVL72 | 72 | 13.4 TB | 130 TB/s | ~$756–$1,944 | 200B+ training, 671B inference |
| 9× 8×B200 nodes | 72 | ~13 TB | 1.8 TB/s per node | ~$434 | 70B–100B training, multi-node B200 |
| 9× 8×H100 nodes | 72 | 5.76 TB | 900 GB/s per node | ~$145 | Mature stack, sub-70B models |
The 9×8×B200 cluster has slightly less total VRAM than the NVL72 rack (approximately 12.96 TB vs 13.4 TB respectively, since HGX B200 nodes run at 180 GB per GPU while NVL72-configured B200 GPUs run at 186 GB per GPU), but NVLink only operates at full bandwidth within each 8-GPU node. Cross-node traffic goes over InfiniBand at 400 Gb/s (50 GB/s), about 2,600x slower than NVL72's all-to-all 130 TB/s. For workloads where cross-node bandwidth is the bottleneck, the NVL72 wins decisively. For workloads that fit within a single 8×B200 node, the 9-node B200 cluster is cheaper and more flexible.
Infrastructure Requirements
The GB200 NVL72 is data-center-class infrastructure. These are not optional requirements.
Power: NVIDIA's nominal spec is 120 kW for the full rack. Deployed NVL72 racks have been reported to draw 130-132 kW under full load (HPE's spec sheet lists 115 kW liquid-cooled plus 17 kW air-cooled for supporting systems), so plan capacity accordingly. A typical data center rack budget is 10-20 kW. Deploying a GB200 NVL72 requires dedicated 3-phase power circuits rated for the full load, typically a dedicated PDU with 200A+ capacity.
Cooling: Air cooling is not viable at 120 kW rack density. Direct liquid cooling (DLC) is required. The coolant manifold handles heat rejection for the GPU and CPU components. Standard rear-door heat exchangers designed for 30–40 kW racks cannot handle this load.
Floor load: The rack weighs approximately 1.36 metric tons. Standard raised-floor tiles are rated for 250 kg per tile. A fully loaded GB200 NVL72 rack requires reinforced flooring or a dedicated slab with verified load capacity before installation.
Space: The NVL72 occupies a 48U rack in a non-standard OCP Open Rack V3 form factor (600mm wide x 1,068mm deep), with additional space needed for cabling and coolant manifold connections. This is not a standard 19-inch data center rack.
Networking: When scaling beyond a single GB200 NVL72 rack, cross-rack traffic runs over InfiniBand (NDR 400 Gb/s or XDR 800 Gb/s) or RoCE. Initial GB200 NVL72 deployments ship with ConnectX-7 NICs supporting NDR 400 Gb/s InfiniBand. Racks deployed from mid-2025 onward use ConnectX-8 NICs supporting XDR 800 Gb/s per port.
This is not standard colo cage territory. Deploying a GB200 NVL72 on-premises requires working with a data center operator experienced with high-density liquid-cooled hardware.
Getting Started
For rack-scale GB200 NVL72 capacity: Contact cloud providers directly for reserved instances. CoreWeave currently has the clearest on-demand availability at ~$10.50/GPU-hr. Oracle Cloud and Azure (ND GB200 v6) also offer GB200 instances at higher price points. Google Cloud A4X VMs (generally available since May 2025) require contacting Google Cloud sales for pricing. Lambda Labs does not have a public GB200 NVL72 offering as of March 2026. Expect negotiated pricing for multi-week or multi-month commitments from any provider.
For GPU access without the rack commitment:
- H100 on-demand at $2.01/hr or spot at $0.99/hr on Spheron. Proven inference stack for sub-70B models. Start with H100.
- B200 on-demand at $6.03/hr or spot at $2.25/hr on Spheron. Handles 180 GB VRAM and FP4 workloads. Start with B200.
- B300 on-demand at $8.55/hr with 288 GB VRAM for workloads that need more room. Explore B300.
For step-by-step deployment, see the Spheron LLM inference quick guide, the vLLM inference server guide, the SGLang inference guide, or the distributed training guide.
For the full GPU comparison: See the H200 vs B200 vs GB200 side-by-side for benchmark data and a clear decision framework across all three generations.
GB200 NVL72 sets the ceiling for what a single rack can do, but most teams don't need it. For Blackwell performance without the rack commitment, Spheron offers B200 and B300 on-demand with no long-term contract.
