Comparison

Rubin vs Blackwell vs Hopper: NVIDIA GPU Architecture Comparison

Back to BlogWritten by Mitrasish, Co-founderMar 27, 2026
NVIDIA GPUHopperBlackwellRubinGPU ComparisonH100B200R100AI InfrastructureGPU Cloud
Rubin vs Blackwell vs Hopper: NVIDIA GPU Architecture Comparison

Three GPU generations are active in production cloud right now. H100 and H200 (Hopper) are widely available and well-understood. B200 and B300 (Blackwell) are ramping with falling prices. Rubin (R100, also referred to as R200 in some sources) ships H2 2026. An AI team picking infrastructure today needs to know where each generation sits, what changed between them, and which one to actually rent. Check current GPU pricing for live rates before making any capacity decision.

Quick Answer: Which Generation for Which Workload

GenerationBest forMemoryKey advantageAvailabilitySpheron pricing
Hopper (H100/H200)7B-70B fine-tuning, low-cost inference80 GB / 141 GB HBM3/HBM3eMature software stack, lowest on-demand costWidely availableH100 from $2.01/hr on-demand; H200 from $1.43/hr spot
Blackwell (B200/B300)Production inference, 70B-200B models192 GB / 288 GB HBM3eNative FP4, 4-5.5x H100 throughputAvailable nowB200 spot from $2.07/hr; B300 spot from $3.63/hr; see pricing
Rubin (R100)400B+ models, next-gen inference288 GB HBM422 TB/s bandwidth, 50 PFLOPS FP4H2 2026, est. $8-12/hr spotNot yet available

Pricing fluctuates based on GPU availability. The prices above are based on 27 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

Architecture Timeline: Hopper to Blackwell to Rubin

Hopper launched in 2022 with the H100, built on TSMC 4N. The defining contribution was the Transformer Engine with FP8 mixed-precision and fourth-generation Tensor Cores. H100 set the performance baseline that every subsequent generation references.

Blackwell arrived in 2024 with the B200, manufactured on TSMC 4NP. NVIDIA nearly doubled the transistor count (80B to 208B) and introduced fifth-generation Tensor Cores with native FP4 compute. NVLink bandwidth doubled from 900 GB/s (NVLink 4) to 1.8 TB/s (NVLink 5). The B300 (Blackwell Ultra) followed with 288 GB VRAM while keeping the same compute profile as B200. For detailed Blackwell specs, see our B200 complete guide.

Rubin was announced at Computex 2024 and confirmed at GTC 2026 with H2 2026 delivery to the first cloud cohort. The architecture switches memory from HBM3e to HBM4, delivering up to 22 TB/s per GPU. FP4 compute reaches 50 PFLOPS, versus 9 PFLOPS on B200. NVLink 6 doubles the interconnect bandwidth again to 3.6 TB/s. Transistor count grows to 336 billion. For a full Rubin spec breakdown, see our NVIDIA Rubin R100 guide.

Full Specs Comparison: H100, H200, B200, B300, R100

SpecH100 SXMH200 SXMB200 SXMB300 SXMR100
ArchitectureHopperHopperBlackwellBlackwell UltraRubin
VRAM80 GB141 GB192 GB288 GB288 GB
Memory typeHBM3HBM3eHBM3eHBM3eHBM4
Memory bandwidth3.35 TB/s4.8 TB/s8 TB/s8 TB/sup to 22 TB/s
FP4 DenseN/AN/A9 PFLOPS15 PFLOPS50 PFLOPS
FP8 Dense~1,979 TFLOPS~1,979 TFLOPS4,500 TFLOPS~4,500 TFLOPS~17,500 TFLOPS
FP16 Dense1,979 TFLOPS1,979 TFLOPS2,250 TFLOPS2,250 TFLOPS~8,000 TFLOPS
Transistors80 B80 B208 B208 B336 B
TDP (W)700700~1,2001,400~2,300
NVLink generation44556
NVLink bandwidth900 GB/s900 GB/s1.8 TB/s1.8 TB/s3.6 TB/s
NetworkingConnectX-7 (400G)ConnectX-7 (400G)ConnectX-7 (400G)ConnectX-8 (800G)ConnectX-9 (2x 800G, up to 1.6T)

H100/H200 FP8 Dense (~1,979 TFLOPS) is the without-sparsity figure. NVIDIA's datasheet also lists 3,958 TFLOPS for H100/H200 FP8 with sparsity enabled; the dense column above uses the without-sparsity values for consistency with B200 (4,500 TFLOPS dense). R100 FP8 (~17,500 TFLOPS) is derived from NVIDIA's GTC 2026 NVL72 system-level spec (1,260 PFLOPS FP8/FP6 across 72 GPUs); this is an approximate per-GPU figure pending final NVIDIA-published per-GPU specs. R100 FP16 (~8,000 TFLOPS) is projected at approximately half the FP8 figure and is not NVIDIA-confirmed. B200 TDP reflects the SXM5 liquid-cooled variant; PCIe B200 has lower TDP and bandwidth. R100 FP4 training is 35 PFLOPS vs 50 PFLOPS inference. All B200/B300 specs refer to SXM form factor. B300 FP4 Dense (15 PFLOPS) is sourced from NVIDIA's developer blog; the DGX B300 page math yields approximately 13-13.5 PFLOPS per GPU dense FP4, a minor variance from the per-GPU published spec.

Memory Hierarchy: HBM3 to HBM3e to HBM4

Hopper: HBM3 (H100) and HBM3e (H200)

The H100 ships with HBM3 delivering 3.35 TB/s. The H200 upgrades to HBM3e, pushing bandwidth to 4.8 TB/s on the same GH100 die. No additional CUDA cores, no new Tensor Core generation: the H200 is an H100 with better memory. That single change makes a meaningful difference for memory-bandwidth-bound workloads like 70B model serving. The H200 fits Llama 70B in FP16 on one GPU; the H100 requires 2-way tensor parallelism.

Blackwell: HBM3e Maxed Out

B200 doubles H100's bandwidth to 8 TB/s with the same HBM3e type. The B300 (Blackwell Ultra) reaches 288 GB of capacity but keeps bandwidth at 8 TB/s. More memory stacks don't automatically mean more bandwidth once the memory bus is saturated. The practical implication: B300 lets you fit larger models (200B in FP8) without multi-GPU sharding, but the bandwidth profile per token is the same as B200. If your bottleneck is memory capacity rather than bandwidth, B300 is the right choice.

Rubin: HBM4 at 22 TB/s

The jump from 8 to 22 TB/s is the single biggest memory bandwidth increase across any NVIDIA generation transition. For long-context inference, KV cache access is the dominant bandwidth consumer at high token counts. A request with 128K context on a 70B model generates a KV cache that, at 8 TB/s, limits how fast the GPU can stream attention weights. At 22 TB/s, that same KV cache takes roughly 36% of the time. Models that currently require 4xH100 or 2xB200 for throughput at long context may fit and serve efficiently on a single R100. For further analysis of R100's workload implications, see our NVIDIA Rubin R100 guide.

NVLink Generations: What Doubling Bandwidth Means

GenerationGPUPer-GPU bandwidthTypical use
NVLink 4H100, H200900 GB/s8-GPU HGX nodes
NVLink 5B200, B3001.8 TB/sHGX B200 nodes, NVL36/72
NVLink 6R1003.6 TB/sVera Rubin NVL72

NVLink 4 (Hopper)

At 900 GB/s bidirectional, NVLink 4 enables 8-GPU all-reduce operations at speeds that keep gradient synchronization from dominating training time for models up to roughly 70B parameters. For inference with tensor parallelism, 900 GB/s is enough for 2-4 way TP on models up to 130B at reasonable latency.

NVLink 5 (Blackwell)

Doubling to 1.8 TB/s changes the economics for large model inference. Gradient sync in training across 8xB200 takes roughly half the time it takes across 8xH100 at equivalent model sizes. For tensor-parallel inference, this is the difference between TP being a 20% overhead versus a 40% overhead per token. The NVL72 configuration (72 B200s connected via NVLink 5 at rack scale) is designed for serving models at 400B-1T parameter range where inter-GPU communication dominates.

NVLink 6 (Rubin)

At 3.6 TB/s, NVLink 6 on the Vera Rubin NVL72 targets trillion-parameter models where current NVLink 5 systems still bottleneck on the interconnect. For multi-GPU inference at 400B+ model sizes, the per-generation NVLink doubling reduces the hidden cost of sharding. Each generation roughly halves the communication-to-compute ratio for fixed model sizes, which directly improves tokens per second at scale.

Precision and Compute: FP8 to FP4

Hopper: FP8 Max

Hopper's Transformer Engine introduced FP8 mixed-precision with hardware acceleration, a significant step beyond the BF16 ceiling of previous generations. H100 and H200 both deliver approximately 1,979 TFLOPS dense FP8 (without sparsity); with sparsity enabled, this doubles to 3,958 TFLOPS. FP16 and BF16 are fully supported. FP4 is not available on Hopper hardware and has no planned software path: the minimum precision supported by the Tensor Cores is FP8. Any inference optimization targeting FP4 on Hopper will not find hardware backing and will fall back to FP8 or worse.

Blackwell: Native FP4 Tensor Cores

B200 and B300 introduced fifth-generation Tensor Cores with native FP4 support. This is not a software emulation layer; FP4 operations execute directly on Tensor Core hardware. B200 reaches 9 PFLOPS FP4 dense, roughly 2x the effective throughput over FP8 for quantization-friendly models. B300 extends this to 15 PFLOPS FP4. FP8 throughput is ~4,500 TFLOPS on both B200 and B300 (the B300 gains are in FP4 and memory capacity, not FP8 compute); FP16 dense is 2,250 TFLOPS on both. For practical guidance on which models benefit from FP4 and how to enable it in production, see our FP4 quantization guide.

Rubin: 50 PFLOPS FP4

R100's FP4 throughput is 5.6x the B200 and 3.33x the B300. Combined with HBM4's 22 TB/s bandwidth, this enables single-GPU inference for 400B-parameter models in FP4 that currently require multi-GPU sharding at FP8. A 400B model at FP4 needs roughly 200 GB of VRAM for weights. The R100's 288 GB HBM4 holds that with room for KV cache, and the bandwidth means the attention computation does not become the bottleneck even at 128K context lengths.

Workload Decision Guide

Fine-tuning 7B-13B models: H100 at $2.01/hr on-demand is the cost leader. B200 spot (check current pricing for live rates) offers roughly 2x the training throughput, so the right choice depends on whether wall-clock time or cost per experiment matters more to your team.

Fine-tuning 70B models: H200 (141 GB, fits Llama 70B in FP16 plus QLoRA adapters on one GPU) or B200. Single-GPU fine-tuning avoids tensor parallelism overhead and simplifies checkpointing.

Inference, 7B-30B models, low concurrency: H100 remains viable. At low batch sizes, these workloads are memory-bandwidth-bound, and the gap between H100 and B200 narrows compared to the price gap. H100 from $2.01/hr on-demand on Spheron is hard to beat for development and low-traffic serving.

Inference, 70B models, production serving: B200 or B300. Approximately 4x the H100 throughput per SemiAnalysis InferenceX benchmarks. Native FP4 optimization available on Blackwell for models that support quantization.

Inference, 200B+ models or long context: B300 (288 GB) handles 200B models in FP8 on one GPU. For 400B+ at FP8, multi-GPU B300 is required until R100 ships. If your roadmap includes 400B+ models and your timeline extends into 2027, factor R100 into your planning.

Planning a new deployment for 2027: Rubin R100 is the target architecture, but Blackwell is the bridge. For production workloads you need running now, B200 or B300 beat waiting.

For detailed inference cost-per-token benchmarks across GPU types, see our best GPU for AI inference guide. For VRAM sizing by model size and precision, see our LLM memory requirements guide.

Availability and Pricing

As of March 2026:

H100: Widely available across all major GPU clouds. On-demand prices have compressed to $2.01/hr on Spheron. The H100 is roughly 2 years past its cloud launch, and prices reflect that maturity. Software frameworks are battle-tested on Hopper.

H200: Available at major providers including Spheron from $1.43/hr spot. The memory upgrade over H100 justifies the price premium for workloads that were multi-GPU on H100 but fit on a single H200.

B200: Available now on Spheron (see pricing for current rates), RunPod, Nebius, Lambda, and others. Supply is still ramping, but spot availability has improved significantly since early 2026. No on-demand pricing is currently listed; spot is the primary access path.

B300: Available on Spheron, CoreWeave, and select providers. On Spheron, B300 starts at $3.63/hr. The 288 GB capacity makes it the practical choice for large-model workloads where the full memory is needed.

R100: H2 2026 delivery to the first confirmed cloud cohort: AWS, Google Cloud, Microsoft Azure, OCI, CoreWeave, Lambda, Nebius, and Nscale (per the NVIDIA CES 2026 announcement), with Crusoe and Together AI confirmed as later additions to the cohort. Broader availability including Spheron is expected in 2027. Projected launch pricing runs $15-25/hr at hyperscalers and $8-12/hr spot at specialist GPU clouds.

One useful data point on price compression: H100 launched at $8-10/hr on-demand in 2023 and is now around $2/hr on-demand. Blackwell pricing will follow a similar curve as supply builds through 2026 and into 2027.

Pricing fluctuates based on GPU availability. The prices above are based on 27 Mar 2026 and may have changed. Check current GPU pricing → for live rates.


If you're on Hopper and the numbers above show a cost-per-token improvement, B200 and B300 are available now on Spheron with per-minute billing and no commitment.

Rent B200 → | Rent B300 → | View pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.