What is the difference between Hopper, Blackwell, and Rubin GPU architectures?

Hopper (H100/H200) introduced FP8 Tensor Cores and the Transformer Engine on HBM3/HBM3e memory. Blackwell (B200/B300) added native FP4 Tensor Cores (5th gen), doubled NVLink bandwidth to 1.8 TB/s (NVLink 5), and scaled memory to 192-288 GB HBM3e. Rubin (R100) delivers HBM4 memory at up to 22 TB/s bandwidth, 50 PFLOPS FP4 compute (vs 9 PFLOPS on B200), NVLink 6 at 3.6 TB/s, and 336 billion transistors.

Is B200 or H100 better for LLM inference in 2026?

B200 is better for most production inference in 2026. It delivers approximately 4x higher throughput than H100 on Llama 70B inference, supports native FP4 for the most compute-bound workloads, and has 192 GB VRAM vs 80 GB on H100. H100 remains reasonable for 7B-13B models at lower concurrency where the cost difference matters, or when migrating from a Hopper-optimized inference stack with significant rewrite costs. B200 spot on Spheron starts at $2.07/hr; check [current GPU pricing](/pricing/) for live rates.

How much faster is the Rubin R100 than the B200?

The R100 is projected at approximately 18-22x the H100 baseline for Llama 70B inference throughput, versus approximately 4x for B200. Directly: R100 is roughly 4-5x the per-GPU throughput of B200 at matching precision. Memory bandwidth increases from 8 TB/s (B200) to up to 22 TB/s (R100), and FP4 compute goes from 9 PFLOPS to 50 PFLOPS. R100 cloud availability is expected H2 2026.

What is the memory bandwidth progression from Hopper to Rubin?

H100 (Hopper) delivers 3.35 TB/s HBM3. H200 (Hopper, upgraded memory) delivers 4.8 TB/s HBM3e. B200/B300 (Blackwell) deliver 8 TB/s HBM3e. The R100 (Rubin) delivers up to 22 TB/s HBM4 per GPU, confirmed by NVIDIA at GTC 2026. That is 6.6x the H100 bandwidth on a single chip.

Rubin vs Blackwell vs Hopper: NVIDIA GPU Architecture Comparison

Q: When will Rubin R100 GPUs be available on cloud?

Rubin R100 cloud instances are expected in H2 2026 from the first confirmed cohort: AWS, Google Cloud, Microsoft Azure, OCI, CoreWeave, Lambda, Nebius, and Nscale (per the NVIDIA CES 2026 announcement), with Crusoe and Together AI confirmed as later additions to the cohort. Broader availability including Spheron is expected in 2027. B200 and B300 are available now.

Three GPU generations are active in production cloud right now. H100 and H200 (Hopper) are widely available and well-understood. B200 and B300 (Blackwell) are ramping with falling prices. Rubin (R100, also referred to as R200 in some sources) ships H2 2026. An AI team picking infrastructure today needs to know where each generation sits, what changed between them, and which one to actually rent. Check current GPU pricing for live rates before making any capacity decision.

Quick Answer: Which Generation for Which Workload

Generation	Best for	Memory	Key advantage	Availability	Spheron pricing
Hopper (H100/H200)	7B-70B fine-tuning, low-cost inference	80 GB / 141 GB HBM3/HBM3e	Mature software stack, lowest on-demand cost	Widely available	H100 from $2.01/hr on-demand; H200 from $1.43/hr spot
Blackwell (B200/B300)	Production inference, 70B-200B models	192 GB / 288 GB HBM3e	Native FP4, 4-5.5x H100 throughput	Available now	B200 spot from $2.07/hr; B300 spot from $3.63/hr; see pricing
Rubin (R100)	400B+ models, next-gen inference	288 GB HBM4	22 TB/s bandwidth, 50 PFLOPS FP4	H2 2026, est. $8-12/hr spot	Not yet available

Pricing fluctuates based on GPU availability. The prices above are based on 27 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

Architecture Timeline: Hopper to Blackwell to Rubin

Hopper launched in 2022 with the H100, built on TSMC 4N. The defining contribution was the Transformer Engine with FP8 mixed-precision and fourth-generation Tensor Cores. H100 set the performance baseline that every subsequent generation references.

Blackwell arrived in 2024 with the B200, manufactured on TSMC 4NP. NVIDIA nearly doubled the transistor count (80B to 208B) and introduced fifth-generation Tensor Cores with native FP4 compute. NVLink bandwidth doubled from 900 GB/s (NVLink 4) to 1.8 TB/s (NVLink 5). The B300 (Blackwell Ultra) followed with 288 GB VRAM while keeping the same compute profile as B200. For detailed Blackwell specs, see our B200 complete guide.

Rubin was announced at Computex 2024 and confirmed at GTC 2026 with H2 2026 delivery to the first cloud cohort. The architecture switches memory from HBM3e to HBM4, delivering up to 22 TB/s per GPU. FP4 compute reaches 50 PFLOPS, versus 9 PFLOPS on B200. NVLink 6 doubles the interconnect bandwidth again to 3.6 TB/s. Transistor count grows to 336 billion. For a full Rubin spec breakdown, see our NVIDIA Rubin R100 guide.

Full Specs Comparison: H100, H200, B200, B300, R100

Spec	H100 SXM	H200 SXM	B200 SXM	B300 SXM	R100
Architecture	Hopper	Hopper	Blackwell	Blackwell Ultra	Rubin
VRAM	80 GB	141 GB	192 GB	288 GB	288 GB
Memory type	HBM3	HBM3e	HBM3e	HBM3e	HBM4
Memory bandwidth	3.35 TB/s	4.8 TB/s	8 TB/s	8 TB/s	up to 22 TB/s
FP4 Dense	N/A	N/A	9 PFLOPS	15 PFLOPS	50 PFLOPS
FP8 Dense	~1,979 TFLOPS	~1,979 TFLOPS	4,500 TFLOPS	~4,500 TFLOPS	~17,500 TFLOPS
FP16 Dense	1,979 TFLOPS	1,979 TFLOPS	2,250 TFLOPS	2,250 TFLOPS	~8,000 TFLOPS
Transistors	80 B	80 B	208 B	208 B	336 B
TDP (W)	700	700	~1,200	1,400	~2,300
NVLink generation	4	4	5	5	6
NVLink bandwidth	900 GB/s	900 GB/s	1.8 TB/s	1.8 TB/s	3.6 TB/s
Networking	ConnectX-7 (400G)	ConnectX-7 (400G)	ConnectX-7 (400G)	ConnectX-8 (800G)	ConnectX-9 (2x 800G, up to 1.6T)

H100/H200 FP8 Dense (~1,979 TFLOPS) is the without-sparsity figure. NVIDIA's datasheet also lists 3,958 TFLOPS for H100/H200 FP8 with sparsity enabled; the dense column above uses the without-sparsity values for consistency with B200 (4,500 TFLOPS dense). R100 FP8 (~17,500 TFLOPS) is derived from NVIDIA's GTC 2026 NVL72 system-level spec (1,260 PFLOPS FP8/FP6 across 72 GPUs); this is an approximate per-GPU figure pending final NVIDIA-published per-GPU specs. R100 FP16 (~8,000 TFLOPS) is projected at approximately half the FP8 figure and is not NVIDIA-confirmed. B200 TDP reflects the SXM5 liquid-cooled variant; PCIe B200 has lower TDP and bandwidth. R100 FP4 training is 35 PFLOPS vs 50 PFLOPS inference. All B200/B300 specs refer to SXM form factor. B300 FP4 Dense (15 PFLOPS) is sourced from NVIDIA's developer blog; the DGX B300 page math yields approximately 13-13.5 PFLOPS per GPU dense FP4, a minor variance from the per-GPU published spec.

Memory Hierarchy: HBM3 to HBM3e to HBM4

Hopper: HBM3 (H100) and HBM3e (H200)

The H100 ships with HBM3 delivering 3.35 TB/s. The H200 upgrades to HBM3e, pushing bandwidth to 4.8 TB/s on the same GH100 die. No additional CUDA cores, no new Tensor Core generation: the H200 is an H100 with better memory. That single change makes a meaningful difference for memory-bandwidth-bound workloads like 70B model serving. The H200 fits Llama 70B in FP16 on one GPU; the H100 requires 2-way tensor parallelism.

Blackwell: HBM3e Maxed Out

B200 doubles H100's bandwidth to 8 TB/s with the same HBM3e type. The B300 (Blackwell Ultra) reaches 288 GB of capacity but keeps bandwidth at 8 TB/s. More memory stacks don't automatically mean more bandwidth once the memory bus is saturated. The practical implication: B300 lets you fit larger models (200B in FP8) without multi-GPU sharding, but the bandwidth profile per token is the same as B200. If your bottleneck is memory capacity rather than bandwidth, B300 is the right choice.

Rubin: HBM4 at 22 TB/s

The jump from 8 to 22 TB/s is the single biggest memory bandwidth increase across any NVIDIA generation transition. For long-context inference, KV cache access is the dominant bandwidth consumer at high token counts. A request with 128K context on a 70B model generates a KV cache that, at 8 TB/s, limits how fast the GPU can stream attention weights. At 22 TB/s, that same KV cache takes roughly 36% of the time. Models that currently require 4xH100 or 2xB200 for throughput at long context may fit and serve efficiently on a single R100. For further analysis of R100's workload implications, see our NVIDIA Rubin R100 guide.

NVLink Generations: What Doubling Bandwidth Means

Generation	GPU	Per-GPU bandwidth	Typical use
NVLink 4	H100, H200	900 GB/s	8-GPU HGX nodes
NVLink 5	B200, B300	1.8 TB/s	HGX B200 nodes, NVL36/72
NVLink 6	R100	3.6 TB/s	Vera Rubin NVL72

NVLink 4 (Hopper)

At 900 GB/s bidirectional, NVLink 4 enables 8-GPU all-reduce operations at speeds that keep gradient synchronization from dominating training time for models up to roughly 70B parameters. For inference with tensor parallelism, 900 GB/s is enough for 2-4 way TP on models up to 130B at reasonable latency.

NVLink 5 (Blackwell)

Doubling to 1.8 TB/s changes the economics for large model inference. Gradient sync in training across 8xB200 takes roughly half the time it takes across 8xH100 at equivalent model sizes. For tensor-parallel inference, this is the difference between TP being a 20% overhead versus a 40% overhead per token. The NVL72 configuration (72 B200s connected via NVLink 5 at rack scale) is designed for serving models at 400B-1T parameter range where inter-GPU communication dominates.

NVLink 6 (Rubin)

At 3.6 TB/s, NVLink 6 on the Vera Rubin NVL72 targets trillion-parameter models where current NVLink 5 systems still bottleneck on the interconnect. For multi-GPU inference at 400B+ model sizes, the per-generation NVLink doubling reduces the hidden cost of sharding. Each generation roughly halves the communication-to-compute ratio for fixed model sizes, which directly improves tokens per second at scale.

Precision and Compute: FP8 to FP4

Hopper: FP8 Max

Hopper's Transformer Engine introduced FP8 mixed-precision with hardware acceleration, a significant step beyond the BF16 ceiling of previous generations. H100 and H200 both deliver approximately 1,979 TFLOPS dense FP8 (without sparsity); with sparsity enabled, this doubles to 3,958 TFLOPS. FP16 and BF16 are fully supported. FP4 is not available on Hopper hardware and has no planned software path: the minimum precision supported by the Tensor Cores is FP8. Any inference optimization targeting FP4 on Hopper will not find hardware backing and will fall back to FP8 or worse.

Blackwell: Native FP4 Tensor Cores

B200 and B300 introduced fifth-generation Tensor Cores with native FP4 support. This is not a software emulation layer; FP4 operations execute directly on Tensor Core hardware. B200 reaches 9 PFLOPS FP4 dense, roughly 2x the effective throughput over FP8 for quantization-friendly models. B300 extends this to 15 PFLOPS FP4. FP8 throughput is ~4,500 TFLOPS on both B200 and B300 (the B300 gains are in FP4 and memory capacity, not FP8 compute); FP16 dense is 2,250 TFLOPS on both. For practical guidance on which models benefit from FP4 and how to enable it in production, see our FP4 quantization guide.

Rubin: 50 PFLOPS FP4

R100's FP4 throughput is 5.6x the B200 and 3.33x the B300. Combined with HBM4's 22 TB/s bandwidth, this enables single-GPU inference for 400B-parameter models in FP4 that currently require multi-GPU sharding at FP8. A 400B model at FP4 needs roughly 200 GB of VRAM for weights. The R100's 288 GB HBM4 holds that with room for KV cache, and the bandwidth means the attention computation does not become the bottleneck even at 128K context lengths.

Workload Decision Guide

Fine-tuning 7B-13B models: H100 at $2.01/hr on-demand is the cost leader. B200 spot (check current pricing for live rates) offers roughly 2x the training throughput, so the right choice depends on whether wall-clock time or cost per experiment matters more to your team.

Fine-tuning 70B models: H200 (141 GB, fits Llama 70B in FP16 plus QLoRA adapters on one GPU) or B200. Single-GPU fine-tuning avoids tensor parallelism overhead and simplifies checkpointing.

Inference, 7B-30B models, low concurrency: H100 remains viable. At low batch sizes, these workloads are memory-bandwidth-bound, and the gap between H100 and B200 narrows compared to the price gap. H100 from $2.01/hr on-demand on Spheron is hard to beat for development and low-traffic serving.

Inference, 70B models, production serving: B200 or B300. Approximately 4x the H100 throughput per SemiAnalysis InferenceX benchmarks. Native FP4 optimization available on Blackwell for models that support quantization.

Inference, 200B+ models or long context: B300 (288 GB) handles 200B models in FP8 on one GPU. For 400B+ at FP8, multi-GPU B300 is required until R100 ships. If your roadmap includes 400B+ models and your timeline extends into 2027, factor R100 into your planning.

Planning a new deployment for 2027: Rubin R100 is the target architecture, but Blackwell is the bridge. For production workloads you need running now, B200 or B300 beat waiting.

For detailed inference cost-per-token benchmarks across GPU types, see our best GPU for AI inference guide. For VRAM sizing by model size and precision, see our LLM memory requirements guide.

Availability and Pricing

As of March 2026:

H100: Widely available across all major GPU clouds. On-demand prices have compressed to $2.01/hr on Spheron. The H100 is roughly 2 years past its cloud launch, and prices reflect that maturity. Software frameworks are battle-tested on Hopper.

H200: Available at major providers including Spheron from $1.43/hr spot. The memory upgrade over H100 justifies the price premium for workloads that were multi-GPU on H100 but fit on a single H200.

B200: Available now on Spheron (see pricing for current rates), RunPod, Nebius, Lambda, and others. Supply is still ramping, but spot availability has improved significantly since early 2026. No on-demand pricing is currently listed; spot is the primary access path.

B300: Available on Spheron, CoreWeave, and select providers. On Spheron, B300 starts at $3.63/hr. The 288 GB capacity makes it the practical choice for large-model workloads where the full memory is needed.

R100: H2 2026 delivery to the first confirmed cloud cohort: AWS, Google Cloud, Microsoft Azure, OCI, CoreWeave, Lambda, Nebius, and Nscale (per the NVIDIA CES 2026 announcement), with Crusoe and Together AI confirmed as later additions to the cohort. Broader availability including Spheron is expected in 2027. Projected launch pricing runs $15-25/hr at hyperscalers and $8-12/hr spot at specialist GPU clouds.

One useful data point on price compression: H100 launched at $8-10/hr on-demand in 2023 and is now around $2/hr on-demand. Blackwell pricing will follow a similar curve as supply builds through 2026 and into 2027.