Three GPU generations are active in production cloud right now. H100 and H200 (Hopper) are widely available and well-understood. B200 and B300 (Blackwell) are ramping with falling prices. Rubin (R100, also referred to as R200 in some sources) ships H2 2026. An AI team picking infrastructure today needs to know where each generation sits, what changed between them, and which one to actually rent. Check current GPU pricing for live rates before making any capacity decision.
Quick Answer: Which Generation for Which Workload
| Generation | Best for | Memory | Key advantage | Availability | Spheron pricing |
|---|---|---|---|---|---|
| Hopper (H100/H200) | 7B-70B fine-tuning, low-cost inference | 80 GB / 141 GB HBM3/HBM3e | Mature software stack, lowest on-demand cost | Widely available | H100 from $2.01/hr on-demand; H200 from $1.43/hr spot |
| Blackwell (B200/B300) | Production inference, 70B-200B models | 192 GB / 288 GB HBM3e | Native FP4, 4-5.5x H100 throughput | Available now | B200 spot from $2.07/hr; B300 spot from $3.63/hr; see pricing |
| Rubin (R100) | 400B+ models, next-gen inference | 288 GB HBM4 | 22 TB/s bandwidth, 50 PFLOPS FP4 | H2 2026, est. $8-12/hr spot | Not yet available |
Pricing fluctuates based on GPU availability. The prices above are based on 27 Mar 2026 and may have changed. Check current GPU pricing → for live rates.
Architecture Timeline: Hopper to Blackwell to Rubin
Hopper launched in 2022 with the H100, built on TSMC 4N. The defining contribution was the Transformer Engine with FP8 mixed-precision and fourth-generation Tensor Cores. H100 set the performance baseline that every subsequent generation references.
Blackwell arrived in 2024 with the B200, manufactured on TSMC 4NP. NVIDIA nearly doubled the transistor count (80B to 208B) and introduced fifth-generation Tensor Cores with native FP4 compute. NVLink bandwidth doubled from 900 GB/s (NVLink 4) to 1.8 TB/s (NVLink 5). The B300 (Blackwell Ultra) followed with 288 GB VRAM while keeping the same compute profile as B200. For detailed Blackwell specs, see our B200 complete guide.
Rubin was announced at Computex 2024 and confirmed at GTC 2026 with H2 2026 delivery to the first cloud cohort. The architecture switches memory from HBM3e to HBM4, delivering up to 22 TB/s per GPU. FP4 compute reaches 50 PFLOPS, versus 9 PFLOPS on B200. NVLink 6 doubles the interconnect bandwidth again to 3.6 TB/s. Transistor count grows to 336 billion. For a full Rubin spec breakdown, see our NVIDIA Rubin R100 guide.
Full Specs Comparison: H100, H200, B200, B300, R100
| Spec | H100 SXM | H200 SXM | B200 SXM | B300 SXM | R100 |
|---|---|---|---|---|---|
| Architecture | Hopper | Hopper | Blackwell | Blackwell Ultra | Rubin |
| VRAM | 80 GB | 141 GB | 192 GB | 288 GB | 288 GB |
| Memory type | HBM3 | HBM3e | HBM3e | HBM3e | HBM4 |
| Memory bandwidth | 3.35 TB/s | 4.8 TB/s | 8 TB/s | 8 TB/s | up to 22 TB/s |
| FP4 Dense | N/A | N/A | 9 PFLOPS | 15 PFLOPS | 50 PFLOPS |
| FP8 Dense | ~1,979 TFLOPS | ~1,979 TFLOPS | 4,500 TFLOPS | ~4,500 TFLOPS | ~17,500 TFLOPS |
| FP16 Dense | 1,979 TFLOPS | 1,979 TFLOPS | 2,250 TFLOPS | 2,250 TFLOPS | ~8,000 TFLOPS |
| Transistors | 80 B | 80 B | 208 B | 208 B | 336 B |
| TDP (W) | 700 | 700 | ~1,200 | 1,400 | ~2,300 |
| NVLink generation | 4 | 4 | 5 | 5 | 6 |
| NVLink bandwidth | 900 GB/s | 900 GB/s | 1.8 TB/s | 1.8 TB/s | 3.6 TB/s |
| Networking | ConnectX-7 (400G) | ConnectX-7 (400G) | ConnectX-7 (400G) | ConnectX-8 (800G) | ConnectX-9 (2x 800G, up to 1.6T) |
H100/H200 FP8 Dense (~1,979 TFLOPS) is the without-sparsity figure. NVIDIA's datasheet also lists 3,958 TFLOPS for H100/H200 FP8 with sparsity enabled; the dense column above uses the without-sparsity values for consistency with B200 (4,500 TFLOPS dense). R100 FP8 (~17,500 TFLOPS) is derived from NVIDIA's GTC 2026 NVL72 system-level spec (1,260 PFLOPS FP8/FP6 across 72 GPUs); this is an approximate per-GPU figure pending final NVIDIA-published per-GPU specs. R100 FP16 (~8,000 TFLOPS) is projected at approximately half the FP8 figure and is not NVIDIA-confirmed. B200 TDP reflects the SXM5 liquid-cooled variant; PCIe B200 has lower TDP and bandwidth. R100 FP4 training is 35 PFLOPS vs 50 PFLOPS inference. All B200/B300 specs refer to SXM form factor. B300 FP4 Dense (15 PFLOPS) is sourced from NVIDIA's developer blog; the DGX B300 page math yields approximately 13-13.5 PFLOPS per GPU dense FP4, a minor variance from the per-GPU published spec.
Memory Hierarchy: HBM3 to HBM3e to HBM4
Hopper: HBM3 (H100) and HBM3e (H200)
The H100 ships with HBM3 delivering 3.35 TB/s. The H200 upgrades to HBM3e, pushing bandwidth to 4.8 TB/s on the same GH100 die. No additional CUDA cores, no new Tensor Core generation: the H200 is an H100 with better memory. That single change makes a meaningful difference for memory-bandwidth-bound workloads like 70B model serving. The H200 fits Llama 70B in FP16 on one GPU; the H100 requires 2-way tensor parallelism.
Blackwell: HBM3e Maxed Out
B200 doubles H100's bandwidth to 8 TB/s with the same HBM3e type. The B300 (Blackwell Ultra) reaches 288 GB of capacity but keeps bandwidth at 8 TB/s. More memory stacks don't automatically mean more bandwidth once the memory bus is saturated. The practical implication: B300 lets you fit larger models (200B in FP8) without multi-GPU sharding, but the bandwidth profile per token is the same as B200. If your bottleneck is memory capacity rather than bandwidth, B300 is the right choice.
Rubin: HBM4 at 22 TB/s
The jump from 8 to 22 TB/s is the single biggest memory bandwidth increase across any NVIDIA generation transition. For long-context inference, KV cache access is the dominant bandwidth consumer at high token counts. A request with 128K context on a 70B model generates a KV cache that, at 8 TB/s, limits how fast the GPU can stream attention weights. At 22 TB/s, that same KV cache takes roughly 36% of the time. Models that currently require 4xH100 or 2xB200 for throughput at long context may fit and serve efficiently on a single R100. For further analysis of R100's workload implications, see our NVIDIA Rubin R100 guide.
NVLink Generations: What Doubling Bandwidth Means
| Generation | GPU | Per-GPU bandwidth | Typical use |
|---|---|---|---|
| NVLink 4 | H100, H200 | 900 GB/s | 8-GPU HGX nodes |
| NVLink 5 | B200, B300 | 1.8 TB/s | HGX B200 nodes, NVL36/72 |
| NVLink 6 | R100 | 3.6 TB/s | Vera Rubin NVL72 |
NVLink 4 (Hopper)
At 900 GB/s bidirectional, NVLink 4 enables 8-GPU all-reduce operations at speeds that keep gradient synchronization from dominating training time for models up to roughly 70B parameters. For inference with tensor parallelism, 900 GB/s is enough for 2-4 way TP on models up to 130B at reasonable latency.
NVLink 5 (Blackwell)
Doubling to 1.8 TB/s changes the economics for large model inference. Gradient sync in training across 8xB200 takes roughly half the time it takes across 8xH100 at equivalent model sizes. For tensor-parallel inference, this is the difference between TP being a 20% overhead versus a 40% overhead per token. The NVL72 configuration (72 B200s connected via NVLink 5 at rack scale) is designed for serving models at 400B-1T parameter range where inter-GPU communication dominates.
NVLink 6 (Rubin)
At 3.6 TB/s, NVLink 6 on the Vera Rubin NVL72 targets trillion-parameter models where current NVLink 5 systems still bottleneck on the interconnect. For multi-GPU inference at 400B+ model sizes, the per-generation NVLink doubling reduces the hidden cost of sharding. Each generation roughly halves the communication-to-compute ratio for fixed model sizes, which directly improves tokens per second at scale.
Precision and Compute: FP8 to FP4
Hopper: FP8 Max
Hopper's Transformer Engine introduced FP8 mixed-precision with hardware acceleration, a significant step beyond the BF16 ceiling of previous generations. H100 and H200 both deliver approximately 1,979 TFLOPS dense FP8 (without sparsity); with sparsity enabled, this doubles to 3,958 TFLOPS. FP16 and BF16 are fully supported. FP4 is not available on Hopper hardware and has no planned software path: the minimum precision supported by the Tensor Cores is FP8. Any inference optimization targeting FP4 on Hopper will not find hardware backing and will fall back to FP8 or worse.
Blackwell: Native FP4 Tensor Cores
B200 and B300 introduced fifth-generation Tensor Cores with native FP4 support. This is not a software emulation layer; FP4 operations execute directly on Tensor Core hardware. B200 reaches 9 PFLOPS FP4 dense, roughly 2x the effective throughput over FP8 for quantization-friendly models. B300 extends this to 15 PFLOPS FP4. FP8 throughput is ~4,500 TFLOPS on both B200 and B300 (the B300 gains are in FP4 and memory capacity, not FP8 compute); FP16 dense is 2,250 TFLOPS on both. For practical guidance on which models benefit from FP4 and how to enable it in production, see our FP4 quantization guide.
Rubin: 50 PFLOPS FP4
R100's FP4 throughput is 5.6x the B200 and 3.33x the B300. Combined with HBM4's 22 TB/s bandwidth, this enables single-GPU inference for 400B-parameter models in FP4 that currently require multi-GPU sharding at FP8. A 400B model at FP4 needs roughly 200 GB of VRAM for weights. The R100's 288 GB HBM4 holds that with room for KV cache, and the bandwidth means the attention computation does not become the bottleneck even at 128K context lengths.
Workload Decision Guide
Fine-tuning 7B-13B models: H100 at $2.01/hr on-demand is the cost leader. B200 spot (check current pricing for live rates) offers roughly 2x the training throughput, so the right choice depends on whether wall-clock time or cost per experiment matters more to your team.
Fine-tuning 70B models: H200 (141 GB, fits Llama 70B in FP16 plus QLoRA adapters on one GPU) or B200. Single-GPU fine-tuning avoids tensor parallelism overhead and simplifies checkpointing.
Inference, 7B-30B models, low concurrency: H100 remains viable. At low batch sizes, these workloads are memory-bandwidth-bound, and the gap between H100 and B200 narrows compared to the price gap. H100 from $2.01/hr on-demand on Spheron is hard to beat for development and low-traffic serving.
Inference, 70B models, production serving: B200 or B300. Approximately 4x the H100 throughput per SemiAnalysis InferenceX benchmarks. Native FP4 optimization available on Blackwell for models that support quantization.
Inference, 200B+ models or long context: B300 (288 GB) handles 200B models in FP8 on one GPU. For 400B+ at FP8, multi-GPU B300 is required until R100 ships. If your roadmap includes 400B+ models and your timeline extends into 2027, factor R100 into your planning.
Planning a new deployment for 2027: Rubin R100 is the target architecture, but Blackwell is the bridge. For production workloads you need running now, B200 or B300 beat waiting.
For detailed inference cost-per-token benchmarks across GPU types, see our best GPU for AI inference guide. For VRAM sizing by model size and precision, see our LLM memory requirements guide.
Availability and Pricing
As of March 2026:
H100: Widely available across all major GPU clouds. On-demand prices have compressed to $2.01/hr on Spheron. The H100 is roughly 2 years past its cloud launch, and prices reflect that maturity. Software frameworks are battle-tested on Hopper.
H200: Available at major providers including Spheron from $1.43/hr spot. The memory upgrade over H100 justifies the price premium for workloads that were multi-GPU on H100 but fit on a single H200.
B200: Available now on Spheron (see pricing for current rates), RunPod, Nebius, Lambda, and others. Supply is still ramping, but spot availability has improved significantly since early 2026. No on-demand pricing is currently listed; spot is the primary access path.
B300: Available on Spheron, CoreWeave, and select providers. On Spheron, B300 starts at $3.63/hr. The 288 GB capacity makes it the practical choice for large-model workloads where the full memory is needed.
R100: H2 2026 delivery to the first confirmed cloud cohort: AWS, Google Cloud, Microsoft Azure, OCI, CoreWeave, Lambda, Nebius, and Nscale (per the NVIDIA CES 2026 announcement), with Crusoe and Together AI confirmed as later additions to the cohort. Broader availability including Spheron is expected in 2027. Projected launch pricing runs $15-25/hr at hyperscalers and $8-12/hr spot at specialist GPU clouds.
One useful data point on price compression: H100 launched at $8-10/hr on-demand in 2023 and is now around $2/hr on-demand. Blackwell pricing will follow a similar curve as supply builds through 2026 and into 2027.
Pricing fluctuates based on GPU availability. The prices above are based on 27 Mar 2026 and may have changed. Check current GPU pricing → for live rates.
If you're on Hopper and the numbers above show a cost-per-token improvement, B200 and B300 are available now on Spheron with per-minute billing and no commitment.
