HBM4 ships on the NVIDIA Rubin R100 in H2 2026, creating a real decision point for teams planning inference infrastructure. Most coverage of this topic is vendor spec sheets. This guide is deployment math: bandwidth ceilings, TPS limits per GPU, and a cost-per-bandwidth comparison across HBM generations so you can pick the right tier without paying for headroom you won't use for 18 months. Check current GPU pricing before making any capacity commitment since spot rates shift weekly.
Quick Answer: Which HBM Generation for Which Workload
| HBM gen | GPU | Bandwidth | VRAM | Best for | Spheron availability |
|---|---|---|---|---|---|
| HBM3 | H100 SXM5 | 3.35 TB/s | 80 GB | 7B-13B inference, cost-sensitive | Available |
| HBM3 | GH200 | 4.0 TB/s | 96 GB | 7B-34B inference, slightly more headroom | Available |
| HBM3e | H200 SXM5 | 4.8 TB/s | 141 GB | 70B single-GPU inference, best spot BW/dollar | Available |
| HBM3e | B200 | 8.0 TB/s | 192 GB | 70B-200B production inference, FP4 workloads | Available |
| HBM3e | B300 | 10.0 TB/s | 288 GB | 200B+ models, maximum single-GPU VRAM | Check pricing |
| HBM4 | Rubin R100 | up to 22 TB/s | 288 GB | 400B+ models, next-gen throughput | H2 2026, not yet on Spheron |
| HBM4e | Rubin R200/next | TBD | TBD | Future generation, unannounced | Not announced |
Why Decode Is Memory-Bandwidth-Bound (Not Compute-Bound)
The roofline model describes two performance ceilings for any GPU workload:
Arithmetic Intensity = FLOPs executed / Bytes moved from memoryIf your workload's arithmetic intensity falls below the GPU's ridge point (peak FLOPS / peak bandwidth), you are memory-bound. If it falls above, you are compute-bound.
For autoregressive decode at batch size 1, the arithmetic intensity is approximately 1 FLOP per byte. The GPU reads all model weights from VRAM once per token step, performs a small number of operations on each value, then moves on. With a 70B FP16 model, that means reading approximately 140 GB of weights per token. The time this takes is fixed by bandwidth, not FLOPS:
| GPU | HBM gen | Bandwidth | 70B FP16 weight read time | TPS ceiling (BW-bound, batch 1) |
|---|---|---|---|---|
| H100 SXM5 | HBM3 | 3.35 TB/s | ~41.8 ms | ~24 TPS |
| H200 SXM5 | HBM3e | 4.8 TB/s | ~29.2 ms | ~34 TPS |
| B200 | HBM3e | 8.0 TB/s | ~17.5 ms | ~57 TPS |
| B300 | HBM3e | 10.0 TB/s | ~14.0 ms | ~71 TPS |
| R100 | HBM4 | 22 TB/s | ~6.4 ms | ~157 TPS |
These are hard lower bounds. No amount of additional FLOPS can push TPS above the bandwidth ceiling at batch size 1. You can double the H100's FLOPS budget and the 42 ms lower bound stays.
The prefill phase is the opposite: prefill is compute-bound because all input tokens are processed as a batch, creating high arithmetic intensity. Adding FLOPS improves prefill speed. Adding bandwidth improves decode throughput. They are different problems.
For the full KV cache analysis and how context length compounds the VRAM pressure problem, see AI's Memory Wall Problem.
HBM Generations: Full Specification Comparison
Each HBM generation is defined by stack height (more layers = more DRAM capacity), per-pin bandwidth (die speed), and I/O width (how many pins per stack). HBM4 breaks the pattern by doubling the I/O width:
| HBM gen | Stack height | Per-pin BW | Die BW (per stack) | I/O width | Key GPUs |
|---|---|---|---|---|---|
| HBM2e | 8 Hi | ~3.6 Gbps/pin | ~460 GB/s | 1,024-bit | A100 |
| HBM3 | 12 Hi | ~6.4 Gbps/pin | ~819 GB/s | 1,024-bit | H100, GH200 |
| HBM3e | 12-16 Hi | ~9.6 Gbps/pin | ~1.2 TB/s | 1,024-bit | H200, B200, B300 |
| HBM4 | 16 Hi | ~6.4 Gbps/pin | ~1.5+ TB/s | 2,048-bit | Rubin R100 (H2 2026) |
| HBM4e | TBD | TBD | TBD | TBD | Not yet announced |
The key change from HBM3e to HBM4 is the I/O width doubling from 1,024 to 2,048 bits per stack. Per-stack bandwidth increases from ~1.2 TB/s to ~1.5+ TB/s entirely because of the doubled I/O width. HBM4's per-pin speed (6.4 Gbps/pin) is the same as HBM3, not faster than HBM3e. The bigger driver of the 22 TB/s total on the R100 is more stacks combined with that doubled bus width. This is a structural, not incremental, change, which is why HBM4 represents a generation boundary rather than a speed bump.
On HBM4e: JEDEC published the HBM4e standard as the next iteration after HBM4. As of May 2026, no GPU vendor has announced a product shipping HBM4e. It does not factor into any real deployment decision today, and bandwidth specs are not confirmed by any manufacturer.
One important nuance: HBM3e bandwidth varies significantly across GPU models because B200 and B300 use different configurations. B200 achieves 8 TB/s; B300 achieves 10 TB/s. Do not treat "HBM3e = 8 TB/s" as a fixed fact. Always tie bandwidth numbers to the specific GPU model.
Per-GPU Bandwidth Comparison: H100 to Rubin R100
| GPU | HBM gen | Memory BW | VRAM | FP16 Tensor TFLOPS (with sparsity) | BW-bound TPS ceiling† (70B FP16, batch 1) |
|---|---|---|---|---|---|
| H100 SXM5 | HBM3 | 3.35 TB/s | 80 GB | 1,979 | ~24 |
| GH200 | HBM3 | 4.0 TB/s | 96 GB | 1,979 | ~29 |
| H200 SXM5 | HBM3e | 4.8 TB/s | 141 GB | 1,979 | ~34 |
| AMD MI300X | HBM3 | 5.3 TB/s | 192 GB | 1,307 | ~38 |
| B200 | HBM3e | 8.0 TB/s | 192 GB | 2,250 | ~57 |
| B300 | HBM3e | 10.0 TB/s | 288 GB | 2,250 | ~71 |
| AMD MI355X | HBM3e | ~8.0 TB/s | 288 GB | ~2,600 | ~57 |
| Rubin R100 | HBM4 | up to 22 TB/s | 288 GB | ~8,000 (proj.) | ~157 |
†Approximate tokens-per-second ceiling for a 70B FP16 model (140 GB weights) at batch size 1, calculated as GPU_bandwidth / model_weight_bytes. This is the hard upper bound for single-request decode throughput and does not account for KV cache reads, attention overhead, or framework buffers.
Note: AMD MI355X and Rubin R100 are included for completeness but are not available on Spheron as of this writing. See GPU pricing and GPU rental for Spheron-available options.
Bandwidth-Bound TPS Ceiling: Worked Examples
The formula is straightforward:
Max TPS (memory-bound) = GPU_bandwidth_bytes_per_sec / model_weight_bytes_per_token
model_weight_bytes = parameter_count × bytes_per_paramAt batch size 1, the GPU reads all weights once per token step. The TPS ceiling is the bandwidth divided by the weight footprint.
Llama 3.1 70B at FP16 (140 GB weights)
| GPU | BW | Theoretical max TPS | Notes |
|---|---|---|---|
| H100 SXM5 | 3.35 TB/s | ~24 TPS | Does NOT fit; 80 GB < 140 GB, requires 2-way tensor parallelism or FP8 quantization |
| GH200 | 4.0 TB/s | ~29 TPS | Does NOT fit; 96 GB < 140 GB, requires 2-way tensor parallelism or FP8 quantization |
| H200 SXM5 | 4.8 TB/s | ~34 TPS | 141 GB fits model weights at FP16, but ~1 GB remains for KV cache; FP8 strongly recommended in practice |
| AMD MI300X | 5.3 TB/s | ~38 TPS | Not on Spheron |
| B200 | 8.0 TB/s | ~57 TPS | 2.4x H100 throughput at FP16 |
| B300 | 10.0 TB/s | ~71 TPS | 3x H100 throughput, 288 GB |
| AMD MI355X | ~8.0 TB/s | ~57 TPS | Not on Spheron |
| Rubin R100 | 22 TB/s | ~157 TPS | 6.5x H100 bandwidth, HBM4 |
H100 at 80 GB and GH200 at 96 GB cannot hold the 140 GB model at FP16. Both require 2-way tensor parallelism or FP8 quantization, which halves model weights to ~70 GB and fits on a single GPU with room for the KV cache. H200 at 141 GB fits the model weights at FP16, but 140 GB of those 141 GB are occupied by the weights themselves, leaving only ~1 GB for KV cache. At batch size 1 with a 4K context, Llama 3.1 70B's KV cache requires approximately 1.25 GB on its own. Any non-trivial batch size or context length will overflow the available headroom and cause OOM errors. In practice, FP8 quantization is strongly recommended on H200: it halves model weights to ~70 GB, leaving ~71 GB for KV cache and enabling meaningful batch sizes and context lengths. For techniques to reduce KV cache memory pressure and maximize single-GPU utilization, see KV Cache Optimization Guide.
DeepSeek V3 at FP8 (671B MoE, ~37B active parameters per step)
For MoE models, only the active expert weights are read per decode step. DeepSeek V3 activates approximately 37B parameters per token step at FP8 precision, meaning ~37 GB of weights per step.
Note: the full model at FP8 is ~671 GB, requiring at least 9xH100 (9 × 80 GB = 720 GB), 5xH200 (5 × 141 GB = 705 GB), or 4xB200 (4 × 192 GB = 768 GB) for single-instance serving. The TPS ceiling below is calculated per individual GPU (per-GPU bandwidth / active weight bytes). In a multi-GPU setup, aggregate system TPS scales roughly with GPU count, minus interconnect overhead.
| GPU | BW | Per-GPU BW-bound TPS ceiling (active weights only) | Notes |
|---|---|---|---|
| H100 SXM5 | 3.35 TB/s | ~91 TPS | 9-GPU minimum (9 × 80 GB = 720 GB ≥ 671 GB) |
| H200 SXM5 | 4.8 TB/s | ~130 TPS | 5-GPU minimum (5 × 141 GB = 705 GB ≥ 671 GB) |
| B200 | 8.0 TB/s | ~216 TPS | 4xB200 fits the full model (4 × 192 GB) |
| B300 | 10.0 TB/s | ~270 TPS | 3xB300 fits the full model (3 × 288 GB) |
| R100 | 22 TB/s | ~595 TPS | 3xR100 fits the full model (3 × 288 GB) |
This is an approximation for the bandwidth-ceiling calculation only. The ~37B active parameter count is DeepSeek V3's nominal per-step activation, but actual performance depends on routing patterns, sparsity, and all-to-all communication latency in multi-GPU setups.
Large dense model proxy: 405B at FP8 (~405 GB weights)
At FP8, a 405B model weighs approximately 405 GB. No single GPU currently has sufficient VRAM to hold it. Even the B300 and R100 at 288 GB each fall short. The minimum viable configurations are 2xB300 (2 × 288 GB = 576 GB) or 2xR100 (2 × 288 GB = 576 GB).
Only B300 and R100 can serve this model in a 2-GPU configuration (2 × 288 GB = 576 GB ≥ 405 GB). With 2-way tensor parallelism, the effective per-step bandwidth is the sum of both GPUs' bandwidth:
| Configuration | Total BW | Theoretical max TPS | Notes |
|---|---|---|---|
| 2x B300 | 20.0 TB/s | ~49 TPS | 2 × 10.0 TB/s, 576 GB total VRAM |
| 2x R100 | 44 TB/s | ~109 TPS | 2 × 22 TB/s, 576 GB total VRAM |
Actual throughput is below these ceilings due to KV cache reads, attention, NVLink communication overhead, and framework overhead. For the full profiling methodology, see AI's Memory Wall Problem.
When FLOPs Matter: Prefill vs Decode
Bandwidth is the only lever for decode at small batch sizes. But as batch size grows, the workload shifts:
Prefill is compute-bound. During prefill, the model processes all input tokens at once, giving high arithmetic intensity. Adding FLOPS directly reduces time-to-first-token. The B200's 2,250 FP16 TFLOPS (vs 1,979 on H100) and 4,500 FP8 TFLOPS translate directly into faster prefill throughput.
Decode at batch 1-4 is bandwidth-bound. Only bandwidth matters here. H200's 43% bandwidth advantage over H100 translates to 43% more tokens per second on memory-bound decode.
Decode at batch 64+ starts to shift. As more requests are batched together, the same weight bytes are shared across more tokens, increasing arithmetic intensity. For batch sizes above the roofline crossover point in the table above, FLOPS and bandwidth both matter.
For a real-time chatbot serving batch 1-8, bandwidth is the only hardware decision that matters. For a high-throughput API endpoint batching 64+ requests, you need to think about both. For teams running separate prefill and decode fleets, see Prefill-Decode Disaggregation on GPU Cloud for the infrastructure patterns.
Quantization as a Bandwidth Multiplier
Quantization reduces bytes per parameter, which directly scales the TPS ceiling. On a fixed GPU, lower precision is a free bandwidth upgrade:
| Precision | Bytes/param | Effective BW multiplier vs FP16 | 70B Llama ceiling TPS on H200 (4.8 TB/s) |
|---|---|---|---|
| FP16 | 2.0 | 1.0x | ~34 TPS |
| BF16 | 2.0 | 1.0x | ~34 TPS |
| FP8 | 1.0 | 2.0x | ~69 TPS |
| INT8 | 1.0 | 2.0x | ~69 TPS |
| FP4/INT4 | 0.5 | 4.0x (Blackwell native FP4 only) | ~137 TPS |
A few clarifications:
FP4 tensor core hardware support is only available on Blackwell (B200/B300). On H100/H200, FP4 has no hardware backing and falls back to FP8 or slower emulation. INT4 via AWQ or GPTQ works on H100/H200 but uses software dequantization, so the effective bandwidth benefit is smaller than the theoretical 4x.
AMD's MXFP4 format is available on MI355X and uses a different numerical standard than NVIDIA's NV-FP4. The byte width is the same (0.5 bytes/param), but model accuracy and kernel availability differ by framework.
For framework setup and accuracy tradeoffs on FP4, see FP4 Quantization on Blackwell GPU Cloud. For INT4/INT8 workflows with AWQ, see AWQ Quantization Guide for LLM Deployment.
HBM Pricing and Bandwidth Per Dollar
Based on live Spheron pricing as of 15 May 2026:
| GPU | HBM gen | BW (TB/s) | On-demand $/hr | Spot $/hr | BW/on-demand dollar (TB/s per $/hr) | BW/spot dollar |
|---|---|---|---|---|---|---|
| H100 SXM5 | HBM3 | 3.35 | $4.00 | $1.69 | 0.84 | 1.98 |
| H200 SXM5 | HBM3e | 4.8 | $4.72 | $1.89 | 1.02 | 2.54 |
| B200 | HBM3e | 8.0 | $7.32 | $3.78 | 1.09 | 2.12 |
| B300 | HBM3e | 10.0 | Check pricing | Check pricing | Check pricing | Check pricing |
| Rubin R100 | HBM4 | 22 | est. $8-12/hr (not on Spheron) | est. $8-12/hr (not on Spheron) | est. 1.8-2.75 | est. 1.8-2.75 |
Pricing fluctuates based on GPU availability. The prices above are based on 15 May 2026 and may have changed. Check current GPU pricing → for live rates.
On spot pricing, H200 delivers the best bandwidth-per-dollar at 2.54 TB/s per $/hr, ahead of B200 (2.12) and H100 (1.98).
On on-demand pricing, B200 leads at 1.09 TB/s per $/hr, with H200 close at 1.02 and H100 trailing at 0.84. The gap between H200 and B200 on on-demand is small, but B200 also brings native FP4 Tensor Core support and 2.4x the raw bandwidth, which matters for high-concurrency workloads.
Rubin R100 at estimated $8-12/hr would deliver 22 TB/s, giving 1.8-2.75 TB/s per $/hr depending on final pricing. At the high end of pricing estimates, H200 still matches or beats R100 on bandwidth-per-dollar. HBM4's advantage over HBM3e on this metric is not guaranteed until cloud spot pricing normalizes, which historically takes 6-12 months after a new generation launches.
Decision Matrix: Which HBM Generation to Rent on Spheron
| Workload | Recommended GPU | HBM gen | Reason |
|---|---|---|---|
| Llama 3.1 8B chatbot, batch 1-8 | H100 SXM5 | HBM3 | Fits on 80 GB, cheapest viable option (spot from $1.69/hr, on-demand from $4.00/hr) |
| Llama 3.1 70B chatbot, batch 1-8 | H200 SXM5 rental | HBM3e | 4.8 TB/s, fits 70B weights in FP16 but FP8 required for any real batch/context; lowest cost at this scale |
| Llama 3.1 70B high-concurrency, batch 32+ | B200 on Spheron | HBM3e | 8 TB/s, FP4 support, 2.4x H100 bandwidth for higher batch TPS |
| DeepSeek V3 (671B MoE) | B200 or B300 | HBM3e | 4xB200 (768 GB) or 3xB300 (864 GB) fits the full model |
| 405B dense model, FP8 | B300 SXM5 | HBM3e | 288 GB VRAM for 405B at FP8 in 2-GPU configuration |
| Research, future-proofing | Rubin R100 (H2 2026) | HBM4 | 22 TB/s, HBM4, not yet on Spheron |
Migration Path: HBM3 to HBM4
HBM3 (H100) to HBM3e (H200): Drop-in swap. Same Hopper architecture, same CUDA code, same TensorRT-LLM and vLLM configs. Only change: your batch size can increase ~43% for the same latency target because bandwidth increased 43%. No code changes, no new dependencies, no new precision formats to configure.
HBM3e (H200) to HBM3e (B200): Architecture change from Hopper to Blackwell. For FP8 and FP16 workloads, existing vLLM configs run without modification. To unlock native FP4 Tensor Cores, you need TensorRT-LLM 0.17 or later and an FP4-quantized checkpoint. The B200's 8 TB/s bandwidth means the same 70B model now has ~57 TPS ceiling vs ~34 TPS on H200 at FP16, and ~137+ TPS at FP4. For the H100-to-B200 migration math and quantization setup, see FP4 Quantization on Blackwell GPU Cloud.
HBM3e (B200) to HBM4 (R100): H2 2026 cloud availability from the first confirmed cohort (AWS, Google Cloud, Azure, CoreWeave). Architecture change from Blackwell to Rubin. New NVLink 6 at 3.6 TB/s per GPU for multi-GPU configurations. Software stack compatibility with TensorRT-LLM and vLLM is expected to follow within 2-4 months of GA availability, based on the pattern from Blackwell. Broader cloud availability including Spheron is expected in 2027.
Spheron aggregates H100, H200, B200, and B300 instances from data center partners globally - pick the HBM generation that matches your workload's bandwidth requirements without paying for R100-tier headroom you don't need today.
Frequently Asked Questions
HBM3 (H100, GH200) delivers 3.35-4.0 TB/s per GPU using 12-high stacks. HBM3e is an optimized HBM3 variant with faster per-pin speeds and wider stacks: H200 reaches 4.8 TB/s, B200 reaches 8 TB/s, and B300 reaches 10 TB/s. HBM4 doubles the I/O width from 1,024 to 2,048 bits per stack and ships on the NVIDIA Rubin R100 at up to 22 TB/s in H2 2026. HBM4e is the JEDEC-standardized next iteration after HBM4, with no confirmed GPU product as of May 2026.
Decode is memory-bandwidth-bound at small batch sizes (1-8). Each token generation requires reading all model weights from VRAM once. For a 70B FP16 model, that's 140 GB per step. On an H100 at 3.35 TB/s, the lower bound is ~42 ms per token regardless of how many FLOPS the GPU has. Prefill is the opposite: compute-bound, because all input tokens are processed simultaneously with high arithmetic intensity.
The formula is: Max TPS = GPU_bandwidth_bytes_per_sec / model_weight_bytes. For FP16, model_weight_bytes = parameters × 2. For FP8, parameters × 1. Example: 70B FP16 on H200 (4.8 TB/s) = 4.8e12 / 140e9 = ~34 tokens/sec ceiling. Actual throughput is lower due to KV cache reads, attention overhead, and framework overhead, but the bandwidth ceiling sets the hard upper bound.
Yes, quantization reduces bytes-per-parameter, which directly increases the effective bandwidth multiplier. FP8 halves the weights to 1 byte/param, doubling effective bandwidth. FP4 reduces to 0.5 bytes/param, quadrupling effective bandwidth. On H200 at 4.8 TB/s with 70B Llama, FP16 gives ~34 TPS ceiling, FP8 gives ~69 TPS ceiling, and FP4 gives ~137 TPS ceiling. Native FP4 Tensor Core support requires Blackwell hardware (B200/B300).
For 70B FP16 inference at batch 1-8, H200 SXM5 (HBM3e, 4.8 TB/s) fits the full model on a single GPU and delivers ~34 TPS ceiling at FP16. On spot pricing, H200 delivers the best bandwidth-per-dollar on Spheron (2.54 TB/s per $/hr vs B200's 2.12). On on-demand pricing, B200 edges ahead at 1.09 TB/s per $/hr versus H200's 1.02, while also adding native FP4 Tensor Core support and 2.4x the raw bandwidth for higher concurrency. HBM4 (Rubin R100) is not yet available on Spheron and is not needed for 70B workloads at current traffic levels.
