HBM3e vs HBM4 vs HBM4e for LLM Inference: GPU Memory Bandwidth Decision Guide (2026)

Q: What is the difference between HBM3, HBM3e, HBM4, and HBM4e?

HBM3 (H100, GH200) delivers 3.35-4.0 TB/s per GPU using 12-high stacks. HBM3e is an optimized HBM3 variant with faster per-pin speeds and wider stacks: H200 reaches 4.8 TB/s, B200 reaches 8 TB/s, and B300 reaches 10 TB/s. HBM4 doubles the I/O width from 1,024 to 2,048 bits per stack and ships on the NVIDIA Rubin R100 at up to 22 TB/s in H2 2026. HBM4e is the JEDEC-standardized next iteration after HBM4, with no confirmed GPU product as of May 2026.

Q: Is LLM decode memory-bandwidth-bound or compute-bound?

Decode is memory-bandwidth-bound at small batch sizes (1-8). Each token generation requires reading all model weights from VRAM once. For a 70B FP16 model, that's 140 GB per step. On an H100 at 3.35 TB/s, the lower bound is ~42 ms per token regardless of how many FLOPS the GPU has. Prefill is the opposite: compute-bound, because all input tokens are processed simultaneously with high arithmetic intensity.

Q: How do I calculate the bandwidth-bound TPS ceiling for my model?

The formula is: Max TPS = GPU_bandwidth_bytes_per_sec / model_weight_bytes. For FP16, model_weight_bytes = parameters × 2. For FP8, parameters × 1. Example: 70B FP16 on H200 (4.8 TB/s) = 4.8e12 / 140e9 = ~34 tokens/sec ceiling. Actual throughput is lower due to KV cache reads, attention overhead, and framework overhead, but the bandwidth ceiling sets the hard upper bound.

Q: Does quantization (FP8/FP4/INT4) improve effective bandwidth?

Yes, quantization reduces bytes-per-parameter, which directly increases the effective bandwidth multiplier. FP8 halves the weights to 1 byte/param, doubling effective bandwidth. FP4 reduces to 0.5 bytes/param, quadrupling effective bandwidth. On H200 at 4.8 TB/s with 70B Llama, FP16 gives ~34 TPS ceiling, FP8 gives ~69 TPS ceiling, and FP4 gives ~137 TPS ceiling. Native FP4 Tensor Core support requires Blackwell hardware (B200/B300).

Q: Which HBM generation should I rent on Spheron for a 70B model?

For 70B FP16 inference at batch 1-8, H200 SXM5 (HBM3e, 4.8 TB/s) fits the full model on a single GPU and delivers ~34 TPS ceiling at FP16. On spot pricing, H200 delivers the best bandwidth-per-dollar on Spheron (2.54 TB/s per $/hr vs B200's 2.12). On on-demand pricing, B200 edges ahead at 1.09 TB/s per $/hr versus H200's 1.02, while also adding native FP4 Tensor Core support and 2.4x the raw bandwidth for higher concurrency. HBM4 (Rubin R100) is not yet available on Spheron and is not needed for 70B workloads at current traffic levels.

HBM4 ships on the NVIDIA Rubin R100 in H2 2026, creating a real decision point for teams planning inference infrastructure. Most coverage of this topic is vendor spec sheets. This guide is deployment math: bandwidth ceilings, TPS limits per GPU, and a cost-per-bandwidth comparison across HBM generations so you can pick the right tier without paying for headroom you won't use for 18 months. Check current GPU pricing before making any capacity commitment since spot rates shift weekly.

Quick Answer: Which HBM Generation for Which Workload

HBM gen	GPU	Bandwidth	VRAM	Best for	Spheron availability
HBM3	H100 SXM5	3.35 TB/s	80 GB	7B-13B inference, cost-sensitive	Available
HBM3	GH200	4.0 TB/s	96 GB	7B-34B inference, slightly more headroom	Available
HBM3e	H200 SXM5	4.8 TB/s	141 GB	70B single-GPU inference, best spot BW/dollar	Available
HBM3e	B200	8.0 TB/s	192 GB	70B-200B production inference, FP4 workloads	Available
HBM3e	B300	10.0 TB/s	288 GB	200B+ models, maximum single-GPU VRAM	Check pricing
HBM4	Rubin R100	up to 22 TB/s	288 GB	400B+ models, next-gen throughput	H2 2026, not yet on Spheron
HBM4e	Rubin R200/next	TBD	TBD	Future generation, unannounced	Not announced

Why Decode Is Memory-Bandwidth-Bound (Not Compute-Bound)

The roofline model describes two performance ceilings for any GPU workload:

Arithmetic Intensity = FLOPs executed / Bytes moved from memory

If your workload's arithmetic intensity falls below the GPU's ridge point (peak FLOPS / peak bandwidth), you are memory-bound. If it falls above, you are compute-bound.

For autoregressive decode at batch size 1, the arithmetic intensity is approximately 1 FLOP per byte. The GPU reads all model weights from VRAM once per token step, performs a small number of operations on each value, then moves on. With a 70B FP16 model, that means reading approximately 140 GB of weights per token. The time this takes is fixed by bandwidth, not FLOPS:

GPU	HBM gen	Bandwidth	70B FP16 weight read time	TPS ceiling (BW-bound, batch 1)
H100 SXM5	HBM3	3.35 TB/s	~41.8 ms	~24 TPS
H200 SXM5	HBM3e	4.8 TB/s	~29.2 ms	~34 TPS
B200	HBM3e	8.0 TB/s	~17.5 ms	~57 TPS
B300	HBM3e	10.0 TB/s	~14.0 ms	~71 TPS
R100	HBM4	22 TB/s	~6.4 ms	~157 TPS

These are hard lower bounds. No amount of additional FLOPS can push TPS above the bandwidth ceiling at batch size 1. You can double the H100's FLOPS budget and the 42 ms lower bound stays.

The prefill phase is the opposite: prefill is compute-bound because all input tokens are processed as a batch, creating high arithmetic intensity. Adding FLOPS improves prefill speed. Adding bandwidth improves decode throughput. They are different problems.

For the full KV cache analysis and how context length compounds the VRAM pressure problem, see AI's Memory Wall Problem.

HBM Generations: Full Specification Comparison

Each HBM generation is defined by stack height (more layers = more DRAM capacity), per-pin bandwidth (die speed), and I/O width (how many pins per stack). HBM4 breaks the pattern by doubling the I/O width:

HBM gen	Stack height	Per-pin BW	Die BW (per stack)	I/O width	Key GPUs
HBM2e	8 Hi	~3.6 Gbps/pin	~460 GB/s	1,024-bit	A100
HBM3	12 Hi	~6.4 Gbps/pin	~819 GB/s	1,024-bit	H100, GH200
HBM3e	12-16 Hi	~9.6 Gbps/pin	~1.2 TB/s	1,024-bit	H200, B200, B300
HBM4	16 Hi	~6.4 Gbps/pin	~1.5+ TB/s	2,048-bit	Rubin R100 (H2 2026)
HBM4e	TBD	TBD	TBD	TBD	Not yet announced

The key change from HBM3e to HBM4 is the I/O width doubling from 1,024 to 2,048 bits per stack. Per-stack bandwidth increases from ~1.2 TB/s to ~1.5+ TB/s entirely because of the doubled I/O width. HBM4's per-pin speed (6.4 Gbps/pin) is the same as HBM3, not faster than HBM3e. The bigger driver of the 22 TB/s total on the R100 is more stacks combined with that doubled bus width. This is a structural, not incremental, change, which is why HBM4 represents a generation boundary rather than a speed bump.

On HBM4e: JEDEC published the HBM4e standard as the next iteration after HBM4. As of May 2026, no GPU vendor has announced a product shipping HBM4e. It does not factor into any real deployment decision today, and bandwidth specs are not confirmed by any manufacturer.

One important nuance: HBM3e bandwidth varies significantly across GPU models because B200 and B300 use different configurations. B200 achieves 8 TB/s; B300 achieves 10 TB/s. Do not treat "HBM3e = 8 TB/s" as a fixed fact. Always tie bandwidth numbers to the specific GPU model.

Per-GPU Bandwidth Comparison: H100 to Rubin R100

GPU	HBM gen	Memory BW	VRAM	FP16 Tensor TFLOPS (with sparsity)	BW-bound TPS ceiling† (70B FP16, batch 1)
H100 SXM5	HBM3	3.35 TB/s	80 GB	1,979	~24
GH200	HBM3	4.0 TB/s	96 GB	1,979	~29
H200 SXM5	HBM3e	4.8 TB/s	141 GB	1,979	~34
AMD MI300X	HBM3	5.3 TB/s	192 GB	1,307	~38
B200	HBM3e	8.0 TB/s	192 GB	2,250	~57
B300	HBM3e	10.0 TB/s	288 GB	2,250	~71
AMD MI355X	HBM3e	~8.0 TB/s	288 GB	~2,600	~57
Rubin R100	HBM4	up to 22 TB/s	288 GB	~8,000 (proj.)	~157

†Approximate tokens-per-second ceiling for a 70B FP16 model (140 GB weights) at batch size 1, calculated as GPU_bandwidth / model_weight_bytes. This is the hard upper bound for single-request decode throughput and does not account for KV cache reads, attention overhead, or framework buffers.

Note: AMD MI355X and Rubin R100 are included for completeness but are not available on Spheron as of this writing. See GPU pricing and GPU rental for Spheron-available options.

Bandwidth-Bound TPS Ceiling: Worked Examples

The formula is straightforward:

Max TPS (memory-bound) = GPU_bandwidth_bytes_per_sec / model_weight_bytes_per_token
model_weight_bytes = parameter_count × bytes_per_param

At batch size 1, the GPU reads all weights once per token step. The TPS ceiling is the bandwidth divided by the weight footprint.

Llama 3.1 70B at FP16 (140 GB weights)

GPU	BW	Theoretical max TPS	Notes
H100 SXM5	3.35 TB/s	~24 TPS	Does NOT fit; 80 GB < 140 GB, requires 2-way tensor parallelism or FP8 quantization
GH200	4.0 TB/s	~29 TPS	Does NOT fit; 96 GB < 140 GB, requires 2-way tensor parallelism or FP8 quantization
H200 SXM5	4.8 TB/s	~34 TPS	141 GB fits model weights at FP16, but ~1 GB remains for KV cache; FP8 strongly recommended in practice
AMD MI300X	5.3 TB/s	~38 TPS	Not on Spheron
B200	8.0 TB/s	~57 TPS	2.4x H100 throughput at FP16
B300	10.0 TB/s	~71 TPS	3x H100 throughput, 288 GB
AMD MI355X	~8.0 TB/s	~57 TPS	Not on Spheron
Rubin R100	22 TB/s	~157 TPS	6.5x H100 bandwidth, HBM4

H100 at 80 GB and GH200 at 96 GB cannot hold the 140 GB model at FP16. Both require 2-way tensor parallelism or FP8 quantization, which halves model weights to ~70 GB and fits on a single GPU with room for the KV cache. H200 at 141 GB fits the model weights at FP16, but 140 GB of those 141 GB are occupied by the weights themselves, leaving only ~1 GB for KV cache. At batch size 1 with a 4K context, Llama 3.1 70B's KV cache requires approximately 1.25 GB on its own. Any non-trivial batch size or context length will overflow the available headroom and cause OOM errors. In practice, FP8 quantization is strongly recommended on H200: it halves model weights to ~70 GB, leaving ~71 GB for KV cache and enabling meaningful batch sizes and context lengths. For techniques to reduce KV cache memory pressure and maximize single-GPU utilization, see KV Cache Optimization Guide.

DeepSeek V3 at FP8 (671B MoE, ~37B active parameters per step)

For MoE models, only the active expert weights are read per decode step. DeepSeek V3 activates approximately 37B parameters per token step at FP8 precision, meaning ~37 GB of weights per step.

Note: the full model at FP8 is ~671 GB, requiring at least 9xH100 (9 × 80 GB = 720 GB), 5xH200 (5 × 141 GB = 705 GB), or 4xB200 (4 × 192 GB = 768 GB) for single-instance serving. The TPS ceiling below is calculated per individual GPU (per-GPU bandwidth / active weight bytes). In a multi-GPU setup, aggregate system TPS scales roughly with GPU count, minus interconnect overhead.

GPU	BW	Per-GPU BW-bound TPS ceiling (active weights only)	Notes
H100 SXM5	3.35 TB/s	~91 TPS	9-GPU minimum (9 × 80 GB = 720 GB ≥ 671 GB)
H200 SXM5	4.8 TB/s	~130 TPS	5-GPU minimum (5 × 141 GB = 705 GB ≥ 671 GB)
B200	8.0 TB/s	~216 TPS	4xB200 fits the full model (4 × 192 GB)
B300	10.0 TB/s	~270 TPS	3xB300 fits the full model (3 × 288 GB)
R100	22 TB/s	~595 TPS	3xR100 fits the full model (3 × 288 GB)

This is an approximation for the bandwidth-ceiling calculation only. The ~37B active parameter count is DeepSeek V3's nominal per-step activation, but actual performance depends on routing patterns, sparsity, and all-to-all communication latency in multi-GPU setups.

Large dense model proxy: 405B at FP8 (~405 GB weights)

At FP8, a 405B model weighs approximately 405 GB. No single GPU currently has sufficient VRAM to hold it. Even the B300 and R100 at 288 GB each fall short. The minimum viable configurations are 2xB300 (2 × 288 GB = 576 GB) or 2xR100 (2 × 288 GB = 576 GB).

Only B300 and R100 can serve this model in a 2-GPU configuration (2 × 288 GB = 576 GB ≥ 405 GB). With 2-way tensor parallelism, the effective per-step bandwidth is the sum of both GPUs' bandwidth:

Configuration	Total BW	Theoretical max TPS	Notes
2x B300	20.0 TB/s	~49 TPS	2 × 10.0 TB/s, 576 GB total VRAM
2x R100	44 TB/s	~109 TPS	2 × 22 TB/s, 576 GB total VRAM

Actual throughput is below these ceilings due to KV cache reads, attention, NVLink communication overhead, and framework overhead. For the full profiling methodology, see AI's Memory Wall Problem.

When FLOPs Matter: Prefill vs Decode

Bandwidth is the only lever for decode at small batch sizes. But as batch size grows, the workload shifts:

Prefill is compute-bound. During prefill, the model processes all input tokens at once, giving high arithmetic intensity. Adding FLOPS directly reduces time-to-first-token. The B200's 2,250 FP16 TFLOPS (vs 1,979 on H100) and 4,500 FP8 TFLOPS translate directly into faster prefill throughput.

Decode at batch 1-4 is bandwidth-bound. Only bandwidth matters here. H200's 43% bandwidth advantage over H100 translates to 43% more tokens per second on memory-bound decode.

Decode at batch 64+ starts to shift. As more requests are batched together, the same weight bytes are shared across more tokens, increasing arithmetic intensity. For batch sizes above the roofline crossover point in the table above, FLOPS and bandwidth both matter.

For a real-time chatbot serving batch 1-8, bandwidth is the only hardware decision that matters. For a high-throughput API endpoint batching 64+ requests, you need to think about both. For teams running separate prefill and decode fleets, see Prefill-Decode Disaggregation on GPU Cloud for the infrastructure patterns.

Quantization as a Bandwidth Multiplier

Quantization reduces bytes per parameter, which directly scales the TPS ceiling. On a fixed GPU, lower precision is a free bandwidth upgrade:

Precision	Bytes/param	Effective BW multiplier vs FP16	70B Llama ceiling TPS on H200 (4.8 TB/s)
FP16	2.0	1.0x	~34 TPS
BF16	2.0	1.0x	~34 TPS
FP8	1.0	2.0x	~69 TPS
INT8	1.0	2.0x	~69 TPS
FP4/INT4	0.5	4.0x (Blackwell native FP4 only)	~137 TPS

A few clarifications:

FP4 tensor core hardware support is only available on Blackwell (B200/B300). On H100/H200, FP4 has no hardware backing and falls back to FP8 or slower emulation. INT4 via AWQ or GPTQ works on H100/H200 but uses software dequantization, so the effective bandwidth benefit is smaller than the theoretical 4x.

AMD's MXFP4 format is available on MI355X and uses a different numerical standard than NVIDIA's NV-FP4. The byte width is the same (0.5 bytes/param), but model accuracy and kernel availability differ by framework.

For framework setup and accuracy tradeoffs on FP4, see FP4 Quantization on Blackwell GPU Cloud. For INT4/INT8 workflows with AWQ, see AWQ Quantization Guide for LLM Deployment.

HBM Pricing and Bandwidth Per Dollar

Based on live Spheron pricing as of 15 May 2026:

GPU	HBM gen	BW (TB/s)	On-demand $/hr	Spot $/hr	BW/on-demand dollar (TB/s per $/hr)	BW/spot dollar
H100 SXM5	HBM3	3.35	$4.00	$1.69	0.84	1.98
H200 SXM5	HBM3e	4.8	$4.72	$1.89	1.02	2.54
B200	HBM3e	8.0	$7.32	$3.78	1.09	2.12
B300	HBM3e	10.0	Check pricing	Check pricing	Check pricing	Check pricing
Rubin R100	HBM4	22	est. $8-12/hr (not on Spheron)	est. $8-12/hr (not on Spheron)	est. 1.8-2.75	est. 1.8-2.75

Pricing fluctuates based on GPU availability. The prices above are based on 15 May 2026 and may have changed. Check current GPU pricing → for live rates.

On spot pricing, H200 delivers the best bandwidth-per-dollar at 2.54 TB/s per $/hr, ahead of B200 (2.12) and H100 (1.98).

On on-demand pricing, B200 leads at 1.09 TB/s per $/hr, with H200 close at 1.02 and H100 trailing at 0.84. The gap between H200 and B200 on on-demand is small, but B200 also brings native FP4 Tensor Core support and 2.4x the raw bandwidth, which matters for high-concurrency workloads.

Rubin R100 at estimated $8-12/hr would deliver 22 TB/s, giving 1.8-2.75 TB/s per $/hr depending on final pricing. At the high end of pricing estimates, H200 still matches or beats R100 on bandwidth-per-dollar. HBM4's advantage over HBM3e on this metric is not guaranteed until cloud spot pricing normalizes, which historically takes 6-12 months after a new generation launches.

Decision Matrix: Which HBM Generation to Rent on Spheron

Workload	Recommended GPU	HBM gen	Reason
Llama 3.1 8B chatbot, batch 1-8	H100 SXM5	HBM3	Fits on 80 GB, cheapest viable option (spot from $1.69/hr, on-demand from $4.00/hr)
Llama 3.1 70B chatbot, batch 1-8	H200 SXM5 rental	HBM3e	4.8 TB/s, fits 70B weights in FP16 but FP8 required for any real batch/context; lowest cost at this scale
Llama 3.1 70B high-concurrency, batch 32+	B200 on Spheron	HBM3e	8 TB/s, FP4 support, 2.4x H100 bandwidth for higher batch TPS
DeepSeek V3 (671B MoE)	B200 or B300	HBM3e	4xB200 (768 GB) or 3xB300 (864 GB) fits the full model
405B dense model, FP8	B300 SXM5	HBM3e	288 GB VRAM for 405B at FP8 in 2-GPU configuration
Research, future-proofing	Rubin R100 (H2 2026)	HBM4	22 TB/s, HBM4, not yet on Spheron

Migration Path: HBM3 to HBM4

HBM3 (H100) to HBM3e (H200): Drop-in swap. Same Hopper architecture, same CUDA code, same TensorRT-LLM and vLLM configs. Only change: your batch size can increase ~43% for the same latency target because bandwidth increased 43%. No code changes, no new dependencies, no new precision formats to configure.

HBM3e (H200) to HBM3e (B200): Architecture change from Hopper to Blackwell. For FP8 and FP16 workloads, existing vLLM configs run without modification. To unlock native FP4 Tensor Cores, you need TensorRT-LLM 0.17 or later and an FP4-quantized checkpoint. The B200's 8 TB/s bandwidth means the same 70B model now has ~57 TPS ceiling vs ~34 TPS on H200 at FP16, and ~137+ TPS at FP4. For the H100-to-B200 migration math and quantization setup, see FP4 Quantization on Blackwell GPU Cloud.

HBM3e (B200) to HBM4 (R100): H2 2026 cloud availability from the first confirmed cohort (AWS, Google Cloud, Azure, CoreWeave). Architecture change from Blackwell to Rubin. New NVLink 6 at 3.6 TB/s per GPU for multi-GPU configurations. Software stack compatibility with TensorRT-LLM and vLLM is expected to follow within 2-4 months of GA availability, based on the pattern from Blackwell. Broader cloud availability including Spheron is expected in 2027.

Spheron aggregates H100, H200, B200, and B300 instances from data center partners globally - pick the HBM generation that matches your workload's bandwidth requirements without paying for R100-tier headroom you don't need today.
Rent H200 → | Rent B200 → | View all GPU pricing →

FAQ / 05

Frequently Asked Questions

HBM3 (H100, GH200) delivers 3.35-4.0 TB/s per GPU using 12-high stacks. HBM3e is an optimized HBM3 variant with faster per-pin speeds and wider stacks: H200 reaches 4.8 TB/s, B200 reaches 8 TB/s, and B300 reaches 10 TB/s. HBM4 doubles the I/O width from 1,024 to 2,048 bits per stack and ships on the NVIDIA Rubin R100 at up to 22 TB/s in H2 2026. HBM4e is the JEDEC-standardized next iteration after HBM4, with no confirmed GPU product as of May 2026.

Decode is memory-bandwidth-bound at small batch sizes (1-8). Each token generation requires reading all model weights from VRAM once. For a 70B FP16 model, that's 140 GB per step. On an H100 at 3.35 TB/s, the lower bound is ~42 ms per token regardless of how many FLOPS the GPU has. Prefill is the opposite: compute-bound, because all input tokens are processed simultaneously with high arithmetic intensity.

The formula is: Max TPS = GPU_bandwidth_bytes_per_sec / model_weight_bytes. For FP16, model_weight_bytes = parameters × 2. For FP8, parameters × 1. Example: 70B FP16 on H200 (4.8 TB/s) = 4.8e12 / 140e9 = ~34 tokens/sec ceiling. Actual throughput is lower due to KV cache reads, attention overhead, and framework overhead, but the bandwidth ceiling sets the hard upper bound.

Yes, quantization reduces bytes-per-parameter, which directly increases the effective bandwidth multiplier. FP8 halves the weights to 1 byte/param, doubling effective bandwidth. FP4 reduces to 0.5 bytes/param, quadrupling effective bandwidth. On H200 at 4.8 TB/s with 70B Llama, FP16 gives ~34 TPS ceiling, FP8 gives ~69 TPS ceiling, and FP4 gives ~137 TPS ceiling. Native FP4 Tensor Core support requires Blackwell hardware (B200/B300).

For 70B FP16 inference at batch 1-8, H200 SXM5 (HBM3e, 4.8 TB/s) fits the full model on a single GPU and delivers ~34 TPS ceiling at FP16. On spot pricing, H200 delivers the best bandwidth-per-dollar on Spheron (2.54 TB/s per $/hr vs B200's 2.12). On on-demand pricing, B200 edges ahead at 1.09 TB/s per $/hr versus H200's 1.02, while also adding native FP4 Tensor Core support and 2.4x the raw bandwidth for higher concurrency. HBM4 (Rubin R100) is not yet available on Spheron and is not needed for 70B workloads at current traffic levels.

Quick Answer: Which HBM Generation for Which Workload

Why Decode Is Memory-Bandwidth-Bound (Not Compute-Bound)

HBM Generations: Full Specification Comparison

Per-GPU Bandwidth Comparison: H100 to Rubin R100

Bandwidth-Bound TPS Ceiling: Worked Examples

Llama 3.1 70B at FP16 (140 GB weights)

DeepSeek V3 at FP8 (671B MoE, ~37B active parameters per step)

Large dense model proxy: 405B at FP8 (~405 GB weights)

When FLOPs Matter: Prefill vs Decode

Quantization as a Bandwidth Multiplier

HBM Pricing and Bandwidth Per Dollar

Decision Matrix: Which HBM Generation to Rent on Spheron

Migration Path: HBM3 to HBM4

Frequently Asked Questions

01What is the difference between HBM3, HBM3e, HBM4, and HBM4e?

02Is LLM decode memory-bandwidth-bound or compute-bound?

03How do I calculate the bandwidth-bound TPS ceiling for my model?

04Does quantization (FP8/FP4/INT4) improve effective bandwidth?

05Which HBM generation should I rent on Spheron for a 70B model?

Build what's next.