Engineering

HBM3e vs HBM4 vs HBM4e for LLM Inference: GPU Memory Bandwidth Decision Guide (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 15, 2026
HBM3e vs HBM4GPU Memory Bandwidth LLM InferenceHBM4e GPU CloudMemory Bandwidth Bound InferenceHBM Generations ComparisonGPU CloudLLM InferenceH200B200
HBM3e vs HBM4 vs HBM4e for LLM Inference: GPU Memory Bandwidth Decision Guide (2026)

HBM4 ships on the NVIDIA Rubin R100 in H2 2026, creating a real decision point for teams planning inference infrastructure. Most coverage of this topic is vendor spec sheets. This guide is deployment math: bandwidth ceilings, TPS limits per GPU, and a cost-per-bandwidth comparison across HBM generations so you can pick the right tier without paying for headroom you won't use for 18 months. Check current GPU pricing before making any capacity commitment since spot rates shift weekly.

Quick Answer: Which HBM Generation for Which Workload

HBM genGPUBandwidthVRAMBest forSpheron availability
HBM3H100 SXM53.35 TB/s80 GB7B-13B inference, cost-sensitiveAvailable
HBM3GH2004.0 TB/s96 GB7B-34B inference, slightly more headroomAvailable
HBM3eH200 SXM54.8 TB/s141 GB70B single-GPU inference, best spot BW/dollarAvailable
HBM3eB2008.0 TB/s192 GB70B-200B production inference, FP4 workloadsAvailable
HBM3eB30010.0 TB/s288 GB200B+ models, maximum single-GPU VRAMCheck pricing
HBM4Rubin R100up to 22 TB/s288 GB400B+ models, next-gen throughputH2 2026, not yet on Spheron
HBM4eRubin R200/nextTBDTBDFuture generation, unannouncedNot announced

Why Decode Is Memory-Bandwidth-Bound (Not Compute-Bound)

The roofline model describes two performance ceilings for any GPU workload:

Arithmetic Intensity = FLOPs executed / Bytes moved from memory

If your workload's arithmetic intensity falls below the GPU's ridge point (peak FLOPS / peak bandwidth), you are memory-bound. If it falls above, you are compute-bound.

For autoregressive decode at batch size 1, the arithmetic intensity is approximately 1 FLOP per byte. The GPU reads all model weights from VRAM once per token step, performs a small number of operations on each value, then moves on. With a 70B FP16 model, that means reading approximately 140 GB of weights per token. The time this takes is fixed by bandwidth, not FLOPS:

GPUHBM genBandwidth70B FP16 weight read timeTPS ceiling (BW-bound, batch 1)
H100 SXM5HBM33.35 TB/s~41.8 ms~24 TPS
H200 SXM5HBM3e4.8 TB/s~29.2 ms~34 TPS
B200HBM3e8.0 TB/s~17.5 ms~57 TPS
B300HBM3e10.0 TB/s~14.0 ms~71 TPS
R100HBM422 TB/s~6.4 ms~157 TPS

These are hard lower bounds. No amount of additional FLOPS can push TPS above the bandwidth ceiling at batch size 1. You can double the H100's FLOPS budget and the 42 ms lower bound stays.

The prefill phase is the opposite: prefill is compute-bound because all input tokens are processed as a batch, creating high arithmetic intensity. Adding FLOPS improves prefill speed. Adding bandwidth improves decode throughput. They are different problems.

For the full KV cache analysis and how context length compounds the VRAM pressure problem, see AI's Memory Wall Problem.

HBM Generations: Full Specification Comparison

Each HBM generation is defined by stack height (more layers = more DRAM capacity), per-pin bandwidth (die speed), and I/O width (how many pins per stack). HBM4 breaks the pattern by doubling the I/O width:

HBM genStack heightPer-pin BWDie BW (per stack)I/O widthKey GPUs
HBM2e8 Hi~3.6 Gbps/pin~460 GB/s1,024-bitA100
HBM312 Hi~6.4 Gbps/pin~819 GB/s1,024-bitH100, GH200
HBM3e12-16 Hi~9.6 Gbps/pin~1.2 TB/s1,024-bitH200, B200, B300
HBM416 Hi~6.4 Gbps/pin~1.5+ TB/s2,048-bitRubin R100 (H2 2026)
HBM4eTBDTBDTBDTBDNot yet announced

The key change from HBM3e to HBM4 is the I/O width doubling from 1,024 to 2,048 bits per stack. Per-stack bandwidth increases from ~1.2 TB/s to ~1.5+ TB/s entirely because of the doubled I/O width. HBM4's per-pin speed (6.4 Gbps/pin) is the same as HBM3, not faster than HBM3e. The bigger driver of the 22 TB/s total on the R100 is more stacks combined with that doubled bus width. This is a structural, not incremental, change, which is why HBM4 represents a generation boundary rather than a speed bump.

On HBM4e: JEDEC published the HBM4e standard as the next iteration after HBM4. As of May 2026, no GPU vendor has announced a product shipping HBM4e. It does not factor into any real deployment decision today, and bandwidth specs are not confirmed by any manufacturer.

One important nuance: HBM3e bandwidth varies significantly across GPU models because B200 and B300 use different configurations. B200 achieves 8 TB/s; B300 achieves 10 TB/s. Do not treat "HBM3e = 8 TB/s" as a fixed fact. Always tie bandwidth numbers to the specific GPU model.

Per-GPU Bandwidth Comparison: H100 to Rubin R100

GPUHBM genMemory BWVRAMFP16 Tensor TFLOPS (with sparsity)BW-bound TPS ceiling† (70B FP16, batch 1)
H100 SXM5HBM33.35 TB/s80 GB1,979~24
GH200HBM34.0 TB/s96 GB1,979~29
H200 SXM5HBM3e4.8 TB/s141 GB1,979~34
AMD MI300XHBM35.3 TB/s192 GB1,307~38
B200HBM3e8.0 TB/s192 GB2,250~57
B300HBM3e10.0 TB/s288 GB2,250~71
AMD MI355XHBM3e~8.0 TB/s288 GB~2,600~57
Rubin R100HBM4up to 22 TB/s288 GB~8,000 (proj.)~157

†Approximate tokens-per-second ceiling for a 70B FP16 model (140 GB weights) at batch size 1, calculated as GPU_bandwidth / model_weight_bytes. This is the hard upper bound for single-request decode throughput and does not account for KV cache reads, attention overhead, or framework buffers.

Note: AMD MI355X and Rubin R100 are included for completeness but are not available on Spheron as of this writing. See GPU pricing and GPU rental for Spheron-available options.

Bandwidth-Bound TPS Ceiling: Worked Examples

The formula is straightforward:

Max TPS (memory-bound) = GPU_bandwidth_bytes_per_sec / model_weight_bytes_per_token
model_weight_bytes = parameter_count × bytes_per_param

At batch size 1, the GPU reads all weights once per token step. The TPS ceiling is the bandwidth divided by the weight footprint.

Llama 3.1 70B at FP16 (140 GB weights)

GPUBWTheoretical max TPSNotes
H100 SXM53.35 TB/s~24 TPSDoes NOT fit; 80 GB < 140 GB, requires 2-way tensor parallelism or FP8 quantization
GH2004.0 TB/s~29 TPSDoes NOT fit; 96 GB < 140 GB, requires 2-way tensor parallelism or FP8 quantization
H200 SXM54.8 TB/s~34 TPS141 GB fits model weights at FP16, but ~1 GB remains for KV cache; FP8 strongly recommended in practice
AMD MI300X5.3 TB/s~38 TPSNot on Spheron
B2008.0 TB/s~57 TPS2.4x H100 throughput at FP16
B30010.0 TB/s~71 TPS3x H100 throughput, 288 GB
AMD MI355X~8.0 TB/s~57 TPSNot on Spheron
Rubin R10022 TB/s~157 TPS6.5x H100 bandwidth, HBM4

H100 at 80 GB and GH200 at 96 GB cannot hold the 140 GB model at FP16. Both require 2-way tensor parallelism or FP8 quantization, which halves model weights to ~70 GB and fits on a single GPU with room for the KV cache. H200 at 141 GB fits the model weights at FP16, but 140 GB of those 141 GB are occupied by the weights themselves, leaving only ~1 GB for KV cache. At batch size 1 with a 4K context, Llama 3.1 70B's KV cache requires approximately 1.25 GB on its own. Any non-trivial batch size or context length will overflow the available headroom and cause OOM errors. In practice, FP8 quantization is strongly recommended on H200: it halves model weights to ~70 GB, leaving ~71 GB for KV cache and enabling meaningful batch sizes and context lengths. For techniques to reduce KV cache memory pressure and maximize single-GPU utilization, see KV Cache Optimization Guide.

DeepSeek V3 at FP8 (671B MoE, ~37B active parameters per step)

For MoE models, only the active expert weights are read per decode step. DeepSeek V3 activates approximately 37B parameters per token step at FP8 precision, meaning ~37 GB of weights per step.

Note: the full model at FP8 is ~671 GB, requiring at least 9xH100 (9 × 80 GB = 720 GB), 5xH200 (5 × 141 GB = 705 GB), or 4xB200 (4 × 192 GB = 768 GB) for single-instance serving. The TPS ceiling below is calculated per individual GPU (per-GPU bandwidth / active weight bytes). In a multi-GPU setup, aggregate system TPS scales roughly with GPU count, minus interconnect overhead.

GPUBWPer-GPU BW-bound TPS ceiling (active weights only)Notes
H100 SXM53.35 TB/s~91 TPS9-GPU minimum (9 × 80 GB = 720 GB ≥ 671 GB)
H200 SXM54.8 TB/s~130 TPS5-GPU minimum (5 × 141 GB = 705 GB ≥ 671 GB)
B2008.0 TB/s~216 TPS4xB200 fits the full model (4 × 192 GB)
B30010.0 TB/s~270 TPS3xB300 fits the full model (3 × 288 GB)
R10022 TB/s~595 TPS3xR100 fits the full model (3 × 288 GB)

This is an approximation for the bandwidth-ceiling calculation only. The ~37B active parameter count is DeepSeek V3's nominal per-step activation, but actual performance depends on routing patterns, sparsity, and all-to-all communication latency in multi-GPU setups.

Large dense model proxy: 405B at FP8 (~405 GB weights)

At FP8, a 405B model weighs approximately 405 GB. No single GPU currently has sufficient VRAM to hold it. Even the B300 and R100 at 288 GB each fall short. The minimum viable configurations are 2xB300 (2 × 288 GB = 576 GB) or 2xR100 (2 × 288 GB = 576 GB).

Only B300 and R100 can serve this model in a 2-GPU configuration (2 × 288 GB = 576 GB ≥ 405 GB). With 2-way tensor parallelism, the effective per-step bandwidth is the sum of both GPUs' bandwidth:

ConfigurationTotal BWTheoretical max TPSNotes
2x B30020.0 TB/s~49 TPS2 × 10.0 TB/s, 576 GB total VRAM
2x R10044 TB/s~109 TPS2 × 22 TB/s, 576 GB total VRAM

Actual throughput is below these ceilings due to KV cache reads, attention, NVLink communication overhead, and framework overhead. For the full profiling methodology, see AI's Memory Wall Problem.

When FLOPs Matter: Prefill vs Decode

Bandwidth is the only lever for decode at small batch sizes. But as batch size grows, the workload shifts:

Prefill is compute-bound. During prefill, the model processes all input tokens at once, giving high arithmetic intensity. Adding FLOPS directly reduces time-to-first-token. The B200's 2,250 FP16 TFLOPS (vs 1,979 on H100) and 4,500 FP8 TFLOPS translate directly into faster prefill throughput.

Decode at batch 1-4 is bandwidth-bound. Only bandwidth matters here. H200's 43% bandwidth advantage over H100 translates to 43% more tokens per second on memory-bound decode.

Decode at batch 64+ starts to shift. As more requests are batched together, the same weight bytes are shared across more tokens, increasing arithmetic intensity. For batch sizes above the roofline crossover point in the table above, FLOPS and bandwidth both matter.

For a real-time chatbot serving batch 1-8, bandwidth is the only hardware decision that matters. For a high-throughput API endpoint batching 64+ requests, you need to think about both. For teams running separate prefill and decode fleets, see Prefill-Decode Disaggregation on GPU Cloud for the infrastructure patterns.

Quantization as a Bandwidth Multiplier

Quantization reduces bytes per parameter, which directly scales the TPS ceiling. On a fixed GPU, lower precision is a free bandwidth upgrade:

PrecisionBytes/paramEffective BW multiplier vs FP1670B Llama ceiling TPS on H200 (4.8 TB/s)
FP162.01.0x~34 TPS
BF162.01.0x~34 TPS
FP81.02.0x~69 TPS
INT81.02.0x~69 TPS
FP4/INT40.54.0x (Blackwell native FP4 only)~137 TPS

A few clarifications:

FP4 tensor core hardware support is only available on Blackwell (B200/B300). On H100/H200, FP4 has no hardware backing and falls back to FP8 or slower emulation. INT4 via AWQ or GPTQ works on H100/H200 but uses software dequantization, so the effective bandwidth benefit is smaller than the theoretical 4x.

AMD's MXFP4 format is available on MI355X and uses a different numerical standard than NVIDIA's NV-FP4. The byte width is the same (0.5 bytes/param), but model accuracy and kernel availability differ by framework.

For framework setup and accuracy tradeoffs on FP4, see FP4 Quantization on Blackwell GPU Cloud. For INT4/INT8 workflows with AWQ, see AWQ Quantization Guide for LLM Deployment.

HBM Pricing and Bandwidth Per Dollar

Based on live Spheron pricing as of 15 May 2026:

GPUHBM genBW (TB/s)On-demand $/hrSpot $/hrBW/on-demand dollar (TB/s per $/hr)BW/spot dollar
H100 SXM5HBM33.35$4.00$1.690.841.98
H200 SXM5HBM3e4.8$4.72$1.891.022.54
B200HBM3e8.0$7.32$3.781.092.12
B300HBM3e10.0Check pricingCheck pricingCheck pricingCheck pricing
Rubin R100HBM422est. $8-12/hr (not on Spheron)est. $8-12/hr (not on Spheron)est. 1.8-2.75est. 1.8-2.75

Pricing fluctuates based on GPU availability. The prices above are based on 15 May 2026 and may have changed. Check current GPU pricing → for live rates.

On spot pricing, H200 delivers the best bandwidth-per-dollar at 2.54 TB/s per $/hr, ahead of B200 (2.12) and H100 (1.98).

On on-demand pricing, B200 leads at 1.09 TB/s per $/hr, with H200 close at 1.02 and H100 trailing at 0.84. The gap between H200 and B200 on on-demand is small, but B200 also brings native FP4 Tensor Core support and 2.4x the raw bandwidth, which matters for high-concurrency workloads.

Rubin R100 at estimated $8-12/hr would deliver 22 TB/s, giving 1.8-2.75 TB/s per $/hr depending on final pricing. At the high end of pricing estimates, H200 still matches or beats R100 on bandwidth-per-dollar. HBM4's advantage over HBM3e on this metric is not guaranteed until cloud spot pricing normalizes, which historically takes 6-12 months after a new generation launches.

Decision Matrix: Which HBM Generation to Rent on Spheron

WorkloadRecommended GPUHBM genReason
Llama 3.1 8B chatbot, batch 1-8H100 SXM5HBM3Fits on 80 GB, cheapest viable option (spot from $1.69/hr, on-demand from $4.00/hr)
Llama 3.1 70B chatbot, batch 1-8H200 SXM5 rentalHBM3e4.8 TB/s, fits 70B weights in FP16 but FP8 required for any real batch/context; lowest cost at this scale
Llama 3.1 70B high-concurrency, batch 32+B200 on SpheronHBM3e8 TB/s, FP4 support, 2.4x H100 bandwidth for higher batch TPS
DeepSeek V3 (671B MoE)B200 or B300HBM3e4xB200 (768 GB) or 3xB300 (864 GB) fits the full model
405B dense model, FP8B300 SXM5HBM3e288 GB VRAM for 405B at FP8 in 2-GPU configuration
Research, future-proofingRubin R100 (H2 2026)HBM422 TB/s, HBM4, not yet on Spheron

Migration Path: HBM3 to HBM4

HBM3 (H100) to HBM3e (H200): Drop-in swap. Same Hopper architecture, same CUDA code, same TensorRT-LLM and vLLM configs. Only change: your batch size can increase ~43% for the same latency target because bandwidth increased 43%. No code changes, no new dependencies, no new precision formats to configure.

HBM3e (H200) to HBM3e (B200): Architecture change from Hopper to Blackwell. For FP8 and FP16 workloads, existing vLLM configs run without modification. To unlock native FP4 Tensor Cores, you need TensorRT-LLM 0.17 or later and an FP4-quantized checkpoint. The B200's 8 TB/s bandwidth means the same 70B model now has ~57 TPS ceiling vs ~34 TPS on H200 at FP16, and ~137+ TPS at FP4. For the H100-to-B200 migration math and quantization setup, see FP4 Quantization on Blackwell GPU Cloud.

HBM3e (B200) to HBM4 (R100): H2 2026 cloud availability from the first confirmed cohort (AWS, Google Cloud, Azure, CoreWeave). Architecture change from Blackwell to Rubin. New NVLink 6 at 3.6 TB/s per GPU for multi-GPU configurations. Software stack compatibility with TensorRT-LLM and vLLM is expected to follow within 2-4 months of GA availability, based on the pattern from Blackwell. Broader cloud availability including Spheron is expected in 2027.

Spheron aggregates H100, H200, B200, and B300 instances from data center partners globally - pick the HBM generation that matches your workload's bandwidth requirements without paying for R100-tier headroom you don't need today.

Rent H200 → | Rent B200 → | View all GPU pricing →

FAQ / 05

Frequently Asked Questions

HBM3 (H100, GH200) delivers 3.35-4.0 TB/s per GPU using 12-high stacks. HBM3e is an optimized HBM3 variant with faster per-pin speeds and wider stacks: H200 reaches 4.8 TB/s, B200 reaches 8 TB/s, and B300 reaches 10 TB/s. HBM4 doubles the I/O width from 1,024 to 2,048 bits per stack and ships on the NVIDIA Rubin R100 at up to 22 TB/s in H2 2026. HBM4e is the JEDEC-standardized next iteration after HBM4, with no confirmed GPU product as of May 2026.

Decode is memory-bandwidth-bound at small batch sizes (1-8). Each token generation requires reading all model weights from VRAM once. For a 70B FP16 model, that's 140 GB per step. On an H100 at 3.35 TB/s, the lower bound is ~42 ms per token regardless of how many FLOPS the GPU has. Prefill is the opposite: compute-bound, because all input tokens are processed simultaneously with high arithmetic intensity.

The formula is: Max TPS = GPU_bandwidth_bytes_per_sec / model_weight_bytes. For FP16, model_weight_bytes = parameters × 2. For FP8, parameters × 1. Example: 70B FP16 on H200 (4.8 TB/s) = 4.8e12 / 140e9 = ~34 tokens/sec ceiling. Actual throughput is lower due to KV cache reads, attention overhead, and framework overhead, but the bandwidth ceiling sets the hard upper bound.

Yes, quantization reduces bytes-per-parameter, which directly increases the effective bandwidth multiplier. FP8 halves the weights to 1 byte/param, doubling effective bandwidth. FP4 reduces to 0.5 bytes/param, quadrupling effective bandwidth. On H200 at 4.8 TB/s with 70B Llama, FP16 gives ~34 TPS ceiling, FP8 gives ~69 TPS ceiling, and FP4 gives ~137 TPS ceiling. Native FP4 Tensor Core support requires Blackwell hardware (B200/B300).

For 70B FP16 inference at batch 1-8, H200 SXM5 (HBM3e, 4.8 TB/s) fits the full model on a single GPU and delivers ~34 TPS ceiling at FP16. On spot pricing, H200 delivers the best bandwidth-per-dollar on Spheron (2.54 TB/s per $/hr vs B200's 2.12). On on-demand pricing, B200 edges ahead at 1.09 TB/s per $/hr versus H200's 1.02, while also adding native FP4 Tensor Core support and 2.4x the raw bandwidth for higher concurrency. HBM4 (Rubin R100) is not yet available on Spheron and is not needed for 70B workloads at current traffic levels.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.