You add a second H100 and your per-token latency barely moves. The GPU is busy, the model is the same, and nothing you do to the compute side changes the output rate. This is the memory wall: the GPU can compute far faster than it can feed data to those compute units, so raw FLOPS sit idle while the memory bus is the actual constraint. For a practical starting point on diagnosing slow inference, that guide covers the seven most common causes, but this post focuses specifically on the memory wall case, which is where most production LLM deployments hit their ceiling.
The Memory Wall: Why FLOPS Aren't the Bottleneck
Every GPU has two performance ceilings: how fast it can compute (FLOPS) and how fast it can move data from memory to the compute units (memory bandwidth). For a given workload, whichever ceiling you hit first is what limits throughput.
The key concept is arithmetic intensity: the ratio of compute operations to bytes of memory accessed.
Arithmetic Intensity = FLOPs executed / Bytes moved from memoryGPUs have a hardware-determined bandwidth ceiling measured in TB/s and a compute ceiling measured in TFLOPS. The point where these two ceilings intersect is called the roofline. If your workload's arithmetic intensity falls below the roofline, you are memory-bound. If it falls above, you are compute-bound.
LLM inference in the decode phase sits well below the roofline for almost every GPU and model size combination. Here is why.
During autoregressive decoding, the model generates one token per forward pass. Each pass requires reading all model weights from VRAM. For a 70B parameter model at FP16, that is approximately 140 GB of data transferred per token step. On an H100 SXM5 with 3.35 TB/s bandwidth, that transfer alone takes roughly 42 milliseconds per token at maximum theoretical bandwidth. The compute for a single token at batch size 1 takes a fraction of that time because batch size 1 gives almost no arithmetic parallelism.
That 42 ms lower bound on TPOT (Time Per Output Token) cannot be improved by adding FLOPS. You can double the FLOPS budget and the bound stays at 42 ms because the bottleneck is the 3.35 TB/s bandwidth limit, not the compute rate.
The decode phase has low arithmetic intensity by nature. With a batch of 1, there is almost no data reuse across the matrix multiplications: you read the weights once, do a small number of operations on them, and move to the next layer. Increasing batch size raises arithmetic intensity because the same weights are shared across multiple requests, eventually pushing the workload toward compute-bound territory. But at small batch sizes (1-8), decode is almost always memory-bound.
The prefill phase is different. During prefill, the model processes the full input prompt in one shot, with all input tokens as a batch. This gives high arithmetic intensity and makes prefill compute-bound. Adding FLOPS improves prefill latency. It does not improve decode latency.
Anatomy of Inference Memory: What's Competing for VRAM
Three data structures compete for GPU VRAM during inference:
- Model weights (static): all the trained parameters. For a 70B model at FP16, this is ~140 GB. At FP8, ~70 GB. This is fixed for the life of the server process.
- KV cache (dynamic, grows with context): key and value tensors accumulated for every token in every active request's context. This is what dominates at long context lengths and large batch sizes.
- Activations (transient): intermediate values computed during the forward pass. These are proportional to batch size and layer width but are freed after each layer, so they add only a few GB at typical batch sizes.
The KV cache is the variable that makes VRAM planning difficult. Its size follows this formula:
KV cache (bytes) = 2 × num_layers × num_kv_heads × head_dim × seq_len × batch_size × bytes_per_elementFor Llama 3 70B at FP16 with 8K context and 8 concurrent requests:
- 2 (K and V) × 80 layers × 8 KV heads × 128 head_dim × 8192 tokens × 8 requests × 2 bytes = ~21.5 GB
At 32K context with the same 8 concurrent requests, that grows to ~85 GB, which exceeds the 80 GB available on an H100 SXM5 before you even account for model weights.
This is the core VRAM pressure problem. The KV cache grows linearly with both context length and batch size. Model weights are fixed. So as context length or concurrency increases, the KV cache eventually crowds out everything else. For a deeper treatment of KV cache optimization techniques including PagedAttention, FP8 quantization, and CPU offloading, that guide covers the compression side of the problem.
Memory Bandwidth vs Compute: The Roofline Model for LLM Inference
The roofline model gives a concrete way to reason about whether adding compute or adding bandwidth will help your workload.
For a GPU, the roofline is set by the ratio of peak FLOPS to peak bandwidth. The H100 SXM5 has:
- Peak FP16 FLOPS: 1,979 TFLOPS
- Peak HBM bandwidth: 3.35 TB/s
- Roofline ridge point: 1,979 / 3.35 = ~591 FLOPs per byte
Any workload with arithmetic intensity below 591 FLOPs/byte is memory-bound on an H100 SXM5.
The H200 SXM5 has the same compute as the H100 SXM5 but 43% more bandwidth:
- Peak FP16 FLOPS: 1,979 TFLOPS (identical)
- Peak HBM3e bandwidth: 4.8 TB/s
- Roofline ridge point: 1,979 / 4.8 = ~412 FLOPs per byte
The H200 costs more per hour than the H100 for a strictly equivalent FLOPS budget. But for memory-bound decode workloads, it delivers 43% more token throughput at the same batch size. This is why the H200 exists: it is a memory bandwidth upgrade wrapped in H100 compute.
At batch size 1, decode arithmetic intensity is roughly 1-2 FLOPs per byte for most transformer architectures. That is 200-600x below the roofline. Batch size shifts this meaningfully. At batch size 64, arithmetic intensity climbs toward 50-100 FLOPs/byte for a 70B model. At batch size 512+, some workloads begin to approach compute-bound territory on H100. The inflection point depends on model architecture and precision, but for most production deployments serving real-time requests, batch sizes rarely exceed 16-32, keeping decode firmly memory-bound.
The practical implication: for latency-sensitive inference (low TPOT, small batch sizes), GPU selection should prioritize bandwidth per dollar, not FLOPS per dollar.
Practical Solutions: HBM3e, KV Cache Tiering, NVMe Offloading, and NVIDIA ICMS
HBM3e: The Bandwidth Upgrade Path
The most direct fix for a memory-bound workload is a GPU with more bandwidth. The current memory hierarchy for data center inference GPUs:
- H100 SXM5: 80 GB HBM3 at 3.35 TB/s
- H200 SXM5: 141 GB HBM3e at 4.8 TB/s (43% more bandwidth than H100, 76% more capacity)
- B200 SXM: 192 GB HBM3e at ~8 TB/s (138% more bandwidth than H100)
- MI300X: 192 GB HBM3 at 5.3 TB/s (58% more bandwidth than H100)
When choosing between these for inference, the question is whether the bandwidth improvement justifies the cost difference for your workload. If TPOT is your constraint and your batch size is under 32, the H200's 43% bandwidth increase translates directly to 43% more tokens per second. Whether that throughput gain reduces cost per token depends on your actual batch size and the hourly price premium: a 43% throughput gain only reduces cost per token if the hourly premium is also below 43%.
KV Cache Quantization
Before upgrading hardware, reducing the KV cache footprint per token is the cheapest intervention. FP8 KV cache quantization cuts bandwidth consumption by 2x vs FP16 because you move half as many bytes per token step.
--kv-cache-dtype fp8in vLLM on H100/H200: ~2x reduction in KV cache bandwidth pressure- Blackwell FP4 KV cache support is in development for vLLM (tracked in issue #32220) and expected to deliver ~4x reduction vs FP16 once merged
Accuracy impact at FP8 KV is below 1-2% on standard benchmarks for most tasks. FP4 KV introduces more rounding error but remains acceptable for chat and summarization workloads. For Blackwell-specific quantization options, see the FP4 quantization guide.
NVMe KV Cache Offloading
When VRAM is exhausted by the KV cache, NVMe offloading extends effective context length by spilling older KV entries to fast NVMe storage while keeping the active context window in HBM.
LMCache is the main open-source implementation. It supports offloading to CPU memory, NVMe, Redis, S3/Ceph, and remote servers, maintaining a fast HBM tier for actively attended tokens and a slower secondary tier for tokens earlier in the context that are accessed less frequently. NVMe read latency adds roughly 50-200 microseconds per cache miss, compared to sub-microsecond HBM reads, so this is viable only when the NVMe-resident tokens are cold enough that they are rarely needed in a single decode step.
For long-context workloads (100K+ tokens per request) with many idle tokens from earlier in the conversation, NVMe offloading can double or triple the effective context per GPU with acceptable latency overhead. For more details on the implementation and benchmarks, see the NVMe KV cache offloading guide.
NVIDIA ICMS (Inference Context Memory Storage)
NVIDIA ICMS (Inference Context Memory Storage) is a hardware infrastructure platform based on the NVIDIA STX reference architecture. It uses BlueField-4 DPUs to attach NVMe flash storage over Ethernet, creating a "G3.5" memory tier that sits between host DRAM and shared storage. This flash tier is designed to hold KV cache at the pod level, across multiple inference nodes sharing the same storage fabric.
The software stack coordinating ICMS is NVIDIA Dynamo (the inference scheduler), NIXL (the cross-node memory transfer library), DOCA Memos, and Grove. These components handle KV cache routing, block promotion/demotion between HBM and the flash tier, and cross-node cache reuse. TensorRT-LLM and Triton are inference frameworks that run on top of this infrastructure; ICMS itself is not a feature of those frameworks.
The target use case is multi-node inference deployments serving long-context workloads where KV cache cannot fit entirely in HBM across the cluster. By pooling flash storage across nodes via ICMS, a pod can maintain a much larger effective KV cache without requiring every token's KV entries to reside in HBM on any single node. For setup details, refer to NVIDIA's Dynamo and ICMS documentation at developer.nvidia.com.
How to Diagnose Memory-Bound Inference: Profiling Tools and Metrics
Before tuning anything, confirm you are actually memory-bound. The two primary metrics:
MBU% (Memory Bandwidth Utilization): The fraction of peak HBM bandwidth being consumed. You want this as high as possible during inference. If MBU% is above 80-90% consistently, you are memory-bound. If MBU% is 40-60% while SM utilization is also 40-60%, you may have a different bottleneck (CPU preprocessing, Python overhead, or network latency).
SM Utilization: The fraction of GPU streaming multiprocessors doing work. If SM utilization is low (under 50%) while MBU% is high, the compute is starved of data. This confirms memory-bound status.
Decision tree:
- MBU% > 80% and SM% < 60%: memory-bound. Prioritize bandwidth (better GPU, KV quantization, NVMe tiering).
- SM% > 80% and MBU% < 60%: compute-bound. Adding GPUs or tensor parallelism will help.
- Both low: check CPU bottlenecks, tokenizer, Python overhead, or network I/O.
Quick MBU snapshot with nvidia-smi:
# Quick MBU snapshot
nvidia-smi dmon -s u -d 1 -c 10The u flag reports memory utilization % (memory controller activity), not memory bandwidth directly. Run this while a single-request inference is in progress and observe the memu% column as a proxy for memory pressure.
Production monitoring with dcgm-exporter:
For Prometheus/Grafana-based monitoring, dcgm-exporter exports DCGM_FI_PROF_DRAM_ACTIVE (memory bandwidth utilization) and DCGM_FI_DEV_GPU_UTIL (SM utilization) as metrics. Alert when the ratio of memory bandwidth utilization to SM utilization consistently exceeds 2:1 during active inference.
Deep per-kernel profiling with NVIDIA Nsight Systems:
Nsight Systems shows the timeline of every CUDA kernel launch, including the memory operations and their duration. For a memory-bound workload, you will see the memory copy or HBM read operations occupying most of the timeline rather than compute kernels. This is the definitive diagnosis tool when the aggregate metrics are ambiguous.
For autoscaling strategies once you have identified the bottleneck, see the guide on inference-time compute scaling on GPU cloud.
Right-Sizing Your GPU: H200 vs B200 vs MI300X Memory Comparison
| GPU | VRAM | HBM Gen | Bandwidth | FP16 TFLOPS | On-Demand (Spheron) | Spot (Spheron) |
|---|---|---|---|---|---|---|
| H100 SXM5 | 80 GB | HBM3 | 3.35 TB/s | 1,979 | ~$2.98/hr | ~$0.80/hr |
| H200 SXM5 | 141 GB | HBM3e | 4.8 TB/s | 1,979 | ~$4.50/hr | ~$1.19/hr |
| B200 SXM | 192 GB | HBM3e | ~8 TB/s | ~4,500 (with sparsity) | Check pricing | Check pricing |
| MI300X | 192 GB | HBM3 | 5.3 TB/s | ~1,307 | Check pricing | Check pricing |
Pricing fluctuates based on GPU availability. The prices above are based on 11 Apr 2026 and may have changed. Check current GPU pricing for live rates.
Which GPU for which workload:
H200 or MI300X over H100 (batch size 1-8, latency-sensitive): For real-time API serving at small batch sizes, the bandwidth advantage of H200 (4.8 TB/s vs 3.35 TB/s) maps directly to lower TPOT. If your inference is memory-bound, the H200 delivers proportionally better latency per token at the same model size. The MI300X's 192 GB VRAM at 5.3 TB/s makes it competitive for models in the 100-140 GB range that don't fit on a single H200.
MI300X for large-context or multi-model serving: 192 GB HBM3 means you can hold a 70B FP16 model (140 GB) with meaningful KV cache headroom on a single GPU, or run two smaller models simultaneously without tensor parallelism. The 5.3 TB/s bandwidth is higher than H200's 4.8 TB/s. For serving 70B models at long contexts without multi-GPU coordination overhead, the MI300X is a strong option. Since there is no dedicated MI300X rental page yet, check the GPU pricing page for current availability.
B200 for maximum throughput at large batch: The B200's ~8 TB/s bandwidth means it moves data at 2.4x the rate of an H100. For high-throughput offline inference at larger batch sizes, or for any workload that needs to serve many concurrent requests from a single GPU, the B200 is the current peak bandwidth option. It also has 192 GB VRAM. For Blackwell-specific serving with FP4 quantization, the B200 can serve roughly 4x the tokens per second of an H100 at comparable batch sizes.
Cost per token framing: A higher hourly rate GPU costs less per generated token only when its throughput advantage exceeds its price premium. At these price points, an H200 at $4.50/hr must serve more than 181 tokens/second to undercut an H100 at $2.98/hr serving 120 tokens/second ($4.50 / $2.98 × 120 = ~181 tok/s crossover). If H200 achieves 240 tokens/second at a favorable batch size (2x H100), cost drops to $4.50 / 240 = $0.01875 per token versus H100's $2.98 / 120 = ~$0.025. But at the throughput gain from bandwidth alone (H200 has 43% more bandwidth than H100, yielding roughly 172 tok/s), cost is $4.50 / 172 = $0.026 per token, which is more expensive. Whether you clear the crossover depends on your actual batch size and whether your workload is saturating the bandwidth ceiling.
For detailed comparisons of AMD vs NVIDIA for inference workloads, see AMD MI300X vs NVIDIA H200 and NVIDIA H100 vs H200.
Memory-Aware Infrastructure Planning on GPU Cloud
Sizing a GPU configuration for a memory-bound workload requires calculating VRAM demand before selecting hardware. The process:
Step 1: Calculate model weight VRAM
- FP16: parameters × 2 bytes (70B = 140 GB)
- FP8: parameters × 1 byte (70B = 70 GB)
- FP4: parameters × 0.5 bytes (70B = 35 GB, Blackwell only)
Step 2: Calculate KV cache VRAM at target load
Use the formula from earlier: 2 × layers × kv_heads × head_dim × seq_len × batch_size × bytes_per_element.
For Llama 3 70B at FP8 KV, 16K context, 16 concurrent requests:
- 2 × 80 × 8 × 128 × 16384 × 16 × 1 byte = ~43 GB
Step 3: Add headroom and select GPU
Total VRAM needed: 70 GB (FP8 weights) + 43 GB (KV cache) + ~5 GB headroom = ~118 GB. This points to H200 SXM5 (141 GB). A single H100 (80 GB) would not fit.
Multi-GPU considerations:
Tensor parallelism across 2 GPUs roughly doubles available VRAM for both weights and KV cache, but introduces inter-GPU communication on every forward pass. NVLink-connected GPUs (SXM5 variants) have 900 GB/s bidirectional NVLink bandwidth, which makes tensor parallelism across 2-4 GPUs viable for low-latency serving. PCIe-connected multi-GPU configurations have 128 GB/s bidirectional bandwidth (PCIe 5.0 x16), which can become the new bottleneck for large tensor parallel degrees.
For models exceeding a single GPU's VRAM, prefer 2-GPU SXM configurations over 4-GPU configurations to minimize communication overhead, unless throughput at large batch is the primary goal.
Spot vs on-demand for inference:
Spot instances are appropriate for batch or offline inference jobs where request queuing is acceptable and latency requirements are flexible. On-demand instances are appropriate for latency-sensitive real-time APIs where you cannot tolerate preemption. For mixed workloads, run on-demand instances for the baseline traffic and overflow to spot for batch jobs.
Rent an H200 or B200 on Spheron and measure TPOT at your actual batch size before committing to a configuration. The theoretical bandwidth advantage only materializes if your serving stack is actually utilizing the HBM bandwidth, which requires proper attention kernel selection and KV cache sizing.
Memory-bound inference is a GPU selection problem. If you are running H100s and still hitting latency ceilings, the fix is bandwidth, not more chips. Rent an H200 or B200 on Spheron and compare TPOT at your actual batch size.
