The H200 is a memory upgrade, not a new architecture. Three numbers tell the whole story: 141 GB HBM3e, 4.8 TB/s memory bandwidth, and 1,979 TFLOPS FP8 dense. The first two are where the H200 actually earns its keep over H100. The third is identical on both GPUs because they share the same Hopper GH100 die. If you're evaluating whether bare-metal H200 SXM5 instances are worth it for your workload, everything you need to make that call is in this datasheet.
H200 Specs: Full Datasheet
Here's the complete H200 spec table from NVIDIA's data sheet:
| Spec | Value |
|---|---|
| Architecture | NVIDIA Hopper |
| VRAM | 141 GB HBM3e |
| Memory Bandwidth | 4.8 TB/s |
| Tensor Cores | 4th Generation |
| CUDA Cores | 16,896 |
| FP64 Performance | 34 TFLOPS |
| FP32 Performance | 67 TFLOPS |
| BF16/FP16 Performance | ~989 TFLOPS (dense) |
| FP8 Performance | ~1,979 TFLOPS (dense) |
| INT8 Performance | ~1,979 TOPS (dense) |
| TDP (SXM5) | 700W |
| TDP (NVL) | 600W |
| NVLink Bandwidth | 900 GB/s bidirectional per GPU |
| Form Factors | SXM5, NVL |
H200 SXM5 rents at $4.62/hr on-demand and $1.92/hr spot on Spheron as of 21 May 2026. For live pricing across configurations, see current GPU pricing →.
141 GB HBM3e Memory: What the Capacity Gain Unlocks
The H100 ships with 80 GB of HBM3. The H200 ships with 141 GB of HBM3e. That's a 76% capacity increase from the same Tensor Core die.
Why it matters in practice:
70B models at FP16 fit on a single GPU. Llama 3.1 70B at FP16 precision requires approximately 140 GB of VRAM for model weights alone. On an H100, that model requires 2-way tensor parallelism split across two GPUs. On an H200, the full model fits on one card. That difference eliminates inter-GPU communication overhead on the critical decode path, reduces infrastructure complexity, and halves the hardware cost for single-stream 70B serving.
KV cache headroom for long-context inference. For long-context requests, KV cache can consume a significant portion of available VRAM. At 128K context length on a 70B FP8 model (weights ~70 GB), KV cache for a single request consumes roughly 30-50 GB depending on batch configuration. An H100 running 70B at FP8 has only 10 GB left after weights. The H200's extra 61 GB of headroom changes the math entirely.
The table below shows minimum GPU count needed by model size:
| Model Size | Precision | H100 SXM5 (80 GB) | H200 SXM5 (141 GB) |
|---|---|---|---|
| 7B | FP16 | 1 GPU | 1 GPU |
| 7B | FP8 | 1 GPU | 1 GPU |
| 13B | FP16 | 1 GPU | 1 GPU |
| 13B | FP8 | 1 GPU | 1 GPU |
| 34B | FP16 | 1 GPU | 1 GPU |
| 34B | FP8 | 1 GPU | 1 GPU |
| 70B | FP16 | 2 GPUs (tensor parallel) | 1 GPU |
| 70B | FP8 | 1 GPU (tight, ~70 GB) | 1 GPU (with KV headroom) |
| 100B+ | FP16 | 4+ GPUs | 2 GPUs |
| 100B+ | FP8 | 2 GPUs | 1-2 GPUs |
For a detailed walkthrough of KV cache sizing and how it affects GPU selection, see the rent NVIDIA H200 GPUs guide.
4.8 TB/s Memory Bandwidth: Inference Throughput Impact
H200: 4.8 TB/s. H100 SXM5: 3.35 TB/s. That's a 43% bandwidth increase.
In autoregressive LLM decoding, each decode step loads every model weight from VRAM before generating one token. At small batch sizes, the GPU is mostly waiting on memory transfers rather than computing. This means memory bandwidth determines tokens per second directly. More bandwidth equals more tokens per second, and the math scales linearly until you hit compute saturation at larger batches.
Estimated throughput comparison on Llama 2 70B at FP8 (based on bandwidth ratios and MLPerf v4.0 data):
| GPU | Memory Bandwidth | Batch=1 (tok/s) | Batch=8 (tok/s) | Batch=32 (tok/s) |
|---|---|---|---|---|
| H100 SXM5 | 3.35 TB/s | ~300 | ~900 | ~2,200 |
| H200 SXM5 | 4.8 TB/s | ~430 | ~1,290 | ~3,100 |
Estimates based on memory bandwidth ratios and MLPerf Inference v4.0 offline results showing H200 at ~42% higher throughput than H100 on Llama 2 70B. Actual figures depend on batch configuration, KV cache settings, and the inference engine used.
MLPerf Inference v4.0 shows the H200 delivering approximately 42% higher throughput than the H100 SXM5 on Llama 2 70B offline mode. That number comes almost entirely from bandwidth: the Tensor Cores are identical. For a deeper benchmark comparison between the two, see NVIDIA H100 vs H200.
At larger batch sizes, the workload shifts toward compute-bound territory and the bandwidth advantage narrows. At batch=1, the H200's 43% bandwidth lift translates directly into 43% more tokens per second. At very large batches where the GPU becomes compute-bound, both GPUs perform similarly per TFLOPS.
Tensor Core Performance: FP8, FP16, BF16, INT8
The H200 and H100 share the same 4th-generation Tensor Core die. Compute TFLOPS are identical.
| Precision | H200 Dense | H100 Dense | H200 Sparse (2:4) | H100 Sparse (2:4) |
|---|---|---|---|---|
| FP64 | 34 TFLOPS | 34 TFLOPS | N/A | N/A |
| FP32 | 67 TFLOPS | 67 TFLOPS | N/A | N/A |
| BF16 | ~989 TFLOPS | ~989 TFLOPS | ~1,979 TFLOPS | ~1,979 TFLOPS |
| FP16 | ~989 TFLOPS | ~989 TFLOPS | ~1,979 TFLOPS | ~1,979 TFLOPS |
| FP8 | ~1,979 TFLOPS | ~1,979 TFLOPS | ~3,958 TFLOPS | ~3,958 TFLOPS |
| INT8 | ~1,979 TOPS | ~1,979 TOPS | ~3,958 TOPS | ~3,958 TOPS |
Dense figures reflect standard Tensor Core throughput. Sparse figures reflect 2:4 structured sparsity, which requires sparse weight patterns to activate.
The H200's throughput advantage is memory-driven. For the prefill phase (the compute-bound forward pass over the full prompt), both GPUs deliver the same effective TFLOPS at the same batch size. The gap opens in the decode phase, where each step is memory-bound and bandwidth dictates speed.
When compute TFLOPS are the bottleneck: large-batch prefill operations, mixture-of-experts (MoE) routing at high batch sizes, and distributed training where the compute pipeline is fully saturated. In these cases H200 and H100 perform on par.
Note that neither H200 nor H100 supports FP4. FP4 requires Blackwell 5th-generation Tensor Cores. For workloads where FP4 throughput matters, see the NVIDIA B200 complete guide for a full breakdown of FP4 capabilities and the throughput jump it enables.
SXM5 vs NVL: Form Factors and TDP Envelopes
The H200 ships in two form factors:
| Spec | H200 SXM5 | H200 NVL |
|---|---|---|
| VRAM | 141 GB HBM3e | 141 GB HBM3e |
| Memory Bandwidth | 4.8 TB/s | 4.8 TB/s |
| TDP | 700W | 600W |
| Slot Type | NVLink SXM module | PCIe add-in card |
| Use Case | 8-GPU HGX/DGX nodes, NVSwitch fabric | 1-4 GPU deployments, standard server slots |
Both variants share the same Hopper GH100 die and identical HBM3e memory subsystem. The difference is packaging and power delivery.
SXM5 mounts on the NVLink Switch System board used in HGX H200 and DGX H200 nodes. Each GPU connects to the NVSwitch fabric at 900 GB/s bidirectional, enabling full-mesh all-to-all communication between all 8 GPUs in a node. This is the standard for large-scale distributed training and multi-GPU inference. SXM5 draws 700W per GPU and requires liquid cooling for sustained peak workloads in most configurations.
NVL is the PCIe add-in card variant. It drops TDP to 600W, fits into standard PCIe 5.0 slots, and works in commodity server hardware without a custom NVLink board. Trade-off: no NVSwitch fabric. For multi-GPU NVL configurations, GPU-to-GPU traffic flows over PCIe 5.0 (~128 GB/s), roughly 7x lower bandwidth than NVLink 4. NVL works well for single-GPU inference or loosely-coupled multi-GPU setups where inter-GPU bandwidth isn't the bottleneck.
Air cooling is feasible for H200 NVL in well-ventilated racks. H200 SXM5 typically requires liquid cooling for sustained maximum throughput.
NVLink 4 and NVSwitch Fabric: 8-GPU Node Topology
Both H200 SXM5 and H100 SXM5 implement NVLink 4, providing 900 GB/s bidirectional bandwidth per GPU. In a standard 8-GPU HGX H200 node, NVSwitch creates a full-mesh all-to-all fabric. Every GPU can read from and write to every other GPU at 900 GB/s simultaneously, with no CPU or PCIe involvement.
What this enables in practice:
Tensor parallelism with minimal overhead. Running a 70B model across 8 H200s with TP=8 splits the model into 8 shards. Each all-reduce operation during the attention and feedforward layers transfers data across NVLink. At 900 GB/s, these transfers take microseconds and typically represent under 10% of total compute time for well-tuned configurations.
MoE expert routing. Mixture-of-experts models like Mixtral and DeepSeek V3 route each token to a subset of experts that may reside on different GPUs. NVLink allows this routing to happen at memory bandwidth speeds rather than being bottlenecked by PCIe.
Pipeline parallelism for 100B+ models. When a single node isn't enough, 8-GPU H200 nodes connected via InfiniBand (ConnectX-7, 400 Gb/s NDR) handle pipeline stages across nodes. NVLink handles intra-node communication; InfiniBand handles inter-node.
For H200 NVL in a multi-GPU PCIe setup: bandwidth tops out at PCIe 5.0 x16 (~128 GB/s bidirectional per GPU pair). This is adequate for loosely-coupled inference where each GPU runs a different model, but inadequate for tight tensor-parallel inference where all-reduce operations dominate.
H200 vs H100 vs B200: Decision Framework
Three-way comparison across the specs that determine GPU selection:
| Dimension | H100 SXM5 | H200 SXM5 | B200 SXM |
|---|---|---|---|
| VRAM | 80 GB HBM3 | 141 GB HBM3e | 192 GB HBM3e |
| Bandwidth | 3.35 TB/s | 4.8 TB/s | 8.0 TB/s |
| FP8 Dense (TFLOPS) | ~1,979 | ~1,979 | 4,500 |
| FP4 Dense (TFLOPS) | N/A | N/A | 9,000 |
| TDP | 700W | 700W | 1,000W |
| Spheron On-Demand | $2.64/hr | $4.62/hr | $7.21/hr |
| Spheron Spot | $1.66/hr | $1.92/hr | $3.77/hr |
Decision tree:
Does your model require more than 80 GB VRAM?
├─ NO → H100 is sufficient. More economical for 7B-34B models.
└─ YES
├─ Does your model fit in 141 GB?
│ ├─ YES (70B at FP16 = ~140 GB, or 30B-70B with large KV cache) → H200.
│ └─ NO (100B+ at FP16, or 70B+ with very large batch) → B200 (192 GB).
│
├─ Do you need FP4 inference?
│ ├─ YES → B200. H200 has no FP4 Tensor Cores.
│ └─ NO → H200 or B200 based on VRAM requirement above.
│
└─ Is software stack maturity important?
├─ YES → H200. Hopper software stack is more mature than Blackwell.
└─ NO → B200 if throughput or VRAM is the bottleneck.Use H200 when:
- Your model is 70B at FP16 and you want single-GPU serving without tensor parallelism
- KV cache headroom matters more than raw TFLOPS (long contexts, multi-model cohosting)
- You want Hopper software compatibility with all existing CUDA, vLLM, SGLang, and TensorRT-LLM code
- H100 spot is unavailable or 80 GB VRAM is the constraint
- You're running multiple smaller models on one GPU (e.g., a 30B + 7B + embedding model stack)
Consider B200 when:
- Models are 100B+ and won't fit in 141 GB at your required precision
- FP4 inference is viable and throughput is the primary metric (B200's 9,000 TFLOPS FP4 doubles effective throughput)
- You're running high-traffic inference APIs where cost-per-token justifies the premium
- 128K+ context windows require more than 141 GB of total VRAM budget
H200 Cloud Pricing
Current H200 pricing across providers as of 21 May 2026:
| Provider | On-Demand $/hr | Spot $/hr | Notes |
|---|---|---|---|
| Spheron | $4.62 | $1.92 | Per-minute billing, full root access |
| GMI Cloud | $2.60 | N/A | On-demand |
| Nebius | $3.50 | N/A | On-demand |
| RunPod | $3.59 | N/A | Secure Cloud |
| Jarvislabs | $3.80 | N/A | On-demand |
| AWS (p5e) | ~$4.98 | N/A | Estimated; spot not widely available |
| Azure | ~$13.78 | N/A | Estimated |
Pricing fluctuates based on GPU availability. The prices above are based on 21 May 2026 and may have changed. Check current GPU pricing → for live rates.
Spheron offers H200 at spot and on-demand pricing with per-minute billing and no minimum commitment. This makes it practical for short experimentation runs at 70B scale without paying for hours of idle time. On-demand availability means no interruption risk for production serving workloads that can't tolerate spot eviction.
For teams still on H100 who are evaluating the upgrade economics, the break-even calculation depends on whether your specific workload is memory-bandwidth-bound. A 70B FP8 model at batch=1 on H100 runs around 300 tok/s. The same model at batch=1 on H200 runs around 430 tok/s, a 43% throughput increase for a ~16% spot price premium at current Spheron rates ($1.92/hr H200 spot vs $1.66/hr H100 spot). For memory-bound workloads, H200 wins on cost-per-token.
H200 GPUs are available on Spheron from $1.92/hr spot and $4.62/hr on-demand, with per-minute billing and no minimum commitments. For teams choosing between H200 and B200, see the full H200 vs B200 comparison.
Frequently Asked Questions
The NVIDIA H200 has 141 GB of HBM3e memory, up from 80 GB on the H100. This 76% increase in VRAM allows 70B parameter models to run at FP16 on a single GPU and enables larger KV caches for long-context inference workloads up to 128K tokens.
The H200 delivers 4.8 TB/s memory bandwidth, up from 3.35 TB/s on the H100 SXM5. This 43% bandwidth improvement is the primary driver of H200's higher inference throughput on memory-bandwidth-bound workloads like autoregressive LLM decoding.
The H200 delivers approximately 1,979 TFLOPS FP8 dense (same Tensor Core die as H100). The H200's primary compute advantage comes from memory: 141 GB vs 80 GB and 4.8 TB/s vs 3.35 TB/s bandwidth. These allow larger batch sizes and reduce memory-bandwidth stalls, translating to higher effective throughput than TFLOPS alone suggest.
Both variants share the same Hopper GPU die and HBM3e memory specs. The SXM5 is a high-density module on the NVLink Switch System board, used in 8-GPU HGX H200 nodes with full NVSwitch fabric for GPU-to-GPU bandwidth of 900 GB/s per GPU. The NVL is a PCIe add-in card variant designed for standard PCIe server slots with lower power delivery requirements. SXM5 is preferred for distributed inference and training; NVL suits single- or dual-GPU deployments in standard server hardware.
As of 21 May 2026, Spheron offers H200 GPU rental at $1.92/hr spot and $4.62/hr on-demand. Spot pricing suits interruptible batch inference and experimentation. On-demand is available for production workloads with no interruption risk. Check current GPU pricing at spheron.network/pricing/ for live rates.
Choose H200 when your model requires 80-141 GB VRAM (e.g., 70B at FP16 requires ~140 GB), when you need mature Hopper software compatibility with maximum memory, or when H100 spot pricing is unavailable. Choose H100 for models under 30B parameters where 80 GB is sufficient and lower cost is the priority. Choose B200 when you need FP4 inference, 192 GB+ VRAM for 100B+ models, or higher raw throughput at scale.
