Three H100 variants exist and they are not interchangeable. The NVL module ships with 94 GB HBM3 per GPU and a NVLink bridge for two-GPU pairing. The SXM5 ships with 80 GB HBM3 in a full NVSwitch server up to 8 GPUs. The PCIe variant ships with 80 GB HBM2e and connects over PCIe Gen5. The right one depends entirely on whether your bottleneck is memory capacity, inter-GPU bandwidth, or per-GPU cost.
If you need to rent H100 and are not sure which variant to pick, this guide covers the full spec delta and workload fit for each form factor.
For generation comparisons, see our A100 vs H100 and H100 vs H200 breakdowns.
TL;DR Decision Matrix
| Workload | Best Form Factor | Reason |
|---|---|---|
| 70B inference, long context, large KV cache | H100 NVL | 94 GB HBM3 fits more context without tensor parallelism |
| Multi-GPU FSDP or tensor-parallel training | H100 SXM5 | Full NVSwitch 900 GB/s all-to-all bandwidth |
| Single-node fine-tuning, cost-sensitive | H100 PCIe | Lower per-GPU rate, PCIe Gen5 sufficient for small jobs |
| MIG partitioned multi-tenant inference | H100 SXM5 or PCIe | Both support 7-way MIG; NVL's larger slices reduce tenant count per GPU |
H100 NVL Specifications
The H100 NVL is a dual-slot PCIe Gen5 card (full-height, full-length) containing two PCIe-form H100 GPUs joined by three NVLink 4 bridges. Unlike SXM5, it slots into standard PCIe data center infrastructure without requiring a specialized SXM board, which is one of NVL's practical advantages for operators who want NVLink-paired inference without a full SXM server build-out. NVIDIA designed it for large inference workloads where 80 GB is not enough and 2-way NVLink bandwidth between the pair is sufficient. It is not a server interconnect the way NVSwitch is.
Key specs per GPU:
- 94 GB HBM3 (versus 80 GB on SXM5 and PCIe)
- ~3,938 GB/s memory bandwidth (higher than SXM5's 3,350 GB/s)
- NVLink bridge connecting the two paired GPUs at 600 GB/s bidirectional total
- 3,341 TFLOPS FP8 with sparsity (lower than SXM5's 3,958 TFLOPS)
- 835 TFLOPS TF32 with sparsity
- 16,896 CUDA cores
- 350-400W per GPU (configurable), 700-800W total for the two-GPU pair
- 7-way MIG support: approximately 12 GB per slice (versus 10 GB on SXM5/PCIe)
One important clarification: "H100 NVL" is the product name for this two-GPU bridge module. "NVLink" is also the interconnect technology inside full SXM5 NVSwitch servers. Both the NVL module and NVSwitch servers use NVLink, but they are different architectures. The NVL bridges two GPUs with a dedicated chip. NVSwitch creates an all-to-all fabric across up to eight GPUs. These are not the same thing.
Full Specifications Comparison
| Specification | H100 NVL | H100 SXM5 | H100 PCIe |
|---|---|---|---|
| Architecture | Hopper (GH100) | Hopper (GH100) | Hopper (GH100) |
| VRAM | 94 GB HBM3 | 80 GB HBM3 | 80 GB HBM2e |
| Memory Bandwidth | ~3,938 GB/s | 3,350 GB/s | 2,000 GB/s |
| FP8 (TFLOPS) | 3,341 | 3,958 | 3,026 |
| TF32 (TFLOPS) | 835 | 989 | 756 |
| CUDA Cores | 16,896 | 16,896 | 14,592 |
| Interconnect | NVLink bridge (2-GPU pair) | NVSwitch (up to 8 GPUs) | PCIe Gen5 |
| Interconnect BW | 600 GB/s pair total | 900 GB/s per GPU | 128 GB/s |
| TDP | 700-800W per 2-GPU module (350-400W/GPU) | 700W per GPU | 350W |
| Max GPUs per server | 2 (NVL pair) | 8 (NVSwitch) | 8 (PCIe) |
| MIG slices | 7 x ~12 GB | 7 x ~10 GB | 7 x ~10 GB |
The PCIe variant uses HBM2e instead of HBM3. That is not just a naming difference. HBM2e delivers 2,000 GB/s versus 3,350 GB/s on SXM5's HBM3. Memory-bound inference scales roughly linearly with bandwidth, so PCIe runs about 40% slower on bandwidth-limited decode for large models compared to SXM5.
The NVL trades a small amount of FP8 compute (3,341 versus SXM5's 3,958 TFLOPS) for ~17% higher memory bandwidth (~3,938 versus 3,350 GB/s) and 14 GB more VRAM per GPU. This tradeoff makes NVL favored for memory-bound long-context inference where KV cache size matters more than raw compute throughput. The key structural difference remains the interconnect: a PCIe-slot NVLink bridge for two GPUs versus NVSwitch for up to eight.
Inference Workload Fit
NVL for 70B and Long-Context Inference
A single H100 NVL GPU holds 94 GB of HBM3. Llama 3 70B at FP16 requires approximately 140 GB for weights alone. Two NVL GPUs provide 188 GB total, leaving 48 GB for KV cache. On SXM5 (80 GB per GPU), you also need 2-way tensor parallelism for 70B FP16, but the NVLink bridge on the NVL pair handles the tensor-parallel communication at 600 GB/s versus PCIe Gen5 at 128 GB/s.
Long-context serving amplifies this advantage. At a 128K context length with FP16 KV cache, Llama 70B can generate 20-30 GB of KV cache per active sequence. The extra 14 GB per NVL GPU compared to SXM5 extends how many concurrent long-context sessions you can hold before eviction starts. If KV cache eviction is already happening in production, NVL's additional VRAM may resolve it without a full hardware upgrade. For vLLM settings that reduce eviction pressure, see the KV cache optimization guide.
For FP8 inference, a 70B model fits in approximately 70 GB, which means a single SXM5 or NVL GPU can serve it without tensor parallelism. In this case, NVL's extra 24 GB per GPU becomes additional runway for larger KV caches at higher concurrency rather than a requirement for model fitting.
When the NVL Bridge Helps (and When It Does Not)
The 600 GB/s NVLink bridge only connects the two GPUs within the NVL module. It does not reach any other GPUs in the server or cluster.
For 2-GPU tensor-parallel inference on 70B FP16 models, the bridge handles all-reduce operations between the two GPUs at much lower latency than PCIe. That is the scenario NVL was designed for. The bridge does nothing for single-GPU serving (you are not using it at all), and it cannot scale beyond 2 GPUs. If your inference job needs 4-way or 8-way tensor parallelism for models above 300B parameters, NVL is not the right form factor.
PCIe for Smaller Models and Budget Serving
H100 PCIe's lower FP8 TFLOPS (3,026 versus 3,958 on SXM5) and lower memory bandwidth (2,000 versus 3,350 GB/s on SXM5) make it slower for large-model inference. For models under 30B parameters, the gap narrows because smaller models are more compute-bound. For 7B-13B serving where the model fits in a fraction of the 80 GB VRAM, PCIe delivers solid throughput at lower cost.
Multi-tenant inference using MIG is where PCIe can surprise you. Both SXM5 and PCIe support 7-way MIG with approximately 10 GB per slice. Each slice can serve a 7B INT8 model independently. With 8 PCIe GPUs in a server, you get up to 56 MIG instances. NVL gives you larger slices (~12 GB each) but only 14 total from the two-GPU module, so tenant density per server is lower on NVL.
Training Workload Fit
SXM5 for Multi-GPU Training
SXM5's NVSwitch fabric gives every GPU in an 8-GPU server full 900 GB/s all-to-all bandwidth. FSDP and tensor-parallel training require frequent all-reduce gradient synchronization across all GPUs simultaneously. NVSwitch at 900 GB/s per GPU handles this efficiently. The NVL bridge at 600 GB/s total for two GPUs cannot replicate this at scale. For how to configure distributed training jobs across multiple H100 nodes, see the guide on distributed training with FSDP.
Why NVL Is Not Suited for Multi-GPU Training
The NVL module can only coordinate two GPUs via its bridge. There is no path to 4-GPU or 8-GPU all-to-all communication within a single NVL module. A training job that requires FSDP across more than 2 GPUs would need to fall back to PCIe or NVSwitch interconnects. Scaling NVL modules beyond the pair requires PCIe connectivity between modules, which gives you 128 GB/s between pairs instead of the 600 GB/s within a pair.
If you are training large models that require more than 2 GPUs, SXM5 with NVSwitch is the correct form factor.
PCIe for Data-Parallel Training
PCIe Gen5 at 128 GB/s is a real bottleneck for tensor-parallel training above 2 GPUs. For data-parallel fine-tuning, each GPU holds a full copy of the model and only needs to synchronize gradients periodically. Gradient synchronization at 128 GB/s is adequate for many data-parallel workloads, especially on smaller models.
Single-node fine-tuning of 7B-13B models with LoRA or QLoRA does not require NVSwitch. PCIe delivers equivalent training speed at lower on-demand cost for these workloads.
Cost Per Token and Cost Per Training Hour
Spheron's GPU API returned the following on-demand rates on 04 May 2026. No spot pricing was available for H100 variants at this time. The H100 NVL does not appear as a separate SKU in the current API response.
| Form Factor | On-Demand ($/GPU/hr) | Spot ($/GPU/hr) | Est. tok/s (Llama 70B FP8) | Cost per 1M tokens |
|---|---|---|---|---|
| H100 NVL | see /pricing/ | N/A | ~3,200 | N/A |
| H100 SXM5 | $3.10 | N/A | ~3,200 | ~$0.27 |
| H100 PCIe | $2.01 | N/A | ~2,400 | ~$0.23 |
H100 NVL availability: The Spheron marketplace currently lists H100 SXM5 and H100 PCIe. H100 NVL availability varies by data center partner and is not surfaced as a standalone SKU. Check current GPU pricing for live availability and rates.
H100 SXM5 is sold as an 8-GPU bundle at $24.82/hr total. The per-GPU figure of $3.10/hr is the bundle rate divided by 8. PCIe single-GPU instances start at $2.01/hr.
Cost per 1M tokens assumes FP8 serving of Llama 3 70B on a single GPU with moderate batching. Actual throughput depends on batch size, context length, and serving framework. For how continuous batching and PagedAttention affect these throughput estimates in practice, see the LLM serving optimization guide. NVL has slightly lower FP8 compute than SXM5 (3,341 versus 3,958 TFLOPS), but its higher memory bandwidth means decode throughput on memory-bound workloads can be comparable. Compute-heavy prefill stages will be modestly faster on SXM5.
Pricing fluctuates based on GPU availability. The prices above are based on 04 May 2026 and may have changed. Check current GPU pricing → for live rates.
For how H100 ranks against B200 and other current options on a cost-per-token basis, see our GPU rankings for LLMs.
Migration Scenarios
PCIe to NVL
Two clear signals indicate PCIe is no longer the right fit:
Hitting the 80 GB VRAM ceiling on 70B FP16 models. The model does not fit on one GPU, forcing 2-way tensor parallelism over PCIe Gen5 at 128 GB/s. NVL's 600 GB/s NVLink bridge handles the same 2-GPU split with significantly lower communication overhead and gives each GPU 14 GB more headroom.
KV cache eviction in production. Long-context requests are getting evicted because insufficient VRAM remains after model weights are loaded. NVL's extra 14 GB per GPU extends the serving window before eviction starts.
NVL to SXM5
SXM5 is the right step when scale requires it:
Scaling beyond 2 GPUs. The NVL module cannot connect more than 2 GPUs. Any job requiring 4-way or 8-way tensor parallelism needs NVSwitch.
Starting FSDP fine-tuning of 70B+ models. Gradient synchronization across more than 2 GPUs requires NVSwitch for acceptable throughput. PCIe Gen5 becomes the bottleneck at this scale.
Adopting Megatron-LM pipeline parallelism. Pipeline stages require fast all-to-all communication between all stages across the server. NVSwitch at 900 GB/s per GPU is the correct infrastructure for this.
For how H100 fits in the broader picture of current GPU options including Blackwell, see the B200 complete guide.
Deploy H100 on Spheron
Spheron offers both H100 SXM5 and H100 PCIe on-demand with per-second billing and no minimum commitment. SXM5 nodes include full NVSwitch for multi-GPU training jobs. PCIe single-GPU instances are available for inference and fine-tuning where NVSwitch is not needed. You can switch between form factors without a long-term contract.
If you are choosing between H100 NVL, SXM5, and PCIe, Spheron lets you rent H100 in the configuration that matches your workload - no long-term contract, per-second billing, full root access.
