What is the H100 NVL?

The H100 NVL is a two-GPU NVLink bridge module. Each GPU has 94 GB of HBM3 (versus 80 GB on SXM5 and PCIe) and the pair shares 600 GB/s NVLink bandwidth via a direct bridge chip. It is designed for large single-node inference where context length and KV cache size are the bottleneck.

H100 NVL vs SXM5: which is better for training?

SXM5 for any multi-GPU training job. SXM5 nodes use full NVSwitch fabric giving 900 GB/s all-to-all bandwidth between all GPUs in an 8-GPU server, which is required for FSDP and tensor-parallel training at scale. The NVL's point-to-point bridge only benefits the two paired GPUs and cannot scale beyond that pair.

How much memory does the H100 NVL have?

94 GB HBM3 per GPU, or 188 GB for the two-GPU pair when treated as a single logical unit. The H100 SXM5 and PCIe both have 80 GB HBM3 and HBM2e respectively.

Is the H100 NVL available on Spheron?

Yes. Spheron offers H100 SXM5 on-demand instances from multiple data center partners. H100 NVL availability varies by data center partner. Check the current H100 rental page for live availability and per-hour pricing.

When should I use H100 PCIe instead of SXM5?

PCIe fits single-node fine-tuning where NVSwitch bandwidth is not needed and you want the lowest per-GPU hourly rate. PCIe servers are cheaper to build and operate, so on-demand rates are typically 30-40% lower than SXM5 on Spheron. The tradeoff is slower inter-GPU PCIe Gen5 bandwidth (128 GB/s) versus NVLink 900 GB/s.

H100 NVL vs SXM5 vs PCIe: Form Factor Decision Guide (2026)

Three H100 variants exist and they are not interchangeable. The NVL module ships with 94 GB HBM3 per GPU and a NVLink bridge for two-GPU pairing. The SXM5 ships with 80 GB HBM3 in a full NVSwitch server up to 8 GPUs. The PCIe variant ships with 80 GB HBM2e and connects over PCIe Gen5. The right one depends entirely on whether your bottleneck is memory capacity, inter-GPU bandwidth, or per-GPU cost.

If you need to rent H100 and are not sure which variant to pick, this guide covers the full spec delta and workload fit for each form factor.

For generation comparisons, see our A100 vs H100 and H100 vs H200 breakdowns.

TL;DR Decision Matrix

Workload	Best Form Factor	Reason
70B inference, long context, large KV cache	H100 NVL	94 GB HBM3 fits more context without tensor parallelism
Multi-GPU FSDP or tensor-parallel training	H100 SXM5	Full NVSwitch 900 GB/s all-to-all bandwidth
Single-node fine-tuning, cost-sensitive	H100 PCIe	Lower per-GPU rate, PCIe Gen5 sufficient for small jobs
MIG partitioned multi-tenant inference	H100 SXM5 or PCIe	Both support 7-way MIG; NVL's larger slices reduce tenant count per GPU

H100 NVL Specifications

The H100 NVL is a dual-slot PCIe Gen5 card (full-height, full-length) containing two PCIe-form H100 GPUs joined by three NVLink 4 bridges. Unlike SXM5, it slots into standard PCIe data center infrastructure without requiring a specialized SXM board, which is one of NVL's practical advantages for operators who want NVLink-paired inference without a full SXM server build-out. NVIDIA designed it for large inference workloads where 80 GB is not enough and 2-way NVLink bandwidth between the pair is sufficient. It is not a server interconnect the way NVSwitch is.

Key specs per GPU:

94 GB HBM3 (versus 80 GB on SXM5 and PCIe)
~3,938 GB/s memory bandwidth (higher than SXM5's 3,350 GB/s)
NVLink bridge connecting the two paired GPUs at 600 GB/s bidirectional total
3,341 TFLOPS FP8 with sparsity (lower than SXM5's 3,958 TFLOPS)
835 TFLOPS TF32 with sparsity
16,896 CUDA cores
350-400W per GPU (configurable), 700-800W total for the two-GPU pair
7-way MIG support: approximately 12 GB per slice (versus 10 GB on SXM5/PCIe)

One important clarification: "H100 NVL" is the product name for this two-GPU bridge module. "NVLink" is also the interconnect technology inside full SXM5 NVSwitch servers. Both the NVL module and NVSwitch servers use NVLink, but they are different architectures. The NVL bridges two GPUs with a dedicated chip. NVSwitch creates an all-to-all fabric across up to eight GPUs. These are not the same thing.

Full Specifications Comparison

Specification	H100 NVL	H100 SXM5	H100 PCIe
Architecture	Hopper (GH100)	Hopper (GH100)	Hopper (GH100)
VRAM	94 GB HBM3	80 GB HBM3	80 GB HBM2e
Memory Bandwidth	~3,938 GB/s	3,350 GB/s	2,000 GB/s
FP8 (TFLOPS)	3,341	3,958	3,026
TF32 (TFLOPS)	835	989	756
CUDA Cores	16,896	16,896	14,592
Interconnect	NVLink bridge (2-GPU pair)	NVSwitch (up to 8 GPUs)	PCIe Gen5
Interconnect BW	600 GB/s pair total	900 GB/s per GPU	128 GB/s
TDP	700-800W per 2-GPU module (350-400W/GPU)	700W per GPU	350W
Max GPUs per server	2 (NVL pair)	8 (NVSwitch)	8 (PCIe)
MIG slices	7 x ~12 GB	7 x ~10 GB	7 x ~10 GB

The PCIe variant uses HBM2e instead of HBM3. That is not just a naming difference. HBM2e delivers 2,000 GB/s versus 3,350 GB/s on SXM5's HBM3. Memory-bound inference scales roughly linearly with bandwidth, so PCIe runs about 40% slower on bandwidth-limited decode for large models compared to SXM5.

The NVL trades a small amount of FP8 compute (3,341 versus SXM5's 3,958 TFLOPS) for ~17% higher memory bandwidth (~3,938 versus 3,350 GB/s) and 14 GB more VRAM per GPU. This tradeoff makes NVL favored for memory-bound long-context inference where KV cache size matters more than raw compute throughput. The key structural difference remains the interconnect: a PCIe-slot NVLink bridge for two GPUs versus NVSwitch for up to eight.

Inference Workload Fit

NVL for 70B and Long-Context Inference

A single H100 NVL GPU holds 94 GB of HBM3. Llama 3 70B at FP16 requires approximately 140 GB for weights alone. Two NVL GPUs provide 188 GB total, leaving 48 GB for KV cache. On SXM5 (80 GB per GPU), you also need 2-way tensor parallelism for 70B FP16, but the NVLink bridge on the NVL pair handles the tensor-parallel communication at 600 GB/s versus PCIe Gen5 at 128 GB/s.

Long-context serving amplifies this advantage. At a 128K context length with FP16 KV cache, Llama 70B can generate 20-30 GB of KV cache per active sequence. The extra 14 GB per NVL GPU compared to SXM5 extends how many concurrent long-context sessions you can hold before eviction starts. If KV cache eviction is already happening in production, NVL's additional VRAM may resolve it without a full hardware upgrade. For vLLM settings that reduce eviction pressure, see the KV cache optimization guide.

For FP8 inference, a 70B model fits in approximately 70 GB, which means a single SXM5 or NVL GPU can serve it without tensor parallelism. In this case, NVL's extra 24 GB per GPU becomes additional runway for larger KV caches at higher concurrency rather than a requirement for model fitting.

When the NVL Bridge Helps (and When It Does Not)

The 600 GB/s NVLink bridge only connects the two GPUs within the NVL module. It does not reach any other GPUs in the server or cluster.

For 2-GPU tensor-parallel inference on 70B FP16 models, the bridge handles all-reduce operations between the two GPUs at much lower latency than PCIe. That is the scenario NVL was designed for. The bridge does nothing for single-GPU serving (you are not using it at all), and it cannot scale beyond 2 GPUs. If your inference job needs 4-way or 8-way tensor parallelism for models above 300B parameters, NVL is not the right form factor.

PCIe for Smaller Models and Budget Serving

H100 PCIe's lower FP8 TFLOPS (3,026 versus 3,958 on SXM5) and lower memory bandwidth (2,000 versus 3,350 GB/s on SXM5) make it slower for large-model inference. For models under 30B parameters, the gap narrows because smaller models are more compute-bound. For 7B-13B serving where the model fits in a fraction of the 80 GB VRAM, PCIe delivers solid throughput at lower cost.

Multi-tenant inference using MIG is where PCIe can surprise you. Both SXM5 and PCIe support 7-way MIG with approximately 10 GB per slice. Each slice can serve a 7B INT8 model independently. With 8 PCIe GPUs in a server, you get up to 56 MIG instances. NVL gives you larger slices (~12 GB each) but only 14 total from the two-GPU module, so tenant density per server is lower on NVL.

Training Workload Fit

SXM5 for Multi-GPU Training

SXM5's NVSwitch fabric gives every GPU in an 8-GPU server full 900 GB/s all-to-all bandwidth. FSDP and tensor-parallel training require frequent all-reduce gradient synchronization across all GPUs simultaneously. NVSwitch at 900 GB/s per GPU handles this efficiently. The NVL bridge at 600 GB/s total for two GPUs cannot replicate this at scale. For how to configure distributed training jobs across multiple H100 nodes, see the guide on distributed training with FSDP.

Why NVL Is Not Suited for Multi-GPU Training

The NVL module can only coordinate two GPUs via its bridge. There is no path to 4-GPU or 8-GPU all-to-all communication within a single NVL module. A training job that requires FSDP across more than 2 GPUs would need to fall back to PCIe or NVSwitch interconnects. Scaling NVL modules beyond the pair requires PCIe connectivity between modules, which gives you 128 GB/s between pairs instead of the 600 GB/s within a pair.

If you are training large models that require more than 2 GPUs, SXM5 with NVSwitch is the correct form factor.

PCIe for Data-Parallel Training

PCIe Gen5 at 128 GB/s is a real bottleneck for tensor-parallel training above 2 GPUs. For data-parallel fine-tuning, each GPU holds a full copy of the model and only needs to synchronize gradients periodically. Gradient synchronization at 128 GB/s is adequate for many data-parallel workloads, especially on smaller models.

Single-node fine-tuning of 7B-13B models with LoRA or QLoRA does not require NVSwitch. PCIe delivers equivalent training speed at lower on-demand cost for these workloads.

Cost Per Token and Cost Per Training Hour

Spheron's GPU API returned the following on-demand rates on 04 May 2026. No spot pricing was available for H100 variants at this time. The H100 NVL does not appear as a separate SKU in the current API response.

Form Factor	On-Demand ($/GPU/hr)	Spot ($/GPU/hr)	Est. tok/s (Llama 70B FP8)	Cost per 1M tokens
H100 NVL	see /pricing/	N/A	~3,200	N/A
H100 SXM5	$3.10	N/A	~3,200	~$0.27
H100 PCIe	$2.01	N/A	~2,400	~$0.23

H100 NVL availability: The Spheron marketplace currently lists H100 SXM5 and H100 PCIe. H100 NVL availability varies by data center partner and is not surfaced as a standalone SKU. Check current GPU pricing for live availability and rates.

H100 SXM5 is sold as an 8-GPU bundle at $24.82/hr total. The per-GPU figure of $3.10/hr is the bundle rate divided by 8. PCIe single-GPU instances start at $2.01/hr.

Cost per 1M tokens assumes FP8 serving of Llama 3 70B on a single GPU with moderate batching. Actual throughput depends on batch size, context length, and serving framework. For how continuous batching and PagedAttention affect these throughput estimates in practice, see the LLM serving optimization guide. NVL has slightly lower FP8 compute than SXM5 (3,341 versus 3,958 TFLOPS), but its higher memory bandwidth means decode throughput on memory-bound workloads can be comparable. Compute-heavy prefill stages will be modestly faster on SXM5.

Pricing fluctuates based on GPU availability. The prices above are based on 04 May 2026 and may have changed. Check current GPU pricing → for live rates.

For how H100 ranks against B200 and other current options on a cost-per-token basis, see our GPU rankings for LLMs.

Migration Scenarios

PCIe to NVL

Two clear signals indicate PCIe is no longer the right fit:

Hitting the 80 GB VRAM ceiling on 70B FP16 models. The model does not fit on one GPU, forcing 2-way tensor parallelism over PCIe Gen5 at 128 GB/s. NVL's 600 GB/s NVLink bridge handles the same 2-GPU split with significantly lower communication overhead and gives each GPU 14 GB more headroom.

KV cache eviction in production. Long-context requests are getting evicted because insufficient VRAM remains after model weights are loaded. NVL's extra 14 GB per GPU extends the serving window before eviction starts.

NVL to SXM5

SXM5 is the right step when scale requires it:

Scaling beyond 2 GPUs. The NVL module cannot connect more than 2 GPUs. Any job requiring 4-way or 8-way tensor parallelism needs NVSwitch.

Starting FSDP fine-tuning of 70B+ models. Gradient synchronization across more than 2 GPUs requires NVSwitch for acceptable throughput. PCIe Gen5 becomes the bottleneck at this scale.

Adopting Megatron-LM pipeline parallelism. Pipeline stages require fast all-to-all communication between all stages across the server. NVSwitch at 900 GB/s per GPU is the correct infrastructure for this.

For how H100 fits in the broader picture of current GPU options including Blackwell, see the B200 complete guide.

Deploy H100 on Spheron

Spheron offers both H100 SXM5 and H100 PCIe on-demand with per-second billing and no minimum commitment. SXM5 nodes include full NVSwitch for multi-GPU training jobs. PCIe single-GPU instances are available for inference and fine-tuning where NVSwitch is not needed. You can switch between form factors without a long-term contract.

If you are choosing between H100 NVL, SXM5, and PCIe, Spheron lets you rent H100 in the configuration that matches your workload - no long-term contract, per-second billing, full root access.
View live H100 pricing → | Rent H100 →