NVLink is the reason an 8-GPU H100 server can move data between GPUs at 900 GB/s while a PCIe machine doing the same thing is stuck at 128 GB/s. That 7x gap is not a minor spec difference. It determines whether you can run FSDP, tensor parallelism, or large KV cache sharing at all.
This post covers what NVLink actually is, how bandwidth has evolved across generations from Pascal to Blackwell, a direct comparison with PCIe, and how to verify NVLink status on a rented instance. For the H100-specific form factor breakdown, the H100 NVL vs SXM5 vs PCIe guide covers the physical differences in detail.
What is NVLink
NVLink is NVIDIA's proprietary GPU-to-GPU interconnect. Instead of routing GPU communication through the PCIe bus (which the CPU and storage also share), NVLink creates a direct, dedicated high-speed path between GPUs within the same server or rack.
An analogy: PCIe is a city arterial road shared by traffic from the CPU, storage, and GPUs. NVLink is a private expressway built exclusively for GPU-to-GPU data. No merge lanes, no shared capacity, no CPU arbitration.
One important constraint: NVLink is intra-node only. It connects GPUs within one server or rack. Across nodes, communication still travels over InfiniBand or Ethernet. For a full breakdown of inter-node fabric choices, see the GPU networking guide.
NVLink Bandwidth Across Generations
Each NVLink generation has roughly doubled total bandwidth, though the mechanism differs. Earlier generations added more links; NVLink 5.0 kept 18 links but doubled per-link speed from 25 GB/s to 50 GB/s per direction.
| Generation | GPU Architecture | Release | Links per GPU | Per-link BW | Total BW per GPU (bidirectional) |
|---|---|---|---|---|---|
| NVLink 1.0 | Pascal (P100) | 2016 | 4 | 20 GB/s | 160 GB/s |
| NVLink 2.0 | Volta (V100) | 2017 | 6 | 25 GB/s | 300 GB/s |
| NVLink 3.0 | Ampere (A100) | 2020 | 12 | 25 GB/s | 600 GB/s |
| NVLink 4.0 | Hopper (H100, H200) | 2022 | 18 | 25 GB/s | 900 GB/s |
| NVLink 5.0 | Blackwell (B200, B300, GB200) | 2024 | 18 | 50 GB/s | 1.8 TB/s |
| NVLink 6.0 | Rubin (R100, upcoming) | 2026+ | 18 | ~66 GB/s | ~2.4 TB/s |
Per-link BW values above are unidirectional (per direction). Total bidirectional BW = Links × Per-link BW × 2.
The NVLink 5.0 jump is notable. The link count stayed the same at 18, but per-link speed doubled from 25 GB/s to 50 GB/s per direction, which is why total bandwidth went from 900 GB/s to 1.8 TB/s. For the full Blackwell architecture context, see the NVIDIA B200 complete guide.
NVLink 6.0 is expected with the Rubin generation (R100). It is not yet shipping, so the ~2.4 TB/s figure is based on NVIDIA roadmap data, not measured performance.
Why NVLink Matters for AI Training and Inference
Model parallelism
When a model is too large to fit on a single GPU, its layers are split across multiple GPUs. Activations must transfer between GPUs on every forward pass. At 900 GB/s, the transfer time for a layer's activation tensors is microseconds. At PCIe Gen5's 128 GB/s, the same transfer takes 7x longer and becomes a visible bottleneck in step time.
Tensor parallelism
Tensor parallelism shards weight matrices across GPUs. Specifically, linear layers are split along rows or columns so each GPU holds a partial weight matrix. After each matrix multiply, GPUs run an all-reduce to sum partial results. Megatron-LM's tensor parallelism design assumes NVLink-class bandwidth. On PCIe, the all-reduce overhead often negates the compute savings from adding more GPUs.
KV cache sharing
In multi-GPU inference serving, the KV cache for long-context requests can be distributed across GPUs on the same node. NVLink allows coherent access to another GPU's KV cache without round-tripping through system memory. This matters most at context lengths above 32K tokens where a single GPU's HBM can't hold the full cache.
All-reduce throughput
Data-parallel training sums gradients across GPUs after every step. The all-reduce volume is 2 * (N-1)/N * model_params * bytes_per_param. For a 70B parameter model in BF16 (2 bytes per param) across 8 GPUs, that is roughly 245 GB per step. NVLink 4.0 at 900 GB/s completes this in under 300 ms. PCIe Gen5 at 128 GB/s would take over 1 second per step, essentially killing the batch.
NVLink vs PCIe: Latency, Bandwidth, and When Each Matters
PCIe and NVLink solve the same problem at different price points and bandwidth tiers.
| Dimension | NVLink 4.0 (H100) | NVLink 5.0 (B200) | PCIe Gen5 x16 |
|---|---|---|---|
| Bidirectional BW | 900 GB/s | 1.8 TB/s | 128 GB/s |
| Topology | All-to-all via NVSwitch | All-to-all via NVSwitch | Shared PCIe root complex |
| Latency (GPU-to-GPU) | ~1 µs | ~1 µs | ~4-8 µs |
| CPU involvement | None | None | DMA through PCIe |
| Max GPUs (intra-node) | 8 (SXM5) | 8 (HGX B200) or 72 (NVL72) | 8 |
| Cost premium | Requires SXM baseboard | Requires SXM baseboard | Commodity PCIe slots |
When each matters:
- Use NVLink when running FSDP, tensor parallelism, or pipeline parallelism on models larger than 7B parameters; when KV cache is distributed across GPUs; when all-reduce time exceeds 5% of step time (profile with
nccl-tests). - PCIe is sufficient for single-GPU fine-tuning, inference serving on models that fit in one GPU, batch embedding jobs, and any workload where GPUs operate independently.
- Mixed environments: some cloud providers rent PCIe form-factor H100s at lower rates. For workloads that genuinely need NVLink bandwidth, the SXM form factor is the right choice.
NVSwitch: Scaling NVLink Beyond 8 GPUs
NVLink describes the link protocol. NVSwitch is the dedicated silicon that makes full all-to-all NVLink fabric possible at scale.
On an 8-GPU H100 SXM5 server, four NVSwitch chips sit on the baseboard and create a non-blocking crossbar. Every GPU can simultaneously send 900 GB/s to every other GPU. Without NVSwitch, NVLink would form point-to-point or ring topologies between GPU pairs, reducing effective bandwidth for multi-GPU collectives.
An all-reduce across 8 H100 SXM5 GPUs via NVSwitch is dominated by bandwidth, not latency - the per-collective overhead is negligible compared to PCIe. The same operation across 8 PCIe-connected GPUs can consume 30-40% of a training step on large models.
The GB200 NVL72 takes this further. NVLink Switch chips (the rack-scale variant of NVSwitch) connect 72 B200 GPUs across the entire rack, delivering 130 TB/s of all-to-all bandwidth. This is not 72 GPUs connected via slower links. Each GPU still has 1.8 TB/s NVLink 5.0 bandwidth, and NVSwitch provides the non-blocking fabric to use it simultaneously with every other GPU. For the full rack architecture, see the GB200 NVL72 guide.
NVLink in the Data Center: SXM vs PCIe Form Factors
SXM is NVIDIA's proprietary socket and baseboard design. The GPU die is mounted directly on a baseboard that routes NVLink signals between all GPUs in the server. This is what H100 SXM5, H200 SXM, and B200 SXM get their NVLink fabric from.
PCIe form-factor GPUs (H100 PCIe, A100 PCIe) insert into standard PCIe slots. They do not connect via NVSwitch. GPU-to-GPU communication goes through the PCIe root complex. One exception: the H100 NVL is a specific dual-GPU bridge module that uses a 2-GPU NVLink bridge between two cards. It is not the same as NVSwitch. It gives those two GPUs a direct NVLink connection to each other but does not scale to 8 GPUs.
On multi-tenant clouds, the form factor is not always labeled clearly. A listing that says "H100 80GB" could be PCIe or SXM5. The bandwidth difference is 7x. H100 SXM5 on Spheron labels the form factor explicitly for each SKU so you can verify before provisioning.
When NVLink is Required vs Optional
| Workload | NVLink Required? | Reasoning |
|---|---|---|
| Pre-training frontier models (>70B params, multi-GPU) | Yes | All-reduce at each step; PCIe bandwidth becomes the bottleneck at scale |
| Fine-tuning (7B-70B, single or multi-GPU) | Depends on VRAM fit | Single-GPU fine-tuning: no. Multi-GPU FSDP on >13B: yes |
| Inference (serving, single-GPU) | No | GPU operates standalone; no inter-GPU communication needed |
| Inference (tensor-parallel multi-GPU, large KV cache) | Yes | Activations and KV cache must transfer between GPUs every token |
| Batch embedding / classification | No | Independent per-sample; GPUs don't communicate |
For most inference deployments, NVLink is not the deciding factor. If your model fits on one GPU, skip the SXM premium. If you are running 70B+ in tensor-parallel mode, NVLink becomes mandatory. The bandwidth gap is too large to close with software tricks.
How to Verify NVLink on a Rented GPU Instance
Before assuming you have NVLink-enabled hardware, confirm it directly on the instance.
# Check NVLink link status and speed per link
nvidia-smi nvlink -sIf NVLink is active, output shows each link's speed and status. For NVLink 4.0 (H100), you'll see 25.000 GB/s per link. For NVLink 5.0 (B200), 50.000 GB/s. An all-Inactive result on a multi-GPU instance means PCIe-only topology.
# Check GPU-to-GPU topology
nvidia-smi topo -mLook for NV4, NV5, or NVB in the cross-GPU cells. NV4 means connected via 4 NVLink links. SYS means connected through the PCIe root complex with no NVLink. A full SXM5 8-GPU node shows NV18 (18 NVLink lanes) in all GPU pair cells.
# Measure actual bandwidth (if nccl-tests is installed)
/usr/local/bin/all_reduce_perf -b 1G -e 8G -f 2 -g 8On an NVLink-connected 8-GPU node, all-reduce bus bandwidth at 8GB message size should approach 800+ GB/s. On a PCIe node, you will see 80-120 GB/s. The ratio is diagnostic. For SSH setup to run these commands on a rented instance, see the Spheron deployment guide.
Spheron NVLink-Enabled GPU Instances
All SXM-form-factor instances on Spheron include full NVSwitch fabric. The NVLink generation depends on the GPU architecture.
| GPU | NVLink Generation | BW per GPU | On-Demand (Spheron) | Spot (Spheron) |
|---|---|---|---|---|
| H100 SXM5 80GB | NVLink 4.0 | 900 GB/s | $3.84/hr | $1.63/hr |
| H200 SXM 141GB | NVLink 4.0 | 900 GB/s | $4.56/hr | $1.89/hr |
| B200 SXM 192GB | NVLink 5.0 | 1.8 TB/s | $7.16/hr | $1.71/hr |
| GB200 NVL72 (per GPU) | NVLink 5.0 | 1.8 TB/s | Contact for pricing | - |
H100 PCIe and RTX variants do not include NVSwitch fabric and are not listed here. L40S also lacks NVLink.
For on-demand B200 SXM instances on Spheron, NVLink 5.0 is standard on all SKUs. For H200 GPU rentals at the Hopper tier, NVLink 4.0 applies.
For GB200 NVL72 on Spheron, pricing is configured per cluster given the rack-scale deployment model.
Pricing fluctuates based on GPU availability. The prices above are based on 23 May 2026 and may have changed. Check current GPU pricing → for live rates.
Training large models or running tensor-parallel inference? NVLink bandwidth is not a spec to overlook. Spheron surfaces the SXM vs PCIe distinction for every GPU SKU so you rent what your workload actually needs.
Frequently Asked Questions
NVLink is NVIDIA's proprietary high-speed GPU-to-GPU interconnect. It replaces PCIe as the communication path between GPUs within a server, delivering up to 1.8 TB/s bidirectional bandwidth on Blackwell (NVLink 5.0) versus PCIe Gen5's 128 GB/s. NVLink enables tensor parallelism, model parallelism, and KV cache sharing across GPUs without CPU involvement.
NVLink 5.0, used in Blackwell GPUs (B200, B300, GB200), delivers 1.8 TB/s of bidirectional bandwidth per GPU across 18 links at 100 GB/s each. The NVL72 rack extends this to 130 TB/s of all-to-all bandwidth across 72 B200 GPUs via NVSwitch fabric.
NVLink is required for multi-GPU training on models larger than roughly 7B parameters, where inter-GPU all-reduce bandwidth is the bottleneck. PCIe Gen5 (128 GB/s) is sufficient for single-GPU inference, fine-tuning on models under 7B, and batch jobs that don't need tight GPU-to-GPU communication. If you are running FSDP or tensor parallelism across GPUs, you need NVLink.
Usually not. Single-node inference of models up to 70B runs on a single GPU or uses NVLink for tensor parallelism, but the bandwidth requirements are far lower than training. Multi-node inference (disaggregated prefill/decode across nodes) uses the inter-node network fabric, not NVLink. NVLink matters for inference when KV cache sharing between GPUs on the same node is a bottleneck.
NVSwitch is a dedicated switch chip that creates a full all-to-all NVLink fabric across all GPUs in a server or rack. On an 8-GPU H100 SXM5 node, four NVSwitch chips provide 900 GB/s bidirectional bandwidth to every GPU simultaneously. On the GB200 NVL72 rack, NVLink Switch chips scale this to 130 TB/s across 72 GPUs. Without NVSwitch, NVLink only connects GPU pairs directly.
Run `nvidia-smi nvlink -s` on your instance. If NVLink is active, you will see per-link speed and status for each lane. For topology, run `nvidia-smi topo -m` and check for NV4/NV5/NVB entries in the topology matrix (NVx means NVLink; SYS means PCIe). A PCIe-only instance will show SYS for all GPU-to-GPU paths.
