What is NVLink? GPU Interconnect Bandwidth Explained for AI Training and Inference (2026)

NVLink is the reason an 8-GPU H100 server can move data between GPUs at 900 GB/s while a PCIe machine doing the same thing is stuck at 128 GB/s. That 7x gap is not a minor spec difference. It determines whether you can run FSDP, tensor parallelism, or large KV cache sharing at all.

This post covers what NVLink actually is, how bandwidth has evolved across generations from Pascal to Blackwell, a direct comparison with PCIe, and how to verify NVLink status on a rented instance. For the H100-specific form factor breakdown, the H100 NVL vs SXM5 vs PCIe guide covers the physical differences in detail.

What is NVLink

NVLink is NVIDIA's proprietary GPU-to-GPU interconnect. Instead of routing GPU communication through the PCIe bus (which the CPU and storage also share), NVLink creates a direct, dedicated high-speed path between GPUs within the same server or rack.

An analogy: PCIe is a city arterial road shared by traffic from the CPU, storage, and GPUs. NVLink is a private expressway built exclusively for GPU-to-GPU data. No merge lanes, no shared capacity, no CPU arbitration.

One important constraint: NVLink is intra-node only. It connects GPUs within one server or rack. Across nodes, communication still travels over InfiniBand or Ethernet. For a full breakdown of inter-node fabric choices, see the GPU networking guide. For a comparison of NVLink with UALink, the open-standard alternative backed by AMD and the Ultra Accelerator Link Promoter Group, see the UALink vs NVLink interconnect guide.

NVLink Bandwidth Across Generations

Each NVLink generation has roughly doubled total bandwidth, though the mechanism differs. Earlier generations added more links; NVLink 5.0 kept 18 links but doubled per-link speed from 25 GB/s to 50 GB/s per direction.

Generation	GPU Architecture	Release	Links per GPU	Per-link BW	Total BW per GPU (bidirectional)
NVLink 1.0	Pascal (P100)	2016	4	20 GB/s	160 GB/s
NVLink 2.0	Volta (V100)	2017	6	25 GB/s	300 GB/s
NVLink 3.0	Ampere (A100)	2020	12	25 GB/s	600 GB/s
NVLink 4.0	Hopper (H100, H200)	2022	18	25 GB/s	900 GB/s
NVLink 5.0	Blackwell (B200, B300, GB200)	2024	18	50 GB/s	1.8 TB/s
NVLink 6.0	Rubin (R100, upcoming)	2026+	18	~66 GB/s	~2.4 TB/s

Per-link BW values above are unidirectional (per direction). Total bidirectional BW = Links × Per-link BW × 2.

The NVLink 5.0 jump is notable. The link count stayed the same at 18, but per-link speed doubled from 25 GB/s to 50 GB/s per direction, which is why total bandwidth went from 900 GB/s to 1.8 TB/s. For the full Blackwell architecture context, see the NVIDIA B200 complete guide.

NVLink 6.0 is expected with the Rubin generation (R100). It is not yet shipping, so the ~2.4 TB/s figure is based on NVIDIA roadmap data, not measured performance. R100 lands in H2 2026, and Spheron's R100 pre-order is open now: reserve your place in line by filling the form with your GPU count, timeline, and workload, and the team reaches out as allocation opens.

Why NVLink Matters for AI Training and Inference

Model parallelism

When a model is too large to fit on a single GPU, its layers are split across multiple GPUs. Activations must transfer between GPUs on every forward pass. At 900 GB/s, the transfer time for a layer's activation tensors is microseconds. At PCIe Gen5's 128 GB/s, the same transfer takes 7x longer and becomes a visible bottleneck in step time.

Tensor parallelism

Tensor parallelism shards weight matrices across GPUs. Specifically, linear layers are split along rows or columns so each GPU holds a partial weight matrix. After each matrix multiply, GPUs run an all-reduce to sum partial results. Megatron-LM's tensor parallelism design assumes NVLink-class bandwidth. On PCIe, the all-reduce overhead often negates the compute savings from adding more GPUs.

In multi-GPU inference serving, the KV cache for long-context requests can be distributed across GPUs on the same node. NVLink allows coherent access to another GPU's KV cache without round-tripping through system memory. This matters most at context lengths above 32K tokens where a single GPU's HBM can't hold the full cache.

All-reduce throughput

Data-parallel training sums gradients across GPUs after every step. The all-reduce volume is 2 * (N-1)/N * model_params * bytes_per_param. For a 70B parameter model in BF16 (2 bytes per param) across 8 GPUs, that is roughly 245 GB per step. NVLink 4.0 at 900 GB/s completes this in under 300 ms. PCIe Gen5 at 128 GB/s would take over 1 second per step, essentially killing the batch.

NVLink vs PCIe: Latency, Bandwidth, and When Each Matters

PCIe and NVLink solve the same problem at different price points and bandwidth tiers.

Dimension	NVLink 4.0 (H100)	NVLink 5.0 (B200)	PCIe Gen5 x16
Bidirectional BW	900 GB/s	1.8 TB/s	128 GB/s
Topology	All-to-all via NVSwitch	All-to-all via NVSwitch	Shared PCIe root complex
Latency (GPU-to-GPU)	~1 µs	~1 µs	~4-8 µs
CPU involvement	None	None	DMA through PCIe
Max GPUs (intra-node)	8 (SXM5)	8 (HGX B200) or 72 (NVL72)	8
Cost premium	Requires SXM baseboard	Requires SXM baseboard	Commodity PCIe slots

When each matters:

Use NVLink when running FSDP, tensor parallelism, or pipeline parallelism on models larger than 7B parameters; when KV cache is distributed across GPUs; when all-reduce time exceeds 5% of step time (profile with nccl-tests).
PCIe is sufficient for single-GPU fine-tuning, inference serving on models that fit in one GPU, batch embedding jobs, and any workload where GPUs operate independently.
Mixed environments: some cloud providers rent PCIe form-factor H100s at lower rates. For workloads that genuinely need NVLink bandwidth, the SXM form factor is the right choice.

NVSwitch: Scaling NVLink Beyond 8 GPUs

NVLink describes the link protocol. NVSwitch is the dedicated silicon that makes full all-to-all NVLink fabric possible at scale.

On an 8-GPU H100 SXM5 server, four NVSwitch chips sit on the baseboard and create a non-blocking crossbar. Every GPU can simultaneously send 900 GB/s to every other GPU. Without NVSwitch, NVLink would form point-to-point or ring topologies between GPU pairs, reducing effective bandwidth for multi-GPU collectives.

An all-reduce across 8 H100 SXM5 GPUs via NVSwitch is dominated by bandwidth, not latency - the per-collective overhead is negligible compared to PCIe. The same operation across 8 PCIe-connected GPUs can consume 30-40% of a training step on large models.

The GB200 NVL72 takes this further. NVLink Switch chips (the rack-scale variant of NVSwitch) connect 72 B200 GPUs across the entire rack, delivering 130 TB/s of all-to-all bandwidth. This is not 72 GPUs connected via slower links. Each GPU still has 1.8 TB/s NVLink 5.0 bandwidth, and NVSwitch provides the non-blocking fabric to use it simultaneously with every other GPU. For the full rack architecture, see the GB200 NVL72 guide.

NVLink in the Data Center: SXM vs PCIe Form Factors

SXM is NVIDIA's proprietary socket and baseboard design. The GPU die is mounted directly on a baseboard that routes NVLink signals between all GPUs in the server. This is what H100 SXM5, H200 SXM, and B200 SXM get their NVLink fabric from.

PCIe form-factor GPUs (H100 PCIe, A100 PCIe) insert into standard PCIe slots. They do not connect via NVSwitch. GPU-to-GPU communication goes through the PCIe root complex. One exception: the H100 NVL is a specific dual-GPU bridge module that uses a 2-GPU NVLink bridge between two cards. It is not the same as NVSwitch. It gives those two GPUs a direct NVLink connection to each other but does not scale to 8 GPUs.

On multi-tenant clouds, the form factor is not always labeled clearly. A listing that says "H100 80GB" could be PCIe or SXM5. The bandwidth difference is 7x. H100 SXM5 on Spheron labels the form factor explicitly for each SKU so you can verify before provisioning.

When NVLink is Required vs Optional

Workload	NVLink Required?	Reasoning
Pre-training frontier models (>70B params, multi-GPU)	Yes	All-reduce at each step; PCIe bandwidth becomes the bottleneck at scale
Fine-tuning (7B-70B, single or multi-GPU)	Depends on VRAM fit	Single-GPU fine-tuning: no. Multi-GPU FSDP on >13B: yes
Inference (serving, single-GPU)	No	GPU operates standalone; no inter-GPU communication needed
Inference (tensor-parallel multi-GPU, large KV cache)	Yes	Activations and KV cache must transfer between GPUs every token
Batch embedding / classification	No	Independent per-sample; GPUs don't communicate

For most inference deployments, NVLink is not the deciding factor. If your model fits on one GPU, skip the SXM premium. If you are running 70B+ in tensor-parallel mode, NVLink becomes mandatory. The bandwidth gap is too large to close with software tricks.

How to Verify NVLink on a Rented GPU Instance

Before assuming you have NVLink-enabled hardware, confirm it directly on the instance.

bash

# Check NVLink link status and speed per link
nvidia-smi nvlink -s

If NVLink is active, output shows each link's speed and status. For NVLink 4.0 (H100), you'll see 25.000 GB/s per link. For NVLink 5.0 (B200), 50.000 GB/s. An all-Inactive result on a multi-GPU instance means PCIe-only topology.

bash

# Check GPU-to-GPU topology
nvidia-smi topo -m

Look for NV4, NV5, or NVB in the cross-GPU cells. NV4 means connected via 4 NVLink links. SYS means connected through the PCIe root complex with no NVLink. A full SXM5 8-GPU node shows NV18 (18 NVLink lanes) in all GPU pair cells.

bash

# Measure actual bandwidth (if nccl-tests is installed)
/usr/local/bin/all_reduce_perf -b 1G -e 8G -f 2 -g 8

On an NVLink-connected 8-GPU node, all-reduce bus bandwidth at 8GB message size should approach 800+ GB/s. On a PCIe node, you will see 80-120 GB/s. The ratio is diagnostic. For SSH setup to run these commands on a rented instance, see the Spheron deployment guide.

Spheron NVLink-Enabled GPU Instances

All SXM-form-factor instances on Spheron include full NVSwitch fabric. The NVLink generation depends on the GPU architecture.

GPU	NVLink Generation	BW per GPU	On-Demand (Spheron)	Spot (Spheron)
H100 SXM5 80GB	NVLink 4.0	900 GB/s	$3.84/hr	$1.63/hr
H200 SXM 141GB	NVLink 4.0	900 GB/s	$4.56/hr	$1.89/hr
B200 SXM 192GB	NVLink 5.0	1.8 TB/s	$7.16/hr	$1.71/hr
GB200 NVL72 (per GPU)	NVLink 5.0	1.8 TB/s	Contact for pricing	-

H100 PCIe and RTX variants do not include NVSwitch fabric and are not listed here. L40S also lacks NVLink.

For on-demand B200 SXM instances on Spheron, NVLink 5.0 is standard on all SKUs. For H200 GPU rentals at the Hopper tier, NVLink 4.0 applies.

For GB200 NVL72 on Spheron, pricing is configured per cluster given the rack-scale deployment model.

Pricing fluctuates based on GPU availability. The prices above are based on 23 May 2026 and may have changed. Check current GPU pricing → for live rates.

Training large models or running tensor-parallel inference? NVLink bandwidth is not a spec to overlook. Spheron surfaces the SXM vs PCIe distinction for every GPU SKU so you rent what your workload actually needs.
H100 SXM5 on Spheron → | Spheron B200 → | View all GPU pricing →

FAQ / 06

Frequently Asked Questions

NVLink is NVIDIA's proprietary high-speed GPU-to-GPU interconnect. It replaces PCIe as the communication path between GPUs within a server, delivering up to 1.8 TB/s bidirectional bandwidth on Blackwell (NVLink 5.0) versus PCIe Gen5's 128 GB/s. NVLink enables tensor parallelism, model parallelism, and KV cache sharing across GPUs without CPU involvement.

NVLink 5.0, used in Blackwell GPUs (B200, B300, GB200), delivers 1.8 TB/s of bidirectional bandwidth per GPU across 18 links at 100 GB/s each. The NVL72 rack extends this to 130 TB/s of all-to-all bandwidth across 72 B200 GPUs via NVSwitch fabric.

NVLink is required for multi-GPU training on models larger than roughly 7B parameters, where inter-GPU all-reduce bandwidth is the bottleneck. PCIe Gen5 (128 GB/s) is sufficient for single-GPU inference, fine-tuning on models under 7B, and batch jobs that don't need tight GPU-to-GPU communication. If you are running FSDP or tensor parallelism across GPUs, you need NVLink.

Usually not. Single-node inference of models up to 70B runs on a single GPU or uses NVLink for tensor parallelism, but the bandwidth requirements are far lower than training. Multi-node inference (disaggregated prefill/decode across nodes) uses the inter-node network fabric, not NVLink. NVLink matters for inference when KV cache sharing between GPUs on the same node is a bottleneck.

NVSwitch is a dedicated switch chip that creates a full all-to-all NVLink fabric across all GPUs in a server or rack. On an 8-GPU H100 SXM5 node, four NVSwitch chips provide 900 GB/s bidirectional bandwidth to every GPU simultaneously. On the GB200 NVL72 rack, NVLink Switch chips scale this to 130 TB/s across 72 GPUs. Without NVSwitch, NVLink only connects GPU pairs directly.

Run `nvidia-smi nvlink -s` on your instance. If NVLink is active, you will see per-link speed and status for each lane. For topology, run `nvidia-smi topo -m` and check for NV4/NV5/NVB entries in the topology matrix (NVx means NVLink; SYS means PCIe). A PCIe-only instance will show SYS for all GPU-to-GPU paths.

What is NVLink

NVLink Bandwidth Across Generations

Why NVLink Matters for AI Training and Inference

Model parallelism

Tensor parallelism

KV cache sharing

All-reduce throughput

NVLink vs PCIe: Latency, Bandwidth, and When Each Matters

NVSwitch: Scaling NVLink Beyond 8 GPUs

NVLink in the Data Center: SXM vs PCIe Form Factors

When NVLink is Required vs Optional

How to Verify NVLink on a Rented GPU Instance

Spheron NVLink-Enabled GPU Instances

Frequently Asked Questions

01What is NVLink?

02What is NVLink 5.0 bandwidth?

03NVLink vs PCIe: which do I need?

04Does LLM inference require NVLink?

05What is NVSwitch and how does it relate to NVLink?

06How do I check if my rented GPU has NVLink enabled?

Build what's next.