GPU Networking for AI Clusters: InfiniBand vs RoCE vs Spectrum-X Decision Guide (2026)

Q: Do I need InfiniBand for AI training, or will RoCE work?

It depends on cluster size and model scale. For 64+ GPU clusters training 100B+ parameter models, InfiniBand NDR typically delivers 15-20% better throughput than RoCEv2 due to lower latency and SHARP offloads. For 8-32 GPU setups or models under 70B, well-tuned RoCEv2 or Spectrum-X delivers comparable results at lower cost. The key test: run NCCL benchmarks on your actual workload and check what fraction of step time is communication.

Q: What is NVIDIA Spectrum-X and how does it compare to InfiniBand NDR?

Spectrum-X is NVIDIA's end-to-end Ethernet AI fabric combining Spectrum-4 switches with ConnectX-7 NICs and proprietary adaptive routing and congestion control. Unlike standard RoCEv2, it closes roughly 80-90% of the gap with InfiniBand NDR on AI all-reduce workloads while running on standard Ethernet infrastructure. It launched in 2023 and is a strong middle-ground option for teams that want better-than-RoCE performance without the full InfiniBand switch cost.

Q: What bandwidth does InfiniBand NDR deliver vs 800GbE RoCE for all-reduce?

On an 8-node H100 SXM5 cluster, InfiniBand NDR 400G delivers approximately 350 GB/s effective all-reduce bandwidth with NCCL. Standard 400GbE RoCEv2 delivers 270-290 GB/s effective under the same conditions. Spectrum-X 800GbE can match InfiniBand NDR within 5% at 8 nodes per NVIDIA's published data, though the gap widens at 64+ nodes.

Q: Does LLM inference require InfiniBand?

No. LLM inference across multiple GPUs within a single node uses NVLink for tensor parallelism, which is far faster than any network fabric. For multi-node inference (disaggregated prefill/decode), high-speed Ethernet or Spectrum-X is sufficient. InfiniBand is rarely cost-justified for inference workloads since the KV cache transfer patterns are less latency-critical than training all-reduce operations.

Q: What are UALink and the Ultra Ethernet Consortium?

UALink (Ultra Accelerator Link) is a scale-up interconnect standard for intra-rack GPU-to-GPU communication, designed as an open alternative to NVLink. It supports 200 Gbps (~25 GB/s) per lane and its May 2024 Promoter Group includes AMD, Broadcom, Cisco, Google, HPE, Intel, Meta, and Microsoft. The Ultra Ethernet Consortium (UEC) is a separate effort defining an open Ethernet-based RDMA fabric for inter-node AI communication. UEC 1.0 published in 2024 defines a reliability layer avoiding PFC head-of-line blocking. Limited shipping hardware uses UALink today; Spectrum-X is the best production Ethernet option until UALink silicon is widely available. For a full architecture and vendor comparison, see the UALink vs NVLink guide in the body of this post.

A large-scale H100 cluster can spend 15-30% of its cycles waiting on the network during large all-reduce operations. Whether you are training a 70B frontier model or running disaggregated inference, the interconnect fabric is not an afterthought. The wrong choice here can turn a $500K training run into a $600K one without touching a single line of model code.

For training-specific tradeoffs on existing hardware, see multi-node GPU training without InfiniBand.

What is InfiniBand?

InfiniBand is a purpose-built, switched-fabric networking standard used to connect GPU servers in AI training and HPC clusters. It is engineered for the two things Ethernet does badly: very low latency (sub-1 microsecond at small message sizes) and lossless RDMA bandwidth (data moves directly from GPU memory on one node to GPU memory on another without touching the CPU or the kernel). For multi-node GPU training, InfiniBand is the fabric most large LLM clusters run on.

The current generation, InfiniBand NDR (Next Data Rate), delivers 400 Gb/s per port. A standard NVIDIA HGX H100 server ships with eight ConnectX-7 NICs, one per GPU, giving 400 Gb/s of dedicated GPU-to-fabric bandwidth per GPU. NDR uses NVIDIA Quantum-2 switches with built-in credit-based flow control, so the network is lossless by design, no PFC tuning required. The next generation, XDR (Extended Data Rate), doubles per-port bandwidth to 800 Gb/s and is expected in 2026-2027.

The reason InfiniBand keeps winning at large training scale is a feature called SHARP (Scalable Hierarchical Aggregation and Reduction Protocol). SHARP moves the all-reduce operation into the switch silicon itself, so gradient summing happens in-flight instead of after every node has received every gradient. On clusters of 16+ nodes, that drops all-reduce round-trips from O(log N) to O(1). RoCEv2 on Ethernet can match InfiniBand's bandwidth on paper but requires extensive PFC and DCQCN tuning to stay lossless under load, and has no equivalent of SHARP. The rest of this guide compares all three options against this baseline.

Why GPU Networking Determines Training Speed

Data-parallel training splits a model across nodes. Each node computes gradients on its local batch, then all nodes synchronize gradients before the next step. That synchronization is an all-reduce operation. For a 405B Llama 3.1 training run with BF16 weights, each all-reduce across 8 nodes moves ~1.4 TB of gradient data per step.

The volume formula: 2 * (N-1)/N * model_params * bytes_per_param. At N=8 nodes, that coefficient is 1.75. A 405B BF16 model has 810 GB of parameters, so each step moves ~1.4 TB across the fabric. At 400 Gb/s InfiniBand NDR (50 GB/s), that takes roughly 28 seconds. At 100GbE Ethernet (12.5 GB/s), it takes ~112 seconds. If your compute step takes 30 seconds, InfiniBand is mandatory. If it takes 600 seconds, Ethernet is fine.

The all-reduce breaks into ring-all-reduce phases: reduce-scatter then all-gather, each moving (N-1)/N * data_size. The ring topology means each node only communicates with two neighbors, distributing load evenly. Collective operations beyond all-reduce matter too: all-gather (used in ZeRO-3 weight reconstruction), broadcast (initial parameter scatter), and reduce-scatter (gradient compression). In mixture-of-experts models, expert parallelism adds all-to-all operations where a single slow link stalls the entire step.

The ratio of communication time to compute time is the deciding metric. Run nccl-tests on your actual model and cluster before committing to any fabric.

Intra-Node vs Inter-Node: Two Different Problems

NVLink 5 and NVSwitch (Intra-Node)

Within a single GPU node, NVIDIA's NVLink and NVSwitch provide all-to-all connectivity between GPUs. On an H100 SXM5 8-GPU node, NVLink 4 delivers 900 GB/s bidirectional bandwidth per GPU through NVSwitch. Every GPU sees every other GPU at full bandwidth simultaneously.

Interconnect	Bandwidth (bidirectional)	Scope	Generation
NVLink 4	900 GB/s	Intra-node	Hopper (H100)
NVLink 5	1.8 TB/s	Intra-node	Blackwell (B200)
NVLink 6	2.4 TB/s	Intra-node	Rubin (upcoming)
PCIe Gen5	128 GB/s	Intra-node	Current gen

NVSwitch handles all-to-all routing without a separate switch chip. The full H100 SXM5 node has four NVSwitch chips providing a fully non-blocking fabric. This means tensor parallelism within a node is essentially free from a networking perspective. Single-node inference of models up to roughly 640B parameters (depending on quantization) can run entirely within NVLink bandwidth without any inter-node communication. The Blackwell generation doubles this to 1.8 TB/s with NVLink 5. The NVIDIA B200 complete guide covers the full architecture breakdown.

Inter-Node Fabric: Where the Decision Lives

Once communication crosses the node boundary, NVLink and PCIe are irrelevant. The inter-node fabric is InfiniBand, RoCEv2, or Spectrum-X. This is the layer that determines multi-node training performance, disaggregated inference latency, and cluster cost.

The Three Inter-Node Fabrics

InfiniBand NDR and XDR

InfiniBand is a purpose-built RDMA fabric designed from the ground up for low-latency, high-bandwidth cluster communication. The kernel is bypassed entirely: the CPU writes a message descriptor to a queue pair, and the ConnectX HCA handles all data movement without OS involvement.

NDR (Next Data Rate) delivers 400 Gb/s per port. A dual-port ConnectX-7 NDR HCA gives 800 Gb/s host bandwidth per server. XDR (Extended Data Rate, expected 2026-2027) doubles that to 800 Gb/s per port. The switch silicon is NVIDIA Quantum-2, which does credit-based flow control natively so the fabric is lossless by design. No PFC configuration required, no packet drops at full load.

The key performance advantage beyond raw bandwidth is SHARP (Scalable Hierarchical Aggregation and Reduction Protocol). SHARP moves all-reduce computation into the switch fabric itself: instead of every node receiving all gradients and then summing them, the switches perform the reduction in-flight as data passes through. For large clusters (16+ nodes), this cuts all-reduce round-trips from O(log N) to O(1). SHARP requires NVIDIA Quantum-2 switches and ConnectX-7 HCAs, and needs explicit enablement in NCCL via NCCL_COLLNET_ENABLE=1.

For a complete walkthrough of NCCL environment variables, topology discovery, and the full tuning process, see the NCCL tuning guide for multi-GPU LLM training.

MPI point-to-point latency on InfiniBand NDR: sub-1 microsecond at 8-byte messages.

RoCEv2 on Ethernet

RoCEv2 (RDMA over Converged Ethernet version 2) provides RDMA semantics over standard Ethernet infrastructure. The same ConnectX-7 HCA supports both InfiniBand and RoCE modes; the choice is made at driver configuration time. RoCEv2 handles IP routing and runs over standard 400GbE or 800GbE Ethernet switches from Broadcom, Arista, or any other vendor.

The catch: Ethernet is lossy by default. To achieve RDMA performance, the network must be made lossless using Priority Flow Control (PFC) pause frames, and congestion must be managed via DCQCN (Data Center Quantized Congestion Notification) combining ECN marking and a rate-based congestion control algorithm at the HCA.

Getting PFC and DCQCN right is genuinely hard. Misconfiguration causes head-of-line blocking (HoL): one congested flow blocks unrelated flows sharing the same priority class, causing queue buildup across the fabric. The symptoms are unpredictable latency spikes and low effective bandwidth. Meta's public engineering posts on their production RoCE infrastructure at thousands of GPUs describe extensive tuning work over multiple years.

Well-tuned RoCEv2 on 400GbE delivers MPI latency of 2-5 microseconds at 8-byte messages. On 800GbE with modern HCAs and good tuning, you can push closer to 1.5-2 microseconds. But achieving and maintaining that requires operational discipline on every switch in the path.

NVIDIA Spectrum-X

Spectrum-X is NVIDIA's end-to-end Ethernet AI fabric, launched in 2023. It combines Spectrum-4 switches (51.2 Tb/s per switch) with ConnectX-7 HCAs running a proprietary RoCE extension that NVIDIA calls "Adaptive Routing." Unlike standard RoCEv2, Spectrum-X's adaptive routing makes per-packet load balancing decisions in hardware, spreading traffic across all available paths dynamically. This eliminates the HoL blocking problem that makes vanilla RoCEv2 hard to tune.

The Spectrum-4 switch also supports NVIDIA's version of hardware-offloaded collectives for Ethernet. The result: Spectrum-X closes roughly 80-90% of the performance gap with InfiniBand NDR on NCCL all-reduce workloads, per NVIDIA's published benchmarks. Spectrum-X switches are standard 800GbE hardware and interoperate with non-NVIDIA Ethernet gear. The adaptive routing and congestion control benefits only apply when both ends are Spectrum-X (switch + ConnectX-7 HCA), but the switches themselves are standard Ethernet.

The product line: Spectrum-4 at 51.2 Tb/s (128 ports 400GbE or 64 ports 800GbE), SN5600 at 51.2 Tb/s with 64 ports of 800GbE. The SN5600 is the current top-of-rack option for 800G Spectrum-X deployments.

Comparison Table

Property	InfiniBand NDR	RoCEv2 on Ethernet	Spectrum-X
Port speed	400 Gb/s (NDR)	Up to 800 GbE	400/800 GbE
Effective all-reduce BW	Highest	70-80% of IB	~85-90% of IB
MPI latency (8-node)	<1 µs	2-5 µs	1.5-2.5 µs
Congestion control	Credit-based (lossless)	PFC+DCQCN	Adaptive Routing
Switch cost per port	High (Quantum-2)	Low-medium	Medium (Spectrum-4)
HCA cost	High (NDR HCA)	Lower (RoCE mode)	Medium (CX-7)
AI-specific offloads	SHARP	None	NVIDIA Adaptive Routing
Ecosystem maturity	Very high	High	Growing (2023+)
Interoperability	IB-only fabric	Any Ethernet	Any Ethernet

Bandwidth and Latency Benchmarks

Concrete numbers from published sources, not internal Spheron measurements:

NCCL all-reduce on 8xH100 SXM5: InfiniBand NDR 400G achieves roughly 350 GB/s effective bus bandwidth; the same cluster on 400GbE RoCEv2 achieves 270-290 GB/s. The 20-25% gap is significant for communication-bound workloads.

Llama 3 405B training throughput: MLCommons MLPerf Training 2024 results show clusters with InfiniBand NDR achieving approximately 15-20% higher tokens per second than equivalent RoCEv2 clusters at the same GPU count. This figure is workload-specific and shrinks at smaller model sizes.

Spectrum-X vs InfiniBand at scale: NVIDIA's published Spectrum-X data shows Spectrum-X 800GbE matching InfiniBand NDR within 5% on NCCL all-reduce at 8 nodes. The gap grows to 10-15% at 64 nodes, where InfiniBand's SHARP offloads provide increasing benefit. For most teams running 8-16 node clusters, Spectrum-X is effectively equivalent.

Latency comparison at 8B message size: InfiniBand NDR ~0.9 µs, Spectrum-X ~1.7 µs, RoCEv2 400G (well-tuned) ~2.4 µs. For MoE expert routing with many small all-to-all operations, this latency difference compounds across every step. Serving DeepSeek V4-Pro on Spheron requires cross-node expert dispatch across 2-3 H200 SXM5 nodes, making this fabric choice directly impact token throughput at the scale of the 1.6T model.

Real-world results vary by cluster size, NCCL version, switch topology, and tuning depth. Always benchmark on your actual configuration before making fabric decisions.

Cost Math: What You Actually Pay per GPU

Fabric cost is real capital expenditure. Here is approximate cost math for an 8-node, 64-GPU H100 cluster.

InfiniBand NDR cost components:

NVIDIA Quantum-2 switches run $50,000-$80,000 per 64-port unit. ConnectX-7 NDR HCAs cost $5,000-$8,000 each. 400G optical cables add $200-$400 per link.

For 8 nodes: 1 Quantum-2 switch (~$65K) + 64 HCAs (~$400K) + cables (~$25K) = roughly $490K fabric cost. Amortized over 3 years (36 months × 64 GPUs): ~$213/GPU/month or ~$0.29/GPU/hour added to compute cost.

Spectrum-X cost components:

SN5600 switches cost $40,000-$55,000 per unit. ConnectX-7 in RoCE mode is $3,500-$5,000 each (same physical HCA as NDR mode at lower cost tier).

For 8 nodes: 1 SN5600 (~$47K) + 64 HCAs (~$250K) + cables (~$20K) = roughly $317K fabric cost. Amortized over 3 years: ~$138/GPU/month or ~$0.19/GPU/hour.

RoCEv2 on commodity Ethernet:

Broadcom Tomahawk 4 or similar 400G switch runs $15,000-$25,000 per unit. ConnectX-6 Dx HCAs in RoCE mode cost $1,800-$2,500 each.

For 8 nodes: 1 switch (~$20K) + 64 HCAs (~$140K) + cables (~$15K) = roughly $175K fabric cost. Amortized over 3 years: ~$76/GPU/month or ~$0.10/GPU/hour.

The gap between InfiniBand NDR and commodity RoCEv2 is about $0.19/GPU/hour in fabric amortization. Whether that premium is worth it depends entirely on how communication-bound your workload is.

For on-demand access without capital outlay, H100 on Spheron runs at $2.90/hr per GPU on-demand (spot from $0.80/hr). If you need H200 scale, H200 SXM5 is available from $1.19/hr per GPU on spot via Spheron. The fabric choice is abstracted in on-demand pricing.

Pricing fluctuates based on GPU availability. The prices above are based on 23 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

When to Use Each Fabric

Workload	Recommended Fabric	Reason
Frontier model training (>100B params, 64+ GPUs)	InfiniBand NDR	Communication-bound; SHARP offload pays off
Large-scale fine-tuning (>32 GPUs, FSDP/ZeRO-3)	InfiniBand NDR or Spectrum-X	Gradient sync dominates; 15-20% slowdown on RoCE is real
Mid-scale training (8-32 GPUs, <70B)	Spectrum-X or tuned RoCEv2	Cost savings outweigh ~5-10% throughput delta
Small-scale training (2-8 GPUs, pipeline parallelism)	Any (RoCEv2 fine)	Pipeline stages hide network latency
Production LLM inference (tensor parallel, 8 GPUs)	RoCEv2 or Spectrum-X	KV transfer latency matters less than throughput; IB premium rarely justified
Disaggregated inference (prefill/decode split nodes)	Spectrum-X or IB	Cross-node KV transfer is latency-sensitive; IB preferred at scale
Batch inference / async workloads	Standard Ethernet (25/100G fine)	Not latency-sensitive
Spot/preemptible training (ZeRO-3 with checkpointing)	RoCEv2	Cheaper total cost; spot recovery on Ethernet is fast enough

For the training-side configuration that goes with this fabric choice - FSDP2 sharding strategies, DeepSpeed ZeRO-3, and multi-node launch patterns - see the FSDP and DeepSpeed ZeRO-3 configuration guide.

For LLM inference workloads, continuous batching and paged attention techniques reduce per-request KV cache memory, which directly determines whether you need cross-node KV transfer at all.

The spot row deserves emphasis. If your training job can tolerate interruption with good checkpointing, using RoCEv2 on spot instances often beats InfiniBand on reserved instances by a large margin on total cost. See current GPU pricing to compare spot vs on-demand rates. For a broader analysis of GPU infrastructure tradeoffs and total cost of ownership, the AI GPU buyers guide covers the full decision framework.

Emerging Standards: UALink and the Ultra Ethernet Consortium

Two open standards are trying to reduce NVIDIA's fabric lock-in. Neither is shipping today, but both will matter over the next 2-3 years.

Ultra Ethernet Consortium (UEC): Formed in 2023, with AMD, Intel, HPE, Broadcom, Cisco, Meta, and Microsoft among the founding members. The goal is an open Ethernet RDMA fabric for AI that matches InfiniBand latency without proprietary components. UEC 1.0 published in 2024 defines a transport protocol (Ultra Ethernet Transport) that avoids PFC entirely, using a credit-based reliability layer implemented in NIC firmware rather than relying on switch-level pause frames. This eliminates the HoL blocking problem that makes RoCEv2 hard to operate at scale.

UEC 1.0 also specifies hardware-offloaded collectives at the NIC level, not the switch level. Broadcom's Tomahawk-UEC silicon targeting UEC 1.0 is expected in 2026-2027. Until products ship, Spectrum-X is the best Ethernet option if you need better-than-vanilla-RoCE performance today.

UALink (Ultra Accelerator Link): A separate standard targeting intra-rack, scale-up GPU-to-GPU communication. UALink 1.0 supports 200 Gbps per lane and positions as an open alternative to NVLink for AMD, Intel, and other non-NVIDIA accelerators. The May 2024 Promoter Group includes AMD, Broadcom, Cisco, Google, HPE, Intel, Meta, and Microsoft. UALink is explicitly for intra-node and intra-rack connectivity, not inter-rack or cross-datacenter. UEC is the inter-node standard; UALink is the intra-node standard.

As of 2026, limited shipping hardware uses UALink, with broader silicon availability expected in 2026-2027. AMD's Helios rack uses UALink-over-Ethernet (UALoE), which tunnels Infinity Fabric protocol over an Ethernet physical layer. Watch for additional UALink silicon from AMD's partners in 2026-2027. The AMD Helios deployment guide covers the full Helios stack for teams evaluating MI455X at rack scale, including UALink vs NVLink and the Ultra Ethernet scale-out fabric. For a full UALink vs NVLink breakdown covering bandwidth specs, topology, and vendor ecosystem, see the open GPU interconnect guide.

Cross-Datacenter AI Fabric: DiLoCo and Low-Bandwidth Training

Multi-region GPU clusters face a different problem: wide-area network bandwidth between datacenters is orders of magnitude lower than intra-cluster fabric. A 10 Gb/s WAN link between US and EU datacenters cannot sustain standard data-parallel all-reduce for a 70B model.

DiLoCo (Distributed Low-Communication) is Google DeepMind's solution, published in a 2023 paper. DiLoCo runs standard local SGD independently on each cluster for H inner steps (typically H=500-1000), then synchronizes only model weights (not gradients) across clusters using a slow outer optimizer. This reduces inter-cluster communication by roughly 500x compared to standard data parallelism. A 1 Gb/s WAN link is sufficient for weight synchronization at H=500.

The tradeoff is convergence speed. DiLoCo achieves roughly 95% of the convergence rate of standard data parallelism at equivalent compute budgets, but requires careful tuning of the inner learning rate, outer learning rate, and H. It also does not work well for RLHF phases where the reward model and policy must be tightly synchronized.

Related low-bandwidth approaches: FedAvg (federated averaging, similar synchronization pattern), local SGD with periodic averaging, and asynchronous SGD with bounded staleness. All share the same tradeoff: slower convergence in exchange for tolerance of high-latency, low-bandwidth inter-cluster links.

Federated learning is one workload where InfiniBand is explicitly the wrong choice: the FL networking section of our federated learning guide explains why standard 10 GbE between sites is sufficient for LoRA-based training, since FL uses a star topology (clients to aggregator only) rather than all-reduce collectives.

For teams training across regions (EU data residency requirements, multi-cloud redundancy), DiLoCo-style synchronization removes the requirement for high-speed cross-region fabric. Within each region, high-speed InfiniBand or Spectrum-X still matters for intra-cluster all-reduce.

Choosing a GPU Cloud Provider by Fabric Type

Provider	Fabric Available	Notes
Spheron	NVLink within node; RoCEv2 or IB across nodes (cluster-dependent)	Marketplace model; match fabric to workload; transparent per-GPU pricing
CoreWeave	InfiniBand NDR on H100/H200 clusters	Strong IB fabric; higher price premium
Lambda Labs	InfiniBand on multi-node H100 clusters	Good for sustained training
AWS	EFA (Elastic Fabric Adapter, proprietary RDMA) on p5.48xlarge	Not IB-compatible natively; AWS-specific protocol
GCP	InfiniBand on A3 (H100)	Available in limited regions
Azure	InfiniBand on ND H100 v5	Good IB availability; reservation required
Hyperscalers (general)	Proprietary fabrics + IB	Less flexibility; higher floor price

Spheron's marketplace model gives you the choice: on-demand RoCEv2-connected GPUs for cost-efficient fine-tuning and inference, or reserved InfiniBand clusters for sustained frontier training. This means you do not pay an IB premium for every workload by default.

The practical guidance: start with on-demand RoCEv2 for any new training run. Run NCCL benchmarks on the first day. If your communication-to-compute ratio is above 25%, evaluate a reserved IB cluster. If it is below 10%, stay on Ethernet. See Spheron's GPU catalog for on-demand options and current GPU pricing for rate comparisons.

Debugging Collective Communication Bottlenecks

Running nccl-tests

nccl-tests is the standard tool for measuring collective bandwidth on any fabric.

bash

# Install nccl-tests
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests && make MPI=1 MPI_HOME=/usr/local/mpi CUDA_HOME=/usr/local/cuda

# Single-node all-reduce (baseline)
./build/all_reduce_perf -b 1G -e 8G -f 2 -g 8

# Multi-node all-reduce (2 nodes via mpirun, 1 process per node × 8 GPUs each)
mpirun -np 2 --hostfile hosts.txt \
  -mca btl_tcp_if_include eth0 \
  ./build/all_reduce_perf -b 1G -e 8G -f 2 -g 8

# Enable NCCL debug output
NCCL_DEBUG=INFO ./build/all_reduce_perf -b 1G -e 8G -f 2 -g 8

Interpreting Results

The all_reduce_perf output shows "bus bandwidth" in the rightmost column. Bus bandwidth accounts for the ring-all-reduce algorithm overhead and represents effective utilization of the underlying link. It is typically 75-85% of the raw link bandwidth for large message sizes.

Expected values by fabric:

InfiniBand NDR 400G, 8xH100 SXM5: ~350 GB/s effective bus bandwidth
RoCEv2 400GbE, well-tuned: ~270-290 GB/s effective
RoCEv2 400GbE, poorly tuned: <200 GB/s (indicates PFC/DCQCN misconfiguration)
Spectrum-X 800GbE, 8 nodes: ~340-360 GB/s effective

If you see high variance across ranks (standard deviation >10% of mean), this usually indicates HoL blocking on Ethernet or a SHARP configuration issue on InfiniBand. Look at per-rank timing in NCCL_DEBUG=TRACE output.

Common Issues and Fixes

Symptom	Likely Cause	Fix
Low effective bandwidth (<50% of link speed)	PFC not enabled on Ethernet	Enable PFC/ECN on switch and NIC
High rank latency variance	Head-of-line blocking	Check DCQCN parameters; verify adaptive routing
NCCL hangs at init	NCCL_SOCKET_IFNAME mismatch	Set `NCCL_SOCKET_IFNAME=eth0` (or correct interface)
IB not detected	Driver/RDMA stack issue	Check `ibv_devices`; ensure MLNX_OFED installed
Slow IB performance	SHARP not enabled	Check `sharp_manager` status; enable SHARP in NCCL

Key NCCL environment variables for each fabric:

bash

# InfiniBand
NCCL_IB_DISABLE=0
NCCL_IB_HCA=mlx5_0
NCCL_NET_GDR_LEVEL=5
NCCL_COLLNET_ENABLE=1   # enables SHARP in-network reduction

# RoCEv2 (RDMA over Ethernet - keeps RDMA active)
NCCL_IB_DISABLE=0
NCCL_IB_GID_INDEX=3     # RoCEv2 GID (value may vary by NIC/OS)
NCCL_NET_GDR_LEVEL=5
NCCL_SOCKET_IFNAME=eth0

# TCP fallback (no RDMA - use only when RDMA is unavailable)
NCCL_IB_DISABLE=1
NCCL_NET=Socket

# Tuning (both fabrics)
NCCL_BUFFSIZE=4194304
NCCL_NTHREADS=512

For a deeper walkthrough of NCCL configuration and multi-node setup, see the multi-node GPU training NCCL configuration guide.

Running multi-node training? Spheron offers on-demand GPU clusters with RoCEv2 for cost-efficient fine-tuning and reserved InfiniBand clusters for sustained frontier training. Match the fabric to the workload.
Spheron H100 → | On-demand H200 → | View all GPU pricing →
Launch a cluster on Spheron →

FAQ / 05

Frequently Asked Questions

It depends on cluster size and model scale. For 64+ GPU clusters training 100B+ parameter models, InfiniBand NDR typically delivers 15-20% better throughput than RoCEv2 due to lower latency and SHARP offloads. For 8-32 GPU setups or models under 70B, well-tuned RoCEv2 or Spectrum-X delivers comparable results at lower cost. The key test: run NCCL benchmarks on your actual workload and check what fraction of step time is communication.

Spectrum-X is NVIDIA's end-to-end Ethernet AI fabric combining Spectrum-4 switches with ConnectX-7 NICs and proprietary adaptive routing and congestion control. Unlike standard RoCEv2, it closes roughly 80-90% of the gap with InfiniBand NDR on AI all-reduce workloads while running on standard Ethernet infrastructure. It launched in 2023 and is a strong middle-ground option for teams that want better-than-RoCE performance without the full InfiniBand switch cost.

On an 8-node H100 SXM5 cluster, InfiniBand NDR 400G delivers approximately 350 GB/s effective all-reduce bandwidth with NCCL. Standard 400GbE RoCEv2 delivers 270-290 GB/s effective under the same conditions. Spectrum-X 800GbE can match InfiniBand NDR within 5% at 8 nodes per NVIDIA's published data, though the gap widens at 64+ nodes.

No. LLM inference across multiple GPUs within a single node uses NVLink for tensor parallelism, which is far faster than any network fabric. For multi-node inference (disaggregated prefill/decode), high-speed Ethernet or Spectrum-X is sufficient. InfiniBand is rarely cost-justified for inference workloads since the KV cache transfer patterns are less latency-critical than training all-reduce operations.

UALink (Ultra Accelerator Link) is a scale-up interconnect standard for intra-rack GPU-to-GPU communication, designed as an open alternative to NVLink. It supports 200 Gbps (~25 GB/s) per lane and its May 2024 Promoter Group includes AMD, Broadcom, Cisco, Google, HPE, Intel, Meta, and Microsoft. The Ultra Ethernet Consortium (UEC) is a separate effort defining an open Ethernet-based RDMA fabric for inter-node AI communication. UEC 1.0 published in 2024 defines a reliability layer avoiding PFC head-of-line blocking. Limited shipping hardware uses UALink today; Spectrum-X is the best production Ethernet option until UALink silicon is widely available. For a full architecture and vendor comparison, see the UALink vs NVLink guide in the body of this post.

What is InfiniBand?

Why GPU Networking Determines Training Speed

Intra-Node vs Inter-Node: Two Different Problems

NVLink 5 and NVSwitch (Intra-Node)

Inter-Node Fabric: Where the Decision Lives

The Three Inter-Node Fabrics

InfiniBand NDR and XDR

RoCEv2 on Ethernet

NVIDIA Spectrum-X

Comparison Table

Bandwidth and Latency Benchmarks

Cost Math: What You Actually Pay per GPU

When to Use Each Fabric

Emerging Standards: UALink and the Ultra Ethernet Consortium

Cross-Datacenter AI Fabric: DiLoCo and Low-Bandwidth Training

Choosing a GPU Cloud Provider by Fabric Type

Debugging Collective Communication Bottlenecks

Running nccl-tests

Interpreting Results

Common Issues and Fixes

Frequently Asked Questions

01Do I need InfiniBand for AI training, or will RoCE work?

02What is NVIDIA Spectrum-X and how does it compare to InfiniBand NDR?

03What bandwidth does InfiniBand NDR deliver vs 800GbE RoCE for all-reduce?

04Does LLM inference require InfiniBand?

05What are UALink and the Ultra Ethernet Consortium?

Try It on Real GPUs