Comparison

UALink vs NVLink: Open GPU Interconnect for AI Inference and Training (2026)

UALink vs NVLinkUALinkwhat is UALinkopen GPU interconnectUALink 200GNVLinkNVSwitchGPU InterconnectAMD MI400GPU Cloud
UALink vs NVLink: Open GPU Interconnect for AI Inference and Training (2026)

The scale-up interconnect market got a lot more interesting when the UALink Promoter Group formed in 2024. AMD's MI400 Helios is the first Tier 1 rack-scale system designed around UALink, the open interconnect standard backed by a consortium of eight major tech companies. Most of the content online about UALink is consortium press releases and spec sheet summaries. This post is a buyer-facing guide explaining what UALink 1.0 actually is, how it differs from NVLink across bandwidth, topology, and vendor support, and when these differences matter for real workloads.

For background on the inter-node networking side (InfiniBand, RoCEv2, Spectrum-X), see the GPU networking guide. For a deep dive on NVLink generations and NVSwitch architecture, see what is NVLink first, since this post treats that as known background.

UALink and NVLink solve the same physical problem: moving data between GPUs inside a rack at speeds that PCIe cannot reach. They solve it with very different philosophies. NVLink is NVIDIA-owned and NVIDIA-only. UALink is an open spec that any vendor can implement. That gap has significant implications for multi-vendor GPU clouds.

TL;DR Comparison Table

SpecUALink 1.0NVLink 5.0NVLink 6.0
Standard typeOpenProprietaryProprietary
Bandwidth per lane200 Gbps (~25 GB/s)~100 GB/s (50 GB/s per direction)~133 GB/s (projected)
Total BW per GPUDepends on lane count (spec only)1.8 TB/s (measured)~2.4 TB/s (projected)
GPU supportAMD MI400+, other UALink Consortium membersNVIDIA Blackwell onlyNVIDIA Rubin only
TopologyPoint-to-point and switchAll-to-all mesh via NVSwitchAll-to-all mesh via NVSwitch
Switch requirementOptional, open vendorNVSwitch (NVIDIA-only ASIC)NVSwitch (next-gen, NVIDIA-only)
Cross-vendorYes (spec is open)NoNo
Shipping siliconLimited (H2 2026 expected)Yes, widely availableNo (Rubin, 2026+)

UALink (Ultra Accelerator Link) is an open interconnect standard for direct accelerator-to-accelerator communication within a rack. It is maintained by the UALink Consortium and driven by the UALink Promoter Group, which includes AMD, Broadcom, Cisco, Google, HPE, Intel, Meta, and Microsoft.

The key word is "open." The UALink 1.0 specification was ratified in April 2025 (the UALink Promoter Group formed in May 2024). Any company can implement it without licensing fees or proprietary controller chips. That is the explicit contrast with NVLink: the spec is publicly available, the switch silicon is implementable by multiple vendors, and the physical layer uses standard signaling.

UALink 1.0 specifies:

  • 200 Gbps (~25 GB/s) per lane throughput (per direction per physical link)
  • Point-to-point and switch topologies are both supported: two GPUs can connect directly, or a switch can connect an entire rack
  • Transport layer based on credit-based flow control, designed for low-latency GPU memory access
  • UALoE (UALink over Ethernet) is a variant that tunnels the Infinity Fabric protocol over an Ethernet physical layer, which is how AMD's Helios system implements it today

UALink 1.0's per-lane rate of 200 Gbps (~25 GB/s) is lower than NVLink 5.0's per-link rate (~100 GB/s bidirectional). UALink's bandwidth advantage comes from lane count and topology, not per-lane speed. AMD disclosed at CES January 2026 that the MI455X delivers ~3.6 TB/s scale-up bandwidth per accelerator in the 72-GPU Helios rack (~260 TB/s rack-level), which exceeds NVLink 5.0's 1.8 TB/s per GPU. AMD achieves this through Helios's dense lane topology, not a higher per-lane figure.

UALink 2.0 is planned and targets 400 Gbps (~50 GB/s) per lane. No confirmed release date or shipping silicon has been announced. Treat any UALink 2.0 figures as roadmap projections, not production specs.

AMD's MI400 Helios is the first rack-scale system designed around UALink-over-Ethernet (UALoE). A Helios rack connects 72 MI400-series GPUs in a tightly coupled scale-up domain using UALoE. AMD disclosed at CES January 2026 that the MI455X delivers ~3.6 TB/s scale-up bandwidth per accelerator (~260 TB/s rack-level), exceeding NVLink 5.0's 1.8 TB/s per GPU. Initial Helios systems are in qualification at select hyperscalers as of mid-2026, with broader availability expected in H2 2026.

NVLink is NVIDIA's high-speed interconnect for direct GPU-to-GPU communication. It replaces PCIe as the GPU communication fabric within a server, delivering up to 1.8 TB/s bidirectional bandwidth per GPU on Blackwell (NVLink 5.0). For a full breakdown of NVLink generations, NVSwitch architecture, and when NVLink is required for specific workloads, see the NVLink bandwidth guide linked in the introduction.

The reason NVLink stays proprietary comes down to system architecture. NVLink is not just a cable spec: it requires NVSwitch, NVIDIA's dedicated switch ASIC, to enable all-to-all GPU mesh connectivity within a node. The NVSwitch chip is NVIDIA-designed, NVIDIA-manufactured, and not licensable. This creates a full-stack lock-in: to use NVLink 5.0 at its full 1.8 TB/s, you need NVSwitch; to get NVSwitch, you need an HGX baseboard; to get an HGX baseboard, you need to buy from NVIDIA's supply chain.

For cloud operators, this means that scaling H100 or B200 capacity forces them to accept NVIDIA's entire stack, including the GPU, the switch ASIC, the HGX form factor, and the SXM connector. It also means NVIDIA can price the switch infrastructure without competitive pressure. UALink is designed to break exactly that dynamic.

GenerationGPUReleaseLinks per GPUPer-link BWTotal BW per GPU (bidirectional)
NVLink 1.0Pascal (P100)2016420 GB/s160 GB/s
NVLink 2.0Volta (V100)2017625 GB/s300 GB/s
NVLink 3.0Ampere (A100)20201225 GB/s600 GB/s
NVLink 4.0Hopper (H100, H200)20221825 GB/s900 GB/s
NVLink 5.0Blackwell (B200, B300, GB200)20241850 GB/s1.8 TB/s
NVLink 6.0Rubin (R100, upcoming)2026+18~66 GB/s~2.4 TB/s (projected)
UALink 1.0AMD MI400, UALink Consortium members2026Vendor-defined200 Gbps (~25 GB/s)~3.6 TB/s per GPU (MI455X, AMD CES 2026 disclosure)

NVLink 6.0 figures are based on NVIDIA roadmap data, not measured production performance. UALink 1.0 per-lane spec is 200 Gbps (= 25 GB/s); total per-GPU bandwidth depends on vendor implementation. AMD's MI455X ~3.6 TB/s figure is from AMD's CES January 2026 disclosure, not independently validated silicon results.

Bandwidth per GPU

NVLink 5.0 delivers 1.8 TB/s bidirectional per GPU. This is a measured, production figure. B200 SXM6 nodes ship with 18 NVLink 5.0 links per GPU, and NCCL all-reduce benchmarks on HGX B200 reflect near-theoretical bandwidth at scale.

UALink 1.0 is specified at 200 Gbps (~25 GB/s) per lane. Per-lane, UALink 1.0 is lower than NVLink 5.0's per-link rate (~100 GB/s bidirectional). UALink's advantage is topology-level scale: AMD disclosed at CES January 2026 that the MI455X delivers ~3.6 TB/s scale-up bandwidth per accelerator in the 72-GPU Helios rack (~260 TB/s rack-level), exceeding NVLink 5.0's 1.8 TB/s per GPU through Helios's dense lane topology. These are AMD-disclosed figures; independently validated all-reduce throughput data on shipping MI400 hardware is not yet available as of mid-2026.

Topology and Switch Architecture

NVLink topology on HGX nodes is all-to-all mesh, enabled by NVSwitch. An H100 SXM5 node uses four NVSwitch 3.0 chips. Each GPU connects to all four switches; each switch connects to all eight GPUs. The result is that any GPU can communicate with any other GPU in the node at full NVLink bandwidth without contention. This matters for tensor parallelism and all-reduce: no GPU has to wait for another's communication to complete before using the shared switch path.

UALink topology spec supports both point-to-point and switched configurations. A UALink switch is implementable by any vendor that meets the spec (Broadcom is the most likely candidate). The switch topology details depend on vendor implementation. AMD's UALoE approach in Helios uses an Ethernet physical layer, which gives more topology flexibility but adds some latency compared to dedicated NVLink-style switching.

Vendor and Hardware Support

InterconnectGPU VendorGPU ModelsSwitch SourceAvailability
NVLink 5.0NVIDIAB200, B300, GB200NVSwitch (NVIDIA)Available now
NVLink 4.0NVIDIAH100, H200NVSwitch 3.0 (NVIDIA)Available now
UALink 1.0AMDMI400 seriesOpen spec (Broadcom, others)H2 2026 (expected)
UALink 1.0Intel, othersTBDOpen specTBD

NVIDIA has not joined the UALink Promoter Group and has not indicated plans to support UALink on any GPU. UALink is designed for AMD, Intel, and other UALink Consortium members.

Latency

NVLink 4.0 on H100 has measured latency around 1-2 microseconds for small messages in back-to-back GPU communication. This is meaningfully lower than any Ethernet-based interconnect, which typically bottoms out around 2-5 microseconds for RDMA even on optimized setups.

UALink 1.0 latency targets are in the sub-microsecond range per the specification, which would beat NVLink's measured latency. The UALoE variant (Ethernet physical layer) adds some overhead compared to native UALink, so AMD Helios latency will likely be higher than the raw UALink spec suggests. Published silicon latency data for MI400 does not exist as of June 2026.

Software Stack

NVLink works through NCCL (NVIDIA Collective Communications Library). NCCL is mature, widely tested, and deeply integrated with PyTorch, JAX, and TensorFlow. NCCL knows about NVSwitch topology and optimizes all-reduce operation graphs automatically. The software stack is a major part of why NVLink performs well in practice: it is not just hardware bandwidth.

AMD's UALink path uses ROCm and RCCL (the ROCm equivalent of NCCL). RCCL has caught up significantly on single-node all-reduce performance for common model sizes, but the multi-node story on ROCm is still maturing. For a broader analysis of the ROCm software gap relative to CUDA, see the AMD MI400 vs NVIDIA B300 comparison.

The software maturity gap is the most significant practical difference today. Even if UALink silicon matches NVLink bandwidth when MI400 ships, the ROCm/RCCL stack needs production validation at scale before it can be treated as equivalent to NCCL on NVLink.

The interconnect choice is rarely a single component decision. It is a full-stack choice that includes scale-up fabric (intra-node), scale-out fabric (inter-node), and software.

LayerNVIDIA StackAMD/Open Stack
Scale-up fabricNVLink 5.0UALink 1.0 (UALoE on Helios)
Scale-out fabricInfiniBand NDR or Spectrum-XUltra Ethernet (UEC) or standard RoCEv2
GPU compatibilityNVIDIA onlyAMD, Intel, UALink Consortium members
Switch vendor (scale-up)NVSwitch (NVIDIA-only ASIC)Open spec; Broadcom most likely
Switch vendor (scale-out)Quantum-2 (InfiniBand) or Spectrum-4Tomahawk/UEC or standard ToR
Software stackCUDA + NCCLROCm + RCCL
Lock-in levelHigh (end-to-end NVIDIA)Lower (open spec components)

The full NVIDIA stack is high-performance and production-proven. The tradeoff is that every component is a single-vendor decision. You cannot swap out NVSwitch for a Broadcom alternative. You cannot mix AMD GPUs into an NVLink domain.

The open stack (UALink + Ultra Ethernet) offers more flexibility in sourcing, pricing negotiation, and multi-vendor environments. The tradeoff is that it is earlier in the production maturity curve. UEC silicon from Broadcom is expected in 2026-2027. UALink silicon (AMD MI400) is expected to be more broadly available in H2 2026.

For GPU networking decisions across InfiniBand, Spectrum-X, and RoCEv2 on existing hardware, see the GPU networking guide linked in the introduction.

Why Open Interconnect Matters for Multi-Vendor GPU Clouds

The interconnect standard a cloud provider supports determines which GPU vendors they can mix in their infrastructure. A platform locked to NVLink can only offer NVIDIA GPUs with NVLink-grade scale-up bandwidth. A platform that supports UALink can add AMD capacity at competitive price points without building a separate, isolated cluster.

For Spheron, which aggregates compute from 5+ providers across multiple GPU vendor ecosystems, open interconnect is aligned with the fundamental model. When AMD MI400 capacity becomes available, UALink-compatible infrastructure means that AMD GPUs can be added to the same marketplace without a separate networking silo.

There is also a pricing dynamic here. When scale-up interconnect is tied to a single vendor's proprietary ASIC, that vendor has pricing power over the entire cluster configuration. Open standards like UALink enable competitive switch sourcing, which reduces the per-switch cost component of cluster build-out and ultimately affects the GPU rental rates customers see.

For the AMD hardware details and how MI400 compares to NVIDIA B300 across memory bandwidth, FLOPs, and software ecosystem, see the AMD MI400 vs NVIDIA B300 comparison linked earlier.

Does Interconnect Type Affect Your Rented GPU Workload?

The short answer: it matters only for specific multi-GPU configurations.

WorkloadGPUs usedInterconnect matters?Why
Single-GPU inference1 GPUNoNo GPU-to-GPU communication
Small fine-tuning (7B-13B)1-2 GPUsNoPCIe or NVLink, negligible difference at this scale
Multi-GPU inference (tensor parallelism)2-8 GPUsYesKV cache and activation transfers are bandwidth-bound
70B training, single node8 GPUsYesAll-reduce on gradients is the bottleneck
100B+ training, single node8 GPUsYesTensor parallelism requires full NVLink bandwidth to avoid being communication-bound
Multi-node training (8+ nodes)64+ GPUsPartiallyScale-up fabric matters within each node; scale-out fabric (InfiniBand/Ethernet) dominates across nodes
Batch inference (no tensor parallelism)Multiple GPUsNoEach GPU handles independent requests

If you are renting a single H100 or B200, the NVLink spec is irrelevant to your workload. You are not doing GPU-to-GPU communication at all. If you are renting 8x H100 SXM5 and training a 70B model with FSDP, NVLink 4.0 at 900 GB/s is a critical part of why that training run completes in the expected time.

UALink becomes relevant when you are specifically planning AMD MI400 capacity for large-scale training. If your current roadmap runs on NVIDIA hardware, UALink is background information, not an active purchase decision.

GPU Pricing on Spheron (Fetched Live)

These prices are from the Spheron API as of 24 Jun 2026, filtered to DEDICATED (on-demand) offers only.

GPUOn-Demand PriceNotes
H100 SXM5$4.06/hrNVLink 4.0, 80 GB HBM3
H100 PCIe$2.01/hrNo NVLink, PCIe only
H200 SXM5$5.82/hrNVLink 4.0, 141 GB HBM3e
B200 SXM6$9.36/hrNVLink 5.0, 192 GB HBM3e

Pricing fluctuates based on GPU availability. The prices above are based on 24 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

The H100 PCIe vs SXM5 price difference ($2.01 vs $4.06) illustrates the NVLink premium directly. The SXM5 form factor includes NVLink 4.0; PCIe does not. For workloads that need multi-GPU tensor parallelism or FSDP, the SXM5's NVLink bandwidth is worth the cost. For single-GPU workloads, the PCIe H100 is the same compute at lower cost.

PhaseTarget dateWhat shipsKey milestone
Consortium formedMay 2024UALink Promoter Group announcedAMD, Broadcom, Cisco, Google, HPE, Intel, Meta, Microsoft align on open spec
Spec publishedApril 2025UALink 1.0 specificationOpen standard now available to implementers
First systems (AMD)H2 2026AMD MI400 Helios racksFirst production UALink silicon at hyperscalers
Cloud availabilityEarly 2027 (estimate)MI400 GPU rentalTypically 3-6 months after initial hardware shipments
UALink 2.02027+ (unconfirmed)400 Gbps (~50 GB/s) per laneRoadmap only, no confirmed ship date
Broader ecosystem2027+Intel and othersDependent on individual vendor roadmaps

The honest summary: most cloud GPU workloads will run on NVLink-based NVIDIA hardware for at least the next 12-18 months. H100, H200, and B200 are available now, the software stack is mature, and NVLink 5.0's 1.8 TB/s is proven in production. UALink is the right thing to track if you are planning AMD-native ROCm workloads or evaluating where the open-stack cost story lands once Helios systems are in the market.

Which Should You Care About?

ScenarioBest interconnectWhy
Training 70B+ on 8 GPUs todayNVLink 4.0 (H100 SXM5) or NVLink 5.0 (B200)Available now, production-proven, mature NCCL
Training 200B+ planning H2 2026 infraEvaluate NVLink 5.0 vs UALink 1.0 (AMD MI400)UALink may be available by then; compare total cost including software migration
Inference on rented single GPUIrrelevantNo GPU-to-GPU communication in single-GPU inference
Multi-node training (8+ nodes)NVLink within node + InfiniBand/Spectrum-X across nodesScale-out fabric dominates in multi-node; within-node NVLink still matters
AMD-native ROCm workloadUALink 1.0 (AMD MI400) when availableUALink is the scale-up fabric for AMD; ROCm compatibility required
Budget-constrained fine-tuning (up to 30B)PCIe H100 or H200NVLink bandwidth is not the bottleneck at this model size; PCIe pricing is lower

The practical takeaway: if you are starting a new project today, use NVLink-based NVIDIA GPUs. The hardware is available, the software stack is mature, and the benchmark data is public. Watch UALink for the second half of 2026. If AMD ships Helios systems on schedule and RCCL all-reduce data matches NCCL at scale, the cost case for AMD MI400 on large training workloads will be worth a serious evaluation.


GPU scale-up interconnect choices matter for training, but most inference and fine-tuning workloads run just fine on a single H100 or B200. Spheron aggregates capacity from 5+ providers so you pick the right GPU for the job, not the one forced by a vendor fabric.

H100 GPU pricing → | B200 GPU pricing → | View all GPU pricing →

FAQ / 07

Frequently Asked Questions

UALink (Ultra Accelerator Link) is an open scale-up interconnect standard for direct GPU-to-GPU communication within a rack. Version 1.0 supports 200 Gbps (~25 GB/s) per lane. The UALink Promoter Group (AMD, Broadcom, Cisco, Google, HPE, Intel, Meta, Microsoft) formed in May 2024 and ratified the 1.0 spec in April 2025 as an open alternative to NVIDIA's proprietary NVLink.

Not on a per-lane basis, but competitively on aggregate per-GPU bandwidth. NVLink 5.0 on B200 delivers 1.8 TB/s bidirectional per GPU across 18 links at 100 GB/s each. UALink 1.0 is specified at 200 Gbps (~25 GB/s) per lane, which is lower per-lane than NVLink 5.0. UALink's advantage is open topology and scalable lane count: AMD disclosed ~3.6 TB/s scale-up bandwidth per MI455X accelerator in the Helios rack at CES January 2026, which exceeds NVLink 5.0's 1.8 TB/s through higher lane counts. UALink 2.0 (planned, no confirmed date) will target 400 Gbps per lane. As of mid-2026, shipping silicon with UALink is limited, so direct measured comparisons are not yet available.

No. UALink is designed as an open standard for non-NVIDIA accelerators, primarily AMD MI-series GPUs and other consortium members' silicon. NVIDIA uses proprietary NVLink for its intra-node scale-up fabric and has not joined the UALink Promoter Group.

UALink handles the scale-up problem - direct GPU-to-GPU connectivity within a rack (intra-node or intra-rack). The Ultra Ethernet Consortium (UEC) addresses scale-out - multi-node communication across the data center network. They solve different problems: UALink replaces NVLink; Ultra Ethernet competes with InfiniBand and RoCEv2.

Limited production silicon using UALink is expected with AMD MI400-series GPUs in H2 2026. Cloud availability typically follows initial hardware shipments by 3-6 months. As of June 2026, no major cloud provider lists UALink instances, but AMD Helios rack-scale systems are in qualification at select hyperscalers.

Yes, but only for multi-GPU jobs that use tensor parallelism or FSDP. If you rent a single GPU or run inference on a single instance, the interconnect spec is irrelevant. If you rent 8+ GPUs and run training or very large inference (100B+ parameters with tensor parallelism), NVLink 5.0 or UALink determines how fast GPUs exchange activations and gradients within the node.

UALink 200G refers to the 200 Gbps (200 GT/s) per-lane throughput target in the UALink 1.0 specification. Each physical lane carries 200 Gbps, which equals ~25 GB/s. The '200G' name follows standard networking conventions where G denotes gigabits. AMD disclosed at CES January 2026 that the MI455X achieves ~3.6 TB/s scale-up bandwidth per accelerator in the 72-GPU Helios rack, which exceeds NVLink 5.0's 1.8 TB/s. AMD achieves that aggregate through Helios's dense lane topology, not by exceeding NVLink 5.0 on a per-lane basis.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.