The scale-up interconnect market got a lot more interesting when the UALink Promoter Group formed in 2024. AMD's MI400 Helios is the first Tier 1 rack-scale system designed around UALink, the open interconnect standard backed by a consortium of eight major tech companies. Most of the content online about UALink is consortium press releases and spec sheet summaries. This post is a buyer-facing guide explaining what UALink 1.0 actually is, how it differs from NVLink across bandwidth, topology, and vendor support, and when these differences matter for real workloads.
For background on the inter-node networking side (InfiniBand, RoCEv2, Spectrum-X), see the GPU networking guide. For a deep dive on NVLink generations and NVSwitch architecture, see what is NVLink first, since this post treats that as known background.
UALink and NVLink solve the same physical problem: moving data between GPUs inside a rack at speeds that PCIe cannot reach. They solve it with very different philosophies. NVLink is NVIDIA-owned and NVIDIA-only. UALink is an open spec that any vendor can implement. That gap has significant implications for multi-vendor GPU clouds.
TL;DR Comparison Table
| Spec | UALink 1.0 | NVLink 5.0 | NVLink 6.0 |
|---|---|---|---|
| Standard type | Open | Proprietary | Proprietary |
| Bandwidth per lane | 200 Gbps (~25 GB/s) | ~100 GB/s (50 GB/s per direction) | ~133 GB/s (projected) |
| Total BW per GPU | Depends on lane count (spec only) | 1.8 TB/s (measured) | ~2.4 TB/s (projected) |
| GPU support | AMD MI400+, other UALink Consortium members | NVIDIA Blackwell only | NVIDIA Rubin only |
| Topology | Point-to-point and switch | All-to-all mesh via NVSwitch | All-to-all mesh via NVSwitch |
| Switch requirement | Optional, open vendor | NVSwitch (NVIDIA-only ASIC) | NVSwitch (next-gen, NVIDIA-only) |
| Cross-vendor | Yes (spec is open) | No | No |
| Shipping silicon | Limited (H2 2026 expected) | Yes, widely available | No (Rubin, 2026+) |
What is UALink
UALink (Ultra Accelerator Link) is an open interconnect standard for direct accelerator-to-accelerator communication within a rack. It is maintained by the UALink Consortium and driven by the UALink Promoter Group, which includes AMD, Broadcom, Cisco, Google, HPE, Intel, Meta, and Microsoft.
The key word is "open." The UALink 1.0 specification was ratified in April 2025 (the UALink Promoter Group formed in May 2024). Any company can implement it without licensing fees or proprietary controller chips. That is the explicit contrast with NVLink: the spec is publicly available, the switch silicon is implementable by multiple vendors, and the physical layer uses standard signaling.
UALink 1.0 Specification
UALink 1.0 specifies:
- 200 Gbps (~25 GB/s) per lane throughput (per direction per physical link)
- Point-to-point and switch topologies are both supported: two GPUs can connect directly, or a switch can connect an entire rack
- Transport layer based on credit-based flow control, designed for low-latency GPU memory access
- UALoE (UALink over Ethernet) is a variant that tunnels the Infinity Fabric protocol over an Ethernet physical layer, which is how AMD's Helios system implements it today
UALink 1.0's per-lane rate of 200 Gbps (~25 GB/s) is lower than NVLink 5.0's per-link rate (~100 GB/s bidirectional). UALink's bandwidth advantage comes from lane count and topology, not per-lane speed. AMD disclosed at CES January 2026 that the MI455X delivers ~3.6 TB/s scale-up bandwidth per accelerator in the 72-GPU Helios rack (~260 TB/s rack-level), which exceeds NVLink 5.0's 1.8 TB/s per GPU. AMD achieves this through Helios's dense lane topology, not a higher per-lane figure.
UALink 2.0 Roadmap
UALink 2.0 is planned and targets 400 Gbps (~50 GB/s) per lane. No confirmed release date or shipping silicon has been announced. Treat any UALink 2.0 figures as roadmap projections, not production specs.
AMD MI400 and UALink
AMD's MI400 Helios is the first rack-scale system designed around UALink-over-Ethernet (UALoE). A Helios rack connects 72 MI400-series GPUs in a tightly coupled scale-up domain using UALoE. AMD disclosed at CES January 2026 that the MI455X delivers ~3.6 TB/s scale-up bandwidth per accelerator (~260 TB/s rack-level), exceeding NVLink 5.0's 1.8 TB/s per GPU. Initial Helios systems are in qualification at select hyperscalers as of mid-2026, with broader availability expected in H2 2026.
What is NVLink (and Why it is Proprietary)
NVLink is NVIDIA's high-speed interconnect for direct GPU-to-GPU communication. It replaces PCIe as the GPU communication fabric within a server, delivering up to 1.8 TB/s bidirectional bandwidth per GPU on Blackwell (NVLink 5.0). For a full breakdown of NVLink generations, NVSwitch architecture, and when NVLink is required for specific workloads, see the NVLink bandwidth guide linked in the introduction.
The reason NVLink stays proprietary comes down to system architecture. NVLink is not just a cable spec: it requires NVSwitch, NVIDIA's dedicated switch ASIC, to enable all-to-all GPU mesh connectivity within a node. The NVSwitch chip is NVIDIA-designed, NVIDIA-manufactured, and not licensable. This creates a full-stack lock-in: to use NVLink 5.0 at its full 1.8 TB/s, you need NVSwitch; to get NVSwitch, you need an HGX baseboard; to get an HGX baseboard, you need to buy from NVIDIA's supply chain.
For cloud operators, this means that scaling H100 or B200 capacity forces them to accept NVIDIA's entire stack, including the GPU, the switch ASIC, the HGX form factor, and the SXM connector. It also means NVIDIA can price the switch infrastructure without competitive pressure. UALink is designed to break exactly that dynamic.
NVLink Generations Reference Table
| Generation | GPU | Release | Links per GPU | Per-link BW | Total BW per GPU (bidirectional) |
|---|---|---|---|---|---|
| NVLink 1.0 | Pascal (P100) | 2016 | 4 | 20 GB/s | 160 GB/s |
| NVLink 2.0 | Volta (V100) | 2017 | 6 | 25 GB/s | 300 GB/s |
| NVLink 3.0 | Ampere (A100) | 2020 | 12 | 25 GB/s | 600 GB/s |
| NVLink 4.0 | Hopper (H100, H200) | 2022 | 18 | 25 GB/s | 900 GB/s |
| NVLink 5.0 | Blackwell (B200, B300, GB200) | 2024 | 18 | 50 GB/s | 1.8 TB/s |
| NVLink 6.0 | Rubin (R100, upcoming) | 2026+ | 18 | ~66 GB/s | ~2.4 TB/s (projected) |
| UALink 1.0 | AMD MI400, UALink Consortium members | 2026 | Vendor-defined | 200 Gbps (~25 GB/s) | ~3.6 TB/s per GPU (MI455X, AMD CES 2026 disclosure) |
NVLink 6.0 figures are based on NVIDIA roadmap data, not measured production performance. UALink 1.0 per-lane spec is 200 Gbps (= 25 GB/s); total per-GPU bandwidth depends on vendor implementation. AMD's MI455X ~3.6 TB/s figure is from AMD's CES January 2026 disclosure, not independently validated silicon results.
UALink vs NVLink: Head-to-Head
Bandwidth per GPU
NVLink 5.0 delivers 1.8 TB/s bidirectional per GPU. This is a measured, production figure. B200 SXM6 nodes ship with 18 NVLink 5.0 links per GPU, and NCCL all-reduce benchmarks on HGX B200 reflect near-theoretical bandwidth at scale.
UALink 1.0 is specified at 200 Gbps (~25 GB/s) per lane. Per-lane, UALink 1.0 is lower than NVLink 5.0's per-link rate (~100 GB/s bidirectional). UALink's advantage is topology-level scale: AMD disclosed at CES January 2026 that the MI455X delivers ~3.6 TB/s scale-up bandwidth per accelerator in the 72-GPU Helios rack (~260 TB/s rack-level), exceeding NVLink 5.0's 1.8 TB/s per GPU through Helios's dense lane topology. These are AMD-disclosed figures; independently validated all-reduce throughput data on shipping MI400 hardware is not yet available as of mid-2026.
Topology and Switch Architecture
NVLink topology on HGX nodes is all-to-all mesh, enabled by NVSwitch. An H100 SXM5 node uses four NVSwitch 3.0 chips. Each GPU connects to all four switches; each switch connects to all eight GPUs. The result is that any GPU can communicate with any other GPU in the node at full NVLink bandwidth without contention. This matters for tensor parallelism and all-reduce: no GPU has to wait for another's communication to complete before using the shared switch path.
UALink topology spec supports both point-to-point and switched configurations. A UALink switch is implementable by any vendor that meets the spec (Broadcom is the most likely candidate). The switch topology details depend on vendor implementation. AMD's UALoE approach in Helios uses an Ethernet physical layer, which gives more topology flexibility but adds some latency compared to dedicated NVLink-style switching.
Vendor and Hardware Support
| Interconnect | GPU Vendor | GPU Models | Switch Source | Availability |
|---|---|---|---|---|
| NVLink 5.0 | NVIDIA | B200, B300, GB200 | NVSwitch (NVIDIA) | Available now |
| NVLink 4.0 | NVIDIA | H100, H200 | NVSwitch 3.0 (NVIDIA) | Available now |
| UALink 1.0 | AMD | MI400 series | Open spec (Broadcom, others) | H2 2026 (expected) |
| UALink 1.0 | Intel, others | TBD | Open spec | TBD |
NVIDIA has not joined the UALink Promoter Group and has not indicated plans to support UALink on any GPU. UALink is designed for AMD, Intel, and other UALink Consortium members.
Latency
NVLink 4.0 on H100 has measured latency around 1-2 microseconds for small messages in back-to-back GPU communication. This is meaningfully lower than any Ethernet-based interconnect, which typically bottoms out around 2-5 microseconds for RDMA even on optimized setups.
UALink 1.0 latency targets are in the sub-microsecond range per the specification, which would beat NVLink's measured latency. The UALoE variant (Ethernet physical layer) adds some overhead compared to native UALink, so AMD Helios latency will likely be higher than the raw UALink spec suggests. Published silicon latency data for MI400 does not exist as of June 2026.
Software Stack
NVLink works through NCCL (NVIDIA Collective Communications Library). NCCL is mature, widely tested, and deeply integrated with PyTorch, JAX, and TensorFlow. NCCL knows about NVSwitch topology and optimizes all-reduce operation graphs automatically. The software stack is a major part of why NVLink performs well in practice: it is not just hardware bandwidth.
AMD's UALink path uses ROCm and RCCL (the ROCm equivalent of NCCL). RCCL has caught up significantly on single-node all-reduce performance for common model sizes, but the multi-node story on ROCm is still maturing. For a broader analysis of the ROCm software gap relative to CUDA, see the AMD MI400 vs NVIDIA B300 comparison.
The software maturity gap is the most significant practical difference today. Even if UALink silicon matches NVLink bandwidth when MI400 ships, the ROCm/RCCL stack needs production validation at scale before it can be treated as equivalent to NCCL on NVLink.
Full-Stack Fabric Comparison: UALink + Ultra Ethernet vs NVLink + NVSwitch + Spectrum-X
The interconnect choice is rarely a single component decision. It is a full-stack choice that includes scale-up fabric (intra-node), scale-out fabric (inter-node), and software.
| Layer | NVIDIA Stack | AMD/Open Stack |
|---|---|---|
| Scale-up fabric | NVLink 5.0 | UALink 1.0 (UALoE on Helios) |
| Scale-out fabric | InfiniBand NDR or Spectrum-X | Ultra Ethernet (UEC) or standard RoCEv2 |
| GPU compatibility | NVIDIA only | AMD, Intel, UALink Consortium members |
| Switch vendor (scale-up) | NVSwitch (NVIDIA-only ASIC) | Open spec; Broadcom most likely |
| Switch vendor (scale-out) | Quantum-2 (InfiniBand) or Spectrum-4 | Tomahawk/UEC or standard ToR |
| Software stack | CUDA + NCCL | ROCm + RCCL |
| Lock-in level | High (end-to-end NVIDIA) | Lower (open spec components) |
The full NVIDIA stack is high-performance and production-proven. The tradeoff is that every component is a single-vendor decision. You cannot swap out NVSwitch for a Broadcom alternative. You cannot mix AMD GPUs into an NVLink domain.
The open stack (UALink + Ultra Ethernet) offers more flexibility in sourcing, pricing negotiation, and multi-vendor environments. The tradeoff is that it is earlier in the production maturity curve. UEC silicon from Broadcom is expected in 2026-2027. UALink silicon (AMD MI400) is expected to be more broadly available in H2 2026.
For GPU networking decisions across InfiniBand, Spectrum-X, and RoCEv2 on existing hardware, see the GPU networking guide linked in the introduction.
Why Open Interconnect Matters for Multi-Vendor GPU Clouds
The interconnect standard a cloud provider supports determines which GPU vendors they can mix in their infrastructure. A platform locked to NVLink can only offer NVIDIA GPUs with NVLink-grade scale-up bandwidth. A platform that supports UALink can add AMD capacity at competitive price points without building a separate, isolated cluster.
For Spheron, which aggregates compute from 5+ providers across multiple GPU vendor ecosystems, open interconnect is aligned with the fundamental model. When AMD MI400 capacity becomes available, UALink-compatible infrastructure means that AMD GPUs can be added to the same marketplace without a separate networking silo.
There is also a pricing dynamic here. When scale-up interconnect is tied to a single vendor's proprietary ASIC, that vendor has pricing power over the entire cluster configuration. Open standards like UALink enable competitive switch sourcing, which reduces the per-switch cost component of cluster build-out and ultimately affects the GPU rental rates customers see.
For the AMD hardware details and how MI400 compares to NVIDIA B300 across memory bandwidth, FLOPs, and software ecosystem, see the AMD MI400 vs NVIDIA B300 comparison linked earlier.
Does Interconnect Type Affect Your Rented GPU Workload?
The short answer: it matters only for specific multi-GPU configurations.
| Workload | GPUs used | Interconnect matters? | Why |
|---|---|---|---|
| Single-GPU inference | 1 GPU | No | No GPU-to-GPU communication |
| Small fine-tuning (7B-13B) | 1-2 GPUs | No | PCIe or NVLink, negligible difference at this scale |
| Multi-GPU inference (tensor parallelism) | 2-8 GPUs | Yes | KV cache and activation transfers are bandwidth-bound |
| 70B training, single node | 8 GPUs | Yes | All-reduce on gradients is the bottleneck |
| 100B+ training, single node | 8 GPUs | Yes | Tensor parallelism requires full NVLink bandwidth to avoid being communication-bound |
| Multi-node training (8+ nodes) | 64+ GPUs | Partially | Scale-up fabric matters within each node; scale-out fabric (InfiniBand/Ethernet) dominates across nodes |
| Batch inference (no tensor parallelism) | Multiple GPUs | No | Each GPU handles independent requests |
If you are renting a single H100 or B200, the NVLink spec is irrelevant to your workload. You are not doing GPU-to-GPU communication at all. If you are renting 8x H100 SXM5 and training a 70B model with FSDP, NVLink 4.0 at 900 GB/s is a critical part of why that training run completes in the expected time.
UALink becomes relevant when you are specifically planning AMD MI400 capacity for large-scale training. If your current roadmap runs on NVIDIA hardware, UALink is background information, not an active purchase decision.
GPU Pricing on Spheron (Fetched Live)
These prices are from the Spheron API as of 24 Jun 2026, filtered to DEDICATED (on-demand) offers only.
| GPU | On-Demand Price | Notes |
|---|---|---|
| H100 SXM5 | $4.06/hr | NVLink 4.0, 80 GB HBM3 |
| H100 PCIe | $2.01/hr | No NVLink, PCIe only |
| H200 SXM5 | $5.82/hr | NVLink 4.0, 141 GB HBM3e |
| B200 SXM6 | $9.36/hr | NVLink 5.0, 192 GB HBM3e |
Pricing fluctuates based on GPU availability. The prices above are based on 24 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
The H100 PCIe vs SXM5 price difference ($2.01 vs $4.06) illustrates the NVLink premium directly. The SXM5 form factor includes NVLink 4.0; PCIe does not. For workloads that need multi-GPU tensor parallelism or FSDP, the SXM5's NVLink bandwidth is worth the cost. For single-GPU workloads, the PCIe H100 is the same compute at lower cost.
UALink Adoption Timeline
| Phase | Target date | What ships | Key milestone |
|---|---|---|---|
| Consortium formed | May 2024 | UALink Promoter Group announced | AMD, Broadcom, Cisco, Google, HPE, Intel, Meta, Microsoft align on open spec |
| Spec published | April 2025 | UALink 1.0 specification | Open standard now available to implementers |
| First systems (AMD) | H2 2026 | AMD MI400 Helios racks | First production UALink silicon at hyperscalers |
| Cloud availability | Early 2027 (estimate) | MI400 GPU rental | Typically 3-6 months after initial hardware shipments |
| UALink 2.0 | 2027+ (unconfirmed) | 400 Gbps (~50 GB/s) per lane | Roadmap only, no confirmed ship date |
| Broader ecosystem | 2027+ | Intel and others | Dependent on individual vendor roadmaps |
The honest summary: most cloud GPU workloads will run on NVLink-based NVIDIA hardware for at least the next 12-18 months. H100, H200, and B200 are available now, the software stack is mature, and NVLink 5.0's 1.8 TB/s is proven in production. UALink is the right thing to track if you are planning AMD-native ROCm workloads or evaluating where the open-stack cost story lands once Helios systems are in the market.
Which Should You Care About?
| Scenario | Best interconnect | Why |
|---|---|---|
| Training 70B+ on 8 GPUs today | NVLink 4.0 (H100 SXM5) or NVLink 5.0 (B200) | Available now, production-proven, mature NCCL |
| Training 200B+ planning H2 2026 infra | Evaluate NVLink 5.0 vs UALink 1.0 (AMD MI400) | UALink may be available by then; compare total cost including software migration |
| Inference on rented single GPU | Irrelevant | No GPU-to-GPU communication in single-GPU inference |
| Multi-node training (8+ nodes) | NVLink within node + InfiniBand/Spectrum-X across nodes | Scale-out fabric dominates in multi-node; within-node NVLink still matters |
| AMD-native ROCm workload | UALink 1.0 (AMD MI400) when available | UALink is the scale-up fabric for AMD; ROCm compatibility required |
| Budget-constrained fine-tuning (up to 30B) | PCIe H100 or H200 | NVLink bandwidth is not the bottleneck at this model size; PCIe pricing is lower |
The practical takeaway: if you are starting a new project today, use NVLink-based NVIDIA GPUs. The hardware is available, the software stack is mature, and the benchmark data is public. Watch UALink for the second half of 2026. If AMD ships Helios systems on schedule and RCCL all-reduce data matches NCCL at scale, the cost case for AMD MI400 on large training workloads will be worth a serious evaluation.
GPU scale-up interconnect choices matter for training, but most inference and fine-tuning workloads run just fine on a single H100 or B200. Spheron aggregates capacity from 5+ providers so you pick the right GPU for the job, not the one forced by a vendor fabric.
H100 GPU pricing → | B200 GPU pricing → | View all GPU pricing →
Frequently Asked Questions
UALink (Ultra Accelerator Link) is an open scale-up interconnect standard for direct GPU-to-GPU communication within a rack. Version 1.0 supports 200 Gbps (~25 GB/s) per lane. The UALink Promoter Group (AMD, Broadcom, Cisco, Google, HPE, Intel, Meta, Microsoft) formed in May 2024 and ratified the 1.0 spec in April 2025 as an open alternative to NVIDIA's proprietary NVLink.
Not on a per-lane basis, but competitively on aggregate per-GPU bandwidth. NVLink 5.0 on B200 delivers 1.8 TB/s bidirectional per GPU across 18 links at 100 GB/s each. UALink 1.0 is specified at 200 Gbps (~25 GB/s) per lane, which is lower per-lane than NVLink 5.0. UALink's advantage is open topology and scalable lane count: AMD disclosed ~3.6 TB/s scale-up bandwidth per MI455X accelerator in the Helios rack at CES January 2026, which exceeds NVLink 5.0's 1.8 TB/s through higher lane counts. UALink 2.0 (planned, no confirmed date) will target 400 Gbps per lane. As of mid-2026, shipping silicon with UALink is limited, so direct measured comparisons are not yet available.
No. UALink is designed as an open standard for non-NVIDIA accelerators, primarily AMD MI-series GPUs and other consortium members' silicon. NVIDIA uses proprietary NVLink for its intra-node scale-up fabric and has not joined the UALink Promoter Group.
UALink handles the scale-up problem - direct GPU-to-GPU connectivity within a rack (intra-node or intra-rack). The Ultra Ethernet Consortium (UEC) addresses scale-out - multi-node communication across the data center network. They solve different problems: UALink replaces NVLink; Ultra Ethernet competes with InfiniBand and RoCEv2.
Limited production silicon using UALink is expected with AMD MI400-series GPUs in H2 2026. Cloud availability typically follows initial hardware shipments by 3-6 months. As of June 2026, no major cloud provider lists UALink instances, but AMD Helios rack-scale systems are in qualification at select hyperscalers.
Yes, but only for multi-GPU jobs that use tensor parallelism or FSDP. If you rent a single GPU or run inference on a single instance, the interconnect spec is irrelevant. If you rent 8+ GPUs and run training or very large inference (100B+ parameters with tensor parallelism), NVLink 5.0 or UALink determines how fast GPUs exchange activations and gradients within the node.
UALink 200G refers to the 200 Gbps (200 GT/s) per-lane throughput target in the UALink 1.0 specification. Each physical lane carries 200 Gbps, which equals ~25 GB/s. The '200G' name follows standard networking conventions where G denotes gigabits. AMD disclosed at CES January 2026 that the MI455X achieves ~3.6 TB/s scale-up bandwidth per accelerator in the 72-GPU Helios rack, which exceeds NVLink 5.0's 1.8 TB/s. AMD achieves that aggregate through Helios's dense lane topology, not by exceeding NVLink 5.0 on a per-lane basis.
