AMD Helios Rack-Scale AI on GPU Cloud: Deploy MI455X Inference with UALink and Ultra Ethernet (2026 Guide)

AMD Helios is 72 MI455X accelerators, 31 TB of HBM4, 1.4 PB/s of aggregate memory bandwidth, and approximately 2.9 FP4 exaFLOPS of inference compute in a single rack. That makes it AMD's direct answer to NVIDIA's NVLink72/GB200 NVL72, built on open interconnect standards instead of proprietary silicon. MI455X hardware ships H2 2026, with cloud availability following as providers onboard capacity.

The case for Helios is not raw FLOPS. It is memory bandwidth per dollar at scale, combined with an open interconnect stack that lets you use standard 800GbE switches instead of expensive NVLink switch hardware. For teams running 400B-class models or building inference clusters that need to scale beyond single-node, the architecture choices AMD made here matter more than the headline FP4 number.

For a per-GPU spec comparison between the MI455X and NVIDIA's Blackwell Ultra, the AMD MI400 vs NVIDIA B300 comparison has the full side-by-side. This guide focuses on the rack-scale system: how Helios works, what UALink and Ultra Ethernet mean for your fabric, and when fractional MI-series cloud capacity makes more sense than a full-rack commitment.

What Is AMD Helios?

Helios is not just a GPU. It is a complete rack-scale AI infrastructure platform: 72 MI455X accelerators, EPYC Venice CPUs, Pensando networking, and a UALink-over-Ethernet fabric, all integrated into a single rack unit. The scale-up interconnect delivers 260 TB/s of aggregate intra-rack bandwidth.

Spec	AMD Helios Rack
Accelerators per rack	72x MI455X
Total HBM4 memory	~31 TB
Aggregate bandwidth	~1.4 PB/s
Inference throughput	~2.9 FP4 exaFLOPS
Scale-up interconnect	UALink (260 TB/s)
Scale-out fabric	Ultra Ethernet
CPU	AMD EPYC Venice
Networking	AMD Pensando

These are AMD's announced specifications. Independent rack-level benchmarks are not yet available as of June 2026.

The design philosophy differs from NVIDIA's NVL72 in one important way: Helios treats the interconnect as infrastructure rather than silicon lock-in. UALink runs over an Ethernet physical layer, which means the switch fabric can be sourced from standard 800GbE vendors. NVIDIA's NVLink switch is proprietary and sold only through NVIDIA. For large-scale deployments, that switch cost is a material line item.

MI455X: One GPU vs the Full Rack

A single MI455X delivers 432 GB HBM4 at 19.6 TB/s bandwidth. That is more memory than any currently shipping NVIDIA GPU. A B300 SXM6 carries 288 GB HBM3e at 8 TB/s. A single MI455X can hold a 216B-parameter model at FP16 without partitioning, a capacity class that currently requires two B300s or ~6 H100s.

The full Helios rack scales that to 31 TB across 72 GPUs. At that scale, you are talking about models in the 15T-parameter class at FP4, or ~1T parameters at BF16 with room for KV cache. That is frontier model training territory, not production inference of Llama 4 405B.

For most teams, the interesting operating point is 8 to 16 MI455X GPUs on demand: enough to run 70B-class models at FP16 on a single GPU (no tensor parallelism needed), or 405B-class models across four GPUs instead of eight. The memory bandwidth advantage is present at every scale, not just at full-rack allocation. For the model-to-VRAM sizing math, see GPU memory requirements for LLMs.

The 72-GPU full rack makes sense for:

Frontier model pre-training (1T+ parameter runs)
Large inference fleets where you need consistent low-latency across many concurrent users
Organizations that already operate rack-scale infrastructure and have the ops capacity for it

The 8-16 GPU fractional cloud path makes sense for:

Production inference of 70B to 400B models
Teams without rack-level hardware ops capacity
Workloads that need burst capacity without a long procurement timeline

UALink and Ultra Ethernet: The Open-Fabric Stack

The interconnect choices in Helios are the most technically interesting part of the design.

UALink (Ultra Accelerator Link) is an open-standard scale-up interconnect. The UALink Consortium includes AMD, Broadcom, Cisco, Google, HPE, Intel, Meta, and Microsoft. UALink runs over an Ethernet physical layer (UALoE, UALink over Ethernet), which means the intra-rack GPU-to-GPU fabric can use standard 800GbE switch hardware. NVLink uses a proprietary physical and link layer that requires NVIDIA's NVLink switch, priced accordingly.

UALink 1.0 supports 200 Gbps per lane. The Helios rack uses UALoE to deliver 260 TB/s of aggregate scale-up bandwidth across the 72-GPU domain. For comparison, NVLink 5 in NVIDIA's NVL72 system delivers 1.8 TB/s per GPU, or roughly 130 TB/s aggregate across 72 GPUs. AMD's figure is higher, though cross-vendor comparisons of announced specs should be treated with caution until third-party benchmarks validate both systems.

Ultra Ethernet (UEC 1.0) handles inter-rack, inter-node scale-out. Where UALink provides the intra-rack scale-up fabric, Ultra Ethernet provides the inter-rack fabric for multi-rack deployments. UEC defines a reliability layer that avoids PFC head-of-line blocking by using credit-based flow control at the NIC level rather than switch-level pause frames. Broadcom's Tomahawk-UEC silicon targeting UEC 1.0 is expected in 2026-2027.

The combination creates a fully open-stack alternative to NVIDIA's proprietary InfiniBand and NVLink fabric. For a deep dive into InfiniBand, RoCE, Spectrum-X, and now UALink/UEC as a fourth option, see the GPU networking guide for AI clusters.

The practical implication for cloud deployments: as MI455X capacity appears on GPU clouds, you should not need proprietary switch hardware to get full interconnect performance. That keeps per-GPU rental costs lower than NVL72 equivalents, where the NVLink switch cost is amortized into the rack rental rate.

ROCm Inference Stack on MI455X: vLLM and SGLang

MI455X runs ROCm, not CUDA. The inference stack is familiar but the install path differs.

Docker setup:

bash

docker pull rocm/pytorch:latest
docker run --device=/dev/kfd --device=/dev/dri \
  --group-add video \
  --shm-size 16G \
  -it rocm/pytorch:latest bash

The --device=/dev/kfd --device=/dev/dri flags expose AMD GPUs to the container. For MI455X (CDNA 5), use ROCm 8.x+ images when available. Verify your GPU is visible:

bash

rocm-smi

vLLM with ROCm:

bash

pip install vllm --extra-index-url https://download.pytorch.org/whl/rocm6.2

Do not use pip install vllm without the extra index. The default wheel links against CUDA and will fail on an AMD GPU.

bash

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --dtype float16 \
  --tensor-parallel-size 1

ROCm is auto-detected via the HIP backend. The API is identical to the CUDA version.

SGLang with ROCm:

bash

docker run --device=/dev/kfd --device=/dev/dri \
  -p 8000:8000 \
  lmsysorg/sglang:rocm \
  python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.3-70B-Instruct \
    --context-length 8192 \
    --host 0.0.0.0 \
    --port 8000

RadixAttention is active by default. For shared context workloads (RAG pipelines, multi-turn chat), prefix cache reuse will materially improve throughput.

What works vs what does not:

Framework	ROCm support	Notes
PyTorch	Full	Native, AMD-maintained wheel
vLLM	Full (ROCm wheel)	Separate install path from CUDA version
SGLang	Full (ROCm image)	Official AMD GPU support
TensorRT-LLM	Not available	NVIDIA-only
FlashAttention 3	Not available	Hopper-specific; AMD has FA2 via MIOpen
Triton	Partial (AMD backend)	Most kernels compile; PTX not supported

For a deeper comparison of the ROCm and CUDA ecosystems, see ROCm vs CUDA: AMD vs NVIDIA AI Stack Compared. For vLLM vs SGLang serving architecture tradeoffs (PagedAttention vs RadixAttention), see the vLLM vs SGLang 2026 guide.

Helios vs NVIDIA GB200 NVL72: When AMD Wins

At rack scale, the comparison is Helios (72x MI455X) vs GB200 NVL72 (72x B200).

Spec	AMD Helios Rack	NVIDIA GB200 NVL72
GPUs per rack	72x MI455X	72x B200
Total memory	~31 TB HBM4	~13.4 TB HBM3e
Aggregate bandwidth	~1.4 PB/s	~576 TB/s (~0.58 PB/s)
Compute (FP4)	~2.9 exaFLOPS	~1.44 exaFLOPS
Scale-up interconnect	UALink (260 TB/s, open)	NVLink 5 (~130 TB/s, proprietary)
Scale-out fabric	Ultra Ethernet (open)	InfiniBand NDR / proprietary
Interconnect switch	800GbE (commodity)	NVLink switch (proprietary)
CPU	AMD EPYC Venice	ARM Grace CPU
Software maturity	ROCm 8.x (projected)	CUDA 13 / TensorRT-LLM

AMD wins when:

The workload is memory-bandwidth-bound (large model prefill, long-context generation at high concurrency).
You need more than 13.4 TB of memory per rack for frontier-scale model serving.
The organization wants open interconnect vendor freedom and to avoid proprietary NVLink switch hardware.
Cost sensitivity at the memory boundary: MI455X can fit models requiring fewer GPUs than equivalent NVIDIA configs.

NVIDIA wins when:

Your production stack relies on TensorRT-LLM or FlashAttention 3.
You need the full CUDA ecosystem with minimal porting risk.
Software maturity and production support are top priorities. ROCm has closed the gap significantly but CUDA has a larger ecosystem and longer track record at scale.

For the single-GPU side of this comparison, see the AMD MI400 vs NVIDIA B300 guide. For the full NVL72 architecture breakdown, see the NVIDIA GB200 NVL72 guide.

Right-Sizing: Do You Need a Full Helios Rack?

The Helios rack is designed for two workloads: frontier model training and the largest inference fleets. Most teams do not need either.

When you need the full 72-GPU rack:

Pre-training or continued pre-training of 1T+ parameter models.
Inference serving for frontier-scale models (Llama 4 Maverick/Scout equivalents at BF16, or next-generation 1T+ models at FP4).
Organizations with rack-level hardware ops capacity already in place (networking team, HW oncall, datacenter agreements).

When 8-16 MI455X on-demand makes more sense:

Production inference of 70B to 405B models. A single MI455X holds a 70B model at FP16; two cover 405B at FP16.
Teams without dedicated hardware ops. On a GPU cloud, the provider handles the rack, fabric, and driver updates.
Workloads that need capacity flexibility. A rack commitment is a capital or long-term contract decision. On-demand cloud instances scale up and down in minutes.
Teams evaluating AMD before committing to an on-premises AMD deployment.

Fractional MI400 on a cloud marketplace vs capex rack procurement:

Cloud availability of MI455X will follow hardware shipment by 3-6 months. When fractional MI400 capacity appears on GPU cloud marketplaces, you can test your ROCm stack against real MI455X hardware at on-demand hourly rates before committing to a rack procurement cycle. That is a significant risk reduction for teams evaluating AMD as a primary inference platform.

For MIG, vGPU, and multi-process sharing approaches that apply regardless of GPU vendor, see fractional GPU inference: vGPU, MPS, and right-sizing.

Live MI-Series Pricing on Spheron

AMD MI-series GPUs are not yet in cloud catalogs as of June 2026. MI455X hardware ships H2 2026, and cloud availability typically follows 3-6 months after initial hardware shipments as providers qualify silicon, integrate ROCm drivers, and onboard capacity. The earliest realistic window for fractional MI455X cloud instances is early 2027.

In the meantime, here are the current NVIDIA GPU rates on Spheron as the available-now alternative:

GPU	On-demand (per GPU/hr)	Spot (per GPU/hr)
H100 SXM5	$4.06	$2.91
H100 PCIe	$2.01	N/A
B200 SXM6	$7.50	$5.34

AMD MI-series pricing will appear on Spheron's pricing page when supply is onboarded. Check the pricing page for MI-series availability and live rates.

Pricing fluctuates based on GPU availability. The prices above are based on 24 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Deploying with Spheron: No Capex, On-Demand MI-Series

Spheron aggregates GPU capacity from 5+ providers globally, including MI-series hardware as it becomes available. For teams evaluating the MI455X inference stack before cloud availability opens, Spheron's current H100 and B200 instances give you a ROCm-compatible path today: AMD's HIP layer handles CUDA-to-ROCm translation at the PyTorch level, and vLLM with the ROCm wheel runs the same API on both.

When MI400 capacity comes online, accessing it on Spheron works the same way as renting H100 or B200 today: per-minute billing, no minimum commitment, no vendor contracts, and no rack procurement timeline. For ROCm setup guides and MI-series instance configuration, see the Spheron docs.

AMD Helios brings rack-scale HBM4 bandwidth to inference workloads that have outgrown single-node NVIDIA configs. As MI455X capacity comes online in GPU clouds, Spheron's marketplace model means you can access it on-demand the same way you rent H100 or B200 today - no minimum commitment, per-minute billing.
Check H100 availability → | B200 GPU pricing → | View all GPU pricing →
Get started on Spheron →

STEPS / 05

Quick Setup Guide

Pull the ROCm base Docker image for MI455X
Start from AMD's official ROCm PyTorch image: docker pull rocm/pytorch:latest. For MI455X (CDNA 5), use a ROCm 8.x+ image when available. The --device=/dev/kfd --device=/dev/dri flags expose AMD GPUs to the container. Verify with rocm-smi.
Install vLLM with ROCm support
pip install vllm --extra-index-url https://download.pytorch.org/whl/rocm6.2. Do not use the default pip install vllm, which links against CUDA. Verify the ROCm wheel was installed: python -c 'import vllm; print(vllm.__version__)'
Launch vLLM on MI455X
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.3-70B-Instruct --dtype float16 --tensor-parallel-size 1. ROCm is auto-detected via the HIP backend when using the ROCm-linked wheel. For 432B+ models needing multiple GPUs, set --tensor-parallel-size to match your GPU count.
Launch SGLang on MI455X
docker run --device=/dev/kfd --device=/dev/dri -p 8000:8000 lmsysorg/sglang:rocm python -m sglang.launch_server --model-path meta-llama/Llama-3.3-70B-Instruct --context-length 8192 --host 0.0.0.0 --port 8000. RadixAttention is active by default, giving prefix-cache reuse across requests with shared context.
Benchmark throughput to validate ROCm performance
python benchmarks/benchmark_throughput.py --backend vllm --model meta-llama/Llama-3.1-70B-Instruct --num-prompts 1000. Use rocm-smi --showmeminfo vram to monitor VRAM utilization. Compare against the same benchmark on a CUDA GPU to validate the cost-per-token trade-off before committing to AMD-first infrastructure.

FAQ / 05

Frequently Asked Questions

AMD Helios is a rack-scale AI infrastructure platform built around 72 MI455X accelerators per rack. A single Helios rack delivers 31 TB of HBM4 memory, 1.4 PB/s aggregate memory bandwidth, and approximately 2.9 FP4 exaFLOPS of inference compute. It uses UALink (Ultra Accelerator Link) for intra-rack scale-up and Ultra Ethernet for inter-rack scale-out, positioning it as AMD's direct answer to NVIDIA's NVLink72/GB200 NVL72 rack-scale architecture.

NVLink is a proprietary NVIDIA interconnect that ties GPUs within a rack using a closed specification. UALink is an open-standard scale-up interconnect backed by the UALink Consortium (AMD, Broadcom, Cisco, Google, HPE, Intel, Meta, Microsoft). Both serve the same function - high-bandwidth, low-latency GPU-to-GPU communication within a scale-up domain - but UALink runs over an Ethernet physical layer (UALoE, UALink over Ethernet) which means standard 800GbE switches can carry the fabric, reducing switch cost versus proprietary NVLink switch hardware.

vLLM has official ROCm support through a ROCm-specific wheel. Install with: pip install vllm --extra-index-url https://download.pytorch.org/whl/rocm6.2. The API is identical to the CUDA version. ROCm 8.x, aligned with MI400 series, is expected to extend this support. SGLang also has official ROCm support. TensorRT-LLM remains NVIDIA-only and will not run on MI455X.

AMD confirmed MI400 series (including MI455X) for H2 2026 hardware shipments. Cloud availability typically lags hardware shipment by 3-6 months as providers qualify silicon, integrate drivers, and onboard capacity. Realistically, MI455X cloud instances should appear on leading providers in early-to-mid 2027. Fractional MI400-series capacity may appear on neo-cloud marketplaces sooner once hardware is in data centers.

A full Helios rack (72x MI455X) is designed for frontier model training and the largest inference deployments. For most teams - running 70B to 400B parameter models in production - fractional MI400 capacity (8-16 MI455X GPUs on-demand) delivers the HBM4 bandwidth advantage without the capex and rack minimum of a full Helios system. GPU cloud marketplaces like Spheron will surface fractional MI-series capacity as supply becomes available, at on-demand pricing with no long procurement cycle.

What Is AMD Helios?

MI455X: One GPU vs the Full Rack

UALink and Ultra Ethernet: The Open-Fabric Stack

ROCm Inference Stack on MI455X: vLLM and SGLang

Helios vs NVIDIA GB200 NVL72: When AMD Wins

Right-Sizing: Do You Need a Full Helios Rack?

Live MI-Series Pricing on Spheron

Deploying with Spheron: No Capex, On-Demand MI-Series

Quick Setup Guide

Pull the ROCm base Docker image for MI455X

Install vLLM with ROCm support

Launch vLLM on MI455X

Launch SGLang on MI455X

Benchmark throughput to validate ROCm performance

Frequently Asked Questions

01What is AMD Helios?

02How does UALink differ from NVLink for scale-up interconnect?

03Does vLLM support AMD MI455X via ROCm?

04When will AMD Helios and MI455X be available on GPU clouds?

05Do I need a full Helios rack or can I rent fractional MI400 capacity?

Build what's next.