What is the AMD MI400 release date?

AMD confirmed the MI400 series (MI455X, MI430X, MI440X) for H2 2026 shipment. Cloud availability is expected to follow 3-6 months after initial hardware shipments as providers onboard capacity. The MI455X is the flagship liquid-cooled variant; AMD has not confirmed TDP, but available leaks suggest ~2,000-2,500W.

How does AMD MI455X memory compare to NVIDIA B300?

The MI455X ships with 432 GB HBM4 at 19.6 TB/s bandwidth. The B300 has 288 GB HBM3e at 8 TB/s. AMD's memory bandwidth advantage is roughly 2.5x, which is significant for memory-bound LLM inference. For large models (400B+ parameters at FP16), a single MI455X can hold the model where a B300 cannot. Note: AMD has not published differentiated memory specs for MI430X and MI440X; available sources suggest all MI400 SKUs may share the same 432 GB HBM4 configuration.

Will AMD MI400 GPUs support ROCm and PyTorch?

Yes. AMD's CDNA 5 architecture will support ROCm 7.x+ and PyTorch. AMD has been investing heavily in ROCm compatibility, and by the time MI400 ships to cloud providers, frameworks like vLLM and SGLang should have ROCm backend support. TensorRT-LLM remains NVIDIA-only.

What GPU cloud will offer MI400 instances?

No cloud provider lists MI400 instances as of April 2026, as hardware is not yet shipping. Spheron plans to add MI400 capacity when available. NVIDIA B300 instances are available today on Spheron with spot pricing.

Should I wait for AMD MI400 or rent NVIDIA B300 now?

If your workload needs GPU capacity in the next 3-6 months, use B300 now. If you are making longer-term infrastructure decisions and need 432 GB per GPU (e.g., 200B+ parameter models at FP16), evaluating MI400 availability in late 2026 is reasonable. For most teams, B300 is the practical choice today.

AMD MI400 vs NVIDIA B300: Performance, Pricing, and Migration Guide (2026)

AMD's MI400 series ships H2 2026 with 432 GB HBM4 and 19.6 TB/s memory bandwidth: the most memory-dense GPU announced to date. The B300 is available now with 288 GB HBM3e and 14 petaFLOPS of FP4 compute. This post gives you the concrete comparison so you can decide which path fits your infrastructure timeline.

MI400 vs B300 at a Glance

Spec	AMD MI455X	AMD MI430X	AMD MI440X	NVIDIA B300 SXM6
Architecture	CDNA 5	CDNA 5	CDNA 5	Blackwell Ultra (dual-die)
Memory	432 GB HBM4	432 GB HBM4 (est., unconfirmed)	432 GB HBM4 (est., unconfirmed)	288 GB HBM3e
Memory Bandwidth	19.6 TB/s	19.6 TB/s (est., unconfirmed)	19.6 TB/s (est., unconfirmed)	8 TB/s
Peak FP8 (dense)	TBD (CDNA 5 pending)	TBD	TBD	~5,000 TFLOPS
Peak FP4	Supported (figures pending)	Supported	Supported	~15,000 TFLOPS (NVFP4)
TDP	~2,000-2,500W (liquid-cooled, est., based on available leaks)	TDP not disclosed	TDP not disclosed	1,400W
Interconnect	Infinity Fabric (next-gen)	Infinity Fabric (next-gen)	Infinity Fabric (next-gen)	NVLink 5 (1.8 TB/s)
Software Stack	ROCm 8.x (projected)	ROCm 8.x (projected)	ROCm 8.x (projected)	CUDA 13.x / TensorRT-LLM
Cloud Availability	H2 2026	H2 2026	H2 2026	Available now
Spheron Pricing	Not listed	Not listed	Not listed	From $2.97/hr spot

MI400 specs sourced from AMD's confirmed announcements. AMD has not published differentiated memory or bandwidth specs for MI430X and MI440X; the figures above are unconfirmed estimates based on SKU positioning. Performance projections will be updated when third-party benchmarks are available.

AMD MI400 Lineup: MI455X, MI430X, MI440X

MI455X: The Flagship

The MI455X is AMD's top-tier MI400 SKU: 432 GB HBM4, 19.6 TB/s bandwidth, and an estimated ~2,000-2,500W liquid-cooled thermal envelope (based on available leaks; AMD has not confirmed TDP). That memory figure is the headline. A single MI455X can hold a 216B-parameter model at FP16 without partitioning across GPUs, a capacity class that currently requires multiple B300s.

HBM4 is the architectural foundation here. Where HBM3e delivers 8 TB/s total on the B300, HBM4 pushes bandwidth higher through tighter die-to-die interconnect and improved signaling. AMD's CDNA 5 implementation achieves 19.6 TB/s total, roughly 2.45x the B300's bandwidth.

CDNA 5 carries forward native FP4 and FP6 support from CDNA 4, maintaining AMD's expanded precision range. Exact TFLOPS figures for MI455X FP8 and FP4 compute have not been published by AMD as of April 2026.

MI430X: Mid-Tier, Air-Cooled

The MI430X is the mid-range SKU targeting deployments that cannot support liquid cooling. AMD has not published final specs for MI430X. Based on AMD's confirmed announcements, all MI400 SKUs may share the same 432 GB HBM4 / 19.6 TB/s configuration, with differentiation in workload focus (FP64/HPC vs FP4/AI compute) rather than memory capacity. The air-cooled thermal envelope constrains peak power, which in turn limits clock headroom relative to the MI455X.

For teams that found MI350X capacity sufficient but want HBM4 bandwidth improvements, MI430X is the likely target SKU.

MI440X: Entry SKU

The MI440X occupies the lower end of the MI400 family. AMD positions this for deployment flexibility and cost efficiency. AMD has not published final MI440X memory specs as of April 2026; available sources suggest all MI400 SKUs may share the same 432 GB HBM4 / 19.6 TB/s configuration, with SKU differentiation based on workload focus rather than memory capacity.

Helios: AMD's Rack-Scale Infrastructure

AMD's "Helios" system is the rack-scale infrastructure built around the MI400 series. The standard configuration scales to 72 GPUs in a tightly coupled scale-up domain, connected via UALink (running Infinity Fabric protocol over Ethernet), with aggregate interconnect bandwidth competitive with NVIDIA's NVL72 at the system level. This is AMD's direct answer to NVIDIA's NVLink-based scale-up approach.

For context on how MI400 builds on AMD's previous generation, see our AMD MI350X vs NVIDIA B200 comparison.

NVIDIA B300 Blackwell Ultra: What You Get Today

The B300 is the current-generation Blackwell Ultra GPU. Key numbers: 288 GB HBM3e at 8 TB/s, ~15,000 TFLOPS NVFP4, ~5,000 TFLOPS FP8 dense, 1,400W liquid-cooled TDP, and NVLink 5 at 1.8 TB/s per GPU.

It is available now. That single fact shapes most of the decision matrix in this post.

For the full B300 specs and deployment guide, see our NVIDIA B300 Blackwell Ultra guide.

B300 SXM6 spot pricing on Spheron is $2.97/hr per GPU as of April 2026. No on-demand pricing is currently listed. Check current GPU pricing for live rates.

Architecture Deep Dive: CDNA 5 vs Blackwell Ultra

AMD CDNA 5 Architecture

CDNA 5 is AMD's successor to the CDNA 4 architecture used in MI350X and MI355X. The core advancement is the shift from HBM3e to HBM4, which more than doubles memory bandwidth on the MI455X flagship.

AMD continues its chiplet packaging approach, combining GPU compute dies with HBM4 stacks and I/O dies on a single package. CDNA 5 extends precision format support to include FP4 and FP6, matching CDNA 4's additions over CDNA 3. Infinity Fabric (next-gen, specific generation unconfirmed by AMD) handles multi-GPU connectivity and scales to rack-level with the Helios infrastructure.

CDNA 5 FP8 and FP4 TFLOPS figures have not been confirmed by AMD as of April 2026. For how CDNA 4 compares to CDNA 3, see the architecture section in our MI350X vs B200 post.

NVIDIA Blackwell Ultra Architecture

The B300 uses a dual-die Blackwell Ultra design, expanded to 288 GB HBM3e. The key differentiator is the 5th-generation Tensor Cores with native FP4 support: ~15,000 TFLOPS NVFP4, roughly 1.5x the B200's FP4 figure.

NVLink 5 at 1.8 TB/s per GPU is the interconnect. For 8-GPU nodes and multi-node clusters, NVLink 5 is the proven high-bandwidth option, with mature NCCL support and production-tested all-reduce performance.

For architectural lineage context, see our NVIDIA Rubin vs Blackwell vs Hopper guide.

HBM4 vs HBM3e: Memory Bandwidth and Capacity Comparison

This is AMD's clearest advantage on paper.

Metric	AMD MI455X (HBM4)	NVIDIA B300 (HBM3e)
Capacity	432 GB	288 GB
Bandwidth	19.6 TB/s	8 TB/s
Per-GPU model capacity (FP16)	~216B params	~144B params
Per-GPU model capacity (FP8)	~432B params	~288B params

Why bandwidth matters for LLM inference: token generation in autoregressive decoding is memory-bandwidth-bound at small batch sizes. Each token requires loading all model weights from HBM to compute units, so throughput scales closely with bandwidth. A 2.5x bandwidth advantage translates to approximately 2.5x token generation rate for single-request inference, assuming the workload stays bandwidth-bound.

That assumption breaks down at large batch sizes, where the GPU becomes compute-bound. FP4 quantization pushes B300 toward compute saturation, where its ~15,000 TFLOPS NVFP4 figure becomes more relevant than bandwidth. For memory-bound workloads (small batch, long context, large models), HBM4 is a real advantage. For compute-bound workloads (large batch FP4 inference), B300's compute density matters more.

For a deeper explanation of how memory bandwidth affects inference throughput, see our GPU memory requirements for LLMs guide.

Inference Benchmarks: LLM Serving Performance Projections

MI400 hardware is not publicly benchmarked as of April 2026. The numbers below are projections derived from bandwidth-scaling applied to publicly available benchmarks. They are clearly labeled as estimates.

Projection methodology: MI455X bandwidth (19.6 TB/s) is approximately 2.45x the MI350X's ~8 TB/s. For memory-bandwidth-bound workloads (small batch size 1), we scale MI350X projected figures by ~2.45x. For compute-bound workloads, the multiplier depends on CDNA 5 TFLOPS figures, which AMD has not confirmed. B300 figures are from public inference benchmarks.

Model	Batch	MI455X est. tok/s	B300 est. tok/s	Notes
Llama 4 Scout (109B) FP8	1	~5,800 (est.)	~3,200 (est.)	Bandwidth-bound projection; MI455X based on 2.45x MI350X projection
Llama 4 Maverick (400B) FP8	1	~680 (est., 1-GPU)	~430 (est., 2-GPU)	MI455X fits single GPU at FP8 (432 GB); B300 requires 2 GPUs (576 GB)
DeepSeek R1 671B FP8	1	~700 (est., 2-GPU)	~430 (est., 3-GPU)	MI455X needs 2 GPUs; B300 needs 3; MI455X 2×19.6 TB/s = 39.2 TB/s vs B300 3×8 TB/s = 24 TB/s

These are estimates derived from bandwidth-scaling projections applied to publicly available benchmarks. Real performance depends on ROCm kernel optimization, CDNA 5 driver maturity, and framework implementation quality. Update these when AMD publishes official CDNA 5 benchmark data or third-party results become available.

Reference MI350X projections from our AMD MI350X vs NVIDIA B200 comparison.

Training Performance: Multi-Node Scaling

NVLink 5 at 1.8 TB/s per GPU (B300) is the established leader for dense 8-GPU configurations and multi-node gradient synchronization. NCCL is mature, the all-reduce topology is proven, and the B300 integrates directly into existing NVLink infrastructure.

AMD's Helios rack-scale system uses UALink (running Infinity Fabric protocol over Ethernet) to connect MI400 GPUs at scale. AMD claims competitive all-reduce bandwidth at the system level, but independent multi-node benchmarks for CDNA 5 do not exist as of April 2026. The MI350X track record on multi-node ROCm training is improving, but reached NVIDIA-equivalent throughput required AMD engineering involvement in several cases.

Expect the same dynamic with MI400 initially: strong theoretical bandwidth, improving software support, but a lag behind CUDA on complex distributed training configurations. For practical guidance on multi-node setups, see our multi-node GPU training without InfiniBand guide.

Software Ecosystem: ROCm vs CUDA Maturity in 2026

MI400 hardware ships H2 2026. Cloud availability is estimated Q1 2027. The software projections below assume ROCm 8.x aligned with MI400's release.

Framework	MI400 (ROCm 8.x projected)	B300 (CUDA 13.x)	Notes
vLLM	Expected support	Full support	ROCm backend improving; CUDA has more optimizations
SGLang	Expected support	Full support	ROCm support available in recent releases
TensorRT-LLM	Not supported	Full support	NVIDIA-only
FlashAttention 4	Partial (projected)	Full	AMD Triton backend maturing; FlashAttention 4 (CuTeDSL) is optimized for Blackwell
DeepSpeed	Expected support	Full support	Both platforms have active development
Megatron-LM	Partial (projected)	Full	NVIDIA-tuned for Blackwell
Triton	Partial (projected)	Full	AMD Triton backend maturing; CUDA is more stable

"Projected" entries are based on AMD's track record with ROCm and announced support commitments. Verify actual status when MI400 cloud instances become available. For the current state of ROCm compatibility, see the ROCm vs CUDA section from our MI350X vs B200 post.

TensorRT-LLM remains NVIDIA-only and will not support ROCm. Teams using TensorRT-LLM in production should stay on NVIDIA hardware.

Pricing Projections: What MI400 Cloud Instances Will Cost

B300 on Spheron: $2.97/hr spot per GPU (B300 SXM6). No on-demand pricing is listed as of April 2026. View current pricing.

MI400 pricing projections: No cloud provider lists MI400 instances as of April 2026. Hardware has not yet shipped to data centers. Historically, AMD GPU cloud pricing debuts 15-25% below equivalent NVIDIA hardware. Based on this pattern and projected MI455X capacity:

GPU	Type	$/hr	Notes
NVIDIA B300 SXM6	Spot	$2.97 (live)	Available now on Spheron
NVIDIA B300 SXM6	On-demand	Not currently listed	Check pricing page for updates
AMD MI455X	On-demand	$10-14 (est.)	H2 2026 availability projected
AMD MI430X	On-demand	$7-10 (est.)	Lower-tier SKU
AMD MI440X	On-demand	$5-8 (est.)	Entry SKU

MI400 pricing estimates are based on MI350X market precedent and capacity premium for HBM4. Spot pricing will typically be 30-50% below on-demand when available.

Pricing fluctuates based on GPU availability. The prices above are based on 06 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

For a broader view across providers, see our GPU cloud pricing comparison for 2026.

Migration Considerations: Moving Workloads from NVIDIA to AMD

HIP for CUDA Porting

AMD's HIP compatibility layer handles the bulk of CUDA-to-ROCm porting for standard operations. PyTorch, vLLM, and most inference frameworks that use standard CUDA calls run on ROCm without code changes. Custom CUDA kernels with device-specific intrinsics require manual HIP translation.

Docker Container Differences

NVIDIA containers from NGC (NVIDIA GPU Cloud) do not run on ROCm. AMD provides ROCm-compatible base images on Docker Hub and AMD's container registry. If your deployment relies on NGC-provided inference containers, plan a container migration as part of the AMD move.

What to Benchmark Before Committing

Memory-bandwidth-sensitive workloads, prefill for long contexts and large-batch decoding, benefit most from MI455X's HBM4. Compute-bound workloads with FP4 quantization at large batch sizes are B300's strength, given its ~15,000 TFLOPS NVFP4 figure and the maturity of TensorRT-LLM and FlashAttention 3 on CUDA.

Run your production workload on a single GPU before committing to fleet migration. Measure actual tokens-per-second and time-to-first-token, not projected throughput.

Team Investment Estimate

A standard PyTorch inference migration to ROCm typically takes 1-2 weeks. Custom kernel optimization, especially for non-standard attention implementations, can take 4-8 weeks. Training workload migrations with custom kernels sit at the higher end of that range.

Which GPU to Choose for Your Workload Type

When to Choose AMD MI455X (when available)

Models 216B-432B parameters at FP16: a single MI455X fits where B300 requires 2 GPUs. Fewer GPUs means less inter-GPU communication overhead and simpler deployment.
Extreme memory-bandwidth-dependent workloads: long-context prefill and streaming inference at low batch sizes scale most directly with bandwidth.
Teams with existing ROCm or MI350X investment who want a natural upgrade path with minimal software changes.
Cost-sensitive large-model inference once MI400 cloud pricing stabilizes and the historical AMD discount pattern holds.

When to Choose NVIDIA B300 (available now)

Available today, with no wait for H2 2026 hardware shipments.
Native NVFP4 at ~15,000 TFLOPS for quantized inference stacks. For models with calibrated FP4 weights, B300 delivers the highest throughput per GPU available.
Models under 288 GB (where B300 memory is sufficient and the FP4 compute advantage is decisive).
Teams on CUDA, TensorRT-LLM, or heavily optimized CUDA kernel stacks that cannot absorb a migration.
Production workloads that need proven uptime, GPU availability, and mature software support without the uncertainty of a new architecture.

For guidance on selecting GPUs across the full spectrum of workload types, see our best GPU for AI inference 2026 guide.

GPU Cloud Availability

B300: Available now on Spheron. Spot pricing at $2.97/hr per GPU. View pricing →

MI400: Not yet available on any cloud provider as of April 2026. Spheron will add MI400 capacity when hardware becomes available through data center partners globally. Check the pricing page for updates.

Spheron operates as a multi-vendor GPU cloud, aggregating capacity from data center partners across multiple regions. Customers access both AMD and NVIDIA hardware through a single platform, with competitive pricing driven by vendor diversity. As MI400 capacity comes online, it will be listed alongside NVIDIA B300, H200, and H100 instances.

For a broader comparison of GPU cloud providers, see our top GPU cloud providers guide.

Full Specs Comparison

Specification	AMD MI455X	AMD MI430X	AMD MI440X	NVIDIA B300 SXM6
Architecture	CDNA 5	CDNA 5	CDNA 5	Blackwell Ultra (dual-die)
Memory Type	HBM4	HBM4	HBM4	HBM3e
Memory Capacity	432 GB	432 GB (est., unconfirmed)	432 GB (est., unconfirmed)	288 GB
Memory Bandwidth	19.6 TB/s	19.6 TB/s (est., unconfirmed)	19.6 TB/s (est., unconfirmed)	8 TB/s
Peak FP8 (dense)	TBD	TBD	TBD	~5,000 TFLOPS
Peak FP4 (dense)	Supported (figures pending)	Supported	Supported	~15,000 TFLOPS (NVFP4)
TDP	~2,000-2,500W (est., based on available leaks)	Not disclosed	Not disclosed	1,400W
Cooling	Liquid (est.)	Air (est.)	Air (est.)	Liquid
Interconnect	Infinity Fabric (next-gen)	Infinity Fabric (next-gen)	Infinity Fabric (next-gen)	NVLink 5 (1.8 TB/s)
Software Stack	ROCm 8.x (projected)	ROCm 8.x (projected)	ROCm 8.x (projected)	CUDA 13.x
Cloud Availability	H2 2026	H2 2026	H2 2026	Now
Spheron Pricing	Not listed	Not listed	Not listed	$2.97/hr spot

The B300 is available on Spheron today. When MI400 instances come online, you will find them here too: same platform, no lock-in. Compare current GPU pricing → or start with B300 now.
Rent B300 → | View all GPU pricing →
Get started on Spheron →