AMD's MI400 series ships H2 2026 with 432 GB HBM4 and 19.6 TB/s memory bandwidth: the most memory-dense GPU announced to date. The B300 is available now with 288 GB HBM3e and 14 petaFLOPS of FP4 compute. This post gives you the concrete comparison so you can decide which path fits your infrastructure timeline.
MI400 vs B300 at a Glance
| Spec | AMD MI455X | AMD MI430X | AMD MI440X | NVIDIA B300 SXM6 |
|---|---|---|---|---|
| Architecture | CDNA 5 | CDNA 5 | CDNA 5 | Blackwell Ultra (dual-die) |
| Memory | 432 GB HBM4 | 432 GB HBM4 (est., unconfirmed) | 432 GB HBM4 (est., unconfirmed) | 288 GB HBM3e |
| Memory Bandwidth | 19.6 TB/s | 19.6 TB/s (est., unconfirmed) | 19.6 TB/s (est., unconfirmed) | 8 TB/s |
| Peak FP8 (dense) | TBD (CDNA 5 pending) | TBD | TBD | ~5,000 TFLOPS |
| Peak FP4 | Supported (figures pending) | Supported | Supported | ~15,000 TFLOPS (NVFP4) |
| TDP | ~2,000-2,500W (liquid-cooled, est., based on available leaks) | TDP not disclosed | TDP not disclosed | 1,400W |
| Interconnect | Infinity Fabric (next-gen) | Infinity Fabric (next-gen) | Infinity Fabric (next-gen) | NVLink 5 (1.8 TB/s) |
| Software Stack | ROCm 8.x (projected) | ROCm 8.x (projected) | ROCm 8.x (projected) | CUDA 13.x / TensorRT-LLM |
| Cloud Availability | H2 2026 | H2 2026 | H2 2026 | Available now |
| Spheron Pricing | Not listed | Not listed | Not listed | From $2.97/hr spot |
MI400 specs sourced from AMD's confirmed announcements. AMD has not published differentiated memory or bandwidth specs for MI430X and MI440X; the figures above are unconfirmed estimates based on SKU positioning. Performance projections will be updated when third-party benchmarks are available.
AMD MI400 Lineup: MI455X, MI430X, MI440X
MI455X: The Flagship
The MI455X is AMD's top-tier MI400 SKU: 432 GB HBM4, 19.6 TB/s bandwidth, and an estimated ~2,000-2,500W liquid-cooled thermal envelope (based on available leaks; AMD has not confirmed TDP). That memory figure is the headline. A single MI455X can hold a 216B-parameter model at FP16 without partitioning across GPUs, a capacity class that currently requires multiple B300s.
HBM4 is the architectural foundation here. Where HBM3e delivers 8 TB/s total on the B300, HBM4 pushes bandwidth higher through tighter die-to-die interconnect and improved signaling. AMD's CDNA 5 implementation achieves 19.6 TB/s total, roughly 2.45x the B300's bandwidth.
CDNA 5 carries forward native FP4 and FP6 support from CDNA 4, maintaining AMD's expanded precision range. Exact TFLOPS figures for MI455X FP8 and FP4 compute have not been published by AMD as of April 2026.
MI430X: Mid-Tier, Air-Cooled
The MI430X is the mid-range SKU targeting deployments that cannot support liquid cooling. AMD has not published final specs for MI430X. Based on AMD's confirmed announcements, all MI400 SKUs may share the same 432 GB HBM4 / 19.6 TB/s configuration, with differentiation in workload focus (FP64/HPC vs FP4/AI compute) rather than memory capacity. The air-cooled thermal envelope constrains peak power, which in turn limits clock headroom relative to the MI455X.
For teams that found MI350X capacity sufficient but want HBM4 bandwidth improvements, MI430X is the likely target SKU.
MI440X: Entry SKU
The MI440X occupies the lower end of the MI400 family. AMD positions this for deployment flexibility and cost efficiency. AMD has not published final MI440X memory specs as of April 2026; available sources suggest all MI400 SKUs may share the same 432 GB HBM4 / 19.6 TB/s configuration, with SKU differentiation based on workload focus rather than memory capacity.
Helios: AMD's Rack-Scale Infrastructure
AMD's "Helios" system is the rack-scale infrastructure built around the MI400 series. The standard configuration scales to 72 GPUs in a tightly coupled scale-up domain, connected via UALink (running Infinity Fabric protocol over Ethernet), with aggregate interconnect bandwidth competitive with NVIDIA's NVL72 at the system level. This is AMD's direct answer to NVIDIA's NVLink-based scale-up approach.
For context on how MI400 builds on AMD's previous generation, see our AMD MI350X vs NVIDIA B200 comparison.
NVIDIA B300 Blackwell Ultra: What You Get Today
The B300 is the current-generation Blackwell Ultra GPU. Key numbers: 288 GB HBM3e at 8 TB/s, ~15,000 TFLOPS NVFP4, ~5,000 TFLOPS FP8 dense, 1,400W liquid-cooled TDP, and NVLink 5 at 1.8 TB/s per GPU.
It is available now. That single fact shapes most of the decision matrix in this post.
For the full B300 specs and deployment guide, see our NVIDIA B300 Blackwell Ultra guide.
B300 SXM6 spot pricing on Spheron is $2.97/hr per GPU as of April 2026. No on-demand pricing is currently listed. Check current GPU pricing for live rates.
Architecture Deep Dive: CDNA 5 vs Blackwell Ultra
AMD CDNA 5 Architecture
CDNA 5 is AMD's successor to the CDNA 4 architecture used in MI350X and MI355X. The core advancement is the shift from HBM3e to HBM4, which more than doubles memory bandwidth on the MI455X flagship.
AMD continues its chiplet packaging approach, combining GPU compute dies with HBM4 stacks and I/O dies on a single package. CDNA 5 extends precision format support to include FP4 and FP6, matching CDNA 4's additions over CDNA 3. Infinity Fabric (next-gen, specific generation unconfirmed by AMD) handles multi-GPU connectivity and scales to rack-level with the Helios infrastructure.
CDNA 5 FP8 and FP4 TFLOPS figures have not been confirmed by AMD as of April 2026. For how CDNA 4 compares to CDNA 3, see the architecture section in our MI350X vs B200 post.
NVIDIA Blackwell Ultra Architecture
The B300 uses a dual-die Blackwell Ultra design, expanded to 288 GB HBM3e. The key differentiator is the 5th-generation Tensor Cores with native FP4 support: ~15,000 TFLOPS NVFP4, roughly 1.5x the B200's FP4 figure.
NVLink 5 at 1.8 TB/s per GPU is the interconnect. For 8-GPU nodes and multi-node clusters, NVLink 5 is the proven high-bandwidth option, with mature NCCL support and production-tested all-reduce performance.
For architectural lineage context, see our NVIDIA Rubin vs Blackwell vs Hopper guide.
HBM4 vs HBM3e: Memory Bandwidth and Capacity Comparison
This is AMD's clearest advantage on paper.
| Metric | AMD MI455X (HBM4) | NVIDIA B300 (HBM3e) |
|---|---|---|
| Capacity | 432 GB | 288 GB |
| Bandwidth | 19.6 TB/s | 8 TB/s |
| Per-GPU model capacity (FP16) | ~216B params | ~144B params |
| Per-GPU model capacity (FP8) | ~432B params | ~288B params |
Why bandwidth matters for LLM inference: token generation in autoregressive decoding is memory-bandwidth-bound at small batch sizes. Each token requires loading all model weights from HBM to compute units, so throughput scales closely with bandwidth. A 2.5x bandwidth advantage translates to approximately 2.5x token generation rate for single-request inference, assuming the workload stays bandwidth-bound.
That assumption breaks down at large batch sizes, where the GPU becomes compute-bound. FP4 quantization pushes B300 toward compute saturation, where its ~15,000 TFLOPS NVFP4 figure becomes more relevant than bandwidth. For memory-bound workloads (small batch, long context, large models), HBM4 is a real advantage. For compute-bound workloads (large batch FP4 inference), B300's compute density matters more.
For a deeper explanation of how memory bandwidth affects inference throughput, see our GPU memory requirements for LLMs guide.
Inference Benchmarks: LLM Serving Performance Projections
MI400 hardware is not publicly benchmarked as of April 2026. The numbers below are projections derived from bandwidth-scaling applied to publicly available benchmarks. They are clearly labeled as estimates.
Projection methodology: MI455X bandwidth (19.6 TB/s) is approximately 2.45x the MI350X's ~8 TB/s. For memory-bandwidth-bound workloads (small batch size 1), we scale MI350X projected figures by ~2.45x. For compute-bound workloads, the multiplier depends on CDNA 5 TFLOPS figures, which AMD has not confirmed. B300 figures are from public inference benchmarks.
| Model | Batch | MI455X est. tok/s | B300 est. tok/s | Notes |
|---|---|---|---|---|
| Llama 4 Scout (109B) FP8 | 1 | ~5,800 (est.) | ~3,200 (est.) | Bandwidth-bound projection; MI455X based on 2.45x MI350X projection |
| Llama 4 Maverick (400B) FP8 | 1 | ~680 (est., 1-GPU) | ~430 (est., 2-GPU) | MI455X fits single GPU at FP8 (432 GB); B300 requires 2 GPUs (576 GB) |
| DeepSeek R1 671B FP8 | 1 | ~700 (est., 2-GPU) | ~430 (est., 3-GPU) | MI455X needs 2 GPUs; B300 needs 3; MI455X 2×19.6 TB/s = 39.2 TB/s vs B300 3×8 TB/s = 24 TB/s |
These are estimates derived from bandwidth-scaling projections applied to publicly available benchmarks. Real performance depends on ROCm kernel optimization, CDNA 5 driver maturity, and framework implementation quality. Update these when AMD publishes official CDNA 5 benchmark data or third-party results become available.
Reference MI350X projections from our AMD MI350X vs NVIDIA B200 comparison.
Training Performance: Multi-Node Scaling
NVLink 5 at 1.8 TB/s per GPU (B300) is the established leader for dense 8-GPU configurations and multi-node gradient synchronization. NCCL is mature, the all-reduce topology is proven, and the B300 integrates directly into existing NVLink infrastructure.
AMD's Helios rack-scale system uses UALink (running Infinity Fabric protocol over Ethernet) to connect MI400 GPUs at scale. AMD claims competitive all-reduce bandwidth at the system level, but independent multi-node benchmarks for CDNA 5 do not exist as of April 2026. The MI350X track record on multi-node ROCm training is improving, but reached NVIDIA-equivalent throughput required AMD engineering involvement in several cases.
Expect the same dynamic with MI400 initially: strong theoretical bandwidth, improving software support, but a lag behind CUDA on complex distributed training configurations. For practical guidance on multi-node setups, see our multi-node GPU training without InfiniBand guide.
Software Ecosystem: ROCm vs CUDA Maturity in 2026
MI400 hardware ships H2 2026. Cloud availability is estimated Q1 2027. The software projections below assume ROCm 8.x aligned with MI400's release.
| Framework | MI400 (ROCm 8.x projected) | B300 (CUDA 13.x) | Notes |
|---|---|---|---|
| vLLM | Expected support | Full support | ROCm backend improving; CUDA has more optimizations |
| SGLang | Expected support | Full support | ROCm support available in recent releases |
| TensorRT-LLM | Not supported | Full support | NVIDIA-only |
| FlashAttention 4 | Partial (projected) | Full | AMD Triton backend maturing; FlashAttention 4 (CuTeDSL) is optimized for Blackwell |
| DeepSpeed | Expected support | Full support | Both platforms have active development |
| Megatron-LM | Partial (projected) | Full | NVIDIA-tuned for Blackwell |
| Triton | Partial (projected) | Full | AMD Triton backend maturing; CUDA is more stable |
"Projected" entries are based on AMD's track record with ROCm and announced support commitments. Verify actual status when MI400 cloud instances become available. For the current state of ROCm compatibility, see the ROCm vs CUDA section from our MI350X vs B200 post.
TensorRT-LLM remains NVIDIA-only and will not support ROCm. Teams using TensorRT-LLM in production should stay on NVIDIA hardware.
Pricing Projections: What MI400 Cloud Instances Will Cost
B300 on Spheron: $2.97/hr spot per GPU (B300 SXM6). No on-demand pricing is listed as of April 2026. View current pricing.
MI400 pricing projections: No cloud provider lists MI400 instances as of April 2026. Hardware has not yet shipped to data centers. Historically, AMD GPU cloud pricing debuts 15-25% below equivalent NVIDIA hardware. Based on this pattern and projected MI455X capacity:
| GPU | Type | $/hr | Notes |
|---|---|---|---|
| NVIDIA B300 SXM6 | Spot | $2.97 (live) | Available now on Spheron |
| NVIDIA B300 SXM6 | On-demand | Not currently listed | Check pricing page for updates |
| AMD MI455X | On-demand | $10-14 (est.) | H2 2026 availability projected |
| AMD MI430X | On-demand | $7-10 (est.) | Lower-tier SKU |
| AMD MI440X | On-demand | $5-8 (est.) | Entry SKU |
MI400 pricing estimates are based on MI350X market precedent and capacity premium for HBM4. Spot pricing will typically be 30-50% below on-demand when available.
Pricing fluctuates based on GPU availability. The prices above are based on 06 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
For a broader view across providers, see our GPU cloud pricing comparison for 2026.
Migration Considerations: Moving Workloads from NVIDIA to AMD
HIP for CUDA Porting
AMD's HIP compatibility layer handles the bulk of CUDA-to-ROCm porting for standard operations. PyTorch, vLLM, and most inference frameworks that use standard CUDA calls run on ROCm without code changes. Custom CUDA kernels with device-specific intrinsics require manual HIP translation.
Docker Container Differences
NVIDIA containers from NGC (NVIDIA GPU Cloud) do not run on ROCm. AMD provides ROCm-compatible base images on Docker Hub and AMD's container registry. If your deployment relies on NGC-provided inference containers, plan a container migration as part of the AMD move.
What to Benchmark Before Committing
Memory-bandwidth-sensitive workloads, prefill for long contexts and large-batch decoding, benefit most from MI455X's HBM4. Compute-bound workloads with FP4 quantization at large batch sizes are B300's strength, given its ~15,000 TFLOPS NVFP4 figure and the maturity of TensorRT-LLM and FlashAttention 3 on CUDA.
Run your production workload on a single GPU before committing to fleet migration. Measure actual tokens-per-second and time-to-first-token, not projected throughput.
Team Investment Estimate
A standard PyTorch inference migration to ROCm typically takes 1-2 weeks. Custom kernel optimization, especially for non-standard attention implementations, can take 4-8 weeks. Training workload migrations with custom kernels sit at the higher end of that range.
Which GPU to Choose for Your Workload Type
When to Choose AMD MI455X (when available)
- Models 216B-432B parameters at FP16: a single MI455X fits where B300 requires 2 GPUs. Fewer GPUs means less inter-GPU communication overhead and simpler deployment.
- Extreme memory-bandwidth-dependent workloads: long-context prefill and streaming inference at low batch sizes scale most directly with bandwidth.
- Teams with existing ROCm or MI350X investment who want a natural upgrade path with minimal software changes.
- Cost-sensitive large-model inference once MI400 cloud pricing stabilizes and the historical AMD discount pattern holds.
When to Choose NVIDIA B300 (available now)
- Available today, with no wait for H2 2026 hardware shipments.
- Native NVFP4 at ~15,000 TFLOPS for quantized inference stacks. For models with calibrated FP4 weights, B300 delivers the highest throughput per GPU available.
- Models under 288 GB (where B300 memory is sufficient and the FP4 compute advantage is decisive).
- Teams on CUDA, TensorRT-LLM, or heavily optimized CUDA kernel stacks that cannot absorb a migration.
- Production workloads that need proven uptime, GPU availability, and mature software support without the uncertainty of a new architecture.
For guidance on selecting GPUs across the full spectrum of workload types, see our best GPU for AI inference 2026 guide.
GPU Cloud Availability
B300: Available now on Spheron. Spot pricing at $2.97/hr per GPU. View pricing →
MI400: Not yet available on any cloud provider as of April 2026. Spheron will add MI400 capacity when hardware becomes available through data center partners globally. Check the pricing page for updates.
Spheron operates as a multi-vendor GPU cloud, aggregating capacity from data center partners across multiple regions. Customers access both AMD and NVIDIA hardware through a single platform, with competitive pricing driven by vendor diversity. As MI400 capacity comes online, it will be listed alongside NVIDIA B300, H200, and H100 instances.
For a broader comparison of GPU cloud providers, see our top GPU cloud providers guide.
Full Specs Comparison
| Specification | AMD MI455X | AMD MI430X | AMD MI440X | NVIDIA B300 SXM6 |
|---|---|---|---|---|
| Architecture | CDNA 5 | CDNA 5 | CDNA 5 | Blackwell Ultra (dual-die) |
| Memory Type | HBM4 | HBM4 | HBM4 | HBM3e |
| Memory Capacity | 432 GB | 432 GB (est., unconfirmed) | 432 GB (est., unconfirmed) | 288 GB |
| Memory Bandwidth | 19.6 TB/s | 19.6 TB/s (est., unconfirmed) | 19.6 TB/s (est., unconfirmed) | 8 TB/s |
| Peak FP8 (dense) | TBD | TBD | TBD | ~5,000 TFLOPS |
| Peak FP4 (dense) | Supported (figures pending) | Supported | Supported | ~15,000 TFLOPS (NVFP4) |
| TDP | ~2,000-2,500W (est., based on available leaks) | Not disclosed | Not disclosed | 1,400W |
| Cooling | Liquid (est.) | Air (est.) | Air (est.) | Liquid |
| Interconnect | Infinity Fabric (next-gen) | Infinity Fabric (next-gen) | Infinity Fabric (next-gen) | NVLink 5 (1.8 TB/s) |
| Software Stack | ROCm 8.x (projected) | ROCm 8.x (projected) | ROCm 8.x (projected) | CUDA 13.x |
| Cloud Availability | H2 2026 | H2 2026 | H2 2026 | Now |
| Spheron Pricing | Not listed | Not listed | Not listed | $2.97/hr spot |
The B300 is available on Spheron today. When MI400 instances come online, you will find them here too: same platform, no lock-in. Compare current GPU pricing → or start with B300 now.
