Engineering

MLPerf Inference v6.0 Results Explained: GPU Performance Rankings for AI Workloads (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 11, 2026
GPU BenchmarkMLPerfLLM InferenceGPU CloudAI InfrastructureNVIDIA BlackwellAMD MI355XInference Performance
MLPerf Inference v6.0 Results Explained: GPU Performance Rankings for AI Workloads (2026)

MLPerf Inference v6.0 results dropped April 1, 2026, and if you are trying to decide which GPU to rent for LLM or video inference, the raw scoreboards need a translator. This post decodes what changed from v5.1, what each new benchmark measures, and what the per-GPU throughput numbers mean when you map them against real cloud costs. For a broader look at how GPU specs translate to inference performance, start with our GPU cloud benchmarks guide.

What Is MLPerf Inference v6.0 and What Changed from v5.1

MLCommons is the consortium behind the MLPerf benchmarks. Its members include NVIDIA, AMD, Intel, Google, and several hyperscalers. MLPerf results carry weight because they are independently audited: submitters must run on publicly described hardware configurations, and the methodology is peer-reviewed before results go public. Unlike a vendor's own benchmark, you cannot tune MLPerf specifically for press release numbers.

The benchmark suite runs in two divisions:

  • Closed division: Fixed model weights, fixed precision, fixed pre/post-processing. All submitters in the closed division are running the same model under the same constraints. Numbers are directly comparable across vendors.
  • Open division: Any optimization allowed, including custom kernels, non-standard quantization, and proprietary runtimes. Open division results can show higher throughput but are not directly comparable across submitters.

For purchasing decisions, always use closed division numbers first. Open division results show what a vendor can do with their best shot, but they do not tell you what you will actually get running the standard vLLM or TensorRT-LLM stack.

What is new in v6.0 vs v5.1:

  • GPT-OSS 120B (new): An open-weight 120B parameter language model benchmark. GPT-OSS 120B is a new large-scale text generation benchmark that provides differentiation at the 120B parameter scale, complementing the existing Llama 2 70B and Llama 3.1 8B tasks already in the suite.
  • WAN-2.2 text-to-video (new): A standardized video generation benchmark using the WAN-2.2 model. This is the first video generation workload in MLPerf Inference, and participation is limited to vendors who already support the WAN-2.2 architecture.
  • Qwen3-VL-235B VLM (new): A vision-language model task using Qwen3-VL-235B-A22B-Instruct that tests both image encoding and language decoding in a single end-to-end benchmark. Relevant for any workload that combines vision and text at scale.
  • DLRMv3 (updated): The deep learning recommendation model benchmark, updated from DLRMv2 to reflect larger embedding tables and higher-traffic query patterns.
  • YOLOv11 (new): Object detection inference using the YOLOv11 architecture. This replaces the older YOLO task and reflects modern detector usage in edge and cloud deployments.
  • DeepSeek-R1 (expanded benchmark): MLCommons v6.0 includes DeepSeek-R1 as an expanded advanced-reasoning benchmark, reflecting widespread adoption of reasoning-focused models in production workloads.

Key submitters in v6.0: NVIDIA with Blackwell B200 (in GB200 NVL72 systems) and H200, AMD with MI355X in both closed and open divisions, and Intel with Xeon 6 CPUs and Arc Pro B-series GPUs. Several hyperscalers (Google, Microsoft) submitted as system builders using NVIDIA hardware.

New Benchmarks in v6.0: What Each Test Measures

GPT-OSS 120B (New)

GPT-OSS 120B tests text generation throughput on an open-weight Mixture-of-Experts (MoE) transformer (117B total parameters, 5.1B active per token). The metric is tokens per second in Offline scenario (maximum batch size, all requests submitted at once) and queries per second under target latency in Server scenario.

This task was added because the previous GPT-J 6B benchmark had become too easy: all major accelerators saturated it at the same throughput, making it useless for differentiation. Despite only 5.1B parameters being active per token during inference, all 117B parameters must be loaded into VRAM, keeping memory capacity requirements high. The MoE routing also introduces memory-access patterns that stress different aspects of accelerator memory subsystems than a dense model would, which is why hardware differences still show clearly in the results.

In production, this maps to any large language model serving workload: chat, code generation, summarization at 100B+ parameter scale. If you are considering running Llama 3.1 405B or similar very large models, GPT-OSS 120B gives you the most relevant benchmark signal.

WAN-2.2 Text-to-Video (New)

WAN-2.2 is a video generation model. The benchmark measures samples per second (complete video clips generated per second) in Offline scenario. Video generation is substantially more memory-intensive than image generation: generating a single clip requires holding the video diffusion model, intermediate latents, and the text conditioning in VRAM simultaneously.

This task was added to reflect the growing use of GPU cloud for AI video production workloads. Both NVIDIA and AMD (with MI355X) submitted results for WAN-2.2 in the first round of v6.0. Intel participation is expected in subsequent rounds.

Vision-Language Model Benchmark (New)

The VLM benchmark uses Qwen3-VL-235B-A22B-Instruct, a Mixture-of-Experts vision-language model with 235B total parameters and 22B active parameters per forward pass. The metric is tokens per second, measured end-to-end from image input to text output.

This matters for any workload that processes images with text: document OCR pipelines, medical imaging with text reports, multimodal chatbots, and visual question answering. At 235B parameters (MoE), this model requires significantly more VRAM than a dense 34B model. Running it at FP8 requires approximately 235 GB for weights alone, which already exceeds the H200's 141 GB capacity. Multi-GPU H200 or GB200 NVL72-class configurations are the practical options.

DLRMv3 Recommender System (Updated)

DLRMv3 tests the deep learning recommendation model used by large-scale ad systems and content platforms. The metric is queries per second. This benchmark was updated from DLRMv2 to use larger embedding tables (reflecting real-world feature size growth) and higher query rates.

Most GPU cloud users will not run DLRM directly. This benchmark is relevant for ad tech, e-commerce, and content recommendation teams. The embedding-heavy workload is less memory-bandwidth-bound than LLMs and more sensitive to memory capacity and interconnect latency.

YOLOv11 Object Detection (New)

YOLOv11 tests real-time object detection inference. The metric is queries per second in both SingleStream (latency) and Offline (throughput) scenarios. YOLOv11 was added to replace the older detection benchmarks and to provide coverage of the convolutional+transformer hybrid architectures common in modern detectors.

This task is compute-bound rather than memory-bandwidth-bound at small batch sizes, which changes which GPU wins. Smaller GPUs with high FP8 TFLOPS can compete effectively here against the larger HBM-heavy accelerators.

NVIDIA B200 (Blackwell) vs AMD MI355X: Head-to-Head Results

The table below shows per-GPU performance derived by dividing system throughput by the number of accelerators in each submission. Closed division, Offline scenario unless noted.

GPUGPT-OSS 120B (tok/s per GPU)LLaMA 2 70B (tok/s per GPU)SDXL (samples/s per GPU)YOLOv11 (QPS per GPU)
NVIDIA B200 (GB200 NVL72)~7,800~17,500~14.2~28,400
NVIDIA H200 SXM~3,100~7,800~6.3~15,200
AMD MI355X (closed div.)~2,600~6,200~4.8*~11,800*

Results from MLCommons Inference v6.0, April 2026. Closed division unless noted. Per-GPU numbers derived by dividing system throughput by accelerator count. AMD MI355X submitted in closed division for LLaMA 2 70B and GPT-OSS 120B; SDXL and YOLOv11 numbers are open division with ROCm optimizations. Intel Gaudi 3 is not listed here: Intel's official v6.0 submissions covered Xeon 6 CPU and Arc Pro GPU workloads, not Gaudi 3 accelerators, so no directly comparable per-GPU numbers are available.

Where B200 leads: Blackwell dominates all tasks it submitted for. On GPT-OSS 120B, the B200 is roughly 2.5x the per-GPU throughput of the H200. The key driver is the combination of FP4 precision support and 8.0 TB/s HBM3e memory bandwidth versus the H200's 4.8 TB/s. For memory-bandwidth-bound workloads like large LLMs, bandwidth doubles nearly double throughput when the model fits in VRAM. The NVL72 system interconnect (NVLink 5 at 1.8 TB/s bidirectional per GPU) also lets the system run the 120B model across GPUs without the bandwidth penalty you see with InfiniBand-connected nodes.

Where MI355X closes the gap: AMD's closed division submission puts MI355X within 20% of the H200 on LLaMA 2 70B and GPT-OSS 120B, and the closed division numbers are directly comparable to NVIDIA's since both run under the same model and precision constraints. The gap narrows further on workloads that fit within the MI355X's 288 GB HBM3e capacity (larger than B200's 192 GB), where AMD can run larger batch sizes. For deeper context on AMD vs NVIDIA tradeoffs in cloud deployments, see our ROCm vs CUDA for GPU Cloud guide.

Intel's position in v6.0: Intel's official MLPerf v6.0 submissions focused on Xeon 6 CPUs and Arc Pro B-series GPUs rather than Gaudi 3 accelerators. No Gaudi 3 results are available from the official MLCommons v6.0 dataset, so Gaudi 3 is excluded from this comparison table. If Gaudi 3 results appear in a subsequent v6.0 round, this table will be updated.

WAN-2.2 note: Both NVIDIA and AMD (MI355X) submitted for the text-to-video benchmark in v6.0 round 1. Intel benchmarks for this task are not yet available. The table will expand further once Intel submits in a subsequent round.

Tokens Per Second Per Dollar: Translating MLPerf Scores to Real Cloud Costs

Raw throughput only matters after you divide by price. The table below maps MLPerf GPT-OSS 120B per-GPU throughput to live Spheron pricing as of 11 Apr 2026.

GPUMLPerf GPT-OSS 120B tok/s (per GPU)Spheron On-Demand ($/hr)Spheron Spot ($/hr)Cost per 1M Tokens (on-demand)Cost per 1M Tokens (spot)
B200 SXM6~7,800N/A$2.06N/A~$0.073
H200 SXM5~3,100$4.50$1.19~$0.403~$0.107
H200 (GH200 PCIe)†~3,100$1.97N/A~$0.177N/A
H100 PCIe~1,300‡$2.11N/A~$0.451N/A
MI355X~2,600Not listedNot listedSee pricing →See pricing →

†GH200 PCIe is the Grace Hopper configuration available on Spheron. Note that GH200 variants differ: the H100-based GH200 has 96 GB HBM3, while the H200-based GH200 has 141 GB HBM3e. The ~3,100 tok/s estimate assumes the 141 GB HBM3e variant. If Spheron's GH200 PCIe carries 96 GB HBM3, throughput at 120B-scale workloads will be lower. Check the GPU pricing page for the specific configuration available.

‡H100 PCIe throughput on GPT-OSS 120B is estimated at approximately 42% of H200 SXM, reflecting the bandwidth ratio (~2.0 TB/s vs 4.8 TB/s).

Cost per 1M tokens formula: ($/hr) / (tokens/sec per GPU) * (1,000,000 / 3,600)

The B200 SXM6 spot price of $2.06/hr delivers the lowest cost-per-token (~$0.073/M) at its MLPerf throughput. The H200 SXM5 spot at $1.19/hr is also very competitive at ~$0.107/M tokens. The GH200 PCIe on-demand at $1.97/hr offers the best on-demand value for 120B-scale LLM inference among on-demand-only options.

Pricing fluctuates based on GPU availability. The prices above are based on 11 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

For H200 vs B200 specification comparisons, see NVIDIA H200 vs B200 vs GB200. For H100 vs H200 tradeoffs in inference specifically, see NVIDIA H100 vs H200.

What the Results Mean for GPU Cloud Users: Which GPU to Rent for Each Workload

LLM Inference (70B-120B Parameters)

For 70B models in FP8, the H100 PCIe on Spheron at $2.11/hr on-demand gives approximately $0.451/M tokens on GPT-OSS 120B scale workloads, based on v6.0 per-GPU throughput estimates (H100 PCIe bandwidth is ~2.0 TB/s, roughly 42% of H200 SXM). The GH200 PCIe at $1.97/hr drops that to approximately $0.177/M with higher raw throughput, and the H200 SXM5 spot at $1.19/hr reaches ~$0.107/M tokens.

If your model fits in 80 GB at FP8 (most 70B models do), the H100 is accessible at current Spheron pricing. If you need 120B+ models or FP16 serving of 70B models, the GH200's H200-class 141 GB memory gives more headroom.

For burst batch workloads, the A100 SXM4 at $1.64/hr on-demand (spot available at $0.45/hr) is viable for smaller models, though the gap vs H100 PCIe at $2.11/hr has narrowed considerably. See rent H100 on Spheron or GPU pricing for current availability.

Text-to-Image Generation (Stable Diffusion XL)

SDXL inference is less memory-bandwidth-bound than LLM serving and more sensitive to FP8 compute density. The B200's 14+ samples/s per GPU is 2x+ the H200 on MLPerf v6.0 SDXL numbers. For image generation at scale, the B200 or H200 are the clear hardware choices. For current Spheron availability, check the GPU pricing page.

For workloads that fit within 48 GB (SDXL itself fits at ~8 GB in FP16), the L40S at $0.72/hr on-demand is a cost-effective option for smaller studios or lower-volume pipelines.

Text-to-Video Generation (WAN-2.2)

Video generation is the most VRAM-intensive workload in MLPerf v6.0. WAN-2.2 requires holding video latents across many diffusion steps, which means 192+ GB systems (B200) have a structural advantage over single-H200 setups. Both NVIDIA and AMD (MI355X) submitted for this task in v6.0 round 1. Intel participation is expected in subsequent rounds.

If you need WAN-2.2 production inference, the GH200 system architecture (which gives access to CPU memory as an overflow layer) may be more practical than a single H200 SXM at 141 GB. Check availability on the GPU pricing page and the AI video generation GPU guide for workload-specific guidance.

Vision-Language Models

VLM workloads combine a vision encoder with a large language backbone. The MLPerf v6.0 VLM benchmark uses Qwen3-VL-235B-A22B-Instruct, a 235B MoE model with 22B active parameters. At FP8, the weights alone need roughly 235 GB, which exceeds both the H100's 80 GB and the H200's 141 GB. Neither GPU can hold the model on its own at FP8. At FP16, the model needs approximately 470 GB for weights alone. Multi-GPU H200 SXM5 or B200 configurations, or GB200 NVL72-class nodes with pooled HBM, are the practical options for this workload.

For lighter VLM use cases using smaller models such as Qwen2-VL-7B or similar sub-20B models, a single H100 PCIe (80 GB) or A100 80GB handles inference comfortably at FP16. See rent H100 for current H100 availability. For broader GPU selection guidance, see the Best GPU for AI inference 2026 guide.

Real-Time Object Detection (YOLOv11)

Object detection is the one workload where smaller GPUs compete well. YOLOv11 at FP8 is primarily compute-bound at low-to-medium concurrency, meaning FP8 TFLOPS matter more than memory bandwidth. The L40S PCIe (1,466 FP8 TFLOPS with sparsity; 733 TFLOPS dense) at $0.72/hr on-demand handles high-throughput detection pipelines at a fraction of H200 pricing.

For very high QPS detection at the edge or in cloud inference APIs, the H100 PCIe on Spheron is the practical ceiling for single-GPU deployments. MLPerf v6.0 shows H200 is ~2x the H100 on YOLOv11, so the throughput gain is real but the cost premium may not justify it for detection-only workloads.

How to Run MLPerf Benchmarks on Your Own GPU Cloud Infrastructure

You do not need to wait for MLCommons to publish results. You can run the benchmark harness on any Spheron GPU instance. Here is how.

Step 1: Clone the harness and results repo

bash
git clone https://github.com/mlcommons/inference.git
cd inference
git checkout v6.0

The inference/ repo contains the harness code. Separately, the published results are in:

bash
git clone https://github.com/mlcommons/inference_results_v6.0.git

The results/ directory in the latter is organized by submitter and system. CSV files contain per-scenario results that you can filter by benchmark, scenario, and system columns.

Step 2: Configure for your Spheron instance

After provisioning a Spheron GPU instance (see Spheron documentation), create a user.conf in the harness directory:

# user.conf for single-GPU Spheron H100 instance
system_name = spheron-h100-pcie
gpu_count = 1

Step 3: Run the BERT Offline benchmark as a sanity check

BERT is the fastest task to run and verifies the harness is working before you commit to a multi-hour GPT-J or GPT-OSS run:

bash
cd inference/language/bert
# Install dependencies
pip install -r requirements.txt
# Download the SQuAD dataset and BERT model
./download_model.sh
./download_data.sh
# Run Offline scenario
python run.py \
  --scenario Offline \
  --model bert-99 \
  --backend pytorch \
  --max-batchsize 64 \
  --count 10833

Step 4: Run the GPT-J Offline benchmark

For LLM throughput, the GPT-J 6B task is the fastest to set up:

bash
cd inference/language/gpt-j
./download_model.sh  # ~25 GB download
./download_data.sh
python run.py \
  --scenario Offline \
  --model gpt-j-99 \
  --backend vllm \
  --max-batchsize 1

Step 5: Normalize results to per-GPU throughput

When you read the output JSON, divide the result value (tokens/s for text tasks, QPS for classification) by your GPU count. For single-GPU runs, this is trivially 1. For multi-GPU Spheron instances, divide by the count you specified in user.conf. This gives you the per-GPU number you can compare against the MLCommons published results tables.

For a comparison of how vLLM, TensorRT-LLM, and SGLang perform on the same H100 hardware outside the MLPerf harness, see our vLLM vs TensorRT-LLM vs SGLang benchmarks. For inference economics and a full Blackwell breakdown, the NVIDIA B200 complete guide covers both.

MLPerf v6.0 confirms that the GPUs winning inference benchmarks - H200 and B200 - are available on Spheron at competitive hourly rates, with spot instances for cost-sensitive batch workloads.

Rent H200 → | Rent B200 → | View all GPU pricing →

Start running inference on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.