GPU Cloud Benchmarks 2026: AI GPU Throughput, Specs, Pricing

Q: Which GPU cloud provider is cheapest for H100s?

As of April 2026, Spheron offers H100 PCIe from $2.01/hr and H100 SXM5 from $2.50/hr on-demand ($1.03/hr on spot) with no commitment. Marketplace providers like Vast.ai still list H100s from $1.49-1.87/hr with variable availability. AWS p5 is around $6.88/hr per GPU after its mid-2025 cut, Google Cloud A3 around $3.00/hr, Azure ND H100 v5 around $12.29/hr per GPU. Spot pricing on Spheron brings H100 even lower. The spread between cheapest and most expensive is still over 3x for the same GPU.

Q: Is the H200 worth the extra cost over H100?

For 70B+ model inference, yes. The H200 delivers roughly 42% more inference throughput than the H100 thanks to 4.8 TB/s memory bandwidth (vs 3.35 TB/s) and 141 GB HBM3e (vs 80 GB HBM3). A single H200 can serve Llama 3.3 70B at FP8 that would require two H100s, making the H200 cheaper overall despite the higher hourly rate. Spheron lists H200 SXM5 from $4.54/hr as of April 2026.

Q: What's the cheapest GPU for LLM inference in April 2026?

For 7B models, the NVIDIA L4 at $0.30-0.80/hr offers the lowest cost per token. For 7B-30B models, the L40S at $0.80-2.00/hr provides 48 GB VRAM and strong FP8 throughput. For larger models on a budget, A100 80G SXM4 is now $1.07/hr on-demand and $0.60/hr on spot. For best throughput-per-dollar on fault-tolerant workloads, B200 SXM6 spot at $2.12/hr delivers 2.4x the memory bandwidth of H100 PCIe at only $0.11/hr premium, with the ability to run 100B+ models natively.

Q: How much does GPU cloud cost per month for LLM serving?

A single H100 PCIe running 24/7 on Spheron is roughly $1,450/month at on-demand rates, dropping to sub-$1,000 on spot. H100 SXM5 is around $1,800/month on-demand or ~$740/month on spot. A100 80G SXM4 is around $770/month on-demand or ~$432/month on spot. L40S at $0.80/hr is around $576/month. Per-minute billing on Spheron means you only pay for what you actually use, which is often 30-50% cheaper than monthly commitments on hyperscalers for intermittent workloads.

Q: How does B200 pricing compare to H100 in April 2026?

B200 availability improved dramatically in Q1 2026, but on-demand pricing is higher than H100 PCIe. On Spheron, B200 SXM6 is $6.02/hr on-demand and $2.12/hr on spot. B200 on-demand is above H100 PCIe ($2.01/hr), but B200 spot at $2.12/hr is only $0.11/hr more than H100 PCIe on-demand while delivering 2.4x the memory bandwidth, 2.4x the VRAM (192 GB vs 80 GB), and native FP4. For interruptible workloads, B200 spot is the default pick.

Q: What are the cheapest spot prices for H100, B200, and B300 right now?

On Spheron as of April 2026: B200 SXM6 is $2.12/hr spot and $6.02/hr on-demand, A100 80G SXM4 spot is $0.60/hr, and B300 SXM6 spot is $2.45/hr compared to $6.80/hr on-demand — the biggest spot discount in the current market at roughly 64% off. H100 SXM5 spot is $1.03/hr. Spot is interruptible but ideal for training runs with checkpointing.

_Updated April 2026 with live Spheron pricing, B200 and B300 Blackwell data, and refreshed throughput benchmarks from MLPerf Inference v6.0._

Choosing a GPU for your AI workload shouldn't require a spreadsheet and three hours of tab-hopping across provider websites. But that's exactly what it takes today, because no provider publishes honest side-by-side comparisons. Each one highlights the metric where they win and buries the rest. This guide helps teams navigate the landscape; for specific selection guidance, see our top 10 cloud GPU providers analysis.

This post fixes that. It's a working set of AI GPU benchmarks plus current pricing across the providers that matter most in 2026. Real specs, published inference throughput data, no marketing spin, just the numbers you need to make an informed decision. For a workload-based decision guide that maps these GPUs to specific inference scenarios, see best GPU for AI inference in 2026.

For a provider-by-provider pricing breakdown across all GPU models, see our GPU cloud pricing comparison for 2026.

The Hardware Landscape: What You're Actually Choosing Between

Before comparing providers, you need to understand what separates the GPUs themselves. The differences aren't just about raw compute power; memory capacity, memory bandwidth, and interconnect speed all determine whether a GPU is right for your workload.

GPU Specifications at a Glance

Spec	A100 80GB	H100 SXM	H200 SXM	B200	L40S	RTX 4090
Architecture	Ampere	Hopper	Hopper	Blackwell	Ada Lovelace	Ada Lovelace
VRAM	80 GB HBM2e	80 GB HBM3	141 GB HBM3e	192 GB HBM3e	48 GB GDDR6	24 GB GDDR6X
Memory Bandwidth	2.0 TB/s	3.35 TB/s	4.8 TB/s	8.0 TB/s	864 GB/s	1.0 TB/s
FP16 Tensor TFLOPS (dense)	312	~989	~989	~2,250	~362	~165
FP8 TFLOPS (dense)	N/A¹	~1,979	~1,979	4,500	~733	~660
Interconnect	NVLink 3	NVLink 4	NVLink 4	NVLink 5	PCIe 4.0	PCIe 4.0

FP16 and FP8 Tensor Core values are dense (non-sparse). With NVIDIA 2:4 structured sparsity: FP16: A100 ~624, H100/H200 ~1,979, L40S ~733, RTX 4090 ~330 TFLOPS; FP8: H100/H200 ~3,958, L40S ~1,466, B200 ~9,000, RTX 4090 ~1,320 TFLOPS.

¹ A100 (Ampere architecture) has no native FP8 hardware support. FP8 Tensor Cores were introduced with Hopper (H100/H200) and Ada Lovelace (L40S, RTX 4090) architectures.

A few things jump out from this table.

Memory bandwidth is the inference bottleneck. For LLM inference, the speed at which you can read model weights from memory matters more than raw FLOPS. The H200's 4.8 TB/s bandwidth is why it outperforms the H100 on inference despite having identical compute; it can feed the tensor cores faster. The B200 doubles that again to 8 TB/s. Teams making this choice should read our GPU cost optimization playbook for detailed TCO analysis.

VRAM determines what you can run. The H200's 141 GB lets you run 70B parameter models in FP16 on a single GPU. The A100 at 80 GB requires quantization or multi-GPU setups for the same model. The B200's 192 GB opens the door to running 100B+ models without sharding.

The L40S is the sleeper pick for inference. Its 1,466 FP8 TFLOPS (Ada Lovelace FP8 Tensor Cores) makes it surprisingly competitive for inference workloads that can use FP8 precision, at a fraction of the cost of an H100.

Pricing Across Providers: The Real Numbers

GPU pricing varies enormously depending on which provider you use, whether you commit to a reservation, and whether you're willing to use spot instances. Here's what the market looks like right now.

H100 Pricing (per GPU, per hour)

Provider	Tier	On-Demand	Notes
Spheron	H100 PCIe	$2.01	Lowest H100 on-demand on the platform
Spheron	H100 NVL	$2.06	NVLink-connected PCIe
Spheron	H100 PCIe bare metal	$2.63	Full node access
Spheron	H100 SXM5	$2.50 ($1.03 spot)	8-way HGX, 59% off on spot
Vast.ai (marketplace)	Variable	$1.49-$1.87	Availability depends on hosts
DataCrunch	SXM	$1.99
GMI Cloud	SXM	$2.10
Lambda Labs	SXM	$2.99
Thunder Compute	SXM	$2.85-$3.50
Google Cloud (A3)	SXM	~$3.00-$9.80	Region-dependent; verify for your zone
AWS (p5)	SXM	~$6.88	p5.48xlarge per-GPU, post mid-2025 cut
CoreWeave	PCIe to HGX	$4.76-$6.15	GPU-only to full HGX bundled with CPU/RAM
Azure (ND H100 v5)	SXM	~$12.29	Per-GPU on ND96isr H100 v5 ($98.32/hr, 8 GPUs)

Pricing fluctuates based on GPU availability. The prices above are based on 15 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

The spread is significant: H100 PCIe on Spheron at $2.01/hr is roughly half the price of AWS and a third of Azure. That's the same GPU, the same CUDA cores, the same 80 GB of HBM3; just a different billing address.

Two years ago, H100s commanded $8/hr on-demand. The market commoditized rapidly through 2025 and into 2026 as supply caught up. On-demand pricing has now fully bifurcated: specialist clouds and marketplaces sit at $2-3/hr, hyperscalers at $3-7/hr for the same silicon.

H200 SXM Pricing

Provider	On-Demand	Notes
Spheron	$4.54	H200 SXM5, per GPU, per-minute billing
GMI Cloud	$2.50
Jarvislabs	$3.80
Lambda Labs / RunPod	$3.79-$3.99	Specialist clouds
AWS (p5e)	~$4.98	After Jan 2026 15% increase
CoreWeave	~$6.31
Azure (ND H200 v5) / GCP (a3-ultragpu)	~$10.60-$10.87	Hyperscaler on-demand

Pricing fluctuates based on GPU availability. The prices above are based on 15 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

The H200 carries a 10-20% premium over the H100 SXM at most specialist clouds, justified by its 42% inference throughput improvement. For latency-sensitive inference, cost-per-token math often favors H200 despite the higher hourly rate.

One catch: hyperscalers still sell H200s only in 8-GPU bundles. If you need a single H200, specialist clouds like Spheron are your only real option for per-GPU billing.

A100 80GB Pricing

Provider	Tier	On-Demand	Spot	Notes
Spheron	A100 80G SXM4	$1.07	$0.60	A100 80G pricing on Spheron
Spheron	A100 80G PCIe	$1.07	—	PCIe form factor, same rate
Thunder Compute	SXM	$0.78	—
Lambda Labs	SXM	~$2.49	—
AWS (p4de)	SXM	~$3.43	—
Azure (ND A100 v4)	SXM	~$4.10	—
Google Cloud (a2-ultragpu)	SXM	~$5.00	—

Pricing fluctuates based on GPU availability. The prices above are based on 15 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

The A100 is in its sunset phase for premium workloads, but still relevant for mid-tier cost-sensitive training and inference. A100 80G SXM4 on Spheron at $1.07/hr on-demand (or $0.60/hr spot) is now a solid choice for any workload that fits in 80 GB. Fine-tuning models under 30B parameters, batch inference, and lighter training jobs don't need Hopper or Blackwell class hardware. See our A100 vs V100 comparison for deeper context on when A100 is still the right call.

L40S Pricing

Provider	On-Demand	Notes
Marketplace low	$0.40	Variable availability
RunPod	$0.86
AWS (3-year reserved)	~$0.80	Requires commitment
Modal (serverless)	$1.95	Pay-per-second

The L40S sits in a compelling price-performance pocket for inference. At $0.40-$0.86/hr with 48 GB VRAM and strong FP8 throughput from its Ada Lovelace Tensor Cores, it handles most production inference workloads at a fraction of H100 pricing.

For full L40S benchmark data and workload guidance, see our L40S for AI inference guide.

B200 and B300 Pricing

The most significant pricing shift in Q1-Q2 2026 is Blackwell. B200 was in short supply through late 2025 with providers quoting $6-8/hr when they had inventory. By April 2026, supply caught up and rates collapsed.

Provider	GPU	On-Demand	Spot	Notes
Spheron	B200 SXM6	$6.02	$2.12	On-demand premium; spot near H100 range
Spheron	B300 SXM6	$6.80	$2.45	64% spot discount, biggest in the market

Pricing fluctuates based on GPU availability. The prices above are based on 15 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

B200 SXM6 on-demand at $6.02/hr is premium, but B200 spot at $2.12/hr delivers 2.4x the memory bandwidth and 2.4x the VRAM (192 GB vs 80 GB) with native FP4 support at only $0.11/hr above H100 PCIe on-demand. For fault-tolerant inference workloads, B200 spot is now the default pick. See our H200 vs B200 vs GB200 guide for detailed benchmarks.

B300 (Blackwell Ultra) sits at $6.80/hr on-demand but drops to $2.45/hr on spot — the biggest spot discount in the current GPU cloud market. If your training run has checkpointing, B300 spot gives you 288 GB HBM3e and 15 PFLOPS dense FP4 at a price most providers can't match on any Hopper hardware. Full breakdown in our B300 Blackwell Ultra guide.

What Changed Between Q1 and Q2 2026

If you read the earlier version of this post in March, a few things have shifted enough to call out explicitly:

B200 on-demand is premium, but spot changed the game. B200 SXM6 on-demand at $6.02/hr is above H100 PCIe, but spot at $2.12/hr is only $0.11/hr above H100 on-demand while delivering 2.4x bandwidth and VRAM. For checkpoint-capable workloads, B200 spot is now the default pick.
B300 spot remains the budget flagship for interruptible training. $2.45/hr for 288 GB HBM3e is unprecedented. If your training stack handles preemption cleanly, you're paying less than H100 on-demand for 3.6x the memory.
AWS p5e bumped H200 up 15% in January 2026 while specialist clouds held flat. The hyperscaler gap widened, not narrowed.
A100 80G SXM4 pricing increased to $1.07/hr on-demand, $0.60/hr spot. Still affordable for sub-30B fine-tuning and mid-tier training.
L40S pricing stayed flat in the $0.80-$0.86/hr range at specialist clouds. Ada-gen FP8 throughput has kept it competitive for 7B-30B inference even as Blackwell ramped.

AI GPU Benchmarks: Inference Throughput Where It Actually Matters

Raw specs and pricing only tell half the story. What matters is how many tokens per second you get for your dollar. Here's published benchmark data from MLPerf inference benchmarks on Llama 70B.

Llama 70B Inference Throughput

GPU	Config	Tokens/sec	Relative to A100	Typical $/hr (node)	Cost per Million Tokens
A100 80GB	1× GPU (INT8)	~130	1.0x	$0.78	~$1.67
H100 SXM	8× GPU (FP8)	~22,290	172x (8-GPU node)	$22.80	~$0.28
H200 SXM	8× GPU (FP8)	~31,700	244x (8-GPU node)	$29.76	~$0.26

Methodology note: The A100 figure is a single-GPU result running Llama 2 70B with INT8 quantization (the model barely fits in 80 GB quantized). The H100/H200 figures are from MLPerf Inference v4.0 8-GPU server submissions using TensorRT-LLM with FP8. These are not equivalent configurations; comparing 1× A100 to 8× H100 inflates the ratio. On a per-GPU basis with the same precision, H100 SXM outperforms A100 by roughly 5-8× for Llama 2 70B throughput. The multi-GPU aggregate numbers are shown here because they reflect real deployed configurations. The $/hr column shows the full 8-GPU node rate (8 × per-GPU price); cost-per-million-tokens divides that node rate by the aggregate throughput in tokens per hour.

For a breakdown of the latest MLPerf Inference v6.0 scores and what they mean for cloud GPU selection, see our MLPerf v6.0 results guide.

For smaller models (7B-13B parameters), the gap narrows. An A100 runs Llama 3.3 8B at roughly 400-500 tok/s, while an H100 pushes 10,000+ tok/s with TensorRT-LLM. The cost-per-token advantage of newer hardware remains, but the absolute throughput is less dramatic.

Practical Inference Scenarios

Chatbot / real-time API (latency-sensitive): H200 is the best choice. The ~42% throughput improvement over H100 directly translates to lower latency at high concurrency. The 141 GB VRAM also lets you run larger models without quantization, preserving output quality. Understand the tradeoffs in our detailed nvidia-h100-vs-h200 comparison.

Batch inference / offline processing: H100 at marketplace pricing ($1.49-$2.10/hr) offers the best cost-per-token. Latency doesn't matter for batch jobs, so you're purely optimizing for throughput per dollar.

Budget inference / low-traffic endpoints: L40S or A100. If your model fits in 48 GB (L40S) or 80 GB (A100), you can serve inference at $0.40-$0.86/hr. For startups running a single model endpoint with moderate traffic, this is 3-5x cheaper than an H100 and more than sufficient.

Maximum throughput (large-scale serving): B200 when pricing normalizes. The 8 TB/s bandwidth and native FP4 support promise 11-15x throughput improvements over Hopper-generation GPUs. If you're serving millions of requests per day, the B200's higher hourly cost is offset by dramatically fewer GPUs needed.

Training Performance: How Much Faster Is Your Model Ready?

Training benchmarks are harder to standardize because they depend heavily on model architecture, batch size, precision, and optimization framework. But the relative performance between GPU generations is consistent.

Training Speedup by Generation

GPU	vs A100 (Mixed Precision)	Sustained TFLOPS	Best For
A100 80GB	1.0x (baseline)	~400	Fine-tuning <13B, budget training
H100 SXM	2.4x	~1,000	General training, 7B-70B models
H200 SXM	2.5x	~1,050	Large models needing 141 GB VRAM
B200	~5x (estimated)	~2,000+	Large-scale training, 100B+ models

The H100's 2.4x training speedup over A100 translates directly to cost savings: a training run that takes 10 days on A100s finishes in roughly 4 days on H100s. Even though H100s cost 3x more per hour, the total training cost is lower because you're paying for fewer hours.

For distributed training across multiple nodes, interconnect matters enormously. NVLink bandwidth determines how fast gradients synchronize between GPUs. The B200's NVLink 5 at 1.8 TB/s per GPU is 2x the H100's NVLink 4; this means less time waiting for communication and more time computing.

The Decision Framework

With all this data, here's the practical framework for choosing a GPU and provider.

Step 1: What's your workload?

Inference (real-time) → Optimize for latency. H200 or H100.
Inference (batch) → Optimize for cost-per-token. H100 marketplace or L40S.
Training (fine-tuning) → Optimize for VRAM + cost. A100 or H100 depending on model size.
Training (pre-training) → Optimize for sustained throughput. H100/H200 clusters with InfiniBand.

Step 2: What's your budget tolerance?

Minimum cost, flexible on availability → GPU marketplaces and spot instances.
Predictable pricing, guaranteed availability → Reserved instances from mid-tier providers.
Maximum reliability, budget secondary → Hyperscalers (AWS, GCP, Azure).

Step 3: Compare cost-per-output, not cost-per-hour.

A GPU that costs 2x more per hour but delivers 3x the throughput is 33% cheaper on a per-output basis. Always calculate cost per million tokens (inference) or cost per training epoch (training), not just the hourly rate. For a detailed token factory benchmarks table with live Spheron pricing across A100, H100, H200, and B200, see the token factory benchmarks guide.

For teams evaluating GPU cloud options in April 2026, the market has shifted hard in your favor. H100 pricing has dropped 75% from its 2023 peak, A100s are at $1.07/hr on-demand or $0.60/hr spot on Spheron, B200 spot at $2.12/hr offers 2.4x bandwidth, and B300 spot is cheaper than H100 on-demand. Learn which GPU suits your specific model by consulting our best NVIDIA GPUs for LLMs guide and checking current pricing.

The biggest remaining challenge isn't finding cheap GPUs; it's comparing across providers without spending a full day on it. Spheron aggregates pricing across 5+ providers into a single interface, making it practical to consistently deploy on the cheapest available option for your workload. Whether you need spot H100s for a training run or reserved L40S instances for a production endpoint, you're choosing from the full market instead of one provider's pricing.

We'll update this benchmark data quarterly as new hardware ships and pricing evolves. Bookmark this page if you want the latest numbers without the research.

Spheron aggregates GPU pricing from vetted data center partners into a single interface, so you can deploy on the cheapest available option for your workload. As of April 2026: A100 80G SXM4 from $1.07/hr on-demand ($0.60 spot), H100 PCIe from $2.01/hr, B200 spot from $2.12/hr, B300 spot from $2.45/hr. No contracts, per-minute billing.
Spheron H100 → | B200 GPU pricing → | View all pricing →

FAQ / 06

Frequently Asked Questions

As of April 2026, Spheron offers H100 PCIe from $2.01/hr and H100 SXM5 from $2.50/hr on-demand ($1.03/hr on spot) with no commitment. Marketplace providers like Vast.ai still list H100s from $1.49-1.87/hr with variable availability. AWS p5 is around $6.88/hr per GPU after its mid-2025 cut, Google Cloud A3 around $3.00/hr, Azure ND H100 v5 around $12.29/hr per GPU. Spot pricing on Spheron brings H100 even lower. The spread between cheapest and most expensive is still over 3x for the same GPU.

For 70B+ model inference, yes. The H200 delivers roughly 42% more inference throughput than the H100 thanks to 4.8 TB/s memory bandwidth (vs 3.35 TB/s) and 141 GB HBM3e (vs 80 GB HBM3). A single H200 can serve Llama 3.3 70B at FP8 that would require two H100s, making the H200 cheaper overall despite the higher hourly rate. Spheron lists H200 SXM5 from $4.54/hr as of April 2026.

For 7B models, the NVIDIA L4 at $0.30-0.80/hr offers the lowest cost per token. For 7B-30B models, the L40S at $0.80-2.00/hr provides 48 GB VRAM and strong FP8 throughput. For larger models on a budget, A100 80G SXM4 is now $1.07/hr on-demand and $0.60/hr on spot. For best throughput-per-dollar on fault-tolerant workloads, B200 SXM6 spot at $2.12/hr delivers 2.4x the memory bandwidth of H100 PCIe at only $0.11/hr premium, with the ability to run 100B+ models natively.

A single H100 PCIe running 24/7 on Spheron is roughly $1,450/month at on-demand rates, dropping to sub-$1,000 on spot. H100 SXM5 is around $1,800/month on-demand or ~$740/month on spot. A100 80G SXM4 is around $770/month on-demand or ~$432/month on spot. L40S at $0.80/hr is around $576/month. Per-minute billing on Spheron means you only pay for what you actually use, which is often 30-50% cheaper than monthly commitments on hyperscalers for intermittent workloads.

B200 availability improved dramatically in Q1 2026, but on-demand pricing is higher than H100 PCIe. On Spheron, B200 SXM6 is $6.02/hr on-demand and $2.12/hr on spot. B200 on-demand is above H100 PCIe ($2.01/hr), but B200 spot at $2.12/hr is only $0.11/hr more than H100 PCIe on-demand while delivering 2.4x the memory bandwidth, 2.4x the VRAM (192 GB vs 80 GB), and native FP4. For interruptible workloads, B200 spot is the default pick.

On Spheron as of April 2026: B200 SXM6 is $2.12/hr spot and $6.02/hr on-demand, A100 80G SXM4 spot is $0.60/hr, and B300 SXM6 spot is $2.45/hr compared to $6.80/hr on-demand — the biggest spot discount in the current market at roughly 64% off. H100 SXM5 spot is $1.03/hr. Spot is interruptible but ideal for training runs with checkpointing.

The Hardware Landscape: What You're Actually Choosing Between

GPU Specifications at a Glance

Pricing Across Providers: The Real Numbers

H100 Pricing (per GPU, per hour)

H200 SXM Pricing

A100 80GB Pricing

L40S Pricing

B200 and B300 Pricing

What Changed Between Q1 and Q2 2026

AI GPU Benchmarks: Inference Throughput Where It Actually Matters

Llama 70B Inference Throughput

Practical Inference Scenarios

Training Performance: How Much Faster Is Your Model Ready?

Training Speedup by Generation

The Decision Framework

What We'd Recommend

Frequently Asked Questions

01Which GPU cloud provider is cheapest for H100s?

02Is the H200 worth the extra cost over H100?

03What's the cheapest GPU for LLM inference in April 2026?

04How much does GPU cloud cost per month for LLM serving?

05How does B200 pricing compare to H100 in April 2026?

06What are the cheapest spot prices for H100, B200, and B300 right now?

Build what's next.