Choosing a GPU for your AI workload shouldn't require a spreadsheet and three hours of tab-hopping across provider websites. But that's exactly what it takes today, because no provider publishes honest side-by-side comparisons. Each one highlights the metric where they win and buries the rest.
This post fixes that. We compiled real specs, current pricing, and published inference throughput data across the GPUs and providers that matter most in 2026. No marketing spin — just the numbers you need to make an informed decision.
The Hardware Landscape: What You're Actually Choosing Between
Before comparing providers, you need to understand what separates the GPUs themselves. The differences aren't just about raw compute power — memory capacity, memory bandwidth, and interconnect speed all determine whether a GPU is right for your workload.
GPU Specifications at a Glance
| Spec | A100 80GB | H100 SXM | H200 SXM | B200 | L40S | RTX 4090 |
|---|---|---|---|---|---|---|
| Architecture | Ampere | Hopper | Hopper | Blackwell | Ada Lovelace | Ada Lovelace |
| VRAM | 80 GB HBM2e | 80 GB HBM3 | 141 GB HBM3e | 192 GB HBM3e | 48 GB GDDR6 | 24 GB GDDR6X |
| Memory Bandwidth | 2.0 TB/s | 3.35 TB/s | 4.8 TB/s | 8.0 TB/s | 864 GB/s | 1.0 TB/s |
| FP16 TFLOPS | 312 | 756 | 756 | 2,250+ | ~362 | 82.6 |
| FP8 TFLOPS | ~192 | 756 | 756 | 4,500 | 1,466 | N/A |
| Interconnect | NVLink 3 | NVLink 4 | NVLink 4 | NVLink 5 | PCIe 4.0 | PCIe 4.0 |
A few things jump out from this table.
Memory bandwidth is the inference bottleneck. For LLM inference, the speed at which you can read model weights from memory matters more than raw FLOPS. The H200's 4.8 TB/s bandwidth is why it outperforms the H100 on inference despite having identical compute — it can feed the tensor cores faster. The B200 doubles that again to 8 TB/s.
VRAM determines what you can run. The H200's 141 GB lets you run 70B parameter models in FP16 on a single GPU. The A100 at 80 GB requires quantization or multi-GPU setups for the same model. The B200's 192 GB opens the door to running 100B+ models without sharding.
The L40S is the sleeper pick for inference. Its 1,466 FP8 TFLOPS through the Transformer Engine makes it surprisingly competitive for inference workloads that can use FP8 precision — at a fraction of the cost of an H100.
Pricing Across Providers: The Real Numbers
GPU pricing varies enormously depending on which provider you use, whether you commit to a reservation, and whether you're willing to use spot instances. Here's what the market looks like right now.
H100 SXM Pricing (per GPU, per hour)
| Provider | On-Demand | Notes |
|---|---|---|
| Vast.ai (marketplace) | $1.49–$1.87 | Variable by availability |
| DataCrunch | $1.99 | |
| GMI Cloud | $2.10 | |
| Lambda Labs | $2.99 | |
| Thunder Compute | $2.85–$3.50 | |
| Google Cloud | ~$3.00 | |
| AWS | ~$3.90 | |
| CoreWeave | $4.25–$6.16 | PCIe to full HGX node |
The spread is significant: the cheapest H100 on the marketplace is less than half the price of AWS. That's the same GPU, the same CUDA cores, the same 80 GB of HBM3 — just a different billing address.
Two years ago, H100s commanded $8/hr on-demand. The market has commoditized rapidly as supply caught up with demand. Current pricing has stabilized around $2.85–$3.50/hr at most providers, with marketplace rates dipping below $2/hr.
H200 SXM Pricing
| Provider | On-Demand | Notes |
|---|---|---|
| GMI Cloud | $2.50 | |
| Google Cloud (spot) | $3.72 | Preemptible pricing |
| Jarvislabs | $3.80 | |
| General market range | $3.72–$10.60 | Wide variance |
The H200 carries a 15-20% premium over the H100, which is justified by its 45% inference performance improvement. For latency-sensitive inference, the cost-per-token math often favors the H200 despite the higher hourly rate.
One catch: hyperscalers currently sell H200s only in 8-GPU bundles. If you need a single H200, you're limited to specialized providers.
A100 80GB Pricing
| Provider | On-Demand | Notes |
|---|---|---|
| Thunder Compute | $0.78 | |
| Market range | $0.66–$1.29 | Approaching sub-$1 |
The A100 is in its sunset phase for premium workloads, and pricing reflects that. At under $1/hr, it's become the budget option — and for many workloads, it's still plenty of GPU. Fine-tuning models under 13B parameters, batch inference, and lighter training jobs don't need H100-class hardware.
L40S Pricing
| Provider | On-Demand | Notes |
|---|---|---|
| Marketplace low | $0.40 | Variable availability |
| RunPod | $0.86 | |
| AWS (3-year reserved) | ~$0.80 | Requires commitment |
| Modal (serverless) | $1.95 | Pay-per-second |
The L40S sits in a compelling price-performance pocket for inference. At $0.40-$0.86/hr with 48 GB VRAM and strong FP8 throughput via the Transformer Engine, it handles most production inference workloads at a fraction of H100 pricing.
B200 Pricing (Early Market)
B200 availability is still limited as of February 2026. Early adopters are paying premium rates, but pricing is expected to settle at roughly 20-30% above H200 levels as supply ramps through the year. If you can wait, prices will drop. If you need the 8 TB/s bandwidth and 192 GB VRAM now, expect to pay for the privilege.
Inference Throughput: Where It Actually Matters
Raw specs and pricing only tell half the story. What matters is how many tokens per second you get for your dollar. Here's published benchmark data for the most common inference benchmark — Llama 2 70B.
Llama 2 70B Inference Throughput
| GPU | Tokens/sec | Relative to A100 | Typical $/hr | Cost per Million Tokens |
|---|---|---|---|---|
| A100 80GB | ~130 | 1.0x | $0.78 | ~$1.67 |
| H100 SXM | ~21,800 | 168x | $2.85 | ~$0.04 |
| H200 SXM | ~31,700 | 244x | $3.72 | ~$0.03 |
The numbers above are from optimized deployments using TensorRT-LLM. The gap between A100 and H100 isn't 2-3x — it's over 100x for optimized large-model inference. This is because the H100's Hopper architecture with FP8 tensor cores and higher memory bandwidth fundamentally changes the inference equation for large models.
For smaller models (7B-13B parameters), the gap narrows. An A100 runs Llama 2 13B at roughly 400-500 tok/s, while an H100 pushes 10,000+ tok/s with TensorRT-LLM. The cost-per-token advantage of newer hardware remains, but the absolute throughput is less dramatic.
Practical Inference Scenarios
Chatbot / real-time API (latency-sensitive): H200 is the best choice. The 45% throughput improvement over H100 directly translates to lower latency at high concurrency. The 141 GB VRAM also lets you run larger models without quantization, preserving output quality.
Batch inference / offline processing: H100 at marketplace pricing ($1.49-$2.10/hr) offers the best cost-per-token. Latency doesn't matter for batch jobs, so you're purely optimizing for throughput per dollar.
Budget inference / low-traffic endpoints: L40S or A100. If your model fits in 48 GB (L40S) or 80 GB (A100), you can serve inference at $0.40-$0.86/hr. For startups running a single model endpoint with moderate traffic, this is 3-5x cheaper than an H100 and more than sufficient.
Maximum throughput (large-scale serving): B200 when pricing normalizes. The 8 TB/s bandwidth and native FP4 support promise 11-15x throughput improvements over Hopper-generation GPUs. If you're serving millions of requests per day, the B200's higher hourly cost is offset by dramatically fewer GPUs needed.
Training Performance: How Much Faster Is Your Model Ready?
Training benchmarks are harder to standardize because they depend heavily on model architecture, batch size, precision, and optimization framework. But the relative performance between GPU generations is consistent.
Training Speedup by Generation
| GPU | vs A100 (Mixed Precision) | Sustained TFLOPS | Best For |
|---|---|---|---|
| A100 80GB | 1.0x (baseline) | ~400 | Fine-tuning <13B, budget training |
| H100 SXM | 2.4x | ~1,000 | General training, 7B-70B models |
| H200 SXM | 2.5x | ~1,050 | Large models needing 141 GB VRAM |
| B200 | ~5x (estimated) | ~2,000+ | Large-scale training, 100B+ models |
The H100's 2.4x training speedup over A100 translates directly to cost savings: a training run that takes 10 days on A100s finishes in roughly 4 days on H100s. Even though H100s cost 3x more per hour, the total training cost is lower because you're paying for fewer hours.
For distributed training across multiple nodes, interconnect matters enormously. NVLink bandwidth determines how fast gradients synchronize between GPUs. The B200's NVLink 5 at 1.8 TB/s per GPU is 2x the H100's NVLink 4 — which means less time waiting for communication and more time computing.
The Decision Framework
With all this data, here's the practical framework for choosing a GPU and provider.
Step 1: What's your workload?
- Inference (real-time) → Optimize for latency. H200 or H100.
- Inference (batch) → Optimize for cost-per-token. H100 marketplace or L40S.
- Training (fine-tuning) → Optimize for VRAM + cost. A100 or H100 depending on model size.
- Training (pre-training) → Optimize for sustained throughput. H100/H200 clusters with InfiniBand.
Step 2: What's your budget tolerance?
- Minimum cost, flexible on availability → GPU marketplaces and spot instances.
- Predictable pricing, guaranteed availability → Reserved instances from mid-tier providers.
- Maximum reliability, budget secondary → Hyperscalers (AWS, GCP, Azure).
Step 3: Compare cost-per-output, not cost-per-hour.
A GPU that costs 2x more per hour but delivers 3x the throughput is 33% cheaper on a per-output basis. Always calculate cost per million tokens (inference) or cost per training epoch (training) — not just the hourly rate.
What We'd Recommend
For teams evaluating GPU cloud options in 2026, the market has shifted in your favor. H100 pricing has dropped 60% from its peak, A100s are available under $1/hr, and the L40S offers a genuine budget alternative for inference.
The biggest remaining challenge isn't finding cheap GPUs — it's comparing across providers without spending a full day on it. Spheron AI aggregates pricing across 5+ providers into a single interface, making it practical to consistently deploy on the cheapest available option for your workload. Whether you need spot H100s for a training run or reserved L40S instances for a production endpoint, you're choosing from the full market instead of one provider's pricing.
We'll update this benchmark data quarterly as new hardware ships and pricing evolves. Bookmark this page if you want the latest numbers without the research.