Research

AI GPU Buyers Guide 2026: How to Evaluate Cloud GPU Providers

AI Buyers GuideGPU Buying GuideGPU Cloud EvaluationAI InfrastructureGPU PerformanceBare Metal GPUsCost OptimizationGPU Cloud
AI GPU Buyers Guide 2026: How to Evaluate Cloud GPU Providers

_Updated April 2026 with current Spheron pricing and the framework we use with teams evaluating GPU cloud providers._

Every GPU cloud ad promises the same three things: fastest hardware, lowest price, infinite scale. Most of them are wrong about two of them. Specs on a datasheet are the starting point, not the answer. What separates a provider you can ship on from one that breaks you six months in is how the platform behaves under real workloads: whether performance holds up across long runs, whether pricing stays predictable as you scale, and whether you actually control the machine.

This is a working AI buyers guide for teams shopping GPU cloud in 2026. It walks through what to evaluate, what to measure, and what the current market looks like, including the GPU, pricing, and workload decisions we see teams get wrong most often. For side-by-side provider comparisons, see our top 10 cloud GPU providers analysis and the GPU cloud pricing comparison 2026.

The Four Questions That Actually Matter

Most buyer conversations start from the wrong question. Teams ask "which provider has H100s available?" when they should be asking:

  1. Control. How much of the machine do I own? Can I install my own CUDA, patch the kernel, run profilers?
  2. Consistency. Does throughput stay stable across a 48-hour training run? What happens when the data center fills up?
  3. Pricing truth. What does this cost per month including egress, storage, minimums, and idle time?
  4. Right-sizing. Am I buying the GPU this workload actually needs, or the one the provider wants me to buy?

If a provider can't answer these clearly, the rest of the pitch doesn't matter. For accurate workload sizing, see our GPU memory requirements for LLMs guide and the GPU requirements cheat sheet.

Control: Do You Actually Own the Machine?

Fastest GPU on paper means nothing if you can't configure the environment around it. Many cloud platforms lock you into container sandboxes, restrict driver installation, or hide the hardware behind layers of virtualization that look clean in benchmarks and fail in production.

This matters for more workloads than people realize. LLM fine-tuning, RLHF pipelines, multi-node training with NCCL tuning, custom CUDA kernels, video AI pipelines with GStreamer bindings, and any research workload that touches low-level profiling tools all need real control. Without it, you hit mystery slowdowns at the worst possible moment.

Spheron gives full VM access with root on every deployment. You configure the OS, install your own CUDA version, patch the kernel, run Nsight or DCGM, and do the things that keep training jobs on schedule. Every instance runs on bare metal, which means no hypervisor tax, no shared-memory contention, and no noisy neighbors eating your PCIe bandwidth. Teams consistently measure 15-20% faster compute and cleaner multi-node networking compared to virtualized alternatives.

Before you pick a provider, run a simple test: deploy a small training job, install a custom kernel or driver version, and run nvidia-smi with full permissions. If any of that is blocked, you're running in someone else's sandbox.

Consistency: The Thing Benchmarks Hide

Performance consistency is where most GPU clouds quietly fail. A GPU that hits peak throughput on a morning benchmark but slows down in the afternoon when the data center fills up is not useful for a 48-hour training run. An inference endpoint that swings from 80 ms to 400 ms without explanation is a production liability.

Two design choices cause most of this pain: (1) virtualized GPUs sharing PCIe lanes and HBM bandwidth with other tenants, and (2) single-region deployments that collapse when the region gets oversubscribed. Bare metal fixes the first. A multi-provider, multi-region footprint fixes the second.

Spheron aggregates supply from data center partners across multiple regions globally, which means a workload isn't pinned to a single geography or a single failure zone. If one partner slows down, jobs continue elsewhere. If a region goes offline, it doesn't take your inference endpoint with it. Combined with bare-metal single-tenancy, this is why teams building production agents, real-time inference, and 24/7 batch pipelines report better stability than on larger clouds that advertise more raw capacity.

Measure this yourself during evaluation. Run the same workload at different times of day for a week and compare throughput distributions. If p99 throughput is more than 20% below p50, you have a consistency problem that will show up in production.

Pricing Truth: What the Hourly Rate Hides

The hourly GPU rate is the piece of the pricing story providers advertise. It's also the least reliable predictor of the monthly bill. On hyperscalers, the hourly rate is typically 40-60% of the total; the rest comes from egress bandwidth, persistent storage, NAT gateways, cross-region replication, snapshot fees, and minimum rental commitments.

A realistic comparison looks like this. For a team serving 70B inference at roughly 10k requests per hour with model checkpoint syncs twice a day:

  • Hourly GPU: $2.50-12.29 depending on provider (H100 SXM5; AWS p5 at ~$6.88/GPU post-2025 cut, Azure ND H100 v5 at ~$12.29/GPU)
  • Egress: $0-150/month (free on neo-clouds, $100-150 on hyperscalers for typical traffic)
  • Storage: $0-60/month (flat or included on neo-clouds, $50-60 on hyperscalers)
  • NAT gateway / IP: $0-45/month (included on neo-clouds, $30-45 on hyperscalers)

The delta between a specialist cloud and a hyperscaler on the same silicon commonly lands 2-3x across the full bill. Egress alone can exceed the GPU cost when you're moving large checkpoints. The GPU cost optimization playbook walks through the patterns that save the most money in practice.

Spheron's billing is per-minute GPU time with no hidden warm-up charges, no egress surprises, and no idle penalties. If the GPU is running, you pay. If it's off, you don't. That simplicity matters more than people expect, especially for iterative development where instances cycle on and off all day.

Right-Sizing: The Most Expensive Mistake

The single most common mistake we see: a team with a 7B model running on H100 SXM5 because "that's what everyone recommends." An H100 SXM5 at $2.50/hr for a workload an L40S at $0.72/hr could handle is a 3.5x cost multiplier on the same latency budget. Over a year of 24/7 inference, that's tens of thousands of dollars evaporated. For teams weighing L40 vs L40S specifically, our L40 vs L40S inference comparison covers the tokens-per-second and cost-per-million math.

Match the GPU to the model and the concurrency target. Here's the working rule of thumb:

  • <7B model, small batch, dev work: RTX 4090 ($0.55/hr) or RTX 5090 ($0.76/hr)
  • 7B-13B production inference: L40S ($0.72/hr) or A100 80GB ($1.07/hr)
  • 30B-70B training or inference: A100 80GB or H100 SXM5 ($2.50/hr on-demand, $1.03/hr spot)
  • 70B+ long-context inference: H200 ($4.54/hr) or B200 spot ($2.12/hr)
  • 100B+ or frontier training: H200, B200, B300 spot ($2.45/hr), or GB200 clusters on Spheron

For detailed workload-to-GPU matching, see best GPU for AI inference in 2026 and best NVIDIA GPUs for LLMs.

GPU Comparison Matrix

Quick reference for every GPU on the Spheron GPU rental catalog, with current per-GPU per-hour pricing and the workload where each shines. Each row links to a dedicated rental page with live inventory.

GPUVRAMBest Use CaseOn-DemandSpotRent
RTX 409024 GBDev, fine-tuning, diffusion$0.55/hrN/ARent →
RTX 509032 GBDev, small-model inference$0.76/hrN/ARent →
RTX PRO 600096 GBWorkstation-class, 70B QLoRA$0.93/hr$0.72/hrRent →
L40S48 GB7B-30B inference, video AI$0.72/hrN/ARent →
A100 80G80 GBMid-size training & inference$1.07/hr$0.60/hrRent →
GH20096 GBGrace Hopper hybrid compute$1.97/hrN/ARent →
H100 SXM580 GB70B training, multi-GPU HGX$2.50/hr$1.03/hrRent →
H200 SXM5141 GB70B+ inference, long context$4.54/hrN/ARent →
B200 SXM6192 GBFP4 inference, frontier training$6.02/hr$2.12/hrRent →
B300 SXM6288 GBFrontier training, long context$6.80/hr$2.45/hrRent →

Pricing fluctuates based on GPU availability. The prices above are based on 15 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Where Spheron Fits in the Market

Most GPU clouds fall into one of three buckets: hyperscalers (AWS, GCP, Azure) sell scale but charge aggressively and lock you into enterprise procurement cycles; specialist clouds (Lambda, CoreWeave, Nebius) sell performance but limit regions and hardware variety; marketplaces (Vast.ai, SFCompute) sell variety but reliability is uneven.

Spheron blends the three: bare-metal performance, marketplace-style pricing, and a multi-provider regional footprint behind a single console. You aren't locked into any one data center operator, which means supply stays available during shortages and pricing stays competitive because providers underneath compete for your workload. For how this compares head-to-head with specific competitors, see our analyses of Spheron vs RunPod, Spheron vs CoreWeave, and Spheron vs Vast.ai.

What Changed in 2026

A few shifts worth calling out if you last evaluated the market more than six months ago:

  • B200 and B300 entered mainstream availability. B200 SXM6 is now $6.02/hr on-demand on Spheron with $2.12/hr spot, a big drop from the $6-8/hr quotes common in late 2025. B200 spot is genuinely competitive with H100 PCIe on-demand and offers 2.4x the memory bandwidth plus native FP4.
  • A100 repriced. A100 80GB SXM4 is $1.07/hr on-demand and $0.60/hr spot. That's roughly 3x cheaper than GCP's A2 instances ($3.30/hr) for the same silicon and still one of the best training and fine-tuning options for models up to 70B.
  • Hyperscaler gap narrowed but is still wide. AWS cut P5 instance pricing by 44% in June 2025, bringing H100 SXM from ~$12.29/GPU down to ~$6.88/GPU, and GCP trimmed A3 rates, but Azure's ND H100 v5 is still roughly $12.29/GPU. The gap between hyperscalers and specialist clouds is still 2-3x on identical silicon.
  • Spot became a real production option. With proper checkpointing, spot instances on H100 ($1.03/hr) and B300 ($2.45/hr) now handle long training runs reliably. The spot GPU training case study walks through the specifics of a 70B training run on spot that saved 73%.
  • Inference cost-per-token compressed. B200 and H200 brought per-token costs for 70B inference into the $0.10-0.20/M range on spot. See cost-per-token math in the GPU cloud pricing comparison.
  • Vera Rubin NVL72 is real now. CoreWeave validated the first Rubin system on June 1, 2026. H2 2026 access goes to hyperscalers and large neo-clouds; broader access including Spheron is expected in 2027. For a breakdown of who gets Rubin access and when, plus cost-per-token economics vs Blackwell, see the Rubin NVL72 cloud availability and rental planning guide.

A Buying Framework You Can Use Today

If you take one thing from this guide, use this decision flow when evaluating any provider:

  1. Prove control. Deploy, install a custom driver, run a profiler. If blocked, move on.
  2. Prove consistency. Run the same workload across different times of day and across regions if the provider offers multi-region. Measure p99 throughput, not peak.
  3. Prove pricing. Build a realistic monthly cost simulation with egress, storage, and minimums included. Don't trust hourly rate comparisons in isolation.
  4. Prove right-sizing. Match GPU to model. If your workload fits in 48 GB, don't pay for 80.
  5. Start small, measure, then scale. Use on-demand until you have 30+ days of stable baseline load. Move to spot for fault-tolerant workloads. Negotiate reserved only against a known pattern.

Teams that follow this framework typically save 50-75% versus hyperscalers and get better stability than their previous cloud, because they buy based on measured behavior, not marketing claims.


GPU buying is about matching workload to hardware and pricing model, not chasing the newest GPU. Spheron gives you bare-metal control, transparent billing, and the hardware palette to match every stage of your AI stack across data center partners globally.

View all GPU pricing → | H100 GPU pricing → | On-demand B200 → | Get started on Spheron →

FAQ / 05

Frequently Asked Questions

Four things in order: hardware control (can you run custom kernels, drivers, and CUDA versions), performance consistency (no noisy-neighbor throttling across long training runs), transparent pricing (no egress or warm-up surprises), and right-sized hardware (don't rent an H100 for a 7B inference job). Specs on a datasheet don't guarantee any of these. Check our GPU cloud benchmarks for measured throughput.

For training and any production inference workload, yes. Bare metal removes the hypervisor, so you get full PCIe bandwidth, no shared memory contention, and no noisy-neighbor slowdowns. Teams that move from virtualized clouds to bare metal commonly measure 15-20% faster compute and noticeably better multi-node networking. For short experiments and dev loops, virtualized is fine.

Hourly GPU rate is only 40-60% of the real bill. Add egress ($0.08-0.12/GB on AWS/GCP/Azure, free or flat on most neo-clouds), persistent storage ($0.08-0.15/GB/month), NAT gateway and IP fees, and minimum rental commitments. Run a realistic monthly simulation including model checkpoint egress and data movement. Hyperscalers typically land 2-3x higher than specialist clouds once you add everything in.

Match the GPU to your biggest model and your concurrency target. For 7B-13B inference, RTX 5090 ($0.76/hr) or L40S ($0.72/hr) are the cheapest competent options. For 30B-70B training or inference, A100 80GB ($1.07/hr on-demand, $0.60/hr spot) or H100 SXM5 ($2.50/hr on-demand, $1.03/hr spot) is the sweet spot. For 100B+ or long-context serving, H200 ($4.54/hr) or B200 spot ($2.12/hr) are the best value.

Only after you have 30+ days of stable baseline load you can measure. Committing to reserved capacity before you understand your actual utilization pattern is how teams end up paying for idle GPUs. Use on-demand until the pattern is stable, then negotiate reserved pricing against a known workload. Spot capacity handles bursts; reserved handles the floor.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.