GPU procurement lead times in 2026 run 2-6 weeks for H100 SXM5 servers, which means the buy-vs-rent decision has to be made before you know what your actual inference load will look like. That is a problem. The conclusion from the numbers: at under 70% GPU utilization, cloud wins on total cost of ownership. At 80%+ sustained utilization, on-prem can win over a 3-year horizon when priced against hyperscalers. At competitive GPU cloud prices, like those on Spheron, on-demand cloud beats on-prem on pure cost even at near-full utilization. Here is the full breakdown.
The 2026 GPU Procurement Reality
Getting H100 SXM5 servers today means a 2-6 week wait if you are buying new. H200 lead times are 4-8 weeks. B200 hardware is largely spoken for through pre-orders. The few suppliers with available inventory charge a premium.
US tariff policy in 2025 and 2026 added another layer of cost to imported server hardware. NVIDIA's supply allocations favor hyperscalers and large enterprise customers first. Smaller teams either wait, pay spot market prices at 30-50% premium, or rent from cloud providers.
The procurement timeline has a second-order effect: by the time your hardware actually ships, your inference workload pattern may have changed completely. A model that needed 8x H100 SXM5 at training time might run fine on A100s for inference at a fraction of the cost. Locking in weeks before you know that is expensive.
For a current read on cloud pricing, see our GPU cloud pricing comparison.
Total Cost of Ownership: On-Premise H100 Server
People focus on hardware purchase price. The actual TCO is substantially higher once you add power, cooling, networking, staff, and space.
Below is the TCO breakdown for a single 8x H100 SXM5 DGX-H100 server over 3 years, based on publicly available hardware costs and US commercial electricity rates.
| Cost Item | Annual Cost | 3-Year Total |
|---|---|---|
| Hardware depreciation (amortized over 3 yr) | ~$116,000-150,000 | $350,000-450,000 |
| Power (~10-10.2 kW @ $0.12/kWh, 24/7) | ~$10,500-10,700 | ~$31,500-32,100 |
| Cooling overhead (30% of power) | ~$3,150-3,210 | ~$9,450-9,630 |
| Colocation or data center rack fee | ~$12,000-24,000 | ~$36,000-72,000 |
| Networking (InfiniBand, switches, amortized) | ~$10,000 | ~$30,000 |
| Storage (NVMe, object storage) | ~$5,000-8,000 | ~$15,000-24,000 |
| Staff (0.5 FTE infrastructure engineer) | ~$75,000-100,000 | ~$225,000-300,000 |
| Maintenance and spares | ~$5,000-10,000 | ~$15,000-30,000 |
| Total | ~$236,650-315,910 | ~$711,950-947,730 |
A few things stand out here.
Staff cost is the largest line item over 3 years, not hardware. A 0.5 FTE infrastructure engineer at fully-loaded cost runs $75,000-100,000/year. That is $225,000-300,000 over 3 years, compared to $350,000-450,000 for the hardware itself. Most teams undercount this.
Power costs vary significantly by region. At $0.06/kWh (US Pacific Northwest industrial rates), annual power cost drops to ~$5,250/yr (10 kW × 8,760 hr × $0.06). At European rates ($0.20+/kWh), it rises to ~$17,900/yr (10.2 kW × 8,760 hr × $0.20). The table uses $0.12/kWh as a US commercial average.
Cooling adds 25-40% on top of power. The 30% figure above is conservative for a standard air-cooled row deployment. Liquid-cooled deployments reduce cooling overhead but require infrastructure investment.
Hardware prices assume direct purchase, not financing. Financing at current rates adds 8-12% to the hardware total cost over 3 years.
For performance context, see GPU cloud benchmarks.
GPU Cloud Cost Model: Spot, On-Demand, and Reserved
Using live pricing from Spheron as of April 13, 2026:
| GPU Model | On-Demand ($/hr/GPU) | Spot ($/hr/GPU) | 8-GPU Node (OD, $/hr) |
|---|---|---|---|
| H100 SXM5 | $2.90 | $0.80 | $23.20 |
| H100 PCIe | $2.01 | Not available | $16.08 |
| H200 SXM5 | $4.50 | $1.19 | $36.00 |
| A100 80GB SXM4 | $1.64 | $0.45 | $13.12 |
| A100 80GB PCIe | $1.04 | Not available | $8.32 |
Pricing fluctuates based on GPU availability. The prices above are based on 13 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Spot pricing is 70-75% cheaper than on-demand on Spheron's H100 SXM5 ($0.80 vs $2.90). That gap is real and usable for any workload that implements checkpoint-based recovery. Batch inference jobs, nightly evaluation runs, and fine-tuning fits this profile. Interactive inference serving does not.
For a full treatment of when spot makes sense, see our serverless vs on-demand vs reserved comparison.
Break-Even Analysis: Utilization Thresholds
This is the core question: at what utilization level does owning beat renting?
The formula is straightforward:
cloud_annual_cost = price_per_gpu_hr × 8 GPUs × 8,760 hr/yr × utilization_rate
Using Spheron's H100 SXM5 on-demand price ($2.90/hr):
| GPU Utilization | Annual Cloud Cost (on-demand) | Annual Cloud Cost (spot) | Annual On-Prem TCO | On-Prem Wins? |
|---|---|---|---|---|
| 30% | ~$60,970 | ~$16,819 | ~$237,000+ | No |
| 50% | ~$101,616 | ~$28,032 | ~$237,000+ | No |
| 70% | ~$142,262 | ~$39,245 | ~$237,000+ | No |
| 80% | ~$162,586 | ~$44,851 | ~$237,000+ | No |
| 90% | ~$182,909 | ~$50,458 | ~$237,000+ | No |
| 100% | ~$203,232 | ~$56,064 | ~$237,000+ | No |
At Spheron's current prices, on-demand cloud costs less than on-prem even at 100% utilization. The gap narrows as utilization rises, but it does not flip. This reflects a structural shift in the market: GPU cloud prices in 2026 are competitive enough that the traditional break-even argument for on-prem has weakened significantly.
The math changes at hyperscaler prices. At AWS H100 pricing ($4.10-6.88/hr per GPU on the p5.48xlarge), an 8-GPU on-demand node costs $287,000-482,000/year at 100% utilization. The on-prem floor with current hardware prices is ~$237,000/year. The break-even with AWS on-demand pricing lands at roughly 50-83% utilization depending on region and pricing tier. GCP A3 High at $11-16/hr per GPU is even more expensive, making on-prem competitive at even lower utilization thresholds against those rates.
Most production LLM inference teams operate at 40-65% GPU utilization due to traffic variability and request batching limits. The assumption of 80-90% utilization that makes on-prem look attractive is rarely achieved in practice outside of batch-only pipelines. For techniques that improve GPU utilization through smarter batching, see our guide on continuous batching and PagedAttention.
The Hidden Costs Nobody Talks About
The TCO table above captures the obvious costs. Several others are harder to quantify but real.
Idle GPU power draw. H100 SXM5 GPUs consume under 100W when idle, roughly 14% of peak draw. On-prem, you pay this regardless of whether the GPU is serving requests. Cloud: idle GPUs are not running, so you pay nothing.
Networking egress. AWS and Azure charge $0.087-0.09/GB for outbound data; GCP charges $0.11-0.12/GB. At 1 TB/day of inference output (text, embeddings, completions), that is ~$2,600-3,600/month in egress alone depending on provider. Spheron does not charge egress fees. On a high-volume inference deployment, this difference compounds fast.
Redundancy overhead. On-prem requires N+1 or N+2 power and cooling redundancy. In practice, that means 15-25% of your data center cost goes to capacity you never use under normal conditions. Cloud handles redundancy transparently.
GPU failure rate. Enterprise GPU failure rates run 5-10% annually, with large-cluster data (Meta's 16,384-GPU H100 deployment) showing ~9% annualized failure rates. One failed H100 SXM5 mid-contract costs $25,000-35,000 to replace and typically takes weeks due to lead times. When an on-prem GPU fails, inference capacity drops immediately. Cloud failures result in instance replacement, usually resolved in minutes.
Team bandwidth. Every hour spent on GPU driver updates, firmware patches, or hardware troubleshooting is an hour not spent on model development. For a small team, this cost is not negligible.
For strategies to reduce spend on whichever infrastructure path you choose, see the GPU cost optimization playbook.
Hyperscaler vs Spheron: Inference Pricing
For teams evaluating cloud providers, the on-demand rate spread across H100 providers is substantial.
| Provider | H100 On-Demand ($/hr/GPU) | Egress Fee | Availability |
|---|---|---|---|
| AWS (p5.48xlarge) | ~$4.10-6.88 | $0.09/GB | Multi-region |
| Google Cloud (A3 High) | ~$11.00-16.00 | $0.11-0.12/GB | Multi-region |
| Azure (ND H100 v5) | ~$6.98-12.29 | $0.087/GB | Multi-region |
| CoreWeave | ~$4.76-6.16 | None | Multi-region |
| Spheron | $2.90 | None | Multiple regions |
Hyperscaler prices from public pricing pages as of April 2026. Spheron price from live API.
Two things matter beyond the hourly rate. Egress fees at 1 TB/day add ~$2,600-3,600/month depending on provider. That is roughly ~$31,000-43,000/year, a non-trivial factor in the total cost comparison. Hyperscalers also require commitment periods (1-3 year reserved instances) to access their most competitive rates; the rack rates above are on-demand.
Spheron's instant availability is a separate advantage. If you need 8x H100 today, you can deploy in under 2 minutes. There is no weeks-long procurement process.
Hybrid Strategy: Baseline On-Prem with Cloud Burst
For teams that already own on-prem GPU infrastructure, the choice is not binary. A hybrid approach often produces better economics than either pure path.
The framework:
- Keep baseline inference capacity on-prem, sized for p50 traffic (median load)
- Burst to cloud for peak load (p90 to p99 traffic spikes)
- Use cloud for all non-production workloads: development, staging, experimentation, and evaluation runs
- Target split: on-prem handles 60-70% of traffic volume, cloud handles 30-40%
This approach gets on-prem utilization to 80%+ on the baseline load (where it pencils out) while avoiding the capital requirement to size for peak demand. The cloud burst is typically spot-eligible, bringing the cost down further.
See our hybrid cloud and edge AI inference guide for implementation details.
5 Questions to Determine Your GPU Strategy
Before committing capital or signing a long-term contract, answer these questions.
- What is your actual GPU utilization today? Run
nvidia-smi dmonor check your cloud monitoring. If average utilization is under 70%, you almost certainly do not have the workload profile to justify on-prem. Low utilization means you are paying for idle capacity.
- Do you have data sovereignty or air-gapped requirements? If regulations require your data to stay within a specific jurisdiction, or if your model weights cannot leave your network, on-prem or private cloud may be required regardless of cost. This is a hard constraint, not an economics question.
- How predictable is your inference load? Variable or seasonal demand favors cloud. If your traffic spikes 5x during product launches and drops 60% overnight, sizing on-prem for peak is wasteful. Flat, predictable 24/7 load starts to favor on-prem at scale.
- Can you wait 2-6 weeks for hardware procurement? H100 SXM5 lead times in 2026 are 2-6 weeks; H200 is 4-8 weeks. If you need capacity in the next few days, cloud is the only practical option. For teams with any urgency, the procurement window is a real constraint even with improved supply.
- Do you have the engineering team to operate GPU infrastructure? Budget at least 0.5-1 FTE per cluster for driver management, hardware failures, firmware updates, and cluster operations. If you do not have that headcount or do not want to hire for it, cloud is the right path.
Most teams that honestly answer these questions find that cloud is the better fit for at least the next 1-2 years. On-prem makes economic sense for organizations with predictable high utilization, long investment horizons, compliance requirements, and the engineering staff to operate the infrastructure. For a broader framework on evaluating GPU cloud options, see our GPU cloud buyer's guide.
If your GPU utilization is under 70% or you need capacity this week, Spheron's on-demand and spot H100, H200, and A100 instances are the faster, lower-risk path. No egress fees, no procurement lead times.
