Research

What Is a GPU Cloud? Definition, How It Works, and When to Use One (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 24, 2026
What Is GPU CloudGPU CloudCloud GPUGPU Cloud DefinitionGPU Cloud MeaningHow GPU Cloud WorksGPU RentalAI InfrastructureGPU Pricing
What Is a GPU Cloud? Definition, How It Works, and When to Use One (2026)

A GPU cloud is a rental marketplace that lets you access NVIDIA GPU servers by the hour, with no hardware to buy and no long-term contract. You pay for compute time, get SSH root access to a dedicated instance, and shut it down when the job finishes. For teams running LLM training, inference, or batch processing workloads, it's the fastest way to get GPU capacity without a six-figure capital expenditure.

How a GPU Cloud Works

The physical foundation is enterprise-grade datacenter hardware: NVIDIA GPUs in either PCIe form factor (single GPU slots on standard servers) or SXM with NVLink (multi-GPU nodes connected via high-bandwidth fabric, 900 GB/s per GPU on H100). These run in Tier 2/3/4 compliant facilities with redundant power, cooling, and network uplinks.

When you submit a request, the platform's scheduler looks across its available inventory for a node that matches your GPU model, count, and configuration requirements. Once found, it provisions the instance, installs your chosen OS image, and returns SSH credentials. Billing starts at instance launch and stops when you terminate, typically metered per second or per minute with no minimum commitment on on-demand plans.

The hardware can be bare metal (you get the full physical server) or virtualized (your instance shares the host OS but gets dedicated GPU resources). Most AI workloads prefer bare metal because GPU virtualization adds latency and constrains driver access.

GPU Cloud vs Traditional Cloud vs On-Prem: Quick Comparison

GPU Cloud (e.g. Spheron)Hyperscaler (AWS/GCP/Azure)On-Prem
Hardware ownershipNoNoYes
Billing modelPer-second/per-minutePer-hour or per-secondCapEx + OpEx
Startup timeUnder 2 minutes2-10 minutesDays to weeks
GPU availabilityBroad inventory, live marketplaceLimited SKUs, often waitlistedFixed capacity
Egress feesNone or flat$0.08-$0.12/GBNone
Minimum commitmentNone (on-demand)None (on-demand)Full hardware cost
Control levelRoot access, bare metalVM with limited driver controlFull control

The hyperscaler comparison deserves a closer look. According to our GPU cloud providers compared analysis, AWS p5.48xlarge runs about $98.32/hr for 8x H100 SXM5, or roughly $12.29/hr per GPU. Azure ND H100 v5 instances are priced similarly at ~$12.29/hr per GPU. Against that, Spheron H100 SXM5 on-demand is $3.90/hr and spot is $0.80/hr. The spot-vs-AWS gap is roughly 15x. That price difference is structural, not a promotional discount. Hyperscalers carry higher overhead, more managed-service abstraction, and margins that reflect their enterprise sales model.

Common GPU SKUs You Can Rent

GPUVRAMArchitecturePrimary Use CaseOn-Demand $/hr
H100 SXM580 GB HBM3HopperLLM training, large-scale inference$3.90
H100 PCIe80 GB HBM3HopperFine-tuning, cost-effective inference$2.09
H200 SXM instance141 GB HBM3eHopperLong-context inference, 70B+ models$4.56
B200 SXM6192 GB HBM3eBlackwellTrillion-parameter models, frontier inference$7.01
L40S rental48 GB GDDR6Ada LovelaceInference serving, rendering, mixed workloads$0.75
RTX PRO 600096 GB GDDR7BlackwellProfessional AI inference, workstation workloads$1.77
A100 80GB80 GB HBM2eAmpereTraining up to 20B params, cost-effective inference$1.09

Pricing fluctuates based on GPU availability. The prices above are based on 24 May 2026 and may have changed. Check current GPU pricing for live rates.

For a deeper dive into pricing across providers, see the GPU cloud pricing comparison.

Pricing Models Explained

On-Demand

On-demand is the default: you rent a dedicated instance, it runs until you stop it, and you pay per second or per minute. No upfront commitment, no reservation required.

A concrete example: H100 PCIe at $2.09/hr on Spheron. If you're running a 4-hour Llama 3 70B fine-tuning job, that's $8.36 total. Stop the instance and billing stops. Want to rent an H100 for an afternoon and shut it down before dinner? That's exactly what on-demand is for.

Spot

Spot instances are preemptible capacity sold at a discount, typically 50-80% below on-demand. The provider can reclaim the hardware with short notice (usually 30-90 seconds). That makes spot unsuitable for production inference APIs or jobs without checkpointing, but excellent for batch workloads that save state periodically.

As of May 2026 on Spheron: H100 SXM5 spot is $0.80/hr vs $3.90/hr on-demand. B200 SXM6 spot is $1.71/hr vs $7.01/hr on-demand. For a training job that checkpoints every 500 steps, spot gives you the same GPU at less than half the price. See the billing model comparison for a detailed breakdown.

Reserved

Reserved pricing (also called committed-use) involves a monthly or quarterly commitment in exchange for a guaranteed rate and guaranteed availability. It's the right model when you know your GPU utilization will be consistently above 60-70% over weeks or months. Below that threshold, on-demand is cheaper because you're not paying for idle hours.

The breakeven math is simple: if your on-demand monthly spend is $3,000 and a reserved commitment costs $2,000/month, you save $1,000 per month. But if you only use the GPU 50% of the time, your actual on-demand spend is $1,500, so the reserved commitment costs more. Model your actual utilization before committing.

Serverless

Serverless GPU billing charges per inference call rather than per hour. You pay only when requests arrive, and the platform handles scaling up and down automatically. This works well for infrequent or bursty workloads where a full instance would sit idle most of the time.

The tradeoff is cold start latency: when no instance is warm, the first request in a quiet period waits for provisioning, which can take 30-120 seconds for large models. Production inference APIs with consistent traffic almost always perform better on reserved on-demand instances than serverless.

When to Use a GPU Cloud

LLM training and fine-tuning are the clearest use cases. A 7B parameter QLoRA fine-tune takes 2-4 hours on an H100 and costs $6-12 on-demand. Training a custom 70B model requires 8xH100 nodes for days. Renting that capacity for the duration of the run, then releasing it, costs a fraction of buying the hardware.

Inference serving at variable scale suits GPU cloud well. If your API traffic spikes 10x during business hours but drops to near zero overnight, on-demand lets you scale to match demand while keeping inference latency SLOs in check. Reserved on-prem hardware sized for peak traffic sits idle 70% of the time.

Batch processing jobs (dataset preprocessing, embedding generation, offline evaluation) are the ideal spot workload. Checkpointing every few minutes means a preemption loses at most minutes of work, and spot pricing cuts the compute cost by more than half.

AI agent workloads that spin up many short-lived inference tasks fit GPU cloud's per-minute billing. An agentic workflow that runs for 10 minutes and costs $0.35 per run is hard to match with on-prem hardware that costs money whether it's running or not. See our LLM deployment guide for end-to-end architecture patterns.

Research and experimentation is where GPU cloud pays off most visibly. Running 20 hyperparameter sweep experiments over a weekend costs $80-200 depending on GPU model. The alternative is waiting weeks for a shared cluster queue or spending $30,000+ on a workstation GPU.

When NOT to Use a GPU Cloud

Sustained 24/7 production at hyperscale eventually favors reserved clusters or on-prem hardware. Once you're running hundreds of GPUs continuously for a year or more, the economics shift toward ownership or long-term reservations. Most teams don't hit this threshold.

Air-gapped compliance requirements may prohibit cloud infrastructure entirely. Financial institutions and government agencies sometimes must run on hardware they own and physically control, where no external network connectivity is allowed. GPU cloud can't serve those use cases regardless of its security posture.

Workloads tightly coupled to managed services can make more sense on hyperscalers. If your pipeline depends heavily on SageMaker Autopilot's managed training workflows, Vertex AI Pipelines, or a specific cloud's proprietary data services, the migration cost to a GPU cloud may not be worth the compute savings.

Sub-10ms latency to co-located data is difficult on cloud infrastructure. If your model needs to read from a database or storage system and round-trip network latency matters at the millisecond level, co-location on-prem (or in the same datacenter) may be the only viable option.

How to Get Started on Spheron

Spheron is a GPU cloud marketplace that aggregates NVIDIA GPU supply from 5+ providers into a single platform with live pricing and one-click deployment. It gives you bare-metal H100, H200, B200, A100, L40S, RTX PRO 6000, and RTX 4090 instances with per-minute billing, no egress fees, and SSH root access. The entire process from signup to a running inference workload takes under 10 minutes for most users.

Here's how to get from zero to a running workload:

  1. Create a Spheron account. Go to app.spheron.ai, sign up with your email or GitHub account, and complete verification.
  1. Browse available GPUs. Navigate to the GPU marketplace. Filter by GPU model, VRAM, and price. H100 PCIe is available from $2.09/hr on-demand.
  1. Select instance type and billing model. Choose on-demand for full control or spot for up to 80% savings on fault-tolerant workloads. On-demand runs until you stop it.
  1. Deploy the instance. Click Rent Now, choose your OS (Ubuntu 22.04 is recommended), add your SSH public key, and launch. Instances are ready in under 2 minutes.
  1. Run your first workload. SSH in and run your inference or training job. For vLLM: pip install vllm && python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B-Instruct.

Full documentation, including API reference, framework-specific tutorials, and multi-node setup guides, is at Spheron documentation.

FAQ

What is a GPU cloud?

A GPU cloud is a rental marketplace that gives you access to NVIDIA GPU servers by the hour with no hardware to buy. You get SSH root access to a dedicated instance, pay only for the compute time you use, and shut it down when the job finishes. Pricing starts under $1/hr for spot instances on platforms like Spheron.

How does a GPU cloud work?

You submit a request specifying the GPU model, count, and OS. The platform's scheduler finds available capacity from its pool of datacenter hardware, provisions a virtual or bare-metal instance, and hands you SSH credentials. Billing starts when the instance is running and stops when you terminate it, typically metered per second or per minute.

What is the difference between a GPU cloud and a regular cloud?

A regular cloud (AWS, Azure, GCP) offers general-purpose compute with GPU options as a secondary service. GPU clouds are built around GPU hardware as the primary product, which means better GPU availability, lower per-hour prices, and simpler pricing. Hyperscalers charge 3-6x more per GPU hour and add egress fees on top.

How much does it cost to rent a GPU in the cloud?

As of May 2026 on Spheron: A100 80GB PCIe starts at $1.09/hr on-demand, H100 SXM5 at $3.90/hr ($0.80/hr spot), H100 PCIe at $2.09/hr, H200 SXM5 at $4.56/hr ($2.00/hr spot), and B200 SXM6 at $7.01/hr ($1.71/hr spot). Spot instances cut costs 50-80% for fault-tolerant workloads.

Is a GPU cloud secure?

Yes. Enterprise GPU clouds run hardware in Tier 2, 3, and 4 certified datacenters with physical security, redundant power, and network isolation. You get dedicated GPU instances, not shared VMs, with SSH root access and no shared tenancy on the GPU itself.

What regions are GPU clouds available in?

Most GPU cloud platforms have datacenter partners across North America, Europe, and Asia-Pacific. Spheron aggregates supply from datacenter partners across multiple regions, so availability depends on live inventory rather than fixed availability zones. Check the GPU marketplace for current regional availability by model.

Are GPU clouds more sustainable than on-prem servers?

For most teams, yes. GPU clouds let you use hardware only when needed, avoiding 24/7 power draw for idle servers. Multi-tenant infrastructure achieves higher average utilization per physical GPU than most on-prem setups. Hyperscale and colocation facilities typically run PUE of 1.1-1.3, versus 1.5+ for most on-prem server rooms. Energy source and PUE still vary by provider, so check the specific datacenter before assuming.

What is the cheapest GPU cloud option?

For production-grade NVIDIA GPUs, spot pricing gives the lowest rates. As of May 2026, H100 SXM5 spot starts at $0.80/hr and B200 SXM6 spot at $1.71/hr on Spheron. For cost-effective on-demand access, A100 80GB PCIe at $1.09/hr and L40S at $0.75/hr are the most affordable data-center-class options. Always check live rates since spot prices fluctuate.


GPU cloud access starts at under $1/hr for H100 spot instances on Spheron. No contracts, no egress fees, and instances ready in under 2 minutes.

Browse GPU pricing → | Rent H100 → | Get started →

STEPS / 05

Quick Setup Guide

  1. Create a Spheron account

    Go to app.spheron.ai, sign up with your email or GitHub account, and complete verification.

  2. Browse available GPUs

    Navigate to the GPU marketplace. Filter by GPU model, VRAM, and price. H100 PCIe is available from $2.09/hr on-demand.

  3. Select instance type and billing model

    Choose on-demand for full control or spot for up to 80% savings on fault-tolerant workloads. On-demand runs until you stop it.

  4. Deploy the instance

    Click Rent Now, choose your OS (Ubuntu 22.04 recommended), add your SSH public key, and launch. Instances are ready in under 2 minutes.

  5. Run your first workload

    SSH in and run your inference or training job. For vLLM: `pip install vllm && python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B-Instruct`.

FAQ / 08

Frequently Asked Questions

A GPU cloud is a rental marketplace that gives you access to NVIDIA GPU servers by the hour with no hardware to buy. You get SSH root access to a dedicated instance, pay only for the compute time you use, and shut it down when the job finishes. Pricing starts under $1/hr for spot instances on platforms like Spheron.

You submit a request specifying the GPU model, count, and OS. The platform's scheduler finds available capacity from its pool of datacenter hardware, provisions a virtual or bare-metal instance, and hands you SSH credentials. Billing starts when the instance is running and stops when you terminate it, typically metered per second or per minute.

A regular cloud (AWS, Azure, GCP) offers general-purpose compute with GPU options as a secondary service. GPU clouds are built around GPU hardware as the primary product, which means better GPU availability, lower per-hour prices, and simpler pricing structures. Hyperscalers charge 3-6x more per GPU hour than GPU-specialized platforms and add egress fees on top.

Prices vary by GPU model and billing type. As of May 2026 on Spheron: A100 80GB PCIe starts at $1.09/hr on-demand, H100 SXM5 at $3.90/hr ($0.80/hr spot), H100 PCIe at $2.09/hr, H200 SXM5 at $4.56/hr ($2.00/hr spot), and B200 SXM6 at $7.01/hr ($1.71/hr spot). Spot instances cut costs 50-80% for fault-tolerant workloads.

Yes. Enterprise GPU clouds run hardware in Tier 2, 3, and 4 certified datacenters with physical security, redundant power, and network isolation. You get dedicated GPU instances, not shared VMs. Traffic is isolated per tenant. You control what software runs on the machine via SSH root access.

Most GPU cloud platforms have datacenter partners across North America, Europe, and Asia-Pacific. Spheron aggregates supply from datacenter partners across multiple regions, so availability depends on live inventory rather than fixed zones. Check the GPU marketplace for current regional availability by model.

For most teams, yes. GPU clouds let you use hardware only when you need it, avoiding 24/7 power draw for idle servers. Multi-tenant infrastructure means higher average utilization per physical GPU than most on-prem setups. That said, energy source and PUE vary by datacenter, so it depends on the specific provider.

For production-grade NVIDIA GPUs, spot pricing on platforms like Spheron gives the lowest rates. As of May 2026, H100 SXM5 spot starts at $0.80/hr and B200 SXM6 spot at $1.71/hr. For cost-effective on-demand access, A100 80GB PCIe starts at $1.09/hr and L40S at $0.75/hr. Always check live rates since spot prices fluctuate with available inventory.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.