What is the difference between serverless GPU and on-demand GPU?

Serverless GPU charges per inference call or per compute-second with no idle cost, but has cold start latency ranging from 200ms-4s for small models on optimized platforms (Modal, RunPod FlashBoot) to 6-60s for large LLM deployments, depending on container size and caching. On-demand GPU rents a dedicated instance billed per second, minute, or hour depending on the provider (Spheron bills per second), starts in under 60 seconds, and has predictable throughput. Serverless suits low-traffic APIs; on-demand suits sustained inference or training.

When does reserved GPU pricing make sense?

The threshold depends on the discount. The formula is: breakeven utilization = 1 - (discount percentage). At AWS H100's ~49% reserved discount (reflecting the June 2025 on-demand price reduction to ~$3.90/hr), you need 51% or more utilization to make reserved cost-effective. For H100 GPUs on AWS, the 1-year reserved effective rate is around $2.00/hr vs $3.90/hr on-demand. Compare the reserved rate to neo-cloud on-demand pricing before committing: Spheron H100 PCIe at $2.01/hr matches AWS 1-year reserved with no contract required.

What is the cheapest GPU billing model for intermittent workloads?

For workloads running fewer than 5-6 hours per day, serverless or per-second on-demand billing is cheaper than any reserved contract. A serverless platform with cold-start tolerance (batch jobs, async inference) costs the least. On-demand with per-second billing (like Spheron) is the next best option for interactive workloads.

How does Spheron's billing model compare to AWS or GCP?

Spheron bills per second with no minimums and no egress fees. Spot pricing is available on select GPUs at variable rates depending on availability. AWS and GCP bill per second with a 1-minute minimum for Linux instances, charge for egress, and require 1 or 3-year reserved contracts for discounts. Spheron has no contracts for on-demand use.

What is the best GPU billing model for AI training?

For most AI training runs, on-demand with per-second billing offers the best combination of cost, flexibility, and control. Short training jobs (hours to days) on a neo-cloud like Spheron cost less than equivalent AWS or GCP on-demand, and far less than hyperscaler reserved. For longer 24/7 training workloads on hyperscalers, reserved contracts can help, but compare the reserved rate to neo-cloud on-demand first since neo-cloud on-demand often beats hyperscaler reserved without any contract commitment.

Serverless vs On-Demand vs Reserved GPU: Choose the Right Billing Model (Save 40-80%)

Q: What is spot GPU pricing and how does it differ from on-demand?

Spot GPUs are excess capacity sold at a discount versus on-demand rates. Discounts vary based on GPU type and current market availability. Spot instances can be interrupted with short notice (2 minutes on AWS, 30 seconds on GCP/Azure). Best for fault-tolerant training jobs that checkpoint regularly. On-demand GPUs are guaranteed to run until you stop them.

GPU clouds offer three billing models. Pick the wrong one and you'll overpay by 2-5x. Serverless charges per call with no idle cost. On-demand bills by the second, minute, or hour for a dedicated instance, depending on the provider. Reserved locks you in for months at a steep discount. The right answer depends entirely on your workload pattern.

For provider-specific GPU pricing comparisons, see our GPU cloud pricing comparison. For inference-specific GPU selection guidance, see best GPU for AI inference in 2026.

The Three Billing Models

When Should You Use Serverless GPU?

Serverless GPU platforms abstract the hardware entirely. You submit a request or function call, the platform provisions a GPU, runs your code, and bills per inference call or per compute-second. You pay nothing when there is no traffic.

Pros: Zero idle cost, no instance management, scales to zero automatically.

Cons: Cold starts ranging from 200ms-4s for small models on optimized platforms (Modal, RunPod FlashBoot) to 6-60s for large LLM deployments, depending on container size and caching. No hardware control. Limited to what the platform supports. Not available at multi-GPU scale (you cannot run an 8xH100 job serverlessly on most platforms).

Best for: Async batch jobs where cold starts are acceptable, prototyping, demos, and situations where zero instance management is worth paying more for. For low-traffic APIs processing 100-500 requests/day, per-second on-demand billing (Spheron, RunPod) is 80% cheaper than serverless ($3.35/month vs $16.67/month) and avoids cold starts entirely. Serverless makes sense here only if you want zero instance management and can tolerate cold starts of 200ms-4s for small models or up to 6-60s for large LLMs, and are willing to pay 5x more per month for that convenience.

Providers: Modal, Replicate, RunPod Serverless, Baseten, Fal AI.

When Should You Use On-Demand GPU?

On-demand rents a dedicated GPU instance that stays running until you stop it. You get full hardware control, consistent throughput, and instance startup in under 60 seconds. You are billed whether the GPU is busy or idle. Billing granularity varies by provider: Spheron bills per second, Lambda Labs per hour, Vast.ai per minute.

Pros: Full hardware control, predictable throughput, no cold starts, works at any scale including multi-GPU clusters.

Cons: You pay for idle time. If you provision an H100 for 24 hours but only use it 3 hours, you pay for 24.

Best for: Training runs, sustained inference serving, interactive development, any workload requiring consistent throughput.

Providers: Spheron, RunPod, Lambda Labs, CoreWeave, Vast.ai.

When Should You Use Reserved GPU?

Reserved pricing on hyperscalers typically requires committing to 1 or 3-year contracts in exchange for discounts vs on-demand. GPU reserved discounts typically range from 30-75%, with AWS H100 1-year reserved at ~49% (reflecting the June 2025 on-demand price reduction to ~$3.90/hr). Some neo-cloud providers (GPU-specialized cloud providers like Spheron, RunPod, Lambda Labs) offer reserved pricing with shorter commitment windows. CoreWeave offers negotiated reserved pricing with minimum commitment periods for discounted rates (exact terms vary by contract), for example. You pay the reserved rate every month regardless of actual usage.

Quick rule: Reserved makes sense when your utilization exceeds (1 - discount percentage). At a ~49% AWS discount, you need 51% utilization to break even. See full calculations below.

Pros: Large discounts for 24/7 workloads. Cost predictability for budget planning.

Cons: Contract commitment. You pay even if you do not use the GPU. Reserved rates on hyperscalers still exceed neo-cloud on-demand pricing for the same GPU.

Best for: Production workloads running 24/7 on hyperscalers where you are already invested in the AWS/GCP/Azure ecosystem.

Providers: AWS (EC2 reserved), GCP (committed-use contracts), Azure (reserved VMs), CoreWeave (minimum commitment required for discounted rates; terms vary by contract).

Spot Pricing: A Hybrid Option

Spot GPUs are excess capacity sold below the on-demand rate. Providers including Spheron, RunPod, and Vast.ai offer spot instances at variable discounts depending on availability. Spot instances can be interrupted with short notice (2 minutes on AWS, 30 seconds on GCP/Azure), so they require checkpointing or fault-tolerant job design.

Spot pricing is not a separate billing model, but it sits between on-demand and reserved. You get below-on-demand rates without a contract, at the cost of interruption risk. Best for training jobs and batch inference that checkpoint state regularly.

Comparison Table

Compared	Serverless	On-Demand	Reserved
Billing unit	Per call / second	Per second / minute / hour (varies by provider)	Monthly flat (committed)
Cold start	200ms-4s (small/optimized); 6-60s+ (large LLMs)	<60 seconds	None
Idle cost	Zero	Full rate	Full rate (committed)
Contract	None	None	Varies (months to years)
GPU control	None (abstracted)	Full	Full
Best for	Intermittent / async	Training / sustained serving	Predictable 24/7 load

Cost Modeling for Four Workload Types

These scenarios use real pricing to show which billing model wins in practice. Serverless rates are approximate since providers change them frequently. Modal's published per-second rate is $0.002778/GPU-second ($9.99/hr). Under sustained load with warm containers and keep-alive optimization, effective rates can be lower ($3.95-$4.76/hr). The examples in this post use the published per-second rate to show worst-case serverless costs. Check Modal's pricing page for current rates.

Low-Traffic Inference API

Setup: 100 requests/day, each needing 2 seconds of H100 compute. Total compute: 200 seconds/day = 3.33 minutes.

Serverless (Modal, ~$0.002778/GPU-second): 200s x $0.002778 = $0.5556/day = $16.67/month
On-demand H100 PCIe (provider with per-minute minimum billing, e.g., Vast.ai, ~$1.50/hr): 100 requests x 1-min minimum = $2.50/day = $75/month
On-demand H100 PCIe per-second billing (Spheron, $2.01/hr): 3.33 min x $0.0335/min = $0.1116/day = $3.35/month

Winner: Per-second on-demand (Spheron, $3.35/month). It is 80% cheaper than serverless ($16.67/month) and avoids cold starts entirely. Serverless makes sense here only if you want zero instance management and can tolerate cold starts of 200ms-4s for small models or up to 6-60s for large LLMs, and are willing to pay 5x more per month for that convenience. Per-request billing with 1-minute minimums costs ~23x more than per-second on-demand and should be avoided for this workload type.

24/7 Inference Serving

Setup: Sustained production traffic requiring one H100 full-time, 720 hours/month.

On-demand H100 PCIe (Spheron, $2.01/hr): $2.01 x 720 = $1,447/month
On-demand H100 PCIe (AWS): ~$3.90/hr x 720 = $2,808/month (reflects June 2025 AWS price reduction, down from ~$6.88/hr)
Reserved H100 (AWS, 1-year effective rate): ~$2.00/hr x 720 = $1,440/month

Winner: Spheron on-demand at $1,447/month matches AWS 1-year reserved pricing ($1,440/month) with no contract required. Spheron is 48% cheaper than AWS on-demand at $2,808/month. For fault-tolerant workloads, spot pricing on available GPUs can reduce costs further.

Short Training Run (7 Days)

Setup: Full H100 PCIe for 168 hours. For GPU selection guidance on training, see best Nvidia GPUs for LLMs.

On-demand H100 PCIe (Spheron, $2.01/hr): $2.01 x 168 = $337.68
On-demand H100 (AWS): ~$3.90/hr x 168 = $655.20 (reflects June 2025 AWS price reduction)
AWS 1-year reserved (effective rate): ~$2.00/hr x 168 = $336.00

Winner: Neo-cloud on-demand (Spheron). Spheron on-demand ($337.68) is 48% cheaper than AWS on-demand ($655.20) and matches the AWS 1-year reserved effective rate, without any contract. For short training jobs, there is no reason to sign a reserved contract.

Monthly Burst Workload

Setup: Need 8x H100 for 4 hours, once a month.

Serverless: Not available at 8-GPU scale on most platforms.
On-demand (Spheron): 8 x $2.01 x 4 = $64.32/month
Reserved (AWS 1-yr, 8x H100): 8 × $2.00 × 720 = ~$11,520/month, committed whether used or not.

Winner: On-demand by a large margin. Reserved pricing makes no sense for burst workloads.

Provider Examples by Billing Model

Serverless GPU Providers

Provider	Pricing	Notes
Modal	Varies by GPU (see Modal pricing page)	Wide GPU selection, fast cold starts (2-4s)
Replicate	Per prediction	Model-specific pricing, 16-60s+ cold starts on custom models
RunPod Serverless	Per second	FlashBoot achieves 200ms-2s cold starts for optimized containers; large model deployments still see 6-12s+
Baseten	Per call	Enterprise inference platform with private VPCs and SLAs; pricing by contract
Fal AI	Per second	Optimized for image/video generation (Flux, SDXL); sub-second inference on popular models

Serverless GPU prices change frequently. Baseten and Fal AI do not publish standard pricing. Treat these as approximate and check their pricing pages or contact sales for current rates.

On-Demand GPU Providers

Provider	H100 On-Demand $/hr	Spot $/hr	Billing unit
Spheron	$2.01 (PCIe)	Variable (select GPUs)	Per second
RunPod	$2.69	Available	Per second
Lambda Labs	$2.49 (PCIe) / $2.99 (SXM, per-GPU rate in 8xH100 config)	N/A	Per hour
CoreWeave	~$4.76 (GPU component only; ~$6.15/GPU bundled with CPU/RAM)	N/A	Per hour
Vast.ai	$1.35-$1.53	Available	Per minute
AWS	~$3.90 (post-June 2025 reduction)	Variable (check AWS console for current rates)	Per second (1-min min)
GCP	~$3.00-$9.80 (significant regional variance; significant price fluctuations in 2025-2026; verify current rates)	Variable (check GCP console for current rates)	Per second (1-min min)

Pricing fluctuates based on GPU availability. The prices above are based on 23 Mar 2026 and may have changed. Check current GPU pricing for live rates.

Provider availability notes: Lambda Labs H100 availability can be limited during peak demand; check current availability before budgeting. Vast.ai is a marketplace, so pricing is volatile and reliability varies by host. GCP H100 on-demand pricing has seen significant fluctuations in 2025-2026; verify current rates before making cost comparisons.

Reserved GPU Providers

Provider	H100 On-Demand $/hr	H100 Reserved (1yr effective)	Discount
AWS	~$3.90 (post-June 2025 reduction)	~$2.00	~49%
GCP	~$3.00-$9.80 (regional variance; verify current rates)	~$4.00 (estimated)	~59% (vs US standard rate; verify for your region)
Azure	~$6.98 (single-GPU NC40ads_H100_v5, East US) / ~$12.29 per GPU (8-GPU ND96isr H100 v5)	~$5.50 (estimated; not publicly listed)	~55% (est.)
CoreWeave	~$4.76 (GPU component only)	Negotiated (contact sales)	CoreWeave offers reserved clusters at negotiated rates but does not publish these rates publicly; contact their sales team for custom quotes

Azure H100 pricing varies by VM type: $6.98/hr for single-GPU VMs (NC40ads_H100_v5, East US), $12.29/hr per GPU for 8-GPU configurations (ND96isr H100 v5). Pricing is highly region-dependent; verify rates for your target region before budgeting.

For Spheron, volume and reserved pricing is available - contact sales via app.spheron.ai or email. Spheron does not publish fixed reserved rates, but its on-demand rate ($2.01/hr for H100 PCIe) matches AWS 1-year reserved pricing (~$2.00/hr), with no contract commitment required.

When Serverless GPU, On-Demand, or Reserved Saves You Money

Workload type	Daily GPU hours	Cheapest model	Notes
Low-traffic API	<0.5 hr equivalent	Per-second on-demand	80% cheaper than serverless at 100 req/day with per-second billing ($3.35/month vs $16.67/month); choose serverless only if zero instance management matters more than cost
Dev / test	0.5-3 hr	Per-second on-demand	Stop when idle
Batch jobs	2-8 hr	Spot on-demand	Use checkpointing
Production inference	12-24 hr	Spot or on-demand neo-cloud	Beats hyperscaler reserved
Long-term 24/7 production	24 hr for 6+ months	Reserved (if on hyperscaler)	Only if already in AWS/GCP/Azure

What Is the Breakeven Point for Reserved vs On-Demand?

The simplified breakeven rule: breakeven utilization = 1 - (discount percentage). At a ~49% AWS discount, you break even at 51% utilization. Below that, on-demand is cheaper.

Reserved makes sense when:
(Reserved monthly cost) < (On-demand rate x actual hours used per month)

Example 1 (AWS H100, 720-hour month, ~$3.90/hr on-demand post-June 2025 reduction):
$2.00/hr x 720 = $1,440 (AWS 1-yr reserved effective)
vs. $3.90/hr x X hours = $1,440
X = 369 hours -> You need to use it more than 369 hours/month
(51% utilization) to break even on AWS reserved vs AWS on-demand

But compare to Spheron on-demand:
$2.01/hr x 720 = $1,447/month (no contract required)
Spheron on-demand at $1,447/month is within $7 of AWS 1-year reserved, with no contract.

Example 2 (GCP H100, 720-hour month, ~59% reserved discount vs $9.80/hr US standard rate):
$4.00/hr x 720 = $2,880 (GCP 1-yr reserved effective, estimated)
vs. $9.80/hr x X hours = $2,880
X = 294 hours -> You need 41% utilization to break even on GCP reserved vs GCP on-demand
But GCP reserved at ~$4.00/hr still costs $2,880/month vs Spheron on-demand at $1,447/month.
Switching to a neo-cloud provider beats signing a GCP reserved contract.

The key insight: AWS H100 on-demand pricing dropped to ~$3.90/hr in June 2025 (down from ~$6.88/hr). AWS 1-year reserved is now ~$2.00/hr effective (~49% discount). Spheron on-demand at $2.01/hr matches that reserved rate without any contract commitment. For workloads running less than 24/7, the comparison favors Spheron even more strongly. To track actual GPU utilization and avoid paying for idle time, see GPU monitoring best practices.

Spheron's Billing Model

Spheron bills on-demand GPU rentals by the second with no hourly minimum. Spot pricing is available on select GPUs for fault-tolerant workloads at variable rates depending on current GPU availability. There are no contracts, no egress fees, and no reserved commitment required.

GPU	Spheron On-Demand	AWS On-Demand	AWS 1-yr Reserved
H100 PCIe	$2.01/hr	~$3.90/hr	~$2.00/hr
A100 80G PCIe	$1.07/hr	~$3.43/hr	~$2.00/hr (est.)
A100 80G SXM4	$1.14/hr	~$3.43/hr	~$2.00/hr (est.)

Spheron pricing as of March 23, 2026. Prices fluctuate based on GPU availability. Check current Spheron pricing for live rates.

Spheron H100 PCIe on-demand ($2.01/hr) matches AWS H100 1-year reserved pricing (~$2.00/hr). You get equivalent pricing with no contract or commitment. For billing details and instance types, see docs.spheron.ai/billing.

For H100 rental, A100 rental, H200 rental, and other GPU options, Spheron offers per-second billing with no upfront commitment.

Decision Framework

Work through these questions in order:

Low-traffic API (fewer than 200 requests/day)? Per-second on-demand is cheaper than serverless at this traffic level if you use a provider that bills by the second with no hourly minimum. At 100 requests/day with 2-second compute each, per-second on-demand costs $3.35/month vs $16.67/month for serverless (80% cheaper). Choose serverless only if you want zero instance management and can tolerate cold start latency of 200ms-4s for small models or 6-60s for large LLMs, and are willing to pay 5x more for that convenience.

Fault-tolerant workload (checkpointed training, batch inference)? Use spot GPU when available. Save on compute costs with no contract. See our GPU cost optimization playbook for checkpoint strategies.

Running 24/7 on AWS or GCP already? Reserved may be worth calculating. But compare the reserved rate to neo-cloud on-demand first. You may save more by switching providers than by signing a contract.

Everything else (training, sustained serving, dev/test, burst workloads)? On-demand with per-second billing. Stop the instance when you are done. No idle waste, no contract risk.

The most common mistake is defaulting to hyperscaler reserved pricing without comparing to neo-cloud on-demand. AWS H100 reserved at ~$2.00/hr effective requires a 1-year commitment but is now comparable to Spheron H100 PCIe on-demand at $2.01/hr. The difference: Spheron requires no contract. For workloads running less than 24/7, on-demand is always the better choice.

Whether you need serverless flexibility for intermittent workloads or 24/7 on-demand throughput for production inference (with checkpoint strategies for fault-tolerant workloads), Spheron offers transparent per-second billing with no contracts or egress fees. H100 PCIe starts at $2.01/hr with no commitment required, matching AWS 1-year reserved pricing without the lock-in.
Rent H100 → | Rent A100 → | View all pricing →