GPU clouds offer three billing models. Pick the wrong one and you'll overpay by 2-5x. Serverless charges per call with no idle cost. On-demand bills by the second, minute, or hour for a dedicated instance, depending on the provider. Reserved locks you in for months at a steep discount. The right answer depends entirely on your workload pattern.
For provider-specific GPU pricing comparisons, see our GPU cloud pricing comparison. For inference-specific GPU selection guidance, see best GPU for AI inference in 2026.
The Three Billing Models
When Should You Use Serverless GPU?
Serverless GPU platforms abstract the hardware entirely. You submit a request or function call, the platform provisions a GPU, runs your code, and bills per inference call or per compute-second. You pay nothing when there is no traffic.
Pros: Zero idle cost, no instance management, scales to zero automatically.
Cons: Cold starts ranging from 200ms-4s for small models on optimized platforms (Modal, RunPod FlashBoot) to 6-60s for large LLM deployments, depending on container size and caching. No hardware control. Limited to what the platform supports. Not available at multi-GPU scale (you cannot run an 8xH100 job serverlessly on most platforms).
Best for: Async batch jobs where cold starts are acceptable, prototyping, demos, and situations where zero instance management is worth paying more for. For low-traffic APIs processing 100-500 requests/day, per-second on-demand billing (Spheron, RunPod) is 80% cheaper than serverless ($3.35/month vs $16.67/month) and avoids cold starts entirely. Serverless makes sense here only if you want zero instance management and can tolerate cold starts of 200ms-4s for small models or up to 6-60s for large LLMs, and are willing to pay 5x more per month for that convenience.
Providers: Modal, Replicate, RunPod Serverless, Baseten, Fal AI.
When Should You Use On-Demand GPU?
On-demand rents a dedicated GPU instance that stays running until you stop it. You get full hardware control, consistent throughput, and instance startup in under 60 seconds. You are billed whether the GPU is busy or idle. Billing granularity varies by provider: Spheron bills per second, Lambda Labs per hour, Vast.ai per minute.
Pros: Full hardware control, predictable throughput, no cold starts, works at any scale including multi-GPU clusters.
Cons: You pay for idle time. If you provision an H100 for 24 hours but only use it 3 hours, you pay for 24.
Best for: Training runs, sustained inference serving, interactive development, any workload requiring consistent throughput.
Providers: Spheron, RunPod, Lambda Labs, CoreWeave, Vast.ai.
When Should You Use Reserved GPU?
Reserved pricing on hyperscalers typically requires committing to 1 or 3-year contracts in exchange for discounts vs on-demand. GPU reserved discounts typically range from 30-75%, with AWS H100 1-year reserved at ~49% (reflecting the June 2025 on-demand price reduction to ~$3.90/hr). Some neo-cloud providers (GPU-specialized cloud providers like Spheron, RunPod, Lambda Labs) offer reserved pricing with shorter commitment windows. CoreWeave offers negotiated reserved pricing with minimum commitment periods for discounted rates (exact terms vary by contract), for example. You pay the reserved rate every month regardless of actual usage.
Quick rule: Reserved makes sense when your utilization exceeds (1 - discount percentage). At a ~49% AWS discount, you need 51% utilization to break even. See full calculations below.
Pros: Large discounts for 24/7 workloads. Cost predictability for budget planning.
Cons: Contract commitment. You pay even if you do not use the GPU. Reserved rates on hyperscalers still exceed neo-cloud on-demand pricing for the same GPU.
Best for: Production workloads running 24/7 on hyperscalers where you are already invested in the AWS/GCP/Azure ecosystem.
Providers: AWS (EC2 reserved), GCP (committed-use contracts), Azure (reserved VMs), CoreWeave (minimum commitment required for discounted rates; terms vary by contract).
Spot Pricing: A Hybrid Option
Spot GPUs are excess capacity sold below the on-demand rate. Providers including Spheron, RunPod, and Vast.ai offer spot instances at variable discounts depending on availability. Spot instances can be interrupted with short notice (2 minutes on AWS, 30 seconds on GCP/Azure), so they require checkpointing or fault-tolerant job design.
Spot pricing is not a separate billing model, but it sits between on-demand and reserved. You get below-on-demand rates without a contract, at the cost of interruption risk. Best for training jobs and batch inference that checkpoint state regularly.
Comparison Table
| Compared | Serverless | On-Demand | Reserved |
|---|---|---|---|
| Billing unit | Per call / second | Per second / minute / hour (varies by provider) | Monthly flat (committed) |
| Cold start | 200ms-4s (small/optimized); 6-60s+ (large LLMs) | <60 seconds | None |
| Idle cost | Zero | Full rate | Full rate (committed) |
| Contract | None | None | Varies (months to years) |
| GPU control | None (abstracted) | Full | Full |
| Best for | Intermittent / async | Training / sustained serving | Predictable 24/7 load |
Cost Modeling for Four Workload Types
These scenarios use real pricing to show which billing model wins in practice. Serverless rates are approximate since providers change them frequently. Modal's published per-second rate is $0.002778/GPU-second ($9.99/hr). Under sustained load with warm containers and keep-alive optimization, effective rates can be lower ($3.95-$4.76/hr). The examples in this post use the published per-second rate to show worst-case serverless costs. Check Modal's pricing page for current rates.
Low-Traffic Inference API
Setup: 100 requests/day, each needing 2 seconds of H100 compute. Total compute: 200 seconds/day = 3.33 minutes.
- Serverless (Modal, ~$0.002778/GPU-second): 200s x $0.002778 = $0.5556/day = $16.67/month
- On-demand H100 PCIe (provider with per-minute minimum billing, e.g., Vast.ai, ~$1.50/hr): 100 requests x 1-min minimum = $2.50/day = $75/month
- On-demand H100 PCIe per-second billing (Spheron, $2.01/hr): 3.33 min x $0.0335/min = $0.1116/day = $3.35/month
Winner: Per-second on-demand (Spheron, $3.35/month). It is 80% cheaper than serverless ($16.67/month) and avoids cold starts entirely. Serverless makes sense here only if you want zero instance management and can tolerate cold starts of 200ms-4s for small models or up to 6-60s for large LLMs, and are willing to pay 5x more per month for that convenience. Per-request billing with 1-minute minimums costs ~23x more than per-second on-demand and should be avoided for this workload type.
24/7 Inference Serving
Setup: Sustained production traffic requiring one H100 full-time, 720 hours/month.
- On-demand H100 PCIe (Spheron, $2.01/hr): $2.01 x 720 = $1,447/month
- On-demand H100 PCIe (AWS): ~$3.90/hr x 720 = $2,808/month (reflects June 2025 AWS price reduction, down from ~$6.88/hr)
- Reserved H100 (AWS, 1-year effective rate): ~$2.00/hr x 720 = $1,440/month
Winner: Spheron on-demand at $1,447/month matches AWS 1-year reserved pricing ($1,440/month) with no contract required. Spheron is 48% cheaper than AWS on-demand at $2,808/month. For fault-tolerant workloads, spot pricing on available GPUs can reduce costs further.
Short Training Run (7 Days)
Setup: Full H100 PCIe for 168 hours. For GPU selection guidance on training, see best Nvidia GPUs for LLMs.
- On-demand H100 PCIe (Spheron, $2.01/hr): $2.01 x 168 = $337.68
- On-demand H100 (AWS): ~$3.90/hr x 168 = $655.20 (reflects June 2025 AWS price reduction)
- AWS 1-year reserved (effective rate): ~$2.00/hr x 168 = $336.00
Winner: Neo-cloud on-demand (Spheron). Spheron on-demand ($337.68) is 48% cheaper than AWS on-demand ($655.20) and matches the AWS 1-year reserved effective rate, without any contract. For short training jobs, there is no reason to sign a reserved contract.
Monthly Burst Workload
Setup: Need 8x H100 for 4 hours, once a month.
- Serverless: Not available at 8-GPU scale on most platforms.
- On-demand (Spheron): 8 x $2.01 x 4 = $64.32/month
- Reserved (AWS 1-yr, 8x H100): 8 × $2.00 × 720 = ~$11,520/month, committed whether used or not.
Winner: On-demand by a large margin. Reserved pricing makes no sense for burst workloads.
Provider Examples by Billing Model
Serverless GPU Providers
| Provider | Pricing | Notes |
|---|---|---|
| Modal | Varies by GPU (see Modal pricing page) | Wide GPU selection, fast cold starts (2-4s) |
| Replicate | Per prediction | Model-specific pricing, 16-60s+ cold starts on custom models |
| RunPod Serverless | Per second | FlashBoot achieves 200ms-2s cold starts for optimized containers; large model deployments still see 6-12s+ |
| Baseten | Per call | Enterprise inference platform with private VPCs and SLAs; pricing by contract |
| Fal AI | Per second | Optimized for image/video generation (Flux, SDXL); sub-second inference on popular models |
Serverless GPU prices change frequently. Baseten and Fal AI do not publish standard pricing. Treat these as approximate and check their pricing pages or contact sales for current rates.
On-Demand GPU Providers
| Provider | H100 On-Demand $/hr | Spot $/hr | Billing unit |
|---|---|---|---|
| Spheron | $2.01 (PCIe) | Variable (select GPUs) | Per second |
| RunPod | $2.69 | Available | Per second |
| Lambda Labs | $2.49 (PCIe) / $2.99 (SXM, per-GPU rate in 8xH100 config) | N/A | Per hour |
| CoreWeave | ~$4.76 (GPU component only; ~$6.15/GPU bundled with CPU/RAM) | N/A | Per hour |
| Vast.ai | $1.35-$1.53 | Available | Per minute |
| AWS | ~$3.90 (post-June 2025 reduction) | Variable (check AWS console for current rates) | Per second (1-min min) |
| GCP | ~$3.00-$9.80 (significant regional variance; significant price fluctuations in 2025-2026; verify current rates) | Variable (check GCP console for current rates) | Per second (1-min min) |
Pricing fluctuates based on GPU availability. The prices above are based on 23 Mar 2026 and may have changed. Check current GPU pricing for live rates.
Provider availability notes: Lambda Labs H100 availability can be limited during peak demand; check current availability before budgeting. Vast.ai is a marketplace, so pricing is volatile and reliability varies by host. GCP H100 on-demand pricing has seen significant fluctuations in 2025-2026; verify current rates before making cost comparisons.
Reserved GPU Providers
| Provider | H100 On-Demand $/hr | H100 Reserved (1yr effective) | Discount |
|---|---|---|---|
| AWS | ~$3.90 (post-June 2025 reduction) | ~$2.00 | ~49% |
| GCP | ~$3.00-$9.80 (regional variance; verify current rates) | ~$4.00 (estimated) | ~59% (vs US standard rate; verify for your region) |
| Azure | ~$6.98 (single-GPU NC40ads_H100_v5, East US) / ~$12.29 per GPU (8-GPU ND96isr H100 v5) | ~$5.50 (estimated; not publicly listed) | ~55% (est.) |
| CoreWeave | ~$4.76 (GPU component only) | Negotiated (contact sales) | CoreWeave offers reserved clusters at negotiated rates but does not publish these rates publicly; contact their sales team for custom quotes |
Azure H100 pricing varies by VM type: $6.98/hr for single-GPU VMs (NC40ads_H100_v5, East US), $12.29/hr per GPU for 8-GPU configurations (ND96isr H100 v5). Pricing is highly region-dependent; verify rates for your target region before budgeting.
For Spheron, volume and reserved pricing is available - contact sales via app.spheron.ai or email. Spheron does not publish fixed reserved rates, but its on-demand rate ($2.01/hr for H100 PCIe) matches AWS 1-year reserved pricing (~$2.00/hr), with no contract commitment required.
When Serverless GPU, On-Demand, or Reserved Saves You Money
| Workload type | Daily GPU hours | Cheapest model | Notes |
|---|---|---|---|
| Low-traffic API | <0.5 hr equivalent | Per-second on-demand | 80% cheaper than serverless at 100 req/day with per-second billing ($3.35/month vs $16.67/month); choose serverless only if zero instance management matters more than cost |
| Dev / test | 0.5-3 hr | Per-second on-demand | Stop when idle |
| Batch jobs | 2-8 hr | Spot on-demand | Use checkpointing |
| Production inference | 12-24 hr | Spot or on-demand neo-cloud | Beats hyperscaler reserved |
| Long-term 24/7 production | 24 hr for 6+ months | Reserved (if on hyperscaler) | Only if already in AWS/GCP/Azure |
What Is the Breakeven Point for Reserved vs On-Demand?
The simplified breakeven rule: breakeven utilization = 1 - (discount percentage). At a ~49% AWS discount, you break even at 51% utilization. Below that, on-demand is cheaper.
Reserved makes sense when:
(Reserved monthly cost) < (On-demand rate x actual hours used per month)
Example 1 (AWS H100, 720-hour month, ~$3.90/hr on-demand post-June 2025 reduction):
$2.00/hr x 720 = $1,440 (AWS 1-yr reserved effective)
vs. $3.90/hr x X hours = $1,440
X = 369 hours -> You need to use it more than 369 hours/month
(51% utilization) to break even on AWS reserved vs AWS on-demand
But compare to Spheron on-demand:
$2.01/hr x 720 = $1,447/month (no contract required)
Spheron on-demand at $1,447/month is within $7 of AWS 1-year reserved, with no contract.
Example 2 (GCP H100, 720-hour month, ~59% reserved discount vs $9.80/hr US standard rate):
$4.00/hr x 720 = $2,880 (GCP 1-yr reserved effective, estimated)
vs. $9.80/hr x X hours = $2,880
X = 294 hours -> You need 41% utilization to break even on GCP reserved vs GCP on-demand
But GCP reserved at ~$4.00/hr still costs $2,880/month vs Spheron on-demand at $1,447/month.
Switching to a neo-cloud provider beats signing a GCP reserved contract.The key insight: AWS H100 on-demand pricing dropped to ~$3.90/hr in June 2025 (down from ~$6.88/hr). AWS 1-year reserved is now ~$2.00/hr effective (~49% discount). Spheron on-demand at $2.01/hr matches that reserved rate without any contract commitment. For workloads running less than 24/7, the comparison favors Spheron even more strongly. To track actual GPU utilization and avoid paying for idle time, see GPU monitoring best practices.
Spheron's Billing Model
Spheron bills on-demand GPU rentals by the second with no hourly minimum. Spot pricing is available on select GPUs for fault-tolerant workloads at variable rates depending on current GPU availability. There are no contracts, no egress fees, and no reserved commitment required.
| GPU | Spheron On-Demand | AWS On-Demand | AWS 1-yr Reserved |
|---|---|---|---|
| H100 PCIe | $2.01/hr | ~$3.90/hr | ~$2.00/hr |
| A100 80G PCIe | $1.07/hr | ~$3.43/hr | ~$2.00/hr (est.) |
| A100 80G SXM4 | $1.14/hr | ~$3.43/hr | ~$2.00/hr (est.) |
Spheron pricing as of March 23, 2026. Prices fluctuate based on GPU availability. Check current Spheron pricing for live rates.
Spheron H100 PCIe on-demand ($2.01/hr) matches AWS H100 1-year reserved pricing (~$2.00/hr). You get equivalent pricing with no contract or commitment. For billing details and instance types, see docs.spheron.ai/billing.
For H100 rental, A100 rental, H200 rental, and other GPU options, Spheron offers per-second billing with no upfront commitment.
Decision Framework
Work through these questions in order:
- Low-traffic API (fewer than 200 requests/day)? Per-second on-demand is cheaper than serverless at this traffic level if you use a provider that bills by the second with no hourly minimum. At 100 requests/day with 2-second compute each, per-second on-demand costs $3.35/month vs $16.67/month for serverless (80% cheaper). Choose serverless only if you want zero instance management and can tolerate cold start latency of 200ms-4s for small models or 6-60s for large LLMs, and are willing to pay 5x more for that convenience.
- Fault-tolerant workload (checkpointed training, batch inference)? Use spot GPU when available. Save on compute costs with no contract. See our GPU cost optimization playbook for checkpoint strategies.
- Running 24/7 on AWS or GCP already? Reserved may be worth calculating. But compare the reserved rate to neo-cloud on-demand first. You may save more by switching providers than by signing a contract.
- Everything else (training, sustained serving, dev/test, burst workloads)? On-demand with per-second billing. Stop the instance when you are done. No idle waste, no contract risk.
The most common mistake is defaulting to hyperscaler reserved pricing without comparing to neo-cloud on-demand. AWS H100 reserved at ~$2.00/hr effective requires a 1-year commitment but is now comparable to Spheron H100 PCIe on-demand at $2.01/hr. The difference: Spheron requires no contract. For workloads running less than 24/7, on-demand is always the better choice.
Whether you need serverless flexibility for intermittent workloads or 24/7 on-demand throughput for production inference (with checkpoint strategies for fault-tolerant workloads), Spheron offers transparent per-second billing with no contracts or egress fees. H100 PCIe starts at $2.01/hr with no commitment required, matching AWS 1-year reserved pricing without the lock-in.
