Modal solved a real problem. Developers hated managing GPU infrastructure, so Modal abstracted it away with Python decorators and per-second billing. Write a function, add @app.function(gpu="H100"), and Modal handles the container, the scaling, and the teardown. That is a genuinely compelling pitch.
The abstraction has costs, though. Cold starts can add anywhere from a few seconds to over a minute for large model deployments on the first request to a new container. The effective H100 rate runs ~$3.95/hr under sustained load, which is roughly 2x what bare-metal providers charge. Modal's SDK is the only way to run Modal-decorated functions, so migrating means rewriting your application code. And billing opacity makes it hard to predict monthly costs when pipeline duration varies. For teams running burst inference workloads with long idle periods, Modal is often the right call. For training runs, always-on inference, or cost-sensitive deployments, the math usually points elsewhere. This post covers 10 alternatives across the spectrum, from serverless-first platforms to bare-metal GPU rentals, with enough pricing detail to make an informed choice. If you are also comparing RunPod, see our RunPod alternatives guide for a parallel breakdown.
Why Teams Look Beyond Modal
Cold start latency
Modal's first-request latency for a cold container ranges from a few seconds for simple workloads to over a minute for large model deployments, without optimization. Modal introduced GPU memory snapshots in 2025 (currently in alpha) that capture the full GPU state, including model weights in VRAM, CUDA kernels, and execution contexts. With explicit opt-in, some workloads achieve up to 10x faster cold starts. For very large models where weights exceed a single GPU's VRAM, some loading overhead remains. The feature is still alpha and requires extra setup work to benefit from it. For batch jobs that tolerate async processing, cold starts are irrelevant. For synchronous user-facing APIs where a 200ms response time is expected, even a few extra seconds is a dealbreaker. The workaround is keeping warm replicas running at all times, which eliminates the cost benefit of serverless. You end up paying for idle GPU time anyway.
Per-second billing opacity
Estimating cost for a single inference call is straightforward. Estimating cost for a multi-step pipeline with variable GPU memory pressure and unpredictable function duration is not. Teams regularly see invoices that diverge significantly from back-of-the-envelope estimates because a single slow call in a high-traffic pipeline multiplies across the bill.
SDK lock-in
Modal-decorated functions are Modal-specific. The @app.function decorator, the volume mounts, the secret management, the web endpoint syntax: all of it requires Modal's runtime to execute. Moving to a different platform means rewriting workload code, not just changing environment variables. That is a meaningful switching cost that grows with every function you add.
Pricing ceiling for sustained workloads
At ~$3.95/hr effective for H100 under sustained load, Modal is more expensive than every major dedicated GPU provider. Spheron is $2.01/hr. Lambda starts at $2.86/hr (H100 PCIe, 1x) or $3.78/hr (H100 SXM, 1x). RunPod is $2.39/hr (H100 PCIe, Secure Cloud) or $2.69/hr (H100 SXM). For training jobs or high-throughput inference running at over 51% GPU utilization, the per-second billing model stops being an advantage and becomes a significant cost multiplier compared to hourly or per-minute billing on dedicated instances.
Quick Comparison Table
| Provider | H100 Price | Billing | Cold Starts | Best For |
|---|---|---|---|---|
| Modal | ~$3.95/hr (effective) | Per-second | Yes (5s-2min, model-dependent) | Bursty serverless inference |
| Spheron | $2.01/hr | Per-minute | None | Training, sustained inference |
| RunPod | $2.39/hr (PCIe) / $2.69/hr (SXM) | Per-second | Serverless only | Mixed workloads |
| Lambda | $2.86/hr (PCIe) / $3.78/hr (SXM, 1x) | Per-hour | None | Research, reserved clusters |
| CoreWeave | $4.25/hr+ (on-demand) | Per-hour | None | Enterprise scale |
| Replicate | $5.49/hr | Per-second | Yes | Inference API, prototyping |
| Baseten | ~$6.50/hr (H100) | Per-minute | Yes | Model serving APIs |
| Cerebrium | Usage-based | Per-second | Yes | Serverless ML pipelines |
| Beam Cloud | ~$3.50/hr (H100) | Per-second | Yes | Python-native serverless |
| Hugging Face | $0.50/hr (T4), $10/hr (H100, GCP only), $5/hr (H200, AWS) | Per-minute | None | Managed model hosting |
| NVIDIA DGX Cloud Lepton | Usage-based | Per-second | Yes | LLM inference APIs |
All third-party pricing is based on publicly listed on-demand rates as of March 22, 2026, and may fluctuate. Check each provider's pricing page for current rates.
1. Spheron: Bare-Metal GPU at Lower Cost Than Modal
H200 SXM: $4.54/hr | H100 SXM: $2.01/hr | A100 80GB: $1.07/hr | L40S: $0.91/hr | RTX 4090: $0.58/hr | Per-minute billing | No contracts
Pricing as of March 22, 2026. Rates can fluctuate based on GPU availability.
Spheron is the most direct cost alternative to Modal for teams who can manage their own instances. H100 SXM at $2.01/hr is one of the lowest published rates in the managed GPU market. There are no cold starts because you have a dedicated instance. There is no SDK to learn because you get SSH access and full root privileges. You bring your own container or run whatever software stack you need.
The tradeoff is exactly what Modal sells: you have to manage the infrastructure. There is no auto-scaling to zero, no function decorator syntax, no automatic container builds. For training runs, long-running inference endpoints, or any workload where you are running at over 51% GPU utilization, Spheron is the economical choice. For true serverless burst workloads with high idle ratios, pairing Spheron for training with Modal or Cerebrium for the serverless inference layer often makes sense.
What Spheron does well
- Transparent per-minute billing with no minimum usage period
- H100, A100, H200, B200, L40S, and RTX-series GPUs available on demand, with multi-GPU cluster instances (up to 8x H100 with InfiniBand) for distributed training
- GPU provider network spanning North America and Europe (Voltage Park, DataCrunch, TensorDock, Sesterce, Spheron AI, Massed Compute). See the regions and providers guide for current availability.
- Full bare-metal access: custom CUDA and NVIDIA drivers, root access, no hypervisor overhead
- No vendor lock-in: instances run standard Linux, any software stack works
- VS Code Remote and Jupyter Notebook access for interactive ML development
- Spot instances available for cost-sensitive experiments (H200 spot: $1.78/hr, H100 spot: $0.99/hr, A100 spot: $0.61/hr, L40S spot: $0.41/hr). Use the volume mounting guide to set up a persistent volume for checkpoints so spot interruptions do not lose training progress.
- Programmatic deployment via REST API for CI/CD integration, no proprietary SDK required
Where it falls short
- No serverless or scale-to-zero offering
- Instance management is your responsibility (no auto-scaling)
- Not optimized for sub-minute workloads where per-second billing matters
Best for: Teams running training jobs, sustained inference workloads, or anyone who has done the utilization math and found that dedicated GPU time is cheaper than per-second billing. See GPU pricing for current rates, browse the H100 rental page for specs and availability, review the instance types guide to pick the right GPU tier, or follow the Spheron getting started guide to deploy your first instance.
2. RunPod: On-Demand and Serverless in One Platform
H100 PCIe: $2.39/hr (Secure Cloud) | H100 SXM: $2.69/hr on-demand | Serverless endpoints available | Per-second billing on serverless
RunPod covers both halves of the Modal use case in one platform. RunPod Serverless is a direct Modal competitor: you deploy functions that scale to zero and pay per execution second. RunPod On-Demand is a traditional GPU rental marketplace for training and long-running inference.
The two-in-one model is convenient for teams whose workloads span both patterns. You can train a model on an on-demand GPU and deploy inference on RunPod Serverless without changing platforms. The developer experience on RunPod Serverless is solid, though not quite as polished as Modal's.
What RunPod does well
- RunPod Serverless as a direct Modal competitor with per-second billing
- Active community and template library reduces deployment friction
- On-demand instances available for training at lower on-demand rates than Modal's per-second billing model
- Simple UI and API for both serverless and dedicated workloads
Where it falls short
- Serverless cold starts still exist on RunPod (5-20s depending on container size)
- On-demand pricing is higher than Spheron for pure training workloads
- Marketplace-sourced GPUs can have variable uptime guarantees
Best for: Teams who want serverless inference and dedicated training under one account, without maintaining two separate platforms. See our RunPod alternatives guide for a deeper breakdown of how RunPod stacks up against other providers.
3. Lambda Labs: Best for Reserved Training Clusters
H100 PCIe: $2.86/hr (1x) | H100 SXM: $3.78/hr (1x) / $3.44/hr (8x) | Per-hour billing | On-demand and reserved options
Lambda has established itself as the research-tier GPU cloud. The hardware is well-maintained, the NVIDIA relationship means early access to new GPU generations, and the platform is designed around the assumption that you are running multi-day training jobs, not 30-second inference calls.
On-demand H100 availability fluctuates significantly. For sustained training workloads, Lambda's reserved instances (available at volume discounts) are often the practical path.
What Lambda does well
- Hardware quality and reliability reputation among ML researchers
- Large multi-node cluster options for distributed training
- Clean, minimal interface without overwhelming feature complexity
- Strong support for reserved instance pricing on A100 and H100 clusters
Where it falls short
- On-demand H100 availability can be constrained during peak periods
- Per-hour minimum billing is wasteful for sub-hour jobs
- No serverless or function-based deployment option
Best for: Research teams and ML engineers running multi-hour training jobs who need a reliable, well-maintained GPU environment without managing cloud infrastructure at the hyperscaler level.
4. CoreWeave: Kubernetes-Native Enterprise GPU
H100 PCIe: $4.25/hr+ (on-demand, GPU component) | Contract pricing available | Kubernetes-native
CoreWeave operates at a different scale than most GPU cloud providers. The platform is built on Kubernetes, offers InfiniBand-connected clusters for distributed training, and targets enterprise workloads where reliability SLAs and compliance requirements matter.
The H100 PCIe GPU component starts at $4.25/hr on-demand, with CPU, RAM, and storage billed separately. H100 HGX configurations start at $4.76/hr per GPU. CoreWeave is not competing on per-unit price. The value proposition is cluster scale, network fabric, and enterprise contracting. Volume customers see rates well below the on-demand list price.
What CoreWeave does well
- Kubernetes-native: if your team already uses K8s, CoreWeave fits naturally
- InfiniBand interconnect for low-latency distributed training at scale
- Enterprise SLAs, compliance documentation, dedicated account teams
- Massive cluster size available (hundreds of H100s in a single job)
Where it falls short
- On-demand pricing is not competitive for single-instance workloads
- Contract-based discounts require volume commitments
- Overkill for teams running individual training jobs or inference endpoints
Best for: Enterprise teams scaling distributed training to 100+ GPUs who need contract pricing, compliance documentation, and a Kubernetes-native environment.
5. Replicate: Serverless Model Inference via API
H100: $5.49/hr ($0.001525/sec) | Public model registry | No deployment required for hosted models
Replicate takes a different angle than Modal. Instead of deploying your own code, you either use models from Replicate's public registry or push your own model in Replicate's format. The API is clean and the hosted models (Stable Diffusion, LLaMA variants, Flux) are available with a single API call.
For prototyping or building products on top of existing open-source models, Replicate eliminates the deployment work entirely. You pay per inference call without thinking about GPU allocation.
What Replicate does well
- Public registry of popular models available immediately via API
- Clean, consistent inference API with no deployment work for hosted models
- Per-second billing means you only pay for actual execution time
- Simple Python and JavaScript clients
Where it falls short
- H100 at $5.49/hr is more expensive than dedicated bare-metal alternatives
- Limited to inference; no training support
- Vendor lock-in to Replicate's model format for custom deployments
- Cold starts exist, especially for less popular models with low request frequency
Best for: Prototyping and product development on top of popular open-source models where deployment work is a bottleneck, not cost optimization. For teams who want to self-host image generation models like Stable Diffusion or Flux on dedicated hardware instead, see the Spheron image generation guide.
6. Baseten: Model Serving with Truss Framework
H100: ~$6.50/hr ($0.10833/min) | Custom model deployment | Dedicated infra options
Baseten is focused on production model serving. The Truss framework is their deployment abstraction: you define your model, its dependencies, and Baseten handles the rest. They offer both serverless endpoints and dedicated GPU instances for latency-sensitive deployments.
The platform has more depth than Replicate for teams deploying custom models. You get control over model configuration, batching settings, and hardware selection. The dedicated infrastructure option eliminates cold starts for production inference.
What Baseten does well
- Performance-focused: Truss is optimized for fast inference with configurable batching
- Dedicated infrastructure option removes cold starts for production workloads
- Supports custom models with fine-grained deployment configuration
- Good observability tools for monitoring inference latency and throughput
Where it falls short
- Truss framework lock-in adds migration friction
- Not a natural fit for training workloads
- Smaller community and ecosystem compared to Modal
Best for: ML teams with custom models who need production inference serving with more control than Replicate offers, as of early 2026.
7. Cerebrium: Python-Native Serverless GPU
Pricing: Usage-based per-second | Keep-warm options | No Kubernetes required
Cerebrium's developer experience is the closest to Modal's in this list. You write Python functions, decorate them, and Cerebrium handles deployment and scaling. Cold start mitigation via keep-warm instances is built in. The container runtime is optimized for ML workloads.
The practical difference from Modal is ecosystem maturity. Cerebrium has fewer integrations, less community tooling, and less documentation. For teams who want Modal's model without Modal's pricing, it is worth evaluating.
What Cerebrium does well
- Deployment model similar to Modal with Python-native functions
- Built-in keep-warm to mitigate cold starts on latency-sensitive endpoints
- Supports both inference and training workloads in a serverless format
- No Kubernetes knowledge required
Where it falls short
- Smaller ecosystem and community than Modal
- Less mature documentation and tooling
- Regional availability is more limited than established providers
Best for: Teams who prefer Modal's developer experience but want to explore alternatives, particularly for inference-heavy workloads. Pricing and features are evolving quickly as of early 2026.
8. Beam Cloud: Serverless GPU for Python Workloads
H100: ~$3.50/hr | RTX 4090: $0.69/hr | Per-second billing | Scheduled jobs | Persistent volume mounts
Beam Cloud targets Python-first ML teams with a function deployment model similar to Modal. The distinguishing features are scheduled job support (cron-based triggers) and persistent volume mounts for datasets and checkpoints.
For pipelines that combine batch processing, scheduled retraining, and inference serving, Beam's combination of features covers more ground than pure inference platforms like Replicate.
What Beam Cloud does well
- Function-based deployment with Python-native syntax
- Scheduled job support for batch processing and periodic retraining
- Persistent volume mounts for large dataset access
- Sub-second cold start times on warm containers; under 10 seconds on cold starts
Where it falls short
- Earlier-stage product; documentation and community are smaller than Modal
- Feature roadmap is less predictable than established platforms
- Fewer GPU options than larger providers; H100 and RTX 4090 are the main high-performance tiers
Best for: Teams with mixed workloads (batch processing, scheduled jobs, inference) who want a unified serverless platform and are comfortable with an early-stage product.
9. Hugging Face Inference Endpoints: Managed Model Hosting
Pricing: $0.50/hr (T4), $10/hr (H100, GCP only), $5/hr (H200, AWS) | Per-minute billing | SLA-backed uptime
Hugging Face Inference Endpoints is the cleanest path to hosting models from the HF Hub in production. You pick a model from the Hub, choose a hardware tier, and Hugging Face handles the serving infrastructure. No custom code required for Hub-compatible models.
Single-GPU pricing ranges from $0.50/hr for an NVIDIA T4 to $10/hr for an H100 on GCP. H100 instances are currently only available on GCP. H200 runs $5/hr on AWS. A100 runs $2.50/hr on AWS or $3.60/hr on GCP. Multi-GPU configurations scale linearly. The per-hour displayed rate is actually billed per minute, so you do not pay for a full hour if you only use 10 minutes.
What Hugging Face does well
- Zero deployment work for any model on the HF Hub
- SLA-backed uptime with dedicated infrastructure per endpoint
- Native integration with HF Hub model versioning and revision tracking
- Supports private models from private Hub repositories
Where it falls short
- H100 at $10/hr (GCP only) and A100 at $2.50/hr (AWS) or $3.60/hr (GCP) are more expensive than bare-metal GPU providers for high-throughput workloads
- Limited flexibility for custom code outside of standard inference pipelines
- GPU selection is fixed to HF's hardware tiers, not arbitrary instance types
Best for: Teams using HF Hub models who want managed production endpoints without writing serving code, and where per-minute billing fits consistent traffic patterns.
10. NVIDIA DGX Cloud Lepton: Developer-Focused GPU Compute Marketplace
Pricing: Usage-based per-second | LLM-optimized serving | Multi-region | Formerly Lepton AI
NVIDIA acquired Lepton AI in April 2025 and rebranded the platform as NVIDIA DGX Cloud Lepton, announced at COMPUTEX in May 2025. It is now a GPU compute marketplace that connects developers to capacity from a global network of NVIDIA Cloud Partners including CoreWeave, Crusoe, Firmus, Foxconn, GMI Cloud, Lambda, Nebius, Nscale, SoftBank Corp., and Yotta Data Services. Rather than running its own data centers, it aggregates GPU supply from that network of partners.
The platform supports popular open-source LLMs and provides a unified API for accessing compute across providers. Serving is optimized for token throughput with features like continuous batching and tensor parallelism.
What NVIDIA DGX Cloud Lepton does well
- LLM-optimized inference with continuous batching and tensor parallelism
- Access to compute from multiple NVIDIA Cloud Partners through one interface
- Multi-region availability for latency-sensitive global serving
- Backed by NVIDIA's ecosystem and partnerships
Where it falls short
- Platform is still maturing after the acquisition and rebrand
- Pricing and availability vary by underlying provider, making cost comparison harder
- Less suitable for non-LLM workloads like image generation or training at the individual developer tier
Best for: Teams building LLM-powered products who want managed inference infrastructure optimized for token throughput with NVIDIA's ecosystem backing, as of early 2026.
Serverless vs. On-Demand GPU: When Each Makes Sense
The choice between serverless GPU and dedicated instances comes down to utilization math.
Use serverless (Modal, Replicate, Cerebrium, Beam) when:
- Traffic is bursty and unpredictable with significant idle periods between requests
- Workloads are stateless and tolerate cold start latency
- You want zero infrastructure management
- Jobs are short enough that per-second billing is an advantage (sub-30 minutes)
Use on-demand or bare-metal (Spheron, Lambda, RunPod on-demand) when:
- Training jobs run longer than 4 hours
- Inference endpoints require consistent sub-100ms latency
- Custom CUDA drivers or kernel configurations are needed
- You have done the utilization math and found dedicated is cheaper
The Spheron instance types guide explains when to use spot vs. dedicated vs. bare-metal, with a decision matrix covering each use case.
The breakeven calculation is simple: at what utilization does dedicated become cheaper than serverless?
- Modal H100 effective rate: ~$3.95/hr
- Spheron H100 on-demand: $2.01/hr
- Breakeven utilization: $2.01 / $3.95 = ~51%
If your GPU is busy more than 51% of the time, Spheron's dedicated instance is cheaper than Modal's per-second billing. For training jobs, that threshold is almost always exceeded. For inference with sparse traffic, it may not be. The Spheron training quick guides cover environment setup for PyTorch, DeepSpeed, and distributed multi-GPU workloads.
You can also use spot instances to push costs even lower for fault-tolerant workloads. Spheron's H100 spot pricing runs $0.99/hr, bringing the breakeven utilization against Modal down further. The Spheron cost optimization guide covers when spot makes sense and how to set up checkpoint recovery. For teams running recurring workloads, the reserved GPU program offers further discounts on committed usage.
Pricing Model Comparison
Per-second billing (Modal, RunPod Serverless, Replicate, Cerebrium, Beam Cloud)
Per-second billing is ideal for sub-minute workloads with high idle ratios. A 3-second inference call costs almost nothing. The downside is cost unpredictability for pipelines with variable duration. A slow batch call can cost 10x what a fast one does, and at scale, that variance compounds into difficult-to-forecast monthly bills.
Per-minute billing (Spheron, Hugging Face)
Per-minute billing is practical for both short and long workloads. No minimum usage period means you do not pay for an hour when you only needed 10 minutes. For training runs, the granularity is irrelevant since jobs run for hours anyway. For inference endpoints that run continuously, per-minute billing is equivalent to per-hour in cost terms but more intuitive. Hugging Face Inference Endpoints displays rates as hourly figures but charges by the minute, so you only pay for actual usage.
Per-hour billing (Lambda)
Per-hour billing is predictable and easy to budget. It is wasteful for workloads shorter than 30 minutes but excellent for training runs and always-on inference endpoints. Most research teams prefer per-hour billing because it maps cleanly to compute budgets.
Concrete comparison: running 1,000 inference calls at 30 seconds each on H100
1,000 calls at 30 seconds each equals 30,000 seconds of GPU compute, or 8.33 hours of sequential processing.
- Modal (per-second at ~$3.95/hr effective): 1,000 x 30s x ($3.95 / 3,600) = ~$32.90
- Spheron (dedicated, sequential throughput over 8.33 hours): 8.33 x $2.01 = ~$16.74
Dedicated Spheron costs roughly half as much for the same sequential throughput. The gap narrows as your traffic gets sparser. If those 1,000 calls arrive over 24 hours with long idle windows between them, Spheron costs 24 x $2.01 = ~$48.24 while Modal stays at ~$32.90. The actual crossover is around 16.4 hours: once a dedicated instance needs to stay on longer than ~16 hours to process 1,000 calls at this rate, Modal's pay-per-second model comes out ahead. For an always-on inference endpoint, that translates to fewer than about 61 calls per hour. Below that request rate, serverless billing wins on cost.
For LLM serving specifically, the Spheron LLM inference guide covers framework selection (vLLM, Ollama, TensorRT-LLM) and matching GPU tier to model size, which directly affects whether dedicated or serverless is the better fit.
Which Modal Alternative Should You Use?
If you need pure serverless burst inference with no infrastructure management, RunPod Serverless and Cerebrium are the strongest alternatives to Modal. Both offer similar developer experience with potentially better per-second pricing for specific workloads.
For large-scale training, Spheron and Lambda are the natural choices. Spheron for teams optimizing for cost and flexibility, Lambda for teams who prioritize hardware reliability and cluster size. Both offer dedicated instances without the overhead of Modal's container model. The Spheron distributed training guide covers PyTorch DDP and DeepSpeed ZeRO-3 setup on multi-GPU clusters.
Enterprise teams running distributed training at 100+ GPU scale who need compliance documentation and SLA guarantees should evaluate CoreWeave. The on-demand pricing is comparable to Modal, but the contract rates and cluster scale are in a different category.
For open-source model serving without writing deployment code, Replicate and Hugging Face Inference Endpoints remove most of the friction. Replicate suits smaller deployments and prototyping. Hugging Face suits production traffic with consistent request volumes.
For teams running both training and inference workloads under one budget, a split architecture often makes sense: Spheron or Lambda for training jobs, Modal or Cerebrium for the serverless inference layer that handles user traffic. This separates cost optimization (dedicated instances for predictable workloads) from developer convenience (serverless for bursty, unpredictable inference).
If Modal's cold starts, per-second billing overhead, or SDK lock-in have become real costs, Spheron's dedicated GPU instances start at $2.01/hr for H100 with no cold starts and no vendor lock-in. Pricing as of March 22, 2026 and subject to change based on GPU availability.
