Modal is a Python-native serverless GPU platform. You write a function, decorate it with @app.function(gpu="H100"), and Modal handles the rest: container provisioning, GPU attachment, and billing per second of execution. It is not a GPU cloud in the traditional sense. You do not get a persistent VM or a dedicated IP address, and remote access works through TCP tunnels rather than direct SSH.
That distinction matters a lot depending on what you are building.
What Modal Is (and What It Is Not)
Modal's core value proposition is zero infrastructure management. Deploy GPU functions from your laptop with modal run. Pay only when code runs. Scale to zero between requests. The Pythonic decorator API means you can go from a local script to a GPU-backed function in a few lines.
Modal's key strengths:
- Zero-to-GPU in seconds for pre-warmed containers
- Pay-per-second billing with no idle cost
- No infrastructure configuration required
- Built-in parallelism for batch jobs
What Modal is not: a general-purpose VM host. Modal does support SSH via its Tunnels feature (TCP port forwarding), but it works differently from direct root SSH on a bare-metal VM. You cannot bind to arbitrary ports or run a persistent daemon. You cannot maintain stateful in-memory caches across invocations. The execution model is functions, not servers.
How Spheron Works: Full VM, Always On
Spheron gives you a dedicated bare-metal VM. You get root SSH access, a persistent disk, a fixed IP, and a GPU that stays attached to your instance until you explicitly stop it. There is no hypervisor overhead on the GPU path. VRAM is yours.
From a workflow standpoint: you provision an instance on the Spheron GPU rental page, SSH in, configure your environment once, and it stays configured. Your model weights load into GPU memory on startup and stay there. Requests hit a live process, not a container that needs to spin up.
Billing is flat hourly. Check Spheron's GPU pricing for current rates. You pay whether the GPU is running inference or sitting idle.
Architecture Comparison
Modal: Serverless Functions
| Property | Modal |
|---|---|
| Execution model | Serverless functions |
| Access method | Python decorator API (@app.function) |
| State persistence | Ephemeral by default (Modal Volumes available) |
| Port binding | Limited (web endpoints via fixed ASGI interface) |
| Background daemons | Not supported |
| GPU attachment | Allocated per function invocation |
| Networking | Managed, no arbitrary port exposure |
| Storage | Modal Volumes (persistent), or ephemeral local filesystem |
Spheron: Dedicated Bare-Metal VMs
| Property | Spheron |
|---|---|
| Execution model | Full dedicated VM |
| Access method | SSH, API provisioning |
| State persistence | Persistent disk, in-memory state survives between requests |
| Port binding | Arbitrary (open any port) |
| Background daemons | Supported (systemd, screen, tmux, etc.) |
| GPU attachment | Dedicated, attached for the instance lifetime |
| Networking | Full control, dedicated IP, inbound/outbound unrestricted |
| Storage | Persistent NVMe, NFS mounts, external object storage |
Multi-GPU configurations on Spheron include NVLink within a node and InfiniBand across nodes on HGX systems. Modal supports multi-GPU within a single function call and has a beta multi-node feature (up to 64 devices with InfiniBand via the @clustered decorator), though it does not expose the full low-level interconnect configuration that bare-metal provides.
Pricing Model Comparison
Modal Pricing (Per-Second Billing)
Modal bills per second of GPU compute (as of March 2026):
| GPU | Approx. per-second rate | Effective hourly rate |
|---|---|---|
| H100 SXM | $0.001097/sec | ~$3.95/hr |
| A100 80GB | ~$0.000694/sec | ~$2.50/hr |
| T4 | ~$0.000164/sec | ~$0.59/hr |
The critical variable is utilization. At 100% utilization (function always running), you pay the full effective hourly rate. At 10% utilization, you pay 10% of it. For bursty workloads, per-second billing is a genuine advantage.
Prices as of 28 Mar 2026. Modal pricing is subject to change. Check Modal's pricing page for current rates.
Spheron Pricing (Flat Hourly Rate)
Spheron charges per hour regardless of whether requests are actively processing. Current rates (as of March 2026):
| GPU | On-demand price/hr | Spot price/hr |
|---|---|---|
| H100 SXM | ~$2.40/hr | ~$0.80/hr |
| A100 SXM4 80GB | ~$1.06/hr | — |
| A100 PCIe 80GB | ~$1.09/hr | — |
Spot instances are interruptible but can offer significant savings on supported GPUs. Spot availability and pricing vary by model, and spot is not always cheaper than on-demand. For production inference or training runs with checkpointing, spot can cut costs substantially where available. Check individual GPU pages like H100 rental for live rates including spot availability.
Pricing fluctuates based on GPU availability. The prices above are based on 28 Mar 2026 and may have changed. Check current GPU pricing for live rates.
For a broader look at how these rates compare across all major providers, see our GPU cloud pricing comparison for 2026.
Worked Example: Continuous Inference Server
Scenario: one H100 SXM running a production inference API, 24 hours/day, 30 days/month. Fully utilized (requests arriving continuously).
| Provider | Calculation | Monthly cost |
|---|---|---|
| Modal | $3.95/hr x 720 hrs | ~$2,844 |
| Spheron on-demand | $2.40/hr x 720 hrs | ~$1,728 |
| Spheron spot | $0.80/hr x 720 hrs | ~$576 |
At 100% utilization, Spheron on-demand saves about $1,116/month over Modal. Spheron spot saves over $2,268/month if your inference API can handle occasional interruptions with failover logic.
Worked Example: Burst Inference (Bursty Traffic)
Scenario: one H100 SXM handling async batch jobs, running 2 hours/day, 30 days/month. GPU is idle the rest of the time.
| Provider | Calculation | Monthly cost |
|---|---|---|
| Modal | $3.95/hr x 60 hrs | ~$237 |
| Spheron on-demand | $2.40/hr x 720 hrs | ~$1,728 |
Modal wins this one by a large margin. The break-even point on H100 SXM is roughly at 61% daily utilization (about 15 hours/day): Spheron on-demand costs $2.40 x 24 = $57.60/day, so Modal breaks even at $57.60 / $3.95 ≈ 14.6 hours of actual GPU use. Below that threshold, Modal is cheaper because you only pay for active compute time.
All prices based on 28 Mar 2026 rates. Check Modal's pricing page for current Modal figures.
Cold Start Latency vs Always-On Performance
Modal's cold start time depends on container state:
- Pre-warmed container (min_containers=1): 1-10 seconds
- Cold container boot: ~1 second for the container itself, but end-to-end initialization including Python imports and model loading can take 20-60+ seconds depending on model size
The latency breakdown for a cold start: container orchestration, image pull (if not cached), Python interpreter start, import of dependencies (torch, transformers), and model load from disk into GPU memory. On a large model like Llama 3 70B, the model load alone can take 20-30 seconds.
Spheron has no cold start in this sense. The VM is always on. Your inference server process loads the model once when the instance starts, and the model weights stay in GPU VRAM across all requests. Latency from request receipt to first token is purely model execution time, typically under 2 seconds for a pre-loaded model serving a streaming request.
For real-time inference with SLAs under 200ms end-to-end, any serverless platform with cold starts is a hard constraint. Serverless works for async inference (queue a job, wait for result), but live synchronous APIs with strict latency requirements need a persistent server.
See our best GPU for AI inference guide for GPU selection based on throughput and latency requirements.
Multi-GPU Training: Why Bare Metal Wins
Modal supports multi-GPU within a single function invocation, up to 8x H100s on a single node. For model fine-tuning or inference that fits within one node, this can work.
The limitations appear at scale:
- No persistent shared filesystem across function runs by default. Checkpoint directories need explicit Modal Volume mounts.
- Modal's multi-node support (via the @clustered decorator) is currently in beta, with up to 64 devices per cluster and RDMA/InfiniBand networking. Production workloads requiring guaranteed stability should factor in the beta status.
- No job schedulers like SLURM or PBS for multi-job cluster workflows.
Spheron bare-metal gives you the full environment for distributed training. NVLink within an HGX node. InfiniBand across nodes for multi-node runs. MPI and NCCL tuned for bare metal. Persistent NFS storage for checkpoints accessible from all nodes.
A practical example: fine-tuning a 70B model with FSDP across 2 nodes requires RDMA-capable inter-node communication and persistent checkpoint storage accessible from both nodes. Modal's beta @clustered feature can attempt this workload, but Spheron's production bare-metal environment offers more mature tooling for it, with stable InfiniBand, MPI, and persistent NFS storage.
See our multi-node GPU training guide for more on distributed training infrastructure. For a similar bare-metal vs managed comparison, see Spheron vs RunPod.
Use Case Mapping
When Serverless (Modal) Makes Sense
- Batch data processing with infrequent runs (a few times per day)
- Event-driven inference triggered by webhooks or async job queues
- Overnight eval runs on a fixed dataset
- Prototyping with intermittent GPU usage
- Workloads running less than ~15 hours of GPU time per day
- Teams that want zero infrastructure management and accept the trade-offs
If Modal doesn't fit your needs, see alternatives to Modal for other GPU cloud options.
When Bare Metal (Spheron) Makes Sense
- Production inference APIs with latency SLAs (synchronous, real-time requests)
- Multi-GPU or multi-node training that requires InfiniBand or NVLink
- Teams that need persistent environments: installed conda envs, cached model weights, active processes
- GPU utilization above ~15 hours/day (~61% utilization), where flat hourly billing beats per-second billing
- Workloads requiring SSH access, custom daemons, arbitrary port binding
- Compliance environments needing dedicated IPs and audit-grade isolation
Full Side-by-Side Comparison
| Feature | Modal | Spheron |
|---|---|---|
| Execution model | Serverless functions | Dedicated VMs |
| GPU access | Per-invocation | Persistent, always attached |
| Cold start | 1-90 seconds | None |
| Persistent storage | Modal Volumes (manual setup) | NVMe + NFS, always-on |
| SSH access | Via Tunnels (TCP port forwarding) | Yes (root) |
| Custom ports and daemons | No | Yes |
| Multi-node training | Beta (up to 64 devices, @clustered) | Yes (production-ready) |
| InfiniBand / NVLink | Beta (InfiniBand via @clustered), NVLink single-node | Yes |
| Billing model | Per second of execution | Flat hourly |
| H100 SXM price | ~$3.95/hr | ~$2.40/hr on-demand |
| Free egress | Yes | Yes |
| Data persistence between runs | Requires Modal Volumes | Default (persistent disk) |
Spheron gives you dedicated bare-metal GPUs with no cold starts, persistent storage, and full root access. If you are running production inference or training workloads, compare GPU pricing or rent an H100 and get running in minutes.
