What is the main difference between Spheron and Modal?

Spheron provides dedicated bare-metal VMs with full GPU access and no cold starts. Modal is a serverless GPU platform where functions spin up on demand and you pay per second of compute. Spheron is better for training jobs, long-running inference, and multi-GPU workloads. Modal is better for bursty, event-driven inference where you want zero idle cost.

Does Modal support multi-GPU training?

Modal supports multi-GPU within a single function (up to 8 GPUs) and has a beta multi-node feature (up to 64 devices with InfiniBand) via the @clustered decorator. For production distributed training, bare-metal providers like Spheron offer a more mature environment: production-grade NVLink and InfiniBand, MPI, NCCL, and persistent storage across nodes, without the constraints of a beta feature.

How does Modal's cold start latency compare to Spheron?

Modal's container boot is fast (around 1 second), but end-to-end initialization including Python imports and model weight loading can take 20-60+ seconds depending on model size. Pre-warmed containers (min_containers=1) reduce this to 1-10 seconds. Spheron VMs are always on, so inference latency is purely model execution time, typically under 2 seconds for a pre-loaded model.

Is Spheron cheaper than Modal for continuous inference?

Yes. For workloads running more than ~15 hours per day, Spheron's flat hourly rate beats Modal's per-second billing. An H100 on Spheron runs around $2.40/hr on-demand. On Modal, the same GPU costs $3.95/hr billed per second, with no idle savings if your functions run continuously.

Can I run a persistent server on Modal?

Modal has a 'web endpoints' feature for serving, but the underlying execution model is still serverless. You cannot open arbitrary ports, run background daemons, or maintain stateful in-memory caches the way you can on a Spheron VM.

Spheron vs Modal: Bare Metal GPU vs Serverless GPU

Modal is a Python-native serverless GPU platform. You write a function, decorate it with @app.function(gpu="H100"), and Modal handles the rest: container provisioning, GPU attachment, and billing per second of execution. It is not a GPU cloud in the traditional sense. You do not get a persistent VM or a dedicated IP address, and remote access works through TCP tunnels rather than direct SSH.

That distinction matters a lot depending on what you are building.

What Modal Is (and What It Is Not)

Modal's core value proposition is zero infrastructure management. Deploy GPU functions from your laptop with modal run. Pay only when code runs. Scale to zero between requests. The Pythonic decorator API means you can go from a local script to a GPU-backed function in a few lines.

Modal's key strengths:

Zero-to-GPU in seconds for pre-warmed containers
Pay-per-second billing with no idle cost
No infrastructure configuration required
Built-in parallelism for batch jobs

What Modal is not: a general-purpose VM host. Modal does support SSH via its Tunnels feature (TCP port forwarding), but it works differently from direct root SSH on a bare-metal VM. You cannot bind to arbitrary ports or run a persistent daemon. You cannot maintain stateful in-memory caches across invocations. The execution model is functions, not servers.

How Spheron Works: Full VM, Always On

Spheron gives you a dedicated bare-metal VM. You get root SSH access, a persistent disk, a fixed IP, and a GPU that stays attached to your instance until you explicitly stop it. There is no hypervisor overhead on the GPU path. VRAM is yours.

From a workflow standpoint: you provision an instance on the Spheron GPU rental page, SSH in, configure your environment once, and it stays configured. Your model weights load into GPU memory on startup and stay there. Requests hit a live process, not a container that needs to spin up.

Billing is flat hourly. Check Spheron's GPU pricing for current rates. You pay whether the GPU is running inference or sitting idle.

Architecture Comparison

Modal: Serverless Functions

Property	Modal
Execution model	Serverless functions
Access method	Python decorator API (`@app.function`)
State persistence	Ephemeral by default (Modal Volumes available)
Port binding	Limited (web endpoints via fixed ASGI interface)
Background daemons	Not supported
GPU attachment	Allocated per function invocation
Networking	Managed, no arbitrary port exposure
Storage	Modal Volumes (persistent), or ephemeral local filesystem

Spheron: Dedicated Bare-Metal VMs

Property	Spheron
Execution model	Full dedicated VM
Access method	SSH, API provisioning
State persistence	Persistent disk, in-memory state survives between requests
Port binding	Arbitrary (open any port)
Background daemons	Supported (systemd, screen, tmux, etc.)
GPU attachment	Dedicated, attached for the instance lifetime
Networking	Full control, dedicated IP, inbound/outbound unrestricted
Storage	Persistent NVMe, NFS mounts, external object storage

Multi-GPU configurations on Spheron include NVLink within a node and InfiniBand across nodes on HGX systems. Modal supports multi-GPU within a single function call and has a beta multi-node feature (up to 64 devices with InfiniBand via the @clustered decorator), though it does not expose the full low-level interconnect configuration that bare-metal provides.

Pricing Model Comparison

Modal Pricing (Per-Second Billing)

Modal bills per second of GPU compute (as of March 2026):

GPU	Approx. per-second rate	Effective hourly rate
H100 SXM	$0.001097/sec	~$3.95/hr
A100 80GB	~$0.000694/sec	~$2.50/hr
T4	~$0.000164/sec	~$0.59/hr

The critical variable is utilization. At 100% utilization (function always running), you pay the full effective hourly rate. At 10% utilization, you pay 10% of it. For bursty workloads, per-second billing is a genuine advantage.

Prices as of 28 Mar 2026. Modal pricing is subject to change. Check Modal's pricing page for current rates.

Spheron Pricing (Flat Hourly Rate)

Spheron charges per hour regardless of whether requests are actively processing. Current rates (as of March 2026):

GPU	On-demand price/hr	Spot price/hr
H100 SXM	~$2.40/hr	~$0.80/hr
A100 SXM4 80GB	~$1.06/hr	—
A100 PCIe 80GB	~$1.09/hr	—

Spot instances are interruptible but can offer significant savings on supported GPUs. Spot availability and pricing vary by model, and spot is not always cheaper than on-demand. For production inference or training runs with checkpointing, spot can cut costs substantially where available. Check individual GPU pages like H100 rental for live rates including spot availability.

Pricing fluctuates based on GPU availability. The prices above are based on 28 Mar 2026 and may have changed. Check current GPU pricing for live rates.

For a broader look at how these rates compare across all major providers, see our GPU cloud pricing comparison for 2026.

Worked Example: Continuous Inference Server

Scenario: one H100 SXM running a production inference API, 24 hours/day, 30 days/month. Fully utilized (requests arriving continuously).

Provider	Calculation	Monthly cost
Modal	$3.95/hr x 720 hrs	~$2,844
Spheron on-demand	$2.40/hr x 720 hrs	~$1,728
Spheron spot	$0.80/hr x 720 hrs	~$576

At 100% utilization, Spheron on-demand saves about $1,116/month over Modal. Spheron spot saves over $2,268/month if your inference API can handle occasional interruptions with failover logic.

Worked Example: Burst Inference (Bursty Traffic)

Scenario: one H100 SXM handling async batch jobs, running 2 hours/day, 30 days/month. GPU is idle the rest of the time.

Provider	Calculation	Monthly cost
Modal	$3.95/hr x 60 hrs	~$237
Spheron on-demand	$2.40/hr x 720 hrs	~$1,728

Modal wins this one by a large margin. The break-even point on H100 SXM is roughly at 61% daily utilization (about 15 hours/day): Spheron on-demand costs $2.40 x 24 = $57.60/day, so Modal breaks even at $57.60 / $3.95 ≈ 14.6 hours of actual GPU use. Below that threshold, Modal is cheaper because you only pay for active compute time.

All prices based on 28 Mar 2026 rates. Check Modal's pricing page for current Modal figures.

Cold Start Latency vs Always-On Performance

Modal's cold start time depends on container state:

Pre-warmed container (min_containers=1): 1-10 seconds
Cold container boot: ~1 second for the container itself, but end-to-end initialization including Python imports and model loading can take 20-60+ seconds depending on model size

The latency breakdown for a cold start: container orchestration, image pull (if not cached), Python interpreter start, import of dependencies (torch, transformers), and model load from disk into GPU memory. On a large model like Llama 3 70B, the model load alone can take 20-30 seconds.

Spheron has no cold start in this sense. The VM is always on. Your inference server process loads the model once when the instance starts, and the model weights stay in GPU VRAM across all requests. Latency from request receipt to first token is purely model execution time, typically under 2 seconds for a pre-loaded model serving a streaming request.

For real-time inference with SLAs under 200ms end-to-end, any serverless platform with cold starts is a hard constraint. Serverless works for async inference (queue a job, wait for result), but live synchronous APIs with strict latency requirements need a persistent server.

See our best GPU for AI inference guide for GPU selection based on throughput and latency requirements.

Multi-GPU Training: Why Bare Metal Wins

Modal supports multi-GPU within a single function invocation, up to 8x H100s on a single node. For model fine-tuning or inference that fits within one node, this can work.

The limitations appear at scale:

No persistent shared filesystem across function runs by default. Checkpoint directories need explicit Modal Volume mounts.
Modal's multi-node support (via the @clustered decorator) is currently in beta, with up to 64 devices per cluster and RDMA/InfiniBand networking. Production workloads requiring guaranteed stability should factor in the beta status.
No job schedulers like SLURM or PBS for multi-job cluster workflows.

Spheron bare-metal gives you the full environment for distributed training. NVLink within an HGX node. InfiniBand across nodes for multi-node runs. MPI and NCCL tuned for bare metal. Persistent NFS storage for checkpoints accessible from all nodes.

A practical example: fine-tuning a 70B model with FSDP across 2 nodes requires RDMA-capable inter-node communication and persistent checkpoint storage accessible from both nodes. Modal's beta @clustered feature can attempt this workload, but Spheron's production bare-metal environment offers more mature tooling for it, with stable InfiniBand, MPI, and persistent NFS storage.

See our multi-node GPU training guide for more on distributed training infrastructure. For a similar bare-metal vs managed comparison, see Spheron vs RunPod.

Use Case Mapping

When Serverless (Modal) Makes Sense

Batch data processing with infrequent runs (a few times per day)
Event-driven inference triggered by webhooks or async job queues
Overnight eval runs on a fixed dataset
Prototyping with intermittent GPU usage
Workloads running less than ~15 hours of GPU time per day
Teams that want zero infrastructure management and accept the trade-offs

If Modal doesn't fit your needs, see alternatives to Modal for other GPU cloud options.

When Bare Metal (Spheron) Makes Sense

Production inference APIs with latency SLAs (synchronous, real-time requests)
Multi-GPU or multi-node training that requires InfiniBand or NVLink
Teams that need persistent environments: installed conda envs, cached model weights, active processes
GPU utilization above ~15 hours/day (~61% utilization), where flat hourly billing beats per-second billing
Workloads requiring SSH access, custom daemons, arbitrary port binding
Compliance environments needing dedicated IPs and audit-grade isolation

Full Side-by-Side Comparison

Feature	Modal	Spheron
Execution model	Serverless functions	Dedicated VMs
GPU access	Per-invocation	Persistent, always attached
Cold start	1-90 seconds	None
Persistent storage	Modal Volumes (manual setup)	NVMe + NFS, always-on
SSH access	Via Tunnels (TCP port forwarding)	Yes (root)
Custom ports and daemons	No	Yes
Multi-node training	Beta (up to 64 devices, @clustered)	Yes (production-ready)
InfiniBand / NVLink	Beta (InfiniBand via @clustered), NVLink single-node	Yes
Billing model	Per second of execution	Flat hourly
H100 SXM price	~$3.95/hr	~$2.40/hr on-demand
Free egress	Yes	Yes
Data persistence between runs	Requires Modal Volumes	Default (persistent disk)

Spheron gives you dedicated bare-metal GPUs with no cold starts, persistent storage, and full root access. If you are running production inference or training workloads, compare GPU pricing or rent an H100 and get running in minutes.
Get started on Spheron →