Why do teams leave Baseten?

The main friction points are replica-hour pricing (you pay per running replica, not just per token), cold-start premiums on serverless endpoints, and Truss framework lock-in that makes migration costly. Teams at sustained throughput volumes often find dedicated GPU bare metal cheaper after accounting for the replica markup.

What is the cheapest Baseten alternative for H100 inference?

Spheron offers H100 PCIe on-demand from $2.01/hr with per-minute billing and no replica markup. At sustained throughput, this consistently undercuts Baseten's effective per-replica rate. Check current rates at spheron.network/pricing/.

Can I migrate a Baseten Truss model to vLLM on Spheron?

Yes. Most Truss models wrap a standard HuggingFace model and a custom Python predict() function. Migration involves pulling the base model weights from HuggingFace (or your artifact store), launching a Spheron H100 instance, and starting vLLM with --served-model-name and your model path. The vLLM OpenAI-compatible server replaces Baseten's endpoint with no client-side code changes.

Does Spheron have an OpenAI-compatible API like Baseten?

Spheron provides bare-metal GPU instances where you deploy vLLM or SGLang, both of which expose OpenAI-compatible endpoints. Your existing code using Baseten's OpenAI-compatible API works without modification after updating the base URL and API key.

When does Baseten still make sense over alternatives?

Baseten is a strong fit for enterprise teams that need SLA contracts, private VPCs, compliance documentation, and managed observability without owning the deployment stack. If those operational requirements have real budget value and Truss deployment fits your workflow, Baseten's premium over bare metal may be justified.

Baseten Alternatives: 10 ML Inference Platforms Compared (2026)

Baseten built a genuinely useful product. The Truss framework reduces the friction of turning a HuggingFace model into a production API, their observability tooling is solid, and enterprise teams get SLA contracts that matter when reliability has a budget. For teams that do not want to own any GPU infrastructure, Baseten covers a real need.

The friction starts when you look at the billing model. Baseten charges per replica-hour: every running model replica costs money continuously, regardless of how many requests it handles. A two-replica setup for redundancy doubles the cost immediately, separate from any per-token overhead. Add cold-start charges on serverless endpoints and the Truss-specific deployment abstraction that creates real switching costs, and the case for evaluating alternatives becomes concrete. This guide covers 10 alternatives with specific pricing and tradeoff breakdowns. For a parallel breakdown of serverless GPU platforms, see the Modal alternatives guide and the Fireworks AI alternatives guide.

Why Teams Look Beyond Baseten

Replica-hour pricing model

Baseten charges per running replica per hour. Each active model replica bills continuously at the underlying GPU rate, regardless of request volume. If you run two replicas for redundancy (which most production deployments do), you pay twice the GPU rate even during off-peak hours when one replica handles zero traffic. Contrast this with per-token serverless providers (you pay only for actual inference compute) or per-minute bare metal (you pay for GPU time at cost, without a replica markup). For a model deployed with 3 replicas running at $6.50/hr each, your floor is $19.50/hr before a single request arrives.

Cold-start premiums on serverless endpoints

Baseten's serverless endpoints scale to zero when idle, which keeps costs low at low traffic. The catch is cold starts: large model containers can take 30-90 seconds to become ready on the first request after a scale-down. Teams running latency-sensitive APIs either accept the cold-start variance or keep warm replicas running, which brings the per-replica-hour cost right back. There is no middle path that gives you both zero idle cost and instant response times.

Truss framework lock-in

Truss is Baseten's deployment abstraction: a Python class that defines load() and predict() methods, plus a config.yaml describing the model's hardware and dependencies. It is specific to Baseten's runtime. Migrating off Baseten means rewriting the model wrapper into a standard vLLM or SGLang configuration. That rewrite is not technically hard for most models, but teams underestimate it as a switching cost, especially when they have a library of Truss-packaged models in production.

Quick Comparison: Baseten vs 10 Alternatives

Provider	Pricing Model	H100 Rate	Deployment Abstraction	Cold Starts	Best For
Baseten	Per replica-hour	~$6.50/hr	Truss	Yes (serverless), No (dedicated)	Managed production serving with SLAs
Spheron	Per minute	$2.01/hr PCIe / $4.41/hr SXM5	None (bring your own)	None	Sustained inference, bare-metal control
Replicate	Per GPU-second	$5.49/hr	Cog	Yes	API-first prototyping on public models
Modal	Per second	~$3.95/hr (effective)	Python decorator	Yes (5s-2min)	Python-native burst serverless
Fireworks AI	Per token	Shared infra	None	N/A	Low-volume serverless open-weight inference
Together AI	Per token	Shared infra	None	N/A	Serverless with broad open-weight catalog
RunPod	Per hour/second	~$2.69/hr SXM (on-demand)	Template-based	Serverless only	Mixed dedicated and serverless
Anyscale	Per token/hour	Custom pricing	Ray Serve	No (dedicated)	Distributed inference via Ray ecosystem
BentoCloud	Per hour	~$4.00/hr (H100, estimated)	BentoML	No (dedicated)	Pythonic serving with packaging control
NVIDIA DGX Cloud Lepton	Per token	Shared infra	None	N/A	LLM-optimized serverless inference
Beam	Per second	~$3.50/hr (H100)	Python decorator	Yes	Scheduled jobs and inference in one platform

All third-party pricing is based on publicly listed on-demand rates as of 05 May 2026 and may fluctuate.

Pricing fluctuates based on GPU availability. The Spheron prices above are based on 05 May 2026 and may have changed. Check current GPU pricing → for live rates.

1. Spheron: Bare-Metal H100 and B300 with No Replica Markup

H100 PCIe: $2.01/hr | H100 SXM5: $4.41/hr | B300 SXM6: $9.77/hr | Per-minute billing | No contracts

Pricing fluctuates based on GPU availability. The prices above are based on 05 May 2026 and may have changed. Check current GPU pricing → for live rates.

The fundamental difference between Spheron and Baseten is what you are paying for. On Baseten, you pay per replica-hour, which means the GPU rate plus a management overhead, multiplied by however many replicas you keep running. On Spheron, you pay for a dedicated GPU at cost, billed per minute. No replica abstraction, no overhead layer, no minimum replica count.

Spheron H100 instances run vLLM or SGLang directly on bare metal with root SSH access. You get the full GPU: no hypervisor overhead, no shared tenancy. For teams that have migrated a Truss model, the vLLM launch command is 5 lines. The OpenAI-compatible endpoint vLLM exposes is drop-in compatible with Baseten's API, so existing client code requires only a base URL change. For the exact steps, see the self-hosted OpenAI-compatible endpoint guide.

Spot pricing is available on select GPU types. A100 80G SXM4 spot starts at $0.45/hr, and B300 SXM6 spot is available at $2.45/hr for cost-sensitive experiments. For production inference requiring consistent latency, on-demand instances avoid spot preemption risk.

If you want to move up the performance curve, Spheron B300 bare-metal instances provide the highest single-GPU throughput currently available on the platform for teams running large model inference at scale.

What Spheron does well

Per-minute billing, no replica-hour model
H100 PCIe, H100 SXM5, H200 SXM5, B300 SXM6, A100 80G, L40S, and RTX-series on demand
Full bare-metal access, root SSH, no hypervisor overhead
Multi-GPU clusters up to 8x H100 with InfiniBand for tensor-parallel inference
Spot instances available on select GPU types (A100 spot: $0.45/hr, B300 spot: $2.45/hr)
No SDK lock-in, standard Linux environment

Where it falls short

No managed serving layer: you deploy and operate vLLM or SGLang yourself
No built-in observability dashboards (bring Prometheus, Grafana, or Langfuse)
No serverless or scale-to-zero

2. Replicate: Per-Second Inference on Public Models

H100: $5.49/hr ($0.001525/sec) | Per-second billing | Cog deployment format

Replicate charges per GPU-second, which works out to $5.49/hr effective H100 cost at sustained use. Their main value is the public model registry: you can call Stable Diffusion, Flux, LLaMA variants, and hundreds of community models with a single API call and no deployment work.

Compared to Baseten, Replicate is simpler but less flexible. Baseten's Truss framework lets you bring arbitrary Python code and custom preprocessing logic. Replicate's Cog format is more opinionated: you define a predict() function, and Cog wraps it. For teams with custom model logic beyond standard inference, Cog is more limiting than Truss. And at $5.49/hr effective, Replicate is expensive compared to bare metal alternatives. The value proposition is prototyping speed on public models, not production inference at scale.

Cold starts exist on Replicate for low-traffic models that have been scaled down, similar to Baseten's serverless behavior. For a full breakdown of Replicate's tradeoffs across pricing, Cog migration, and alternatives, see the Replicate alternatives guide.

H100 effective rate: ~$3.95/hr | Scale-to-zero | Python decorator deployment

Modal replaces Baseten's Truss Python class with Python decorators. You write @app.function(gpu="H100") above your inference function, and Modal handles container builds, GPU scheduling, and scaling. If you are evaluating Baseten and want a managed serverless layer without Truss, Modal is the most architecturally similar alternative.

The tradeoffs versus Baseten are cost and cold starts. Modal's effective H100 rate under sustained load is around $3.95/hr, cheaper than Baseten's ~$6.50/hr, but still roughly 2x Spheron's $2.01/hr H100 PCIe on-demand rate. Cold starts range from a few seconds for small optimized containers to over a minute for large models. Keeping warm replicas removes cold starts but eliminates the cost benefit of serverless, similar to Baseten's pricing dynamic. For a deeper breakdown of Modal's billing behavior and cold start numbers, see the Modal alternatives guide.

4. Fireworks AI: Token-Priced Serverless for Public Models

Llama 3.1 70B: $0.90/1M tokens | No GPU management | OpenAI-compatible API

Fireworks charges per token on a shared GPU cluster. For teams whose Baseten usage is dominated by public open-weight models (Llama, Qwen, Mistral families), Fireworks can be significantly cheaper at low-to-moderate volumes. At 10M tokens per day on a 70B model, Fireworks costs $9, versus Baseten's replica-hour charge that runs regardless of traffic.

The tradeoff is control. Fireworks gives you no GPU access, no batching control, no custom checkpoint support. If your Baseten deployment uses Truss to serve a fine-tuned model or a custom inference pipeline, Fireworks cannot replace it. For public catalog models at under 100M tokens per day, Fireworks makes the economics look very different from Baseten. For a full comparison of Fireworks' pricing across model sizes and volume tiers, see the Fireworks AI alternatives guide.

5. Together AI: Broadest Serverless Open-Weight Catalog

Llama 3.1 8B: $0.18/1M tokens | Fine-tune hosting | Dedicated Endpoints

Together AI covers similar ground to Fireworks with a slightly broader model catalog and a fine-tune hosting product. You can upload a custom checkpoint and Together serves it through their API with the same per-token billing. That removes one of Baseten's Truss-specific advantages for teams who only need fine-tune serving, not arbitrary Python inference logic.

Together's Dedicated Endpoints product gives you reserved capacity at a fixed hourly rate, closer to Baseten's dedicated replica model without the Truss abstraction overhead. If your Baseten workload is primarily serving fine-tuned LLaMA or Mistral checkpoints without heavy custom preprocessing, Together is worth evaluating directly. For a deeper comparison across pricing tiers and fine-tune workflow, see the Together AI alternatives guide.

6. RunPod: Dedicated and Serverless Under One Account

H100 SXM: ~$2.69/hr on-demand | Serverless endpoints | Per-second serverless billing

RunPod no longer shows per-hour rates on public pages. Rate above is from the RunPod deploy console, May 2026.

RunPod covers both patterns Baseten offers (dedicated replicas and serverless cold-standby) under one account, at a meaningfully lower GPU rate. H100 SXM on-demand runs around $2.69/hr through the RunPod deploy console, versus Baseten's ~$6.50/hr. RunPod Serverless uses per-second billing with auto-scaling to zero, comparable to Baseten's serverless endpoint behavior with similar cold-start characteristics (5-20 seconds for most containers).

The platform has a community template library that reduces time to first deployment for popular models, and the switch between serverless and dedicated under one account is operationally convenient. For teams that want Baseten's dual-mode (serverless for burst, dedicated for baseline) at a lower GPU rate, RunPod covers that pattern without Truss. For a full comparison across RunPod's tiers and alternatives, see the RunPod alternatives guide.

7. Anyscale: Distributed Inference via Ray Serve

Per-token pricing on hosted endpoints | Ray Serve-based | Multi-GPU distributed inference

Anyscale builds on Ray, the distributed compute framework. Their hosted inference product uses Ray Serve under the hood, giving you distributed inference across multi-GPU clusters and fine-grained autoscaling based on request queue depth. Pricing is consumption-based and requires a sales conversation for most configurations.

Compared to Baseten, Anyscale targets teams already invested in the Ray ecosystem who need to go beyond single-GPU inference. Where Baseten's Truss handles single-model deployments well, Anyscale's Ray Serve integration handles multi-node tensor-parallel deployments for 70B models that do not fit in a single GPU's VRAM. The operational complexity is higher than Baseten's managed layer, so Anyscale makes sense only if Ray is already part of your infrastructure.

8. BentoCloud: Pythonic Model Packaging with Dedicated Compute

H100: ~$4.00/hr (estimated, not publicly listed) | BentoML packaging | Autoscaling endpoints

BentoCloud pricing is not publicly listed. The $4.00/hr H100 figure is estimated from industry benchmarks. Check bentocloud.bentoml.com for current rates.

BentoCloud is the closest architectural parallel to Baseten in this list. BentoML is their model packaging abstraction (a Python class-based framework like Truss), and BentoCloud is the managed hosting layer on top. You get autoscaling endpoints, built-in observability, and a managed serving experience without owning GPU infrastructure.

Teams evaluating both platforms typically find BentoML and Truss comparable in capability and learning curve. The choice often comes down to community size (Truss has more Baseten-specific documentation) and pricing (BentoCloud's estimated rate is below Baseten's at scale). If you are already frustrated with Truss but want a similar managed packaging approach rather than raw bare metal, BentoCloud is worth evaluating.

9. NVIDIA DGX Cloud Lepton: LLM-Optimized Serverless

Per-token pricing | Multi-region | Backed by NVIDIA Cloud Partners network

NVIDIA DGX Cloud Lepton (formerly Lepton AI, rebranded following NVIDIA's acquisition and COMPUTEX May 2025 announcement) provides LLM-optimized serverless inference on popular model families. The platform covers Llama, Mistral, and other major open-weight models with optimized serving and competitive per-token pricing.

The NVIDIA backing gives Lepton/DGX Cloud Lepton early access to new GPU hardware and tight integration with NVIDIA's software stack (TensorRT-LLM, NIM microservices). For teams that want NVIDIA-ecosystem-backed managed inference without deploying their own stack, this is a strong option. The tradeoff versus Baseten is control: DGX Cloud Lepton does not support custom Truss-style model wrappers. You get the public model catalog, not arbitrary Python inference logic.

10. Beam: Serverless GPU with Scheduled Jobs

H100: ~$3.50/hr | Per-second billing | Python-native | Scheduled job support

Beam's key differentiator over Baseten is scheduled job support alongside inference endpoints. You can run cron-triggered batch inference, periodic model evaluation, or retraining jobs in the same platform as your serving endpoints. Baseten focuses on inference serving and does not cover this pattern.

The H100 effective rate around $3.50/hr is below Baseten's ~$6.50/hr and below Modal's ~$3.95/hr, though above Spheron's bare-metal rates. The Python-native deployment model (similar to Modal's decorator approach) means lower switching cost from Baseten's Truss than migrating to bare metal. For teams whose Baseten workloads include periodic batch jobs alongside serving, Beam avoids running two separate platforms.

Pricing Comparison: Cost per 1M Tokens (Llama 3.1 70B FP8 on H100)

The table below estimates per-token cost for Llama 3.1 70B FP8 across platforms. For dedicated providers, the methodology is: hourly_rate / tokens_per_second / 3600 * 1,000,000. The baseline throughput assumption is 800 tokens/second on a single H100 SXM5 with vLLM continuous batching. Serverless providers use their published per-token rates directly.

Provider	Pricing Model	Est. $/1M output tokens (70B FP8)	Notes
Spheron H100 SXM5	Per minute, dedicated	$1.53	At 800 tok/s with vLLM continuous batching
RunPod H100 SXM	Per hour, dedicated	$0.93	At 800 tok/s, $2.69/hr
Together AI	Per token	$0.88	Published rate
Fireworks AI	Per token	$0.90	Published rate
Modal	Per second	$1.37	At 800 tok/s, $3.95/hr effective
Baseten	Per replica-hour	$2.26	At 800 tok/s, $6.50/hr; excludes replica markup overhead
Replicate	Per second	$1.91	At 800 tok/s, $5.49/hr

Pricing fluctuates based on GPU availability. The Spheron prices above are based on 05 May 2026 and may have changed. Check current GPU pricing → for live rates.

Note that the Baseten figure excludes any replica markup overhead. Running two replicas for production redundancy doubles the effective per-token cost to $4.52/1M tokens at the same throughput, which is significantly above every other option in this table.

For the full cost-per-token methodology including batch size impact, quantization effects, and how throughput changes the break-even, see GPU cost per token benchmarks.

Migration Guide: Porting a Baseten Truss Model to vLLM on Spheron

Step 1: Identify your base model

In Baseten's config.yaml, the model_name or hf_model_name field names the HuggingFace model. Note the model ID. Check model.py's load() method for any custom preprocessing or tokenization logic that is not handled by the standard HuggingFace API. Standard models with no custom load() logic migrate in minutes.

Step 2: Provision a Spheron H100 instance

Via the Spheron dashboard, rent an H100 SXM5 or H100 PCIe instance depending on your model size and throughput requirements. SSH into the instance once it is running.

Step 3: Install and launch vLLM

bash

pip install vllm
vllm serve <your-hf-model-id> \
  --served-model-name <model-alias> \
  --tensor-parallel-size 1 \
  --dtype fp8 \
  --port 8000

For 70B models that exceed a single H100's 80GB VRAM, use --tensor-parallel-size 2 across two H100s, or use --dtype fp8 to fit on a single card. The vLLM server starts an OpenAI-compatible endpoint on port 8000.

Step 4: Update your client

Change only the base_url and api_key in your existing client code. The /v1/chat/completions path and request body format are identical to Baseten's OpenAI-compatible API.

python

# Before (Baseten)
import os
import openai
client = openai.OpenAI(
    base_url="https://model-<id>.api.baseten.co/environments/production/sync/v1",
    api_key=os.environ["BASETEN_API_KEY"],
)

# After (Spheron vLLM)
client = openai.OpenAI(
    base_url="http://<spheron-instance-ip>:8000/v1",
    api_key="not-needed",
)

If your Truss model.py had custom preprocessing (tokenization overrides, prompt templating, pre/post-processing), replicate that logic as a vLLM chat template or a thin proxy layer in front of the vLLM server. Most standard HuggingFace models do not need this step.

For full production vLLM configuration including tensor parallelism, quantization, and health-check setup, see the vLLM production deployment guide. For adding authentication and HTTPS to the OpenAI-compatible endpoint, see the self-hosted OpenAI-compatible API guide.

Decision Matrix: When Baseten Still Wins vs When Alternatives Are Cheaper

Stay on Baseten if:

You need SLA contracts with uptime guarantees and financial penalties for downtime
Your team requires a managed serving layer with no ops ownership of GPU infrastructure
Compliance or private VPC requirements make self-managed infra impractical
Your team ships endpoints via Truss and has no bandwidth to rewrite deployment code
Cold-start behavior on serverless replicas is acceptable for your traffic pattern
Your organization already has a Baseten enterprise agreement with dedicated account support

Switch to bare metal (Spheron H100 or B300) if:

Replica-hour billing is costing more than dedicated GPU time at your current throughput
You need to serve custom fine-tuned checkpoints or LoRA adapters without Truss packaging
P99 latency SLOs require dedicated hardware with no cold-start variance
You want to use SGLang, TensorRT-LLM, or a custom CUDA-optimized serving stack
Your data cannot leave a self-controlled environment

Switch to serverless (Together AI, Fireworks AI, NVIDIA DGX Cloud Lepton) if:

Traffic is genuinely bursty with long idle periods (under 10M tokens per day)
You need catalog access to many models without deploying each one individually
Per-token pricing beats your replica-hour cost at current volume

The Bottom Line

Baseten earns its premium for teams that need managed production serving with SLAs and are willing to pay for not owning the GPU stack. The Truss framework, observability tooling, and enterprise contracts are worth real money for teams where the alternative is hiring DevOps to manage inference infrastructure.

The cases where it stops making sense are predictable. Replica-hour billing scales painfully at sustained throughput: running two replicas at $6.50/hr each costs $9,360/month in GPU charges alone, before a single request is made. Self-hosted vLLM on a dedicated H100 SXM5 at $4.41/hr covers 90% of what Baseten provides for production inference, at under half the cost. Add in the anchor to Truss when most teams could migrate in a day, and the reasons to stay narrow to the specific cases where SLAs, VPCs, and compliance documentation have direct budget value.

Baseten's Truss model is polished, but replica-hour pricing compounds fast at production volumes. Spheron H100 and B300 bare-metal instances give you full vLLM/SGLang control with per-minute billing and no replica markup.
Rent H100 on Spheron → | Rent B300 on Spheron → | View all GPU pricing →
Get started on Spheron →