Rent NVIDIA H100 GPUs on Demand from $0.80/hr
80GB HBM3, 400 Gb/s InfiniBand, per-minute billing. Live in under 2 minutes.
Renting an NVIDIA H100 on Spheron starts at $0.80/hr per GPU per hour on demand, with spot instances cheaper still. There is no minimum commit, billing is per minute, and an instance is usually live in under 2 minutes. The H100 has 80GB of HBM3 and 3.35 TB/s of memory bandwidth, which is enough headroom to fine-tune Llama 3 70B in 4-bit on a single card or serve a 70B model at production latency. For multi-node training, every H100 node ships with 400 Gb/s InfiniBand and GPUDirect RDMA. On-demand H100 pricing on AWS, GCP, and Azure currently sits between about $3 and $7 per GPU per hour.
Technical specifications
Pricing comparison
| Provider | Price/hr | Savings |
|---|---|---|
SpheronYour price | $0.80/hr | - |
RunPod | $1.99/hr | 2.5x more expensive |
Lambda Labs | $2.99/hr | 3.7x more expensive |
Google Cloud | $3.00/hr | 3.8x more expensive |
Nebius | $3.08/hr | 3.9x more expensive |
AWS | $3.90/hr | 4.9x more expensive |
CoreWeave | $6.16/hr | 7.7x more expensive |
Azure | $6.98/hr | 8.7x more expensive |
Need More H100 Than What's Listed?
Reserved Capacity
Commit to a duration, lock in availability and better rates
Custom Clusters
8 to 512+ GPUs, specific hardware, InfiniBand configs on request
Supplier Matchmaking
Spheron sources from its certified data center network, negotiates pricing, handles setup
Need more H100 capacity? Tell us your requirements and we'll source it from our certified data center network.
Typical turnaround: 24–48 hours
When to pick the H100
Pick the H100 if
You are training or fine-tuning a 30B+ parameter model, serving a 70B model in production, or running multi-node distributed training that depends on InfiniBand and GPUDirect RDMA. The H100 is also the right call for FP8 inference workloads where the Transformer Engine pays back the price difference vs older silicon.
Pick the A100 instead if
Your model fits in 80GB and you do not need FP8. The A100 runs roughly half the price of the H100 and handles BERT-class training, sub-30B fine-tunes, and most production inference without breaking a sweat. Spend the savings on more GPUs.
Pick the H200 instead if
You are bottlenecked by memory, not compute. The H200 keeps the H100 architecture but doubles to 141GB of HBM3e at 4.8 TB/s. Long-context inference, KV-cache-heavy serving, and fitting 70B models without sharding are where the H200 wins.
Pick the B200 instead if
You are training trillion-parameter models or pushing the absolute frontier of throughput per node. The B200 has roughly 2.3x the FP8 dense TFLOPS of H100, 192GB HBM3e at 8 TB/s, and FP4 support. Real-world inference speedups vary by model and stack. The catch is availability and price. Most teams do not need it yet.
Ideal use cases
LLM training and fine-tuning
Train transformer architectures from scratch or fine-tune frontier-class models with FP8 mixed precision. The H100's Transformer Engine cuts memory and time per step versus FP16.
Production LLM inference
Serve large models at low latency with vLLM, TensorRT-LLM, or SGLang. The H100 handles 70B class models at production traffic without sharding overhead.
Diffusion and video generation
Run image and video diffusion pipelines that need high VRAM and bandwidth. The H100 chews through batch sizes that the RTX 4090 can not touch.
HPC and scientific computing
FP64 throughput on the H100 is 34 TFLOPS, three times the A100. That matters for simulation work where double precision is non-negotiable.
Performance benchmarks
Launch vLLM on an H100 in under 2 minutes
Spin up a Spheron H100 instance, pull the vLLM image, and serve Llama 3 70B with an OpenAI-compatible API. Drop the same client into your existing OpenAI SDK call and point the base URL at the new endpoint.
# 1. Provision an H100 from the Spheron CLI (or use the dashboard)spheron deploy --gpu h100 --image vllm/vllm-openai:latest # 2. Inside the instance, serve Llama 3 70B with FP8python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Meta-Llama-3-70B-Instruct \ --quantization fp8 \ --max-model-len 8192 \ --gpu-memory-utilization 0.92 \ --port 8000 # 3. Hit the endpoint from any OpenAI-compatible clientcurl http://<instance-ip>:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Meta-Llama-3-70B-Instruct", "messages": [{"role": "user", "content": "Explain FP8 quantization."}] }'Need multi-GPU or multi-node? Add --tensor-parallel-size 2 for 2x H100, or contact us for InfiniBand-connected clusters.
InfiniBand for multi-node H100 training
Every H100 node on Spheron ships with 400 Gb/s InfiniBand. Multi-node training jobs synchronize gradients over RDMA at sub-microsecond latency, so distributed training scales close to linear with node count instead of stalling on the network.
Need a custom multi-node cluster or reserved capacity? Talk to us about topology, regions, and committed pricing.
H100 vs alternatives
Same compute, 1.76x more memory and 1.4x more bandwidth on H200. If you keep hitting OOM on long context or large KV caches, the H200 fixes that.
B200 is the next generation: ~2.3x H100's FP8 dense TFLOPS, 192GB HBM3e at 8 TB/s, and FP4 support. Real-world inference speedups depend on model and stack maturity. Worth the jump if you can secure capacity.
The 4090 has 24GB and no NVLink. Fine for sub-13B inference and hobbyist training, not for serving 70B models or any serious distributed work.
Related resources
NVIDIA H100 vs H200: benchmarks and when to upgrade
Side-by-side specs, inference throughput, and the workloads where the H200's extra memory pays for itself.
Running 10 concurrent fine-tuning jobs on bare-metal H100s
Architecture, scheduling, and cost breakdown for parallelizing fine-tunes across a bare-metal H100 cluster.
Building a sub-200ms RAG pipeline on bare-metal H100s
How a production RAG stack hits 2M queries per day with H100 inference and aggressive KV-cache reuse.
vLLM production deployment in 2026
Serving large models with vLLM on Spheron H100s: configuration, tuning, and the gotchas to avoid.
Best NVIDIA GPUs for LLMs
Decision framework for picking H100, H200, B200, A100, or RTX-class GPUs based on model size and budget.
GPU cost optimization playbook
Practical tactics to cut H100 spend: spot scheduling, FP8, batching, and right-sizing.
Frequently asked questions
How much does it cost to rent an H100 GPU?
On Spheron the H100 starts at $0.80/hr per GPU per hour on dedicated (99.99% SLA, non-interruptible), with interruptible spot instances cheaper still. There is no minimum commitment and billing is per minute, so a one-hour test costs you one hour. For comparison, AWS p5 H100 instances run around $3.90/hr per GPU on demand after the mid-2025 price cut, Google Cloud A3-high is about $3.00/hr, and Azure ND H100 v5 sits near $7/hr.
What is the cheapest way to rent an H100?
Spot instances on Spheron are the cheapest path, often 50 to 70 percent below the dedicated rate. Both are on-demand tiers with per-minute billing. The trade-off is that spot instances can be reclaimed when demand spikes, so checkpoint your training state every 15 to 30 minutes and treat spot as a fit for fault-tolerant workloads. For long uninterrupted runs, stay on dedicated (99.99% SLA, non-interruptible).
Can I rent an H100 by the hour?
Yes. Spheron bills per minute with no minimum. You can rent an H100 for a single hour to benchmark a workload, or keep it running for months. There are no contracts, reserved-instance lock-ins, or commit fees on dedicated or spot.
How fast can I deploy an H100 instance?
Most H100 instances are live in under 2 minutes. The hardware is pre-warmed, so provisioning is closer to a container start than a hyperscaler VM boot. If you have a Docker image ready, you can be running training inside two minutes of clicking deploy.
Is the H100 worth it over the A100?
If you are doing FP8 inference, training models above 30B parameters, or running anything that benefits from the Transformer Engine, the H100 is the right choice. It is roughly 3 to 4x faster on those workloads. For sub-30B training and most inference at scale, the A100 80GB delivers better dollars per token.
Do you support multi-node H100 clusters with InfiniBand?
Yes. Spheron supports up to 8x H100 per node with NVLink, and bare-metal clusters up to 80x H100 across 10 nodes connected by 400 Gb/s InfiniBand. Every cluster is tested with PyTorch DDP, DeepSpeed ZeRO-3, and Megatron-LM. For larger configurations, contact us.
What deep learning frameworks come pre-installed?
PyTorch, TensorFlow, JAX, and the major serving stacks (vLLM, TensorRT-LLM, SGLang, Triton). Containers ship with CUDA 12.4+, cuDNN, NCCL, and the standard NVIDIA AI Enterprise libraries. You can also bring your own Docker image.
What regions are H100s available in?
H100 capacity is online across North America, Europe, and Asia, sourced from data center partners. Specific availability shifts with demand. The dashboard shows current capacity per region in real time.
What is the difference between H100 SXM and H100 PCIe?
SXM5 is the higher-power variant (700W) with NVLink connectivity between GPUs in a node, which matters for multi-GPU training. PCIe is air-cooled, lower power (350W), and easier to mix with existing servers. Spheron offers both. Pick SXM for distributed training, PCIe for single-GPU inference.
How long does it take to fine-tune Llama 3 70B on an H100?
A QLoRA fine-tune of Llama 3 70B on a 50k-sample dataset takes roughly 8 to 12 hours on a single H100. Full-parameter fine-tunes need 4 to 8 H100s and run in the 24 to 48 hour range depending on dataset size and sequence length. We have a fine-tuning case study with the exact numbers.
Can I run H100 on spot instances safely?
Yes, with checkpointing. Spot instances on Spheron are reclaimed when demand rises, so unsaved state is at risk. The safe pattern is: checkpoint every 15 to 30 minutes to a persistent volume, restart from the latest checkpoint on preemption, and reserve spot for training, batch jobs, and fault-tolerant inference. Use dedicated instances (99.99% SLA) for production serving.
Does the H100 support FP8 training and inference?
Yes. The Transformer Engine on Hopper supports FP8 mixed precision out of the box. In practice this is a roughly 1.7x speedup over FP16 with no measurable accuracy loss for most LLM workloads, plus halved memory pressure on activations. vLLM, TensorRT-LLM, and PyTorch all support FP8 paths.
Do you offer enterprise SLAs and dedicated support for H100 deployments?
For 100+ GPU deployments and production-critical workloads, Spheron offers dedicated Slack or Discord support, sourcing assistance for capacity, and SLA-backed instances. Smaller deployments are self-serve through the dashboard.
Talk to our team →How does pricing on Spheron compare to AWS, GCP, and Azure?
For the same H100 hardware, Spheron is meaningfully cheaper than AWS p5, Azure ND H100, and GCP A3 on-demand. As of April 2026 hyperscaler on-demand H100 pricing runs roughly $3.00 per GPU per hour on GCP A3-high, ~$3.90/hr on AWS p5, and ~$7/hr on Azure ND H100 v5. Spheron starts at $0.80/hr. Same chip, different pricing model.