Spheron GPU Catalog

Rent NVIDIA H100 GPUs on Demand from $0.80/hr

80GB HBM3, 400 Gb/s InfiniBand, per-minute billing. Live in under 2 minutes.

At a glance

Renting an NVIDIA H100 on Spheron starts at $0.80/hr per GPU per hour on demand, with spot instances cheaper still. There is no minimum commit, billing is per minute, and an instance is usually live in under 2 minutes. The H100 has 80GB of HBM3 and 3.35 TB/s of memory bandwidth, which is enough headroom to fine-tune Llama 3 70B in 4-bit on a single card or serve a 70B model at production latency. For multi-node training, every H100 node ships with 400 Gb/s InfiniBand and GPUDirect RDMA. On-demand H100 pricing on AWS, GCP, and Azure currently sits between about $3 and $7 per GPU per hour.

GPU ArchitectureNVIDIA Hopper
VRAM80 GB HBM3
Memory Bandwidth3.35 TB/s

Technical specifications

GPU Architecture
NVIDIA Hopper
VRAM
80 GB HBM3
Memory Bandwidth
3.35 TB/s
Tensor Cores
4th Generation
CUDA Cores
16,896
FP64 Performance
34 TFLOPS
FP32 Performance
67 TFLOPS
TF32 Performance
989 TFLOPS
FP16 Performance
1,979 TFLOPS
INT8 Performance
3,958 TOPS
System RAM
116 GB DDR4
vCPUs
26 vCPUs
Storage
2.4 TB NVMe SSD
Network
400 Gb/s InfiniBand
TDP
700W

Pricing comparison

ProviderPrice/hrSavings
SpheronYour price
$0.80/hr-
RunPod
$1.99/hr2.5x more expensive
Lambda Labs
$2.99/hr3.7x more expensive
Google Cloud
$3.00/hr3.8x more expensive
Nebius
$3.08/hr3.9x more expensive
AWS
$3.90/hr4.9x more expensive
CoreWeave
$6.16/hr7.7x more expensive
Azure
$6.98/hr8.7x more expensive
Custom & Reserved

Need More H100 Than What's Listed?

Reserved Capacity

Commit to a duration, lock in availability and better rates

Custom Clusters

8 to 512+ GPUs, specific hardware, InfiniBand configs on request

Supplier Matchmaking

Spheron sources from its certified data center network, negotiates pricing, handles setup

Need more H100 capacity? Tell us your requirements and we'll source it from our certified data center network.

Typical turnaround: 24–48 hours

When to pick the H100

Scenario 01

Pick the H100 if

You are training or fine-tuning a 30B+ parameter model, serving a 70B model in production, or running multi-node distributed training that depends on InfiniBand and GPUDirect RDMA. The H100 is also the right call for FP8 inference workloads where the Transformer Engine pays back the price difference vs older silicon.

Recommended fit
Scenario 02

Pick the A100 instead if

Your model fits in 80GB and you do not need FP8. The A100 runs roughly half the price of the H100 and handles BERT-class training, sub-30B fine-tunes, and most production inference without breaking a sweat. Spend the savings on more GPUs.

Recommended fit
Scenario 03

Pick the H200 instead if

You are bottlenecked by memory, not compute. The H200 keeps the H100 architecture but doubles to 141GB of HBM3e at 4.8 TB/s. Long-context inference, KV-cache-heavy serving, and fitting 70B models without sharding are where the H200 wins.

Recommended fit
Scenario 04

Pick the B200 instead if

You are training trillion-parameter models or pushing the absolute frontier of throughput per node. The B200 has roughly 2.3x the FP8 dense TFLOPS of H100, 192GB HBM3e at 8 TB/s, and FP4 support. Real-world inference speedups vary by model and stack. The catch is availability and price. Most teams do not need it yet.

Recommended fit

Ideal use cases

Use case / 01
🤖

LLM training and fine-tuning

Train transformer architectures from scratch or fine-tune frontier-class models with FP8 mixed precision. The H100's Transformer Engine cuts memory and time per step versus FP16.

Pre-training 7B to 70B parameter modelsFull fine-tunes of Llama 3, Qwen, Mistral, and DeepSeekLoRA / QLoRA at scale across multiple adaptersMulti-modal models combining text, vision, and audio
Use case / 02

Production LLM inference

Serve large models at low latency with vLLM, TensorRT-LLM, or SGLang. The H100 handles 70B class models at production traffic without sharding overhead.

Llama 3 70B serving at sub-second TTFTDeepSeek V3 and R1 deploymentsReal-time chat and agent backendsSpeculative decoding with paired draft models
Use case / 03
🎨

Diffusion and video generation

Run image and video diffusion pipelines that need high VRAM and bandwidth. The H100 chews through batch sizes that the RTX 4090 can not touch.

Stable Diffusion XL and Flux at high batchWan 2.1 and other video diffusion modelsControlNet and IP-Adapter pipelines at scaleVoice cloning and TTS production workloads
Use case / 04
🔬

HPC and scientific computing

FP64 throughput on the H100 is 34 TFLOPS, three times the A100. That matters for simulation work where double precision is non-negotiable.

Molecular dynamics and protein foldingComputational fluid dynamicsClimate and weather modelingQuantum chemistry and DFT codes

Performance benchmarks

Llama 3.1 8B inference (vLLM, BF16)
~12,500 tok/s
single H100 SXM
Llama 3.1 70B inference (FP8, BS=64)
~460 tok/s
single H100 SXM
GPT-3 175B training (MLPerf v3.0)
up to 3.1x faster
vs A100 80GB
LLM training (NVIDIA headline)
up to 4x faster
vs A100 80GB
Mixed-precision training throughput
~2.4x faster
vs A100 80GB
FP8 vs FP16 throughput (TE)
up to 1.6x faster
Transformer Engine on

Launch vLLM on an H100 in under 2 minutes

Spin up a Spheron H100 instance, pull the vLLM image, and serve Llama 3 70B with an OpenAI-compatible API. Drop the same client into your existing OpenAI SDK call and point the base URL at the new endpoint.

bash
Spheron
# 1. Provision an H100 from the Spheron CLI (or use the dashboard)spheron deploy --gpu h100 --image vllm/vllm-openai:latest # 2. Inside the instance, serve Llama 3 70B with FP8python -m vllm.entrypoints.openai.api_server \  --model meta-llama/Meta-Llama-3-70B-Instruct \  --quantization fp8 \  --max-model-len 8192 \  --gpu-memory-utilization 0.92 \  --port 8000 # 3. Hit the endpoint from any OpenAI-compatible clientcurl http://<instance-ip>:8000/v1/chat/completions \  -H "Content-Type: application/json" \  -d '{    "model": "meta-llama/Meta-Llama-3-70B-Instruct",    "messages": [{"role": "user", "content": "Explain FP8 quantization."}]  }'

Need multi-GPU or multi-node? Add --tensor-parallel-size 2 for 2x H100, or contact us for InfiniBand-connected clusters.

Interconnect fabric

InfiniBand for multi-node H100 training

Every H100 node on Spheron ships with 400 Gb/s InfiniBand. Multi-node training jobs synchronize gradients over RDMA at sub-microsecond latency, so distributed training scales close to linear with node count instead of stalling on the network.

01400 Gb/s InfiniBand connectivity per GPU
02NVIDIA ConnectX-7 network adapters
03GPUDirect RDMA for zero-copy GPU-to-GPU transfers
04Optimized for NCCL collective operations
05Sub-microsecond GPU-to-GPU latency
06Bare-metal clusters up to 80x H100 (10 nodes)
07Tested with PyTorch DDP, DeepSpeed ZeRO, and Megatron-LM
08NVLink and NVSwitch within each node
Scale

Need a custom multi-node cluster or reserved capacity?

H100 vs alternatives

Related resources

Frequently asked questions

How much does it cost to rent an H100 GPU?

On Spheron the H100 starts at $0.80/hr per GPU per hour on dedicated (99.99% SLA, non-interruptible), with interruptible spot instances cheaper still. There is no minimum commitment and billing is per minute, so a one-hour test costs you one hour. For comparison, AWS p5 H100 instances run around $3.90/hr per GPU on demand after the mid-2025 price cut, Google Cloud A3-high is about $3.00/hr, and Azure ND H100 v5 sits near $7/hr.

What is the cheapest way to rent an H100?

Spot instances on Spheron are the cheapest path, often 50 to 70 percent below the dedicated rate. Both are on-demand tiers with per-minute billing. The trade-off is that spot instances can be reclaimed when demand spikes, so checkpoint your training state every 15 to 30 minutes and treat spot as a fit for fault-tolerant workloads. For long uninterrupted runs, stay on dedicated (99.99% SLA, non-interruptible).

Can I rent an H100 by the hour?

Yes. Spheron bills per minute with no minimum. You can rent an H100 for a single hour to benchmark a workload, or keep it running for months. There are no contracts, reserved-instance lock-ins, or commit fees on dedicated or spot.

How fast can I deploy an H100 instance?

Most H100 instances are live in under 2 minutes. The hardware is pre-warmed, so provisioning is closer to a container start than a hyperscaler VM boot. If you have a Docker image ready, you can be running training inside two minutes of clicking deploy.

Is the H100 worth it over the A100?

If you are doing FP8 inference, training models above 30B parameters, or running anything that benefits from the Transformer Engine, the H100 is the right choice. It is roughly 3 to 4x faster on those workloads. For sub-30B training and most inference at scale, the A100 80GB delivers better dollars per token.

Do you support multi-node H100 clusters with InfiniBand?

Yes. Spheron supports up to 8x H100 per node with NVLink, and bare-metal clusters up to 80x H100 across 10 nodes connected by 400 Gb/s InfiniBand. Every cluster is tested with PyTorch DDP, DeepSpeed ZeRO-3, and Megatron-LM. For larger configurations, contact us.

What deep learning frameworks come pre-installed?

PyTorch, TensorFlow, JAX, and the major serving stacks (vLLM, TensorRT-LLM, SGLang, Triton). Containers ship with CUDA 12.4+, cuDNN, NCCL, and the standard NVIDIA AI Enterprise libraries. You can also bring your own Docker image.

What regions are H100s available in?

H100 capacity is online across North America, Europe, and Asia, sourced from data center partners. Specific availability shifts with demand. The dashboard shows current capacity per region in real time.

What is the difference between H100 SXM and H100 PCIe?

SXM5 is the higher-power variant (700W) with NVLink connectivity between GPUs in a node, which matters for multi-GPU training. PCIe is air-cooled, lower power (350W), and easier to mix with existing servers. Spheron offers both. Pick SXM for distributed training, PCIe for single-GPU inference.

How long does it take to fine-tune Llama 3 70B on an H100?

A QLoRA fine-tune of Llama 3 70B on a 50k-sample dataset takes roughly 8 to 12 hours on a single H100. Full-parameter fine-tunes need 4 to 8 H100s and run in the 24 to 48 hour range depending on dataset size and sequence length. We have a fine-tuning case study with the exact numbers.

Can I run H100 on spot instances safely?

Yes, with checkpointing. Spot instances on Spheron are reclaimed when demand rises, so unsaved state is at risk. The safe pattern is: checkpoint every 15 to 30 minutes to a persistent volume, restart from the latest checkpoint on preemption, and reserve spot for training, batch jobs, and fault-tolerant inference. Use dedicated instances (99.99% SLA) for production serving.

Does the H100 support FP8 training and inference?

Yes. The Transformer Engine on Hopper supports FP8 mixed precision out of the box. In practice this is a roughly 1.7x speedup over FP16 with no measurable accuracy loss for most LLM workloads, plus halved memory pressure on activations. vLLM, TensorRT-LLM, and PyTorch all support FP8 paths.

Do you offer enterprise SLAs and dedicated support for H100 deployments?

For 100+ GPU deployments and production-critical workloads, Spheron offers dedicated Slack or Discord support, sourcing assistance for capacity, and SLA-backed instances. Smaller deployments are self-serve through the dashboard.

Talk to our team

How does pricing on Spheron compare to AWS, GCP, and Azure?

For the same H100 hardware, Spheron is meaningfully cheaper than AWS p5, Azure ND H100, and GCP A3 on-demand. As of April 2026 hyperscaler on-demand H100 pricing runs roughly $3.00 per GPU per hour on GCP A3-high, ~$3.90/hr on AWS p5, and ~$7/hr on Azure ND H100 v5. Spheron starts at $0.80/hr. Same chip, different pricing model.

Also consider