Spheron GPU Catalog

Rent NVIDIA H200 GPUs on Demand from $1.19/hr

141GB HBM3e, 4.8 TB/s bandwidth, NVLink, per-minute billing. Live in under 2 minutes.

At a glance

Renting an NVIDIA H200 on Spheron starts at $1.19/hr per GPU per hour on dedicated (99.99% SLA), with interruptible spot instances cheaper still. Billing is per minute, there is no minimum commit, and most instances are live inside two minutes. The H200 shares H100's Hopper compute (4th-gen Tensor Cores, FP8 via Transformer Engine, 989 TFLOPS TF32, 3,958 TFLOPS FP8 with sparsity) and bumps memory to 141GB HBM3e at 4.8 TB/s. That makes it the better pick when H100 is memory-bound: long-context inference, 70B to 100B serving at large batch sizes, multi-model colocation, and RAG. Specialist clouds price H200 around $3.80 to $4.00 per GPU per hour (Lambda, Jarvislabs, RunPod), while hyperscalers run $4.98/hr on AWS p5e, ~$6.31/hr on CoreWeave, and ~$10.60 to $10.87/hr on Azure ND H200 v5 and GCP a3-ultragpu on-demand.

GPU ArchitectureNVIDIA Hopper
VRAM141 GB HBM3e
Memory Bandwidth4.8 TB/s

Technical specifications

GPU Architecture
NVIDIA Hopper
VRAM
141 GB HBM3e
Memory Bandwidth
4.8 TB/s
Tensor Cores
528 (4th Gen)
CUDA Cores
16,896
FP64 Performance
34 TFLOPS
FP32 Performance
67 TFLOPS
TF32 Tensor
989 TFLOPS (sparsity)
FP16 Tensor
1,979 TFLOPS (sparsity)
FP8 Tensor
3,958 TFLOPS (sparsity)
INT8 Tensor
3,958 TOPS (sparsity)
NVLink Bandwidth
900 GB/s
System RAM
200 GB DDR5
vCPUs
16 vCPUs
Storage
465 GB NVMe Gen4
Form Factor
SXM5
TDP
700W

Pricing comparison

ProviderPrice/hrSavings
SpheronYour price
$1.19/hr-
Lambda
$3.79/hr3.2x more expensive
Jarvislabs
$3.80/hr3.2x more expensive
RunPod
$3.99/hr3.4x more expensive
AWS p5e
$4.98/hr4.2x more expensive
CoreWeave
$6.31/hr5.3x more expensive
Azure ND H200 v5
$10.60/hr8.9x more expensive
Google Cloud a3-ultragpu
$10.87/hr9.1x more expensive
Custom & Reserved

Need More H200 Than What's Listed?

Reserved Capacity

Commit to a duration, lock in availability and better rates

Custom Clusters

8 to 512+ GPUs, specific hardware, InfiniBand configs on request

Supplier Matchmaking

Spheron sources from its certified data center network, negotiates pricing, handles setup

Need more H200 capacity? Tell us your requirements and we'll source it from our certified data center network.

Typical turnaround: 24–48 hours

When to pick the H200

Scenario 01

Pick the H200 if

Your workload is memory-bound on H100. That means long-context LLM inference (32K+ tokens), 70B to 100B serving at production batch sizes, multi-model colocation on a single GPU, or RAG stacks where embedding stores and the LLM need to live in VRAM together. H200 gives you 1.76x the memory and 1.43x the bandwidth of H100, same Hopper compute.

Recommended fit
Scenario 02

Pick the H100 instead if

You are training on models up to 70B or running inference that comfortably fits in 80GB. H100 has identical Tensor Core math and Transformer Engine, at a lower hourly rate. Move to H200 only when memory capacity or KV-cache headroom is the bottleneck.

Recommended fit
Scenario 03

Pick the B200 instead if

You need maximum throughput on the largest models. B200 delivers ~2.3x H100's FP8 dense TFLOPS and ships 192GB HBM3e at 8 TB/s. For trillion-parameter training or FP4 inference, B200 is the right call. For H100-class compute with bigger memory, stay on H200.

Recommended fit
Scenario 04

Pick the A100 instead if

You are doing classic training up to 30B parameters, quantized inference, or cost-sensitive fine-tuning. A100 80GB costs roughly a third of H200 and the mature stack still delivers. Skip to H200 when FP8 or 141GB matter.

Recommended fit

Ideal use cases

Use case / 01
💬

Long-context LLM inference

141GB HBM3e lets you serve 70B to 100B models at large batch sizes with room left for KV cache on 32K+ context windows. Transformer Engine and FP8 keep latency low; the extra memory keeps throughput high.

Llama 3.1 70B FP8 at batch 128+ with 32K contextMixtral 8x22B and DeepSeek V3 serving on fewer GPUsLong-document chat with 100K+ token windowsHigh-concurrency agent and copilot backends
Use case / 02
📚

Multi-model and RAG serving

Colocate a 30B chat model, a 7B code model, and an embedding model on one card. Keep vector indices and reranker weights resident in VRAM alongside the LLM for sub-10 ms retrieval.

Enterprise RAG with embeddings + LLM in-memoryLegal / medical AI stacks with specialized modelsA/B serving of multiple model versions on one GPUReranker + LLM pipelines without cross-GPU hops
Use case / 03
🎯

LLM fine-tuning and RLHF

Fine-tune 70B models with larger per-GPU batches, or run full SFT on 30B models without sharding. LoRA and QLoRA on 100B-class models become single-node jobs.

Llama 3.1 70B supervised fine-tuningRLHF / DPO on 30B to 70B modelsLoRA / QLoRA on 100B+ checkpointsDomain tuning for code, legal, medical
Use case / 04

High-throughput inference at scale

For production serving where tokens per dollar matters, H200 widens the batch without running out of memory. Pair with TensorRT-LLM or vLLM for best throughput.

Chat backends at millions of tokens per minuteCode-generation services with large KV cacheMulti-tenant inference with dynamic batchingSpeculative decoding with draft + target models resident

Performance benchmarks

Llama 2 70B inference (MLPerf v5.0)
~33,000 tok/s
~1.4x H100 offline
Peak single-GPU throughput
up to 1.9x
vs H100 on memory-bound decode
Memory bandwidth
4.8 TB/s
vs 3.35 TB/s on H100 (1.43x)
VRAM capacity
141 GB HBM3e
1.76x H100 (80 GB HBM3)
FP8 Tensor (sparsity)
3,958 TFLOPS
same Hopper compute as H100
Concurrent model serving
3 to 5 models
20B to 70B resident on one card

Serve Llama 3.1 70B FP8 on one H200 in under 3 minutes

H200's 141GB fits Llama 3.1 70B in FP8 with plenty of KV-cache headroom for long contexts. This snippet pulls the vLLM image, serves the model with an OpenAI-compatible API, and enables FP8 for best throughput.

bash
Spheron
# 1. Provision an H200 from the Spheron CLI (or use the dashboard)spheron deploy --gpu h200 --image vllm/vllm-openai:latest # 2. Inside the instance, serve Llama 3.1 70B Instruct in FP8vllm serve meta-llama/Llama-3.1-70B-Instruct \  --quantization fp8 \  --max-model-len 32768 \  --gpu-memory-utilization 0.92 \  --port 8000 # 3. Hit the endpoint from any OpenAI-compatible clientcurl http://<instance-ip>:8000/v1/chat/completions \  -H "Content-Type: application/json" \  -d '{    "model": "meta-llama/Llama-3.1-70B-Instruct",    "messages": [{"role": "user", "content": "Summarize why H200 is memory-bound workloads first."}]  }'

For models above 141GB or extreme concurrency, add --tensor-parallel-size N and rent a multi-GPU H200 cluster with NVLink. For multi-node InfiniBand clusters, contact us.

Interconnect fabric

Multi-GPU H200 with NVLink and InfiniBand

H200 SXM5 nodes on Spheron connect 8 GPUs with 900 GB/s NVLink inside a node and 400 Gb/s NDR InfiniBand across nodes. That fabric matches NVIDIA's HGX H200 reference design, so tensor-parallel inference with vLLM or TensorRT-LLM, and pipeline-parallel training with Megatron-LM, scale close to linearly.

01900 GB/s NVLink between GPUs inside a node
02400 Gb/s NDR InfiniBand across nodes
03GPUDirect RDMA for zero-copy GPU-to-GPU transfers
041.1 TB unified GPU memory in an 8x H200 node
05NCCL pre-tuned for H200 topology
06Tested with vLLM, TensorRT-LLM, SGLang, Megatron-LM, DeepSpeed ZeRO-3
07Tensor parallel and pipeline parallel serving for 200B+ models
08Sub-microsecond latency for GPU-to-GPU communication
Scale

Need a custom multi-node cluster or reserved capacity?

H200 vs alternatives

Related resources

Frequently asked questions

How much does it cost to rent an H200 GPU?

On Spheron the H200 starts at $1.19/hr per GPU per hour on demand. Billing is per minute with no minimum commit. For reference, specialist clouds price H200 around $3.79 to $3.99/hr per GPU (Lambda, Jarvislabs, RunPod), AWS p5e runs ~$4.98/hr per GPU after the January 2026 15% increase, CoreWeave is ~$6.31/hr, and Azure ND H200 v5 and GCP a3-ultragpu on-demand land around $10.60 to $10.87/hr per GPU.

What is the cheapest way to rent an H200?

Spot instances on Spheron are the cheapest route when they are available, typically 40 to 60 percent below the dedicated rate. Spot can be reclaimed when demand spikes, so checkpoint every 15 to 30 minutes and use spot for fault-tolerant jobs (fine-tuning, batch inference, experimentation). For production serving with SLAs, stay on dedicated (99.99% SLA, non-interruptible). Both are on-demand tiers with per-minute billing.

Can I rent an H200 by the hour?

Yes. Spheron bills per minute with no minimum. A one-hour benchmark costs one hour. No reserved-instance contracts on dedicated or spot, and no commit fees.

How fast can I deploy an H200 instance?

Most H200 instances are live in under 2 minutes. Hardware is pre-warmed and provisioning behaves like a container start rather than a VM boot. If your Docker image is ready, you can be serving tokens inside three minutes of hitting deploy.

What is the main difference between H200 and H100?

H200 shares H100's Hopper compute (same 4th-gen Tensor Cores, same FP8 via Transformer Engine, same 989 TFLOPS TF32 and 3,958 TFLOPS FP8 with sparsity). What changes is memory: 141GB HBM3e at 4.8 TB/s on H200 versus 80GB HBM3 at 3.35 TB/s on H100. That is 1.76x the capacity and 1.43x the bandwidth, so H200 wins on anything memory-bound, especially long-context LLM serving.

Is H200 better for inference or training?

Both, but it is especially strong for inference. The extra memory lets you run bigger batches, longer contexts, and multiple models per GPU, which directly improves tokens per dollar on serving workloads. For training, H200 helps when KV cache or activation memory is the bottleneck. If memory is not the limit, H100 often has better price-performance.

What LLM sizes can a single H200 handle?

A single H200 comfortably handles 70B models in FP8 or FP16 with headroom for KV cache on long contexts, and 100B-class models with smaller batches. For 200B+ parameter models, tensor parallel across 2 to 8 H200s with NVLink is the right pattern. Mixtral 8x22B, DeepSeek V3, and Llama 3.1 70B all run well on a single H200.

Can I serve multiple models on one H200?

Yes. With 141GB of VRAM you can colocate two to three models in the 30B range, or three to five smaller 7B to 13B models. That is useful for stacks that need chat, code, and embedding models together, or for A/B serving of multiple checkpoints on the same card without cross-GPU hops.

Do you support multi-GPU H200 clusters with NVLink and InfiniBand?

Yes. Spheron offers 8x H200 per node with 900 GB/s NVLink and multi-node clusters connected by 400 Gb/s NDR InfiniBand with GPUDirect RDMA. Tested with vLLM, TensorRT-LLM, SGLang, Megatron-LM, and DeepSpeed ZeRO-3. Larger configurations are available on request.

What regions are H200s available in?

H200 capacity is online across North America, Europe, and Asia, sourced from data center partners. Availability shifts with demand; the dashboard shows live capacity per region.

What frameworks are optimized for H200?

All major serving stacks are tuned for H200: TensorRT-LLM (highest peak throughput, NVIDIA official), vLLM (OpenAI-compatible, easy deployment), SGLang (structured decoding), and Hugging Face TGI. Training stacks (PyTorch FSDP, DeepSpeed ZeRO-3, Megatron-LM) also have H200 kernels. CUDA 12.6+, cuDNN, and NCCL ship pre-configured in Spheron images.

Is the H200 worth it over the H100?

If your inference is hitting H100's 80GB ceiling (OOM at large batches, KV-cache pressure on long contexts, or you want to colocate multiple models), H200 pays for itself through higher tokens per dollar. If you are comfortably under 80GB, H100 is cheaper. Train on H100, serve on H200 is a common split.

Do you offer enterprise SLAs and dedicated support for H200?

For 100+ GPU deployments and production-critical workloads, Spheron offers dedicated Slack or Discord support, sourcing assistance, and SLA-backed instances. Smaller deployments are self-serve through the dashboard.

Talk to our team

How does H200 pricing on Spheron compare to AWS, GCP, and Azure?

For the same H200 hardware, Spheron is meaningfully cheaper than the hyperscalers on-demand. As of April 2026, AWS p5e runs ~$4.98/hr per GPU (post the January 2026 15% price increase), CoreWeave is ~$6.31/hr, Azure ND H200 v5 lands around $10.60/hr per GPU, and GCP a3-ultragpu-8g on-demand works out to roughly $10.87/hr per GPU. Spheron starts at $1.19/hr. Same silicon, different pricing model.

Also consider