Rent NVIDIA H200 GPUs on Demand from $4.54/hr
141GB HBM3e, 4.8 TB/s bandwidth, NVLink, per-minute billing. Live in under 2 minutes.
Renting an NVIDIA H200 on Spheron starts at $4.54/hr per GPU per hour on dedicated (99.99% SLA), with interruptible spot instances cheaper still. Billing is per minute, there is no minimum commit, and most instances are live inside two minutes. The H200 shares H100's Hopper compute (4th-gen Tensor Cores, FP8 via Transformer Engine, 989 TFLOPS TF32, 3,958 TFLOPS FP8 with sparsity) and bumps memory to 141GB HBM3e at 4.8 TB/s. That makes it the better pick when H100 is memory-bound: long-context inference, 70B to 100B serving at large batch sizes, multi-model colocation, and RAG. Specialist clouds price H200 around $3.80 to $4.00 per GPU per hour (Lambda, Jarvislabs, RunPod), while hyperscalers run $4.98/hr on AWS p5e, ~$6.31/hr on CoreWeave, and ~$10.60 to $10.87/hr on Azure ND H200 v5 and GCP a3-ultragpu on-demand.
Technical specifications
Pricing comparison
| Provider | Price/hr | Savings |
|---|---|---|
SpheronYour price | $4.54/hr | - |
Lambda | $3.79/hr | - |
Jarvislabs | $3.80/hr | - |
RunPod | $3.99/hr | - |
AWS p5e | $4.98/hr | 1.1x more expensive |
CoreWeave | $6.31/hr | 1.4x more expensive |
Azure ND H200 v5 | $10.60/hr | 2.3x more expensive |
Google Cloud a3-ultragpu | $10.87/hr | 2.4x more expensive |
Need More H200 Than What's Listed?
Reserved Capacity
Commit to a duration, lock in availability and better rates
Custom Clusters
8 to 512+ GPUs, specific hardware, InfiniBand configs on request
Supplier Matchmaking
Spheron sources from its certified data center network, negotiates pricing, handles setup
Need more H200 capacity? Tell us your requirements and we'll source it from our certified data center network.
Typical turnaround: 24–48 hours
When to pick the H200
Pick the H200 if
Your workload is memory-bound on H100. That means long-context LLM inference (32K+ tokens), 70B to 100B serving at production batch sizes, multi-model colocation on a single GPU, or RAG stacks where embedding stores and the LLM need to live in VRAM together. H200 gives you 1.76x the memory and 1.43x the bandwidth of H100, same Hopper compute.
Pick the H100 instead if
You are training on models up to 70B or running inference that comfortably fits in 80GB. H100 has identical Tensor Core math and Transformer Engine, at a lower hourly rate. Move to H200 only when memory capacity or KV-cache headroom is the bottleneck.
Pick the B200 instead if
You need maximum throughput on the largest models. B200 delivers ~2.3x H100's FP8 dense TFLOPS and ships 192GB HBM3e at 8 TB/s. For trillion-parameter training or FP4 inference, B200 is the right call. For H100-class compute with bigger memory, stay on H200.
Pick the A100 instead if
You are doing classic training up to 30B parameters, quantized inference, or cost-sensitive fine-tuning. A100 80GB costs roughly a third of H200 and the mature stack still delivers. Skip to H200 when FP8 or 141GB matter.
Ideal use cases
Long-context LLM inference
141GB HBM3e lets you serve 70B to 100B models at large batch sizes with room left for KV cache on 32K+ context windows. Transformer Engine and FP8 keep latency low; the extra memory keeps throughput high.
Multi-model and RAG serving
Colocate a 30B chat model, a 7B code model, and an embedding model on one card. Keep vector indices and reranker weights resident in VRAM alongside the LLM for sub-10 ms retrieval.
LLM fine-tuning and RLHF
Fine-tune 70B models with larger per-GPU batches, or run full SFT on 30B models without sharding. LoRA and QLoRA on 100B-class models become single-node jobs.
High-throughput inference at scale
For production serving where tokens per dollar matters, H200 widens the batch without running out of memory. Pair with TensorRT-LLM or vLLM for best throughput.
Performance benchmarks
Serve Llama 3.1 70B FP8 on one H200 in under 3 minutes
H200's 141GB fits Llama 3.1 70B in FP8 with plenty of KV-cache headroom for long contexts. This snippet pulls the vLLM image, serves the model with an OpenAI-compatible API, and enables FP8 for best throughput.
# 1. Provision an H200 from the Spheron CLI (or use the dashboard)spheron deploy --gpu h200 --image vllm/vllm-openai:latest # 2. Inside the instance, serve Llama 3.1 70B Instruct in FP8vllm serve meta-llama/Llama-3.1-70B-Instruct \ --quantization fp8 \ --max-model-len 32768 \ --gpu-memory-utilization 0.92 \ --port 8000 # 3. Hit the endpoint from any OpenAI-compatible clientcurl http://<instance-ip>:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-70B-Instruct", "messages": [{"role": "user", "content": "Summarize why H200 is memory-bound workloads first."}] }'For models above 141GB or extreme concurrency, add --tensor-parallel-size N and rent a multi-GPU H200 cluster with NVLink. For multi-node InfiniBand clusters, contact us.
Multi-GPU H200 with NVLink and InfiniBand
H200 SXM5 nodes on Spheron connect 8 GPUs with 900 GB/s NVLink inside a node and 400 Gb/s NDR InfiniBand across nodes. That fabric matches NVIDIA's HGX H200 reference design, so tensor-parallel inference with vLLM or TensorRT-LLM, and pipeline-parallel training with Megatron-LM, scale close to linearly.
Need a custom multi-node cluster or reserved capacity? Talk to us about topology, regions, and committed pricing.
H200 vs alternatives
Same Hopper compute, 1.76x the memory and 1.43x the bandwidth. Pick H200 when H100 is memory-bound: long context, 70B+ inference at scale, or multi-model serving.
B200 Blackwell delivers ~2.3x H100's FP8 dense TFLOPS and 192GB HBM3e at 8 TB/s. Jump to B200 for trillion-parameter training or FP4 inference. H200 is Hopper compute with bigger memory.
A100 is ~60 to 70 percent cheaper per hour but lacks FP8 and Transformer Engine. Use A100 for classic training up to 30B. H200 is the right call when FP8 throughput or 141GB matter.
MI300X has more VRAM (192GB) but the software stack trails NVIDIA's CUDA / TensorRT-LLM / vLLM ecosystem. For production inference today, H200 is the faster-to-deploy path.
Related resources
NVIDIA H100 vs H200: Benchmarks, Specs, and When to Upgrade
Side-by-side comparison for LLM inference, memory bandwidth, batch sizing, and when the H200 premium pays off.
H200 Deployment Guide: Long Context, Multi-Model, and NVLink Clusters
Practitioner deep dive on H200 configurations, KV-cache sizing, multi-model colocation, and cluster patterns on Spheron.
H200 vs B200 vs GB200: Which Blackwell-Class GPU Fits Your Workload?
How H200 compares to Blackwell B200 and GB200 for training, inference, and memory-bound workloads.
AMD MI300X vs NVIDIA H200: Memory, Performance, and Cost
How the H200 stacks up against AMD's MI300X across memory capacity, software stack maturity, and total cost.
Best NVIDIA GPUs for LLMs
Framework for matching GPU choice to model size and workload, from 7B on A100 to 671B on B200.
GPU Memory Requirements for Large Language Models
Calculate VRAM needs across precision levels and KV-cache pressure for every major model class.
Frequently asked questions
How much does it cost to rent an H200 GPU?
On Spheron the H200 starts at $4.54/hr per GPU per hour on demand. Billing is per minute with no minimum commit. For reference, specialist clouds price H200 around $3.79 to $3.99/hr per GPU (Lambda, Jarvislabs, RunPod), AWS p5e runs ~$4.98/hr per GPU after the January 2026 15% increase, CoreWeave is ~$6.31/hr, and Azure ND H200 v5 and GCP a3-ultragpu on-demand land around $10.60 to $10.87/hr per GPU.
What is the cheapest way to rent an H200?
Spot instances on Spheron are the cheapest route when they are available, typically 40 to 60 percent below the dedicated rate. Spot can be reclaimed when demand spikes, so checkpoint every 15 to 30 minutes and use spot for fault-tolerant jobs (fine-tuning, batch inference, experimentation). For production serving with SLAs, stay on dedicated (99.99% SLA, non-interruptible). Both are on-demand tiers with per-minute billing.
Can I rent an H200 by the hour?
Yes. Spheron bills per minute with no minimum. A one-hour benchmark costs one hour. No reserved-instance contracts on dedicated or spot, and no commit fees.
How fast can I deploy an H200 instance?
Most H200 instances are live in under 2 minutes. Hardware is pre-warmed and provisioning behaves like a container start rather than a VM boot. If your Docker image is ready, you can be serving tokens inside three minutes of hitting deploy.
What is the main difference between H200 and H100?
H200 shares H100's Hopper compute (same 4th-gen Tensor Cores, same FP8 via Transformer Engine, same 989 TFLOPS TF32 and 3,958 TFLOPS FP8 with sparsity). What changes is memory: 141GB HBM3e at 4.8 TB/s on H200 versus 80GB HBM3 at 3.35 TB/s on H100. That is 1.76x the capacity and 1.43x the bandwidth, so H200 wins on anything memory-bound, especially long-context LLM serving.
Is H200 better for inference or training?
Both, but it is especially strong for inference. The extra memory lets you run bigger batches, longer contexts, and multiple models per GPU, which directly improves tokens per dollar on serving workloads. For training, H200 helps when KV cache or activation memory is the bottleneck. If memory is not the limit, H100 often has better price-performance.
What LLM sizes can a single H200 handle?
A single H200 comfortably handles 70B models in FP8 or FP16 with headroom for KV cache on long contexts, and 100B-class models with smaller batches. For 200B+ parameter models, tensor parallel across 2 to 8 H200s with NVLink is the right pattern. Mixtral 8x22B, DeepSeek V3, and Llama 3.1 70B all run well on a single H200.
Can I serve multiple models on one H200?
Yes. With 141GB of VRAM you can colocate two to three models in the 30B range, or three to five smaller 7B to 13B models. That is useful for stacks that need chat, code, and embedding models together, or for A/B serving of multiple checkpoints on the same card without cross-GPU hops.
Do you support multi-GPU H200 clusters with NVLink and InfiniBand?
Yes. Spheron offers 8x H200 per node with 900 GB/s NVLink and multi-node clusters connected by 400 Gb/s NDR InfiniBand with GPUDirect RDMA. Tested with vLLM, TensorRT-LLM, SGLang, Megatron-LM, and DeepSpeed ZeRO-3. Larger configurations are available on request.
What regions are H200s available in?
H200 capacity is online across North America, Europe, and Asia, sourced from data center partners. Availability shifts with demand; the dashboard shows live capacity per region.
What frameworks are optimized for H200?
All major serving stacks are tuned for H200: TensorRT-LLM (highest peak throughput, NVIDIA official), vLLM (OpenAI-compatible, easy deployment), SGLang (structured decoding), and Hugging Face TGI. Training stacks (PyTorch FSDP, DeepSpeed ZeRO-3, Megatron-LM) also have H200 kernels. CUDA 12.6+, cuDNN, and NCCL ship pre-configured in Spheron images.
Is the H200 worth it over the H100?
If your inference is hitting H100's 80GB ceiling (OOM at large batches, KV-cache pressure on long contexts, or you want to colocate multiple models), H200 pays for itself through higher tokens per dollar. If you are comfortably under 80GB, H100 is cheaper. Train on H100, serve on H200 is a common split.
Do you offer enterprise SLAs and dedicated support for H200?
For 100+ GPU deployments and production-critical workloads, Spheron offers dedicated Slack or Discord support, sourcing assistance, and SLA-backed instances. Smaller deployments are self-serve through the dashboard.
Talk to our team →How does H200 pricing on Spheron compare to AWS, GCP, and Azure?
For the same H200 hardware, Spheron is meaningfully cheaper than the hyperscalers on-demand. As of April 2026, AWS p5e runs ~$4.98/hr per GPU (post the January 2026 15% price increase), CoreWeave is ~$6.31/hr, Azure ND H200 v5 lands around $10.60/hr per GPU, and GCP a3-ultragpu-8g on-demand works out to roughly $10.87/hr per GPU. Spheron starts at $4.54/hr. Same silicon, different pricing model.