NVIDIA H200 GPU: 141GB HBM3e Specs, Pricing & Rental. Rent H200 GPU from $3.31/hr
141GB HBM3e, 4.8 TB/s bandwidth, FP8 Transformer Engine, NVLink. Per-minute billing on H200 GPU rentals, live in under 2 minutes.
Renting an NVIDIA H200 on Spheron starts at $3.31/hr per GPU per hour, the lowest live marketplace rate. Billing is per minute, there is no minimum commit, and most instances are live inside two minutes. The H200 shares H100's Hopper compute (4th-gen Tensor Cores, FP8 via Transformer Engine, 989 TFLOPS TF32, 3,958 TFLOPS FP8 with sparsity) and bumps memory to 141GB HBM3e at 4.8 TB/s. That makes it the better pick when H100 is memory-bound: long-context inference, 70B to 100B serving at large batch sizes, multi-model colocation, and RAG. Specialist clouds price H200 around $3.80 to $4.40 per GPU per hour (Lambda, Jarvislabs, RunPod), while hyperscalers run $4.98/hr on AWS p5e, ~$6.31/hr on CoreWeave, and ~$10.60 to $10.87/hr on Azure ND H200 v5 and GCP a3-ultragpu on-demand.
NVIDIA H200 specifications
NVIDIA H200 pricing
| Provider | Price/hr | Savings |
|---|---|---|
SpheronYour price | $3.31/hr | - |
Lambda | $3.79/hr | 1.1x more expensive |
Jarvislabs | $3.80/hr | 1.1x more expensive |
RunPod | $4.39/hr | 1.3x more expensive |
AWS p5e | $4.98/hr | 1.5x more expensive |
CoreWeave | $6.31/hr | 1.9x more expensive |
Azure ND H200 v5 | $10.60/hr | 3.2x more expensive |
Google Cloud a3-ultragpu | $10.87/hr | 3.3x more expensive |
Need More H200 Than What's Listed?
Reserved Capacity
Commit to a duration, lock in availability and better rates
Custom Clusters
8 to 512+ GPUs, specific hardware, InfiniBand configs on request
Supplier Matchmaking
Spheron sources from its certified data center network, negotiates pricing, handles setup
Need more H200 capacity? Tell us your requirements and we'll source it from our certified data center network.
Typical turnaround: 24–48 hours
When to pick the H200
Pick the H200 if
Your workload is memory-bound on H100. That means long-context LLM inference (32K+ tokens), 70B to 100B serving at production batch sizes, multi-model colocation on a single GPU, or RAG stacks where embedding stores and the LLM need to live in VRAM together. H200 gives you 1.76x the memory and 1.43x the bandwidth of H100, same Hopper compute.
Pick the H100 instead if
You are training on models up to 70B or running inference that comfortably fits in 80GB. H100 has identical Tensor Core math and Transformer Engine, at a lower hourly rate. Move to H200 only when memory capacity or KV-cache headroom is the bottleneck.
Pick the B200 instead if
You need maximum throughput on the largest models. B200 delivers ~2.3x H100's FP8 dense TFLOPS and ships 192GB HBM3e at 8 TB/s. For trillion-parameter training or FP4 inference, B200 is the right call. For H100-class compute with bigger memory, stay on H200.
Pick the A100 instead if
You are doing classic training up to 30B parameters, quantized inference, or cost-sensitive fine-tuning. A100 80GB costs roughly a third of H200 and the mature stack still delivers. Skip to H200 when FP8 or 141GB matter.
NVIDIA H200 use cases
Long-context LLM inference
141GB HBM3e lets you serve 70B to 100B models at large batch sizes with room left for KV cache on 32K+ context windows. Transformer Engine and FP8 keep latency low; the extra memory keeps throughput high.
Multi-model and RAG serving
Colocate a 30B chat model, a 7B code model, and an embedding model on one card. Keep vector indices and reranker weights resident in VRAM alongside the LLM for sub-10 ms retrieval.
LLM fine-tuning and RLHF
Fine-tune 70B models with larger per-GPU batches, or run full SFT on 30B models without sharding. LoRA and QLoRA on 100B-class models become single-node jobs.
High-throughput inference at scale
For production serving where tokens per dollar matters, H200 widens the batch without running out of memory. Pair with TensorRT-LLM or vLLM for best throughput.
NVIDIA H200 benchmarks
Serve Llama 3.1 70B FP8 on one H200 in under 3 minutes
H200's 141GB fits Llama 3.1 70B in FP8 with plenty of KV-cache headroom for long contexts. This snippet pulls the vLLM image, serves the model with an OpenAI-compatible API, and enables FP8 for best throughput.
# 1. Provision an H200 from the Spheron CLI (or use the dashboard)spheron deploy --gpu h200 --image vllm/vllm-openai:latest # 2. Inside the instance, serve Llama 3.1 70B Instruct in FP8vllm serve meta-llama/Llama-3.1-70B-Instruct \ --quantization fp8 \ --max-model-len 32768 \ --gpu-memory-utilization 0.92 \ --port 8000 # 3. Hit the endpoint from any OpenAI-compatible clientcurl http://<instance-ip>:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-70B-Instruct", "messages": [{"role": "user", "content": "Summarize why H200 is memory-bound workloads first."}] }'For models above 141GB or extreme concurrency, add --tensor-parallel-size N and rent a multi-GPU H200 cluster with NVLink. For multi-node InfiniBand clusters, contact us.
Multi-GPU H200 with NVLink and InfiniBand
H200 SXM5 nodes on Spheron connect 8 GPUs with 900 GB/s NVLink inside a node and 400 Gb/s NDR InfiniBand across nodes. That fabric matches NVIDIA's HGX H200 reference design, so tensor-parallel inference with vLLM or TensorRT-LLM, and pipeline-parallel training with Megatron-LM, scale close to linearly.
Need a custom multi-node cluster or reserved capacity? Talk to us about topology, regions, and committed pricing.
H200 vs alternatives
Same Hopper compute, 1.76x the memory and 1.43x the bandwidth. Pick H200 when H100 is memory-bound: long context, 70B+ inference at scale, or multi-model serving.
B200 Blackwell delivers ~2.3x H100's FP8 dense TFLOPS and 192GB HBM3e at 8 TB/s. Jump to B200 for trillion-parameter training or FP4 inference. H200 is Hopper compute with bigger memory.
A100 is ~60 to 70 percent cheaper per hour but lacks FP8 and Transformer Engine. Use A100 for classic training up to 30B. H200 is the right call when FP8 throughput or 141GB matter.
MI300X has more VRAM (192GB) but the software stack trails NVIDIA's CUDA / TensorRT-LLM / vLLM ecosystem. For production inference today, H200 is the faster-to-deploy path.
NVIDIA H200 guides and resources
NVIDIA H100 vs H200: Benchmarks, Specs, and When to Upgrade
Side-by-side comparison for LLM inference, memory bandwidth, batch sizing, and when the H200 premium pays off.
H200 Deployment Guide: Long Context, Multi-Model, and NVLink Clusters
Practitioner deep dive on H200 configurations, KV-cache sizing, multi-model colocation, and cluster patterns on Spheron.
H200 vs B200 vs GB200: Which Blackwell-Class GPU Fits Your Workload?
How H200 compares to Blackwell B200 and GB200 for training, inference, and memory-bound workloads.
AMD MI300X vs NVIDIA H200: Memory, Performance, and Cost
How the H200 stacks up against AMD's MI300X across memory capacity, software stack maturity, and total cost.
Best NVIDIA GPUs for LLMs
Framework for matching GPU choice to model size and workload, from 7B on A100 to 671B on B200.
GPU Memory Requirements for Large Language Models
Calculate VRAM needs across precision levels and KV-cache pressure for every major model class.
NVIDIA H200 Release Date and Cloud Availability
The NVIDIA H200 Tensor Core GPU was announced in November 2023 at the Supercomputing conference (SC23) as the memory-upgraded successor to the H100. Volume shipments began in Q2 2024 inside HGX H200 systems, with broad cloud availability through Q3 and Q4 2024. AWS launched the p5e (H200) instance in late 2024; Azure ND H200 v5 became generally available in November 2024; CoreWeave, Lambda, and Nebius had H200 capacity by mid-2024.
On Spheron the H200 SXM5 is available with per-minute billing and no contract. Capacity is sourced from data center partners across North America, Europe, and Asia. Live availability and current pricing is on the pricing page. The next step up from H200 is the Blackwell B200 (192GB HBM3e at 8 TB/s with native FP4) or B300 (288GB HBM3e), both of which exceed the H200 on memory capacity and bandwidth but trade off Hopper's proven software maturity for newer silicon.
H200 VRAM and Memory Bandwidth: 141GB HBM3e at 4.8 TB/s
The H200 ships with 141GB of HBM3e memory at 4.8 TB/s of bandwidth. That is 1.76x the VRAM and 1.43x the bandwidth of the H100 (80GB HBM3 at 3.35 TB/s). Same Hopper compute, same Transformer Engine with FP8, but the memory upgrade is what changes the workload calculus. For autoregressive LLM decode at batch size 1, the H200's bandwidth ceiling for a 70B FP16 model is roughly 34 tokens per second versus 24 tokens per second on the H100.
Where the 141GB VRAM matters: a Llama 3.1 70B model fits in FP16 on a single H200 with roughly 60GB of KV cache headroom, enough for long-context production serving at moderate batch sizes. The H100 forces you to FP8 quantization to fit the same model with usable KV cache. For 128K+ token context windows, multi-model colocation, and RAG pipelines that keep both the embedding model and the LLM resident on the same card, the H200's extra memory pays for itself quickly.