NVIDIA H200 Deployment Guide: Long Context, Multi-Model, and NVLink Clusters

The H200 exists to solve one specific problem. Modern AI inference is memory-bound, not compute-bound, and H100's 80 GB ceiling is the wall you hit first. Everything else, FP8 throughput, Tensor Core peak, NVLink bandwidth, stayed the same when NVIDIA built H200. What changed is 141 GB of HBM3e at 4.8 TB/s, which is exactly the lever you want when KV cache, long contexts, or multi-model colocation are the bottleneck.

This guide walks through the deployment decisions that actually matter on H200: sizing KV cache for long-context serving, colocating multiple models on one GPU, picking tensor-parallel sizes on NVLink clusters, and knowing when H200 beats H100 on tokens per dollar. For live H200 pricing, current SKUs, and instant deployment, go to the H200 rental page. For architecture comparisons, see our H100 vs H200 deep dive and the H200 vs B200 vs GB200 framework.

Why H200 Exists

From the outside H200 looks like a minor bump over H100. Same Hopper GPU. Same 528 4th-gen Tensor Cores. Same 989 TFLOPS TF32 and 3,958 TFLOPS FP8 with sparsity. Same Transformer Engine. Same 700W TDP. Same NVLink at 900 GB/s.

The difference is memory. H100 ships 80 GB of HBM3 at 3.35 TB/s. H200 ships 141 GB of HBM3e at 4.8 TB/s. That is 1.76x the capacity and 1.43x the bandwidth, and in memory-bound workloads it translates directly into throughput. MLPerf Llama 2 70B offline runs about 42% faster on H200 than H100, and maximum single-GPU throughput reaches ~1.9x in practice. No code changes, no recompilation. Just more room.

Technical Specifications

Specification	H200 SXM5	H100 SXM5 (reference)
Architecture	Hopper	Hopper
CUDA Cores	16,896	16,896
Tensor Cores	528 (4th Gen)	528 (4th Gen)
VRAM	141 GB HBM3e	80 GB HBM3
Memory Bandwidth	4,800 GB/s	3,350 GB/s
FP64	34 TFLOPS	34 TFLOPS
FP32	67 TFLOPS	67 TFLOPS
TF32 Tensor (sparsity)	989 TFLOPS	989 TFLOPS
FP16 Tensor (sparsity)	1,979 TFLOPS	1,979 TFLOPS
FP8 Tensor (sparsity)	3,958 TFLOPS	3,958 TFLOPS
NVLink Bandwidth	900 GB/s	900 GB/s
PCIe	Gen 5	Gen 5
TDP	700W	700W

Compute is identical. Memory is the whole story.

KV Cache Sizing: The Math That Drives H200 Decisions

The moment you start serving long contexts, KV cache eats VRAM faster than weights do. For a transformer with L layers, H attention heads, head dim D, sequence length T, in precision P bytes, KV cache per request is:

KV = 2 * P * L * H * D * T

For Llama 3.1 70B (80 layers, 8 KV heads with GQA, 128 dim, FP16):

Per token: 2 * 2 * 80 * 8 * 128 = 327,680 bytes ≈ 320 KB
32K context per request: ~10 GB
100 concurrent requests at 32K: ~1 TB of KV cache needed

That last number is why H200 matters. On H200 you serve 70B in FP8 (weights ~70 GB), leaving ~65 GB for KV cache and activations. At ~10 GB per 32K request with paged attention and prefix sharing, that supports a real production batch. On H100 with 80 GB total, the same math leaves ~10 GB for KV cache, which caps you at one or two long-context requests.

For shorter contexts (4K to 8K), the KV cache picture is far gentler and H100 holds its own. H200 pulls away specifically when T grows.

Colocating Multiple Models on One H200

141 GB unlocks a pattern that is painful on H100: running a multi-model stack on a single card.

A realistic RAG or agent backend often needs:

A 30B chat model in FP16 (~60 GB) or FP8 (~30 GB)
A 7B code model in FP8 (~7 GB)
A 7B embedding model in FP16 (~14 GB)
A reranker (~2 GB) and some KV cache budget

In FP16 / FP8 mix you land around 55 to 70 GB of weights, with plenty of room left for activations, KV cache, and paged attention buffers. All four models colocated on one H200. The same stack needs at least two H100s with network hops between them. vLLM 0.6+ supports multi-LoRA serving and multi-model routing, and SGLang exposes structured decoding across colocated endpoints.

Colocation also removes cross-GPU latency from the critical path, which typically saves 2 to 5 ms per retrieval step in production.

Tensor-Parallel Sizing on NVLink Clusters

For models beyond what a single H200 handles, NVLink clusters are the default tool. Spheron's HGX H200 nodes expose 8 GPUs with 900 GB/s NVLink inside the node and 400 Gb/s NDR InfiniBand with GPUDirect RDMA across nodes.

Practical tensor-parallel recommendations:

Model	Precision	TP size	Notes
Llama 3.1 70B	FP8	1	Comfortably fits one H200 with KV headroom
Llama 3.1 70B	FP16	1 (tight) or 2	140 GB weights, TP=2 if you want KV cache
Mixtral 8x22B	FP8	1 or 2	MoE routing makes TP trickier than dense; use expert parallel
DeepSeek V3 671B	FP8	8	Single HGX H200 node, NVLink only
Llama 3.1 405B	FP8	6 to 8	TP=8 on one node keeps traffic on NVLink

Rule of thumb: stay on one node for as long as you can. NVLink at 900 GB/s is roughly 2x faster than the fastest multi-node InfiniBand fabric, and tensor parallelism is chatty. Go multi-node only when a single HGX node is not enough.

When H200 Wins on Tokens per Dollar

A higher hourly rate does not automatically mean higher cost per token. Three common cases where H200 wins the economics:

70B inference in production. On H100 you need TP=2 to run 70B FP16. Two H100s cost more per hour than one H200 and add cross-GPU latency. H200 serves 70B FP8 on one card with ~2x the per-GPU throughput. Typical saving on 70B serving: 40 to 50 percent fewer GPUs for the same QPS.

Long-context workloads. Any serving use case above 16K context starts to pay H200's memory premium back in KV cache room. Document chat, code review, long-form summarization all fit here.

Multi-model backends. One H200 replaces two H100s for a colocated stack of chat + code + embedding + reranker, with lower latency because traffic never leaves the card.

Cases where H100 still wins: pure sub-30B FP8 inference, training jobs under 70B where KV cache and optimizer state fit comfortably, or any batch-inference workload that is throughput-bound on compute, not memory.

Spot vs Dedicated for H200

On Spheron, on-demand H200 comes in two tiers: dedicated (99.99% SLA, non-interruptible) and spot (cheaper, interruptible). H200 spot is typically 40 to 60 percent below the dedicated rate, with reclaim on minutes of notice. Good fits:

Batch inference on fault-tolerant pipelines
Fine-tuning runs with 15 to 30 minute checkpoint cadence
Hyperparameter sweeps and research experiments
Offline data generation and synthetic data pipelines

Bad fits:

Production serving with SLAs
Short jobs where a mid-run reclaim costs more than the spot savings
Workloads without robust checkpointing

A common pattern that works: serve traffic on dedicated H200, run nightly fine-tunes and data processing on spot. Checkpoint every 20 minutes to a persistent volume and re-runs cost at most 20 minutes of wasted work.

Software and Drivers

H200 shares H100's software stack end to end. On Spheron's default H200 images you get CUDA 12.6+, cuDNN, NCCL tuned for H200 topology, and pre-installed serving frameworks: vLLM, TensorRT-LLM, SGLang, and Hugging Face TGI. Training stacks (PyTorch FSDP, DeepSpeed ZeRO-3, Megatron-LM) also have H200 kernels. If you already have an H100 container that works, it will run unchanged on H200 and your metrics will go up on memory-bound workloads.

Monitoring That Matters on H200

With 141 GB of VRAM it is easy to think you have more headroom than you really do. The metrics worth watching in production:

KV cache utilization (vLLM / TensorRT-LLM expose this): staying above 85 percent means you are about to see request queueing.
Prefix cache hit rate for chat workloads: below 30 percent and you are paying to recompute prefixes that another request already tokenized.
Paged attention fragmentation: grows slowly over hours of uptime, can force a restart if not monitored.
NVLink utilization on TP>1 configurations: sustained above 60 percent means you are close to the interconnect wall and should consider larger TP or fewer stages.

These are the numbers that correlate with tail latency in practice. Peak FLOPS and bandwidth matter far less than how well you are using the memory you have.

When to Reach for H200

Reach for H200 when H100 is memory-bound. Long contexts, 70B-class production serving, multi-model backends, or RAG stacks where embedding stores and LLMs need to coexist in VRAM. Stay on H100 for training up to 70B or sub-30B inference where you are well under the 80 GB ceiling. Consider B200 for trillion-parameter training or FP4 inference on the largest workloads. See our best NVIDIA GPUs for LLMs guide for a full decision framework and GPU memory requirements for LLMs for precision-by-precision sizing.

Ready to deploy? Check live per-hour H200 pricing and available SKUs on the H200 rental page or see all GPU pricing.
Browse H200 capacity →

FAQ / 06

Frequently Asked Questions

When your inference workload is memory-bound. That means one of three things: long contexts (32K+ tokens) where KV cache dominates, 70B+ models in FP8 or FP16 where H100 forces tensor parallelism, or multi-model serving where you want two or three models colocated on one card. In those cases H200's 1.76x memory and 1.43x bandwidth lift throughput enough that the premium per hour pays back. For sub-30B models in FP8 that comfortably fit on H100, stay on H100.

For Llama 3.1 70B with GQA (80 layers, 8 KV heads, 128 head dim) in FP16, KV cache is roughly 2 * 2 bytes * 80 layers * 8 heads * 128 dim = ~320 KB per token per request. At 32K context that is ~10 GB per request. On H200 after 70B FP16 weights (~140 GB fitting tightly), remaining KV budget is small, so most teams serve 70B in FP8 or INT8 to free 60 to 70 GB for KV cache. FP8 weights at ~70 GB leave 70+ GB for KV cache, enough for several concurrent long-context requests.

Yes, and it is one of the strongest reasons to pick H200 over H100. With 141 GB you can fit a 30B chat model plus a 7B code model plus a 7B embedding model plus reranker, all resident in VRAM at the same time. vLLM's multi-LoRA mode and server-side model routing make this practical in production. On H100 the same stack spans two cards.

No. H200 shares H100's Hopper compute and runs the same CUDA, cuDNN, TensorRT-LLM, vLLM, SGLang, PyTorch, and JAX builds without changes. The speedup is entirely hardware: more memory, more bandwidth. You drop the same container on H200 and your batch size or context window goes up.

For 70B to 100B models, one H200 is usually enough and tensor parallelism adds latency without adding throughput. For 200B-class models, TP=2 or TP=4 across NVLink-connected H200s is the sweet spot. For 405B+ models, TP=8 on a single HGX node keeps all traffic on 900 GB/s NVLink. Go multi-node only when a single node is not enough, and use 400 Gb/s NDR InfiniBand with GPUDirect RDMA.

B200 delivers roughly 2.3x H100's FP8 dense TFLOPS and 192 GB HBM3e at 8 TB/s, but capacity is still tight and the hourly rate reflects it. If you need inference capacity this quarter, H200 gives you 141 GB today at Hopper-class compute. Many teams run a split: H200 for production serving, B200 for new training runs that can use FP4 and larger batches.

Why H200 Exists

Technical Specifications

KV Cache Sizing: The Math That Drives H200 Decisions

Colocating Multiple Models on One H200

Tensor-Parallel Sizing on NVLink Clusters

When H200 Wins on Tokens per Dollar

Spot vs Dedicated for H200

Software and Drivers

Monitoring That Matters on H200

When to Reach for H200

Frequently Asked Questions

01When does H200 actually beat H100 on tokens per dollar?

02How much KV cache does a long-context request actually consume?

03Can I colocate multiple models on a single H200?

04Does H200 need different code than H100?

05What is the right tensor-parallel size for H200 clusters?

06Is it worth waiting for B200 instead of deploying on H200 now?

Build what's next.