Engineering

NVIDIA H200 Deployment Guide: Long Context, Multi-Model, and NVLink Clusters

Back to BlogWritten by Mitrasish, Co-founderFeb 8, 2026
NVIDIA H200GPU DeploymentLLM InferenceKV CacheNVLinkHopper Architecture
NVIDIA H200 Deployment Guide: Long Context, Multi-Model, and NVLink Clusters

The H200 exists to solve one specific problem. Modern AI inference is memory-bound, not compute-bound, and H100's 80 GB ceiling is the wall you hit first. Everything else, FP8 throughput, Tensor Core peak, NVLink bandwidth, stayed the same when NVIDIA built H200. What changed is 141 GB of HBM3e at 4.8 TB/s, which is exactly the lever you want when KV cache, long contexts, or multi-model colocation are the bottleneck.

This guide walks through the deployment decisions that actually matter on H200: sizing KV cache for long-context serving, colocating multiple models on one GPU, picking tensor-parallel sizes on NVLink clusters, and knowing when H200 beats H100 on tokens per dollar. For live H200 pricing, current SKUs, and instant deployment, go to the H200 rental page. For architecture comparisons, see our H100 vs H200 deep dive and the H200 vs B200 vs GB200 framework.

Why H200 Exists

From the outside H200 looks like a minor bump over H100. Same Hopper GPU. Same 528 4th-gen Tensor Cores. Same 989 TFLOPS TF32 and 3,958 TFLOPS FP8 with sparsity. Same Transformer Engine. Same 700W TDP. Same NVLink at 900 GB/s.

The difference is memory. H100 ships 80 GB of HBM3 at 3.35 TB/s. H200 ships 141 GB of HBM3e at 4.8 TB/s. That is 1.76x the capacity and 1.43x the bandwidth, and in memory-bound workloads it translates directly into throughput. MLPerf Llama 2 70B offline runs about 42% faster on H200 than H100, and maximum single-GPU throughput reaches ~1.9x in practice. No code changes, no recompilation. Just more room.

Technical Specifications

SpecificationH200 SXM5H100 SXM5 (reference)
ArchitectureHopperHopper
CUDA Cores16,89616,896
Tensor Cores528 (4th Gen)528 (4th Gen)
VRAM141 GB HBM3e80 GB HBM3
Memory Bandwidth4,800 GB/s3,350 GB/s
FP6434 TFLOPS34 TFLOPS
FP3267 TFLOPS67 TFLOPS
TF32 Tensor (sparsity)989 TFLOPS989 TFLOPS
FP16 Tensor (sparsity)1,979 TFLOPS1,979 TFLOPS
FP8 Tensor (sparsity)3,958 TFLOPS3,958 TFLOPS
NVLink Bandwidth900 GB/s900 GB/s
PCIeGen 5Gen 5
TDP700W700W

Compute is identical. Memory is the whole story.

KV Cache Sizing: The Math That Drives H200 Decisions

The moment you start serving long contexts, KV cache eats VRAM faster than weights do. For a transformer with L layers, H attention heads, head dim D, sequence length T, in precision P bytes, KV cache per request is:

KV = 2 * P * L * H * D * T

For Llama 3.1 70B (80 layers, 8 KV heads with GQA, 128 dim, FP16):

  • Per token: 2 * 2 * 80 * 8 * 128 = 327,680 bytes ≈ 320 KB
  • 32K context per request: ~10 GB
  • 100 concurrent requests at 32K: ~1 TB of KV cache needed

That last number is why H200 matters. On H200 you serve 70B in FP8 (weights ~70 GB), leaving ~65 GB for KV cache and activations. At ~10 GB per 32K request with paged attention and prefix sharing, that supports a real production batch. On H100 with 80 GB total, the same math leaves ~10 GB for KV cache, which caps you at one or two long-context requests.

For shorter contexts (4K to 8K), the KV cache picture is far gentler and H100 holds its own. H200 pulls away specifically when T grows.

Colocating Multiple Models on One H200

141 GB unlocks a pattern that is painful on H100: running a multi-model stack on a single card.

A realistic RAG or agent backend often needs:

  • A 30B chat model in FP16 (~60 GB) or FP8 (~30 GB)
  • A 7B code model in FP8 (~7 GB)
  • A 7B embedding model in FP16 (~14 GB)
  • A reranker (~2 GB) and some KV cache budget

In FP16 / FP8 mix you land around 55 to 70 GB of weights, with plenty of room left for activations, KV cache, and paged attention buffers. All four models colocated on one H200. The same stack needs at least two H100s with network hops between them. vLLM 0.6+ supports multi-LoRA serving and multi-model routing, and SGLang exposes structured decoding across colocated endpoints.

Colocation also removes cross-GPU latency from the critical path, which typically saves 2 to 5 ms per retrieval step in production.

Tensor-Parallel Sizing on NVLink Clusters

For models beyond what a single H200 handles, NVLink clusters are the default tool. Spheron's HGX H200 nodes expose 8 GPUs with 900 GB/s NVLink inside the node and 400 Gb/s NDR InfiniBand with GPUDirect RDMA across nodes.

Practical tensor-parallel recommendations:

ModelPrecisionTP sizeNotes
Llama 3.1 70BFP81Comfortably fits one H200 with KV headroom
Llama 3.1 70BFP161 (tight) or 2140 GB weights, TP=2 if you want KV cache
Mixtral 8x22BFP81 or 2MoE routing makes TP trickier than dense; use expert parallel
DeepSeek V3 671BFP88Single HGX H200 node, NVLink only
Llama 3.1 405BFP86 to 8TP=8 on one node keeps traffic on NVLink

Rule of thumb: stay on one node for as long as you can. NVLink at 900 GB/s is roughly 2x faster than the fastest multi-node InfiniBand fabric, and tensor parallelism is chatty. Go multi-node only when a single HGX node is not enough.

When H200 Wins on Tokens per Dollar

A higher hourly rate does not automatically mean higher cost per token. Three common cases where H200 wins the economics:

70B inference in production. On H100 you need TP=2 to run 70B FP16. Two H100s cost more per hour than one H200 and add cross-GPU latency. H200 serves 70B FP8 on one card with ~2x the per-GPU throughput. Typical saving on 70B serving: 40 to 50 percent fewer GPUs for the same QPS.

Long-context workloads. Any serving use case above 16K context starts to pay H200's memory premium back in KV cache room. Document chat, code review, long-form summarization all fit here.

Multi-model backends. One H200 replaces two H100s for a colocated stack of chat + code + embedding + reranker, with lower latency because traffic never leaves the card.

Cases where H100 still wins: pure sub-30B FP8 inference, training jobs under 70B where KV cache and optimizer state fit comfortably, or any batch-inference workload that is throughput-bound on compute, not memory.

Spot vs Dedicated for H200

On Spheron, on-demand H200 comes in two tiers: dedicated (99.99% SLA, non-interruptible) and spot (cheaper, interruptible). H200 spot is typically 40 to 60 percent below the dedicated rate, with reclaim on minutes of notice. Good fits:

  • Batch inference on fault-tolerant pipelines
  • Fine-tuning runs with 15 to 30 minute checkpoint cadence
  • Hyperparameter sweeps and research experiments
  • Offline data generation and synthetic data pipelines

Bad fits:

  • Production serving with SLAs
  • Short jobs where a mid-run reclaim costs more than the spot savings
  • Workloads without robust checkpointing

A common pattern that works: serve traffic on dedicated H200, run nightly fine-tunes and data processing on spot. Checkpoint every 20 minutes to a persistent volume and re-runs cost at most 20 minutes of wasted work.

Software and Drivers

H200 shares H100's software stack end to end. On Spheron's default H200 images you get CUDA 12.6+, cuDNN, NCCL tuned for H200 topology, and pre-installed serving frameworks: vLLM, TensorRT-LLM, SGLang, and Hugging Face TGI. Training stacks (PyTorch FSDP, DeepSpeed ZeRO-3, Megatron-LM) also have H200 kernels. If you already have an H100 container that works, it will run unchanged on H200 and your metrics will go up on memory-bound workloads.

Monitoring That Matters on H200

With 141 GB of VRAM it is easy to think you have more headroom than you really do. The metrics worth watching in production:

  • KV cache utilization (vLLM / TensorRT-LLM expose this): staying above 85 percent means you are about to see request queueing.
  • Prefix cache hit rate for chat workloads: below 30 percent and you are paying to recompute prefixes that another request already tokenized.
  • Paged attention fragmentation: grows slowly over hours of uptime, can force a restart if not monitored.
  • NVLink utilization on TP>1 configurations: sustained above 60 percent means you are close to the interconnect wall and should consider larger TP or fewer stages.

These are the numbers that correlate with tail latency in practice. Peak FLOPS and bandwidth matter far less than how well you are using the memory you have.

When to Reach for H200

Reach for H200 when H100 is memory-bound. Long contexts, 70B-class production serving, multi-model backends, or RAG stacks where embedding stores and LLMs need to coexist in VRAM. Stay on H100 for training up to 70B or sub-30B inference where you are well under the 80 GB ceiling. Consider B200 for trillion-parameter training or FP4 inference on the largest workloads. See our best NVIDIA GPUs for LLMs guide for a full decision framework and GPU memory requirements for LLMs for precision-by-precision sizing.


Ready to deploy? Check live per-hour H200 pricing and available SKUs on the H200 rental page or see all GPU pricing.

Rent H200 on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.