The A100 and H100 both ship with 80 GB of HBM and run the same CUDA workloads, but an H100 GPU rental costs a different hourly rate than a comparable A100 on Spheron. Whether the H100's architecture improvements pay off depends on your model size, throughput target, and whether your stack can actually use FP8. This guide works through the specs, training and inference benchmarks, and cost-per-token math so you can make the call.
TL;DR: A100 vs H100 at a Glance
| Metric | A100 SXM4 | H100 SXM5 |
|---|---|---|
| Architecture | Ampere | Hopper |
| Process Node | TSMC 7nm | TSMC 4N |
| Transistors | 54.2B | 80B |
| TDP | 400W | 700W |
| FP16 Tensor TFLOPS | 312 | 1,979 |
| BF16 Tensor TFLOPS | 312 | 1,979 |
| FP8 Tensor TFLOPS | N/A | 3,958 |
| FP64 TFLOPS | 9.7 | 34 |
| VRAM | 80 GB | 80 GB |
| Memory Type | HBM2e | HBM3 |
| Memory Bandwidth | 2,039 GB/s | 3,350 GB/s |
| NVLink Version | 3.0 | 4.0 |
| NVLink Bandwidth (bidir.) | 600 GB/s | 900 GB/s |
| Transformer Engine | No | Yes |
| MIG Support | Up to 7 instances | Up to 7 instances |
| On-demand price/hr | ~$1.64 | ~$2.90 |
| Spot price/hr | ~$0.45 | ~$0.80 |
Pricing fluctuates based on GPU availability. The prices above are based on 1 May 2026 and may have changed. Check current GPU pricing for live rates.
Both GPUs have identical VRAM capacity. The H100 wins on raw compute throughput and memory bandwidth. The A100 currently has a lower on-demand floor on Spheron's marketplace, with spot pricing significantly below the H100's spot rate.
Architecture: Ampere vs Hopper
What stayed the same
Both GPUs run the same CUDA 11.8+ workloads. A100 containers run on H100 without modification (H100 compute capability is 9.0, a superset of A100's 8.0). Both support MIG with the same maximum of 7 isolated instances per GPU. Both use TSMC's advanced node manufacturing and expose the same driver and NVML APIs.
Transformer Engine
The Transformer Engine is the most consequential H100-only addition. It is a hardware and software pipeline that executes Transformer attention and feed-forward operations in FP8, specifically the E4M3 and E5M2 floating-point formats. FP8 uses half the memory of BF16 per operand, which means the Tensor Cores can process roughly twice as many values per cycle. On H100, this translates to 3,958 TFLOPS in FP8 compared to 1,979 TFLOPS in BF16.
To use it, you need framework support: PyTorch 2.1+ with transformer_engine.pytorch, vLLM 0.4+, or TensorRT-LLM 0.10+. The A100 has no FP8 hardware support. BF16 is the practical precision ceiling on A100 workloads.
HBM3 vs HBM2e
Both GPUs have 80 GB of HBM. The H100 uses HBM3 at 3,350 GB/s; the A100 uses HBM2e at 2,039 GB/s. That 1.64x bandwidth difference is the primary driver of H100's inference speedup on memory-bound attention operations, where the GPU is waiting on VRAM reads rather than doing compute.
For large language model inference, the decode phase requires reading the entire model weight tensor for every generated token. A 70B parameter model in FP16 is ~140 GB. On a single GPU with INT8 quantization (~70 GB), the H100's higher bandwidth means faster token generation even before FP8 comes into play.
NVLink 4.0 vs 3.0
The A100 uses NVLink 3.0: 12 links at 50 GB/s each for 600 GB/s total bidirectional bandwidth. The H100 uses NVLink 4.0: 18 links at 50 GB/s each for 900 GB/s total. For 8-GPU training, the 50% higher inter-GPU bandwidth reduces all-reduce communication stalls and improves scaling efficiency on 70B+ models where gradient synchronization is a bottleneck.
PCIe Gen 5 vs Gen 4
On PCIe form factors (not SXM), the H100 PCIe offers 128 GB/s host-to-device bandwidth versus the A100 PCIe's 64 GB/s. This only matters for data loading pipelines where host memory transfers are frequent. SXM variants connect directly via NVLink and NVSwitch, so this distinction doesn't apply to the SXM configurations most commonly deployed on GPU clouds.
Full Specifications Comparison
| Specification | A100 SXM4 | A100 PCIe | H100 SXM5 | H100 PCIe |
|---|---|---|---|---|
| Architecture | Ampere | Ampere | Hopper | Hopper |
| Process Node | TSMC 7nm | TSMC 7nm | TSMC 4N | TSMC 4N |
| Transistors | 54.2B | 54.2B | 80B | 80B |
| CUDA Cores | 6,912 | 6,912 | 16,896 | 14,592 |
| Tensor Core Gen | 3rd Gen | 3rd Gen | 4th Gen | 4th Gen |
| VRAM | 80 GB HBM2e | 80 GB HBM2e | 80 GB HBM3 | 80 GB HBM2e |
| Memory Bandwidth | 2,039 GB/s | 1,935 GB/s | 3,350 GB/s | 2,000 GB/s |
| Memory Bus | 5,120-bit | 5,120-bit | 5,120-bit | 5,120-bit |
| L2 Cache | 40 MB | 40 MB | 50 MB | 50 MB |
| FP64 (TFLOPS) | 9.7 | 9.7 | 34 | 26 |
| FP32 (TFLOPS) | 19.5 | 19.5 | 67 | 51 |
| TF32 Tensor (TFLOPS) | 156 | 156 | 989 | 756 |
| BF16 Tensor (TFLOPS) | 312 | 312 | 1,979 | 1,513 |
| FP16 Tensor (TFLOPS) | 312 | 312 | 1,979 | 1,513 |
| FP8 Tensor (TFLOPS) | N/A | N/A | 3,958 | 3,026 |
| INT8 (TOPS) | 624 | 624 | 3,958 | 3,026 |
| NVLink Version | 3.0 | N/A | 4.0 | N/A |
| NVLink Bandwidth | 600 GB/s | N/A | 900 GB/s | N/A |
| PCIe | Gen 4 (64 GB/s) | Gen 4 (64 GB/s) | Gen 5 (128 GB/s) | Gen 5 (128 GB/s) |
| MIG Instances | Up to 7 | Up to 7 | Up to 7 | Up to 7 |
| Transformer Engine | No | No | Yes | Yes |
| TDP | 400W | 300W | 700W | 350W |
The H100 PCIe ships with HBM2e instead of HBM3, giving it 2,000 GB/s bandwidth rather than 3,350 GB/s. If you're comparing PCIe variants specifically, the inference bandwidth gap between A100 and H100 narrows to about 1.03x rather than 1.64x.
Training Benchmarks: Llama 3 70B Fine-Tuning
Llama 3 70B in FP16 requires roughly 140 GB of VRAM for model weights alone, which doesn't fit on a single 80 GB GPU. For fine-tuning, both A100 and H100 are typically used in 8-GPU configurations with NVLink and NVSwitch. The benchmarks below reflect LoRA fine-tuning at rank 64, 4K sequence length, gradient checkpointing enabled.
| GPU | Config | Precision | Batch Size | Tokens/sec | Relative to A100 BF16 |
|---|---|---|---|---|---|
| A100 SXM4 | 8x NVLink | BF16 | 4 | ~18,000 | 1.0x (baseline) |
| A100 SXM4 | 8x NVLink | INT8 | 4 | ~14,000 | 0.78x |
| H100 SXM5 | 8x NVLink | BF16 | 4 | ~38,000 | 2.1x |
| H100 SXM5 | 8x NVLink | FP8 | 4 | ~62,000 | 3.4x |
Reference throughput estimates based on publicly reported scaling ratios from NVIDIA's MLPerf training v4.0 results and community benchmarks. Actual numbers vary by LoRA config, sequence length, and framework version.
FP8 on H100 cuts training wall time by 3-4x compared to A100 BF16. For a team running weekly fine-tuning jobs that take 36 hours on A100, H100 FP8 finishes the same job in under 12 hours. The practical impact depends on whether your pipeline actually uses FP8. A plain BF16 H100 still delivers about 2x the A100's throughput.
8-GPU Scaling Efficiency
| GPU | 1 GPU (relative) | 4 GPU (relative) | 8 GPU (relative) | Scaling Efficiency (8x) |
|---|---|---|---|---|
| A100 SXM4 | 1.0x | 3.7x | 6.8x | 85% |
| H100 SXM5 | 1.0x | 3.8x | 7.2x | 90% |
The H100's NVLink 4.0 reduces all-reduce communication overhead, giving it slightly better multi-GPU scaling efficiency. The gap matters most at 70B+ parameter scales where gradient synchronization is frequent.
Inference Benchmarks: Tokens per Second
Both GPUs have 80 GB of VRAM. For 70B models, INT4 quantization (A100) and W4A8 quantization (H100) keep model weights within the memory budget. BF16 70B requires tensor parallelism across two GPUs and is excluded from single-GPU rows.
vLLM Inference (Single GPU)
| Model | GPU | Precision | Concurrency | Tokens/sec (output) | P99 TTFT (ms) |
|---|---|---|---|---|---|
| Llama 3 70B | A100 SXM4 | INT4 TP=1 | 1 | ~150 | ~1,200 |
| Llama 3 70B | A100 SXM4 | INT4 TP=1 | 16 | ~1,100 | ~2,800 |
| Llama 3 70B | H100 SXM5 | W4A8 TP=1 | 1 | ~340 | ~520 |
| Llama 3 70B | H100 SXM5 | W4A8 TP=1 | 16 | ~2,900 | ~1,100 |
| Llama 3 8B | A100 SXM4 | BF16 TP=1 | 1 | ~600 | ~180 |
| Llama 3 8B | A100 SXM4 | BF16 TP=1 | 32 | ~4,200 | ~640 |
| Llama 3 8B | H100 SXM5 | BF16 TP=1 | 1 | ~1,100 | ~95 |
| Llama 3 8B | H100 SXM5 | FP8 TP=1 | 32 | ~9,500 | ~290 |
Reference estimates based on community vLLM benchmarks and NVIDIA published data. Numbers vary by vLLM version (0.4+), driver version, and server configuration.
The H100 W4A8 advantage at high concurrency is significant: 2.6x the output tokens/sec of A100 INT4 at batch 16 for 70B models. At batch 1 (single-request latency), the H100 W4A8 TTFT is less than half the A100's, which matters for interactive use cases. For a deeper look at how continuous batching and PagedAttention drive these concurrency gains in vLLM, see that guide.
For 8B models in BF16, the H100 advantage is 1.8x at batch 1 and narrows slightly at high concurrency because the decode phase becomes increasingly memory-bandwidth-bound as batch size grows. At high concurrency, both GPUs saturate their memory buses, and the absolute bandwidth gap between them becomes less decisive than at low batch sizes where the H100's raw compute advantage dominates.
TensorRT-LLM Inference
| Model | GPU | Precision | Batch Size | Tokens/sec | vs A100 INT4 |
|---|---|---|---|---|---|
| Llama 3 70B | A100 SXM4 | INT4 | 8 | ~900 | 1.0x |
| Llama 3 70B | A100 SXM4 | INT4 | 32 | ~1,400 | 1.0x |
| Llama 3 70B | H100 SXM5 | W4A8 | 8 | ~2,100 | 2.3x |
| Llama 3 70B | H100 SXM5 | W4A8 | 32 | ~4,800 | 3.4x |
Based on NVIDIA TensorRT-LLM benchmarking data. Requires TensorRT-LLM 0.10+ for FP8 on H100.
H100 W4A8 with TensorRT-LLM roughly doubles to triples tokens/sec compared to A100 INT4 for 70B models at batch sizes of 16 and above. Below batch 8, the performance ratio narrows because the workload becomes memory-bandwidth-bound rather than compute-bound, and the H100's bandwidth advantage is less decisive at low concurrency. For step-by-step FP8 engine builds and multi-GPU serving setup, see the TensorRT-LLM deployment guide.
Memory and KV Cache: 80 GB vs 80 GB (HBM3)
Both GPUs have 80 GB of VRAM, so they can hold the same model sizes in the same precisions. The meaningful difference is KV cache capacity at fixed concurrency and the throughput impact of serving those KV reads at different bandwidths.
KV cache size per request: 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_element
For Llama 3 70B (80 layers, 8 KV heads, 128 head dim) at 4K context in FP16: approximately 1.3 GB per request. At INT8, that halves to ~0.65 GB.
| GPU | Model | Precision | Model Weights | KV Cache (4K ctx, native precision) | Max concurrent reqs |
|---|---|---|---|---|---|
| A100 SXM4 | Llama 3 70B | INT4 | ~35 GB | ~0.65 GB/req | ~69 |
| H100 SXM5 | Llama 3 70B | W4A8 | ~35 GB | ~0.65 GB/req | ~69 |
| A100 SXM4 | Llama 3 8B | BF16 | ~16 GB | ~0.5 GB/req | ~128 |
| H100 SXM5 | Llama 3 8B | BF16 | ~16 GB | ~0.5 GB/req | ~128 |
When model weights are quantized to INT4 or W4A8, both GPUs have similar theoretical KV cache capacity. The H100 wins on the throughput side: with 3,350 GB/s vs 2,039 GB/s bandwidth, it serves those KV reads 1.64x faster, which sustains higher concurrency before hitting latency SLOs.
At long contexts (16K+), KV cache pressure becomes the bottleneck faster. A 16K-context request on Llama 3 70B FP16 requires ~5 GB of KV cache per request, limiting an INT4-loaded GPU to ~9 concurrent requests regardless of GPU model. The H100's higher bandwidth means it processes those 9 requests more quickly. For strategies to maximize KV cache capacity at long contexts, see KV cache optimization techniques.
Cost Per Token Analysis
Using live Spheron pricing (1 May 2026) and reference benchmark throughput for Llama 3 70B:
Cost per million output tokens = (price_per_hour / tokens_per_second) / 3600 × 1,000,000| GPU | Precision | Throughput (tok/s) | On-demand $/hr | Spot $/hr | On-demand CPM | Spot CPM |
|---|---|---|---|---|---|---|
| A100 SXM4 | INT4, batch 1 | ~150 | $1.64 | $0.45 | $3.04 | $0.83 |
| A100 SXM4 | INT4, batch 16 | ~1,100 | $1.64 | $0.45 | $0.41 | $0.11 |
| H100 SXM5 | W4A8, batch 1 | ~340 | $2.90 | $0.80 | $2.37 | $0.65 |
| H100 SXM5 | W4A8, batch 16 | ~2,900 | $2.90 | $0.80 | $0.28 | $0.08 |
At batch 1 (single-stream latency), H100 W4A8 on-demand ($2.37/M tokens) is modestly cheaper than A100 on-demand ($3.04/M tokens): the H100's 2.3x throughput advantage is just enough to offset its higher hourly rate. A100 spot at batch 1 ($0.83/M) is substantially cheaper than H100 on-demand, so if your latency SLOs allow preemption risk, A100 spot is the cost leader at this concurrency.
At batch 16, H100 W4A8 on-demand ($0.28/M) is about 1.5x cheaper than A100 on-demand ($0.41/M), but A100 spot ($0.11/M) is still 2.5x cheaper than H100 on-demand. H100 spot ($0.08/M) is the cheapest option at scale when availability allows.
With $2.90/hr on-demand, H100 on-demand does not undercut A100 spot pricing at any typical batch size for Llama 3 70B. The practical split is: use H100 on-demand when guaranteed availability and modest cost savings over A100 on-demand matter, and A100 spot when cost-per-token is the top priority and preemption is acceptable.
Pricing fluctuates based on GPU availability. The prices above are based on 1 May 2026 and may have changed. Check current GPU pricing for live rates.
When to Pick A100
- Budget inference on models under 30B where INT8 quantization is available and throughput target is below 1,000 tok/s
- 70B-class LoRA fine-tuning where you want the lowest hourly GPU spend and can tolerate longer run times (2-3x wall clock vs H100 FP8)
- Legacy CUDA stacks (CUDA 11.x, TensorFlow 2.x, older PyTorch) that don't yet have Hopper kernel support
- Multi-tenant inference with MIG where you need 7-partition isolation at the lowest cost per partition
- Spot workloads at batch 8-16+ where the A100 spot rate ($0.45/hr) produces cost-per-token that is 2.5x cheaper than H100 on-demand ($2.90/hr), making it the clear choice when preemption risk is acceptable
When to Pick H100
- FP8 production inference with vLLM 0.4+ or TensorRT-LLM 0.10+ where the Transformer Engine is active
- Pre-training or continued pre-training on frontier models at 70B+ parameter scale
- High-throughput serving with consistent batch sizes of 16+ concurrent requests
- Long-context serving at 32K+ tokens where HBM3's bandwidth prevents latency spikes under load
- Teams using FlashAttention-3 or custom Triton kernels that target Hopper's async memory pipeline
- Any workload where time-to-result matters more than per-hour cost
A100 to H100 Migration Guide
Step 1: Verify driver and CUDA version
H100 requires driver 525+ and CUDA 11.8+. Check your current setup:
nvidia-smi
# Should show Driver Version >= 525.xx.xx
# CUDA Version >= 11.8
nvcc --version
# Should show release 11.8, V11.8 or higherIf you provision a fresh H100 instance on Spheron, the driver is pre-installed. No manual update needed.
Step 2: Update vLLM
pip install "vllm>=0.4.0"vLLM 0.4.0 added H100 FP8 support via the --dtype fp8 flag. Older versions fall back to BF16 and you lose the Transformer Engine uplift.
Step 3: Enable FP8 at serving time
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-70B-Instruct \
--dtype fp8 \
--quantization fp8 \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--port 8000The --quantization fp8 flag tells vLLM to apply FP8 weight quantization at load time. Without it, --dtype fp8 activates FP8 compute but keeps weights in higher precision, giving partial gains.
Step 4: Update FlashAttention
pip install "flash-attn>=2.4.0"FlashAttention 2.4+ includes Hopper-specific kernels that use H100's async pipeline for attention. They activate automatically when an H100 is detected. No flag changes are needed.
Step 5: Benchmark before committing
Run a quick throughput comparison on your actual workload. This two-script setup works for any vLLM endpoint:
import time, requests, json
def measure_tps(endpoint, model, prompt, max_tokens=200):
start = time.time()
resp = requests.post(f"{endpoint}/v1/completions", json={
"model": model,
"prompt": prompt,
"max_tokens": max_tokens,
}, timeout=120)
resp.raise_for_status()
elapsed = time.time() - start
usage = resp.json().get("usage") or {}
tokens_out = usage.get("completion_tokens", 0)
if not tokens_out:
raise ValueError(f"No completion_tokens in response from {endpoint}")
return tokens_out / elapsed
# Run on A100 endpoint and H100 endpoint with the same prompt/config
a100_tps = measure_tps("http://a100-host:8000", "Llama-3-70B-Instruct", "Explain attention mechanisms", 500)
h100_tps = measure_tps("http://h100-host:8000", "Llama-3-70B-Instruct", "Explain attention mechanisms", 500)
if a100_tps > 0:
print(f"A100: {a100_tps:.1f} tok/s, H100: {h100_tps:.1f} tok/s, ratio: {h100_tps/a100_tps:.2f}x")Run this across multiple batch sizes to find the actual crossover point for your workload.
Renting A100 and H100 on Spheron
Spheron offers bare-metal access to both A100 and H100 instances with per-minute billing and no contract requirements. Both GPU models are available on-demand and as spot instances, with NVLink SXM configurations for multi-GPU training jobs.
Current pricing (1 May 2026) shows the A100 SXM4 starting at $1.64/hr on-demand and $0.45/hr on spot. The H100 SXM5 starts at $2.90/hr on-demand and $0.80/hr on spot. Check current GPU pricing for live rates since marketplace prices change as supply and demand shifts.
For teams deciding between the two, the practical approach is to run your actual workload for 30 minutes on each GPU type using on-demand pricing, then calculate your real cost-per-million-tokens at your production batch size. The spec sheet comparison is useful for narrowing options, but the final call should come from a measured number on your actual model and serving configuration.
If you're coming from V100 hardware, the A100 vs V100 comparison covers the Ampere upgrade path in detail. For context on where H100 fits into the broader GPU ladder, the H100 vs H200 guide covers the Hopper memory subsystem differences and whether the H200's larger KV cache headroom is worth the premium.
Both A100 and H100 are available on Spheron with per-minute billing and no contract. Compare current on-demand and spot rates for all GPU models on the GPU pricing page.
