Is the H100 always better than the A100?

Not always. The H100 is faster for training and inference on large models, and its Transformer Engine delivers a real throughput uplift when FP8 is enabled. But the A100 is more than adequate for 70B-class fine-tuning, INT8 inference on models under 70B, and teams running on older CUDA stacks. For budget-conscious inference at moderate throughput, the A100 often delivers better cost-per-token.

How much faster is the H100 than the A100 for LLM training?

For Llama 3 70B fine-tuning in BF16 with NVLink on 8 GPUs, the H100 SXM5 delivers roughly 2-2.5x the throughput of the A100 SXM4. The gap widens to 3-4x when FP8 is enabled on the H100 via the Transformer Engine, since the A100 has no FP8 support. For smaller models like Llama 3 8B, the gap narrows to 1.5-2x.

What is the price difference between A100 and H100 on Spheron?

Based on current Spheron marketplace pricing, A100 on-demand runs at roughly $1.64/hr while H100 SXM5 starts at $2.90/hr on-demand. Spot pricing runs lower: H100 spot around $0.80/hr, A100 spot around $0.45/hr. Exact rates fluctuate, so check the live pricing page at [spheron.network/pricing/](/pricing/) for current numbers.

What is the Transformer Engine and does it require code changes?

The Transformer Engine is an H100-exclusive hardware and software layer that executes Transformer attention and feed-forward operations in FP8 (8-bit floating point), achieving roughly 2x the compute throughput of BF16. It is exposed through libraries like PyTorch AMP (with dtype=torch.float8_e4m3fn), vLLM (--dtype fp8), and TensorRT-LLM. A100 code runs unchanged on H100, but to actually use FP8 you need framework support, not just a driver update.

Can I run the same Docker container on both A100 and H100?

Yes. H100 is backward-compatible with A100 CUDA kernels (compute capability 9.0 is a superset of 8.0). Existing A100 containers run unchanged. You only need to update the container if you want to use H100-specific features: FP8 training, FlashAttention-3 kernels, or Hopper-tuned TensorRT-LLM engines.

Does the H100 have more VRAM than the A100?

No. Both the A100 SXM4 and H100 SXM5 ship with 80 GB of HBM. The difference is the memory type and bandwidth: the H100 uses HBM3 at 3,350 GB/s while the A100 uses HBM2e at 2,039 GB/s. This 1.64x bandwidth advantage is a major driver of H100's inference speedup on memory-bound workloads.

When should I pick A100 over H100?

Pick A100 when: (1) your workload is 70B-class fine-tuning or inference on models that fit in 80 GB and don't need FP8, (2) you're on a CUDA 11.x stack or use libraries that don't yet have Hopper kernels, (3) you need the lowest cost-per-GPU-hour and your throughput target is modest, or (4) you're running batch inference where the lower hourly rate beats the H100's throughput advantage on a cost-per-token basis.

What NVLink version do A100 and H100 use?

The A100 uses NVLink 3.0, providing 600 GB/s total bidirectional bandwidth across 12 NVLink links (50 GB/s per link x 12). The H100 uses NVLink 4.0, providing 50% more bandwidth at 900 GB/s total across 18 links. For 8-GPU training, this bandwidth increase meaningfully reduces all-reduce communication overhead on large models.

NVIDIA A100 vs H100: Specs, Benchmarks, and Cloud Pricing Guide (2026)

The A100 and H100 both ship with 80 GB of HBM and run the same CUDA workloads, but an H100 GPU rental costs a different hourly rate than a comparable A100 on Spheron. Whether the H100's architecture improvements pay off depends on your model size, throughput target, and whether your stack can actually use FP8. This guide works through the specs, training and inference benchmarks, and cost-per-token math so you can make the call.

TL;DR: A100 vs H100 at a Glance

Metric	A100 SXM4	H100 SXM5
Architecture	Ampere	Hopper
Process Node	TSMC 7nm	TSMC 4N
Transistors	54.2B	80B
TDP	400W	700W
FP16 Tensor TFLOPS	312	1,979
BF16 Tensor TFLOPS	312	1,979
FP8 Tensor TFLOPS	N/A	3,958
FP64 TFLOPS	9.7	34
VRAM	80 GB	80 GB
Memory Type	HBM2e	HBM3
Memory Bandwidth	2,039 GB/s	3,350 GB/s
NVLink Version	3.0	4.0
NVLink Bandwidth (bidir.)	600 GB/s	900 GB/s
Transformer Engine	No	Yes
MIG Support	Up to 7 instances	Up to 7 instances
On-demand price/hr	~$1.64	~$2.90
Spot price/hr	~$0.45	~$0.80

Pricing fluctuates based on GPU availability. The prices above are based on 1 May 2026 and may have changed. Check current GPU pricing for live rates.

Both GPUs have identical VRAM capacity. The H100 wins on raw compute throughput and memory bandwidth. The A100 currently has a lower on-demand floor on Spheron's marketplace, with spot pricing significantly below the H100's spot rate.

Architecture: Ampere vs Hopper

What stayed the same

Both GPUs run the same CUDA 11.8+ workloads. A100 containers run on H100 without modification (H100 compute capability is 9.0, a superset of A100's 8.0). Both support MIG with the same maximum of 7 isolated instances per GPU. Both use TSMC's advanced node manufacturing and expose the same driver and NVML APIs.

Transformer Engine

The Transformer Engine is the most consequential H100-only addition. It is a hardware and software pipeline that executes Transformer attention and feed-forward operations in FP8, specifically the E4M3 and E5M2 floating-point formats. FP8 uses half the memory of BF16 per operand, which means the Tensor Cores can process roughly twice as many values per cycle. On H100, this translates to 3,958 TFLOPS in FP8 compared to 1,979 TFLOPS in BF16.

To use it, you need framework support: PyTorch 2.1+ with transformer_engine.pytorch, vLLM 0.4+, or TensorRT-LLM 0.10+. The A100 has no FP8 hardware support. BF16 is the practical precision ceiling on A100 workloads.

HBM3 vs HBM2e

Both GPUs have 80 GB of HBM. The H100 uses HBM3 at 3,350 GB/s; the A100 uses HBM2e at 2,039 GB/s. That 1.64x bandwidth difference is the primary driver of H100's inference speedup on memory-bound attention operations, where the GPU is waiting on VRAM reads rather than doing compute.

For large language model inference, the decode phase requires reading the entire model weight tensor for every generated token. A 70B parameter model in FP16 is ~140 GB. On a single GPU with INT8 quantization (~70 GB), the H100's higher bandwidth means faster token generation even before FP8 comes into play.

NVLink 4.0 vs 3.0

The A100 uses NVLink 3.0: 12 links at 50 GB/s each for 600 GB/s total bidirectional bandwidth. The H100 uses NVLink 4.0: 18 links at 50 GB/s each for 900 GB/s total. For 8-GPU training, the 50% higher inter-GPU bandwidth reduces all-reduce communication stalls and improves scaling efficiency on 70B+ models where gradient synchronization is a bottleneck.

PCIe Gen 5 vs Gen 4

On PCIe form factors (not SXM), the H100 PCIe offers 128 GB/s host-to-device bandwidth versus the A100 PCIe's 64 GB/s. This only matters for data loading pipelines where host memory transfers are frequent. SXM variants connect directly via NVLink and NVSwitch, so this distinction doesn't apply to the SXM configurations most commonly deployed on GPU clouds.

Full Specifications Comparison

Specification	A100 SXM4	A100 PCIe	H100 SXM5	H100 PCIe
Architecture	Ampere	Ampere	Hopper	Hopper
Process Node	TSMC 7nm	TSMC 7nm	TSMC 4N	TSMC 4N
Transistors	54.2B	54.2B	80B	80B
CUDA Cores	6,912	6,912	16,896	14,592
Tensor Core Gen	3rd Gen	3rd Gen	4th Gen	4th Gen
VRAM	80 GB HBM2e	80 GB HBM2e	80 GB HBM3	80 GB HBM2e
Memory Bandwidth	2,039 GB/s	1,935 GB/s	3,350 GB/s	2,000 GB/s
Memory Bus	5,120-bit	5,120-bit	5,120-bit	5,120-bit
L2 Cache	40 MB	40 MB	50 MB	50 MB
FP64 (TFLOPS)	9.7	9.7	34	26
FP32 (TFLOPS)	19.5	19.5	67	51
TF32 Tensor (TFLOPS)	156	156	989	756
BF16 Tensor (TFLOPS)	312	312	1,979	1,513
FP16 Tensor (TFLOPS)	312	312	1,979	1,513
FP8 Tensor (TFLOPS)	N/A	N/A	3,958	3,026
INT8 (TOPS)	624	624	3,958	3,026
NVLink Version	3.0	N/A	4.0	N/A
NVLink Bandwidth	600 GB/s	N/A	900 GB/s	N/A
PCIe	Gen 4 (64 GB/s)	Gen 4 (64 GB/s)	Gen 5 (128 GB/s)	Gen 5 (128 GB/s)
MIG Instances	Up to 7	Up to 7	Up to 7	Up to 7
Transformer Engine	No	No	Yes	Yes
TDP	400W	300W	700W	350W

The H100 PCIe ships with HBM2e instead of HBM3, giving it 2,000 GB/s bandwidth rather than 3,350 GB/s. If you're comparing PCIe variants specifically, the inference bandwidth gap between A100 and H100 narrows to about 1.03x rather than 1.64x.

Training Benchmarks: Llama 3 70B Fine-Tuning

Llama 3 70B in FP16 requires roughly 140 GB of VRAM for model weights alone, which doesn't fit on a single 80 GB GPU. For fine-tuning, both A100 and H100 are typically used in 8-GPU configurations with NVLink and NVSwitch. The benchmarks below reflect LoRA fine-tuning at rank 64, 4K sequence length, gradient checkpointing enabled.

GPU	Config	Precision	Batch Size	Tokens/sec	Relative to A100 BF16
A100 SXM4	8x NVLink	BF16	4	~18,000	1.0x (baseline)
A100 SXM4	8x NVLink	INT8	4	~14,000	0.78x
H100 SXM5	8x NVLink	BF16	4	~38,000	2.1x
H100 SXM5	8x NVLink	FP8	4	~62,000	3.4x

Reference throughput estimates based on publicly reported scaling ratios from NVIDIA's MLPerf training v4.0 results and community benchmarks. Actual numbers vary by LoRA config, sequence length, and framework version.

FP8 on H100 cuts training wall time by 3-4x compared to A100 BF16. For a team running weekly fine-tuning jobs that take 36 hours on A100, H100 FP8 finishes the same job in under 12 hours. The practical impact depends on whether your pipeline actually uses FP8. A plain BF16 H100 still delivers about 2x the A100's throughput.

8-GPU Scaling Efficiency

GPU	1 GPU (relative)	4 GPU (relative)	8 GPU (relative)	Scaling Efficiency (8x)
A100 SXM4	1.0x	3.7x	6.8x	85%
H100 SXM5	1.0x	3.8x	7.2x	90%

The H100's NVLink 4.0 reduces all-reduce communication overhead, giving it slightly better multi-GPU scaling efficiency. The gap matters most at 70B+ parameter scales where gradient synchronization is frequent.

Inference Benchmarks: Tokens per Second

Both GPUs have 80 GB of VRAM. For 70B models, INT4 quantization (A100) and W4A8 quantization (H100) keep model weights within the memory budget. BF16 70B requires tensor parallelism across two GPUs and is excluded from single-GPU rows.

vLLM Inference (Single GPU)

Model	GPU	Precision	Concurrency	Tokens/sec (output)	P99 TTFT (ms)
Llama 3 70B	A100 SXM4	INT4 TP=1	1	~150	~1,200
Llama 3 70B	A100 SXM4	INT4 TP=1	16	~1,100	~2,800
Llama 3 70B	H100 SXM5	W4A8 TP=1	1	~340	~520
Llama 3 70B	H100 SXM5	W4A8 TP=1	16	~2,900	~1,100
Llama 3 8B	A100 SXM4	BF16 TP=1	1	~600	~180
Llama 3 8B	A100 SXM4	BF16 TP=1	32	~4,200	~640
Llama 3 8B	H100 SXM5	BF16 TP=1	1	~1,100	~95
Llama 3 8B	H100 SXM5	FP8 TP=1	32	~9,500	~290

Reference estimates based on community vLLM benchmarks and NVIDIA published data. Numbers vary by vLLM version (0.4+), driver version, and server configuration.

The H100 W4A8 advantage at high concurrency is significant: 2.6x the output tokens/sec of A100 INT4 at batch 16 for 70B models. At batch 1 (single-request latency), the H100 W4A8 TTFT is less than half the A100's, which matters for interactive use cases. For a deeper look at how continuous batching and PagedAttention drive these concurrency gains in vLLM, see that guide.

For 8B models in BF16, the H100 advantage is 1.8x at batch 1 and narrows slightly at high concurrency because the decode phase becomes increasingly memory-bandwidth-bound as batch size grows. At high concurrency, both GPUs saturate their memory buses, and the absolute bandwidth gap between them becomes less decisive than at low batch sizes where the H100's raw compute advantage dominates.

TensorRT-LLM Inference

Model	GPU	Precision	Batch Size	Tokens/sec	vs A100 INT4
Llama 3 70B	A100 SXM4	INT4	8	~900	1.0x
Llama 3 70B	A100 SXM4	INT4	32	~1,400	1.0x
Llama 3 70B	H100 SXM5	W4A8	8	~2,100	2.3x
Llama 3 70B	H100 SXM5	W4A8	32	~4,800	3.4x

Based on NVIDIA TensorRT-LLM benchmarking data. Requires TensorRT-LLM 0.10+ for FP8 on H100.

H100 W4A8 with TensorRT-LLM roughly doubles to triples tokens/sec compared to A100 INT4 for 70B models at batch sizes of 16 and above. Below batch 8, the performance ratio narrows because the workload becomes memory-bandwidth-bound rather than compute-bound, and the H100's bandwidth advantage is less decisive at low concurrency. For step-by-step FP8 engine builds and multi-GPU serving setup, see the TensorRT-LLM deployment guide.

Memory and KV Cache: 80 GB vs 80 GB (HBM3)

Both GPUs have 80 GB of VRAM, so they can hold the same model sizes in the same precisions. The meaningful difference is KV cache capacity at fixed concurrency and the throughput impact of serving those KV reads at different bandwidths.

KV cache size per request: 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_element

For Llama 3 70B (80 layers, 8 KV heads, 128 head dim) at 4K context in FP16: approximately 1.3 GB per request. At INT8, that halves to ~0.65 GB.

GPU	Model	Precision	Model Weights	KV Cache (4K ctx, native precision)	Max concurrent reqs
A100 SXM4	Llama 3 70B	INT4	~35 GB	~0.65 GB/req	~69
H100 SXM5	Llama 3 70B	W4A8	~35 GB	~0.65 GB/req	~69
A100 SXM4	Llama 3 8B	BF16	~16 GB	~0.5 GB/req	~128
H100 SXM5	Llama 3 8B	BF16	~16 GB	~0.5 GB/req	~128

When model weights are quantized to INT4 or W4A8, both GPUs have similar theoretical KV cache capacity. The H100 wins on the throughput side: with 3,350 GB/s vs 2,039 GB/s bandwidth, it serves those KV reads 1.64x faster, which sustains higher concurrency before hitting latency SLOs.

At long contexts (16K+), KV cache pressure becomes the bottleneck faster. A 16K-context request on Llama 3 70B FP16 requires ~5 GB of KV cache per request, limiting an INT4-loaded GPU to ~9 concurrent requests regardless of GPU model. The H100's higher bandwidth means it processes those 9 requests more quickly. For strategies to maximize KV cache capacity at long contexts, see KV cache optimization techniques.

Cost Per Token Analysis

Using live Spheron pricing (1 May 2026) and reference benchmark throughput for Llama 3 70B:

Cost per million output tokens = (price_per_hour / tokens_per_second) / 3600 × 1,000,000

GPU	Precision	Throughput (tok/s)	On-demand $/hr	Spot $/hr	On-demand CPM	Spot CPM
A100 SXM4	INT4, batch 1	~150	$1.64	$0.45	$3.04	$0.83
A100 SXM4	INT4, batch 16	~1,100	$1.64	$0.45	$0.41	$0.11
H100 SXM5	W4A8, batch 1	~340	$2.90	$0.80	$2.37	$0.65
H100 SXM5	W4A8, batch 16	~2,900	$2.90	$0.80	$0.28	$0.08

At batch 1 (single-stream latency), H100 W4A8 on-demand ($2.37/M tokens) is modestly cheaper than A100 on-demand ($3.04/M tokens): the H100's 2.3x throughput advantage is just enough to offset its higher hourly rate. A100 spot at batch 1 ($0.83/M) is substantially cheaper than H100 on-demand, so if your latency SLOs allow preemption risk, A100 spot is the cost leader at this concurrency.

At batch 16, H100 W4A8 on-demand ($0.28/M) is about 1.5x cheaper than A100 on-demand ($0.41/M), but A100 spot ($0.11/M) is still 2.5x cheaper than H100 on-demand. H100 spot ($0.08/M) is the cheapest option at scale when availability allows.

With $2.90/hr on-demand, H100 on-demand does not undercut A100 spot pricing at any typical batch size for Llama 3 70B. The practical split is: use H100 on-demand when guaranteed availability and modest cost savings over A100 on-demand matter, and A100 spot when cost-per-token is the top priority and preemption is acceptable.

Pricing fluctuates based on GPU availability. The prices above are based on 1 May 2026 and may have changed. Check current GPU pricing for live rates.

When to Pick A100

Budget inference on models under 30B where INT8 quantization is available and throughput target is below 1,000 tok/s
70B-class LoRA fine-tuning where you want the lowest hourly GPU spend and can tolerate longer run times (2-3x wall clock vs H100 FP8)
Legacy CUDA stacks (CUDA 11.x, TensorFlow 2.x, older PyTorch) that don't yet have Hopper kernel support
Multi-tenant inference with MIG where you need 7-partition isolation at the lowest cost per partition
Spot workloads at batch 8-16+ where the A100 spot rate ($0.45/hr) produces cost-per-token that is 2.5x cheaper than H100 on-demand ($2.90/hr), making it the clear choice when preemption risk is acceptable

When to Pick H100

FP8 production inference with vLLM 0.4+ or TensorRT-LLM 0.10+ where the Transformer Engine is active
Pre-training or continued pre-training on frontier models at 70B+ parameter scale
High-throughput serving with consistent batch sizes of 16+ concurrent requests
Long-context serving at 32K+ tokens where HBM3's bandwidth prevents latency spikes under load
Teams using FlashAttention-3 or custom Triton kernels that target Hopper's async memory pipeline
Any workload where time-to-result matters more than per-hour cost

A100 to H100 Migration Guide

Step 1: Verify driver and CUDA version

H100 requires driver 525+ and CUDA 11.8+. Check your current setup:

bash

nvidia-smi
# Should show Driver Version >= 525.xx.xx
# CUDA Version >= 11.8

nvcc --version
# Should show release 11.8, V11.8 or higher

If you provision a fresh H100 instance on Spheron, the driver is pre-installed. No manual update needed.

Step 2: Update vLLM

bash

pip install "vllm>=0.4.0"

vLLM 0.4.0 added H100 FP8 support via the --dtype fp8 flag. Older versions fall back to BF16 and you lose the Transformer Engine uplift.

Step 3: Enable FP8 at serving time

bash

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-70B-Instruct \
  --dtype fp8 \
  --quantization fp8 \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --port 8000

The --quantization fp8 flag tells vLLM to apply FP8 weight quantization at load time. Without it, --dtype fp8 activates FP8 compute but keeps weights in higher precision, giving partial gains.

Step 4: Update FlashAttention

bash

pip install "flash-attn>=2.4.0"

FlashAttention 2.4+ includes Hopper-specific kernels that use H100's async pipeline for attention. They activate automatically when an H100 is detected. No flag changes are needed.

Step 5: Benchmark before committing

Run a quick throughput comparison on your actual workload. This two-script setup works for any vLLM endpoint:

python

import time, requests, json

def measure_tps(endpoint, model, prompt, max_tokens=200):
    start = time.time()
    resp = requests.post(f"{endpoint}/v1/completions", json={
        "model": model,
        "prompt": prompt,
        "max_tokens": max_tokens,
    }, timeout=120)
    resp.raise_for_status()
    elapsed = time.time() - start
    usage = resp.json().get("usage") or {}
    tokens_out = usage.get("completion_tokens", 0)
    if not tokens_out:
        raise ValueError(f"No completion_tokens in response from {endpoint}")
    return tokens_out / elapsed

# Run on A100 endpoint and H100 endpoint with the same prompt/config
a100_tps = measure_tps("http://a100-host:8000", "Llama-3-70B-Instruct", "Explain attention mechanisms", 500)
h100_tps = measure_tps("http://h100-host:8000", "Llama-3-70B-Instruct", "Explain attention mechanisms", 500)
if a100_tps > 0:
    print(f"A100: {a100_tps:.1f} tok/s, H100: {h100_tps:.1f} tok/s, ratio: {h100_tps/a100_tps:.2f}x")

Run this across multiple batch sizes to find the actual crossover point for your workload.

Renting A100 and H100 on Spheron

Spheron offers bare-metal access to both A100 and H100 instances with per-minute billing and no contract requirements. Both GPU models are available on-demand and as spot instances, with NVLink SXM configurations for multi-GPU training jobs.

Current pricing (1 May 2026) shows the A100 SXM4 starting at $1.64/hr on-demand and $0.45/hr on spot. The H100 SXM5 starts at $2.90/hr on-demand and $0.80/hr on spot. Check current GPU pricing for live rates since marketplace prices change as supply and demand shifts.

For teams deciding between the two, the practical approach is to run your actual workload for 30 minutes on each GPU type using on-demand pricing, then calculate your real cost-per-million-tokens at your production batch size. The spec sheet comparison is useful for narrowing options, but the final call should come from a measured number on your actual model and serving configuration.

If you're coming from V100 hardware, the A100 vs V100 comparison covers the Ampere upgrade path in detail. For context on where H100 fits into the broader GPU ladder, the H100 vs H200 guide covers the Hopper memory subsystem differences and whether the H200's larger KV cache headroom is worth the premium.

Both A100 and H100 are available on Spheron with per-minute billing and no contract. Compare current on-demand and spot rates for all GPU models on the GPU pricing page.
Get started on Spheron →

TL;DR: A100 vs H100 at a Glance

Architecture: Ampere vs Hopper

What stayed the same

Transformer Engine

HBM3 vs HBM2e

NVLink 4.0 vs 3.0

PCIe Gen 5 vs Gen 4

Full Specifications Comparison

Training Benchmarks: Llama 3 70B Fine-Tuning

8-GPU Scaling Efficiency

Inference Benchmarks: Tokens per Second

vLLM Inference (Single GPU)

TensorRT-LLM Inference

Memory and KV Cache: 80 GB vs 80 GB (HBM3)

Cost Per Token Analysis

When to Pick A100

When to Pick H100

A100 to H100 Migration Guide

Step 1: Verify driver and CUDA version

Step 2: Update vLLM

Step 3: Enable FP8 at serving time

Step 4: Update FlashAttention

Step 5: Benchmark before committing

Renting A100 and H100 on Spheron

Build what's next.