Comparison

NVIDIA A100 vs H100: Specs, Benchmarks, and Cloud Pricing Guide (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 1, 2026
NVIDIA A100 vs H100A100 vs H100 PricingAmpere vs HopperH100 FP8 InferenceTransformer EngineHBM3 vs HBM2eNVLink 4.0 vs 3.0GPU ComparisonLLM TrainingLLM InferenceGPU Benchmarks
NVIDIA A100 vs H100: Specs, Benchmarks, and Cloud Pricing Guide (2026)

The A100 and H100 both ship with 80 GB of HBM and run the same CUDA workloads, but an H100 GPU rental costs a different hourly rate than a comparable A100 on Spheron. Whether the H100's architecture improvements pay off depends on your model size, throughput target, and whether your stack can actually use FP8. This guide works through the specs, training and inference benchmarks, and cost-per-token math so you can make the call.

TL;DR: A100 vs H100 at a Glance

MetricA100 SXM4H100 SXM5
ArchitectureAmpereHopper
Process NodeTSMC 7nmTSMC 4N
Transistors54.2B80B
TDP400W700W
FP16 Tensor TFLOPS3121,979
BF16 Tensor TFLOPS3121,979
FP8 Tensor TFLOPSN/A3,958
FP64 TFLOPS9.734
VRAM80 GB80 GB
Memory TypeHBM2eHBM3
Memory Bandwidth2,039 GB/s3,350 GB/s
NVLink Version3.04.0
NVLink Bandwidth (bidir.)600 GB/s900 GB/s
Transformer EngineNoYes
MIG SupportUp to 7 instancesUp to 7 instances
On-demand price/hr~$1.64~$2.90
Spot price/hr~$0.45~$0.80

Pricing fluctuates based on GPU availability. The prices above are based on 1 May 2026 and may have changed. Check current GPU pricing for live rates.

Both GPUs have identical VRAM capacity. The H100 wins on raw compute throughput and memory bandwidth. The A100 currently has a lower on-demand floor on Spheron's marketplace, with spot pricing significantly below the H100's spot rate.

Architecture: Ampere vs Hopper

What stayed the same

Both GPUs run the same CUDA 11.8+ workloads. A100 containers run on H100 without modification (H100 compute capability is 9.0, a superset of A100's 8.0). Both support MIG with the same maximum of 7 isolated instances per GPU. Both use TSMC's advanced node manufacturing and expose the same driver and NVML APIs.

Transformer Engine

The Transformer Engine is the most consequential H100-only addition. It is a hardware and software pipeline that executes Transformer attention and feed-forward operations in FP8, specifically the E4M3 and E5M2 floating-point formats. FP8 uses half the memory of BF16 per operand, which means the Tensor Cores can process roughly twice as many values per cycle. On H100, this translates to 3,958 TFLOPS in FP8 compared to 1,979 TFLOPS in BF16.

To use it, you need framework support: PyTorch 2.1+ with transformer_engine.pytorch, vLLM 0.4+, or TensorRT-LLM 0.10+. The A100 has no FP8 hardware support. BF16 is the practical precision ceiling on A100 workloads.

HBM3 vs HBM2e

Both GPUs have 80 GB of HBM. The H100 uses HBM3 at 3,350 GB/s; the A100 uses HBM2e at 2,039 GB/s. That 1.64x bandwidth difference is the primary driver of H100's inference speedup on memory-bound attention operations, where the GPU is waiting on VRAM reads rather than doing compute.

For large language model inference, the decode phase requires reading the entire model weight tensor for every generated token. A 70B parameter model in FP16 is ~140 GB. On a single GPU with INT8 quantization (~70 GB), the H100's higher bandwidth means faster token generation even before FP8 comes into play.

NVLink 4.0 vs 3.0

The A100 uses NVLink 3.0: 12 links at 50 GB/s each for 600 GB/s total bidirectional bandwidth. The H100 uses NVLink 4.0: 18 links at 50 GB/s each for 900 GB/s total. For 8-GPU training, the 50% higher inter-GPU bandwidth reduces all-reduce communication stalls and improves scaling efficiency on 70B+ models where gradient synchronization is a bottleneck.

PCIe Gen 5 vs Gen 4

On PCIe form factors (not SXM), the H100 PCIe offers 128 GB/s host-to-device bandwidth versus the A100 PCIe's 64 GB/s. This only matters for data loading pipelines where host memory transfers are frequent. SXM variants connect directly via NVLink and NVSwitch, so this distinction doesn't apply to the SXM configurations most commonly deployed on GPU clouds.

Full Specifications Comparison

SpecificationA100 SXM4A100 PCIeH100 SXM5H100 PCIe
ArchitectureAmpereAmpereHopperHopper
Process NodeTSMC 7nmTSMC 7nmTSMC 4NTSMC 4N
Transistors54.2B54.2B80B80B
CUDA Cores6,9126,91216,89614,592
Tensor Core Gen3rd Gen3rd Gen4th Gen4th Gen
VRAM80 GB HBM2e80 GB HBM2e80 GB HBM380 GB HBM2e
Memory Bandwidth2,039 GB/s1,935 GB/s3,350 GB/s2,000 GB/s
Memory Bus5,120-bit5,120-bit5,120-bit5,120-bit
L2 Cache40 MB40 MB50 MB50 MB
FP64 (TFLOPS)9.79.73426
FP32 (TFLOPS)19.519.56751
TF32 Tensor (TFLOPS)156156989756
BF16 Tensor (TFLOPS)3123121,9791,513
FP16 Tensor (TFLOPS)3123121,9791,513
FP8 Tensor (TFLOPS)N/AN/A3,9583,026
INT8 (TOPS)6246243,9583,026
NVLink Version3.0N/A4.0N/A
NVLink Bandwidth600 GB/sN/A900 GB/sN/A
PCIeGen 4 (64 GB/s)Gen 4 (64 GB/s)Gen 5 (128 GB/s)Gen 5 (128 GB/s)
MIG InstancesUp to 7Up to 7Up to 7Up to 7
Transformer EngineNoNoYesYes
TDP400W300W700W350W

The H100 PCIe ships with HBM2e instead of HBM3, giving it 2,000 GB/s bandwidth rather than 3,350 GB/s. If you're comparing PCIe variants specifically, the inference bandwidth gap between A100 and H100 narrows to about 1.03x rather than 1.64x.

Training Benchmarks: Llama 3 70B Fine-Tuning

Llama 3 70B in FP16 requires roughly 140 GB of VRAM for model weights alone, which doesn't fit on a single 80 GB GPU. For fine-tuning, both A100 and H100 are typically used in 8-GPU configurations with NVLink and NVSwitch. The benchmarks below reflect LoRA fine-tuning at rank 64, 4K sequence length, gradient checkpointing enabled.

GPUConfigPrecisionBatch SizeTokens/secRelative to A100 BF16
A100 SXM48x NVLinkBF164~18,0001.0x (baseline)
A100 SXM48x NVLinkINT84~14,0000.78x
H100 SXM58x NVLinkBF164~38,0002.1x
H100 SXM58x NVLinkFP84~62,0003.4x

Reference throughput estimates based on publicly reported scaling ratios from NVIDIA's MLPerf training v4.0 results and community benchmarks. Actual numbers vary by LoRA config, sequence length, and framework version.

FP8 on H100 cuts training wall time by 3-4x compared to A100 BF16. For a team running weekly fine-tuning jobs that take 36 hours on A100, H100 FP8 finishes the same job in under 12 hours. The practical impact depends on whether your pipeline actually uses FP8. A plain BF16 H100 still delivers about 2x the A100's throughput.

8-GPU Scaling Efficiency

GPU1 GPU (relative)4 GPU (relative)8 GPU (relative)Scaling Efficiency (8x)
A100 SXM41.0x3.7x6.8x85%
H100 SXM51.0x3.8x7.2x90%

The H100's NVLink 4.0 reduces all-reduce communication overhead, giving it slightly better multi-GPU scaling efficiency. The gap matters most at 70B+ parameter scales where gradient synchronization is frequent.

Inference Benchmarks: Tokens per Second

Both GPUs have 80 GB of VRAM. For 70B models, INT4 quantization (A100) and W4A8 quantization (H100) keep model weights within the memory budget. BF16 70B requires tensor parallelism across two GPUs and is excluded from single-GPU rows.

vLLM Inference (Single GPU)

ModelGPUPrecisionConcurrencyTokens/sec (output)P99 TTFT (ms)
Llama 3 70BA100 SXM4INT4 TP=11~150~1,200
Llama 3 70BA100 SXM4INT4 TP=116~1,100~2,800
Llama 3 70BH100 SXM5W4A8 TP=11~340~520
Llama 3 70BH100 SXM5W4A8 TP=116~2,900~1,100
Llama 3 8BA100 SXM4BF16 TP=11~600~180
Llama 3 8BA100 SXM4BF16 TP=132~4,200~640
Llama 3 8BH100 SXM5BF16 TP=11~1,100~95
Llama 3 8BH100 SXM5FP8 TP=132~9,500~290

Reference estimates based on community vLLM benchmarks and NVIDIA published data. Numbers vary by vLLM version (0.4+), driver version, and server configuration.

The H100 W4A8 advantage at high concurrency is significant: 2.6x the output tokens/sec of A100 INT4 at batch 16 for 70B models. At batch 1 (single-request latency), the H100 W4A8 TTFT is less than half the A100's, which matters for interactive use cases. For a deeper look at how continuous batching and PagedAttention drive these concurrency gains in vLLM, see that guide.

For 8B models in BF16, the H100 advantage is 1.8x at batch 1 and narrows slightly at high concurrency because the decode phase becomes increasingly memory-bandwidth-bound as batch size grows. At high concurrency, both GPUs saturate their memory buses, and the absolute bandwidth gap between them becomes less decisive than at low batch sizes where the H100's raw compute advantage dominates.

TensorRT-LLM Inference

ModelGPUPrecisionBatch SizeTokens/secvs A100 INT4
Llama 3 70BA100 SXM4INT48~9001.0x
Llama 3 70BA100 SXM4INT432~1,4001.0x
Llama 3 70BH100 SXM5W4A88~2,1002.3x
Llama 3 70BH100 SXM5W4A832~4,8003.4x

Based on NVIDIA TensorRT-LLM benchmarking data. Requires TensorRT-LLM 0.10+ for FP8 on H100.

H100 W4A8 with TensorRT-LLM roughly doubles to triples tokens/sec compared to A100 INT4 for 70B models at batch sizes of 16 and above. Below batch 8, the performance ratio narrows because the workload becomes memory-bandwidth-bound rather than compute-bound, and the H100's bandwidth advantage is less decisive at low concurrency. For step-by-step FP8 engine builds and multi-GPU serving setup, see the TensorRT-LLM deployment guide.

Memory and KV Cache: 80 GB vs 80 GB (HBM3)

Both GPUs have 80 GB of VRAM, so they can hold the same model sizes in the same precisions. The meaningful difference is KV cache capacity at fixed concurrency and the throughput impact of serving those KV reads at different bandwidths.

KV cache size per request: 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_element

For Llama 3 70B (80 layers, 8 KV heads, 128 head dim) at 4K context in FP16: approximately 1.3 GB per request. At INT8, that halves to ~0.65 GB.

GPUModelPrecisionModel WeightsKV Cache (4K ctx, native precision)Max concurrent reqs
A100 SXM4Llama 3 70BINT4~35 GB~0.65 GB/req~69
H100 SXM5Llama 3 70BW4A8~35 GB~0.65 GB/req~69
A100 SXM4Llama 3 8BBF16~16 GB~0.5 GB/req~128
H100 SXM5Llama 3 8BBF16~16 GB~0.5 GB/req~128

When model weights are quantized to INT4 or W4A8, both GPUs have similar theoretical KV cache capacity. The H100 wins on the throughput side: with 3,350 GB/s vs 2,039 GB/s bandwidth, it serves those KV reads 1.64x faster, which sustains higher concurrency before hitting latency SLOs.

At long contexts (16K+), KV cache pressure becomes the bottleneck faster. A 16K-context request on Llama 3 70B FP16 requires ~5 GB of KV cache per request, limiting an INT4-loaded GPU to ~9 concurrent requests regardless of GPU model. The H100's higher bandwidth means it processes those 9 requests more quickly. For strategies to maximize KV cache capacity at long contexts, see KV cache optimization techniques.

Cost Per Token Analysis

Using live Spheron pricing (1 May 2026) and reference benchmark throughput for Llama 3 70B:

Cost per million output tokens = (price_per_hour / tokens_per_second) / 3600 × 1,000,000
GPUPrecisionThroughput (tok/s)On-demand $/hrSpot $/hrOn-demand CPMSpot CPM
A100 SXM4INT4, batch 1~150$1.64$0.45$3.04$0.83
A100 SXM4INT4, batch 16~1,100$1.64$0.45$0.41$0.11
H100 SXM5W4A8, batch 1~340$2.90$0.80$2.37$0.65
H100 SXM5W4A8, batch 16~2,900$2.90$0.80$0.28$0.08

At batch 1 (single-stream latency), H100 W4A8 on-demand ($2.37/M tokens) is modestly cheaper than A100 on-demand ($3.04/M tokens): the H100's 2.3x throughput advantage is just enough to offset its higher hourly rate. A100 spot at batch 1 ($0.83/M) is substantially cheaper than H100 on-demand, so if your latency SLOs allow preemption risk, A100 spot is the cost leader at this concurrency.

At batch 16, H100 W4A8 on-demand ($0.28/M) is about 1.5x cheaper than A100 on-demand ($0.41/M), but A100 spot ($0.11/M) is still 2.5x cheaper than H100 on-demand. H100 spot ($0.08/M) is the cheapest option at scale when availability allows.

With $2.90/hr on-demand, H100 on-demand does not undercut A100 spot pricing at any typical batch size for Llama 3 70B. The practical split is: use H100 on-demand when guaranteed availability and modest cost savings over A100 on-demand matter, and A100 spot when cost-per-token is the top priority and preemption is acceptable.

Pricing fluctuates based on GPU availability. The prices above are based on 1 May 2026 and may have changed. Check current GPU pricing for live rates.

When to Pick A100

  • Budget inference on models under 30B where INT8 quantization is available and throughput target is below 1,000 tok/s
  • 70B-class LoRA fine-tuning where you want the lowest hourly GPU spend and can tolerate longer run times (2-3x wall clock vs H100 FP8)
  • Legacy CUDA stacks (CUDA 11.x, TensorFlow 2.x, older PyTorch) that don't yet have Hopper kernel support
  • Multi-tenant inference with MIG where you need 7-partition isolation at the lowest cost per partition
  • Spot workloads at batch 8-16+ where the A100 spot rate ($0.45/hr) produces cost-per-token that is 2.5x cheaper than H100 on-demand ($2.90/hr), making it the clear choice when preemption risk is acceptable

When to Pick H100

  • FP8 production inference with vLLM 0.4+ or TensorRT-LLM 0.10+ where the Transformer Engine is active
  • Pre-training or continued pre-training on frontier models at 70B+ parameter scale
  • High-throughput serving with consistent batch sizes of 16+ concurrent requests
  • Long-context serving at 32K+ tokens where HBM3's bandwidth prevents latency spikes under load
  • Teams using FlashAttention-3 or custom Triton kernels that target Hopper's async memory pipeline
  • Any workload where time-to-result matters more than per-hour cost

A100 to H100 Migration Guide

Step 1: Verify driver and CUDA version

H100 requires driver 525+ and CUDA 11.8+. Check your current setup:

bash
nvidia-smi
# Should show Driver Version >= 525.xx.xx
# CUDA Version >= 11.8

nvcc --version
# Should show release 11.8, V11.8 or higher

If you provision a fresh H100 instance on Spheron, the driver is pre-installed. No manual update needed.

Step 2: Update vLLM

bash
pip install "vllm>=0.4.0"

vLLM 0.4.0 added H100 FP8 support via the --dtype fp8 flag. Older versions fall back to BF16 and you lose the Transformer Engine uplift.

Step 3: Enable FP8 at serving time

bash
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-70B-Instruct \
  --dtype fp8 \
  --quantization fp8 \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --port 8000

The --quantization fp8 flag tells vLLM to apply FP8 weight quantization at load time. Without it, --dtype fp8 activates FP8 compute but keeps weights in higher precision, giving partial gains.

Step 4: Update FlashAttention

bash
pip install "flash-attn>=2.4.0"

FlashAttention 2.4+ includes Hopper-specific kernels that use H100's async pipeline for attention. They activate automatically when an H100 is detected. No flag changes are needed.

Step 5: Benchmark before committing

Run a quick throughput comparison on your actual workload. This two-script setup works for any vLLM endpoint:

python
import time, requests, json

def measure_tps(endpoint, model, prompt, max_tokens=200):
    start = time.time()
    resp = requests.post(f"{endpoint}/v1/completions", json={
        "model": model,
        "prompt": prompt,
        "max_tokens": max_tokens,
    }, timeout=120)
    resp.raise_for_status()
    elapsed = time.time() - start
    usage = resp.json().get("usage") or {}
    tokens_out = usage.get("completion_tokens", 0)
    if not tokens_out:
        raise ValueError(f"No completion_tokens in response from {endpoint}")
    return tokens_out / elapsed

# Run on A100 endpoint and H100 endpoint with the same prompt/config
a100_tps = measure_tps("http://a100-host:8000", "Llama-3-70B-Instruct", "Explain attention mechanisms", 500)
h100_tps = measure_tps("http://h100-host:8000", "Llama-3-70B-Instruct", "Explain attention mechanisms", 500)
if a100_tps > 0:
    print(f"A100: {a100_tps:.1f} tok/s, H100: {h100_tps:.1f} tok/s, ratio: {h100_tps/a100_tps:.2f}x")

Run this across multiple batch sizes to find the actual crossover point for your workload.

Renting A100 and H100 on Spheron

Spheron offers bare-metal access to both A100 and H100 instances with per-minute billing and no contract requirements. Both GPU models are available on-demand and as spot instances, with NVLink SXM configurations for multi-GPU training jobs.

Current pricing (1 May 2026) shows the A100 SXM4 starting at $1.64/hr on-demand and $0.45/hr on spot. The H100 SXM5 starts at $2.90/hr on-demand and $0.80/hr on spot. Check current GPU pricing for live rates since marketplace prices change as supply and demand shifts.

For teams deciding between the two, the practical approach is to run your actual workload for 30 minutes on each GPU type using on-demand pricing, then calculate your real cost-per-million-tokens at your production batch size. The spec sheet comparison is useful for narrowing options, but the final call should come from a measured number on your actual model and serving configuration.

If you're coming from V100 hardware, the A100 vs V100 comparison covers the Ampere upgrade path in detail. For context on where H100 fits into the broader GPU ladder, the H100 vs H200 guide covers the Hopper memory subsystem differences and whether the H200's larger KV cache headroom is worth the premium.


Both A100 and H100 are available on Spheron with per-minute billing and no contract. Compare current on-demand and spot rates for all GPU models on the GPU pricing page.

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.