NVIDIA H100 Specs: Complete Datasheet, FP8/FP16 Throughput, and Memory Bandwidth (2026)

The NVIDIA H100 is built on the GH100 die: 80 billion transistors, TSMC 4N process, three form factors (SXM5, PCIe, NVL), and the first GPU with hardware FP8 support. For teams evaluating it, the spec sheet numbers that matter most are 3,350 GB/s memory bandwidth, 989 FP16 TFLOPS (dense), 1,979 FP16 TFLOPS with sparsity, and 3,958 FP8 TFLOPS with sparsity. You can provision H100 GPU rental on-demand with per-minute billing and no contracts. This post covers every figure in the H100 datasheet: all precision tiers, all three form factors, NVLink/NVSwitch specs, MIG profiles, and real-world inference math. Operators evaluating H100 against a lower-cost alternative for inference should read the L40S vs H100 cost-per-token comparison.

H100 Specs at a Glance

Specification	H100 SXM5	H100 PCIe	H100 NVL
Architecture	Hopper GH100	Hopper GH100	Hopper GH100
Process	TSMC 4N	TSMC 4N	TSMC 4N
Transistors	80B	80B	80B
Streaming Multiprocessors	132 SMs	114 SMs	132 SMs per GPU
CUDA Cores	16,896	14,592	16,896
Tensor Cores - 4th Gen	528	456	528
VRAM	80 GB HBM3	80 GB HBM2e	94 GB HBM3 per GPU
Memory Bandwidth	3,350 GB/s	2,000 GB/s	3,900 GB/s
L2 Cache	50 MB	50 MB	50 MB
FP8 TFLOPS w/ sparsity	3,958	3,026	3,341
FP16 TFLOPS w/ sparsity	1,979	1,513	1,671
BF16 TFLOPS w/ sparsity	1,979	1,513	1,671
TF32 TFLOPS w/ sparsity	989	756	835
FP64 TFLOPS	34	26	30
NVLink bandwidth	900 GB/s	N/A	600 GB/s bridge
TDP	700 W	350 W	350-400 W per GPU
Form factor	SXM5 board	PCIe x16 card	Dual-GPU bridge module

Hopper Architecture: GH100 Die

The H100 is fabricated on TSMC's 4N process, a customized 4 nm-class node developed specifically for NVIDIA. The GH100 die measures 814 mm² and packs 80 billion transistors, making it one of the largest GPU dies produced in volume.

The full die contains 144 streaming multiprocessors (SMs). Commercial H100 SKUs enable 132 of them (SXM5, NVL) or 114 (PCIe). NVIDIA bins the dies: chips with defects on up to 12 SMs are sold as H100 PCIe rather than discarded. The remaining SMs and memory controllers are fully functional, which is why the PCIe variant still delivers strong throughput at a lower price point.

Each SM on GH100 includes fourth-generation Tensor Cores (covered below), an L1/shared memory block, and warp schedulers. The Hopper architecture also introduces a new asynchronous execution engine called the Tensor Memory Accelerator (TMA), which handles bulk data transfers from global memory to shared memory without consuming warp resources. This frees CUDA threads to focus on compute rather than memory management.

Other Hopper-specific additions: 7th generation NVEncoder and NVDecoder for video transcoding, and Confidential Computing via on-die hardware encryption (covered later in this post).

Fourth-Gen Tensor Cores and the Transformer Engine

The H100's Tensor Cores operate in FP64, TF32, BF16, FP16, FP8, and INT8. The FP8 path is the headline: two formats are supported, E4M3 (4-bit exponent, 3-bit mantissa, used for forward-pass weights and activations) and E5M2 (5-bit exponent, 2-bit mantissa, used for gradients during training). E4M3 has higher precision range for weights; E5M2 has wider dynamic range for gradient accumulation.

The Transformer Engine library sits on top of these Tensor Cores. It is a hardware-software co-design: the silicon handles the FP8 matrix multiply, while the driver and framework libraries handle per-layer quantization scaling dynamically. You do not tune scaling factors manually. The engine monitors activation statistics each forward pass and adjusts precision automatically, switching layers between FP16 and FP8 based on where precision loss would degrade model quality.

The practical result: FP8 inference delivers 1,979 TFLOPS at dense throughput and 3,958 TFLOPS with structured sparsity, compared to 989 TFLOPS (dense) and 1,979 TFLOPS (with sparsity) for FP16. That is a 2x compute multiplier from precision alone, with no manual work beyond installing a supported framework.

Framework support as of 2026: vLLM (--dtype fp8), TensorRT-LLM (native FP8 engine), and the transformer_engine Python package for PyTorch training. Older stacks that don't support FP8 fall back to BF16 and lose the throughput gain.

Throughput by Precision: Full TFLOPS Table

The table below lists all precision tiers for H100 SXM5 and PCIe, with and without structured sparsity.

Structured sparsity (2:4 pattern) means the hardware skips zero weights in a 2-of-every-4-elements pattern. Supported by 4th gen Tensor Cores natively. NVIDIA's marketing figures lead with sparsity-enabled numbers; dense numbers are what you get without explicit sparsification of the weight matrix.

Precision	H100 SXM5 (no sparsity)	H100 SXM5 (w/ sparsity)	H100 PCIe (no sparsity)	H100 PCIe (w/ sparsity)
FP64	34 TFLOPS	34 TFLOPS	26 TFLOPS	26 TFLOPS
TF32	494 TFLOPS	989 TFLOPS	378 TFLOPS	756 TFLOPS
FP32 (CUDA)	67 TFLOPS	N/A	51 TFLOPS	N/A
FP16	989 TFLOPS	1,979 TFLOPS	756 TFLOPS	1,513 TFLOPS
BF16	989 TFLOPS	1,979 TFLOPS	756 TFLOPS	1,513 TFLOPS
FP8	1,979 TFLOPS	3,958 TFLOPS	1,513 TFLOPS	3,026 TFLOPS
INT8	1,979 TOPS	3,958 TOPS	1,513 TOPS	3,026 TOPS

FP32 CUDA refers to the scalar floating-point path through CUDA cores, not Tensor Cores. It is relevant for non-AI workloads (simulation, rendering) but essentially unused in LLM training and inference.

Memory: HBM3 vs HBM2e, 80 GB vs 94 GB

The H100 SXM5 uses HBM3 at 3,350 GB/s; the NVL uses HBM3 at 3,900 GB/s per GPU. The H100 PCIe uses HBM2e at 2,000 GB/s. The NVL variant carries 94 GB per GPU instead of the standard 80 GB, giving a 2-GPU pair 188 GB total.

For LLM inference, memory bandwidth matters more than raw TFLOPS. Inference on large models is memory-bound: the GPU loads weight tensors from HBM for every token generated. The compute engines are fast enough that they idle, waiting for weights to arrive from HBM. This is the roofline model constraint: if arithmetic intensity (FLOP per byte loaded) is below the hardware's peak FLOP/byte ratio, you are bandwidth-limited.

For a 70B FP16 model, loading one full forward pass requires reading ~140 GB of weights. At 3,350 GB/s, that takes roughly 42 ms per token in the memory-bound regime. Faster HBM directly reduces that floor.

The NVL's 94 GB matters for a specific case: Llama 3 70B in FP16 weights alone occupy ~140 GB. On a single SXM5 (80 GB), that doesn't fit with KV cache included, requiring 2-way tensor parallelism. On an NVL pair (188 GB total, 94 GB per GPU with NVLink bridge), you can serve Llama 70B as a 2-GPU unit with the full 188 GB addressable without setting up a full tensor-parallel configuration. For a deeper look at how the three variants handle different workload shapes, see the full H100 form factor guide.

Form Factors: SXM5, PCIe, NVL

Form Factor	TDP	VRAM	Inter-GPU	Best For
H100 SXM5	700 W	80 GB HBM3	NVSwitch 900 GB/s	Multi-GPU training, FSDP, large-scale inference
H100 PCIe	350 W	80 GB HBM2e	PCIe Gen5 128 GB/s	Single-node fine-tuning, cost-sensitive inference
H100 NVL	350-400 W per GPU	94 GB HBM3	NVLink bridge 600 GB/s	70B+ inference, long context, large KV cache

The SXM5 form factor requires a specialized HGX board and is only available in server configurations built around it. The PCIe variant slots into standard dual-slot PCIe Gen5 servers. The NVL module is PCIe form factor with an additional NVLink bridge connecting the two GPUs on the same module.

For detailed decision logic on which variant fits which workload, see the H100 form factor breakdown.

NVLink 4.0 and NVSwitch: 900 GB/s GPU-to-GPU

NVLink 4.0 on the H100 SXM5 uses 18 bidirectional links, each delivering 50 GB/s, for a total of 900 GB/s bidirectional between any two GPUs. Compare this to PCIe Gen5 x16, which delivers 128 GB/s bidirectional. NVLink is 7x faster for GPU-to-GPU transfers.

Inside an HGX H100 8-GPU server, NVSwitch provides a non-blocking all-to-all fabric. Every GPU can communicate with every other GPU simultaneously at full NVLink bandwidth, with no contention. This is what makes large-model distributed training practical on a single node.

The practical gap shows up in all-reduce operations during FSDP training on a 70B model: an all-reduce across 8 GPUs at NVLink bandwidth takes roughly 2 seconds for a 100 GB gradient sync. The same operation over PCIe Gen5 takes approximately 14 seconds. That difference compounds over thousands of training steps.

The NVL module uses a point-to-point NVLink bridge at 600 GB/s between the two paired GPUs on the same card. It does not use NVSwitch and cannot scale beyond the pair, which is why SXM5 is the right choice for any multi-GPU training setup beyond 2 GPUs. For a direct comparison of NVLink 4.0 vs NVLink 3.0 on A100, see the A100 vs H100 deep dive.

MIG: Up to 7 Instances Per GPU

Multi-Instance GPU (MIG) partitions the H100 into up to 7 isolated instances. Each instance gets a dedicated slice of CUDA cores, Tensor Cores, L2 cache, and HBM. Unlike vGPU time-slicing, MIG gives hardware-enforced isolation: one tenant's workload cannot access another's memory or affect their compute bandwidth.

H100 SXM5 MIG profiles:

MIG Profile	VRAM	CUDA Cores	Use Case
1g.10gb	10 GB	2,048	Dev/test, small model inference
2g.20gb	20 GB	4,096	7B model inference
3g.40gb	40 GB	6,144	13B-20B inference
4g.40gb	40 GB	8,192	Larger batch or 30B models
7g.80gb	80 GB	16,384	Full GPU single tenant

The most common production pattern is 7x 1g.10gb for multi-tenant inference of small models, or 2x 3g.40gb for two simultaneous 13B model servers. MIG instances appear as separate GPU devices to the OS, so each can run its own container independently. For the hands-on setup walkthrough, see the MIG setup guide.

Confidential Computing

The H100 introduces hardware-enforced Confidential Computing via a trusted execution environment (TEE) built into the GH100 die. Encryption is AES-256, executed on-die, with no performance overhead on compute operations. Memory transfers between the GPU and CPU are encrypted over the NVLink or PCIe fabric.

The TEE provides hardware-enforced isolation between VMs sharing the same physical GPU. A hypervisor or cloud operator cannot access a tenant's data in memory or in transit. Attestation is handled through NVIDIA's Hopper Attestation SDK, which allows software to cryptographically verify it is running on a genuine, unmodified H100 TEE before submitting sensitive data.

Use cases include healthcare organizations deploying LLMs on patient records, financial institutions running inference on proprietary trading data, and any regulated workload where multi-tenant cloud infrastructure creates compliance concerns.

H100 vs H200, A100, B200: Spec Deltas

H100 SXM5 vs A100 SXM4

Metric	A100 SXM4	H100 SXM5	Delta
Process	TSMC 7nm	TSMC 4N	Newer node
Transistors	54B	80B	+48%
FP16 TFLOPS (no sparsity)	312	989	+3.2x
FP8 TFLOPS (no sparsity)	N/A	1,979	New capability
Memory Bandwidth	2,039 GB/s	3,350 GB/s	+1.64x
VRAM	80 GB HBM2e	80 GB HBM3	Same capacity
TDP	400 W	700 W	+1.75x

For a full A100 vs H100 comparison including training benchmarks and cost-per-token math, see the full A100 vs H100 benchmarks.

H100 SXM5 vs H200 SXM

Metric	H100 SXM5	H200 SXM	Delta
VRAM	80 GB HBM3	141 GB HBM3e	+76%
Memory Bandwidth	3,350 GB/s	4,800 GB/s	+43%
FP8 TFLOPS (w/ sparsity)	3,958	3,958	Equal
CUDA Cores	16,896	16,896	Equal
TDP	700 W	700 W	Equal

The compute engines are identical. The H200's advantage is memory: 43% more bandwidth and 76% more capacity. For memory-bound inference on 70B+ models, this translates to 37-45% higher token throughput. For a full performance breakdown, see the H100 vs H200 inference comparison.

H100 SXM5 vs B200

Metric	H100 SXM5	B200	Delta
Architecture	Hopper	Blackwell	Next-gen
VRAM	80 GB HBM3	192 GB HBM3e	+2.4x
Memory Bandwidth	3,350 GB/s	8,000 GB/s	+2.4x
FP8 TFLOPS (w/ sparsity)	3,958	9,000+	~2.3x
TDP	700 W	1,000 W	+43%

The B200 is a full architectural generation ahead. For teams that can't wait for Blackwell availability, H100 remains the most accessible high-performance GPU for production inference. For a full Blackwell deep-dive, see the B200 complete guide.

Real-World Inference: How H100 Specs Translate to Tokens/Second

These estimates are based on roofline analysis and community benchmarks from vLLM and TensorRT-LLM. Actual throughput depends on batch size, sequence length, framework version, and quantization implementation. Numbers can shift 10-30% in practice.

Llama 3 70B (FP16, single H100 SXM5)

Model size in VRAM: ~140 GB (doesn't fit on a single 80 GB GPU)
Requires 2-way tensor parallelism across 2x H100 SXM5
On a 2x H100 SXM5 NVLink setup: roughly 3,500-4,500 tokens/sec at batch size 32
Memory bottleneck: 2x 3,350 GB/s = 6,700 GB/s combined for the model weights

Llama 3 70B (FP8, single H100 SXM5)

FP8 weights: ~70 GB (fits on a single H100 with KV cache room)
vLLM FP8 (--dtype fp8): roughly 1,800-2,200 tokens/sec at batch size 16
Memory bottleneck math: 3,350 GB/s bandwidth, ~70 GB FP8 weights means ~21 ms minimum per token in pure memory-bound regime

Mixtral 8x7B (FP16, single H100 SXM5)

Active parameters per forward pass: ~12B (2 of 8 experts per token)
Total weights: ~93 GB, fits in 80 GB with FP8 or requires 2 GPUs (tensor parallelism) in FP16
With FP8 quantization: expected decode at batch size 32 is roughly 4,000-6,000 tokens/sec

DeepSeek-R1 671B (multi-GPU tensor parallelism)

Requires a full 8-GPU NVSwitch node
FP8 weights: ~671 GB across 8 GPUs (84 GB per GPU) — FP8 is 1 byte per parameter, so 671B params × 1 byte ≈ 671 GB total. At 84 GB per GPU, this exceeds the 80 GB HBM3 limit, meaning naive 8-way tensor parallelism is not sufficient to hold the weights alone. Practitioners typically run DeepSeek-R1 671B on 16× H100 SXM5 nodes (42 GB per GPU in FP8) or apply INT4/GPTQ quantization to bring per-GPU weight size below 80 GB
On 16× H100 SXM5 in FP8 (42 GB per GPU, weights fit cleanly): roughly 2,000-3,200 tokens/sec at batch size 32. NVSwitch provides 900 GB/s all-to-all within each 8-GPU node; inter-node communication uses InfiniBand at ~50 GB/s per port, which becomes the dominant bottleneck for cross-node tensor parallelism
On 8× H100 SXM5 with INT4/GPTQ (671B × 0.5 bytes ≈ 336 GB, or ~42 GB per GPU): roughly 900-1,400 tokens/sec at batch size 32 — lower than FP8 due to dequantization overhead, but feasible on a single 8-GPU node
NVSwitch 900 GB/s all-to-all bandwidth applies within a single 8-GPU HGX node; for multi-node runs, InfiniBand HDR/NDR (~25-50 GB/s per port) handles cross-node traffic and is the primary tensor-parallel communication bottleneck

These figures assume no KV cache bottleneck. At long context lengths, KV cache size grows and can push the effective bandwidth utilization higher, reducing throughput below these estimates. For a full guide on tuning inference for these models, see the LLM inference optimization guide.

Rent H100 on Spheron: Pricing

Live pricing from the Spheron marketplace as of 20 May 2026:

SKU	On-Demand ($/hr)	Spot ($/hr)
H100 SXM5	$2.64	$1.66
H100 PCIe	$2.09	N/A
H100 NVL	$2.15	N/A

Pricing fluctuates based on GPU availability. The prices above are based on 20 May 2026 and may have changed. Check current GPU pricing → for live rates.

To deploy H100 instances now, visit the H100 rental page on Spheron or go directly to app.spheron.ai to provision in minutes. For a comparison of GCP A3 H100 pricing against Spheron across on-demand, spot, and committed-use tiers, see Google Cloud A3 H100 pricing 2026. To put an H100 to work on a real dual-utility workload, see how to set up a Pearl Research node on H100. For the latest H100 cloud pricing and market news updated monthly, see our H100 news tracker.

Running inference on Llama 70B or a fine-tuning job that needs Hopper-class memory bandwidth? Spheron provides on-demand and spot H100 SXM5 instances from 5+ providers, billed per minute.
Check H100 availability → | View all GPU pricing → | Get started on Spheron →

FAQ / 06

Frequently Asked Questions

The H100 SXM5 and PCIe both ship with 80 GB of HBM memory. The SXM5 uses HBM3 at 3,350 GB/s bandwidth; the PCIe uses HBM2e at 2,000 GB/s. The H100 NVL module is an exception: it carries 94 GB HBM3 per GPU at 3,900 GB/s bandwidth, for a total of 188 GB across its two-GPU bridge pair.

The H100 SXM5 delivers 3,958 TFLOPS in FP8 (with sparsity). Without sparsity the figure is 1,979 TFLOPS. The PCIe variant delivers 3,026 TFLOPS FP8 with sparsity. FP8 throughput is enabled by the fourth-generation Tensor Cores and the Hopper Transformer Engine.

Yes, for most AI workloads. The H100 delivers 3x the FP16 compute of the A100, 6x faster training via FP8, and 1.64x higher memory bandwidth. The A100's main advantage is cost: it typically rents for less per hour and still handles 70B-class inference and fine-tuning well when FP8 is not required.

The H100 SXM5 has a TDP of 700 W. The H100 PCIe has a TDP of 350 W. The H100 NVL is configurable at 350-400 W per GPU. Always confirm power budgets with your data center or cloud provider before deployment.

Up to 7 MIG instances per GPU. Each instance gets a dedicated slice of CUDA cores, Tensor Cores, L2 cache, and HBM memory. On the H100 SXM5 the full 7-way partition gives each instance approximately 10 GB of HBM3 and 2,048 CUDA cores.

The H100 GH100 die contains approximately 80 billion transistors, fabricated on TSMC's 4N (4 nm class) custom process node. The full die is 814 mm² and contains 144 streaming multiprocessors, though commercial H100 SKUs enable 132 SMs.