Research

NVIDIA H100 Specs: Complete Datasheet, FP8/FP16 Throughput, and Memory Bandwidth (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 20, 2026
nvidia h100 specsh100 specsh100 specificationsh100 datasheetnvidia h100 datasheeth100 memory bandwidthh100 fp8 tflopsh100 fp16 tflopsh100 transistor counth100 tdp
NVIDIA H100 Specs: Complete Datasheet, FP8/FP16 Throughput, and Memory Bandwidth (2026)

The NVIDIA H100 is built on the GH100 die: 80 billion transistors, TSMC 4N process, three form factors (SXM5, PCIe, NVL), and the first GPU with hardware FP8 support. For teams evaluating it, the spec sheet numbers that matter most are 3,350 GB/s memory bandwidth, 989 FP16 TFLOPS (dense), 1,979 FP16 TFLOPS with sparsity, and 3,958 FP8 TFLOPS with sparsity. You can provision H100 GPU rental on-demand with per-minute billing and no contracts. This post covers every figure in the H100 datasheet: all precision tiers, all three form factors, NVLink/NVSwitch specs, MIG profiles, and real-world inference math.

H100 Specs at a Glance

SpecificationH100 SXM5H100 PCIeH100 NVL
ArchitectureHopper GH100Hopper GH100Hopper GH100
ProcessTSMC 4NTSMC 4NTSMC 4N
Transistors80B80B80B
Streaming Multiprocessors132 SMs114 SMs132 SMs per GPU
CUDA Cores16,89614,59216,896
Tensor Cores - 4th Gen528456528
VRAM80 GB HBM380 GB HBM2e94 GB HBM3 per GPU
Memory Bandwidth3,350 GB/s2,000 GB/s3,900 GB/s
L2 Cache50 MB50 MB50 MB
FP8 TFLOPS w/ sparsity3,9583,0263,341
FP16 TFLOPS w/ sparsity1,9791,5131,671
BF16 TFLOPS w/ sparsity1,9791,5131,671
TF32 TFLOPS w/ sparsity989756835
FP64 TFLOPS342630
NVLink bandwidth900 GB/sN/A600 GB/s bridge
TDP700 W350 W350-400 W per GPU
Form factorSXM5 boardPCIe x16 cardDual-GPU bridge module

Hopper Architecture: GH100 Die

The H100 is fabricated on TSMC's 4N process, a customized 4 nm-class node developed specifically for NVIDIA. The GH100 die measures 814 mm² and packs 80 billion transistors, making it one of the largest GPU dies produced in volume.

The full die contains 144 streaming multiprocessors (SMs). Commercial H100 SKUs enable 132 of them (SXM5, NVL) or 114 (PCIe). NVIDIA bins the dies: chips with defects on up to 12 SMs are sold as H100 PCIe rather than discarded. The remaining SMs and memory controllers are fully functional, which is why the PCIe variant still delivers strong throughput at a lower price point.

Each SM on GH100 includes fourth-generation Tensor Cores (covered below), an L1/shared memory block, and warp schedulers. The Hopper architecture also introduces a new asynchronous execution engine called the Tensor Memory Accelerator (TMA), which handles bulk data transfers from global memory to shared memory without consuming warp resources. This frees CUDA threads to focus on compute rather than memory management.

Other Hopper-specific additions: 7th generation NVEncoder and NVDecoder for video transcoding, and Confidential Computing via on-die hardware encryption (covered later in this post).

Fourth-Gen Tensor Cores and the Transformer Engine

The H100's Tensor Cores operate in FP64, TF32, BF16, FP16, FP8, and INT8. The FP8 path is the headline: two formats are supported, E4M3 (4-bit exponent, 3-bit mantissa, used for forward-pass weights and activations) and E5M2 (5-bit exponent, 2-bit mantissa, used for gradients during training). E4M3 has higher precision range for weights; E5M2 has wider dynamic range for gradient accumulation.

The Transformer Engine sits on top of these Tensor Cores. It is a hardware-software co-design: the silicon handles the FP8 matrix multiply, while the driver and framework libraries handle per-layer quantization scaling dynamically. You do not tune scaling factors manually. The engine monitors activation statistics each forward pass and adjusts precision automatically, switching layers between FP16 and FP8 based on where precision loss would degrade model quality.

The practical result: FP8 inference delivers 1,979 TFLOPS at dense throughput and 3,958 TFLOPS with structured sparsity, compared to 989 TFLOPS (dense) and 1,979 TFLOPS (with sparsity) for FP16. That is a 2x compute multiplier from precision alone, with no manual work beyond installing a supported framework.

Framework support as of 2026: vLLM (--dtype fp8), TensorRT-LLM (native FP8 engine), and the transformer_engine Python package for PyTorch training. Older stacks that don't support FP8 fall back to BF16 and lose the throughput gain.

Throughput by Precision: Full TFLOPS Table

The table below lists all precision tiers for H100 SXM5 and PCIe, with and without structured sparsity.

Structured sparsity (2:4 pattern) means the hardware skips zero weights in a 2-of-every-4-elements pattern. Supported by 4th gen Tensor Cores natively. NVIDIA's marketing figures lead with sparsity-enabled numbers; dense numbers are what you get without explicit sparsification of the weight matrix.

PrecisionH100 SXM5 (no sparsity)H100 SXM5 (w/ sparsity)H100 PCIe (no sparsity)H100 PCIe (w/ sparsity)
FP6434 TFLOPS34 TFLOPS26 TFLOPS26 TFLOPS
TF32494 TFLOPS989 TFLOPS378 TFLOPS756 TFLOPS
FP32 (CUDA)67 TFLOPSN/A51 TFLOPSN/A
FP16989 TFLOPS1,979 TFLOPS756 TFLOPS1,513 TFLOPS
BF16989 TFLOPS1,979 TFLOPS756 TFLOPS1,513 TFLOPS
FP81,979 TFLOPS3,958 TFLOPS1,513 TFLOPS3,026 TFLOPS
INT81,979 TOPS3,958 TOPS1,513 TOPS3,026 TOPS

FP32 CUDA refers to the scalar floating-point path through CUDA cores, not Tensor Cores. It is relevant for non-AI workloads (simulation, rendering) but essentially unused in LLM training and inference.

Memory: HBM3 vs HBM2e, 80 GB vs 94 GB

The H100 SXM5 uses HBM3 at 3,350 GB/s; the NVL uses HBM3 at 3,900 GB/s per GPU. The H100 PCIe uses HBM2e at 2,000 GB/s. The NVL variant carries 94 GB per GPU instead of the standard 80 GB, giving a 2-GPU pair 188 GB total.

For LLM inference, memory bandwidth matters more than raw TFLOPS. Inference on large models is memory-bound: the GPU loads weight tensors from HBM for every token generated. The compute engines are fast enough that they idle, waiting for weights to arrive from HBM. This is the roofline model constraint: if arithmetic intensity (FLOP per byte loaded) is below the hardware's peak FLOP/byte ratio, you are bandwidth-limited.

For a 70B FP16 model, loading one full forward pass requires reading ~140 GB of weights. At 3,350 GB/s, that takes roughly 42 ms per token in the memory-bound regime. Faster HBM directly reduces that floor.

The NVL's 94 GB matters for a specific case: Llama 3 70B in FP16 weights alone occupy ~140 GB. On a single SXM5 (80 GB), that doesn't fit with KV cache included, requiring 2-way tensor parallelism. On an NVL pair (188 GB total, 94 GB per GPU with NVLink bridge), you can serve Llama 70B as a 2-GPU unit with the full 188 GB addressable without setting up a full tensor-parallel configuration. For a deeper look at how the three variants handle different workload shapes, see the full H100 form factor guide.

Form Factors: SXM5, PCIe, NVL

Form FactorTDPVRAMInter-GPUBest For
H100 SXM5700 W80 GB HBM3NVSwitch 900 GB/sMulti-GPU training, FSDP, large-scale inference
H100 PCIe350 W80 GB HBM2ePCIe Gen5 128 GB/sSingle-node fine-tuning, cost-sensitive inference
H100 NVL350-400 W per GPU94 GB HBM3NVLink bridge 600 GB/s70B+ inference, long context, large KV cache

The SXM5 form factor requires a specialized HGX board and is only available in server configurations built around it. The PCIe variant slots into standard dual-slot PCIe Gen5 servers. The NVL module is PCIe form factor with an additional NVLink bridge connecting the two GPUs on the same module.

For detailed decision logic on which variant fits which workload, see the H100 form factor breakdown.

NVLink 4.0 on the H100 SXM5 uses 18 bidirectional links, each delivering 50 GB/s, for a total of 900 GB/s bidirectional between any two GPUs. Compare this to PCIe Gen5 x16, which delivers 128 GB/s bidirectional. NVLink is 7x faster for GPU-to-GPU transfers.

Inside an HGX H100 8-GPU server, NVSwitch provides a non-blocking all-to-all fabric. Every GPU can communicate with every other GPU simultaneously at full NVLink bandwidth, with no contention. This is what makes large-model distributed training practical on a single node.

The practical gap shows up in all-reduce operations during FSDP training on a 70B model: an all-reduce across 8 GPUs at NVLink bandwidth takes roughly 2 seconds for a 100 GB gradient sync. The same operation over PCIe Gen5 takes approximately 14 seconds. That difference compounds over thousands of training steps.

The NVL module uses a point-to-point NVLink bridge at 600 GB/s between the two paired GPUs on the same card. It does not use NVSwitch and cannot scale beyond the pair, which is why SXM5 is the right choice for any multi-GPU training setup beyond 2 GPUs. For a direct comparison of NVLink 4.0 vs NVLink 3.0 on A100, see the A100 vs H100 deep dive.

MIG: Up to 7 Instances Per GPU

Multi-Instance GPU (MIG) partitions the H100 into up to 7 isolated instances. Each instance gets a dedicated slice of CUDA cores, Tensor Cores, L2 cache, and HBM. Unlike vGPU time-slicing, MIG gives hardware-enforced isolation: one tenant's workload cannot access another's memory or affect their compute bandwidth.

H100 SXM5 MIG profiles:

MIG ProfileVRAMCUDA CoresUse Case
1g.10gb10 GB2,048Dev/test, small model inference
2g.20gb20 GB4,0967B model inference
3g.40gb40 GB6,14413B-20B inference
4g.40gb40 GB8,192Larger batch or 30B models
7g.80gb80 GB16,384Full GPU single tenant

The most common production pattern is 7x 1g.10gb for multi-tenant inference of small models, or 2x 3g.40gb for two simultaneous 13B model servers. MIG instances appear as separate GPU devices to the OS, so each can run its own container independently. For the hands-on setup walkthrough, see the MIG setup guide.

Confidential Computing

The H100 introduces hardware-enforced Confidential Computing via a trusted execution environment (TEE) built into the GH100 die. Encryption is AES-256, executed on-die, with no performance overhead on compute operations. Memory transfers between the GPU and CPU are encrypted over the NVLink or PCIe fabric.

The TEE provides hardware-enforced isolation between VMs sharing the same physical GPU. A hypervisor or cloud operator cannot access a tenant's data in memory or in transit. Attestation is handled through NVIDIA's Hopper Attestation SDK, which allows software to cryptographically verify it is running on a genuine, unmodified H100 TEE before submitting sensitive data.

Use cases include healthcare organizations deploying LLMs on patient records, financial institutions running inference on proprietary trading data, and any regulated workload where multi-tenant cloud infrastructure creates compliance concerns.

H100 vs H200, A100, B200: Spec Deltas

H100 SXM5 vs A100 SXM4

MetricA100 SXM4H100 SXM5Delta
ProcessTSMC 7nmTSMC 4NNewer node
Transistors54B80B+48%
FP16 TFLOPS (no sparsity)312989+3.2x
FP8 TFLOPS (no sparsity)N/A1,979New capability
Memory Bandwidth2,039 GB/s3,350 GB/s+1.64x
VRAM80 GB HBM2e80 GB HBM3Same capacity
TDP400 W700 W+1.75x

For a full A100 vs H100 comparison including training benchmarks and cost-per-token math, see the full A100 vs H100 benchmarks.

H100 SXM5 vs H200 SXM

MetricH100 SXM5H200 SXMDelta
VRAM80 GB HBM3141 GB HBM3e+76%
Memory Bandwidth3,350 GB/s4,800 GB/s+43%
FP8 TFLOPS (w/ sparsity)3,9583,958Equal
CUDA Cores16,89616,896Equal
TDP700 W700 WEqual

The compute engines are identical. The H200's advantage is memory: 43% more bandwidth and 76% more capacity. For memory-bound inference on 70B+ models, this translates to 37-45% higher token throughput. For a full performance breakdown, see the H100 vs H200 inference comparison.

H100 SXM5 vs B200

MetricH100 SXM5B200Delta
ArchitectureHopperBlackwellNext-gen
VRAM80 GB HBM3192 GB HBM3e+2.4x
Memory Bandwidth3,350 GB/s8,000 GB/s+2.4x
FP8 TFLOPS (w/ sparsity)3,9589,000+~2.3x
TDP700 W1,000 W+43%

The B200 is a full architectural generation ahead. For teams that can't wait for Blackwell availability, H100 remains the most accessible high-performance GPU for production inference. For a full Blackwell deep-dive, see the B200 complete guide.

Real-World Inference: How H100 Specs Translate to Tokens/Second

These estimates are based on roofline analysis and community benchmarks from vLLM and TensorRT-LLM. Actual throughput depends on batch size, sequence length, framework version, and quantization implementation. Numbers can shift 10-30% in practice.

Llama 3 70B (FP16, single H100 SXM5)

  • Model size in VRAM: ~140 GB (doesn't fit on a single 80 GB GPU)
  • Requires 2-way tensor parallelism across 2x H100 SXM5
  • On a 2x H100 SXM5 NVLink setup: roughly 3,500-4,500 tokens/sec at batch size 32
  • Memory bottleneck: 2x 3,350 GB/s = 6,700 GB/s combined for the model weights

Llama 3 70B (FP8, single H100 SXM5)

  • FP8 weights: ~70 GB (fits on a single H100 with KV cache room)
  • vLLM FP8 (--dtype fp8): roughly 1,800-2,200 tokens/sec at batch size 16
  • Memory bottleneck math: 3,350 GB/s bandwidth, ~70 GB FP8 weights means ~21 ms minimum per token in pure memory-bound regime

Mixtral 8x7B (FP16, single H100 SXM5)

  • Active parameters per forward pass: ~12B (2 of 8 experts per token)
  • Total weights: ~93 GB, fits in 80 GB with FP8 or requires 2 GPUs (tensor parallelism) in FP16
  • With FP8 quantization: expected decode at batch size 32 is roughly 4,000-6,000 tokens/sec

DeepSeek-R1 671B (multi-GPU tensor parallelism)

  • Requires a full 8-GPU NVSwitch node
  • FP8 weights: ~671 GB across 8 GPUs (84 GB per GPU) — FP8 is 1 byte per parameter, so 671B params × 1 byte ≈ 671 GB total. At 84 GB per GPU, this exceeds the 80 GB HBM3 limit, meaning naive 8-way tensor parallelism is not sufficient to hold the weights alone. Practitioners typically run DeepSeek-R1 671B on 16× H100 SXM5 nodes (42 GB per GPU in FP8) or apply INT4/GPTQ quantization to bring per-GPU weight size below 80 GB
  • On 16× H100 SXM5 in FP8 (42 GB per GPU, weights fit cleanly): roughly 2,000-3,200 tokens/sec at batch size 32. NVSwitch provides 900 GB/s all-to-all within each 8-GPU node; inter-node communication uses InfiniBand at ~50 GB/s per port, which becomes the dominant bottleneck for cross-node tensor parallelism
  • On 8× H100 SXM5 with INT4/GPTQ (671B × 0.5 bytes ≈ 336 GB, or ~42 GB per GPU): roughly 900-1,400 tokens/sec at batch size 32 — lower than FP8 due to dequantization overhead, but feasible on a single 8-GPU node
  • NVSwitch 900 GB/s all-to-all bandwidth applies within a single 8-GPU HGX node; for multi-node runs, InfiniBand HDR/NDR (~25-50 GB/s per port) handles cross-node traffic and is the primary tensor-parallel communication bottleneck

These figures assume no KV cache bottleneck. At long context lengths, KV cache size grows and can push the effective bandwidth utilization higher, reducing throughput below these estimates. For a full guide on tuning inference for these models, see the LLM inference optimization guide.

Rent H100 on Spheron: Pricing

Live pricing from the Spheron marketplace as of 20 May 2026:

SKUOn-Demand ($/hr)Spot ($/hr)
H100 SXM5$2.64$1.66
H100 PCIe$2.09N/A
H100 NVL$2.15N/A

Pricing fluctuates based on GPU availability. The prices above are based on 20 May 2026 and may have changed. Check current GPU pricing → for live rates.

To deploy H100 instances now, visit the H100 rental page on Spheron or go directly to app.spheron.ai to provision in minutes.


Running inference on Llama 70B or a fine-tuning job that needs Hopper-class memory bandwidth? Spheron provides on-demand and spot H100 SXM5 instances from 5+ providers, billed per minute.

Rent H100 → | View all GPU pricing → | Get started on Spheron →

FAQ / 06

Frequently Asked Questions

The H100 SXM5 and PCIe both ship with 80 GB of HBM memory. The SXM5 uses HBM3 at 3,350 GB/s bandwidth; the PCIe uses HBM2e at 2,000 GB/s. The H100 NVL module is an exception: it carries 94 GB HBM3 per GPU at 3,900 GB/s bandwidth, for a total of 188 GB across its two-GPU bridge pair.

The H100 SXM5 delivers 3,958 TFLOPS in FP8 (with sparsity). Without sparsity the figure is 1,979 TFLOPS. The PCIe variant delivers 3,026 TFLOPS FP8 with sparsity. FP8 throughput is enabled by the fourth-generation Tensor Cores and the Hopper Transformer Engine.

Yes, for most AI workloads. The H100 delivers 3x the FP16 compute of the A100, 6x faster training via FP8, and 1.64x higher memory bandwidth. The A100's main advantage is cost: it typically rents for less per hour and still handles 70B-class inference and fine-tuning well when FP8 is not required.

The H100 SXM5 has a TDP of 700 W. The H100 PCIe has a TDP of 350 W. The H100 NVL is configurable at 350-400 W per GPU. Always confirm power budgets with your data center or cloud provider before deployment.

Up to 7 MIG instances per GPU. Each instance gets a dedicated slice of CUDA cores, Tensor Cores, L2 cache, and HBM memory. On the H100 SXM5 the full 7-way partition gives each instance approximately 10 GB of HBM3 and 2,048 CUDA cores.

The H100 GH100 die contains approximately 80 billion transistors, fabricated on TSMC's 4N (4 nm class) custom process node. The full die is 814 mm² and contains 144 streaming multiprocessors, though commercial H100 SKUs enable 132 SMs.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.