What is FP8 Quantization? AI Inference Performance, Accuracy, and Hardware Support Explained (2026)

Running Llama 3.1 70B on H100 SXM5, FP8 inference costs roughly $0.71 per million tokens on-demand, compared to $1.19/M at BF16, while delivering 1.4-1.8x more throughput at large batch sizes. The two levers: FP8 nearly doubles Tensor Core throughput and cuts KV cache memory per token in half, which opens up room for larger batch sizes. If you want the hands-on API setup rather than the conceptual breakdown, start with the Transformer Engine setup guide.

What FP8 Quantization Is

FP8 quantization stores model weights and activations in 8-bit floating-point format instead of the 16-bit or 32-bit formats used during training. The key distinction from INT8 quantization is that FP8 retains a floating-point exponent. Every FP8 value follows the same structure as larger floats: 1 sign bit, some exponent bits, and some mantissa bits totaling 8 bits. The exponent gives FP8 a dynamic range that spans many orders of magnitude, unlike INT8's fixed-range integers. That dynamic range is exactly what transformer models need, since activation values can vary by factors of 1,000 or more across layers. INT8 handles this badly without per-layer calibration and outlier clipping. FP8 handles it naturally.

The Two FP8 Formats: E4M3 and E5M2

Two FP8 formats are in active use. They allocate bits differently, giving each a different precision and dynamic range profile. Transformer Engine's DelayedScaling recipe configures them per-layer: E4M3 for the forward pass (weights and activations), E5M2 for the backward pass (gradients, which span far wider ranges than activations).

Format	Sign bits	Exponent bits	Mantissa bits	Max value	Dynamic range	Typical use
E4M3	1	4	3	448	~1.7 × 10^4	Forward-pass weights and activations
E5M2	1	5	2	57,344	~5.7 × 10^4	Backward-pass gradients during training

E4M3 packs more precision into its 8 bits at the cost of a narrower range. For weights and activations, that tradeoff is correct: values cluster tightly and precision matters more than range. E5M2 flips this for gradients, where range matters far more than fine precision.

FP8 vs FP16 vs BF16 vs INT8: Comparison Table

This is the decision table most practitioners need before choosing a precision format for their inference setup.

Format	Bit width	Bytes/value	Throughput (relative)	Memory footprint	Dynamic range	GPU support	Best use case
FP32	32-bit	4 bytes	1× (baseline)	4× FP8	Full	All GPUs	Master weights, optimizer states
BF16	16-bit	2 bytes	~2× FP32	2× FP8	Wide (8-bit exp)	Ampere+	Default training precision
FP16	16-bit	2 bytes	~2× FP32	2× FP8	Moderate	All modern GPUs	Legacy training, some inference
FP8 E4M3	8-bit	1 byte	~4× FP32, ~2× BF16	1×	Moderate	Hopper, Blackwell	Forward-pass inference, training
FP8 E5M2	8-bit	1 byte	~4× FP32, ~2× BF16	1×	Wide	Hopper, Blackwell	Gradient computation (training)
INT8	8-bit	1 byte	~4× FP32, ~2× BF16	1×	Fixed range	All modern GPUs	Inference only, with calibration

The throughput figures above reflect Tensor Core TFLOPS ceilings published by NVIDIA, not guaranteed end-to-end serving throughput. Real inference throughput depends on batch size, sequence length, KV cache pressure, and memory bandwidth. For large models at small batch sizes, memory bandwidth is typically the bottleneck, so the actual throughput gain from FP8 vs BF16 in serving is often in the 1.4-1.8x range rather than a clean 2x.

Why FP8 Matters for AI Inference

Throughput. FP8 inference on Spheron H100 SXM5 delivers 1,979 dense FP8 TFLOPS vs 989 dense BF16 TFLOPS (NVIDIA spec sheet). In real vLLM serving benchmarks for Llama 3.1 70B, switching from BF16 to FP8 improves tokens per second by 1.4-1.8x at large batch sizes. At small batch sizes (single-digit concurrency), the gain is smaller because memory bandwidth, not compute, is the bottleneck.

KV cache memory. FP8 KV cache cuts per-token KV memory in half compared to BF16. For a 70B model with 128K context, the KV cache alone can consume over 40 GB at BF16 on a single H100 80 GB. Switching to FP8 KV cache frees roughly 20 GB, which either allows longer context within the same VRAM budget or opens up room for larger batch sizes.

Batch size and cost. Larger batch sizes lower cost per token directly. The formula: cost per 1M tokens = ($/hr) ÷ (tokens/sec × 3,600) × 1,000,000. Using live H100 SXM5 pricing: at $1.63/hr spot ($3.84/hr on-demand) and a 70B FP8 model running at ~1,500 tokens/sec, the cost works out to roughly $0.30 per million tokens on spot ($0.71/M on-demand). Compare that to a BF16 run at ~900 tokens/sec on the same hardware: $0.50/M spot ($1.19/M on-demand). The FP8 advantage compounds further at higher batch sizes where throughput scales better.

FP8 Training vs FP8 Inference

These are different use cases that share the "FP8" label but have different complexity profiles.

Inference (PTQ). Post-training quantization: the model is already trained, and weights are calibrated to FP8 after the fact. Major serving frameworks handle the scaling automatically. For vLLM, add --quantization fp8 to your serve command. For TensorRT-LLM, pass --dtype bfloat16 --fp8. No separate calibration step is needed for dynamic per-tensor quantization. Accuracy loss for most production models is negligible. This is the simple path.

Training (mixed precision). This requires NVIDIA Transformer Engine. The forward pass runs in FP8 E4M3, backward gradients in FP8 E5M2 or BF16, and master weights plus optimizer states stay in BF16 or FP32. Transformer Engine handles per-tensor dynamic scaling (delayed scaling or current scaling) automatically. The setup is more involved, but training throughput improves by 1.3-1.7x on H100 SXM5. See the full Transformer Engine setup tutorial for the installation, recipe API, and benchmark data.

Hardware Support for FP8

GPU	Architecture	FP8 support	FP8 TFLOPS (dense)	Notes
H100 SXM5	Hopper	Full	1,979	First-gen Transformer Engine; dedicated FP8 Tensor Cores
H200 SXM5	Hopper	Full	1,979	Same compute as H100; 141 GB HBM3e for larger models
GH200	Hopper	Full	1,979	H100 compute + Grace CPU on NVLink-C2C
B200 SXM6	Blackwell	Full	4,500	Second-gen TE; also supports FP4
B300 (Blackwell Ultra)	Blackwell	Full	~5,000	Dense FP8; ~10,000 TFLOPS sparse; also supports FP4
RTX 5090	Blackwell	Full	~838	Consumer Blackwell; FP8 + FP4
RTX PRO 6000	Blackwell	Full	~1,457	Workstation Blackwell; 96 GB GDDR7
L40S	Ada Lovelace	Partial	~733	No per-tensor scaling HW; inference only
RTX 4090	Ada Lovelace	Partial	~660	Dense FP8; Transformer Engine falls back to BF16 silently
A100	Ampere	No	N/A	INT8 is the max hardware-accelerated format
AMD MI300X	CDNA3	Full (inference)	~1,307	hipBLASLt FP8 gemm ops; no full TE equivalent
AMD MI355X	CDNA3.5	Full (inference)	~5,000	aiter attention kernels; FP8 + MXFP4

Hopper (H100/H200/GH200). First-generation Transformer Engine with dedicated FP8 Tensor Cores and hardware per-tensor scaling. The transformer_engine Python library is required for training. For inference, vLLM and TensorRT-LLM use TE internally. Rent an H100 or try H200 on Spheron for the extra memory bandwidth if your workload is HBM-bound. The Grace Hopper Superchip (GH200) pairs H100 compute with a Grace CPU over NVLink-C2C; see our NVIDIA GH200 guide for the full architecture breakdown.

Ada Lovelace (RTX 4090, L40S). CUDA-level FP8 ops exist, but there is no dedicated scaling hardware. Transformer Engine falls back to BF16 silently on Ada. These GPUs are suitable for FP8 inference with manual quantization (vLLM --quantization fp8 still works), but you will not see the same throughput gains as on Hopper. For inference-only deployments where cost matters more than raw throughput, Ada can still make sense.

Blackwell (B200, B300, RTX 5090, RTX PRO 6000). Second-generation Transformer Engine with higher underlying FP8 TFLOPS and improved scaling throughput. The same Python API works on Blackwell as on Hopper. Additionally, Blackwell supports FP4, which FP4 on Blackwell doubles throughput again beyond FP8. B200 instances on Spheron give you both FP8 and FP4 options.

AMD MI300X/MI355X. FP8 inference works via hipBLASLt FP8 GEMM operations and the aiter attention kernel library. ROCm PyTorch supports FP8 ops. There is no direct AMD equivalent to Transformer Engine's Python API for training as of mid-2026, but inference pipelines (vLLM on ROCm, SGLang) support FP8 on MI300X. For MXFP4 microscaling on AMD hardware, see the MXFP4 microscaling on AMD hardware guide.

FP8 in Practice: vLLM, TensorRT-LLM, and SGLang

vLLM. The simplest path to FP8 inference. Dynamic per-tensor quantization, no separate calibration step required:

bash

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --quantization fp8 \
  --kv-cache-dtype fp8

The --quantization fp8 flag uses dynamic per-tensor quantization. For higher accuracy, load a pre-quantized FP8 checkpoint from HuggingFace using --quantization compressed-tensors, which applies static scales computed during calibration.

If you need a single toolkit for FP8, INT4, and FP4 across multiple export targets, see the ModelOpt unified quantization guide.

TensorRT-LLM. Builds a compiled FP8 engine from a checkpoint:

bash

trtllm-build \
  --checkpoint_dir ./checkpoint \
  --dtype bfloat16 \
  --strongly_typed \
  --use_fp8_context_fmha enable

TensorRT-LLM also supports loading pre-quantized FP8 checkpoints from NVIDIA's NGC model catalog, which gives you statically calibrated scales without running the quantization yourself.

SGLang. Similar syntax to vLLM:

bash

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --quantization fp8

On standard benchmarks, accuracy delta for FP8 vs BF16 is typically under 0.5% on MMLU for 70B models. For 7B models, expect 1-2% degradation, which is worth validating against your specific task.

Accuracy and Limitations

For most production workloads, FP8 vs BF16 accuracy differences are below measurement noise. The cases where FP8 degrades measurably:

Small models (under 7B). Each weight matters more per prediction. FP8 quantization errors compound more visibly at smaller parameter counts. Test your specific 7B model before committing to FP8 in production.

Models with large activation outliers. Older architectures like OPT and GPT-2-style models have activation distributions that stress FP8's dynamic range even with per-tensor scaling. Modern models (Llama 3, Qwen3, Mistral) are more quantization-friendly.

Tasks requiring numerical precision. Multi-step arithmetic, chemistry and biology prediction, and code generation that outputs exact numeric results are the categories where FP8 accuracy loss shows up most clearly on benchmarks.

Mitigation options. Use FP8 KV cache with BF16 weights (partial approach that reduces memory pressure while keeping weight precision). Use TRT-LLM's per-layer fallback to selectively keep sensitive layers in BF16. Mix FP8 and BF16 via Transformer Engine's per-layer recipe. If MMLU or task-specific accuracy drops more than 1% in an A/B test, fall back to BF16 or evaluate AWQ INT4 via activation-aware calibration as an alternative, which can recover accuracy on non-Hopper hardware through careful calibration. For deployments on non-Hopper hardware where FP8 acceleration is limited, GGUF on non-Hopper hardware is the more practical route to VRAM reduction.

FP8 on Spheron: Available GPUs

The following GPUs with native FP8 Tensor Core support are available on Spheron. AMD GPUs are not currently in the Spheron catalog.

GPU	FP8 support	On-demand price	Spot price	Notes
H100 SXM5	Full	$3.84/hr	$1.63/hr	80 GB HBM3; first-gen TE
H200 SXM5	Full	$4.56/hr	$1.89/hr	141 GB HBM3e; more bandwidth
B200 SXM6	Full + FP4	$7.16/hr	$1.71/hr	192 GB HBM3e; second-gen TE; spot is dramatically below on-demand, strong value for FP8/FP4 workloads
A100 80G PCIe	INT8 only	$1.20/hr	$1.19/hr	No hardware FP8; use AWQ instead

Pricing fluctuates based on GPU availability. The prices above are based on 23 May 2026 and may have changed. Check current GPU pricing → for live rates.

H100, H200, and B200 GPUs with native FP8 Tensor Core support are available on Spheron with per-minute billing and no contracts. Compare on-demand and spot rates across Hopper and Blackwell in one place.
H100 GPU pricing → | Check H200 availability → | On-demand B200 → | View all pricing →

FAQ / 05

Frequently Asked Questions

FP8 quantization stores model weights and activations using 8-bit floating-point numbers instead of the 16-bit or 32-bit formats used during training. Unlike INT8 quantization (which uses fixed-range integers), FP8 retains a floating-point exponent, giving it a much wider dynamic range. This makes FP8 suitable for the wildly varying activation distributions in large transformer models without the calibration and clipping workarounds that INT8 typically requires.

Both are 8-bit floating-point formats but with different allocations of exponent vs mantissa bits. E4M3 (4 exponent bits, 3 mantissa bits) has higher precision and a narrower dynamic range. It is the standard choice for forward-pass weights and activations. E5M2 (5 exponent bits, 2 mantissa bits) covers a wider dynamic range at lower precision and is used for backward-pass gradients during FP8 training, where values can span many orders of magnitude.

NVIDIA Hopper GPUs (H100, H200, GH200) have dedicated FP8 Tensor Cores via the Transformer Engine library. NVIDIA Ada Lovelace GPUs (RTX 4090, L40S) have partial FP8 support through CUDA but lack the per-tensor scaling hardware, so you do not get the same throughput gains as on Hopper. Blackwell GPUs (B200, B300, RTX 5090, RTX PRO 6000) support both FP8 and FP4 with a second-generation Transformer Engine. AMD MI300X and MI355X support FP8 inference via hipBLASLt; AMD's FP8 support in ROCm is mature for inference but less complete than NVIDIA's for training as of mid-2026.

FP8 and INT8 achieve similar memory savings (both halve memory vs FP16) but FP8 handles large transformer activation distributions more cleanly. INT8 inference (via bitsandbytes, llm.int8(), or TensorRT INT8) requires careful calibration and often clips outlier activations, which can hurt accuracy on large models. FP8's floating-point exponent naturally handles outliers without explicit clipping, making it easier to deploy with minimal accuracy loss. For most modern LLMs on Hopper hardware, FP8 is the preferred choice over INT8.

For most production workloads - instruction following, chat, summarization, code generation - the accuracy gap between FP8 and BF16 is below measurement noise on standard benchmarks (MMLU, HellaSwag, GSM8K). Accuracy does degrade more noticeably on tasks requiring high numerical precision (multi-step math, scientific reasoning) and for smaller models under 7B parameters where each bit of precision matters more. Always run your target evaluation before switching a production model to FP8.

What FP8 Quantization Is

The Two FP8 Formats: E4M3 and E5M2

FP8 vs FP16 vs BF16 vs INT8: Comparison Table

Why FP8 Matters for AI Inference

FP8 Training vs FP8 Inference

Hardware Support for FP8

FP8 in Practice: vLLM, TensorRT-LLM, and SGLang

Accuracy and Limitations

FP8 on Spheron: Available GPUs

Frequently Asked Questions

01What is FP8 quantization?

02What is the difference between E4M3 and E5M2 in FP8?

03Which GPUs support FP8 acceleration?

04How does FP8 compare to INT8 for LLM inference?

05Does FP8 quantization hurt model accuracy?

Build what's next.