Engineering

What is FP8 Quantization? AI Inference Performance, Accuracy, and Hardware Support Explained (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 23, 2026
FP8 QuantizationFP8FP8 vs FP16FP8 InferenceFP8 TrainingFP8 Transformer Enginewhat is FP8GPU CloudLLM Inference
What is FP8 Quantization? AI Inference Performance, Accuracy, and Hardware Support Explained (2026)

Running Llama 3.1 70B on H100 SXM5, FP8 inference costs roughly $0.71 per million tokens on-demand, compared to $1.19/M at BF16, while delivering 1.4-1.8x more throughput at large batch sizes. The two levers: FP8 nearly doubles Tensor Core throughput and cuts KV cache memory per token in half, which opens up room for larger batch sizes. If you want the hands-on API setup rather than the conceptual breakdown, start with the Transformer Engine setup guide.

What FP8 Quantization Is

FP8 quantization stores model weights and activations in 8-bit floating-point format instead of the 16-bit or 32-bit formats used during training. The key distinction from INT8 quantization is that FP8 retains a floating-point exponent. Every FP8 value follows the same structure as larger floats: 1 sign bit, some exponent bits, and some mantissa bits totaling 8 bits. The exponent gives FP8 a dynamic range that spans many orders of magnitude, unlike INT8's fixed-range integers. That dynamic range is exactly what transformer models need, since activation values can vary by factors of 1,000 or more across layers. INT8 handles this badly without per-layer calibration and outlier clipping. FP8 handles it naturally.

The Two FP8 Formats: E4M3 and E5M2

Two FP8 formats are in active use. They allocate bits differently, giving each a different precision and dynamic range profile. Transformer Engine's DelayedScaling recipe configures them per-layer: E4M3 for the forward pass (weights and activations), E5M2 for the backward pass (gradients, which span far wider ranges than activations).

FormatSign bitsExponent bitsMantissa bitsMax valueDynamic rangeTypical use
E4M3143448~1.7 × 10^4Forward-pass weights and activations
E5M215257,344~5.7 × 10^4Backward-pass gradients during training

E4M3 packs more precision into its 8 bits at the cost of a narrower range. For weights and activations, that tradeoff is correct: values cluster tightly and precision matters more than range. E5M2 flips this for gradients, where range matters far more than fine precision.

FP8 vs FP16 vs BF16 vs INT8: Comparison Table

This is the decision table most practitioners need before choosing a precision format for their inference setup.

FormatBit widthBytes/valueThroughput (relative)Memory footprintDynamic rangeGPU supportBest use case
FP3232-bit4 bytes1× (baseline)4× FP8FullAll GPUsMaster weights, optimizer states
BF1616-bit2 bytes~2× FP322× FP8Wide (8-bit exp)Ampere+Default training precision
FP1616-bit2 bytes~2× FP322× FP8ModerateAll modern GPUsLegacy training, some inference
FP8 E4M38-bit1 byte~4× FP32, ~2× BF16ModerateHopper, BlackwellForward-pass inference, training
FP8 E5M28-bit1 byte~4× FP32, ~2× BF16WideHopper, BlackwellGradient computation (training)
INT88-bit1 byte~4× FP32, ~2× BF16Fixed rangeAll modern GPUsInference only, with calibration

The throughput figures above reflect Tensor Core TFLOPS ceilings published by NVIDIA, not guaranteed end-to-end serving throughput. Real inference throughput depends on batch size, sequence length, KV cache pressure, and memory bandwidth. For large models at small batch sizes, memory bandwidth is typically the bottleneck, so the actual throughput gain from FP8 vs BF16 in serving is often in the 1.4-1.8x range rather than a clean 2x.

Why FP8 Matters for AI Inference

Throughput. FP8 inference on Spheron H100 SXM5 delivers 1,979 dense FP8 TFLOPS vs 989 dense BF16 TFLOPS (NVIDIA spec sheet). In real vLLM serving benchmarks for Llama 3.1 70B, switching from BF16 to FP8 improves tokens per second by 1.4-1.8x at large batch sizes. At small batch sizes (single-digit concurrency), the gain is smaller because memory bandwidth, not compute, is the bottleneck.

KV cache memory. FP8 KV cache cuts per-token KV memory in half compared to BF16. For a 70B model with 128K context, the KV cache alone can consume over 40 GB at BF16 on a single H100 80 GB. Switching to FP8 KV cache frees roughly 20 GB, which either allows longer context within the same VRAM budget or opens up room for larger batch sizes.

Batch size and cost. Larger batch sizes lower cost per token directly. The formula: cost per 1M tokens = ($/hr) ÷ (tokens/sec × 3,600) × 1,000,000. Using live H100 SXM5 pricing: at $1.63/hr spot ($3.84/hr on-demand) and a 70B FP8 model running at ~1,500 tokens/sec, the cost works out to roughly $0.30 per million tokens on spot ($0.71/M on-demand). Compare that to a BF16 run at ~900 tokens/sec on the same hardware: $0.50/M spot ($1.19/M on-demand). The FP8 advantage compounds further at higher batch sizes where throughput scales better.

FP8 Training vs FP8 Inference

These are different use cases that share the "FP8" label but have different complexity profiles.

Inference (PTQ). Post-training quantization: the model is already trained, and weights are calibrated to FP8 after the fact. Major serving frameworks handle the scaling automatically. For vLLM, add --quantization fp8 to your serve command. For TensorRT-LLM, pass --dtype bfloat16 --fp8. No separate calibration step is needed for dynamic per-tensor quantization. Accuracy loss for most production models is negligible. This is the simple path.

Training (mixed precision). This requires NVIDIA Transformer Engine. The forward pass runs in FP8 E4M3, backward gradients in FP8 E5M2 or BF16, and master weights plus optimizer states stay in BF16 or FP32. Transformer Engine handles per-tensor dynamic scaling (delayed scaling or current scaling) automatically. The setup is more involved, but training throughput improves by 1.3-1.7x on H100 SXM5. See the full Transformer Engine setup tutorial for the installation, recipe API, and benchmark data.

Hardware Support for FP8

GPUArchitectureFP8 supportFP8 TFLOPS (dense)Notes
H100 SXM5HopperFull1,979First-gen Transformer Engine; dedicated FP8 Tensor Cores
H200 SXM5HopperFull1,979Same compute as H100; 141 GB HBM3e for larger models
GH200HopperFull1,979H100 compute + Grace CPU on NVLink-C2C
B200 SXM6BlackwellFull4,500Second-gen TE; also supports FP4
B300 (Blackwell Ultra)BlackwellFull~5,000Dense FP8; ~10,000 TFLOPS sparse; also supports FP4
RTX 5090BlackwellFull~838Consumer Blackwell; FP8 + FP4
RTX PRO 6000BlackwellFull~1,457Workstation Blackwell; 96 GB GDDR7
L40SAda LovelacePartial~733No per-tensor scaling HW; inference only
RTX 4090Ada LovelacePartial~660Dense FP8; Transformer Engine falls back to BF16 silently
A100AmpereNoN/AINT8 is the max hardware-accelerated format
AMD MI300XCDNA3Full (inference)~1,307hipBLASLt FP8 gemm ops; no full TE equivalent
AMD MI355XCDNA3.5Full (inference)~5,000aiter attention kernels; FP8 + MXFP4

Hopper (H100/H200/GH200). First-generation Transformer Engine with dedicated FP8 Tensor Cores and hardware per-tensor scaling. The transformer_engine Python library is required for training. For inference, vLLM and TensorRT-LLM use TE internally. Rent an H100 or try H200 on Spheron for the extra memory bandwidth if your workload is HBM-bound.

Ada Lovelace (RTX 4090, L40S). CUDA-level FP8 ops exist, but there is no dedicated scaling hardware. Transformer Engine falls back to BF16 silently on Ada. These GPUs are suitable for FP8 inference with manual quantization (vLLM --quantization fp8 still works), but you will not see the same throughput gains as on Hopper. For inference-only deployments where cost matters more than raw throughput, Ada can still make sense.

Blackwell (B200, B300, RTX 5090, RTX PRO 6000). Second-generation Transformer Engine with higher underlying FP8 TFLOPS and improved scaling throughput. The same Python API works on Blackwell as on Hopper. Additionally, Blackwell supports FP4, which FP4 on Blackwell doubles throughput again beyond FP8. B200 instances on Spheron give you both FP8 and FP4 options.

AMD MI300X/MI355X. FP8 inference works via hipBLASLt FP8 GEMM operations and the aiter attention kernel library. ROCm PyTorch supports FP8 ops. There is no direct AMD equivalent to Transformer Engine's Python API for training as of mid-2026, but inference pipelines (vLLM on ROCm, SGLang) support FP8 on MI300X. For MXFP4 microscaling on AMD hardware, see the MXFP4 microscaling on AMD hardware guide.

FP8 in Practice: vLLM, TensorRT-LLM, and SGLang

vLLM. The simplest path to FP8 inference. Dynamic per-tensor quantization, no separate calibration step required:

bash
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --quantization fp8 \
  --kv-cache-dtype fp8

The --quantization fp8 flag uses dynamic per-tensor quantization. For higher accuracy, load a pre-quantized FP8 checkpoint from HuggingFace using --quantization compressed-tensors, which applies static scales computed during calibration.

TensorRT-LLM. Builds a compiled FP8 engine from a checkpoint:

bash
trtllm-build \
  --checkpoint_dir ./checkpoint \
  --dtype bfloat16 \
  --strongly_typed \
  --use_fp8_context_fmha enable

TensorRT-LLM also supports loading pre-quantized FP8 checkpoints from NVIDIA's NGC model catalog, which gives you statically calibrated scales without running the quantization yourself.

SGLang. Similar syntax to vLLM:

bash
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --quantization fp8

On standard benchmarks, accuracy delta for FP8 vs BF16 is typically under 0.5% on MMLU for 70B models. For 7B models, expect 1-2% degradation, which is worth validating against your specific task.

Accuracy and Limitations

For most production workloads, FP8 vs BF16 accuracy differences are below measurement noise. The cases where FP8 degrades measurably:

Small models (under 7B). Each weight matters more per prediction. FP8 quantization errors compound more visibly at smaller parameter counts. Test your specific 7B model before committing to FP8 in production.

Models with large activation outliers. Older architectures like OPT and GPT-2-style models have activation distributions that stress FP8's dynamic range even with per-tensor scaling. Modern models (Llama 3, Qwen3, Mistral) are more quantization-friendly.

Tasks requiring numerical precision. Multi-step arithmetic, chemistry and biology prediction, and code generation that outputs exact numeric results are the categories where FP8 accuracy loss shows up most clearly on benchmarks.

Mitigation options. Use FP8 KV cache with BF16 weights (partial approach that reduces memory pressure while keeping weight precision). Use TRT-LLM's per-layer fallback to selectively keep sensitive layers in BF16. Mix FP8 and BF16 via Transformer Engine's per-layer recipe. If MMLU or task-specific accuracy drops more than 1% in an A/B test, fall back to BF16 or evaluate AWQ INT4 via activation-aware calibration as an alternative, which can recover accuracy on non-Hopper hardware through careful calibration. For deployments on non-Hopper hardware where FP8 acceleration is limited, GGUF on non-Hopper hardware is the more practical route to VRAM reduction.

FP8 on Spheron: Available GPUs

The following GPUs with native FP8 Tensor Core support are available on Spheron. AMD GPUs are not currently in the Spheron catalog.

GPUFP8 supportOn-demand priceSpot priceNotes
H100 SXM5Full$3.84/hr$1.63/hr80 GB HBM3; first-gen TE
H200 SXM5Full$4.56/hr$1.89/hr141 GB HBM3e; more bandwidth
B200 SXM6Full + FP4$7.16/hr$1.71/hr192 GB HBM3e; second-gen TE; spot is dramatically below on-demand, strong value for FP8/FP4 workloads
A100 80G PCIeINT8 only$1.20/hr$1.19/hrNo hardware FP8; use AWQ instead

Pricing fluctuates based on GPU availability. The prices above are based on 23 May 2026 and may have changed. Check current GPU pricing → for live rates.


H100, H200, and B200 GPUs with native FP8 Tensor Core support are available on Spheron with per-minute billing and no contracts. Compare on-demand and spot rates across Hopper and Blackwell in one place.

Rent H100 → | Rent H200 → | Rent B200 → | View all pricing →

FAQ / 05

Frequently Asked Questions

FP8 quantization stores model weights and activations using 8-bit floating-point numbers instead of the 16-bit or 32-bit formats used during training. Unlike INT8 quantization (which uses fixed-range integers), FP8 retains a floating-point exponent, giving it a much wider dynamic range. This makes FP8 suitable for the wildly varying activation distributions in large transformer models without the calibration and clipping workarounds that INT8 typically requires.

Both are 8-bit floating-point formats but with different allocations of exponent vs mantissa bits. E4M3 (4 exponent bits, 3 mantissa bits) has higher precision and a narrower dynamic range. It is the standard choice for forward-pass weights and activations. E5M2 (5 exponent bits, 2 mantissa bits) covers a wider dynamic range at lower precision and is used for backward-pass gradients during FP8 training, where values can span many orders of magnitude.

NVIDIA Hopper GPUs (H100, H200, GH200) have dedicated FP8 Tensor Cores via the Transformer Engine library. NVIDIA Ada Lovelace GPUs (RTX 4090, L40S) have partial FP8 support through CUDA but lack the per-tensor scaling hardware, so you do not get the same throughput gains as on Hopper. Blackwell GPUs (B200, B300, RTX 5090, RTX PRO 6000) support both FP8 and FP4 with a second-generation Transformer Engine. AMD MI300X and MI355X support FP8 inference via hipBLASLt; AMD's FP8 support in ROCm is mature for inference but less complete than NVIDIA's for training as of mid-2026.

FP8 and INT8 achieve similar memory savings (both halve memory vs FP16) but FP8 handles large transformer activation distributions more cleanly. INT8 inference (via bitsandbytes, llm.int8(), or TensorRT INT8) requires careful calibration and often clips outlier activations, which can hurt accuracy on large models. FP8's floating-point exponent naturally handles outliers without explicit clipping, making it easier to deploy with minimal accuracy loss. For most modern LLMs on Hopper hardware, FP8 is the preferred choice over INT8.

For most production workloads - instruction following, chat, summarization, code generation - the accuracy gap between FP8 and BF16 is below measurement noise on standard benchmarks (MMLU, HellaSwag, GSM8K). Accuracy does degrade more noticeably on tasks requiring high numerical precision (multi-step math, scientific reasoning) and for smaller models under 7B parameters where each bit of precision matters more. Always run your target evaluation before switching a production model to FP8.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.