Why is my LLM inference so slow?

The most common causes are: the model doesn't fit in GPU VRAM and spills to CPU (10-50x throughput drop); no PagedAttention or KV cache optimization; using FP16 when FP8 or INT8 would work; static single-request serving instead of continuous batching; using a slow attention implementation; network latency on model downloads or API calls; and picking the wrong serving engine for the workload.

How do I speed up LLM inference without changing the model?

Enable continuous batching in your serving engine (vLLM enables this by default). On H100 or Blackwell GPUs, enable FP8 with --dtype fp8 in vLLM for 1.5-2x throughput. Upgrade to vLLM v0.17+ which enables FlashAttention 3 on Hopper and FlashAttention 4 on Blackwell automatically. Ensure the full model fits in VRAM with no CPU spillover. On Blackwell GPUs, FP4 with a pre-calibrated model roughly doubles throughput vs FP8.

What is the fastest open-source inference engine for LLMs in 2026?

For most production workloads, vLLM is the strongest all-around choice: continuous batching by default, FlashAttention 3/4 automatic, FP8/FP4 support, and multi-GPU tensor parallelism. TensorRT-LLM delivers higher raw throughput on NVIDIA hardware, especially with FP4 on Blackwell, but requires more setup effort. SGLang excels for agent workloads requiring structured outputs. Ollama is for single-user local testing only, not high-concurrency production.

Does quantization hurt LLM output quality?

FP8 and INT8 produce less than 1-2% quality degradation on standard benchmarks and are imperceptible for most production use cases. INT4 introduces more rounding error (1-4% perplexity increase) but is acceptable for chatbots, summarization, and Q&A tasks. FP4 on Blackwell GPUs is similar to INT4. Always validate on a sample of your production tasks before deploying a quantized model.

What GPU should I use for LLM inference?

Match GPU VRAM to your model size at your target precision. For 7B-13B models: RTX 4090 (24 GB, $0.51/hr) or L40S (48 GB, $0.72/hr). For 70B at FP8: single H100 SXM5 80 GB at $2.40/hr. For 70B at FP16: 2x H100 at $4.80/hr. For 400B+ MoE models: B200 SXM6 (192 GB, spot pricing only, ~$2.07/hr, no on-demand pricing) or multi-GPU tensor parallelism. The GPU sizing guide in this post maps every major model size to the right instance.

Why Your LLM Inference Is Slow (And How to Fix It)

If your LLM endpoint is returning tokens slower than expected, one of seven problems is almost always responsible. Some of these are GPU configuration issues. Some are software choices. A few are infrastructure decisions that compound each other. This guide covers each one: what causes it, how to confirm it, and the specific fix. For GPU VRAM sizing before you start diagnosing, see our GPU memory requirements guide. For GPU selection by cost-per-token, see the AI inference GPU guide.

TL;DR

VRAM spillover: model doesn't fit in GPU memory, computation offloads to CPU at 10-50x slower throughput. Fix: quantize or upgrade GPU.
No PagedAttention: KV cache pre-allocation wastes 60-80% of VRAM, limiting concurrency. Fix: use vLLM.
FP16 when lower precision works: unnecessary VRAM pressure and slower tensor ops. Fix: --dtype fp8 on H100/Blackwell.
Static batching: serving one request at a time leaves GPU 60-80% idle. Fix: enable continuous batching in vLLM.
Slow attention kernel: standard attention is O(n²) memory; long contexts become painfully slow. Fix: vLLM v0.17+ with FlashAttention 3/4.
Network bottleneck: 70B model downloads take 30-60 min; API round-trips add latency. Fix: cache models to persistent storage.
Wrong inference engine: Ollama for local, but trying to run production traffic through it. Fix: migrate to vLLM or TensorRT-LLM.

1. Wrong GPU: Your Model Doesn't Fit in VRAM

When model weights exceed VRAM, the framework (PyTorch, transformers, Ollama) silently offloads layers to CPU. A 70B model at FP16 is 140 GB. That won't fit on a single H100 80 GB. GPU-to-CPU memory bandwidth is 10-50x slower than HBM, so throughput collapses.

Diagnose: nvidia-smi shows VRAM usage near 100%. CPU usage is elevated during inference. htop shows system RAM being consumed. Tokens/sec is under 5 for models that should do 40+.

Fix: Quantize to FP8 or INT8 (halves weight memory), or move to a GPU with adequate VRAM.

Model Size	FP16 VRAM	FP8/INT8 VRAM	INT4 VRAM	Recommended GPU	On-Demand Price
7B	14 GB	7 GB	3.5 GB	RTX 4090 (24 GB)	$0.51/hr
13B	26 GB	13 GB	6.5 GB	L40S (48 GB)	$0.72/hr
30B	60 GB	30 GB	15 GB	L40S (48 GB) FP8	$0.72/hr
70B	140 GB	70 GB	35 GB	H100 SXM5 80 GB (FP8)	$2.40/hr
70B FP16	140 GB	—	—	2x H100 SXM5	$4.80/hr
405B	810 GB	405 GB	203 GB	8x H100 SXM5 (FP8) or 3x B200 (INT4, tensor parallel)	varies
685B MoE	1.37 TB	685 GB	343 GB	8x H200 or multi-node	varies

For exact VRAM calculations by model, see our GPU memory requirements guide for LLMs.

Rent an H100 on Spheron or compare all GPU instances.

2. No KV Cache Optimization (PagedAttention)

During autoregressive generation, the model stores key/value tensors for every token in the context. At 128K context, this can consume 40+ GB per concurrent request on a 70B model.

Traditional serving frameworks pre-allocate maximum context-length KV memory per request slot, wasting 60-80% of VRAM when requests use shorter contexts.

PagedAttention (introduced by vLLM) applies OS-style virtual memory paging to the KV cache. It allocates cache blocks on-demand, not up-front. The practical result: 2-4x more concurrent requests on the same GPU.

Diagnose: The vllm:kv_cache_usage_perc Prometheus metric (from vLLM's /metrics endpoint) hits 100%; requests queue even when GPU compute isn't saturated.

Fix: Use vLLM (PagedAttention is default). Tune --gpu-memory-utilization 0.92 to give more VRAM headroom to the KV cache. Reduce --max-model-len if context lengths are shorter than the default maximum.

You can also pass --kv-cache-dtype fp8 to store the KV cache in FP8 for additional VRAM savings on H100/Blackwell.

3. Using FP16 When INT8 or INT4 Would Be Fine

Precision directly controls both VRAM consumption and compute throughput. FP16 uses 2 bytes per parameter. FP8 uses 1 byte. INT4 uses 0.5 bytes.

Format	Bytes/param	70B model VRAM	Throughput vs FP16	Quality delta	GPU support
FP32	4	280 GB	0.5x	Baseline	All
FP16/BF16	2	140 GB	1x	Baseline	All modern
FP8	1	70 GB	1.5-2x	<1-2%	H100, H200, Blackwell
INT8	1	70 GB	1.3-1.7x	<1-2%	Most modern GPUs
INT4	0.5	35 GB	1.5-2x	1-4%	Most modern GPUs
FP4	0.5	35 GB	~3-4x¹	1-4%	Blackwell only

¹ FP4 throughput benchmarks compare against FP8, not FP16 directly. ~2x over FP8 translates to ~3-4x over FP16.

Diagnose: Check what dtype vLLM is using in server startup logs (look for dtype=float16 vs dtype=float8). nvidia-smi shows the model using more VRAM than expected for the parameter count.

Fix: Pass --dtype fp8 to vLLM on H100 or Blackwell. For Blackwell with FP4, download a pre-calibrated model from the nvidia/ namespace on Hugging Face. For environments where FP8 is not available, use bitsandbytes INT8 via --load-in-8bit with HuggingFace transformers.

For FP4 specifically on Blackwell GPUs, see our FP4 quantization guide.

4. Single-Request Serving Instead of Continuous Batching

Naively serving one request at a time leaves the GPU 60-80% idle between requests. Even serving fixed batches of N requests wastes time waiting for the slowest request in the batch to finish before accepting new ones.

Continuous batching (also called iteration-level scheduling) lets new requests join mid-generation as soon as a slot frees. GPU utilization climbs to 80-95% under load.

Diagnose: GPU utilization stays at 20-40% even when you're sending many concurrent requests. Each request gets fast individual latency but total throughput doesn't scale with concurrency.

Fix: Use vLLM (continuous batching is on by default; add --max-num-seqs 256 for large queues). If you're using HuggingFace generate() in a loop, that's static serving. Migrate to vLLM.

TGI (text-generation-inference from HuggingFace) also implements continuous batching and is an alternative if vLLM doesn't suit your deployment pattern. For a step-by-step vLLM production setup with multi-GPU and FP8, see the vLLM production deployment guide.

5. Unoptimized Attention Implementation (FlashAttention)

Standard attention computes the full N×N attention matrix in VRAM, where N is context length. Memory scales O(N²). At 128K context, this consumes tens of GB per layer.

FlashAttention computes attention in tiles that fit in GPU SRAM, avoiding materializing the full matrix. Faster and uses less VRAM at long contexts.

FlashAttention 3 (Hopper/H100, H200): further optimized for HBM3.
FlashAttention 4 (Blackwell/SM100+): default backend in vLLM v0.17.0+ for B200 and datacenter Blackwell GPUs.

Diagnose: TTFT climbs steeply as context length grows beyond 8K, even at batch size 1. Standard attention at 32K context can be 4-8x slower than FlashAttention.

Fix: Upgrade to vLLM v0.17+ (released March 2026). FlashAttention 3/4 is enabled automatically with no configuration flags. Verify: python -c "import vllm; print(vllm.__version__)". If on an older engine, upgrade or switch to vLLM.

6. Network Bottleneck

Two separate network problems are often conflated:

1. Model download latency: A 70B FP16 model is 140 GB. On a 1 Gbps connection that is 18+ minutes. On 100 Mbps it is 3 hours. If your serving infrastructure downloads the model on every cold start, you are paying GPU time while waiting for a download.

2. API latency (for hosted inference providers): if using a third-party API, round-trip latency to the provider's endpoint adds overhead on every request. This matters most for short outputs where network RTT is a large fraction of total time.

Diagnose for model download: Time your model load with:

bash

time python -c "from transformers import AutoModelForCausalLM; AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3.1-70B')"

If it takes more than 5 minutes, you are downloading on each start. For API latency: measure with curl -w "%{time_total}" -o /dev/null -s <endpoint>.

Fix:

Pre-download model weights to a persistent volume or network-attached storage. On Spheron, attach persistent storage and set HF_HOME or --model to the local path.
Cache models in the same region as your instance.
For hosted API latency: choose providers with low-latency endpoints, or move to self-hosted inference.

7. Wrong Inference Engine for Your Workload

Not every engine is built for every scenario. Using Ollama for production traffic or running vLLM when TensorRT-LLM would give 40% more throughput are both common mismatches.

Engine	Best For	Setup	Continuous Batching	FP8/FP4	Throughput
vLLM	Most production deployments	Medium	Yes (default)	Yes	High
TensorRT-LLM	Max throughput on NVIDIA	High	Yes	Yes (incl. FP4)	Highest
SGLang	Agent workloads, structured output	Medium	Yes	Yes	High
Ollama	Local single-user testing	Low	No	Limited	Low
LMDeploy	NVIDIA TurboMind backend alternative	Medium	Yes	Yes	High
llama.cpp	CPU inference, edge/local	Low	No	Yes (GGUF)	Low-Medium

Diagnose: Compare your measured tokens/sec against published benchmarks for your engine on the same GPU. If you're 30-50% under benchmark, the engine config or version is the likely issue.

Fix: For most teams, vLLM is the right starting point. Migrate with pip install vllm and the OpenAI-compatible server (python -m vllm.entrypoints.openai.api_server). The API is compatible with OpenAI's SDK, so client code usually requires no changes.

GPU Sizing Guide: Which Spheron Instance for Which Model

Model	Params	FP8/INT8 VRAM	Recommended GPU	VRAM	Approx tok/s (batch 8)	Price/hr
Mistral 7B, Llama 3.1 8B	7-8B	7-8 GB	RTX 4090	24 GB	90-120	$0.51
Llama 2 13B	13B	13 GB	L40S	48 GB	60-80	$0.72
Llama 3.1 70B	70B	70 GB	H100 SXM5	80 GB	30-45	$2.40
Llama 3.1 70B (FP16)	70B	140 GB	2x H100 SXM5	160 GB	50-70	$4.80
Llama 4 Scout (109B MoE)	17B active	55 GB (INT4)	H100 SXM5 (INT4)	80 GB	40-60	$2.40
DeepSeek V3.2 685B MoE	37B active	690 GB (FP8)	8x H200 or multi-node	1,128 GB	varies	varies
405B+ dense	405B+	405 GB (FP8)	8x H100 SXM5 (FP8) or 3x B200 (INT4)	640 GB / 576 GB	varies	$19.20+

Rent H100 on Spheron | Rent A100 | View all GPU pricing | GPU cost optimization guide

Pricing fluctuates based on GPU availability. The prices above are based on 29 Mar 2026 and may have changed. Check current GPU pricing for live rates.

Before/After Benchmarks

Optimization Applied	Before	After	Improvement
Move 70B from CPU offload to H100 FP8	~2 tok/s	~40 tok/s	~20x
Static batching to continuous batching (vLLM)	40% GPU util	85% GPU util	~2x throughput
FP16 to FP8 on H100 (70B model)	~25 tok/s	~42 tok/s	~1.7x
FP8 to FP4 on B200 (70B model, MLPerf est.)†	~4,350 tok/s	~12,840 tok/s	~2.9x
Standard attention to FlashAttention 3 @ 32K ctx	~120ms TTFT	~55ms TTFT	~2.2x TTFT
No PagedAttention to PagedAttention (KV cache)	15 concurrent reqs	60 concurrent reqs	4x concurrency
Model download every cold start to cached	20-min cold start	<30s cold start	~40x startup

† The ~4,350 tok/s FP8 baseline may reflect H200 FP8 performance from MLPerf Inference v5.0; B200 submissions primarily used FP4, so this baseline is unverified for B200 specifically. The FP4 figure (~12,840 tok/s) is an approximate per-GPU estimate derived from 8-GPU aggregate results divided by 8. Actual results vary by batch size and model architecture.

Quick Reference: If You See X, Try Y

Symptom	Likely Cause	First Fix
<5 tok/s at batch 1 for a 7B model	VRAM spillover to CPU	Quantize to INT8 or upgrade GPU
GPU util <40% under concurrent load	Static batching	Switch to vLLM continuous batching
Requests queuing but GPU not saturated	KV cache full	Reduce `--max-model-len` or add VRAM
TTFT slow at 32K+ context, fast at 2K	No FlashAttention	Upgrade to vLLM v0.17+
Good single-request latency, poor concurrency	No batching	Add `--max-num-seqs 256` in vLLM
High cost per token vs benchmark	Wrong precision (FP16)	Enable FP8 with `--dtype fp8`
20-min cold start per instance	Re-downloading model	Cache model to persistent volume
Throughput 30-50% below published benchmarks	Wrong engine or version	Try vLLM latest or TensorRT-LLM
OOM errors mid-generation	KV cache overflow	Lower `--max-model-len` or `--max-num-seqs`

If VRAM limits or cost-per-token are the bottleneck holding back your inference setup, Spheron's GPU instances let you test across H100, H200, and B200 configurations at per-minute billing with no commitment.
Rent H100 | Rent A100 | View all GPU pricing
Get started on Spheron