Engineering

Why Your LLM Inference Is Slow (And How to Fix It)

Back to BlogWritten by Mitrasish, Co-founderMar 29, 2026
LLM InferenceGPU CloudPerformance OptimizationKV CacheQuantizationContinuous BatchingFlashAttentionvLLM
Why Your LLM Inference Is Slow (And How to Fix It)

If your LLM endpoint is returning tokens slower than expected, one of seven problems is almost always responsible. Some of these are GPU configuration issues. Some are software choices. A few are infrastructure decisions that compound each other. This guide covers each one: what causes it, how to confirm it, and the specific fix. For GPU VRAM sizing before you start diagnosing, see our GPU memory requirements guide. For GPU selection by cost-per-token, see the AI inference GPU guide.

TL;DR

  • VRAM spillover: model doesn't fit in GPU memory, computation offloads to CPU at 10-50x slower throughput. Fix: quantize or upgrade GPU.
  • No PagedAttention: KV cache pre-allocation wastes 60-80% of VRAM, limiting concurrency. Fix: use vLLM.
  • FP16 when lower precision works: unnecessary VRAM pressure and slower tensor ops. Fix: --dtype fp8 on H100/Blackwell.
  • Static batching: serving one request at a time leaves GPU 60-80% idle. Fix: enable continuous batching in vLLM.
  • Slow attention kernel: standard attention is O(n²) memory; long contexts become painfully slow. Fix: vLLM v0.17+ with FlashAttention 3/4.
  • Network bottleneck: 70B model downloads take 30-60 min; API round-trips add latency. Fix: cache models to persistent storage.
  • Wrong inference engine: Ollama for local, but trying to run production traffic through it. Fix: migrate to vLLM or TensorRT-LLM.

1. Wrong GPU: Your Model Doesn't Fit in VRAM

When model weights exceed VRAM, the framework (PyTorch, transformers, Ollama) silently offloads layers to CPU. A 70B model at FP16 is 140 GB. That won't fit on a single H100 80 GB. GPU-to-CPU memory bandwidth is 10-50x slower than HBM, so throughput collapses.

Diagnose: nvidia-smi shows VRAM usage near 100%. CPU usage is elevated during inference. htop shows system RAM being consumed. Tokens/sec is under 5 for models that should do 40+.

Fix: Quantize to FP8 or INT8 (halves weight memory), or move to a GPU with adequate VRAM.

Model SizeFP16 VRAMFP8/INT8 VRAMINT4 VRAMRecommended GPUOn-Demand Price
7B14 GB7 GB3.5 GBRTX 4090 (24 GB)$0.51/hr
13B26 GB13 GB6.5 GBL40S (48 GB)$0.72/hr
30B60 GB30 GB15 GBL40S (48 GB) FP8$0.72/hr
70B140 GB70 GB35 GBH100 SXM5 80 GB (FP8)$2.40/hr
70B FP16140 GB2x H100 SXM5$4.80/hr
405B810 GB405 GB203 GB8x H100 SXM5 (FP8) or 3x B200 (INT4, tensor parallel)varies
685B MoE1.37 TB685 GB343 GB8x H200 or multi-nodevaries

For exact VRAM calculations by model, see our GPU memory requirements guide for LLMs.

Rent an H100 on Spheron or compare all GPU instances.

2. No KV Cache Optimization (PagedAttention)

During autoregressive generation, the model stores key/value tensors for every token in the context. At 128K context, this can consume 40+ GB per concurrent request on a 70B model.

Traditional serving frameworks pre-allocate maximum context-length KV memory per request slot, wasting 60-80% of VRAM when requests use shorter contexts.

PagedAttention (introduced by vLLM) applies OS-style virtual memory paging to the KV cache. It allocates cache blocks on-demand, not up-front. The practical result: 2-4x more concurrent requests on the same GPU.

Diagnose: The vllm:kv_cache_usage_perc Prometheus metric (from vLLM's /metrics endpoint) hits 100%; requests queue even when GPU compute isn't saturated.

Fix: Use vLLM (PagedAttention is default). Tune --gpu-memory-utilization 0.92 to give more VRAM headroom to the KV cache. Reduce --max-model-len if context lengths are shorter than the default maximum.

You can also pass --kv-cache-dtype fp8 to store the KV cache in FP8 for additional VRAM savings on H100/Blackwell.

3. Using FP16 When INT8 or INT4 Would Be Fine

Precision directly controls both VRAM consumption and compute throughput. FP16 uses 2 bytes per parameter. FP8 uses 1 byte. INT4 uses 0.5 bytes.

FormatBytes/param70B model VRAMThroughput vs FP16Quality deltaGPU support
FP324280 GB0.5xBaselineAll
FP16/BF162140 GB1xBaselineAll modern
FP8170 GB1.5-2x<1-2%H100, H200, Blackwell
INT8170 GB1.3-1.7x<1-2%Most modern GPUs
INT40.535 GB1.5-2x1-4%Most modern GPUs
FP40.535 GB~3-4x¹1-4%Blackwell only

¹ FP4 throughput benchmarks compare against FP8, not FP16 directly. ~2x over FP8 translates to ~3-4x over FP16.

Diagnose: Check what dtype vLLM is using in server startup logs (look for dtype=float16 vs dtype=float8). nvidia-smi shows the model using more VRAM than expected for the parameter count.

Fix: Pass --dtype fp8 to vLLM on H100 or Blackwell. For Blackwell with FP4, download a pre-calibrated model from the nvidia/ namespace on Hugging Face. For environments where FP8 is not available, use bitsandbytes INT8 via --load-in-8bit with HuggingFace transformers.

For FP4 specifically on Blackwell GPUs, see our FP4 quantization guide.

4. Single-Request Serving Instead of Continuous Batching

Naively serving one request at a time leaves the GPU 60-80% idle between requests. Even serving fixed batches of N requests wastes time waiting for the slowest request in the batch to finish before accepting new ones.

Continuous batching (also called iteration-level scheduling) lets new requests join mid-generation as soon as a slot frees. GPU utilization climbs to 80-95% under load.

Diagnose: GPU utilization stays at 20-40% even when you're sending many concurrent requests. Each request gets fast individual latency but total throughput doesn't scale with concurrency.

Fix: Use vLLM (continuous batching is on by default; add --max-num-seqs 256 for large queues). If you're using HuggingFace generate() in a loop, that's static serving. Migrate to vLLM.

TGI (text-generation-inference from HuggingFace) also implements continuous batching and is an alternative if vLLM doesn't suit your deployment pattern. For a step-by-step vLLM production setup with multi-GPU and FP8, see the vLLM production deployment guide.

5. Unoptimized Attention Implementation (FlashAttention)

Standard attention computes the full N×N attention matrix in VRAM, where N is context length. Memory scales O(N²). At 128K context, this consumes tens of GB per layer.

FlashAttention computes attention in tiles that fit in GPU SRAM, avoiding materializing the full matrix. Faster and uses less VRAM at long contexts.

  • FlashAttention 3 (Hopper/H100, H200): further optimized for HBM3.
  • FlashAttention 4 (Blackwell/SM100+): default backend in vLLM v0.17.0+ for B200 and datacenter Blackwell GPUs.

Diagnose: TTFT climbs steeply as context length grows beyond 8K, even at batch size 1. Standard attention at 32K context can be 4-8x slower than FlashAttention.

Fix: Upgrade to vLLM v0.17+ (released March 2026). FlashAttention 3/4 is enabled automatically with no configuration flags. Verify: python -c "import vllm; print(vllm.__version__)". If on an older engine, upgrade or switch to vLLM.

6. Network Bottleneck

Two separate network problems are often conflated:

1. Model download latency: A 70B FP16 model is 140 GB. On a 1 Gbps connection that is 18+ minutes. On 100 Mbps it is 3 hours. If your serving infrastructure downloads the model on every cold start, you are paying GPU time while waiting for a download.

2. API latency (for hosted inference providers): if using a third-party API, round-trip latency to the provider's endpoint adds overhead on every request. This matters most for short outputs where network RTT is a large fraction of total time.

Diagnose for model download: Time your model load with:

bash
time python -c "from transformers import AutoModelForCausalLM; AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3.1-70B')"

If it takes more than 5 minutes, you are downloading on each start. For API latency: measure with curl -w "%{time_total}" -o /dev/null -s <endpoint>.

Fix:

  • Pre-download model weights to a persistent volume or network-attached storage. On Spheron, attach persistent storage and set HF_HOME or --model to the local path.
  • Cache models in the same region as your instance.
  • For hosted API latency: choose providers with low-latency endpoints, or move to self-hosted inference.

7. Wrong Inference Engine for Your Workload

Not every engine is built for every scenario. Using Ollama for production traffic or running vLLM when TensorRT-LLM would give 40% more throughput are both common mismatches.

EngineBest ForSetupContinuous BatchingFP8/FP4Throughput
vLLMMost production deploymentsMediumYes (default)YesHigh
TensorRT-LLMMax throughput on NVIDIAHighYesYes (incl. FP4)Highest
SGLangAgent workloads, structured outputMediumYesYesHigh
OllamaLocal single-user testingLowNoLimitedLow
LMDeployNVIDIA TurboMind backend alternativeMediumYesYesHigh
llama.cppCPU inference, edge/localLowNoYes (GGUF)Low-Medium

Diagnose: Compare your measured tokens/sec against published benchmarks for your engine on the same GPU. If you're 30-50% under benchmark, the engine config or version is the likely issue.

Fix: For most teams, vLLM is the right starting point. Migrate with pip install vllm and the OpenAI-compatible server (python -m vllm.entrypoints.openai.api_server). The API is compatible with OpenAI's SDK, so client code usually requires no changes.

GPU Sizing Guide: Which Spheron Instance for Which Model

ModelParamsFP8/INT8 VRAMRecommended GPUVRAMApprox tok/s (batch 8)Price/hr
Mistral 7B, Llama 3.1 8B7-8B7-8 GBRTX 409024 GB90-120$0.51
Llama 2 13B13B13 GBL40S48 GB60-80$0.72
Llama 3.1 70B70B70 GBH100 SXM580 GB30-45$2.40
Llama 3.1 70B (FP16)70B140 GB2x H100 SXM5160 GB50-70$4.80
Llama 4 Scout (109B MoE)17B active55 GB (INT4)H100 SXM5 (INT4)80 GB40-60$2.40
DeepSeek V3.2 685B MoE37B active690 GB (FP8)8x H200 or multi-node1,128 GBvariesvaries
405B+ dense405B+405 GB (FP8)8x H100 SXM5 (FP8) or 3x B200 (INT4)640 GB / 576 GBvaries$19.20+

Rent H100 on Spheron | Rent A100 | View all GPU pricing | GPU cost optimization guide

Pricing fluctuates based on GPU availability. The prices above are based on 29 Mar 2026 and may have changed. Check current GPU pricing for live rates.

Before/After Benchmarks

Optimization AppliedBeforeAfterImprovement
Move 70B from CPU offload to H100 FP8~2 tok/s~40 tok/s~20x
Static batching to continuous batching (vLLM)40% GPU util85% GPU util~2x throughput
FP16 to FP8 on H100 (70B model)~25 tok/s~42 tok/s~1.7x
FP8 to FP4 on B200 (70B model, MLPerf est.)†~4,350 tok/s~12,840 tok/s~2.9x
Standard attention to FlashAttention 3 @ 32K ctx~120ms TTFT~55ms TTFT~2.2x TTFT
No PagedAttention to PagedAttention (KV cache)15 concurrent reqs60 concurrent reqs4x concurrency
Model download every cold start to cached20-min cold start<30s cold start~40x startup

† The ~4,350 tok/s FP8 baseline may reflect H200 FP8 performance from MLPerf Inference v5.0; B200 submissions primarily used FP4, so this baseline is unverified for B200 specifically. The FP4 figure (~12,840 tok/s) is an approximate per-GPU estimate derived from 8-GPU aggregate results divided by 8. Actual results vary by batch size and model architecture.

Quick Reference: If You See X, Try Y

SymptomLikely CauseFirst Fix
<5 tok/s at batch 1 for a 7B modelVRAM spillover to CPUQuantize to INT8 or upgrade GPU
GPU util <40% under concurrent loadStatic batchingSwitch to vLLM continuous batching
Requests queuing but GPU not saturatedKV cache fullReduce --max-model-len or add VRAM
TTFT slow at 32K+ context, fast at 2KNo FlashAttentionUpgrade to vLLM v0.17+
Good single-request latency, poor concurrencyNo batchingAdd --max-num-seqs 256 in vLLM
High cost per token vs benchmarkWrong precision (FP16)Enable FP8 with --dtype fp8
20-min cold start per instanceRe-downloading modelCache model to persistent volume
Throughput 30-50% below published benchmarksWrong engine or versionTry vLLM latest or TensorRT-LLM
OOM errors mid-generationKV cache overflowLower --max-model-len or --max-num-seqs

If VRAM limits or cost-per-token are the bottleneck holding back your inference setup, Spheron's GPU instances let you test across H100, H200, and B200 configurations at per-minute billing with no commitment.

Rent H100 | Rent A100 | View all GPU pricing

Get started on Spheron

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.