If your LLM endpoint is returning tokens slower than expected, one of seven problems is almost always responsible. Some of these are GPU configuration issues. Some are software choices. A few are infrastructure decisions that compound each other. This guide covers each one: what causes it, how to confirm it, and the specific fix. For GPU VRAM sizing before you start diagnosing, see our GPU memory requirements guide. For GPU selection by cost-per-token, see the AI inference GPU guide.
TL;DR
- VRAM spillover: model doesn't fit in GPU memory, computation offloads to CPU at 10-50x slower throughput. Fix: quantize or upgrade GPU.
- No PagedAttention: KV cache pre-allocation wastes 60-80% of VRAM, limiting concurrency. Fix: use vLLM.
- FP16 when lower precision works: unnecessary VRAM pressure and slower tensor ops. Fix:
--dtype fp8on H100/Blackwell. - Static batching: serving one request at a time leaves GPU 60-80% idle. Fix: enable continuous batching in vLLM.
- Slow attention kernel: standard attention is O(n²) memory; long contexts become painfully slow. Fix: vLLM v0.17+ with FlashAttention 3/4.
- Network bottleneck: 70B model downloads take 30-60 min; API round-trips add latency. Fix: cache models to persistent storage.
- Wrong inference engine: Ollama for local, but trying to run production traffic through it. Fix: migrate to vLLM or TensorRT-LLM.
1. Wrong GPU: Your Model Doesn't Fit in VRAM
When model weights exceed VRAM, the framework (PyTorch, transformers, Ollama) silently offloads layers to CPU. A 70B model at FP16 is 140 GB. That won't fit on a single H100 80 GB. GPU-to-CPU memory bandwidth is 10-50x slower than HBM, so throughput collapses.
Diagnose: nvidia-smi shows VRAM usage near 100%. CPU usage is elevated during inference. htop shows system RAM being consumed. Tokens/sec is under 5 for models that should do 40+.
Fix: Quantize to FP8 or INT8 (halves weight memory), or move to a GPU with adequate VRAM.
| Model Size | FP16 VRAM | FP8/INT8 VRAM | INT4 VRAM | Recommended GPU | On-Demand Price |
|---|---|---|---|---|---|
| 7B | 14 GB | 7 GB | 3.5 GB | RTX 4090 (24 GB) | $0.51/hr |
| 13B | 26 GB | 13 GB | 6.5 GB | L40S (48 GB) | $0.72/hr |
| 30B | 60 GB | 30 GB | 15 GB | L40S (48 GB) FP8 | $0.72/hr |
| 70B | 140 GB | 70 GB | 35 GB | H100 SXM5 80 GB (FP8) | $2.40/hr |
| 70B FP16 | 140 GB | — | — | 2x H100 SXM5 | $4.80/hr |
| 405B | 810 GB | 405 GB | 203 GB | 8x H100 SXM5 (FP8) or 3x B200 (INT4, tensor parallel) | varies |
| 685B MoE | 1.37 TB | 685 GB | 343 GB | 8x H200 or multi-node | varies |
For exact VRAM calculations by model, see our GPU memory requirements guide for LLMs.
Rent an H100 on Spheron or compare all GPU instances.
2. No KV Cache Optimization (PagedAttention)
During autoregressive generation, the model stores key/value tensors for every token in the context. At 128K context, this can consume 40+ GB per concurrent request on a 70B model.
Traditional serving frameworks pre-allocate maximum context-length KV memory per request slot, wasting 60-80% of VRAM when requests use shorter contexts.
PagedAttention (introduced by vLLM) applies OS-style virtual memory paging to the KV cache. It allocates cache blocks on-demand, not up-front. The practical result: 2-4x more concurrent requests on the same GPU.
Diagnose: The vllm:kv_cache_usage_perc Prometheus metric (from vLLM's /metrics endpoint) hits 100%; requests queue even when GPU compute isn't saturated.
Fix: Use vLLM (PagedAttention is default). Tune --gpu-memory-utilization 0.92 to give more VRAM headroom to the KV cache. Reduce --max-model-len if context lengths are shorter than the default maximum.
You can also pass --kv-cache-dtype fp8 to store the KV cache in FP8 for additional VRAM savings on H100/Blackwell.
3. Using FP16 When INT8 or INT4 Would Be Fine
Precision directly controls both VRAM consumption and compute throughput. FP16 uses 2 bytes per parameter. FP8 uses 1 byte. INT4 uses 0.5 bytes.
| Format | Bytes/param | 70B model VRAM | Throughput vs FP16 | Quality delta | GPU support |
|---|---|---|---|---|---|
| FP32 | 4 | 280 GB | 0.5x | Baseline | All |
| FP16/BF16 | 2 | 140 GB | 1x | Baseline | All modern |
| FP8 | 1 | 70 GB | 1.5-2x | <1-2% | H100, H200, Blackwell |
| INT8 | 1 | 70 GB | 1.3-1.7x | <1-2% | Most modern GPUs |
| INT4 | 0.5 | 35 GB | 1.5-2x | 1-4% | Most modern GPUs |
| FP4 | 0.5 | 35 GB | ~3-4x¹ | 1-4% | Blackwell only |
¹ FP4 throughput benchmarks compare against FP8, not FP16 directly. ~2x over FP8 translates to ~3-4x over FP16.
Diagnose: Check what dtype vLLM is using in server startup logs (look for dtype=float16 vs dtype=float8). nvidia-smi shows the model using more VRAM than expected for the parameter count.
Fix: Pass --dtype fp8 to vLLM on H100 or Blackwell. For Blackwell with FP4, download a pre-calibrated model from the nvidia/ namespace on Hugging Face. For environments where FP8 is not available, use bitsandbytes INT8 via --load-in-8bit with HuggingFace transformers.
For FP4 specifically on Blackwell GPUs, see our FP4 quantization guide.
4. Single-Request Serving Instead of Continuous Batching
Naively serving one request at a time leaves the GPU 60-80% idle between requests. Even serving fixed batches of N requests wastes time waiting for the slowest request in the batch to finish before accepting new ones.
Continuous batching (also called iteration-level scheduling) lets new requests join mid-generation as soon as a slot frees. GPU utilization climbs to 80-95% under load.
Diagnose: GPU utilization stays at 20-40% even when you're sending many concurrent requests. Each request gets fast individual latency but total throughput doesn't scale with concurrency.
Fix: Use vLLM (continuous batching is on by default; add --max-num-seqs 256 for large queues). If you're using HuggingFace generate() in a loop, that's static serving. Migrate to vLLM.
TGI (text-generation-inference from HuggingFace) also implements continuous batching and is an alternative if vLLM doesn't suit your deployment pattern. For a step-by-step vLLM production setup with multi-GPU and FP8, see the vLLM production deployment guide.
5. Unoptimized Attention Implementation (FlashAttention)
Standard attention computes the full N×N attention matrix in VRAM, where N is context length. Memory scales O(N²). At 128K context, this consumes tens of GB per layer.
FlashAttention computes attention in tiles that fit in GPU SRAM, avoiding materializing the full matrix. Faster and uses less VRAM at long contexts.
- FlashAttention 3 (Hopper/H100, H200): further optimized for HBM3.
- FlashAttention 4 (Blackwell/SM100+): default backend in vLLM v0.17.0+ for B200 and datacenter Blackwell GPUs.
Diagnose: TTFT climbs steeply as context length grows beyond 8K, even at batch size 1. Standard attention at 32K context can be 4-8x slower than FlashAttention.
Fix: Upgrade to vLLM v0.17+ (released March 2026). FlashAttention 3/4 is enabled automatically with no configuration flags. Verify: python -c "import vllm; print(vllm.__version__)". If on an older engine, upgrade or switch to vLLM.
6. Network Bottleneck
Two separate network problems are often conflated:
1. Model download latency: A 70B FP16 model is 140 GB. On a 1 Gbps connection that is 18+ minutes. On 100 Mbps it is 3 hours. If your serving infrastructure downloads the model on every cold start, you are paying GPU time while waiting for a download.
2. API latency (for hosted inference providers): if using a third-party API, round-trip latency to the provider's endpoint adds overhead on every request. This matters most for short outputs where network RTT is a large fraction of total time.
Diagnose for model download: Time your model load with:
time python -c "from transformers import AutoModelForCausalLM; AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3.1-70B')"If it takes more than 5 minutes, you are downloading on each start. For API latency: measure with curl -w "%{time_total}" -o /dev/null -s <endpoint>.
Fix:
- Pre-download model weights to a persistent volume or network-attached storage. On Spheron, attach persistent storage and set
HF_HOMEor--modelto the local path. - Cache models in the same region as your instance.
- For hosted API latency: choose providers with low-latency endpoints, or move to self-hosted inference.
7. Wrong Inference Engine for Your Workload
Not every engine is built for every scenario. Using Ollama for production traffic or running vLLM when TensorRT-LLM would give 40% more throughput are both common mismatches.
| Engine | Best For | Setup | Continuous Batching | FP8/FP4 | Throughput |
|---|---|---|---|---|---|
| vLLM | Most production deployments | Medium | Yes (default) | Yes | High |
| TensorRT-LLM | Max throughput on NVIDIA | High | Yes | Yes (incl. FP4) | Highest |
| SGLang | Agent workloads, structured output | Medium | Yes | Yes | High |
| Ollama | Local single-user testing | Low | No | Limited | Low |
| LMDeploy | NVIDIA TurboMind backend alternative | Medium | Yes | Yes | High |
| llama.cpp | CPU inference, edge/local | Low | No | Yes (GGUF) | Low-Medium |
Diagnose: Compare your measured tokens/sec against published benchmarks for your engine on the same GPU. If you're 30-50% under benchmark, the engine config or version is the likely issue.
Fix: For most teams, vLLM is the right starting point. Migrate with pip install vllm and the OpenAI-compatible server (python -m vllm.entrypoints.openai.api_server). The API is compatible with OpenAI's SDK, so client code usually requires no changes.
GPU Sizing Guide: Which Spheron Instance for Which Model
| Model | Params | FP8/INT8 VRAM | Recommended GPU | VRAM | Approx tok/s (batch 8) | Price/hr |
|---|---|---|---|---|---|---|
| Mistral 7B, Llama 3.1 8B | 7-8B | 7-8 GB | RTX 4090 | 24 GB | 90-120 | $0.51 |
| Llama 2 13B | 13B | 13 GB | L40S | 48 GB | 60-80 | $0.72 |
| Llama 3.1 70B | 70B | 70 GB | H100 SXM5 | 80 GB | 30-45 | $2.40 |
| Llama 3.1 70B (FP16) | 70B | 140 GB | 2x H100 SXM5 | 160 GB | 50-70 | $4.80 |
| Llama 4 Scout (109B MoE) | 17B active | 55 GB (INT4) | H100 SXM5 (INT4) | 80 GB | 40-60 | $2.40 |
| DeepSeek V3.2 685B MoE | 37B active | 690 GB (FP8) | 8x H200 or multi-node | 1,128 GB | varies | varies |
| 405B+ dense | 405B+ | 405 GB (FP8) | 8x H100 SXM5 (FP8) or 3x B200 (INT4) | 640 GB / 576 GB | varies | $19.20+ |
Rent H100 on Spheron | Rent A100 | View all GPU pricing | GPU cost optimization guide
Pricing fluctuates based on GPU availability. The prices above are based on 29 Mar 2026 and may have changed. Check current GPU pricing for live rates.
Before/After Benchmarks
| Optimization Applied | Before | After | Improvement |
|---|---|---|---|
| Move 70B from CPU offload to H100 FP8 | ~2 tok/s | ~40 tok/s | ~20x |
| Static batching to continuous batching (vLLM) | 40% GPU util | 85% GPU util | ~2x throughput |
| FP16 to FP8 on H100 (70B model) | ~25 tok/s | ~42 tok/s | ~1.7x |
| FP8 to FP4 on B200 (70B model, MLPerf est.)† | ~4,350 tok/s | ~12,840 tok/s | ~2.9x |
| Standard attention to FlashAttention 3 @ 32K ctx | ~120ms TTFT | ~55ms TTFT | ~2.2x TTFT |
| No PagedAttention to PagedAttention (KV cache) | 15 concurrent reqs | 60 concurrent reqs | 4x concurrency |
| Model download every cold start to cached | 20-min cold start | <30s cold start | ~40x startup |
† The ~4,350 tok/s FP8 baseline may reflect H200 FP8 performance from MLPerf Inference v5.0; B200 submissions primarily used FP4, so this baseline is unverified for B200 specifically. The FP4 figure (~12,840 tok/s) is an approximate per-GPU estimate derived from 8-GPU aggregate results divided by 8. Actual results vary by batch size and model architecture.
Quick Reference: If You See X, Try Y
| Symptom | Likely Cause | First Fix |
|---|---|---|
| <5 tok/s at batch 1 for a 7B model | VRAM spillover to CPU | Quantize to INT8 or upgrade GPU |
| GPU util <40% under concurrent load | Static batching | Switch to vLLM continuous batching |
| Requests queuing but GPU not saturated | KV cache full | Reduce --max-model-len or add VRAM |
| TTFT slow at 32K+ context, fast at 2K | No FlashAttention | Upgrade to vLLM v0.17+ |
| Good single-request latency, poor concurrency | No batching | Add --max-num-seqs 256 in vLLM |
| High cost per token vs benchmark | Wrong precision (FP16) | Enable FP8 with --dtype fp8 |
| 20-min cold start per instance | Re-downloading model | Cache model to persistent volume |
| Throughput 30-50% below published benchmarks | Wrong engine or version | Try vLLM latest or TensorRT-LLM |
| OOM errors mid-generation | KV cache overflow | Lower --max-model-len or --max-num-seqs |
If VRAM limits or cost-per-token are the bottleneck holding back your inference setup, Spheron's GPU instances let you test across H100, H200, and B200 configurations at per-minute billing with no commitment.
