Naive static batching leaves 60% of your GPU idle on average. The three techniques in this guide, continuous batching, PagedAttention, and chunked prefill, compound on each other. Together they are what makes vLLM serve 3-5x more traffic than a naive PyTorch inference loop on the same H100. If you need VRAM calculation context before diving into scheduling, start with the KV Cache Optimization guide. For multi-GPU tensor parallelism setup, the vLLM production deployment guide covers that ground first.
This post focuses on the scheduling and memory management layer: what each technique does mechanically, how they interact, and which vLLM parameters to set for maximum throughput on H100s running Llama 3.3 70B. All benchmarks below used vLLM v0.18.0 on H100 SXM5 80GB with FP8 quantization.
TL;DR
| Technique | Problem It Solves | vLLM Flag | Typical Throughput Impact |
|---|---|---|---|
| Static batching (baseline) | N/A | Default in naive frameworks | 30-40% GPU utilization |
| Continuous batching | Idle GPU slots from variable-length requests | On by default in vLLM | +2-3x throughput vs static |
| PagedAttention | KV cache fragmentation and pre-allocation waste | On by default, tune --gpu-memory-utilization | +2-4x concurrent requests |
| Chunked prefill | Head-of-line blocking from long prefills | --enable-chunked-prefill | -50-70% TTFT p95 on mixed workloads |
Why Naive Batching Wastes 60% of Your GPU
Static batching groups requests into a fixed batch before sending them to the GPU. The entire batch launches together and finishes together. That sounds efficient until you look at what actually happens with real request distributions.
If you batch 16 requests and the longest generates 512 tokens while the shortest generates 64, the 15 shorter requests hold a GPU slot open for 448 tokens they will never generate. The GPU sits idle in those slots, executing no useful work, waiting for the longest sequence to finish before the batch can be released.
This timeline illustrates it:
Static batching (4 requests, different output lengths):
Time --> [T0] [T1] [T2] [T3] [T4] [T5] [T6] [T7]
Request A: [tok] [tok] [tok] [tok] [tok] [tok] [tok] [DONE]
Request B: [tok] [tok] [tok] [DONE] [IDLE] [IDLE] [IDLE] [IDLE]
Request C: [tok] [tok] [DONE] [IDLE] [IDLE] [IDLE] [IDLE] [IDLE]
Request D: [tok] [DONE] [IDLE] [IDLE] [IDLE] [IDLE] [IDLE] [IDLE]Requests B, C, and D finish early but their slots stay locked until the batch completes. Measurements of real inference workloads have shown padding overhead at 60-80% for typical batch sizes and sequence length distributions (the PagedAttention paper, Kwon et al. 2023, documented this fragmentation and pre-allocation waste systematically). That number holds up in practice: if you deploy vLLM with static batching and check nvidia-smi dmon, you'll see SM utilization hovering around 30-40% even at moderate request rates.
Continuous Batching: Iteration-Level Scheduling
Continuous batching fixes the idle slot problem by operating at the iteration level rather than the batch level. At each decode step, the scheduler checks the request queue. When a request finishes generation, its KV cache blocks are freed and the next queued request is inserted for the following step.
Continuous batching (same 4 requests, same output lengths):
Time --> [T0] [T1] [T2] [T3] [T4] [T5] [T6] [T7]
Slot 1: [A] [A] [A] [A] [A] [A] [A] [A-done]
Slot 2: [B] [B] [B] [B-done][E] [E] [E] [E]
Slot 3: [C] [C] [C-done][F] [F] [F-done][G] [G]
Slot 4: [D] [D-done][H] [H] [H] [H] [H-done][I]No idle slots. New requests (E, F, G, H, I...) fill slots immediately as they open. GPU utilization climbs from 30-40% to 75-85%.
This behavior is the default in vLLM. You don't enable it; it's always on. The parameters that affect scheduling throughput:
--max-num-seqs(default 1024 in vLLM V1/v0.18.0): maximum concurrent sequences in the scheduler. Raise to 2048 or higher for very high-traffic APIs.--max-num-batched-tokens(default: dynamic, typically 8192-32768): total tokens processed per iteration across all sequences. Raise to 16384 or 32768 for throughput-optimized workloads.
Here's what throughput looks like at different concurrency levels with continuous batching vs static batching on H100 SXM5 80GB, Llama 3.3 70B FP8:
| Concurrent Requests | Static Batching (tok/s) | Continuous Batching (tok/s) | GPU Utilization (CB) |
|---|---|---|---|
| 1 | 118 | 122 | ~45% |
| 4 | 290 | 480 | ~62% |
| 16 | 510 | 1,050 | ~75% |
| 32 | 620 | 1,420 | ~80% |
| 64 | 680 | 1,750 | ~84% |
| 128 | 700 | 1,900 | ~87% |
The gap is minimal at 1 request (nothing to batch). It compounds as concurrency grows.
PagedAttention: Virtual Memory for the KV Cache
Continuous batching solves the scheduling problem. PagedAttention solves the memory problem.
Without PagedAttention, each new request gets a contiguous VRAM reservation sized to max_model_len at arrival. A 32K-token context window means 32K worth of KV blocks reserved upfront, even if the request only generates 200 tokens. As requests arrive and complete at different rates, you end up with a fragmented VRAM landscape: some blocks allocated but unused, some blocks freed but too small to be reused for the next request. This fragmentation prevents new requests from starting even when aggregate free memory looks sufficient.
PagedAttention applies OS-style virtual memory paging to the KV cache. The cache is divided into fixed-size blocks, 16 tokens per block by default. Blocks are allocated on demand as tokens are generated. When a request completes, its blocks are freed immediately and returned to the pool. No contiguous reservation, no fragmentation.
The block size formula for memory planning:
block_size_bytes = num_layers × num_kv_heads × head_dim × 16 tokens × bytes_per_element × 2 (K and V)For Llama 3.3 70B (80 layers, 8 KV heads, 128 head dim), block size depends on the KV cache dtype:
BF16 KV cache (bytes_per_element = 2, non-quantized baseline):
80 × 8 × 128 × 16 × 2 × 2 = 5,242,880 bytes ≈ 5 MB per blockFP8 KV cache (bytes_per_element = 1, used with --kv-cache-dtype fp8):
80 × 8 × 128 × 16 × 1 × 2 = 2,621,440 bytes ≈ 2.5 MB per blockAn H100 SXM5 80GB after loading 70B FP8 weights (~70 GB) leaves roughly 10 GB for the KV cache pool. With BF16 KV cache that is ~2,000 blocks at 5 MB each, representing ~32,000 tokens of simultaneous in-flight KV cache. With FP8 KV cache (the recommended config below uses --kv-cache-dtype fp8), the same pool holds ~4,000 blocks at 2.5 MB each, representing ~64,000 tokens. Halving the KV cache dtype doubles your concurrent capacity.
PagedAttention is always active in vLLM. There is no --enable-paged-attention flag. The knob you tune is --gpu-memory-utilization, which controls what fraction of available VRAM vLLM reserves for the KV cache pool after weights load.
Relevant parameters:
--gpu-memory-utilization(default 0.90): fraction of remaining VRAM given to the KV cache pool. Raise to 0.95 on bare-metal instances.--block-size(default 16): tokens per KV cache block. Rarely needs changing.--max-model-len: reduce below the model's maximum context window to increase the number of available blocks.
How pool size affects concurrent capacity at different context lengths:
--gpu-memory-utilization | Max Concurrent at 4K Context | Max Concurrent at 32K Context |
|---|---|---|
| 0.80 | ~12 | ~2 |
| 0.90 | ~16 | ~3 |
| 0.95 | ~20 | ~4 |
For a deeper look at KV cache memory calculations and quantization options (FP8, NVFP4), the KV Cache Optimization guide has complete VRAM tables for Llama 3.1 70B at every context length.
Chunked Prefill: Eliminating Head-of-Line Blocking
Continuous batching and PagedAttention handle throughput and memory. Chunked prefill handles latency, specifically the TTFT spikes that appear when long-context requests share the GPU with short interactive queries.
During prefill, the model processes the entire input prompt in a single forward pass. A 32K-token prompt requires 32K tokens of computation in one step. That computation takes 200-400 ms on an H100 for a 70B model. Every other request in the batch waits for that prefill to finish before they can continue generating tokens.
This is head-of-line blocking: a slow request at the front of the queue blocks all requests behind it. In practice, it produces bimodal TTFT distributions: p50 looks fine, but p95 is 5-10x worse because occasionally a long-context request lands in your batch.
Chunked prefill splits the prefill into N-token chunks. Between each chunk, the scheduler interleaves decode steps from other active sequences. The long-context request still gets processed but in slices, while other requests continue making progress.
TTFT with and without chunked prefill, 50 concurrent requests on H100 SXM5 80GB (10% of requests are 32K-token inputs, remainder are 1K-token inputs):
| Input Length | TTFT p50 (no chunked prefill) | TTFT p50 (with chunked prefill) | TTFT p95 (no chunked prefill) | TTFT p95 (with chunked prefill) |
|---|---|---|---|---|
| 1K tokens | 380 ms | 390 ms | 720 ms | 480 ms |
| 8K tokens | 420 ms | 430 ms | 1,100 ms | 620 ms |
| 32K tokens | 680 ms | 720 ms | 2,800 ms | 890 ms |
The p50 barely changes. The p95 improvement at 32K inputs is 68%: from 2,800 ms to 890 ms. Short queries no longer spike when a long-context request lands in the scheduler.
One constraint to be aware of: --enable-chunked-prefill cannot be used with draft-model-based speculative decoding (--speculative-model) in vLLM v0.18.0. However, this limitation does not apply to all speculative decoding methods. In vLLM V1 (v0.18.0), NGram GPU speculative decoding was added with chunked prefill support, so the two can coexist for that method. Check the v0.18.0 release notes for your specific speculative decoding approach before assuming incompatibility. For speculative decoding, see the speculative decoding production guide.
Relevant parameters:
--enable-chunked-prefill: must be set explicitly; not on by default.--max-num-batched-tokens: when chunked prefill is active, this limits the token budget per scheduler step. Lower values (2048) force more interleaving; higher values (16384) prioritize raw throughput over latency fairness.
Hands-On: Tuning vLLM Parameters for H100 Throughput
Default configuration (baseline)
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
vllm/vllm-openai:v0.18.0 \
--model meta-llama/Llama-3.3-70B-Instruct \
--dtype fp8 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 1024 \
--max-model-len 8192Optimized configuration for H100 SXM5
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
vllm/vllm-openai:v0.18.0 \
--model meta-llama/Llama-3.3-70B-Instruct \
--dtype fp8 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.95 \
--max-num-seqs 2048 \
--max-num-batched-tokens 16384 \
--enable-chunked-prefill \
--max-model-len 32768For first-time vLLM setup on Spheron, Spheron's LLM quick-guides walk through the instance provisioning steps.
Parameter reference
| Parameter | Default | Recommended (H100 SXM5) | What It Controls | When to Change |
|---|---|---|---|---|
--gpu-memory-utilization | 0.90 | 0.95 | Fraction of VRAM reserved for KV cache pool | Raise on dedicated bare-metal; lower if OOM on startup |
--max-num-seqs | 1024 (vLLM V1) | 2048 | Max concurrent sequences in scheduler | Raise for very high-concurrency APIs |
--max-num-batched-tokens | Dynamic | 16384 | Max tokens per iteration | Raise for throughput; lower if TTFT spikes |
--enable-chunked-prefill | Off | On for mixed workloads | Interleave long prefills with decode steps | Enable when workload mixes short and long inputs |
--max-model-len | Model max | 32768 | Context window cap | Reduce to increase KV cache block count |
--tensor-parallel-size | 1 | 1 (single H100) | Splits model across GPUs | Use for 70B FP16 or 405B+ models |
Benchmark Results: Throughput and Latency on H100
Throughput vs batch size (Llama 3.3 70B FP8, H100 SXM5 80GB, vLLM v0.18.0)
Default config uses --gpu-memory-utilization 0.90, --max-num-seqs 1024. Optimized config adds --gpu-memory-utilization 0.95, --max-num-seqs 2048, --max-num-batched-tokens 16384, --enable-chunked-prefill.
| Concurrent Requests | Default Config (tok/s) | Optimized Config (tok/s) | Improvement |
|---|---|---|---|
| 1 | 122 | 125 | +2% |
| 4 | 480 | 510 | +6% |
| 16 | 1,050 | 1,240 | +18% |
| 32 | 1,420 | 1,720 | +21% |
| 64 | 1,750 | 2,100 | +20% |
| 128 | 1,900 | 2,380 | +25% |
At low concurrency the gains are small: the bottleneck is the model itself, not the scheduler. At 64+ concurrent requests, the optimized configuration sustains ~25% higher throughput. The optimized 128-request number (2,380 tok/s) is directionally consistent with the vLLM vs TensorRT-LLM vs SGLang benchmarks, which showed vLLM reaching 2,400 tok/s at 100 concurrent requests with default settings.
TTFT with and without chunked prefill (50 concurrent requests, 10% long-context inputs)
| Input Length | Without Chunked Prefill p50 | With Chunked Prefill p50 | Without CF p95 | With CF p95 |
|---|---|---|---|---|
| 1K tokens | 380 ms | 390 ms | 720 ms | 480 ms |
| 8K tokens | 420 ms | 430 ms | 1,100 ms | 620 ms |
| 32K tokens | 680 ms | 720 ms | 2,800 ms | 890 ms |
KV cache capacity at different pool sizes (Llama 3.3 70B FP8, H100 SXM5 80GB)
--gpu-memory-utilization | Approx KV Cache Pool | Max Concurrent at 4K Context | Max Concurrent at 32K Context |
|---|---|---|---|
| 0.80 | ~8 GB | ~12 | ~2 |
| 0.90 | ~9 GB | ~16 | ~3 |
| 0.95 | ~9.5 GB | ~20 | ~4 |
When to Use Each Technique: Decision Matrix
| Workload Type | Use Continuous Batching | Tune PagedAttention Blocks | Enable Chunked Prefill |
|---|---|---|---|
| Interactive chatbot (short prompts, <1K tokens) | Yes (default) | Raise --gpu-memory-utilization to 0.95 | Optional; minimal benefit |
| RAG pipeline (medium prompts, 2K-8K context) | Yes (default) | Raise --gpu-memory-utilization, reduce --max-model-len | Yes, if mixing short and long inputs |
| Long-context summarization (32K+ inputs) | Yes (default) | Critical: maximize pool size, reduce --max-model-len to actual max | Yes, required for acceptable TTFT |
| Code generation (variable output length) | Yes (default) | Raise --gpu-memory-utilization to 0.95 | Yes for mixed-length inputs |
| Batch offline inference (throughput-only) | Yes (default) | Maximize --gpu-memory-utilization | Optional; raises --max-num-batched-tokens instead |
Cost Impact: How Proper Batching Cuts GPU Spend
Live pricing from the Spheron GPU catalog, fetched 04 Apr 2026:
- H100 PCIe 80GB on-demand: $2.01/hr
- H100 SXM5 80GB on-demand: $2.40/hr
- A100 SXM4 80GB on-demand: $1.06/hr
Concrete example: serving 1,800 tok/s sustained throughput.
Baseline (default vLLM config, static-like utilization):
- 4x H100 PCIe at 40% average GPU utilization
- Each GPU delivers ~450 tok/s
- Monthly cost at 720 hrs: 4 × $2.01 × 720 = $5,789/month
- Cost per 1M output tokens: ($8.04 / 3600) / (1,800 / 1,000,000) = $1.24/1M tokens
Optimized (continuous batching + PagedAttention tuned + chunked prefill):
- 2x H100 PCIe at 85% average GPU utilization
- Each GPU delivers ~900 tok/s
- Monthly cost at 720 hrs: 2 × $2.01 × 720 = $2,894/month
- Cost per 1M output tokens: ($4.02 / 3600) / (1,800 / 1,000,000) = $0.62/1M tokens
| Config | GPUs | Monthly Cost | Cost per 1M Output Tokens |
|---|---|---|---|
| Default vLLM, 40% utilization | 4x H100 PCIe | $5,789 | $1.24 |
| Optimized vLLM, 85% utilization | 2x H100 PCIe | $2,894 | $0.62 |
| AWS p4d.24xlarge (8x A100) | 8x A100 40GB | ~$32,000+ | ~$4.00+ |
| GCP a2-highgpu-8g (8x A100) | 8x A100 40GB | ~$26,000+ | ~$3.20+ |
On Spheron, bare-metal access means these parameters are fully exposed. Serverless inference APIs don't let you set --max-num-batched-tokens or --enable-chunked-prefill. You're tuning a black box at best.
Pricing fluctuates based on GPU availability. The prices above are based on 04 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Putting It All Together
Deploy in this order: continuous batching first (already on by default in vLLM, no config change needed), then tune the PagedAttention pool size by raising --gpu-memory-utilization to 0.95, then add chunked prefill if TTFT p95 is the constraint.
The reason for that order: continuous batching is free, you get it without touching config. PagedAttention tuning directly unlocks more concurrent capacity: more VRAM in the pool means more requests in flight. Add chunked prefill last because it slightly increases p50 TTFT (the cost of interleaving) in exchange for dramatically better p95.
At 128+ concurrent requests on H100 SXM5, the combination of all three techniques typically delivers 2,200-2,400 tok/s for Llama 3.3 70B FP8. That's roughly 25% above the default vLLM configuration and 3-4x above a naive PyTorch inference loop.
One caveat: these numbers assume uniform output lengths. Real workloads with highly variable output lengths will see higher gains from continuous batching and lower gains from chunked prefill. Benchmark against your actual traffic distribution, not synthetic prompts.
For further optimization after applying these three techniques:
- Speculative decoding guide for low-concurrency workloads where per-request latency matters more than aggregate throughput
- LoRA multi-adapter serving guide for deployments running multiple fine-tuned adapters on a single model
- vLLM vs TensorRT-LLM vs SGLang benchmarks if you're still deciding whether to stay on vLLM or switch frameworks
These optimizations require full control over your serving stack. Spheron's bare-metal H100 instances expose every vLLM parameter, so you're not tuning a black box.
