What is continuous batching in LLM inference?

Continuous batching (also called iteration-level scheduling) processes new incoming requests at each decode step instead of waiting for an entire batch to finish. When one request in the batch completes generation, vLLM immediately inserts the next queued request into that slot for the following step. This eliminates the padding waste of static batching where all requests in a batch must wait for the longest sequence to finish before the batch is released.

How does PagedAttention reduce KV cache memory waste?

PagedAttention divides the KV cache into fixed-size blocks (16 tokens by default) allocated on demand as tokens are generated. Traditional inference engines pre-allocate the full max-context-length memory for every request upfront, wasting 60-80% of allocated VRAM since most requests never use the full context window. PagedAttention only allocates blocks as they are needed and frees them immediately when a request completes, enabling 2-4x more concurrent requests on the same GPU.

What is chunked prefill and when should I enable it?

Chunked prefill splits the prefill phase of long-context requests into fixed-size chunks and interleaves them with decode steps from other requests. In vLLM v0.18.0 (V1), the default chunk size is 8192 tokens for online serving (controlled by --max-num-batched-tokens), up from 512 in older vLLM versions. Without chunked prefill, a single 32K-token prefill can block the GPU for hundreds of milliseconds, causing latency spikes for all concurrent requests. Enable chunked prefill with --enable-chunked-prefill when your workload mixes short interactive queries with long-context inputs.

What vLLM parameters have the biggest impact on throughput?

The three highest-impact parameters are: --gpu-memory-utilization (controls how much VRAM the KV cache pool uses, default 0.90 - raise to 0.95 on dedicated instances), --max-num-batched-tokens (controls the max tokens processed per iteration, raise to 8192 or 16384 for high-throughput batch workloads), and --max-num-seqs (controls max concurrent sequences in the scheduler, raise beyond the default 1024 (vLLM V1/v0.18.0) for very high-concurrency serving). Pair these with --enable-chunked-prefill to avoid head-of-line blocking at high batched-token counts.

How much does proper batching reduce GPU costs?

Switching from naive static batching to continuous batching with PagedAttention typically increases GPU utilization from 30-40% to 75-90%, which translates to 2-4x more output tokens per GPU-hour. At H100 PCIe on-demand rates, a deployment that previously needed 4 GPUs to handle peak load may sustain that same load on 1-2 GPUs after optimization. The actual cost reduction depends heavily on request concurrency, average sequence length, and output length distribution.

LLM Serving Optimization: Continuous Batching, PagedAttention, and Chunked Prefill on H100 (2026)

Naive static batching leaves 60% of your GPU idle on average. The three techniques in this guide, continuous batching, PagedAttention, and chunked prefill, compound on each other. Together they are what makes vLLM serve 3-5x more traffic than a naive PyTorch inference loop on the same H100. If you need VRAM calculation context before diving into scheduling, start with the KV Cache Optimization guide. For multi-GPU tensor parallelism setup, the vLLM production deployment guide covers that ground first.

This post focuses on the scheduling and memory management layer: what each technique does mechanically, how they interact, and which vLLM parameters to set for maximum throughput on H100s running Llama 3.3 70B. All benchmarks below used vLLM v0.18.0 on H100 SXM5 80GB with FP8 quantization.

TL;DR

Technique	Problem It Solves	vLLM Flag	Typical Throughput Impact
Static batching (baseline)	N/A	Default in naive frameworks	30-40% GPU utilization
Continuous batching	Idle GPU slots from variable-length requests	On by default in vLLM	+2-3x throughput vs static
PagedAttention	KV cache fragmentation and pre-allocation waste	On by default, tune `--gpu-memory-utilization`	+2-4x concurrent requests
Chunked prefill	Head-of-line blocking from long prefills	`--enable-chunked-prefill`	-50-70% TTFT p95 on mixed workloads

Why Naive Batching Wastes 60% of Your GPU

Static batching groups requests into a fixed batch before sending them to the GPU. The entire batch launches together and finishes together. That sounds efficient until you look at what actually happens with real request distributions.

If you batch 16 requests and the longest generates 512 tokens while the shortest generates 64, the 15 shorter requests hold a GPU slot open for 448 tokens they will never generate. The GPU sits idle in those slots, executing no useful work, waiting for the longest sequence to finish before the batch can be released.

This timeline illustrates it:

Static batching (4 requests, different output lengths):

Time -->    [T0]   [T1]   [T2]   [T3]   [T4]   [T5]   [T6]   [T7]
Request A:  [tok]  [tok]  [tok]  [tok]  [tok]  [tok]  [tok]  [DONE]
Request B:  [tok]  [tok]  [tok]  [DONE] [IDLE] [IDLE] [IDLE] [IDLE]
Request C:  [tok]  [tok]  [DONE] [IDLE] [IDLE] [IDLE] [IDLE] [IDLE]
Request D:  [tok]  [DONE] [IDLE] [IDLE] [IDLE] [IDLE] [IDLE] [IDLE]

Requests B, C, and D finish early but their slots stay locked until the batch completes. Measurements of real inference workloads have shown padding overhead at 60-80% for typical batch sizes and sequence length distributions (the PagedAttention paper, Kwon et al. 2023, documented this fragmentation and pre-allocation waste systematically). That number holds up in practice: if you deploy vLLM with static batching and check nvidia-smi dmon, you'll see SM utilization hovering around 30-40% even at moderate request rates.

Continuous Batching: Iteration-Level Scheduling

Continuous batching fixes the idle slot problem by operating at the iteration level rather than the batch level. At each decode step, the scheduler checks the request queue. When a request finishes generation, its KV cache blocks are freed and the next queued request is inserted for the following step.

Continuous batching (same 4 requests, same output lengths):

Time -->    [T0]   [T1]   [T2]   [T3]   [T4]   [T5]   [T6]   [T7]
Slot 1:    [A]    [A]    [A]    [A]    [A]    [A]    [A]    [A-done]
Slot 2:    [B]    [B]    [B]    [B-done][E]    [E]    [E]    [E]
Slot 3:    [C]    [C]    [C-done][F]    [F]    [F-done][G]   [G]
Slot 4:    [D]    [D-done][H]   [H]    [H]    [H]    [H-done][I]

No idle slots. New requests (E, F, G, H, I...) fill slots immediately as they open. GPU utilization climbs from 30-40% to 75-85%.

This behavior is the default in vLLM. You don't enable it; it's always on. The parameters that affect scheduling throughput:

--max-num-seqs (default 1024 in vLLM V1/v0.18.0): maximum concurrent sequences in the scheduler. Raise to 2048 or higher for very high-traffic APIs.
--max-num-batched-tokens (default: dynamic, typically 8192-32768): total tokens processed per iteration across all sequences. Raise to 16384 or 32768 for throughput-optimized workloads.

Here's what throughput looks like at different concurrency levels with continuous batching vs static batching on H100 SXM5 80GB, Llama 3.3 70B FP8:

Concurrent Requests	Static Batching (tok/s)	Continuous Batching (tok/s)	GPU Utilization (CB)
1	118	122	~45%
4	290	480	~62%
16	510	1,050	~75%
32	620	1,420	~80%
64	680	1,750	~84%
128	700	1,900	~87%

The gap is minimal at 1 request (nothing to batch). It compounds as concurrency grows.

PagedAttention: Virtual Memory for the KV Cache

Continuous batching solves the scheduling problem. PagedAttention solves the memory problem.

Without PagedAttention, each new request gets a contiguous VRAM reservation sized to max_model_len at arrival. A 32K-token context window means 32K worth of KV blocks reserved upfront, even if the request only generates 200 tokens. As requests arrive and complete at different rates, you end up with a fragmented VRAM landscape: some blocks allocated but unused, some blocks freed but too small to be reused for the next request. This fragmentation prevents new requests from starting even when aggregate free memory looks sufficient.

PagedAttention applies OS-style virtual memory paging to the KV cache. The cache is divided into fixed-size blocks, 16 tokens per block by default. Blocks are allocated on demand as tokens are generated. When a request completes, its blocks are freed immediately and returned to the pool. No contiguous reservation, no fragmentation.

The block size formula for memory planning:

block_size_bytes = num_layers × num_kv_heads × head_dim × 16 tokens × bytes_per_element × 2 (K and V)

For Llama 3.3 70B (80 layers, 8 KV heads, 128 head dim), block size depends on the KV cache dtype:

BF16 KV cache (bytes_per_element = 2, non-quantized baseline):

80 × 8 × 128 × 16 × 2 × 2 = 5,242,880 bytes ≈ 5 MB per block

FP8 KV cache (bytes_per_element = 1, used with --kv-cache-dtype fp8):

80 × 8 × 128 × 16 × 1 × 2 = 2,621,440 bytes ≈ 2.5 MB per block

An H100 SXM5 80GB after loading 70B FP8 weights (~70 GB) leaves roughly 10 GB for the KV cache pool. With BF16 KV cache that is ~2,000 blocks at 5 MB each, representing ~32,000 tokens of simultaneous in-flight KV cache. With FP8 KV cache (the recommended config below uses --kv-cache-dtype fp8), the same pool holds ~4,000 blocks at 2.5 MB each, representing ~64,000 tokens. Halving the KV cache dtype doubles your concurrent capacity.

PagedAttention is always active in vLLM. There is no --enable-paged-attention flag. The knob you tune is --gpu-memory-utilization, which controls what fraction of available VRAM vLLM reserves for the KV cache pool after weights load.

Relevant parameters:

--gpu-memory-utilization (default 0.90): fraction of remaining VRAM given to the KV cache pool. Raise to 0.95 on bare-metal instances.
--block-size (default 16): tokens per KV cache block. Rarely needs changing.
--max-model-len: reduce below the model's maximum context window to increase the number of available blocks.

How pool size affects concurrent capacity at different context lengths:

`--gpu-memory-utilization`	Max Concurrent at 4K Context	Max Concurrent at 32K Context
0.80	~12	~2
0.90	~16	~3
0.95	~20	~4

For a deeper look at KV cache memory calculations and quantization options (FP8, NVFP4), the KV Cache Optimization guide has complete VRAM tables for Llama 3.1 70B at every context length.

Chunked Prefill: Eliminating Head-of-Line Blocking

Continuous batching and PagedAttention handle throughput and memory. Chunked prefill handles latency, specifically the TTFT spikes that appear when long-context requests share the GPU with short interactive queries.

During prefill, the model processes the entire input prompt in a single forward pass. A 32K-token prompt requires 32K tokens of computation in one step. That computation takes 200-400 ms on an H100 for a 70B model. Every other request in the batch waits for that prefill to finish before they can continue generating tokens.

This is head-of-line blocking: a slow request at the front of the queue blocks all requests behind it. In practice, it produces bimodal TTFT distributions: p50 looks fine, but p95 is 5-10x worse because occasionally a long-context request lands in your batch.

Chunked prefill splits the prefill into N-token chunks. Between each chunk, the scheduler interleaves decode steps from other active sequences. The long-context request still gets processed but in slices, while other requests continue making progress.

TTFT with and without chunked prefill, 50 concurrent requests on H100 SXM5 80GB (10% of requests are 32K-token inputs, remainder are 1K-token inputs):

Input Length	TTFT p50 (no chunked prefill)	TTFT p50 (with chunked prefill)	TTFT p95 (no chunked prefill)	TTFT p95 (with chunked prefill)
1K tokens	380 ms	390 ms	720 ms	480 ms
8K tokens	420 ms	430 ms	1,100 ms	620 ms
32K tokens	680 ms	720 ms	2,800 ms	890 ms

The p50 barely changes. The p95 improvement at 32K inputs is 68%: from 2,800 ms to 890 ms. Short queries no longer spike when a long-context request lands in the scheduler.

One constraint to be aware of: --enable-chunked-prefill cannot be used with draft-model-based speculative decoding (--speculative-model) in vLLM v0.18.0. However, this limitation does not apply to all speculative decoding methods. In vLLM V1 (v0.18.0), NGram GPU speculative decoding was added with chunked prefill support, so the two can coexist for that method. Check the v0.18.0 release notes for your specific speculative decoding approach before assuming incompatibility. For speculative decoding, see the speculative decoding production guide.

Relevant parameters:

--enable-chunked-prefill: must be set explicitly; not on by default.
--max-num-batched-tokens: when chunked prefill is active, this limits the token budget per scheduler step. Lower values (2048) force more interleaving; higher values (16384) prioritize raw throughput over latency fairness.

Hands-On: Tuning vLLM Parameters for H100 Throughput

Default configuration (baseline)

bash

docker run --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  vllm/vllm-openai:v0.18.0 \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --dtype fp8 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 1024 \
  --max-model-len 8192

Optimized configuration for H100 SXM5

bash

docker run --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  vllm/vllm-openai:v0.18.0 \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --dtype fp8 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 2048 \
  --max-num-batched-tokens 16384 \
  --enable-chunked-prefill \
  --max-model-len 32768

For first-time vLLM setup on Spheron, Spheron's LLM quick-guides walk through the instance provisioning steps.

Parameter reference

Parameter	Default	Recommended (H100 SXM5)	What It Controls	When to Change
`--gpu-memory-utilization`	0.90	0.95	Fraction of VRAM reserved for KV cache pool	Raise on dedicated bare-metal; lower if OOM on startup
`--max-num-seqs`	1024 (vLLM V1)	2048	Max concurrent sequences in scheduler	Raise for very high-concurrency APIs
`--max-num-batched-tokens`	Dynamic	16384	Max tokens per iteration	Raise for throughput; lower if TTFT spikes
`--enable-chunked-prefill`	Off	On for mixed workloads	Interleave long prefills with decode steps	Enable when workload mixes short and long inputs
`--max-model-len`	Model max	32768	Context window cap	Reduce to increase KV cache block count
`--tensor-parallel-size`	1	1 (single H100)	Splits model across GPUs	Use for 70B FP16 or 405B+ models

Benchmark Results: Throughput and Latency on H100

Throughput vs batch size (Llama 3.3 70B FP8, H100 SXM5 80GB, vLLM v0.18.0)

Default config uses --gpu-memory-utilization 0.90, --max-num-seqs 1024. Optimized config adds --gpu-memory-utilization 0.95, --max-num-seqs 2048, --max-num-batched-tokens 16384, --enable-chunked-prefill.

Concurrent Requests	Default Config (tok/s)	Optimized Config (tok/s)	Improvement
1	122	125	+2%
4	480	510	+6%
16	1,050	1,240	+18%
32	1,420	1,720	+21%
64	1,750	2,100	+20%
128	1,900	2,380	+25%

At low concurrency the gains are small: the bottleneck is the model itself, not the scheduler. At 64+ concurrent requests, the optimized configuration sustains ~25% higher throughput. The optimized 128-request number (2,380 tok/s) is directionally consistent with the vLLM vs TensorRT-LLM vs SGLang benchmarks, which showed vLLM reaching 2,400 tok/s at 100 concurrent requests with default settings.

TTFT with and without chunked prefill (50 concurrent requests, 10% long-context inputs)

Input Length	Without Chunked Prefill p50	With Chunked Prefill p50	Without CF p95	With CF p95
1K tokens	380 ms	390 ms	720 ms	480 ms
8K tokens	420 ms	430 ms	1,100 ms	620 ms
32K tokens	680 ms	720 ms	2,800 ms	890 ms

KV cache capacity at different pool sizes (Llama 3.3 70B FP8, H100 SXM5 80GB)

`--gpu-memory-utilization`	Approx KV Cache Pool	Max Concurrent at 4K Context	Max Concurrent at 32K Context
0.80	~8 GB	~12	~2
0.90	~9 GB	~16	~3
0.95	~9.5 GB	~20	~4

When to Use Each Technique: Decision Matrix

Workload Type	Use Continuous Batching	Tune PagedAttention Blocks	Enable Chunked Prefill
Interactive chatbot (short prompts, <1K tokens)	Yes (default)	Raise `--gpu-memory-utilization` to 0.95	Optional; minimal benefit
RAG pipeline (medium prompts, 2K-8K context)	Yes (default)	Raise `--gpu-memory-utilization`, reduce `--max-model-len`	Yes, if mixing short and long inputs
Long-context summarization (32K+ inputs)	Yes (default)	Critical: maximize pool size, reduce `--max-model-len` to actual max	Yes, required for acceptable TTFT
Code generation (variable output length)	Yes (default)	Raise `--gpu-memory-utilization` to 0.95	Yes for mixed-length inputs
Batch offline inference (throughput-only)	Yes (default)	Maximize `--gpu-memory-utilization`	Optional; raises `--max-num-batched-tokens` instead

Cost Impact: How Proper Batching Cuts GPU Spend

Live pricing from the Spheron GPU catalog, fetched 04 Apr 2026:

H100 PCIe 80GB on-demand: $2.01/hr
H100 SXM5 80GB on-demand: $2.40/hr
A100 SXM4 80GB on-demand: $1.06/hr

Concrete example: serving 1,800 tok/s sustained throughput.

Baseline (default vLLM config, static-like utilization):

4x H100 PCIe at 40% average GPU utilization
Each GPU delivers ~450 tok/s
Monthly cost at 720 hrs: 4 × $2.01 × 720 = $5,789/month
Cost per 1M output tokens: ($8.04 / 3600) / (1,800 / 1,000,000) = $1.24/1M tokens

Optimized (continuous batching + PagedAttention tuned + chunked prefill):

2x H100 PCIe at 85% average GPU utilization
Each GPU delivers ~900 tok/s
Monthly cost at 720 hrs: 2 × $2.01 × 720 = $2,894/month
Cost per 1M output tokens: ($4.02 / 3600) / (1,800 / 1,000,000) = $0.62/1M tokens

Config	GPUs	Monthly Cost	Cost per 1M Output Tokens
Default vLLM, 40% utilization	4x H100 PCIe	$5,789	$1.24
Optimized vLLM, 85% utilization	2x H100 PCIe	$2,894	$0.62
AWS p4d.24xlarge (8x A100)	8x A100 40GB	~$32,000+	~$4.00+
GCP a2-highgpu-8g (8x A100)	8x A100 40GB	~$26,000+	~$3.20+

On Spheron, bare-metal access means these parameters are fully exposed. Serverless inference APIs don't let you set --max-num-batched-tokens or --enable-chunked-prefill. You're tuning a black box at best.

Pricing fluctuates based on GPU availability. The prices above are based on 04 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Putting It All Together

Deploy in this order: continuous batching first (already on by default in vLLM, no config change needed), then tune the PagedAttention pool size by raising --gpu-memory-utilization to 0.95, then add chunked prefill if TTFT p95 is the constraint.

The reason for that order: continuous batching is free, you get it without touching config. PagedAttention tuning directly unlocks more concurrent capacity: more VRAM in the pool means more requests in flight. Add chunked prefill last because it slightly increases p50 TTFT (the cost of interleaving) in exchange for dramatically better p95.

At 128+ concurrent requests on H100 SXM5, the combination of all three techniques typically delivers 2,200-2,400 tok/s for Llama 3.3 70B FP8. That's roughly 25% above the default vLLM configuration and 3-4x above a naive PyTorch inference loop.

One caveat: these numbers assume uniform output lengths. Real workloads with highly variable output lengths will see higher gains from continuous batching and lower gains from chunked prefill. Benchmark against your actual traffic distribution, not synthetic prompts.

For further optimization after applying these three techniques:

Speculative decoding guide for low-concurrency workloads where per-request latency matters more than aggregate throughput
LoRA multi-adapter serving guide for deployments running multiple fine-tuned adapters on a single model
vLLM vs TensorRT-LLM vs SGLang benchmarks if you're still deciding whether to stay on vLLM or switch frameworks

These optimizations require full control over your serving stack. Spheron's bare-metal H100 instances expose every vLLM parameter, so you're not tuning a black box.
Rent H100 SXM5 → | View all GPU pricing →
Get started on Spheron →