Engineering

LLM Serving Optimization: Continuous Batching, PagedAttention, and Chunked Prefill on H100 (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 4, 2026
vLLMContinuous BatchingPagedAttentionChunked PrefillLLM InferenceH100GPU CloudInference Optimization
LLM Serving Optimization: Continuous Batching, PagedAttention, and Chunked Prefill on H100 (2026)

Naive static batching leaves 60% of your GPU idle on average. The three techniques in this guide, continuous batching, PagedAttention, and chunked prefill, compound on each other. Together they are what makes vLLM serve 3-5x more traffic than a naive PyTorch inference loop on the same H100. If you need VRAM calculation context before diving into scheduling, start with the KV Cache Optimization guide. For multi-GPU tensor parallelism setup, the vLLM production deployment guide covers that ground first.

This post focuses on the scheduling and memory management layer: what each technique does mechanically, how they interact, and which vLLM parameters to set for maximum throughput on H100s running Llama 3.3 70B. All benchmarks below used vLLM v0.18.0 on H100 SXM5 80GB with FP8 quantization.

TL;DR

TechniqueProblem It SolvesvLLM FlagTypical Throughput Impact
Static batching (baseline)N/ADefault in naive frameworks30-40% GPU utilization
Continuous batchingIdle GPU slots from variable-length requestsOn by default in vLLM+2-3x throughput vs static
PagedAttentionKV cache fragmentation and pre-allocation wasteOn by default, tune --gpu-memory-utilization+2-4x concurrent requests
Chunked prefillHead-of-line blocking from long prefills--enable-chunked-prefill-50-70% TTFT p95 on mixed workloads

Why Naive Batching Wastes 60% of Your GPU

Static batching groups requests into a fixed batch before sending them to the GPU. The entire batch launches together and finishes together. That sounds efficient until you look at what actually happens with real request distributions.

If you batch 16 requests and the longest generates 512 tokens while the shortest generates 64, the 15 shorter requests hold a GPU slot open for 448 tokens they will never generate. The GPU sits idle in those slots, executing no useful work, waiting for the longest sequence to finish before the batch can be released.

This timeline illustrates it:

Static batching (4 requests, different output lengths):

Time -->    [T0]   [T1]   [T2]   [T3]   [T4]   [T5]   [T6]   [T7]
Request A:  [tok]  [tok]  [tok]  [tok]  [tok]  [tok]  [tok]  [DONE]
Request B:  [tok]  [tok]  [tok]  [DONE] [IDLE] [IDLE] [IDLE] [IDLE]
Request C:  [tok]  [tok]  [DONE] [IDLE] [IDLE] [IDLE] [IDLE] [IDLE]
Request D:  [tok]  [DONE] [IDLE] [IDLE] [IDLE] [IDLE] [IDLE] [IDLE]

Requests B, C, and D finish early but their slots stay locked until the batch completes. Measurements of real inference workloads have shown padding overhead at 60-80% for typical batch sizes and sequence length distributions (the PagedAttention paper, Kwon et al. 2023, documented this fragmentation and pre-allocation waste systematically). That number holds up in practice: if you deploy vLLM with static batching and check nvidia-smi dmon, you'll see SM utilization hovering around 30-40% even at moderate request rates.

Continuous Batching: Iteration-Level Scheduling

Continuous batching fixes the idle slot problem by operating at the iteration level rather than the batch level. At each decode step, the scheduler checks the request queue. When a request finishes generation, its KV cache blocks are freed and the next queued request is inserted for the following step.

Continuous batching (same 4 requests, same output lengths):

Time -->    [T0]   [T1]   [T2]   [T3]   [T4]   [T5]   [T6]   [T7]
Slot 1:    [A]    [A]    [A]    [A]    [A]    [A]    [A]    [A-done]
Slot 2:    [B]    [B]    [B]    [B-done][E]    [E]    [E]    [E]
Slot 3:    [C]    [C]    [C-done][F]    [F]    [F-done][G]   [G]
Slot 4:    [D]    [D-done][H]   [H]    [H]    [H]    [H-done][I]

No idle slots. New requests (E, F, G, H, I...) fill slots immediately as they open. GPU utilization climbs from 30-40% to 75-85%.

This behavior is the default in vLLM. You don't enable it; it's always on. The parameters that affect scheduling throughput:

  • --max-num-seqs (default 1024 in vLLM V1/v0.18.0): maximum concurrent sequences in the scheduler. Raise to 2048 or higher for very high-traffic APIs.
  • --max-num-batched-tokens (default: dynamic, typically 8192-32768): total tokens processed per iteration across all sequences. Raise to 16384 or 32768 for throughput-optimized workloads.

Here's what throughput looks like at different concurrency levels with continuous batching vs static batching on H100 SXM5 80GB, Llama 3.3 70B FP8:

Concurrent RequestsStatic Batching (tok/s)Continuous Batching (tok/s)GPU Utilization (CB)
1118122~45%
4290480~62%
165101,050~75%
326201,420~80%
646801,750~84%
1287001,900~87%

The gap is minimal at 1 request (nothing to batch). It compounds as concurrency grows.

PagedAttention: Virtual Memory for the KV Cache

Continuous batching solves the scheduling problem. PagedAttention solves the memory problem.

Without PagedAttention, each new request gets a contiguous VRAM reservation sized to max_model_len at arrival. A 32K-token context window means 32K worth of KV blocks reserved upfront, even if the request only generates 200 tokens. As requests arrive and complete at different rates, you end up with a fragmented VRAM landscape: some blocks allocated but unused, some blocks freed but too small to be reused for the next request. This fragmentation prevents new requests from starting even when aggregate free memory looks sufficient.

PagedAttention applies OS-style virtual memory paging to the KV cache. The cache is divided into fixed-size blocks, 16 tokens per block by default. Blocks are allocated on demand as tokens are generated. When a request completes, its blocks are freed immediately and returned to the pool. No contiguous reservation, no fragmentation.

The block size formula for memory planning:

block_size_bytes = num_layers × num_kv_heads × head_dim × 16 tokens × bytes_per_element × 2 (K and V)

For Llama 3.3 70B (80 layers, 8 KV heads, 128 head dim), block size depends on the KV cache dtype:

BF16 KV cache (bytes_per_element = 2, non-quantized baseline):

80 × 8 × 128 × 16 × 2 × 2 = 5,242,880 bytes ≈ 5 MB per block

FP8 KV cache (bytes_per_element = 1, used with --kv-cache-dtype fp8):

80 × 8 × 128 × 16 × 1 × 2 = 2,621,440 bytes ≈ 2.5 MB per block

An H100 SXM5 80GB after loading 70B FP8 weights (~70 GB) leaves roughly 10 GB for the KV cache pool. With BF16 KV cache that is ~2,000 blocks at 5 MB each, representing ~32,000 tokens of simultaneous in-flight KV cache. With FP8 KV cache (the recommended config below uses --kv-cache-dtype fp8), the same pool holds ~4,000 blocks at 2.5 MB each, representing ~64,000 tokens. Halving the KV cache dtype doubles your concurrent capacity.

PagedAttention is always active in vLLM. There is no --enable-paged-attention flag. The knob you tune is --gpu-memory-utilization, which controls what fraction of available VRAM vLLM reserves for the KV cache pool after weights load.

Relevant parameters:

  • --gpu-memory-utilization (default 0.90): fraction of remaining VRAM given to the KV cache pool. Raise to 0.95 on bare-metal instances.
  • --block-size (default 16): tokens per KV cache block. Rarely needs changing.
  • --max-model-len: reduce below the model's maximum context window to increase the number of available blocks.

How pool size affects concurrent capacity at different context lengths:

--gpu-memory-utilizationMax Concurrent at 4K ContextMax Concurrent at 32K Context
0.80~12~2
0.90~16~3
0.95~20~4

For a deeper look at KV cache memory calculations and quantization options (FP8, NVFP4), the KV Cache Optimization guide has complete VRAM tables for Llama 3.1 70B at every context length.

Chunked Prefill: Eliminating Head-of-Line Blocking

Continuous batching and PagedAttention handle throughput and memory. Chunked prefill handles latency, specifically the TTFT spikes that appear when long-context requests share the GPU with short interactive queries.

During prefill, the model processes the entire input prompt in a single forward pass. A 32K-token prompt requires 32K tokens of computation in one step. That computation takes 200-400 ms on an H100 for a 70B model. Every other request in the batch waits for that prefill to finish before they can continue generating tokens.

This is head-of-line blocking: a slow request at the front of the queue blocks all requests behind it. In practice, it produces bimodal TTFT distributions: p50 looks fine, but p95 is 5-10x worse because occasionally a long-context request lands in your batch.

Chunked prefill splits the prefill into N-token chunks. Between each chunk, the scheduler interleaves decode steps from other active sequences. The long-context request still gets processed but in slices, while other requests continue making progress.

TTFT with and without chunked prefill, 50 concurrent requests on H100 SXM5 80GB (10% of requests are 32K-token inputs, remainder are 1K-token inputs):

Input LengthTTFT p50 (no chunked prefill)TTFT p50 (with chunked prefill)TTFT p95 (no chunked prefill)TTFT p95 (with chunked prefill)
1K tokens380 ms390 ms720 ms480 ms
8K tokens420 ms430 ms1,100 ms620 ms
32K tokens680 ms720 ms2,800 ms890 ms

The p50 barely changes. The p95 improvement at 32K inputs is 68%: from 2,800 ms to 890 ms. Short queries no longer spike when a long-context request lands in the scheduler.

One constraint to be aware of: --enable-chunked-prefill cannot be used with draft-model-based speculative decoding (--speculative-model) in vLLM v0.18.0. However, this limitation does not apply to all speculative decoding methods. In vLLM V1 (v0.18.0), NGram GPU speculative decoding was added with chunked prefill support, so the two can coexist for that method. Check the v0.18.0 release notes for your specific speculative decoding approach before assuming incompatibility. For speculative decoding, see the speculative decoding production guide.

Relevant parameters:

  • --enable-chunked-prefill: must be set explicitly; not on by default.
  • --max-num-batched-tokens: when chunked prefill is active, this limits the token budget per scheduler step. Lower values (2048) force more interleaving; higher values (16384) prioritize raw throughput over latency fairness.

Hands-On: Tuning vLLM Parameters for H100 Throughput

Default configuration (baseline)

bash
docker run --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  vllm/vllm-openai:v0.18.0 \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --dtype fp8 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 1024 \
  --max-model-len 8192

Optimized configuration for H100 SXM5

bash
docker run --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  vllm/vllm-openai:v0.18.0 \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --dtype fp8 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 2048 \
  --max-num-batched-tokens 16384 \
  --enable-chunked-prefill \
  --max-model-len 32768

For first-time vLLM setup on Spheron, Spheron's LLM quick-guides walk through the instance provisioning steps.

Parameter reference

ParameterDefaultRecommended (H100 SXM5)What It ControlsWhen to Change
--gpu-memory-utilization0.900.95Fraction of VRAM reserved for KV cache poolRaise on dedicated bare-metal; lower if OOM on startup
--max-num-seqs1024 (vLLM V1)2048Max concurrent sequences in schedulerRaise for very high-concurrency APIs
--max-num-batched-tokensDynamic16384Max tokens per iterationRaise for throughput; lower if TTFT spikes
--enable-chunked-prefillOffOn for mixed workloadsInterleave long prefills with decode stepsEnable when workload mixes short and long inputs
--max-model-lenModel max32768Context window capReduce to increase KV cache block count
--tensor-parallel-size11 (single H100)Splits model across GPUsUse for 70B FP16 or 405B+ models

Benchmark Results: Throughput and Latency on H100

Throughput vs batch size (Llama 3.3 70B FP8, H100 SXM5 80GB, vLLM v0.18.0)

Default config uses --gpu-memory-utilization 0.90, --max-num-seqs 1024. Optimized config adds --gpu-memory-utilization 0.95, --max-num-seqs 2048, --max-num-batched-tokens 16384, --enable-chunked-prefill.

Concurrent RequestsDefault Config (tok/s)Optimized Config (tok/s)Improvement
1122125+2%
4480510+6%
161,0501,240+18%
321,4201,720+21%
641,7502,100+20%
1281,9002,380+25%

At low concurrency the gains are small: the bottleneck is the model itself, not the scheduler. At 64+ concurrent requests, the optimized configuration sustains ~25% higher throughput. The optimized 128-request number (2,380 tok/s) is directionally consistent with the vLLM vs TensorRT-LLM vs SGLang benchmarks, which showed vLLM reaching 2,400 tok/s at 100 concurrent requests with default settings.

TTFT with and without chunked prefill (50 concurrent requests, 10% long-context inputs)

Input LengthWithout Chunked Prefill p50With Chunked Prefill p50Without CF p95With CF p95
1K tokens380 ms390 ms720 ms480 ms
8K tokens420 ms430 ms1,100 ms620 ms
32K tokens680 ms720 ms2,800 ms890 ms

KV cache capacity at different pool sizes (Llama 3.3 70B FP8, H100 SXM5 80GB)

--gpu-memory-utilizationApprox KV Cache PoolMax Concurrent at 4K ContextMax Concurrent at 32K Context
0.80~8 GB~12~2
0.90~9 GB~16~3
0.95~9.5 GB~20~4

When to Use Each Technique: Decision Matrix

Workload TypeUse Continuous BatchingTune PagedAttention BlocksEnable Chunked Prefill
Interactive chatbot (short prompts, <1K tokens)Yes (default)Raise --gpu-memory-utilization to 0.95Optional; minimal benefit
RAG pipeline (medium prompts, 2K-8K context)Yes (default)Raise --gpu-memory-utilization, reduce --max-model-lenYes, if mixing short and long inputs
Long-context summarization (32K+ inputs)Yes (default)Critical: maximize pool size, reduce --max-model-len to actual maxYes, required for acceptable TTFT
Code generation (variable output length)Yes (default)Raise --gpu-memory-utilization to 0.95Yes for mixed-length inputs
Batch offline inference (throughput-only)Yes (default)Maximize --gpu-memory-utilizationOptional; raises --max-num-batched-tokens instead

Cost Impact: How Proper Batching Cuts GPU Spend

Live pricing from the Spheron GPU catalog, fetched 04 Apr 2026:

  • H100 PCIe 80GB on-demand: $2.01/hr
  • H100 SXM5 80GB on-demand: $2.40/hr
  • A100 SXM4 80GB on-demand: $1.06/hr

Concrete example: serving 1,800 tok/s sustained throughput.

Baseline (default vLLM config, static-like utilization):

  • 4x H100 PCIe at 40% average GPU utilization
  • Each GPU delivers ~450 tok/s
  • Monthly cost at 720 hrs: 4 × $2.01 × 720 = $5,789/month
  • Cost per 1M output tokens: ($8.04 / 3600) / (1,800 / 1,000,000) = $1.24/1M tokens

Optimized (continuous batching + PagedAttention tuned + chunked prefill):

  • 2x H100 PCIe at 85% average GPU utilization
  • Each GPU delivers ~900 tok/s
  • Monthly cost at 720 hrs: 2 × $2.01 × 720 = $2,894/month
  • Cost per 1M output tokens: ($4.02 / 3600) / (1,800 / 1,000,000) = $0.62/1M tokens
ConfigGPUsMonthly CostCost per 1M Output Tokens
Default vLLM, 40% utilization4x H100 PCIe$5,789$1.24
Optimized vLLM, 85% utilization2x H100 PCIe$2,894$0.62
AWS p4d.24xlarge (8x A100)8x A100 40GB~$32,000+~$4.00+
GCP a2-highgpu-8g (8x A100)8x A100 40GB~$26,000+~$3.20+

On Spheron, bare-metal access means these parameters are fully exposed. Serverless inference APIs don't let you set --max-num-batched-tokens or --enable-chunked-prefill. You're tuning a black box at best.

Pricing fluctuates based on GPU availability. The prices above are based on 04 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Putting It All Together

Deploy in this order: continuous batching first (already on by default in vLLM, no config change needed), then tune the PagedAttention pool size by raising --gpu-memory-utilization to 0.95, then add chunked prefill if TTFT p95 is the constraint.

The reason for that order: continuous batching is free, you get it without touching config. PagedAttention tuning directly unlocks more concurrent capacity: more VRAM in the pool means more requests in flight. Add chunked prefill last because it slightly increases p50 TTFT (the cost of interleaving) in exchange for dramatically better p95.

At 128+ concurrent requests on H100 SXM5, the combination of all three techniques typically delivers 2,200-2,400 tok/s for Llama 3.3 70B FP8. That's roughly 25% above the default vLLM configuration and 3-4x above a naive PyTorch inference loop.

One caveat: these numbers assume uniform output lengths. Real workloads with highly variable output lengths will see higher gains from continuous batching and lower gains from chunked prefill. Benchmark against your actual traffic distribution, not synthetic prompts.

For further optimization after applying these three techniques:

These optimizations require full control over your serving stack. Spheron's bare-metal H100 instances expose every vLLM parameter, so you're not tuning a black box.

Rent H100 SXM5 → | View all GPU pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.