Engineering

torch.compile and CUDA Graphs for LLM Inference: Production PyTorch 2.6 Guide (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 27, 2026
torch.compile LLM inferenceCUDA Graphs PyTorchPyTorch 2.6 inference optimizationtorch inductor LLMPyTorch DynamoInductor cache persistencedynamic shapes PyTorchtorch.library custom opKernel FusionFlashAttention torch.compileLLM Inferencedecode latencybucketed paddingInductor cachereduce-overhead modeDynamo graph breakGPU CloudH100H200B200RTX Pro 6000
torch.compile and CUDA Graphs for LLM Inference: Production PyTorch 2.6 Guide (2026)

torch.compile turned production-stable in PyTorch 2.0, but the friction around cold-start compile times and CUDA graph capture kept most teams from shipping it. PyTorch 2.6 fixes the remaining rough edges. If you are running LLM inference directly in PyTorch outside a framework like vLLM or SGLang, this guide covers everything you need: how Dynamo and Inductor work, how to capture and replay CUDA graphs, how to handle variable sequence lengths, and the one infrastructure detail that gives Spheron a concrete advantage over serverless GPU platforms. For context on the broader inference engineering stack, see the inference engineering guide.

Why torch.compile Matters for Inference

Eager mode PyTorch launches one CUDA kernel per operation. A single forward pass through a 7B transformer issues thousands of small kernel launches. Each launch has CPU-side dispatch overhead: the driver validates arguments, schedules the kernel, and returns. At batch size 1 or 4, this overhead is not trivial.

torch.compile solves this with two mechanisms. First, kernel fusion: Inductor identifies adjacent elementwise ops, matrix multiplies, and normalization layers that can run as a single fused kernel instead of a chain of separate launches. Second, the compiled graph becomes stable enough for CUDA graph capture, which eliminates the per-step launch overhead entirely.

Three compilation modes matter in practice:

ModeWhat it doesWhen to use
defaultFull Inductor optimization, no CUDA graphFirst pass when debugging graph breaks
reduce-overheadInductor + CUDA graph captureProduction inference, fixed batch sizes
max-autotuneExhaustive kernel search + CUDA graphLatency-critical, offline benchmark setup

For most LLM inference workloads, reduce-overhead is the right default. It takes 30-90 seconds on first call (kernel compilation), then runs at near-hardware-peak throughput on every subsequent call.

PyTorch 2.6 Changes for LLM Inference

PyTorch 2.6 adds three things that matter for production inference:

Regional compilation. You can now call torch.compile on individual submodules rather than the full model. This is useful when parts of your model use custom CUDA ops that Dynamo cannot trace. Compile the attention and MLP blocks; leave custom preprocessing in eager mode.

python
# Compile only the transformer layers, not the embedding or lm_head
for layer in model.model.layers:
    layer.self_attn = torch.compile(layer.self_attn, mode="reduce-overhead")
    layer.mlp = torch.compile(layer.mlp, mode="reduce-overhead")

Symbolic shapes via Dim.AUTO. PyTorch 2.6 improves symbolic shape support via Dim.AUTO for torch.export. Pass dynamic=True to torch.compile and Dynamo will treat marked dimensions as symbolic, generating guards that check shape ranges rather than exact values. This avoids full recompilation across sequence lengths within a defined range.

Custom op registration via torch.library. The torch.library.custom_op decorator is the recommended way to register custom ops so Dynamo treats them as opaque boundaries. Any custom CUDA kernel registered this way is visible to Dynamo as an opaque boundary: Dynamo stops tracing at the kernel boundary and emits a graph break, rather than trying to trace into C++ extension code. This is the correct pattern for integrating FlashAttention, custom attention variants, or any hand-written CUDA op with torch.compile.

What changed from 2.4 and 2.5: graph break rates dropped significantly for transformer architectures. The Dynamo team specifically targeted nn.MultiheadAttention, RoPE implementations, and GQA patterns. If you hit torch._dynamo.exc.UserError on 2.4 or 2.5 for standard transformer ops, retest on 2.6 before writing custom workarounds.

How CUDA Graphs Cut Kernel Launch Overhead

A CUDA graph records every kernel launch in a forward pass during a warm-up phase, then replays that exact sequence in subsequent steps. The replay is a single GPU call from the CPU side: one cudaGraphLaunch instead of thousands of individual kernel submissions.

The latency reduction is most visible in the decode phase of autoregressive generation. Prefill processes the full prompt in one forward pass, so batch processing masks the per-op overhead. Decode generates one token at a time. At batch size 1, the decode forward pass is almost entirely memory-bandwidth-bound (reading weights from HBM to compute one small matrix multiply). The CPU overhead of eager kernel dispatch can be 20-30% of total step time at small batch sizes.

Benchmark on Llama 3.1 8B, single H100 SXM5, decode-phase only:

Batch sizeEager decode (ms/step)CUDA graph decode (ms/step)Speedup
18.45.11.65x
49.26.01.53x
1612.19.31.30x
3218.716.11.16x

The gain drops as batch size increases because compute time dominates dispatch overhead at larger batches. CUDA graphs give the largest wins exactly where they matter most: low-concurrency, latency-sensitive serving.

Capture requirements: the graph captures tensor addresses and shapes. If either changes between steps, the graph is invalid and must be recaptured. This means captured CUDA graphs require static input shapes, static output tensor allocation, and no Python-side control flow that varies between steps.

Compiling a Llama or Qwen Forward Pass on H100

Here is a complete example compiling Llama 3.1 8B for decode-phase inference on bare-metal H100 instances:

python
import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
model.eval()

# Compile the full model with reduce-overhead mode (fixed shapes for CUDA graph capture)
model = torch.compile(model, mode="reduce-overhead")

# Prefill warm-up: two passes to trigger Inductor compilation for the prefill shape.
dummy_input = tokenizer("Hello, world!", return_tensors="pt").to("cuda")
with torch.no_grad():
    for _ in range(2):
        model(**dummy_input)

# Run prefill to get the post-prefill KV cache state.
with torch.no_grad():
    prefill_out = model(**dummy_input, use_cache=True)

# Snapshot the KV cache after prefill. This fixed shape is reused across all decode steps.
post_prefill_kv = prefill_out.past_key_values
next_token = prefill_out.logits[:, -1:, :].argmax(dim=-1)

# Decode warm-up: two passes with the decode call signature (input shape [1, 1] + fixed KV cache).
# The decode graph is distinct from the prefill graph and must be captured separately.
# Reusing post_prefill_kv each step keeps tensor shapes static so the graph stays valid.
with torch.no_grad():
    for _ in range(2):
        out = model(input_ids=next_token, past_key_values=post_prefill_kv, use_cache=True)
        next_token = out.logits[:, -1:, :].argmax(dim=-1)

torch.cuda.synchronize()
start = time.perf_counter()
with torch.no_grad():
    for _ in range(100):
        # Reset to the post-prefill KV cache each step so tensor shapes remain constant.
        # This is required for CUDA graph replay: the captured graph expects fixed shapes.
        out = model(input_ids=next_token, past_key_values=post_prefill_kv, use_cache=True)
        next_token = out.logits[:, -1:, :].argmax(dim=-1)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
print(f"100 decode steps: {elapsed:.3f}s, {elapsed/100*1000:.2f}ms per step")

The first warm-up call will print Inductor compilation logs like:

[2026-04-25 12:01:03] torch._inductor: Compiling kernel: mm_default_0
[2026-04-25 12:01:04] torch._inductor: Compiling kernel: fused_add_tanh_1
...
[2026-04-25 12:01:38] torch.compile: Compilation complete (35.2s)

The prefill warm-up compiles kernels for the prefill shape. The decode warm-up captures a separate CUDA graph for the single-token decode shape. All 100 timed iterations replay that graph.

For Qwen 2.5 models, the same pattern applies. One difference: Qwen uses a custom GQA attention implementation. If Dynamo cannot trace it on 2.5, upgrade to 2.6 and try again. If graph breaks persist, fall back to regional compilation on individual attention heads and compile only the MLP blocks.

Handling Variable Sequence Lengths

Fixed sequence lengths are ideal for CUDA graphs. Production traffic is not fixed. You have two options.

Bucketed padding. Pad all inputs to the nearest bucket boundary: 512, 1024, 2048, 4096 tokens. Compile and capture a separate CUDA graph for each bucket. On each request, round up to the nearest bucket, run that graph.

python
BUCKETS = [512, 1024, 2048, 4096]

compiled_models = {}
for seq_len in BUCKETS:
    # compile and warm up a version for each bucket length
    compiled_models[seq_len] = torch.compile(model, mode="reduce-overhead")
    dummy = torch.zeros(1, seq_len, dtype=torch.long, device="cuda")
    with torch.no_grad():
        for _ in range(2):
            compiled_models[seq_len](input_ids=dummy)

def get_bucket(seq_len):
    for b in BUCKETS:
        if seq_len <= b:
            return b
    raise ValueError(f"seq_len {seq_len} exceeds largest bucket {BUCKETS[-1]}; truncate or add a larger bucket")

# At inference time, pad input_ids to the exact bucket length before calling the compiled model.
# CUDA graphs capture tensor shapes during warm-up; passing a shorter tensor causes a shape
# mismatch, a graph miss, and full recompilation on every request.
import torch.nn.functional as F

def run_bucketed(input_ids: torch.Tensor) -> torch.Tensor:
    actual_len = input_ids.shape[1]
    bucket = get_bucket(actual_len)
    # Pad right to bucket length so the shape matches the captured CUDA graph
    if actual_len < bucket:
        input_ids = F.pad(input_ids, (0, bucket - actual_len))
    with torch.no_grad():
        return compiled_models[bucket](input_ids=input_ids)

Dynamic shapes. Pass dynamic=True to torch.compile and mark the sequence dimension:

python
model = torch.compile(model, mode="default", dynamic=True)

Dynamic compilation uses shape guards instead of exact-match checks. Dynamo generates code like assert 0 <= seq_len <= 4096 rather than assert seq_len == 512. Within the valid range, the same compiled graph runs without recompilation.

Tradeoff comparison:

ApproachRecompilationRuntime overheadCUDA graph compatible
Fixed bucketNone after warm-upPadding waste (varies)Yes, one graph per bucket
Dynamic shapesOn range changeSmall guard check per stepPartial (no CUDA graph)
Fully dynamic (no bucket, no dynamic=True)Every new lengthFull recompile costNo

For latency-critical serving where you control request grouping, bucketed padding with reduce-overhead is the right choice. For batch jobs where sequence lengths vary widely and you cannot tolerate the padding waste, dynamic=True without CUDA graphs is a reasonable middle ground.

Combining torch.compile with FlashAttention-3 and FA4

FlashAttention kernels are written in CUDA and Triton. Dynamo cannot trace into them. The correct integration is to register FA as a custom op via torch.library.custom_op, which tells Dynamo to treat the kernel as an opaque function call. Use an application-specific namespace (e.g. myapp::) rather than flash_attn::: flash-attention 2.x+ ships its own op registrations in that namespace, and re-registering under it will raise RuntimeError: Trying to define a custom op with the same name as an existing op.

python
import torch
from flash_attn import flash_attn_func

@torch.library.custom_op("myapp::fa_wrapper", mutates_args=())
def fa_custom_op(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float = 0.0, causal: bool = False) -> torch.Tensor:
    return flash_attn_func(q, k, v, dropout_p=dropout_p, causal=causal)

@fa_custom_op.register_fake
def fa_custom_op_fake(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float = 0.0, causal: bool = False) -> torch.Tensor:
    return torch.empty_like(q)

With this registration, Dynamo stops at the FA boundary. It compiles the surrounding linear projections and output normalization into fused Inductor kernels, and calls FA as a black box. The FA kernel itself runs through its own CUDA/Triton pipeline.

For parts of your model where you want to prevent Dynamo from compiling entirely (not just stop at a boundary), use torch.compiler.disable():

python
class MyAttention(nn.Module):
    @torch.compiler.disable()
    def forward(self, q, k, v):
        # runs in eager mode regardless of outer torch.compile context
        return flash_attn_func(q, k, v, causal=True)

On Blackwell (B200, B300), FlashAttention-4 uses the same custom op registration pattern. If you're using FA4 in a raw PyTorch stack rather than through vLLM or SGLang, register FA4 as a torch.library.custom_op before calling torch.compile. The FlashAttention-4 guide covers the FA4 backend selection and how FA4's SM100 tile architecture compares to FA3 in terms of throughput. When using FlashAttention-4 with torch.compile, register FA4 as a custom op via torch.library so Dynamo does not try to trace through its CUDA kernel directly - the pattern above covers custom op boundary handling in detail.

How vLLM and SGLang Use CUDA Graphs Internally

vLLM and SGLang both manage CUDA graph capture automatically. You do not need to do it manually when using these frameworks, but understanding the mechanism helps with debugging.

At server start, vLLM captures CUDA graphs across a configurable set of batch sizes (defaults span roughly 1 to 256, mixing powers of 2 with intermediate sizes). SGLang's piecewise CUDA graph does the same. For each captured batch size, vLLM stores the graph and its associated tensor buffers.

During inference, when a batch arrives, vLLM selects the smallest captured graph that can accommodate the batch. If the batch has 6 sequences, vLLM uses the batch-size-8 graph and pads with dummy sequences. If a request arrives that doesn't fit any captured batch size, vLLM falls back to eager mode for that decode step.

SGLang does the same. The warm-up phase at startup is what you see when the server logs "Capturing CUDA graphs..." for 30-60 seconds on first launch.

The key difference from manual graph capture: both frameworks manage the static input tensor buffers for you. Captured graphs require that the exact same memory addresses are used on replay. vLLM's memory manager handles this by pre-allocating fixed paged attention buffers that don't move between decode steps.

vLLM pre-captures CUDA graphs for the decode phase during the warm-up pass at server start. For the full PyTorch-level API details, see the PyTorch CUDA graphs documentation.

Persistent Inductor Cache on Spheron

The 30-90 second cold-start compile time is the main production friction with torch.compile. On serverless GPU platforms (Lambda Labs Inference, Modal, RunPod serverless), the container filesystem is ephemeral: every cold start re-compiles from scratch.

The fix is simple: point TORCHINDUCTOR_CACHE_DIR at a persistent path.

bash
export TORCHINDUCTOR_CACHE_DIR=/mnt/nvme/inductor-cache
python serve.py

On Spheron, each GPU instance has local NVMe attached directly to the node. The cache directory survives container restarts and re-deploys as long as you use the same instance. For H200 on Spheron, you get NVMe with enough throughput that the cache write is faster than the GPU is waiting on its first compiled kernel anyway.

The performance difference is concrete. On a Llama 3.1 70B workload with torch.compile, here is what cold-start compile overhead looks like across platforms:

PlatformCache persistentCold start compilep99 latency overhead (first 10 requests)
Spheron (bare metal, NVMe cache)Yes0s (cache hit)0%
Lambda Labs serverlessNo45-90s12-18% (first batch degrades)
Modal (ephemeral container)No30-70s10-15%
RunPod serverlessNo30-90s10-18%
Spheron (first deploy, cold cache)After first run35-60sOne-time cost only

The p99 latency overhead numbers represent the degradation during the compilation window: requests that arrive during the 30-90 second compile either queue (increasing TTFT) or fail with timeout. On subsequent deploys to the same Spheron instance, the cache is warm and there is no overhead.

To mount a persistent volume in the Spheron UI: navigate to the storage section when provisioning your instance, attach a persistent NVMe volume, and set TORCHINDUCTOR_CACHE_DIR to the mount path in your container's environment variables. The cache will survive across restarts automatically.

Benchmarks: Throughput and TTFT on B200, H200, RTX Pro 6000

Benchmarks below use Llama 3.1 8B in BF16 with torch.compile mode='reduce-overhead' after warm-up. All runs use batch size 1 for TTFT and batch size 32 for throughput. Prices from the Spheron GPU pricing API as of 27 Apr 2026.

For Spheron B200 instances, the benchmarks reflect the B200 SXM with 192 GB HBM3e. For RTX Pro 6000 rental, this uses the 96 GB GDDR7 RTX Pro 6000, which is a strong single-GPU option for 30B-70B models.

GPUModeTTFT (ms, BS=1)Throughput (tok/s, BS=32)On-demand $/hrSpot $/hr
H100 SXM5Eager5248,200$2.90$0.80
H100 SXM5torch.compile3164,500$2.90$0.80
H200 SXM5Eager3868,100$5.58$1.19
H200 SXM5torch.compile2389,300$5.58$1.19
B200 SXMEager21122,000N/A$1.71
B200 SXMtorch.compile14157,000N/A$1.71
RTX Pro 6000Eager6831,400$1.70$0.59
RTX Pro 6000torch.compile4441,800$1.70$0.59

The relative gain from torch.compile is consistent across hardware: roughly 35-40% TTFT improvement and 25-30% throughput improvement at batch size 32. The absolute numbers scale with hardware capability.

Pricing fluctuates based on GPU availability. The prices above are based on 27 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Common Production Failures

Graph Breaks

Graph breaks are the most common torch.compile failure mode. When Dynamo encounters something it cannot trace, it emits a graph break: it compiles what it can, then falls back to eager mode for the rest. Each break splits the compiled region, reducing fusion coverage and breaking CUDA graph eligibility.

Diagnose with:

bash
TORCH_LOGS=graph_breaks python serve.py 2>&1 | grep "Graph break"

Common causes in LLM inference:

  • Data-dependent control flow: if seq_len > max_len: ... where seq_len is a tensor value
  • In-place ops on aliased tensors: x[mask] = 0 when x is a view of another tensor
  • Unsupported custom extensions: C++ ops not registered via torch.library
  • print and assert statements inside traced regions

Fix strategy: use torch._dynamo.explain(model.forward)(*sample_inputs) to see exactly where Dynamo breaks and why. Eliminate the break or use torch.compiler.disable() to exclude that region from compilation.

Guard Explosions

If you see log lines like:

[RECOMPILE] Guard failed: L['seq_len'] == 512
[RECOMPILE] Guard failed: L['seq_len'] == 768

Dynamo is recompiling for every unique input shape it sees. This typically means dynamic=True is not set, or the model has shape-dependent control flow that creates per-shape guards.

Fix: add dynamic=True, or switch to bucketed padding with fixed shapes per bucket.

Dynamo Unsupported Ops

Some ops have incomplete Dynamo support. The symptom is a UserError or silent fallback. Check with:

bash
TORCH_COMPILE_DEBUG=1 python -c "import torch; torch.compile(your_model)(your_input)"

This writes a full debug directory with the captured FX graph, the Inductor-generated C++ code, and each kernel's compilation log. The unsupported op will appear as a graph break in the FX graph output.

TORCH_COMPILE_DEBUG Artifacts

TORCH_COMPILE_DEBUG=1 writes to torch_compile_debug/ by default. These directories accumulate across runs. On a long-running server, this can fill the NVMe volume. Set TORCH_COMPILE_DEBUG_DIR=/tmp/tcd to redirect output, or disable after your initial debugging pass.

When NOT to Use torch.compile

TensorRT-LLM. TensorRT-LLM manages its own engine compilation pipeline using TensorRT, not Dynamo. The two systems do not compose. If you are using TRT-LLM, do not wrap anything in torch.compile - you will get a conflict between the two compilation pipelines.

Triton kernel boundaries. If your model uses hand-written Triton kernels for matmul or attention, those kernels are already optimized. Wrapping them in torch.compile adds compilation overhead without improving the kernels themselves. Use regional compilation to compile only the surrounding non-Triton layers.

Extremely variable input shapes without bucketing. If your workload has unbounded sequence length variation (user-generated content with lengths from 50 to 100,000 tokens) and you cannot implement bucketed padding, torch.compile will recompile frequently. The recompile cost can exceed the per-request speedup. Profile your recompile frequency with TORCH_LOGS=recompiles before committing.

Models with heavy in-place aliasing. Some model implementations use in-place operations on views for memory efficiency. This pattern breaks CUDA graph capture and can cause incorrect results under Dynamo compilation. Run with fullgraph=True to surface aliasing errors before deploying:

python
model = torch.compile(model, fullgraph=True)  # errors on any graph break

If fullgraph=True raises, fix the aliasing before switching back to the default (which silently falls back on graph breaks).


Running LLM inference on Spheron gives you bare-metal NVMe for persistent Inductor cache, which means zero cold-start compile overhead on redeployments. That is a real, measurable improvement over serverless GPU platforms where the cache is lost on every container restart.

Rent H200 → | Rent B200 → | View all pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.