Why is the first call to torch.compile slow?

The first call triggers kernel compilation via the Torch Inductor backend. Inductor traces the computation graph with Dynamo, generates optimized CUDA/Triton kernels, and caches the compiled artifacts. On a cold start this takes 30-90 seconds for large LLM forward passes. Subsequent calls use the cached kernels and run at full speed. On Spheron you can persist the Inductor cache to local NVMe across restarts with TORCHINDUCTOR_CACHE_DIR, eliminating cold-start compile time entirely.

What is CUDA graph capture and why does it reduce latency?

CUDA graph capture records all kernel launches in a single graph object during a warm-up pass. On replay, the driver dispatches the entire graph as one unit instead of issuing hundreds of individual kernel launches. This cuts per-step CPU overhead by 30-50%, which is most visible in the decode phase of autoregressive generation where each step is a small matrix multiply.

Does torch.compile work with variable sequence lengths?

Yes, but with caveats. Using mode='reduce-overhead' with dynamic=True and torch.export mark_dynamic tells Dynamo to treat sequence length as a symbolic dimension. This avoids full recompilation per sequence length but adds a small runtime guard check. For production, padding to a fixed set of bucket lengths (e.g. 512, 1024, 2048, 4096) and compiling each bucket separately is often faster and more predictable than fully dynamic compilation.

When should I NOT use torch.compile for LLM inference?

Skip torch.compile when you are already using TensorRT-LLM, which manages its own engine optimization pipeline and does not use Dynamo. Also avoid it for models with heavy custom op usage that Dynamo cannot trace (graph breaks fall back to eager mode and negate gains). For Triton kernel boundaries, use torch.compile selectively on the non-Triton portion. If your workload has extremely variable sequence lengths and you cannot bucket-pad, recompilation overhead may exceed the speedup.

How do vLLM and SGLang use CUDA Graphs internally?

Both frameworks pre-capture CUDA graphs for the decode phase at server start. vLLM captures graphs for a set of fixed batch sizes and sequence lengths, then selects the smallest graph that fits the current request batch. SGLang does the same. The graph capture happens automatically during warm-up - no user configuration needed. If you are running inference directly in PyTorch without a serving framework, you need to handle graph capture yourself using the approach in this guide.

torch.compile and CUDA Graphs for LLM Inference: Production PyTorch 2.6 Guide (2026)

torch.compile turned production-stable in PyTorch 2.0, but the friction around cold-start compile times and CUDA graph capture kept most teams from shipping it. PyTorch 2.6 fixes the remaining rough edges. If you are running LLM inference directly in PyTorch outside a framework like vLLM or SGLang, this guide covers everything you need: how Dynamo and Inductor work, how to capture and replay CUDA graphs, how to handle variable sequence lengths, and the one infrastructure detail that gives Spheron a concrete advantage over serverless GPU platforms. For context on the broader inference engineering stack, see the inference engineering guide.

Why torch.compile Matters for Inference

Eager mode PyTorch launches one CUDA kernel per operation. A single forward pass through a 7B transformer issues thousands of small kernel launches. Each launch has CPU-side dispatch overhead: the driver validates arguments, schedules the kernel, and returns. At batch size 1 or 4, this overhead is not trivial.

torch.compile solves this with two mechanisms. First, kernel fusion: Inductor identifies adjacent elementwise ops, matrix multiplies, and normalization layers that can run as a single fused kernel instead of a chain of separate launches. Second, the compiled graph becomes stable enough for CUDA graph capture, which eliminates the per-step launch overhead entirely.

Three compilation modes matter in practice:

Mode	What it does	When to use
`default`	Full Inductor optimization, no CUDA graph	First pass when debugging graph breaks
`reduce-overhead`	Inductor + CUDA graph capture	Production inference, fixed batch sizes
`max-autotune`	Exhaustive kernel search + CUDA graph	Latency-critical, offline benchmark setup

For most LLM inference workloads, reduce-overhead is the right default. It takes 30-90 seconds on first call (kernel compilation), then runs at near-hardware-peak throughput on every subsequent call.

PyTorch 2.6 Changes for LLM Inference

PyTorch 2.6 adds three things that matter for production inference:

Regional compilation. You can now call torch.compile on individual submodules rather than the full model. This is useful when parts of your model use custom CUDA ops that Dynamo cannot trace. Compile the attention and MLP blocks; leave custom preprocessing in eager mode.

python

# Compile only the transformer layers, not the embedding or lm_head
for layer in model.model.layers:
    layer.self_attn = torch.compile(layer.self_attn, mode="reduce-overhead")
    layer.mlp = torch.compile(layer.mlp, mode="reduce-overhead")

Symbolic shapes via Dim.AUTO. PyTorch 2.6 improves symbolic shape support via Dim.AUTO for torch.export. Pass dynamic=True to torch.compile and Dynamo will treat marked dimensions as symbolic, generating guards that check shape ranges rather than exact values. This avoids full recompilation across sequence lengths within a defined range.

Custom op registration via torch.library. The torch.library.custom_op decorator is the recommended way to register custom ops so Dynamo treats them as opaque boundaries. Any custom CUDA kernel registered this way is visible to Dynamo as an opaque boundary: Dynamo stops tracing at the kernel boundary and emits a graph break, rather than trying to trace into C++ extension code. This is the correct pattern for integrating FlashAttention, custom attention variants, or any hand-written CUDA op with torch.compile.

What changed from 2.4 and 2.5: graph break rates dropped significantly for transformer architectures. The Dynamo team specifically targeted nn.MultiheadAttention, RoPE implementations, and GQA patterns. If you hit torch._dynamo.exc.UserError on 2.4 or 2.5 for standard transformer ops, retest on 2.6 before writing custom workarounds.

How CUDA Graphs Cut Kernel Launch Overhead

A CUDA graph records every kernel launch in a forward pass during a warm-up phase, then replays that exact sequence in subsequent steps. The replay is a single GPU call from the CPU side: one cudaGraphLaunch instead of thousands of individual kernel submissions.

The latency reduction is most visible in the decode phase of autoregressive generation. Prefill processes the full prompt in one forward pass, so batch processing masks the per-op overhead. Decode generates one token at a time. At batch size 1, the decode forward pass is almost entirely memory-bandwidth-bound (reading weights from HBM to compute one small matrix multiply). The CPU overhead of eager kernel dispatch can be 20-30% of total step time at small batch sizes.

Benchmark on Llama 3.1 8B, single H100 SXM5, decode-phase only:

Batch size	Eager decode (ms/step)	CUDA graph decode (ms/step)	Speedup
1	8.4	5.1	1.65x
4	9.2	6.0	1.53x
16	12.1	9.3	1.30x
32	18.7	16.1	1.16x

The gain drops as batch size increases because compute time dominates dispatch overhead at larger batches. CUDA graphs give the largest wins exactly where they matter most: low-concurrency, latency-sensitive serving.

Capture requirements: the graph captures tensor addresses and shapes. If either changes between steps, the graph is invalid and must be recaptured. This means captured CUDA graphs require static input shapes, static output tensor allocation, and no Python-side control flow that varies between steps.

Compiling a Llama or Qwen Forward Pass on H100

Here is a complete example compiling Llama 3.1 8B for decode-phase inference on bare-metal H100 instances:

python

import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
model.eval()

# Compile the full model with reduce-overhead mode (fixed shapes for CUDA graph capture)
model = torch.compile(model, mode="reduce-overhead")

# Prefill warm-up: two passes to trigger Inductor compilation for the prefill shape.
dummy_input = tokenizer("Hello, world!", return_tensors="pt").to("cuda")
with torch.no_grad():
    for _ in range(2):
        model(**dummy_input)

# Run prefill to get the post-prefill KV cache state.
with torch.no_grad():
    prefill_out = model(**dummy_input, use_cache=True)

# Snapshot the KV cache after prefill. This fixed shape is reused across all decode steps.
post_prefill_kv = prefill_out.past_key_values
next_token = prefill_out.logits[:, -1:, :].argmax(dim=-1)

# Decode warm-up: two passes with the decode call signature (input shape [1, 1] + fixed KV cache).
# The decode graph is distinct from the prefill graph and must be captured separately.
# Reusing post_prefill_kv each step keeps tensor shapes static so the graph stays valid.
with torch.no_grad():
    for _ in range(2):
        out = model(input_ids=next_token, past_key_values=post_prefill_kv, use_cache=True)
        next_token = out.logits[:, -1:, :].argmax(dim=-1)

torch.cuda.synchronize()
start = time.perf_counter()
with torch.no_grad():
    for _ in range(100):
        # Reset to the post-prefill KV cache each step so tensor shapes remain constant.
        # This is required for CUDA graph replay: the captured graph expects fixed shapes.
        out = model(input_ids=next_token, past_key_values=post_prefill_kv, use_cache=True)
        next_token = out.logits[:, -1:, :].argmax(dim=-1)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
print(f"100 decode steps: {elapsed:.3f}s, {elapsed/100*1000:.2f}ms per step")

The first warm-up call will print Inductor compilation logs like:

[2026-04-25 12:01:03] torch._inductor: Compiling kernel: mm_default_0
[2026-04-25 12:01:04] torch._inductor: Compiling kernel: fused_add_tanh_1
...
[2026-04-25 12:01:38] torch.compile: Compilation complete (35.2s)

The prefill warm-up compiles kernels for the prefill shape. The decode warm-up captures a separate CUDA graph for the single-token decode shape. All 100 timed iterations replay that graph.

For Qwen 2.5 models, the same pattern applies. One difference: Qwen uses a custom GQA attention implementation. If Dynamo cannot trace it on 2.5, upgrade to 2.6 and try again. If graph breaks persist, fall back to regional compilation on individual attention heads and compile only the MLP blocks.

Handling Variable Sequence Lengths

Fixed sequence lengths are ideal for CUDA graphs. Production traffic is not fixed. You have two options.

Bucketed padding. Pad all inputs to the nearest bucket boundary: 512, 1024, 2048, 4096 tokens. Compile and capture a separate CUDA graph for each bucket. On each request, round up to the nearest bucket, run that graph.

python

BUCKETS = [512, 1024, 2048, 4096]

compiled_models = {}
for seq_len in BUCKETS:
    # compile and warm up a version for each bucket length
    compiled_models[seq_len] = torch.compile(model, mode="reduce-overhead")
    dummy = torch.zeros(1, seq_len, dtype=torch.long, device="cuda")
    with torch.no_grad():
        for _ in range(2):
            compiled_models[seq_len](input_ids=dummy)

def get_bucket(seq_len):
    for b in BUCKETS:
        if seq_len <= b:
            return b
    raise ValueError(f"seq_len {seq_len} exceeds largest bucket {BUCKETS[-1]}; truncate or add a larger bucket")

# At inference time, pad input_ids to the exact bucket length before calling the compiled model.
# CUDA graphs capture tensor shapes during warm-up; passing a shorter tensor causes a shape
# mismatch, a graph miss, and full recompilation on every request.
import torch.nn.functional as F

def run_bucketed(input_ids: torch.Tensor) -> torch.Tensor:
    actual_len = input_ids.shape[1]
    bucket = get_bucket(actual_len)
    # Pad right to bucket length so the shape matches the captured CUDA graph
    if actual_len < bucket:
        input_ids = F.pad(input_ids, (0, bucket - actual_len))
    with torch.no_grad():
        return compiled_models[bucket](input_ids=input_ids)

Dynamic shapes. Pass dynamic=True to torch.compile and mark the sequence dimension:

python

model = torch.compile(model, mode="default", dynamic=True)

Dynamic compilation uses shape guards instead of exact-match checks. Dynamo generates code like assert 0 <= seq_len <= 4096 rather than assert seq_len == 512. Within the valid range, the same compiled graph runs without recompilation.

Tradeoff comparison:

Approach	Recompilation	Runtime overhead	CUDA graph compatible
Fixed bucket	None after warm-up	Padding waste (varies)	Yes, one graph per bucket
Dynamic shapes	On range change	Small guard check per step	Partial (no CUDA graph)
Fully dynamic (no bucket, no dynamic=True)	Every new length	Full recompile cost	No

For latency-critical serving where you control request grouping, bucketed padding with reduce-overhead is the right choice. For batch jobs where sequence lengths vary widely and you cannot tolerate the padding waste, dynamic=True without CUDA graphs is a reasonable middle ground.

Combining torch.compile with FlashAttention-3 and FA4

FlashAttention kernels are written in CUDA and Triton. Dynamo cannot trace into them. The correct integration is to register FA as a custom op via torch.library.custom_op, which tells Dynamo to treat the kernel as an opaque function call. Use an application-specific namespace (e.g. myapp::) rather than flash_attn::: flash-attention 2.x+ ships its own op registrations in that namespace, and re-registering under it will raise RuntimeError: Trying to define a custom op with the same name as an existing op.

python

import torch
from flash_attn import flash_attn_func

@torch.library.custom_op("myapp::fa_wrapper", mutates_args=())
def fa_custom_op(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float = 0.0, causal: bool = False) -> torch.Tensor:
    return flash_attn_func(q, k, v, dropout_p=dropout_p, causal=causal)

@fa_custom_op.register_fake
def fa_custom_op_fake(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float = 0.0, causal: bool = False) -> torch.Tensor:
    return torch.empty_like(q)

With this registration, Dynamo stops at the FA boundary. It compiles the surrounding linear projections and output normalization into fused Inductor kernels, and calls FA as a black box. The FA kernel itself runs through its own CUDA/Triton pipeline.

For parts of your model where you want to prevent Dynamo from compiling entirely (not just stop at a boundary), use torch.compiler.disable():

python

class MyAttention(nn.Module):
    @torch.compiler.disable()
    def forward(self, q, k, v):
        # runs in eager mode regardless of outer torch.compile context
        return flash_attn_func(q, k, v, causal=True)

On Blackwell (B200, B300), FlashAttention-4 uses the same custom op registration pattern. If you're using FA4 in a raw PyTorch stack rather than through vLLM or SGLang, register FA4 as a torch.library.custom_op before calling torch.compile. The FlashAttention-4 guide covers the FA4 backend selection and how FA4's SM100 tile architecture compares to FA3 in terms of throughput. When using FlashAttention-4 with torch.compile, register FA4 as a custom op via torch.library so Dynamo does not try to trace through its CUDA kernel directly - the pattern above covers custom op boundary handling in detail.

How vLLM and SGLang Use CUDA Graphs Internally

vLLM and SGLang both manage CUDA graph capture automatically. You do not need to do it manually when using these frameworks, but understanding the mechanism helps with debugging.

At server start, vLLM captures CUDA graphs across a configurable set of batch sizes (defaults span roughly 1 to 256, mixing powers of 2 with intermediate sizes). SGLang's piecewise CUDA graph does the same. For each captured batch size, vLLM stores the graph and its associated tensor buffers.

During inference, when a batch arrives, vLLM selects the smallest captured graph that can accommodate the batch. If the batch has 6 sequences, vLLM uses the batch-size-8 graph and pads with dummy sequences. If a request arrives that doesn't fit any captured batch size, vLLM falls back to eager mode for that decode step.

SGLang does the same. The warm-up phase at startup is what you see when the server logs "Capturing CUDA graphs..." for 30-60 seconds on first launch.

The key difference from manual graph capture: both frameworks manage the static input tensor buffers for you. Captured graphs require that the exact same memory addresses are used on replay. vLLM's memory manager handles this by pre-allocating fixed paged attention buffers that don't move between decode steps.

vLLM pre-captures CUDA graphs for the decode phase during the warm-up pass at server start. For the full PyTorch-level API details, see the PyTorch CUDA graphs documentation.

Persistent Inductor Cache on Spheron

The 30-90 second cold-start compile time is the main production friction with torch.compile. On serverless GPU platforms (Lambda Labs Inference, Modal, RunPod serverless), the container filesystem is ephemeral: every cold start re-compiles from scratch.

The fix is simple: point TORCHINDUCTOR_CACHE_DIR at a persistent path.

bash

export TORCHINDUCTOR_CACHE_DIR=/mnt/nvme/inductor-cache
python serve.py

On Spheron, each GPU instance has local NVMe attached directly to the node. The cache directory survives container restarts and re-deploys as long as you use the same instance. For H200 on Spheron, you get NVMe with enough throughput that the cache write is faster than the GPU is waiting on its first compiled kernel anyway.

The performance difference is concrete. On a Llama 3.1 70B workload with torch.compile, here is what cold-start compile overhead looks like across platforms:

Platform	Cache persistent	Cold start compile	p99 latency overhead (first 10 requests)
Spheron (bare metal, NVMe cache)	Yes	0s (cache hit)	0%
Lambda Labs serverless	No	45-90s	12-18% (first batch degrades)
Modal (ephemeral container)	No	30-70s	10-15%
RunPod serverless	No	30-90s	10-18%
Spheron (first deploy, cold cache)	After first run	35-60s	One-time cost only

The p99 latency overhead numbers represent the degradation during the compilation window: requests that arrive during the 30-90 second compile either queue (increasing TTFT) or fail with timeout. On subsequent deploys to the same Spheron instance, the cache is warm and there is no overhead.

To mount a persistent volume in the Spheron UI: navigate to the storage section when provisioning your instance, attach a persistent NVMe volume, and set TORCHINDUCTOR_CACHE_DIR to the mount path in your container's environment variables. The cache will survive across restarts automatically.

Benchmarks: Throughput and TTFT on B200, H200, RTX Pro 6000

Benchmarks below use Llama 3.1 8B in BF16 with torch.compile mode='reduce-overhead' after warm-up. All runs use batch size 1 for TTFT and batch size 32 for throughput. Prices from the Spheron GPU pricing API as of 27 Apr 2026.

For Spheron B200 instances, the benchmarks reflect the B200 SXM with 192 GB HBM3e. For RTX Pro 6000 rental, this uses the 96 GB GDDR7 RTX Pro 6000, which is a strong single-GPU option for 30B-70B models.

GPU	Mode	TTFT (ms, BS=1)	Throughput (tok/s, BS=32)	On-demand $/hr	Spot $/hr
H100 SXM5	Eager	52	48,200	$2.90	$0.80
H100 SXM5	torch.compile	31	64,500	$2.90	$0.80
H200 SXM5	Eager	38	68,100	$5.58	$1.19
H200 SXM5	torch.compile	23	89,300	$5.58	$1.19
B200 SXM	Eager	21	122,000	N/A	$1.71
B200 SXM	torch.compile	14	157,000	N/A	$1.71
RTX Pro 6000	Eager	68	31,400	$1.70	$0.59
RTX Pro 6000	torch.compile	44	41,800	$1.70	$0.59

The relative gain from torch.compile is consistent across hardware: roughly 35-40% TTFT improvement and 25-30% throughput improvement at batch size 32. The absolute numbers scale with hardware capability.

Pricing fluctuates based on GPU availability. The prices above are based on 27 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Common Production Failures

Graph Breaks

Graph breaks are the most common torch.compile failure mode. When Dynamo encounters something it cannot trace, it emits a graph break: it compiles what it can, then falls back to eager mode for the rest. Each break splits the compiled region, reducing fusion coverage and breaking CUDA graph eligibility.

Diagnose with:

bash

TORCH_LOGS=graph_breaks python serve.py 2>&1 | grep "Graph break"

Common causes in LLM inference:

Data-dependent control flow: if seq_len > max_len: ... where seq_len is a tensor value
In-place ops on aliased tensors: x[mask] = 0 when x is a view of another tensor
Unsupported custom extensions: C++ ops not registered via torch.library
print and assert statements inside traced regions

Fix strategy: use torch._dynamo.explain(model.forward)(*sample_inputs) to see exactly where Dynamo breaks and why. Eliminate the break or use torch.compiler.disable() to exclude that region from compilation.

Guard Explosions

If you see log lines like:

[RECOMPILE] Guard failed: L['seq_len'] == 512
[RECOMPILE] Guard failed: L['seq_len'] == 768

Dynamo is recompiling for every unique input shape it sees. This typically means dynamic=True is not set, or the model has shape-dependent control flow that creates per-shape guards.

Fix: add dynamic=True, or switch to bucketed padding with fixed shapes per bucket.

Dynamo Unsupported Ops

Some ops have incomplete Dynamo support. The symptom is a UserError or silent fallback. Check with:

bash

TORCH_COMPILE_DEBUG=1 python -c "import torch; torch.compile(your_model)(your_input)"

This writes a full debug directory with the captured FX graph, the Inductor-generated C++ code, and each kernel's compilation log. The unsupported op will appear as a graph break in the FX graph output.

`TORCH_COMPILE_DEBUG` Artifacts

TORCH_COMPILE_DEBUG=1 writes to torch_compile_debug/ by default. These directories accumulate across runs. On a long-running server, this can fill the NVMe volume. Set TORCH_COMPILE_DEBUG_DIR=/tmp/tcd to redirect output, or disable after your initial debugging pass.

When NOT to Use torch.compile

TensorRT-LLM. TensorRT-LLM manages its own engine compilation pipeline using TensorRT, not Dynamo. The two systems do not compose. If you are using TRT-LLM, do not wrap anything in torch.compile - you will get a conflict between the two compilation pipelines.

Triton kernel boundaries. If your model uses hand-written Triton kernels for matmul or attention, those kernels are already optimized. Wrapping them in torch.compile adds compilation overhead without improving the kernels themselves. Use regional compilation to compile only the surrounding non-Triton layers.

Extremely variable input shapes without bucketing. If your workload has unbounded sequence length variation (user-generated content with lengths from 50 to 100,000 tokens) and you cannot implement bucketed padding, torch.compile will recompile frequently. The recompile cost can exceed the per-request speedup. Profile your recompile frequency with TORCH_LOGS=recompiles before committing.

Models with heavy in-place aliasing. Some model implementations use in-place operations on views for memory efficiency. This pattern breaks CUDA graph capture and can cause incorrect results under Dynamo compilation. Run with fullgraph=True to surface aliasing errors before deploying:

python

model = torch.compile(model, fullgraph=True)  # errors on any graph break

If fullgraph=True raises, fix the aliasing before switching back to the default (which silently falls back on graph breaks).

Running LLM inference on Spheron gives you bare-metal NVMe for persistent Inductor cache, which means zero cold-start compile overhead on redeployments. That is a real, measurable improvement over serverless GPU platforms where the cache is lost on every container restart.
Rent H200 → | Rent B200 → | View all pricing →
Get started on Spheron →

Why torch.compile Matters for Inference

PyTorch 2.6 Changes for LLM Inference

How CUDA Graphs Cut Kernel Launch Overhead

Compiling a Llama or Qwen Forward Pass on H100

Handling Variable Sequence Lengths

Combining torch.compile with FlashAttention-3 and FA4

How vLLM and SGLang Use CUDA Graphs Internally

Persistent Inductor Cache on Spheron

Benchmarks: Throughput and TTFT on B200, H200, RTX Pro 6000

Common Production Failures

Graph Breaks

Guard Explosions

Dynamo Unsupported Ops

TORCH_COMPILE_DEBUG Artifacts

When NOT to Use torch.compile

Build what's next.

`TORCH_COMPILE_DEBUG` Artifacts