Most LLM inference slowdowns come from 2-5 kernels that a 10-minute profile would identify. Yet ML engineers rarely profile on cloud GPUs because the tooling feels painful: GUIs that need displays, counter permission errors on shared infrastructure, trace files that are hard to get off a remote host. This guide cuts through that. It covers the inference engineering stack from the tool selection decision down to reading a roofline chart, and pairs with GPU monitoring basics for teams already tracking utilization but hitting a wall where nvidia-smi stops being useful.
Which Tool for Which Job
| Tool | Granularity | When to use |
|---|---|---|
nsys (Nsight Systems) | Application timeline | Always run first. Identifies which phases dominate wall time. |
ncu (Nsight Compute) | Single kernel, all hardware counters | Roofline analysis after nsys identifies a slow kernel target. |
torch.profiler | Python ops + CUDA ops, per-rank | Distributed training jobs; operator attribution back to Python code. |
| Triton Proton | Triton kernel internals | Only when writing custom Triton kernels. |
Start with nsys for orientation. It gives you the timeline: where the GPU is busy, where it stalls, which kernels eat the most time. Drill into ncu for 1-3 slow kernels once you have targets. Use torch.profiler when you need Python-level attribution or per-rank distributed analysis. Use Proton only if you are writing custom Triton.
All four tools require either bare-metal or VM-level GPU access. Most serverless GPU platforms block hardware counter collection for tenant isolation. Running these tools on bare-metal H100 instances on Spheron gives you full ncu and nsys access without the ERR_NVGPUCTRPERM error that kills profiling sessions on shared infrastructure.
Capturing a Trace on a Remote Cloud GPU
The key principle is separating capture from analysis. Collect traces headlessly on the remote host, download them, analyze locally. No X11 forwarding, no GUI on the server, no VNC.
nsys capture
nsys profile \
--trace cuda,nvtx,osrt,cudnn \
--output /workspace/sys-profile.nsys-rep \
python inference.py
scp user@host:/workspace/sys-profile.nsys-rep ./Open sys-profile.nsys-rep in the Nsight Systems desktop GUI (free from developer.nvidia.com). The CUDA HW row shows kernel execution. Sort by total GPU time to find the top 3 offenders.
ncu capture with --replay-mode kernel
ncu \
--replay-mode kernel \
--set full \
--target-processes all \
-o /workspace/kernel-profile.ncu-rep \
python inference.py--replay-mode kernel re-runs each kernel individually per counter set without restarting the process. --replay-mode application reruns the entire script per counter set, which is more accurate for stateful kernels but far slower and more expensive for LLM workloads where the full forward pass takes seconds.
Scoping to specific kernels
Profile only the kernels that matter. This keeps trace size and runtime manageable:
ncu --kernel-name flash_attn_varlen_fwd \
--launch-skip 10 --launch-count 5 \
--set full \
-o /workspace/attn-profile.ncu-rep \
python inference.py--launch-skip 10 skips the first 10 launches (warmup). --launch-count 5 collects 5 launches. The resulting trace covers one attention kernel in detail without profiling hundreds of unrelated kernels.
Docker setup for profiling
FROM nvcr.io/nvidia/pytorch:24.10-py3
RUN pip install HolisticTraceAnalysis nvidia-pytool-report triton==3.0.0
ENV TORCHINDUCTOR_CACHE_DIR=/mnt/nvme/inductor-cache
WORKDIR /workspaceRun with:
docker run --rm --gpus all --cap-add SYS_ADMIN \
-v $(pwd):/workspace \
profiling-image \
ncu --replay-mode kernel --set full \
-o /workspace/profile.ncu-rep \
python /workspace/inference_script.py--cap-add SYS_ADMIN is the minimal Linux capability needed for hardware counter access. --privileged works too but grants broader permissions than necessary. On most serverless platforms, neither flag is available and ncu fails with ERR_NVGPUCTRPERM. This is a structural constraint of their hosting model, not a configuration issue you can work around.
For GPU monitoring basics that complement this profiling workflow, see the GPU monitoring for ML guide.
Reading the Nsight Compute Roofline Chart
The roofline model has two axes: arithmetic intensity on x (FLOPs per byte of DRAM traffic) and achieved performance on y (GFLOPs/s). Two lines define the "roof": the memory bandwidth slope (bounded by GB/s) and the compute ceiling (bounded by peak TFLOP/s). Your kernel's measured point sits somewhere on this chart.
H100 SXM5 reference numbers: 3.35 TB/s HBM3 bandwidth, 989 TFLOP/s BF16. The ridge point, where a kernel transitions from memory-bound to compute-bound, sits at approximately 989,000 GFLOPs / 3,350 GB/s = 295 FLOP/byte.
Where LLM kernels typically land:
- Large batch GEMM: compute-bound, upper-right, near the compute roof
- Attention at decode batch=1: memory-bound, lower-left. Arithmetic intensity around 4-8 FLOP/byte on H100, 40x left of the ridge point
- LayerNorm, RoPE, elementwise ops: heavily memory-bound, often under 1 FLOP/byte
- FP8 GEMM at medium batch: often near the ridge point, transitioning between memory and compute bound
A 7B model attention kernel at decode batch=1 on H100 SXM5 has arithmetic intensity around 6 FLOP/byte. The H100 ridge point sits at ~295 FLOP/byte. The kernel is more than 40x to the left of the ridge. The fix is not a bigger GPU. It is reducing HBM trips via FlashAttention or increasing batch size to amortize memory reads. Once you can read this in Nsight Compute, you stop guessing.
Key secondary metrics to check alongside roofline:
- SM utilization %: low means underparallelized (too few concurrent warps)
- Memory throughput % of peak HBM: low despite being memory-bound means stall somewhere else (often L2 thrashing)
- Warp occupancy: low indicates register pressure or shared memory overuse
- L2 hit rate: low means cold KV cache reads, common at long context. Fix: prefix caching
For more on KV cache behavior and fixes, see the KV cache optimization guide and why your LLM inference is slow.
PyTorch Profiler and Holistic Trace Analysis for Distributed Jobs
For multi-GPU training, torch.profiler plus HTA gives you per-rank attribution and communication overlap analysis that nsys cannot easily provide.
import torch
from torch.profiler import profile, ProfilerActivity, schedule
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=schedule(wait=1, warmup=2, active=5, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler('./traces'),
record_shapes=True,
with_stack=True,
with_flops=True,
) as prof:
for step, batch in enumerate(dataloader):
train_step(batch)
prof.step()Parameter notes:
schedule(wait=1, warmup=2, active=5): skip step 0 (startup overhead), warmup 2 steps (JIT compilation), then record 5 steps. Without this, your trace captures compilation time and skews everything.record_shapes=True: required to attribute kernels to specific matrix shapes and layer sizes.with_stack=True: adds Python callstack to each op. Roughly 20% overhead; disable in production capture.with_flops=True: enables FLOPs counting for GEMM and convolution ops.
HTA loads traces from all ranks and surfaces the bottlenecks:
pip install HolisticTraceAnalysisfrom hta.trace_analysis import TraceAnalysis
analyzer = TraceAnalysis(trace_dir='./traces')
idle = analyzer.get_idle_time_breakdown(ranks=[0, 1, 2, 3])
critical = analyzer.critical_path_analysis(rank=0, annotation='step', instance_id=0)HTA surfaces idle time breakdown per rank (compute vs communication vs memory-wait), all-reduce overlap efficiency (what fraction of NCCL time overlaps with compute), and the critical path across the forward+backward+optimizer step.
For context on setting up distributed training environments, see the distributed LLM training guide and the NCCL tuning guide.
Common LLM Inference Bottlenecks in Profiles
1. Attention memory bandwidth
Profile signature in nsys: flash_attn_varlen_fwd in top-5 by GPU time. In ncu roofline: memory-bound, achieved bandwidth 60-80% of peak H100 HBM3.
At decode batch=1 on a 70B model, attention accounts for 30-40% of total decode time. The roofline shows this is not a compute problem. FlashAttention reduces HBM round trips by computing attention in tiles that fit in SRAM, but even FlashAttention is memory-bound at batch=1. Increasing batch size amortizes the weight reads across more tokens.
2. KV cache reads at long context
In ncu: L2 hit rate drops below 15% at 32K+ token contexts because the KV cache does not fit in L2. Each decode step sweeps the entire KV cache from HBM. Profile shows sequential memory reads with low arithmetic intensity.
Fix: prefix caching (radix attention in SGLang) reuses KV cache for shared prefixes, converting cold reads to cache hits. For extreme context lengths, NVMe KV offload is the fallback. See the KV cache optimization guide.
3. All-reduce stalls in tensor-parallel inference
In nsys timeline: white gaps between GPU kernels labelled as ncclAllReduce events. Stall duration scales with tensor parallel (TP) degree. At TP=8 on H100 NVLink, AllReduce adds 0.5-1.5ms per transformer layer decode step.
Fix: AllReduce overlap via disaggregated inference (NVIDIA Dynamo), or reduce TP degree if memory allows. See the prefill-decode disaggregation guide.
4. Prefill blocking decode
In nsys: long prefill phases (hundreds of milliseconds) appear back-to-back while decode requests queue. Visible as a "burst then idle" pattern. This is not a kernel efficiency problem. It is an architecture problem.
Fix: separate prefill and decode instances. See prefill-decode disaggregation.
5. Python dispatch overhead in tight decode loops
In torch.profiler timeline: aten::item, aten::copy_, or Python enumerate calls appear in the critical CUDA path. This only shows up in eager-mode PyTorch inference without torch.compile or CUDA graphs.
Fix: enable torch.compile with mode='reduce-overhead'. See the torch.compile guide.
Profiling vLLM, SGLang, and TensorRT-LLM
vLLM
Set VLLM_WORKER_MULTIPROC_METHOD=spawn before profiling so nsys can attach to worker processes. Profile at the process level:
nsys profile --wait all \
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4Key kernels to watch: paged_attention_v1_kernel (PagedAttention decode), fused_add_rms_norm_kernel (layernorm fusion). Use a controlled load generator with fixed concurrency and sequence length for reproducible traces.
SGLang
SGLang exposes --enable-torch-profiler to wrap torch.profiler around specific requests. Radix attention cache hit shows as radix_cache_decode taking near-zero time. Cache miss shows as a full KV recompute sequence in nsys. Profile at multiple cache hit rates to see the latency cliff when cache pressure increases.
TensorRT-LLM
Build the engine with --profiling-verbosity detailed to enable kernel-level hooks. nsys is the primary tool; ncu on TRT-LLM requires a non-batched single-inference test harness because TRT manages kernel dispatch internally. The fused MHA kernel appears as fmha_v2_flash_attn in the timeline; cublas GEMM variants appear by name for each layer size.
From Profile to Fix
| Profile Observation | Root Cause | Action |
|---|---|---|
| Attention memory-bound on roofline | HBM bandwidth ceiling at small batch | Enable FlashAttention, increase batch size |
| NCCL AllReduce gaps > 2ms per layer | Tensor-parallel communication overhead | Reduce TP degree, enable AllReduce overlap |
| L2 hit rate < 20% on decode kernels | KV cache exceeds L2 capacity at long context | Enable prefix caching, larger PagedAttention block size |
| Prefill >> decode in nsys timeline | No prefill-decode separation | Disaggregated prefill-decode routing |
| Python ops in critical CUDA path | Eager mode dispatch overhead | torch.compile + CUDA graphs |
| Low SM occupancy on GEMM | Small matrix dimensions (batch=1) | Continuous batching; group decode requests |
| Memory throughput < 40% despite memory-bound kernel | PCIe bottleneck (CPU-GPU transfer in loop) | Move data to GPU before the kernel loop |
For continuous batching configuration that addresses the SM occupancy and batching rows above, see the LLM serving optimization guide.
Profiling Cost: Keeping ncu Sessions Under 30 GPU-Minutes
Live pricing fetched 2026-05-11:
- H100 SXM5 on-demand: $1.66/hr, spot: $1.52/hr
- H100 PCIe on-demand: $3.29/hr (no spot tier available)
- A100 80G PCIe spot: $1.15/hr (for smaller model profiling or initial --set default passes)
| Profiling scope | GPU-minutes | H100 SXM5 on-demand | A100 80G PCIe spot |
|---|---|---|---|
--set default (1 replay pass) | 3-5 min | $0.08-$0.14 | $0.06-$0.10 |
--set full, 7B model forward | 10-15 min | $0.28-$0.42 | $0.19-$0.29 |
--set full, 70B model forward | 20-30 min | $0.55-$0.83 | $0.38-$0.57 |
Strategies to stay under budget:
--set defaultfirst: one replay pass, captures roofline basics. Use to confirm a kernel is memory-bound before paying for--set full.--launch-skip N --launch-count M: profile only the 5-10 kernel launches that matter, not the full warmup.- Mini-batch profiling: profile at batch=1, sequence=512. Scale observations to production config analytically.
- Spot instances: profiling is stateless and short. Use H100 SXM5 spot at $1.52/hr for 70B model sessions, or A100 80G PCIe spot at $1.15/hr for smaller models under 40B. Spin up, run one
ncupass (~20-30 min), download the.ncu-rep, tear down.
Pricing fluctuates based on GPU availability. The prices above are based on 11 May 2026 and may have changed. Check current GPU pricing → for live rates.
Spheron Profiling Workflow and Reference Docker Image
Step-by-step workflow:
- Provision an H100 PCIe instance from the Spheron dashboard or via the CLI.
- SSH in and pull the profiling container.
- Run with
--cap-add SYS_ADMINfor counter access:
docker run --rm --gpus all --cap-add SYS_ADMIN \
-v $(pwd):/workspace \
nvcr.io/nvidia/pytorch:24.10-py3 \
ncu --replay-mode kernel --set full \
-o /workspace/profile.ncu-rep \
python /workspace/inference_script.py- Download traces via
scpfor local GUI analysis. - Tear down the instance. Total cost for a full profile session: $1.10-$1.65 on H100 PCIe on-demand ($3.29/hr, 20-30 GPU-min).
Reference Dockerfile.profiling:
FROM nvcr.io/nvidia/pytorch:24.10-py3
RUN pip install --no-cache-dir \
HolisticTraceAnalysis \
nvidia-pytool-report \
triton==3.0.0
ENV TORCHINDUCTOR_CACHE_DIR=/mnt/nvme/inductor-cache
ENV NCCL_DEBUG=WARN
WORKDIR /workspaceWhy --cap-add SYS_ADMIN works on Spheron: bare-metal GPU allocation passes NVIDIA driver capabilities through to the container runtime. Platforms that use shared-kernel multi-tenancy or hypervisor-mediated GPU access cannot grant this capability, so ncu fails at counter collection with ERR_NVGPUCTRPERM. This is a structural property of the hosting model, not something configuration can fix.
Kernel-level profiling requires full hardware counter access, something most serverless GPU platforms block for tenant isolation. Spheron's bare-metal GPU instances give you complete ncu and nsys access, so a 15-minute profiling session on H100 SXM5 costs under $0.50 instead of an hours-long guessing game.
Rent H100 → | View all GPU pricing → | Get started on Spheron →
