What is the difference between Nsight Compute and Nsight Systems?

Nsight Systems is a timeline-level profiler that shows the full application run: CPU/GPU overlap, CUDA API calls, NCCL communication, and which phases dominate wall time. Nsight Compute is a kernel-level profiler that captures hardware performance counters for individual CUDA kernels - memory bandwidth, compute utilization, roofline position, warp occupancy. Use Nsight Systems first to identify the 2-3 kernels consuming the most time, then use Nsight Compute to understand why those kernels are slow.

How do I run ncu on a remote GPU server without a display?

Use replay mode: ncu --replay-mode kernel --set full -o profile.ncu-rep python your_script.py. This collects all hardware counter sets by replaying each kernel without needing a GUI on the remote host. Download the .ncu-rep file via scp and open it in the Nsight Compute GUI on your local machine. For nsys, run nsys profile --trace cuda,nvtx -o profile.nsys-rep python your_script.py and download the .nsys-rep file the same way.

Does GPU kernel profiling work on serverless GPU platforms?

Most serverless GPU platforms block hardware performance counter access for tenant isolation. ncu will fail with ERR_NVGPUCTRPERM because the host restricts NvidiaDriverCapabilities. Spheron bare-metal instances expose full counter access - you run ncu directly in your container with --privileged or --cap-add SYS_ADMIN. This is the key practical difference between bare-metal and managed serverless GPU hosting for profiling workflows.

What does a memory-bound kernel look like in the Nsight Compute roofline chart?

A memory-bound kernel plots in the lower-left region of the roofline chart, bounded by the memory bandwidth roof rather than the compute roof. Its arithmetic intensity (FLOPs per byte of DRAM traffic) is typically under 100 FLOP/byte. Transformer attention at decode batch=1 on H100 is a classic example: arithmetic intensity falls around 4-8 FLOP/byte, well below the compute roofline. The fix is not more compute - it is reducing HBM round trips via FlashAttention or increasing batch size to amortize memory reads.

How much does a full ncu profiling session cost on Spheron?

ncu --set full replays each kernel 30-60 times to collect all counter sets. A 70B model forward pass profiled this way takes 20-30 GPU-minutes. At H100 SXM5 on-demand pricing of $1.66/hr, one session costs $0.55-$0.83. H100 SXM5 spot at $1.52/hr cuts this to $0.51-$0.76 for the same workload. Using --set default instead of --set full reduces replay count to 1 pass and costs under $0.15 per session on H100 SXM5, at the cost of incomplete roofline data.

Which LLM inference bottlenecks are most commonly found via profiling?

Three show up in almost every production profile: attention memory bandwidth (FlashAttention kernels are HBM-bound at small batch), KV cache read latency (dominates decode at long context), and all-reduce stalls (NCCL synchronization gaps between GPU kernels in tensor-parallel inference). A fourth, Python dispatch overhead in tight decode loops, shows up when teams run inference in pure eager mode without torch.compile or CUDA graphs.

GPU Profiling for AI Workloads: Nsight Compute, Nsight Systems, and PyTorch Profiler Production Guide (2026)

Most LLM inference slowdowns come from 2-5 kernels that a 10-minute profile would identify. Yet ML engineers rarely profile on cloud GPUs because the tooling feels painful: GUIs that need displays, counter permission errors on shared infrastructure, trace files that are hard to get off a remote host. This guide cuts through that. It covers the inference engineering stack from the tool selection decision down to reading a roofline chart, and pairs with GPU monitoring basics for teams already tracking utilization but hitting a wall where nvidia-smi stops being useful.

Which Tool for Which Job

Tool	Granularity	When to use
`nsys` (Nsight Systems)	Application timeline	Always run first. Identifies which phases dominate wall time.
`ncu` (Nsight Compute)	Single kernel, all hardware counters	Roofline analysis after nsys identifies a slow kernel target.
`torch.profiler`	Python ops + CUDA ops, per-rank	Distributed training jobs; operator attribution back to Python code.
Triton Proton	Triton kernel internals	Only when writing custom Triton kernels.

Start with nsys for orientation. It gives you the timeline: where the GPU is busy, where it stalls, which kernels eat the most time. Drill into ncu for 1-3 slow kernels once you have targets. Use torch.profiler when you need Python-level attribution or per-rank distributed analysis. Use Proton only if you are writing custom Triton.

All four tools require either bare-metal or VM-level GPU access. Most serverless GPU platforms block hardware counter collection for tenant isolation. Running these tools on bare-metal H100 instances on Spheron gives you full ncu and nsys access without the ERR_NVGPUCTRPERM error that kills profiling sessions on shared infrastructure.

Capturing a Trace on a Remote Cloud GPU

The key principle is separating capture from analysis. Collect traces headlessly on the remote host, download them, analyze locally. No X11 forwarding, no GUI on the server, no VNC.

nsys capture

bash

nsys profile \
  --trace cuda,nvtx,osrt,cudnn \
  --output /workspace/sys-profile.nsys-rep \
  python inference.py
scp user@host:/workspace/sys-profile.nsys-rep ./

Open sys-profile.nsys-rep in the Nsight Systems desktop GUI (free from developer.nvidia.com). The CUDA HW row shows kernel execution. Sort by total GPU time to find the top 3 offenders.

ncu capture with `--replay-mode kernel`

bash

ncu \
  --replay-mode kernel \
  --set full \
  --target-processes all \
  -o /workspace/kernel-profile.ncu-rep \
  python inference.py

--replay-mode kernel re-runs each kernel individually per counter set without restarting the process. --replay-mode application reruns the entire script per counter set, which is more accurate for stateful kernels but far slower and more expensive for LLM workloads where the full forward pass takes seconds.

Scoping to specific kernels

Profile only the kernels that matter. This keeps trace size and runtime manageable:

bash

ncu --kernel-name flash_attn_varlen_fwd \
    --launch-skip 10 --launch-count 5 \
    --set full \
    -o /workspace/attn-profile.ncu-rep \
    python inference.py

--launch-skip 10 skips the first 10 launches (warmup). --launch-count 5 collects 5 launches. The resulting trace covers one attention kernel in detail without profiling hundreds of unrelated kernels.

Docker setup for profiling

dockerfile

FROM nvcr.io/nvidia/pytorch:24.10-py3
RUN pip install HolisticTraceAnalysis nvidia-pytool-report triton==3.0.0
ENV TORCHINDUCTOR_CACHE_DIR=/mnt/nvme/inductor-cache
WORKDIR /workspace

Run with:

bash

docker run --rm --gpus all --cap-add SYS_ADMIN \
  -v $(pwd):/workspace \
  profiling-image \
  ncu --replay-mode kernel --set full \
      -o /workspace/profile.ncu-rep \
      python /workspace/inference_script.py

--cap-add SYS_ADMIN is the minimal Linux capability needed for hardware counter access. --privileged works too but grants broader permissions than necessary. On most serverless platforms, neither flag is available and ncu fails with ERR_NVGPUCTRPERM. This is a structural constraint of their hosting model, not a configuration issue you can work around.

For GPU monitoring basics that complement this profiling workflow, see the GPU monitoring for ML guide.

Reading the Nsight Compute Roofline Chart

The roofline model has two axes: arithmetic intensity on x (FLOPs per byte of DRAM traffic) and achieved performance on y (GFLOPs/s). Two lines define the "roof": the memory bandwidth slope (bounded by GB/s) and the compute ceiling (bounded by peak TFLOP/s). Your kernel's measured point sits somewhere on this chart.

H100 SXM5 reference numbers: 3.35 TB/s HBM3 bandwidth, 989 TFLOP/s BF16. The ridge point, where a kernel transitions from memory-bound to compute-bound, sits at approximately 989,000 GFLOPs / 3,350 GB/s = 295 FLOP/byte.

Where LLM kernels typically land:

Large batch GEMM: compute-bound, upper-right, near the compute roof
Attention at decode batch=1: memory-bound, lower-left. Arithmetic intensity around 4-8 FLOP/byte on H100, 40x left of the ridge point
LayerNorm, RoPE, elementwise ops: heavily memory-bound, often under 1 FLOP/byte
FP8 GEMM at medium batch: often near the ridge point, transitioning between memory and compute bound

A 7B model attention kernel at decode batch=1 on H100 SXM5 has arithmetic intensity around 6 FLOP/byte. The H100 ridge point sits at ~295 FLOP/byte. The kernel is more than 40x to the left of the ridge. The fix is not a bigger GPU. It is reducing HBM trips via FlashAttention or increasing batch size to amortize memory reads. Once you can read this in Nsight Compute, you stop guessing.

Key secondary metrics to check alongside roofline:

SM utilization %: low means underparallelized (too few concurrent warps)
Memory throughput % of peak HBM: low despite being memory-bound means stall somewhere else (often L2 thrashing)
Warp occupancy: low indicates register pressure or shared memory overuse
L2 hit rate: low means cold KV cache reads, common at long context. Fix: prefix caching

For more on KV cache behavior and fixes, see the KV cache optimization guide and why your LLM inference is slow.

PyTorch Profiler and Holistic Trace Analysis for Distributed Jobs

For multi-GPU training, torch.profiler plus HTA gives you per-rank attribution and communication overlap analysis that nsys cannot easily provide.

python

import torch
from torch.profiler import profile, ProfilerActivity, schedule

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=schedule(wait=1, warmup=2, active=5, repeat=1),
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./traces'),
    record_shapes=True,
    with_stack=True,
    with_flops=True,
) as prof:
    for step, batch in enumerate(dataloader):
        train_step(batch)
        prof.step()

Parameter notes:

schedule(wait=1, warmup=2, active=5): skip step 0 (startup overhead), warmup 2 steps (JIT compilation), then record 5 steps. Without this, your trace captures compilation time and skews everything.
record_shapes=True: required to attribute kernels to specific matrix shapes and layer sizes.
with_stack=True: adds Python callstack to each op. Roughly 20% overhead; disable in production capture.
with_flops=True: enables FLOPs counting for GEMM and convolution ops.

HTA loads traces from all ranks and surfaces the bottlenecks:

bash

pip install HolisticTraceAnalysis

python

from hta.trace_analysis import TraceAnalysis

analyzer = TraceAnalysis(trace_dir='./traces')
idle = analyzer.get_idle_time_breakdown(ranks=[0, 1, 2, 3])
critical = analyzer.critical_path_analysis(rank=0, annotation='step', instance_id=0)

HTA surfaces idle time breakdown per rank (compute vs communication vs memory-wait), all-reduce overlap efficiency (what fraction of NCCL time overlaps with compute), and the critical path across the forward+backward+optimizer step.

For context on setting up distributed training environments, see the distributed LLM training guide and the NCCL tuning guide.

Common LLM Inference Bottlenecks in Profiles

1. Attention memory bandwidth

Profile signature in nsys: flash_attn_varlen_fwd in top-5 by GPU time. In ncu roofline: memory-bound, achieved bandwidth 60-80% of peak H100 HBM3.

At decode batch=1 on a 70B model, attention accounts for 30-40% of total decode time. The roofline shows this is not a compute problem. FlashAttention reduces HBM round trips by computing attention in tiles that fit in SRAM, but even FlashAttention is memory-bound at batch=1. Increasing batch size amortizes the weight reads across more tokens.

2. KV cache reads at long context

In ncu: L2 hit rate drops below 15% at 32K+ token contexts because the KV cache does not fit in L2. Each decode step sweeps the entire KV cache from HBM. Profile shows sequential memory reads with low arithmetic intensity.

Fix: prefix caching (radix attention in SGLang) reuses KV cache for shared prefixes, converting cold reads to cache hits. For extreme context lengths, NVMe KV offload is the fallback. See the KV cache optimization guide.

3. All-reduce stalls in tensor-parallel inference

In nsys timeline: white gaps between GPU kernels labelled as ncclAllReduce events. Stall duration scales with tensor parallel (TP) degree. At TP=8 on H100 NVLink, AllReduce adds 0.5-1.5ms per transformer layer decode step.

Fix: AllReduce overlap via disaggregated inference (NVIDIA Dynamo), or reduce TP degree if memory allows. See the prefill-decode disaggregation guide.

4. Prefill blocking decode

In nsys: long prefill phases (hundreds of milliseconds) appear back-to-back while decode requests queue. Visible as a "burst then idle" pattern. This is not a kernel efficiency problem. It is an architecture problem.

Fix: separate prefill and decode instances. See prefill-decode disaggregation.

5. Python dispatch overhead in tight decode loops

In torch.profiler timeline: aten::item, aten::copy_, or Python enumerate calls appear in the critical CUDA path. This only shows up in eager-mode PyTorch inference without torch.compile or CUDA graphs.

Fix: enable torch.compile with mode='reduce-overhead'. See the torch.compile guide.

Profiling vLLM, SGLang, and TensorRT-LLM

vLLM

Set VLLM_WORKER_MULTIPROC_METHOD=spawn before profiling so nsys can attach to worker processes. Profile at the process level:

bash

nsys profile --wait all \
  python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4

Key kernels to watch: paged_attention_v1_kernel (PagedAttention decode), fused_add_rms_norm_kernel (layernorm fusion). Use a controlled load generator with fixed concurrency and sequence length for reproducible traces.

SGLang

SGLang exposes --enable-torch-profiler to wrap torch.profiler around specific requests. Radix attention cache hit shows as radix_cache_decode taking near-zero time. Cache miss shows as a full KV recompute sequence in nsys. Profile at multiple cache hit rates to see the latency cliff when cache pressure increases.

TensorRT-LLM

Build the engine with --profiling-verbosity detailed to enable kernel-level hooks. nsys is the primary tool; ncu on TRT-LLM requires a non-batched single-inference test harness because TRT manages kernel dispatch internally. The fused MHA kernel appears as fmha_v2_flash_attn in the timeline; cublas GEMM variants appear by name for each layer size.

From Profile to Fix

Profile Observation	Root Cause	Action
Attention memory-bound on roofline	HBM bandwidth ceiling at small batch	Enable FlashAttention, increase batch size
NCCL AllReduce gaps > 2ms per layer	Tensor-parallel communication overhead	Reduce TP degree, enable AllReduce overlap
L2 hit rate < 20% on decode kernels	KV cache exceeds L2 capacity at long context	Enable prefix caching, larger PagedAttention block size
Prefill >> decode in nsys timeline	No prefill-decode separation	Disaggregated prefill-decode routing
Python ops in critical CUDA path	Eager mode dispatch overhead	torch.compile + CUDA graphs
Low SM occupancy on GEMM	Small matrix dimensions (batch=1)	Continuous batching; group decode requests
Memory throughput < 40% despite memory-bound kernel	PCIe bottleneck (CPU-GPU transfer in loop)	Move data to GPU before the kernel loop

For continuous batching configuration that addresses the SM occupancy and batching rows above, see the LLM serving optimization guide.

Profiling Cost: Keeping ncu Sessions Under 30 GPU-Minutes

Live pricing fetched 2026-05-11:

H100 SXM5 on-demand: $1.66/hr, spot: $1.52/hr
H100 PCIe on-demand: $3.29/hr (no spot tier available)
A100 80G PCIe spot: $1.15/hr (for smaller model profiling or initial --set default passes)

Profiling scope	GPU-minutes	H100 SXM5 on-demand	A100 80G PCIe spot
`--set default` (1 replay pass)	3-5 min	$0.08-$0.14	$0.06-$0.10
`--set full`, 7B model forward	10-15 min	$0.28-$0.42	$0.19-$0.29
`--set full`, 70B model forward	20-30 min	$0.55-$0.83	$0.38-$0.57

Strategies to stay under budget:

--set default first: one replay pass, captures roofline basics. Use to confirm a kernel is memory-bound before paying for --set full.
--launch-skip N --launch-count M: profile only the 5-10 kernel launches that matter, not the full warmup.
Mini-batch profiling: profile at batch=1, sequence=512. Scale observations to production config analytically.
Spot instances: profiling is stateless and short. Use H100 SXM5 spot at $1.52/hr for 70B model sessions, or A100 80G PCIe spot at $1.15/hr for smaller models under 40B. Spin up, run one ncu pass (~20-30 min), download the .ncu-rep, tear down.

Pricing fluctuates based on GPU availability. The prices above are based on 11 May 2026 and may have changed. Check current GPU pricing → for live rates.

Spheron Profiling Workflow and Reference Docker Image

Step-by-step workflow:

Provision an H100 PCIe instance from the Spheron dashboard or via the CLI.
SSH in and pull the profiling container.
Run with --cap-add SYS_ADMIN for counter access:

bash

docker run --rm --gpus all --cap-add SYS_ADMIN \
  -v $(pwd):/workspace \
  nvcr.io/nvidia/pytorch:24.10-py3 \
  ncu --replay-mode kernel --set full \
      -o /workspace/profile.ncu-rep \
      python /workspace/inference_script.py

Download traces via scp for local GUI analysis.
Tear down the instance. Total cost for a full profile session: $1.10-$1.65 on H100 PCIe on-demand ($3.29/hr, 20-30 GPU-min).

Reference Dockerfile.profiling:

dockerfile

FROM nvcr.io/nvidia/pytorch:24.10-py3

RUN pip install --no-cache-dir \
    HolisticTraceAnalysis \
    nvidia-pytool-report \
    triton==3.0.0

ENV TORCHINDUCTOR_CACHE_DIR=/mnt/nvme/inductor-cache
ENV NCCL_DEBUG=WARN

WORKDIR /workspace

Why --cap-add SYS_ADMIN works on Spheron: bare-metal GPU allocation passes NVIDIA driver capabilities through to the container runtime. Platforms that use shared-kernel multi-tenancy or hypervisor-mediated GPU access cannot grant this capability, so ncu fails at counter collection with ERR_NVGPUCTRPERM. This is a structural property of the hosting model, not something configuration can fix.

Kernel-level profiling requires full hardware counter access, something most serverless GPU platforms block for tenant isolation. Spheron's bare-metal GPU instances give you complete ncu and nsys access, so a 15-minute profiling session on H100 SXM5 costs under $0.50 instead of an hours-long guessing game.
Rent H100 → | View all GPU pricing → | Get started on Spheron →

Which Tool for Which Job

Capturing a Trace on a Remote Cloud GPU

nsys capture

ncu capture with --replay-mode kernel

Scoping to specific kernels

Docker setup for profiling

Reading the Nsight Compute Roofline Chart

PyTorch Profiler and Holistic Trace Analysis for Distributed Jobs

Common LLM Inference Bottlenecks in Profiles

1. Attention memory bandwidth

2. KV cache reads at long context

3. All-reduce stalls in tensor-parallel inference

4. Prefill blocking decode

5. Python dispatch overhead in tight decode loops

Profiling vLLM, SGLang, and TensorRT-LLM

vLLM

SGLang

TensorRT-LLM

From Profile to Fix

Profiling Cost: Keeping ncu Sessions Under 30 GPU-Minutes

Spheron Profiling Workflow and Reference Docker Image

Build what's next.

ncu capture with `--replay-mode kernel`