Tutorial

OpenAI Triton Kernel Development on GPU Cloud: Write Custom AI Kernels in Python Without CUDA C++ (2026 Guide)

Back to BlogWritten by Mitrasish, Co-founderMay 10, 2026
OpenAI TritonTriton GPU ProgrammingCustom CUDA Kernel PythonGPU Kernel DevelopmentTriton AutotunerPersistent KernelsH100B200vLLMGPU Cloud
OpenAI Triton Kernel Development on GPU Cloud: Write Custom AI Kernels in Python Without CUDA C++ (2026 Guide)

OpenAI Triton is the default kernel layer in PyTorch 2.x: torch.compile lowers to Triton by default, and vLLM's attention backends (PagedAttention, ROPE, RMS Norm) are all written in Triton. It has also eliminated C++ from kernel development for most ML use cases. You can now write a production-grade fused attention kernel, a custom activation, or a memory-efficient matmul variant entirely in Python and get PTX that competes with hand-tuned CUDA.

Before going further: this post is about OpenAI Triton (also called triton-lang), the Python GPU programming language and compiler. It is not about NVIDIA Triton Inference Server, which is a model serving platform. The two share a name and nothing else. If you are looking for the inference serving layer, see the inference engineering guide for the full stack overview. For the NVIDIA serving platform specifically, the search term you want is "NVIDIA Triton Inference Server".

What Triton Is (and What It Replaced)

Triton is a tiled-loop compiler. You write compute logic at the tile level in Python: load a tile from HBM, run operations on it, store back. The Triton compiler handles thread block layout, shared memory allocation, warp synchronization, and hardware-specific instruction selection. The output is optimized PTX or CUBIN that runs on the target GPU.

Before Triton, writing a fused softmax or a custom attention variant required CUDA C++. That meant managing thread indices manually, dealing with shared memory bank conflicts, writing warp-level primitives (WMMA, wgmma) with precise register layout requirements, and debugging silent correctness errors that only appear at specific warp occupancies. For most ML teams, that work was a detour: the kernel was not the product, the model was.

What changed: PyTorch 2.0 (2023) shipped torch.inductor with Triton as its default codegen backend. Operator fusion inside torch.compile now emits Triton kernels, not CUDA C++. FlashAttention 2 and 3 use Triton for their Hopper kernels. Liger-Kernel, which provides production-quality fused LLM operator implementations, is entirely Triton. The FlexAttention API in PyTorch 2.5 exposes user-defined attention masks as Triton kernels. The ecosystem converged on Triton as the practical interface between Python ML code and GPU hardware.

For the one place Triton stepped back: FlashAttention 4 on Blackwell SM100 moved to a C++ tile DSL (CuTeDSL) because Blackwell's Tensor Memory Accelerator (TMA) hardware requires tile-level control that Triton's abstractions do not yet expose fully. See the FlashAttention-4 Blackwell inference guide for details on that architecture. For the CUDA 13 tile programming approach as an alternative to Triton on Blackwell, see the CUDA 13 tile programming guide.

Triton 3.x Architecture

Block Pointers and Tiled Loads

Triton 3.x introduced tl.make_block_ptr, which replaces the manual pointer arithmetic that earlier Triton versions required. Instead of computing offsets by hand, you describe the tensor layout, strides, and tile shape, and the block pointer handles address generation.

Here is a block pointer for loading a 2D matrix tile and advancing it across the K dimension in a matmul inner loop:

python
import triton
import triton.language as tl

@triton.jit
def matmul_kernel(
    A_ptr, B_ptr, C_ptr,
    M, N, K,
    stride_am, stride_ak,
    stride_bk, stride_bn,
    stride_cm, stride_cn,
    BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr,
):
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)

    # Block pointer for A: tile of shape (BLOCK_M, BLOCK_K)
    a_block_ptr = tl.make_block_ptr(
        base=A_ptr,
        shape=(M, K),
        strides=(stride_am, stride_ak),
        offsets=(pid_m * BLOCK_M, 0),
        block_shape=(BLOCK_M, BLOCK_K),
        order=(1, 0),
    )

    # Block pointer for B: tile of shape (BLOCK_K, BLOCK_N)
    b_block_ptr = tl.make_block_ptr(
        base=B_ptr,
        shape=(K, N),
        strides=(stride_bk, stride_bn),
        offsets=(0, pid_n * BLOCK_N),
        block_shape=(BLOCK_K, BLOCK_N),
        order=(1, 0),
    )

    accumulator = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)

    for k in range(0, K, BLOCK_K):
        # boundary_check pads out-of-bounds elements with zero when K % BLOCK_K != 0
        a = tl.load(a_block_ptr, boundary_check=(0, 1), padding_option='zero')
        b = tl.load(b_block_ptr, boundary_check=(0, 1), padding_option='zero')
        accumulator += tl.dot(a, b)
        # Advance both pointers along K dimension
        a_block_ptr = tl.advance(a_block_ptr, (0, BLOCK_K))
        b_block_ptr = tl.advance(b_block_ptr, (BLOCK_K, 0))

    # Write output
    c_block_ptr = tl.make_block_ptr(
        base=C_ptr,
        shape=(M, N),
        strides=(stride_cm, stride_cn),
        offsets=(pid_m * BLOCK_M, pid_n * BLOCK_N),
        block_shape=(BLOCK_M, BLOCK_N),
        order=(1, 0),
    )
    # boundary_check handles M % BLOCK_M != 0 or N % BLOCK_N != 0 at output tile edges
    tl.store(c_block_ptr, accumulator.to(tl.float16), boundary_check=(0, 1))

tl.make_block_ptr matters for two reasons. First, it eliminates out-of-bounds masking overhead for aligned tiles, which was a significant source of overhead in older Triton code. Second, on Hopper and Blackwell, it enables the compiler to emit TMA-style asynchronous loads that overlap data movement with compute, without any additional code changes from the programmer.

Autotuning

The @triton.autotune decorator wraps a kernel with a list of configuration dictionaries. On first call for a given set of problem sizes, Triton benchmarks all configs and caches the fastest one. Subsequent calls with the same sizes use the cached config.

python
@triton.autotune(
    configs=[
        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 256, 'BLOCK_K': 64, 'num_warps': 8, 'num_stages': 3}),
        triton.Config({'BLOCK_M': 64,  'BLOCK_N': 256, 'BLOCK_K': 32, 'num_warps': 4, 'num_stages': 4}),
        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 32, 'num_warps': 4, 'num_stages': 4}),
        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 64,  'BLOCK_K': 32, 'num_warps': 4, 'num_stages': 4}),
    ],
    key=['M', 'N', 'K'],  # Retune when these problem dimensions change
)
@triton.jit
def matmul_kernel(...):
    ...

The five key config knobs:

  • BLOCK_M, BLOCK_N, BLOCK_K: tile dimensions that determine shared memory footprint and compute density
  • num_warps: warps per thread block (4 or 8 is typical; higher requires more registers)
  • num_stages: software pipeline depth for overlapping memory loads with compute

Set TRITON_CACHE_DIR to a persistent directory to avoid recompilation across runs:

bash
export TRITON_CACHE_DIR=/workspace/.triton_cache

In a Docker image, pre-warm the cache by running a warmup pass with representative problem sizes at container build time, then mount the cache directory as a volume at runtime. This avoids 30-120 second cold-start compilation delays in production inference servers.

When NOT to autotune: latency-sensitive inference paths where cold-start recompilation can add hundreds of milliseconds to p99. In those cases, fix tile sizes to values that are known-good for your GPU and problem shape, and ship the pre-compiled kernel artifacts.

Persistent Kernels (Triton 3.x)

A persistent kernel loops over multiple tiles internally without re-launching the CUDA kernel between tiles. Standard (non-persistent) kernels launch one CUDA kernel per tile batch, which adds overhead at high tile counts. Persistent kernels eliminate that overhead by keeping the SM occupied across the full work queue.

python
@triton.jit
def persistent_softmax_kernel(
    input_ptr, output_ptr, n_rows, n_cols,
    BLOCK_SIZE: tl.constexpr,
):
    # Single-tile-per-row: BLOCK_SIZE must be >= n_cols. For rows wider than
    # BLOCK_SIZE, increase BLOCK_SIZE at the call site or add a column loop.
    tl.device_assert(n_cols <= BLOCK_SIZE, "n_cols must not exceed BLOCK_SIZE")
    # Each program instance processes multiple rows
    pid = tl.program_id(0)
    num_programs = tl.num_programs(0)

    for row_idx in range(pid, n_rows, num_programs):
        row_start = row_idx * n_cols
        cols = tl.arange(0, BLOCK_SIZE)
        mask = cols < n_cols

        x = tl.load(input_ptr + row_start + cols, mask=mask, other=-float('inf'))
        x_max = tl.max(x, axis=0)
        x = x - x_max
        x_exp = tl.exp(x)
        x_sum = tl.sum(x_exp, axis=0)
        out = x_exp / x_sum

        tl.store(output_ptr + row_start + cols, out, mask=mask)

On Hopper and Blackwell, persistent kernels combine well with software pipelining (num_stages > 1) because the SM stays alive long enough for the async memory pipeline to amortize latency across many tiles. For memory-bandwidth-bound operations like softmax or layer norm, the persistent pattern typically improves throughput 10-20% over the non-persistent variant on H100.

Setting Up a Triton Dev Environment on Spheron

Kernel benchmarking requires bare-metal access. Two reasons: noisy-neighbor CPU interference on virtualized instances distorts micro-benchmarks by 15-30%, making it impossible to compare configs reliably. And ncu (NVIDIA Nsight Compute) requires root or --privileged Docker access to read hardware performance counters. On shared cloud VMs, you often cannot get either. Spheron bare-metal instances have SSH root access by default.

For Triton kernel development, use Spheron's H100 instances for Hopper architecture testing, or B200 SXM6 availability on Spheron for Blackwell. H100 SXM5 is the standard target for most Triton kernel development today because the ecosystem (FA2/FA3, Liger-Kernel, vLLM internals) targets Hopper. B200 is where you go for Blackwell-specific tuning or to validate persistent kernel performance on SM100.

Environment setup:

bash
# Install Triton from PyTorch's wheel index (matches bundled torch version)
pip install triton torch --index-url https://download.pytorch.org/whl/cu124

# Always import torch before triton to confirm version alignment
python -c "import torch; import triton; print(f'torch={torch.__version__}, triton={triton.__version__}')"

# Verify GPU
nvidia-smi

# Install Nsight Compute for hardware profiling
apt-get install -y nsight-compute

# Proton profiler (bundled with triton extras)
pip install triton[profiling]

One important note on version pinning: pip install triton alone may pull a different version than what PyTorch bundles internally. Always install from --index-url https://download.pytorch.org/whl/cu124 to ensure your standalone triton package matches the one PyTorch uses for torch.compile. Mismatches cause silent errors where kernels compile but produce incorrect results.

For SSH and instance provisioning details, see the Spheron quick-start guide.

Authoring a Fused Softmax Kernel: Step-by-Step

Baseline PyTorch F.softmax launches three separate elementwise kernels internally: exp, sum (reduction), and divide. Each kernel reads from and writes to HBM. Fusing all three into a single kernel cuts HBM reads by roughly 2x: you load each element once, do all the math, and write once.

Here is a complete runnable fused softmax in Triton:

python
import torch
import triton
import triton.language as tl


@triton.autotune(
    configs=[
        triton.Config({'BLOCK_SIZE': 256, 'num_warps': 4}),
        triton.Config({'BLOCK_SIZE': 512, 'num_warps': 8}),
        triton.Config({'BLOCK_SIZE': 1024, 'num_warps': 8}),
        triton.Config({'BLOCK_SIZE': 2048, 'num_warps': 8}),
        triton.Config({'BLOCK_SIZE': 4096, 'num_warps': 16}),
        triton.Config({'BLOCK_SIZE': 8192, 'num_warps': 16}),
    ],
    key=['n_cols'],
)
@triton.jit
def fused_softmax_kernel(
    input_ptr, output_ptr,
    input_row_stride, output_row_stride,
    n_rows, n_cols,
    BLOCK_SIZE: tl.constexpr,
):
    row_idx = tl.program_id(0)
    if row_idx >= n_rows:
        return
    row_start_ptr = input_ptr + row_idx * input_row_stride

    col_offsets = tl.arange(0, BLOCK_SIZE)
    input_ptrs = row_start_ptr + col_offsets
    mask = col_offsets < n_cols

    # Load row, mask padding with -inf so it doesn't affect max/sum
    row = tl.load(input_ptrs, mask=mask, other=-float('inf'))

    # Numerically stable softmax: subtract max before exp
    row_max = tl.max(row, axis=0)
    row = row - row_max

    numerator = tl.exp(row)
    denominator = tl.sum(numerator, axis=0)
    softmax_output = numerator / denominator

    output_row_start_ptr = output_ptr + row_idx * output_row_stride
    output_ptrs = output_row_start_ptr + col_offsets
    tl.store(output_ptrs, softmax_output, mask=mask)


def triton_fused_softmax(x: torch.Tensor) -> torch.Tensor:
    n_rows, n_cols = x.shape
    assert x.is_contiguous(), "Input must be contiguous"

    y = torch.empty_like(x)

    # Grid: one program per row
    grid = (n_rows,)
    fused_softmax_kernel[grid](
        x, y,
        x.stride(0), y.stride(0),
        n_rows, n_cols,
    )
    return y

Benchmark against PyTorch baseline:

python
import torch
import torch.nn.functional as F

x = torch.randn(4096, 4096, device='cuda', dtype=torch.float16)

# Warm up
for _ in range(10):
    _ = triton_fused_softmax(x)
    _ = F.softmax(x, dim=-1)

# Time
n_iters = 100
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
for _ in range(n_iters):
    _ = triton_fused_softmax(x)
end.record()
torch.cuda.synchronize()
triton_ms = start.elapsed_time(end) / n_iters

start.record()
for _ in range(n_iters):
    _ = F.softmax(x, dim=-1)
end.record()
torch.cuda.synchronize()
torch_ms = start.elapsed_time(end) / n_iters

print(f"Triton fused softmax: {triton_ms:.3f} ms")
print(f"PyTorch F.softmax:    {torch_ms:.3f} ms")
print(f"Speedup: {torch_ms / triton_ms:.2f}x")

Representative throughput results (based on published Triton reference benchmarks):

ImplementationH100 SXM5 ThroughputB200 SXM6 ThroughputSpeedup vs PyTorch
PyTorch F.softmax~350 GB/s~480 GB/s1.0x
Triton fused softmax~820 GB/s~1150 GB/s~2.3x

These are representative figures based on published Triton benchmarks and will vary with row/column dimensions and dtype. Run the benchmark above on your target hardware to reproduce.

Adding Attention: Fused Softmax + Scaled Dot-Product

A simplified FlashAttention-style inner loop extends the softmax kernel with two key additions: a tiled QK^T matmul and an online softmax update. The online softmax (Dao et al.'s two-pass trick) avoids reading the full attention matrix back from HBM to normalize, which is the core insight behind Flash Attention's memory efficiency.

python
@triton.jit
def flash_attention_inner(
    Q_ptr, K_ptr, V_ptr, O_ptr,
    stride_qm, stride_qk,
    stride_kn, stride_kk,
    stride_vm, stride_vk,
    stride_om, stride_ok,
    seq_len, head_dim: tl.constexpr,
    scale,
    BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr,
):
    pid = tl.program_id(0)
    q_block_ptr = tl.make_block_ptr(
        base=Q_ptr,
        shape=(seq_len, head_dim),
        strides=(stride_qm, stride_qk),
        offsets=(pid * BLOCK_M, 0),
        block_shape=(BLOCK_M, head_dim),
        order=(1, 0),
    )
    q = tl.load(q_block_ptr)

    # Online softmax state
    m_i = tl.full((BLOCK_M,), -float('inf'), dtype=tl.float32)
    l_i = tl.zeros((BLOCK_M,), dtype=tl.float32)
    acc = tl.zeros((BLOCK_M, head_dim), dtype=tl.float32)

    for j in range(0, seq_len, BLOCK_N):
        k_block_ptr = tl.make_block_ptr(
            base=K_ptr,
            shape=(seq_len, head_dim),
            strides=(stride_kn, stride_kk),
            offsets=(j, 0),
            block_shape=(BLOCK_N, head_dim),
            order=(1, 0),
        )
        k = tl.load(k_block_ptr)

        # QK^T scaled
        qk = tl.dot(q, tl.trans(k)) * scale

        # Causal mask
        row_ids = pid * BLOCK_M + tl.arange(0, BLOCK_M)
        col_ids = j + tl.arange(0, BLOCK_N)
        mask = row_ids[:, None] >= col_ids[None, :]
        qk = tl.where(mask, qk, -float('inf'))

        # Online softmax update
        m_ij = tl.max(qk, axis=1)
        m_new = tl.maximum(m_i, m_ij)
        alpha = tl.exp(m_i - m_new)
        p = tl.exp(qk - m_new[:, None])

        l_i = alpha * l_i + tl.sum(p, axis=1)
        m_i = m_new

        v_block_ptr = tl.make_block_ptr(
            base=V_ptr,
            shape=(seq_len, head_dim),
            strides=(stride_vm, stride_vk),
            offsets=(j, 0),
            block_shape=(BLOCK_N, head_dim),
            order=(1, 0),
        )
        v = tl.load(v_block_ptr)
        acc = acc * alpha[:, None] + tl.dot(p.to(tl.float16), v)

    # Normalize
    acc = acc / l_i[:, None]

    o_block_ptr = tl.make_block_ptr(
        base=O_ptr,
        shape=(seq_len, head_dim),
        strides=(stride_om, stride_ok),
        offsets=(pid * BLOCK_M, 0),
        block_shape=(BLOCK_M, head_dim),
        order=(1, 0),
    )
    tl.store(o_block_ptr, acc.to(tl.float16))

The tl.where call handles causal masking without branching: elements above the diagonal get -inf before exp, which flushes them to zero in the softmax. This is the pattern used throughout Triton-based attention implementations.

For a production attention kernel backed by FlashAttention-4's SM100 tile model on Blackwell, see the FlashAttention-4 Blackwell inference guide.

Profiling Triton Kernels

PyTorch Profiler

python
import torch
from torch.profiler import profile, ProfilerActivity

with profile(activities=[ProfilerActivity.CUDA]) as prof:
    for _ in range(20):
        triton_fused_softmax(x)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

What to look for: kernel launch count, kernel duration in microseconds, and memory transfer ops. If you see multiple CUDA kernels per logical operation, fusion is not happening. If kernel duration is orders of magnitude longer than expected, check for unmasked out-of-bounds loads.

Proton (Triton's Built-in Profiler)

python
import triton.profiler as proton

proton.start("softmax_profile")
for _ in range(100):
    triton_fused_softmax(x)
proton.stop()
proton.dump("softmax_profile.json")

Open the output JSON in chrome://tracing or in the Triton Proton viewer. Proton gives you a timeline view of kernel launches, duration, and any gaps between kernels that indicate scheduling overhead or host-device synchronization.

ncu (NVIDIA Nsight Compute)

bash
ncu --set full -o kernel_profile --target-processes all python benchmark.py

Key hardware metrics:

MetricGood value (H100)What it means
sm__throughput.avg.pct_of_peak_sustained_elapsed> 60%Compute utilization
dram__throughput.avg.pct_of_peak_sustained_elapsed> 70% (BW-bound ops)HBM utilization
l1tex__throughput.avg.pct_of_peak_sustained_elapsed> 50%SRAM hit rate
achieved_occupancy> 50%Warp occupancy

ncu requires root or --privileged Docker access to read hardware performance counters. This is one of the concrete reasons bare-metal GPU instances matter for kernel development. On shared VMs where you do not have root, ncu either fails silently or returns no data. Spheron bare-metal instances have root SSH access by default.

Integrating Triton Kernels into vLLM, SGLang, and torch.compile

vLLM

vLLM's internal architecture is Triton through and through: PagedAttention, ROPE, and RMS Norm are all @triton.jit kernels. Replacing one of them with your own kernel follows a consistent pattern.

Subclass AttentionBackend:

python
from vllm.attention.backends.abstract import AttentionBackend, AttentionMetadata
import torch

class TritonFusedSoftmaxBackend(AttentionBackend):
    @staticmethod
    def get_name() -> str:
        return "triton_fused_softmax"

    def forward(
        self,
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        kv_cache: torch.Tensor,
        attn_metadata: AttentionMetadata,
        attn_bias=None,
    ) -> torch.Tensor:
        # Call your Triton kernel here instead of FlashAttention
        return triton_fused_softmax(query)

Launch vLLM with the custom backend:

bash
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --attention-backend triton_fused_softmax

For the vLLM container setup and production deployment configuration, see the vLLM production deployment guide. vLLM's Model Runner V2 (MRV2) exposes Triton kernel hooks for custom operator injection with reduced boilerplate. See the vLLM MRV2 deployment guide for the full pattern. For a deeper look at how continuous batching and PagedAttention interact with custom attention kernels, see the LLM serving optimization guide and the KV cache optimization guide.

SGLang

SGLang uses Triton for its RadixAttention kernel. Custom kernels can be injected via custom backend hooks:

python
# In your SGLang custom backend module
import triton
import triton.language as tl

@triton.jit
def custom_radix_attention_kernel(...):
    ...

# Register via SGLang's backend interface

For the SGLang container setup and baseline configuration, see the SGLang production deployment guide.

torch.compile

torch.compile uses Triton as its default backend (via inductor). To use a hand-written Triton kernel inside a compiled function without triggering graph breaks:

python
import torch
import torch.library

@torch.library.custom_op("mylib::fused_softmax", mutates_args=())
def fused_softmax(x: torch.Tensor) -> torch.Tensor:
    return triton_fused_softmax(x)

@fused_softmax.register_fake
def _(x):
    return torch.empty_like(x)

# Use inside a compiled function
@torch.compile
def my_model(x):
    return fused_softmax(x)

The register_fake (formerly register_abstract) provides Dynamo with shape and dtype information without running the actual kernel. This lets torch.compile trace through your custom op and include it in the fused graph. Without register_fake, Dynamo cannot determine output shapes and will either error or fall back to eager mode.

For the full torch.compile and CUDA Graphs production pipeline, see the torch.compile + CUDA Graphs LLM inference guide.

Pitfalls: Shared Memory, Register Spills, and Occupancy

Shared memory limits. H100 has 228 KB SRAM per SM (shared memory plus L1 combined). Blackwell SM100 has 256 KB. A tile of 128x128 BF16 elements needs 128 128 2 = 32 KB. That sounds fine until you add a second tile for the other operand, accumulator registers, and warp state. As a practical rule, keep your tile memory footprint under 64 KB to allow 4+ warps per SM without hitting the shared memory ceiling. Use ncu's "Scheduler Statistics" view to confirm your kernel is not constrained by shared memory.

Register spills. When num_warps is too high or BLOCK_SIZE too large, the register file fills and spills to local memory (which is DRAM, not SRAM). Register spills are invisible without ncu: the kernel runs correctly but slower. Check with:

bash
ncu --metrics l1tex__data_pipe_lsu_wavefronts_mem_lg_cmd_load.sum python benchmark.py

Or look at the "Memory Workload Analysis" view in Nsight Compute and check "Spill Stores". Fix: reduce BLOCK_SIZE or num_warps, or mark loop-invariant values with tl.constexpr so the compiler can fold them as constants and free registers.

Occupancy on Hopper vs Blackwell. H100 has 132 SMs. Full SM occupancy at 4 warps per SM and 32 threads per warp is 132 4 32 = 16,896 active threads. B200 SXM6 has roughly 160 SMs. A kernel with a grid of (n_rows,) may under-utilize B200 when n_rows is small.

For non-persistent kernels like fused_softmax_kernel, keep the grid at exactly (n_rows,). Padding the grid beyond n_rows does not improve SM utilization: each extra CTA hits the early-exit guard and returns immediately without doing useful work, consuming scheduler slots for nothing.

To actually fill the SM pool when n_rows is small, use the persistent kernel pattern. persistent_softmax_kernel above uses a fixed grid and processes multiple rows per program via an internal work-stealing loop, so every CTA does real work:

python
num_sms = torch.cuda.get_device_properties(0).multi_processor_count
grid = (num_sms * 4,)
y = torch.empty_like(x)
persistent_softmax_kernel[grid](
    x, y,
    n_rows, n_cols,
    BLOCK_SIZE=4096,
)

The for row_idx in range(pid, n_rows, num_programs) loop inside the kernel distributes all rows across the fixed SM pool. This is the pattern that actually achieves the 4 CTAs-per-SM target on both Hopper and Blackwell.

Autotuner cold starts in production. Docker containers that do not mount a persistent TRITON_CACHE_DIR recompile every kernel on each container start, adding 30-120 seconds to startup time. Log kernel compilation events (Triton emits them to stderr with TRITON_DEBUG=1) and pre-compile by running a warmup pass with representative shapes at container startup. Mount the cache as a persistent volume across restarts.

B200 SXM6 vs consumer Blackwell. The RTX 5090 uses SM120 (compute capability 12.0), not SM100 (compute capability 10.0). SM120 lacks the TMEM subsystem that enables Triton's persistent kernel optimizations on data-center Blackwell. For all B200 references in this guide, the target is B200 SXM6 (SM100). The consumer RTX 5090 runs Triton kernels correctly but does not benefit from the SM100-specific pipeline optimizations.

Cost Analysis: Speedup per GPU-Hour

Here is a worked example to make the math concrete.

Task: 1 billion softmax operations at sequence length 4096.

Baseline: PyTorch F.softmax on H100 SXM5, achieving ~350 GB/s HBM throughput.

With Triton fused kernel: ~820 GB/s, a 2.3x throughput increase.

Cost calculation:

  • H100 SXM5 on-demand: $4.21/hr
  • PyTorch baseline time for 1B ops: T hours, costs $4.21 * T
  • Triton time: T / 2.3 hours, costs $4.21 T / 2.3 = $1.83 T
  • Savings: $2.38 per equivalent unit of work, roughly 56% cost reduction for this operation

B200 SXM6 at $1.71/hr spot ($7.00/hr on-demand): Running PyTorch baseline on B200 spot costs $1.71 T, cheaper than H100 with Triton at $1.83 T. With Triton on B200 spot, the ~1150 GB/s throughput and $1.71/hr rate compound: you get the same 1B ops done at $1.71 T / (1150/350) = $0.52 T, a roughly 88% reduction vs H100 PyTorch baseline.

The takeaway: kernel optimization and hardware choice compound. You do not have to pick one. A fused Triton kernel on a B200 spot instance is the current high-water mark for cost-per-operation on softmax-heavy workloads.

Current pricing on Spheron as of 10 May 2026:

GPUOn-demand (lowest $/hr)Spot ($/hr)
H100 PCIe$2.11N/A
H100 SXM5$4.21N/A
B200 SXM6$7.00$1.71

Pricing fluctuates based on GPU availability. The prices above are based on 10 May 2026 and may have changed. Check current GPU pricing → for live rates.

Spheron Deployment Recipe

Practical steps for getting a Triton kernel development environment running:

1. Choose your instance.

H100 SXM5 for Hopper-specific kernel development, testing against the ecosystem (vLLM, Liger-Kernel, FA3), or mixed-fleet validation across Hopper and Blackwell. B200 SXM6 for Blackwell-only workloads or cost-optimized production batch jobs where spot pricing applies. See rent H100 on Spheron for Hopper access and rent B200 on Spheron for Blackwell.

2. Base container.

The NVIDIA PyTorch NGC container (nvcr.io/nvidia/pytorch:24.12-py3) includes PyTorch, Triton, CUDA 12.6, and ncu. It is the fastest path to a working Triton environment without manual toolkit installation.

bash
docker run --gpus all --privileged --ipc=host \
  -v /workspace:/workspace \
  nvcr.io/nvidia/pytorch:24.12-py3 \
  /bin/bash

The --privileged flag is required for ncu hardware counter access. Spheron bare-metal instances have root by default, so this works without additional configuration.

3. Cache configuration.

bash
export TRITON_CACHE_DIR=/workspace/.triton_cache

Mount /workspace as a persistent volume in your container to keep compiled kernel artifacts across restarts. This eliminates recompilation delays in iterative development sessions.

4. Version verification.

python
import torch
import triton
print(f"PyTorch: {torch.__version__}")
print(f"Triton: {triton.__version__}")
print(f"CUDA: {torch.version.cuda}")
print(f"GPU: {torch.cuda.get_device_name(0)}")

5. Serving integration.

For the full inference server + Triton kernel stack on Spheron, see the Spheron vLLM deployment guide.


Triton kernels deliver the most predictable performance on bare-metal GPU hardware, where ncu has root access, there are no noisy neighbors distorting micro-benchmarks, and on-demand billing means you can rent a B200 for a two-hour kernel validation session without a monthly commitment.

Rent H100 SXM5 → | Rent B200 SXM6 → | View all pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.