What is CUDA Tile and why is it different from traditional CUDA programming?

CUDA Tile is a new abstraction layer introduced in CUDA 13.1 that lets you write GPU kernels at the tile (block of data) level rather than the individual thread level. Traditional CUDA requires managing thousands of individual thread indices, synchronization primitives, and shared memory layouts manually. CUDA Tile handles that orchestration internally, letting you express operations like matrix multiply or attention as tile-level transformations. The result is code that is closer in complexity to PyTorch but generates GPU machine code as fast as a hand-tuned kernel.

Does CUDA Tile work on A100 and other Ampere GPUs, or is it Blackwell-only?

CUDA Tile (CUDA 13.x) is supported on Ampere (A100), Ada Lovelace (L40S, RTX 4090), and Blackwell (B200, B300) architectures (compute capabilities 8.x, 10.x, and 12.x). Hopper (H100, H200) is not supported in CUDA 13.1 - future CUDA releases may add Hopper support. The hardware-specific acceleration from the Tensor Memory Accelerator is only available on Blackwell, so that architecture gets the largest speedup. On Ampere the tile abstractions still compile and run correctly but fall back to older hardware paths.

Is cuTile Python DSL production-ready as of 2026?

cuTile is in active development as of April 2026. The Python DSL is available in CUDA 13.2 for writing custom kernels without raw C++ CUDA code, but the ecosystem of cuTile-based libraries is still maturing. It is well-suited for experimentation, benchmarking, and writing custom attention or matmul variants. For production inference at scale, most teams still rely on FlashAttention 3 or cuBLAS for the core compute-intensive operations.

How do I get CUDA 13.2 on a GPU cloud instance?

On Spheron, select an A100 40G SXM4 or B300 SXM6 instance (CUDA Tile supports Ampere and Blackwell, not Hopper) and choose an NGC (NVIDIA GPU Cloud) base container or the CUDA 13.2 runtime image when deploying. You can also start from a bare Ubuntu 22.04 image and install the CUDA 13.2 toolkit manually from NVIDIA's package repository. The toolkit includes the NVCC compiler with CUDA Tile headers and the cuTile Python bindings package.

How much does it cost to run CUDA 13 kernel benchmarks on an A100 vs B300 SXM6?

A100 40G SXM4 on Spheron is available as spot instances from $0.25/hr per GPU (no on-demand pricing). B300 SXM6 starts at $8.70/hr on-demand (no spot available), and delivers significantly higher tensor core throughput for FP8 workloads. For short kernel development and benchmarking sessions (1-4 hours), the total cost on A100 is typically under $2. Note: CUDA Tile does not support Hopper (H100) as of CUDA 13.1. Check current pricing at spheron.network/pricing/ since rates fluctuate with GPU availability.

CUDA 13 Tile Programming on GPU Cloud: A 2026 Developer Guide

CUDA 13.1 shipped in late 2025 with something that hasn't happened in CUDA's 20-year history: a complete new programming model layered on top of the existing one. CUDA Tile changes how you write GPU kernels, and the cuTile Python DSL means you no longer need to write a line of C++ to get there. If you've been avoiding custom kernel development because thread index arithmetic felt like a detour from your actual work, this changes the calculus.

If you're deciding between CUDA and ROCm for your workload, see our ROCm vs CUDA comparison for GPU cloud 2026. For bare-metal A100 access to run what's covered here, see the A100 rental page.

What Changed in CUDA 13: The Tile Programming Model

Traditional CUDA programs decompose work into grids of thread blocks, where each thread computes over a single element or small slice of data. Writing a matrix multiply means managing thread indices, shared memory staging, bank conflict avoidance, and manual __syncthreads() calls. Writing an attention kernel means dealing with warp-level primitives (WMMA or wgmma) that require precise layout alignment and register file discipline. It's not impossible, but it demands deep knowledge of the hardware memory hierarchy to do it well.

CUDA Tile introduces a different abstraction. A tile is a contiguous, multi-dimensional chunk of data that maps to a thread block's shared memory footprint. Instead of writing per-thread logic, you write tile-level operations: load a tile from HBM, multiply two tiles together, accumulate results, store back. The CUDA runtime handles the thread orchestration, synchronization, and shared memory management internally.

CUDA 13 shipped in meaningful increments:

CUDA 13.0 added Blackwell GPU support, unified Arm platform support, and CCCL 3.0 (C++ Core Compute Libraries); it did not introduce the Tile programming model.
CUDA 13.1 introduced the CUDA Tile abstraction and the cuTile Python DSL, along with Green Contexts (partitioned GPU sub-contexts). Supported architecture at launch: Blackwell only (compute capabilities 10.x and 12.x). The C++ Tile API is planned for a future release.
CUDA 13.2 extended CUDA Tile support to Ampere (8.x), Ada Lovelace (8.x), and added compute capability 11.x architectures, refined the Python API, expanded Blackwell tensor core support (tcgen05.mma, Blackwell's native MMA instruction) and TMA (Tensor Memory Accelerator) support in the tile intrinsics.

Here's the contrast between a traditional CUDA approach and a CUDA Tile kernel for a simple element-wise scale-and-add operation:

Traditional CUDA (per-thread):

cpp

__global__ void scale_add_kernel(float* A, float* B, float* C,
                                  float alpha, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        C[idx] = alpha * A[idx] + B[idx];
    }
}

CUDA Tile C++ API (conceptual — the C++ API is planned for a future CUDA release; only the Python DSL is available as of CUDA 13.1):

cpp

// NOTE: This is a conceptual illustration of what the future C++ CUDA Tile API
// is expected to look like. The C++ API does not exist yet as of CUDA 13.1.
// Use the cuTile Python DSL for actual kernel development today.
#include <cuda/tile>

__global__ void scale_add_tile_kernel(float* A, float* B, float* C, float alpha) {
    // Define a 256-element tile mapped to this block's shared memory
    auto tile_a = cuda::tile<float, 256>(A + blockIdx.x * 256);
    auto tile_b = cuda::tile<float, 256>(B + blockIdx.x * 256);

    // Tile load: HBM -> shared memory, handled internally
    cuda::load(tile_a);
    cuda::load(tile_b);
    __syncthreads();

    // Tile operation: no per-thread index math needed
    auto tile_c = alpha * tile_a + tile_b;
    cuda::store(C + blockIdx.x * 256, tile_c);
}

For this simple operation the difference is modest. For attention kernels or large GEMM, where the traditional approach requires careful register blocking, wgmma intrinsics, and TMA descriptor setup, the tile approach cuts hundreds of lines of boilerplate to a handful of tile operations.

CUDA Tile vs Traditional CUDA Kernels: Architecture Comparison

Aspect	Traditional CUDA	CUDA Tile
Abstraction level	Thread/warp/block	Tile (block of elements)
Memory management	Manual `__shared__` allocation	Automatic tile staging
Synchronization	`__syncthreads()` explicit	Implicit within tile ops
Tensor Core access	Requires WMMA/wgmma intrinsics	Tile matmul API
Python support	Not available (C++ only)	cuTile Python DSL
Architecture support	All CUDA-capable GPUs	Ampere, Ada, Blackwell, Rubin (Hopper not supported in 13.1)
Learning curve	High (thread index math)	Moderate (tile shape configuration)

The memory hierarchy story is worth unpacking. On Blackwell, CUDA Tile load/store operations are backed by the Tensor Memory Accelerator (TMA), which is dedicated hardware for asynchronous bulk data movement between HBM and shared memory. Traditional CUDA kernels need explicit cp.async instructions and barrier management to use TMA. CUDA Tile uses TMA transparently when available. Hopper has TMA hardware, but CUDA Tile does not currently support Hopper (sm_90) as of CUDA 13.1 - future releases may add Hopper support.

On Ampere, tile loads fall back to software-managed DMA paths. The code is identical, but you lose the TMA hardware acceleration. For background on why GPU memory architecture matters for AI workloads, see our guide on GPU memory architecture for AI workloads.

Getting Started with cuTile Python DSL on GPU Cloud

1. Launch an A100 or B300 SXM6 instance on Spheron

Spheron A100 40G SXM4 and B300 SXM6 instances give you bare-metal GPU access with no hypervisor layer. That matters for kernel development: you get the full CUDA driver stack, direct PCIe/NVLink access, and no virtualization overhead between your kernel code and the hardware. CUDA Tile supports Ampere (A100) and Blackwell (B300) but not Hopper (H100) as of CUDA 13.1.

2. Install CUDA 13.2 toolkit

bash

# Add NVIDIA package repository
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install -y cuda-toolkit-13-2

3. Install the cuTile Python package

bash

pip install cuda-tile  # NVIDIA's cuTile Python DSL package (PyPI: cuda-tile)

4. Verify the environment

bash

nvcc --version          # Should show CUDA 13.2.x
python -c "import cutile; print(cutile.__version__)"
nvidia-smi              # Confirm GPU and driver version

For a one-click CUDA 13.2 environment, Spheron's deployment templates at docs.spheron.ai include pre-configured NGC base containers with the full CUDA 13.2 toolkit installed.

Practical Example: Writing a Custom Attention Kernel with CUDA Tile

Scaled dot-product attention implemented in plain PyTorch materializes the full QK^T matrix in GPU memory before the softmax. For large sequence lengths, that intermediate tensor alone occupies several gigabytes and forces multiple kernel launches. A fused kernel computes attention in tiles, keeping the intermediate data in shared memory and never writing it back to HBM.

The traditional WMMA approach requires about 150-300 lines of C++: WMMA fragment declarations, manual shared memory tiling, ptx-level warp group matrix multiply-accumulate setup, and explicit register file management. It's achievable, but a single misaligned memory access silently produces wrong results and takes hours to debug.

The cuTile Python DSL approach expresses the same fused kernel as tile-level operations:

python

import cutile
import torch

@cutile.kernel
def tile_attention(Q, K, V, O, scale):
    # Define tile shapes for A100/Blackwell (128 tokens x 64 head_dim)
    tile_m = cutile.Tile(128, Q.shape[-1])  # query tile
    tile_n = cutile.Tile(128, K.shape[-1])  # key tile

    # Load query tile to shared memory
    q = cutile.load(Q, tile_m)

    # Online softmax accumulators (FlashAttention-style) — one entry per query token.
    # A fixed (128, 128) buffer would only cover one key tile and produce wrong
    # attention logits for any sequence longer than 128 tokens.
    m_i = cutile.full((tile_m.shape[0],), float('-inf'))  # running row-wise max
    l_i = cutile.zeros((tile_m.shape[0],))               # running row-wise sum of exp
    o_i = cutile.zeros((tile_m.shape[0], V.shape[-1]))   # output accumulator

    for k_tile in cutile.range(K, tile_n):
        k = cutile.load(K, k_tile)
        v = cutile.load(V, k_tile)

        # Attention scores for this key tile: (128, tile_size)
        s_ij = cutile.matmul(q, k.T, scale=scale)

        # Update running max and rescale previous accumulators
        m_new = cutile.maximum(m_i, cutile.max(s_ij, dim=-1))
        alpha = cutile.exp(m_i - m_new)
        p_ij = cutile.exp(s_ij - m_new[:, None])

        l_i = alpha * l_i + cutile.sum(p_ij, dim=-1)
        o_i = alpha[:, None] * o_i + cutile.matmul(p_ij, v)
        m_i = m_new

    # Normalize and store — divide each output row by its softmax denominator
    cutile.store(O, tile_m, o_i / l_i[:, None])

Note: this code illustrates the cuTile Python DSL pattern as it existed in April 2026. The cuTile API was under active development. Refer to NVIDIA's official cuTile documentation for current API signatures before using this in production.

Benchmarking the kernel against PyTorch SDPA:

python

import torch
import time

seq_len, heads, head_dim = 2048, 32, 64
Q = torch.randn(1, heads, seq_len, head_dim, device='cuda', dtype=torch.float16)
K = torch.randn_like(Q)
V = torch.randn_like(Q)

# Warm-up
for _ in range(10):
    torch.nn.functional.scaled_dot_product_attention(Q, K, V)
torch.cuda.synchronize()

# PyTorch SDPA baseline
start = time.perf_counter()
for _ in range(100):
    torch.nn.functional.scaled_dot_product_attention(Q, K, V)
torch.cuda.synchronize()
baseline_ms = (time.perf_counter() - start) * 10  # ms per call

print(f"PyTorch SDPA: {baseline_ms:.2f} ms")

The attention kernel efficiency directly affects how many reasoning tokens a model can generate per second. For the GPU-side implications of inference-time compute scaling, see inference-time compute scaling on GPU cloud.

CUDA Tile Compatibility: Ampere, Ada, Hopper, Blackwell, and Rubin

Architecture	GPU Examples	CUDA Tile Support	TMA Support	Tensor Core MMA	Recommended Use
Ampere (sm_80)	A100 80GB	Yes (CUDA 13.2+)	No	No	Development, testing
Ada Lovelace (sm_89)	L40S, RTX 4090	Yes (CUDA 13.2+)	No	No	Cost-effective prototyping
Hopper (sm_90)	H100, H200	Not supported (future release may add support)	N/A	N/A	Not supported in CUDA 13.1
Blackwell B200 (sm_100)	B200	Yes (full)	Yes (enhanced TMA)	tcgen05.mma	Highest throughput
Blackwell B300 (sm_103)	B300	Yes (full)	Yes (enhanced TMA)	tcgen05.mma	Highest throughput
Rubin (upcoming, est. 2026)	R100 (upcoming)	Expected	Expected	Expected	Next-gen

Blackwell gets the most benefit from CUDA Tile because tile load/store operations use the enhanced Blackwell TMA hardware directly. On Ampere and Ada (both requiring CUDA 13.2+), the same tile operations compile and run correctly but fall back to software-managed async copy paths. The code is portable; the performance profile is not. Hopper (H100, H200) is not supported by CUDA Tile in CUDA 13.1 despite having TMA hardware - future CUDA releases may add Hopper support.

B200 also gains from the enhanced Blackwell TMA, which supports larger transfer granularities and higher peak bandwidth than Hopper's original TMA. For Blackwell architecture details, see our NVIDIA B200 complete guide. For a broader architecture progression including Rubin, see the NVIDIA Rubin vs Blackwell vs Hopper comparison.

Green Contexts for Multi-Tenant GPU Cloud Workloads

Green Contexts, first available in the CUDA 12.4 driver API and exposed in the CUDA 13.1 runtime API alongside cuTile, address a different problem: how to partition a GPU's streaming multiprocessors across independent workloads without the overhead of full CUDA context switching.

Traditional CUDA contexts are exclusive by default. One context occupies all SMs until it yields. Green Contexts let you carve out a subset of SMs and hand them to a sub-context that runs concurrently with the parent. Two kernel development sessions, or a kernel benchmark running alongside a background monitoring job, can coexist on the same physical GPU without interfering.

c++

// Creating a Green Context in CUDA 13.1 (C++ API)
CUdevice device;
cuDeviceGet(&device, 0);  // Get handle for GPU 0

// Step 1: Query the device's SM resource
CUdevResource smResource;
cuDeviceGetDevResource(device, &smResource, CU_DEV_RESOURCE_TYPE_SM);

// Step 2: Partition into a sub-resource of 40 SMs
CUdevResourceDesc resDesc = {};
resDesc.type = CU_DEV_RESOURCE_TYPE_SM;
resDesc.sm.smCount = 40;  // Allocate 40 of the available SMs

CUdevResource partitioned[2];
unsigned int count = 1;
cuDevSmResourceSplitByCount(partitioned, &count, &smResource, &resDesc, 0, 1);

// Step 3: Create the Green Context using the CUdevResource handle (not the descriptor)
CUgreenCtx greenCtx;
cuGreenCtxCreate(&greenCtx, partitioned[0], device, CU_GREEN_CTX_DEFAULT_STREAM);

For GPU cloud workloads, Green Contexts matter when you want to run multiple kernel experiments on the same instance without spinning up separate VMs. Spheron's bare-metal instances give you full context control including Green Context creation, which managed inference APIs abstract away entirely. For multi-tenant patterns at the scheduler level, see running multiple LLMs on one GPU with MIG and time-slicing.

Performance Benchmarks: CUDA Tile vs Hand-Tuned Kernels on A100/Ampere and B200/Blackwell

NVIDIA's CUDA Tile announcement included benchmark data showing that CUDA Tile kernels on Ampere (A100) approach hand-tuned FlashAttention 3 throughput for standard attention shapes. For GEMM workloads at 4096x4096 FP16, CUDA Tile matches cuBLAS throughput on A100 to within 3%, and exceeds it on some Blackwell FP8 configurations where the tile API maps more directly to the Blackwell Tensor Core instruction set.

Rather than reproduce numbers that may shift as the cuTile API matures, here are the directional benchmarks from NVIDIA's CUDA Tile documentation as of April 2026:

Kernel Type	A100 SXM4 (CUDA Tile)	A100 SXM4 (Hand-tuned)	B200 (CUDA Tile)	B200 (Hand-tuned)
FP16 GEMM (4096x4096)	Within 3% of cuBLAS	cuBLAS reference	2-3x A100 throughput	2-3x A100 reference
Attention (seq=2048, FP16)	Approaches FA3 throughput	FlashAttention 3 ref	Higher (enhanced Blackwell TMA)	FA3 on Blackwell ref
FP8 GEMM (Blackwell-optimized)	N/A (Ampere FP8 not natively supported)	N/A	Highest	Reference

Verify these against NVIDIA's current CUDA Tile benchmark documentation before making architecture decisions, as numbers change with each cuTile release.

Current GPU pricing on Spheron (fetched 12 Apr 2026):

GPU	On-Demand	Spot
H100 SXM5 (not CUDA Tile compatible)	not available	from $0.80/hr
B300 SXM6	from $8.70/hr	not available
A100 40G SXM4	not available	from $0.25/hr

Pricing fluctuates based on GPU availability. The prices above are based on 12 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

To keep kernel development costs in check, see the GPU cost optimization playbook for strategies like spot instance checkpointing and auto-shutdown for development notebooks.

For broader inference benchmark comparisons, see GPU cloud benchmarks 2026 and vLLM vs TensorRT-LLM vs SGLang benchmarks.

Setting Up a CUDA 13.2 Development Environment on Spheron GPU Cloud

This section covers the full setup from a fresh instance.

Step 1: Deploy an A100 or B300 SXM6 instance

Log into app.spheron.ai, select A100 40G SXM4 or B300 SXM6 from the GPU catalog, choose Ubuntu 22.04 or the CUDA 13.2 base image, and deploy. SSH root access is available within 60 seconds. CUDA Tile supports Ampere (A100) and Blackwell (B300) but not Hopper (H100) as of CUDA 13.1.

Step 2: Connect via SSH

Spheron provides an SSH command directly from the instance dashboard. Connect with full root access, no additional key management required.

Step 3: Install CUDA 13.2 toolkit

bash

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update && sudo apt-get install -y cuda-toolkit-13-2

Verify: nvcc --version should show CUDA 13.2.x.

Step 4: Set up the Python environment

bash

python3 -m venv cuda-tile-dev
source cuda-tile-dev/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cu132
pip install cuda-tile nvtx

Step 5: Verify tensor core access

python

import torch
print(torch.cuda.get_device_capability())  # (8, 0) for A100, (10, 3) for B300 SXM6
print(torch.backends.cuda.matmul.allow_tf32)

Step 6: Run your first cuTile kernel

Use the attention kernel from the earlier section as a starting point. Adjust tile shapes based on your GPU's shared memory capacity: A100 supports up to 164 KB shared memory per SM, B200 supports 228 KB per SM. For B300, refer to the NVIDIA Blackwell Tuning Guide for current shared memory specs, as figures vary by configuration.

CUDA 13 Tile programming makes low-level GPU kernel work accessible to Python engineers for the first time without a C++ requirement. Whether that translates to production-grade custom attention kernels or just better-informed decisions about when to use cuBLAS versus write your own, having the tool in your stack is worth the 30-minute setup cost.

CUDA 13.2 with full Tile programming support is available on Ampere and Blackwell instances on Spheron. Bare-metal GPU access means no hypervisor between your kernel code and the hardware, so you get the full tensor core throughput the benchmarks show.
Rent A100 → | View all pricing →

What Changed in CUDA 13: The Tile Programming Model

CUDA Tile vs Traditional CUDA Kernels: Architecture Comparison

Getting Started with cuTile Python DSL on GPU Cloud

1. Launch an A100 or B300 SXM6 instance on Spheron

2. Install CUDA 13.2 toolkit

3. Install the cuTile Python package

4. Verify the environment

Practical Example: Writing a Custom Attention Kernel with CUDA Tile

CUDA Tile Compatibility: Ampere, Ada, Hopper, Blackwell, and Rubin

Green Contexts for Multi-Tenant GPU Cloud Workloads

Performance Benchmarks: CUDA Tile vs Hand-Tuned Kernels on A100/Ampere and B200/Blackwell

Setting Up a CUDA 13.2 Development Environment on Spheron GPU Cloud

Step 1: Deploy an A100 or B300 SXM6 instance

Step 2: Connect via SSH

Step 3: Install CUDA 13.2 toolkit

Step 4: Set up the Python environment

Step 5: Verify tensor core access

Step 6: Run your first cuTile kernel

Build what's next.