CUDA News Today: NVIDIA Toolkit, AMD ROCm, and AI Framework Releases (2026)

<!-- WEEKLY UPDATE PROTOCOL:

Prepend a new "This Week" section (5 news items with source links) above the previous week's entry.
Update the date: frontmatter field to the current Monday's date.
Update any tracker tables (CUDA, ROCm, Framework, Driver) if a new version shipped.
Append the previous week's top story as a one-line entry in the Archive section.
Commit and push - the Vercel build updates the static page and lastmod in sitemap.

-->

This is a weekly digest of CUDA toolkit releases, AMD ROCm updates, and AI framework compatibility changes that affect GPU cloud workloads. It updates every Monday with the latest releases, version tables, and a "What this means for Spheron users" section tying each update to real deployment decisions. For deeper background on the CUDA vs ROCm divide, see the ROCm vs CUDA comparison. For a deep dive into CUDA 13's new programming model, see the CUDA 13 Tile programming guide. For GPU cloud pricing and capacity news, the GPU cloud news 2026 feed tracks weekly moves across providers.

This Week in CUDA and GPU Compute (Week of 19 May 2026)

1. CUDA 13.2 stable released

May 15, 2026 - CUDA Toolkit 13.2 Release Notes

CUDA 13.2 ships the cuTile Python DSL as a stable feature after its experimental debut in 13.1. It also extends the Cooperative Groups API with new grid_group synchronization primitives that simplify multi-block reductions. CUDA 13.x dropped support for Maxwell, Pascal, and Volta but continues to support Turing, Ampere, Ada, Hopper (H100/H200, compute capability 9.0), and Blackwell. H100 and H200 users can now run CUDA 13.x without pinning to 12.8.

2. ROCm 7.2.3 patch ships vLLM profiling improvements for MI300X and MI350X

May 2026 - ROCm Release Notes

AMD pushed ROCm 7.2.3 with improved profiling stability for vLLM workloads traced with PyTorch torch.profiler, reducing large idle gaps between GPU kernels. The release also merges multiple gather operations from similar embedding tables into batched operations, cutting kernel launches and memory traffic for embedding-heavy inference. ROCm 7.2 is the current stable branch for MI300X and MI350X production deployments. For a comparison of AMD and NVIDIA software stacks, see the ROCm vs CUDA comparison.

3. PyTorch 2.12 stable released

May 13, 2026 - PyTorch 2.12 Release Notes

PyTorch 2.12 is the current stable release. Key additions include batched linalg.eigh on CUDA that is up to 100x faster for symmetric eigendecompositions and continued torch.compile improvements for CUDA 13.x kernels. CUDA 12.8 and 13.x wheel variants are both available. For production deployments with CUDA Graphs and persistent compilation caching, the torch.compile and CUDA Graphs guide covers the current recommendations.

4. vLLM 0.21.0 released with KV offload and Hybrid Memory Allocator

May 15, 2026 - vLLM v0.21.0 Release Notes

vLLM 0.21.0 ships KV offload integration with the Hybrid Memory Allocator (HMA), enabling larger effective context windows by tiering KV cache between GPU and CPU memory. It also adds speculative decoding support for reasoning models with thinking budget constraints and the TOKENSPEED_MLA backend on Blackwell for DeepSeek-R1 and Kimi-K25. Eagle3 speculative decoding (added in v0.20.0) continues to be available. See the vLLM production deployment guide for current flag recommendations.

5. OpenAI Triton (triton-lang) 3.7.0 released

May 7, 2026 - OpenAI Triton (triton-lang) 3.7.0 Release Notes

OpenAI Triton (triton-lang) 3.7.0 ships with continued improvements to both the CUDA and ROCm backends, improving performance for flash attention and linear attention patterns on CDNA3 and Blackwell hardware. The block pointer API is stable across recent releases. Note: this is the kernel language OpenAI Triton (triton-lang), not NVIDIA Triton Inference Server. For GPU kernel development context, see the Triton kernel development guide.

What This Means for Spheron Users (Week of 19 May 2026)

CUDA 13.2 supports Turing, Ampere, Ada, Hopper, and Blackwell architectures. H100 and H200 users can now run CUDA 13.x without pinning to CUDA 12.8. For teams exploring B300 SXM6, spot instances are currently available at $2.45/hr per GPU, making CUDA 13.x workloads low-cost to validate.
When deploying on Spheron, select an NGC base container matching your target CUDA version (e.g., nvcr.io/nvidia/pytorch:24.08-py3 for CUDA 12.8 or a newer tag for CUDA 13.x). The container handles the toolkit; the bare-metal driver is pre-installed at the host level. For automating driver and toolkit setup at instance launch, see the Spheron startup scripts guide.
H100 SXM5 instances start at $1.66/hr spot or $2.64/hr on-demand, with SSH root access included. B200 SXM6 starts at $3.77/hr spot or $7.21/hr on-demand for teams that need Blackwell.

CUDA Toolkit Release Tracker

Version	Release Date	Key Features	Min Driver Version	Supported Architectures
CUDA 12.4	Mar 2024	FP8 support on Hopper, cuDNN 9.0 integration	550.54	Volta, Turing, Ampere, Ada, Hopper
CUDA 12.6	Jul 2024	cuBLAS FP8 GEMMs, NVLink 5.0 support	560.28	Volta, Turing, Ampere, Ada, Hopper
CUDA 12.8	Jan 2025	Blackwell (Compute 10.0) GA support, GB200 NVLink, NVJPEG2000	570.00	Volta, Turing, Ampere, Ada, Hopper, Blackwell
CUDA 13.0	Nov 2025	Dropped Maxwell/Pascal/Volta, Blackwell architecture focus, cuTile DSL preview	575.00	Turing, Ampere, Ada, Hopper, Blackwell
CUDA 13.1	Mar 2026	cuTile DSL (experimental), new CG primitives	575.00	Turing, Ampere, Ada, Hopper, Blackwell
CUDA 13.2	May 2026	cuTile DSL (stable), extended Cooperative Groups	575.51	Turing, Ampere, Ada, Hopper, Blackwell

CUDA 13.x dropped support for Maxwell, Pascal, and Volta (pre-Turing). Turing, Ampere, Ada, Hopper (H100/H200), and Blackwell are all supported. See the CUDA 13 Tile programming guide for details on the new programming model.

Source: CUDA Toolkit Release Notes

AMD ROCm Release Tracker

Version	Release Date	Key Changes	MI300X	MI350X	MI400	Min Kernel
ROCm 6.2	Jun 2024	vLLM ROCm wheel support, MIGraphX 2.10	Stable	No	No	5.15
ROCm 7.0.0	Late 2025	MI350X/MI355X GA support, HIP compiler major update, MX data types (FP4/FP6/FP8) on MI350X	Stable	Stable	Preview	6.8
ROCm 7.1	Early 2026	Deeper MI350X/MI355X optimization, improved FlashAttention, HIP Python improvements	Stable	Stable	Preview	6.8
ROCm 7.2	2026	RDNA4 support, Ubuntu 24.04.4 support, RCCL improvements	Stable	Stable	Preview	6.8
ROCm 7.2.3 (current)	May 2026	vLLM profiling stability, embedding-inference optimization	Stable	Stable	Preview	6.8

ROCm 7.2 is the current stable series. MI350X and MI355X have been fully supported since ROCm 7.0.0. MI400 remains in developer preview. For a full comparison of AMD and NVIDIA software stacks and when to choose each, see the ROCm vs CUDA comparison.

Source: ROCm GitHub Releases

Framework Compatibility Matrix

Framework	Version	Min CUDA	Min ROCm	Notes
PyTorch	2.4	11.8	5.7	Last version supporting CUDA 11.x
PyTorch	2.5	12.1	6.1	Drops CUDA 11.x
PyTorch	2.6	12.4	6.2	Recommended for CUDA Graphs + torch.compile
PyTorch	2.7	12.4	6.2	Stable since Apr 2025; cu121 wheel deprecated
PyTorch	2.12	12.8	7.0	Current stable (May 2026); CUDA 13.x wheels available
TensorFlow	2.17	12.3	6.1	Last 2.x before TF 3 API migration
TensorFlow	2.18	12.5	6.2	TF 3 compatibility layer included
JAX	0.4.28	12.3	6.2	Pallas kernel language stable
JAX	0.4.30	12.4	6.2	Mosaic GPU backend GA
OpenAI Triton (triton-lang)	3.2	12.1	6.1	Stable block pointer API
OpenAI Triton (triton-lang)	3.7.0	12.3	7.0	Current stable (May 2026); improved CUDA and ROCm backends

For PyTorch 2.6 in production with CUDA Graphs and persistent compilation caching, see the torch.compile and CUDA Graphs guide.

NVIDIA Datacenter Driver Branches

Branch	Latest Version	CUDA Support	Type
R555	555.85	Up to CUDA 12.5	End of Life
R560	560.94	Up to CUDA 12.6	End of Life
R570	570.153	Up to CUDA 12.8	Production (EOL Jun 2026)
R575	575.51	Up to CUDA 13.2	Short-Term Support (EOL)
R580	580.x	Up to CUDA 13.2	Long Term Support (recommended)

Use R580 for production LLM inference workloads on H100, H200, A100, B200, and B300. It is the current Long Term Support branch with an EOL date of August 2028, making it the stable choice for fleet deployments. R570 (production branch) reaches end of life in June 2026. R575 was a short-term support branch and is already EOL. Both R570 and R575 users should migrate to R580 before the R570 EOL date, and R575 users should treat the migration as urgent since that branch is no longer receiving security or stability fixes. LTS branches are ideal when you want a fixed driver ABI across a large fleet.

Source: NVIDIA Driver Documentation

AMD Instinct Driver Updates

Release	Date	ROCm Compat	Key Changes	Supported GPUs
AMDGPU 6.7.0	Jun 2025	ROCm 6.2	DMA engine fixes	MI300X
AMDGPU 6.7.2	Aug 2025	ROCm 6.2	P-state tuning for MI300X under heavy MoE load	MI300X
AMDGPU 7.0	Late 2025	ROCm 7.0	MI350X/MI355X stable support, MX data types (FP4/FP6/FP8), initial MI400 device hooks	MI300X, MI350X, MI355X
AMDGPU 7.2	May 2026	ROCm 7.2	vLLM profiling stability, embedding-inference optimization	MI300X, MI350X, MI355X

Source: AMD ROCm GitHub and AMD ROCm Documentation

Library and Runtime Updates

cuDNN

cuDNN 9.7 (released April 2026) is the current stable release. It adds FP8 convolution support on Blackwell and extends the graph API with fused attention patterns for GQA (Grouped-Query Attention) layouts. The graph API is now the primary path for custom attention implementations; the legacy C API is deprecated but not removed. Compatibility: requires CUDA 12.4 or later.

cuDNN Version	CUDA Min	CUDA Max	Key Addition
9.4	12.1	12.6	FlashAttention-2 graph fusion
9.5	12.3	12.8	Hopper FP8 wgrad
9.6	12.4	12.8	GQA fused attention graph
9.7	12.4	13.x	Blackwell FP8 convolution, GQA GJA layout

cuBLAS and cuSPARSE

cuBLAS 12.8.4 (shipped with CUDA 12.8) added FP8 GEMM variants tuned for Blackwell's new tensor core layouts. The API interface is stable; no breaking changes from 12.6. cuSPARSE 12.5 added structured sparsity patterns for 2:4 pruning on Hopper and Blackwell without requiring a custom kernel.

NCCL and RCCL

NCCL 2.23 (current stable) improves All-Reduce bandwidth on 400G InfiniBand fabrics and adds adaptive tree algorithms that reduce latency for small tensors in pipeline-parallel setups. RCCL 2.21 (AMD's counterpart) brings NCCL API parity for the ring and tree collectives used by PyTorch DDP and DeepSpeed.

For tuning NCCL for multi-GPU training jobs, see the NCCL tuning guide.

TensorRT

TensorRT 10.9 is current. It ships FP4 sparsity on Blackwell GB200 and extends the FP8 quantization path to cover KV cache compression without a custom plugin. The deprecated TF32 fallback for FP16 overflow is removed in 10.9; if your network had silent TF32 promotion, you will see explicit errors now. For production LLM serving setup with TensorRT, see the TensorRT-LLM deployment guide.

vLLM

vLLM 0.21.0 (current stable as of this week) ships KV offload integration with the Hybrid Memory Allocator, speculative decoding support for reasoning models with thinking budget constraints, and the TOKENSPEED_MLA backend on Blackwell. Eagle3 speculative decoding and EPLB load balancing for MoE models (added in v0.20.0) are fully available. The --enable-prefix-caching flag is on by default for multi-turn workloads. See the vLLM production deployment guide for current flag recommendations.

Compiler and Kernel-Level Releases

OpenAI Triton (triton-lang)

OpenAI Triton (triton-lang) 3.7.0 is current stable. The block pointer API is stable across recent releases. Both the CUDA and ROCm backends have received continued improvements for flash attention and linear attention patterns on Blackwell and CDNA3 hardware. For writing production custom kernels with OpenAI Triton (triton-lang), see the Triton kernel development guide.

CUTLASS

CUTLASS 3.9 (April 2026) adds new tile shapes for Blackwell's FP8 MMA instruction set and improves the Python GEMM template API. The cute layout algebra is now the primary API path; the legacy CUTLASS 2.x API is in maintenance mode.

ThunderKittens

ThunderKittens (HazyResearch) is production-ready for H100 SXM5 targets and has experimental Ada (RTX 4090) support. Blackwell targeting is in active development. If your use case involves writing custom attention kernels for H100, ThunderKittens provides a higher-level abstraction than raw CUDA PTX.

FlashAttention

FlashAttention 3 is stable on Hopper (H100/H200) and is the recommended kernel for production transformer inference. FlashAttention 4 targeting Blackwell (B200/B300) is in beta. For a detailed guide on deploying FlashAttention 4 on Blackwell instances, see the FlashAttention 4 guide.

AI Agent and Inference Framework News

SGLang

SGLang 0.4.x is current stable. Notable additions include a new RadixAttention scheduler with better KV cache utilization for shared prefix scenarios and improved MoE expert routing. For production SGLang setup on Spheron instances, see the SGLang production setup guide.

vLLM

vLLM 0.21.0 is current (see Library and Runtime Updates above). The key deployment decision for new vLLM setups is whether to use the AsyncLLMEngine (best for per-request latency) or the V1 engine (best for throughput at high concurrency). See the vLLM production deployment guide for when to use each.

TensorRT-LLM

TensorRT-LLM 0.17 added FP4 inference for Blackwell GB200, which is the first production-grade sub-FP8 inference path for transformer models. It also ships improved in-flight batching for speculative decoding. For setup and production configuration, see the TensorRT-LLM deployment guide.

NVIDIA Dynamo

NVIDIA Dynamo 0.3 (May 2026) ships disaggregated prefill and decode routing as a stable feature. It targets clusters with 8+ nodes and works with both TensorRT-LLM and vLLM backends. For a full setup walkthrough, see the NVIDIA Dynamo disaggregated inference guide.

llm-d

llm-d 0.2 (April 2026) is the Kubernetes-native inference framework for disaggregated LLM serving. It ships a custom scheduler plugin and a Kubernetes operator for managing disaggregated prefill/decode pods. For a production setup on Kubernetes with Spheron instances, see the llm-d on Kubernetes guide.

GPU Pricing for CUDA Workloads (as of 20 May 2026)

The following pricing is from Spheron's live GPU marketplace. Use these figures to estimate compute costs when planning a CUDA or ROCm upgrade cycle.

GPU	Architecture	CUDA Max	On-Demand (per GPU/hr)	Spot (per GPU/hr)
H100 SXM5	Hopper (9.0)	CUDA 13.x	$2.64	$1.66
H200 SXM5	Hopper (9.0)	CUDA 13.x	$4.56	$1.19
A100 80G PCIe	Ampere (8.0)	CUDA 13.x	$1.15	N/A
B200 SXM6	Blackwell (10.0)	CUDA 13.2	$7.21	$3.77
B300 SXM6	Blackwell (10.0)	CUDA 13.2	$9.89	$2.45
L40S PCIe	Ada (8.9)	CUDA 13.2	$1.03	$1.03

B300 spot instances at $2.45/hr are currently the cheapest way to test CUDA 13.x workloads on Blackwell. H100 on Spheron starts at $1.66/hr spot (on-demand from $2.64/hr) for CUDA 13.x production deployments. A100 instances on Spheron at $1.15/hr on-demand are a solid choice for batch jobs.

Pricing fluctuates based on GPU availability. The prices above are based on 20 May 2026 and may have changed. Check current GPU pricing for live rates.

Archive: Past Weekly Digests

[Week of 12 May 2026] - vLLM 0.20.0 released with Eagle3 speculative decoding and EPLB load balancing for MoE models.
[Week of 05 May 2026] - ROCm 7.2.2 maintenance release shipped with vLLM profiling stability fixes; PyTorch 2.13 dev wheels landed on PyPI.
[Week of 28 Apr 2026] - NCCL 2.23 released with improved bandwidth on 400G InfiniBand; TensorRT 10.9 added FP4 for Blackwell.

CUDA and ROCm releases move fast. The simplest way to stay current is to run each new toolkit version on bare-metal instances where you control the driver stack and can verify behavior against your own kernels before pushing to production. Spheron gives you SSH root access and the ability to swap base images on every deployment, with no GPU sharing across tenants.
Rent H100 on Spheron | B200 GPU cloud | H200 GPU rental | Deploy now

STEPS / 06

Quick Setup Guide

Check your current CUDA Toolkit and driver version
Run nvcc --version to see the installed toolkit. Run nvidia-smi to see the driver version and the maximum CUDA runtime the driver supports. If the toolkit version exceeds the driver ceiling, upgrade the driver first.
Install or upgrade CUDA Toolkit on Ubuntu 22.04
Add NVIDIA's package repo: wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb && dpkg -i cuda-keyring_1.1-1_all.deb && apt-get update. Install: sudo apt-get install cuda-toolkit-12-8. Verify: nvcc --version. Update PATH if needed: export PATH=/usr/local/cuda-12.8/bin:$PATH.
Install PyTorch with a specific CUDA version
Use the version-specific index URL: pip install torch --index-url https://download.pytorch.org/whl/cu128 for CUDA 12.8, or cu124 for CUDA 12.4. Never mix PyTorch wheels built for different CUDA targets - this causes silent compute errors.
Install ROCm and vLLM for AMD GPUs
Pull AMD's official ROCm Docker image: docker pull rocm/pytorch:rocm7.0_ubuntu22.04_py3.10_pytorch_release-2.6.0. Inside the container, install vLLM: pip install vllm --extra-index-url https://download.pytorch.org/whl/rocm7.0. Verify ROCm sees your GPU: rocm-smi.
Pin your framework to a tested CUDA/ROCm version
In your Dockerfile or requirements.txt, pin both the base image (e.g., nvcr.io/nvidia/pytorch:24.10-py3) and framework versions. Unpinned installs pull the latest CUDA wheel, which may not have been tested against your custom kernels or CUDA extensions.
Monitor NVIDIA and AMD release channels for breaking changes
Subscribe to NVIDIA's developer blog at developer.nvidia.com/blog and AMD's ROCm GitHub releases at github.com/ROCm/ROCm/releases. CUDA minor releases (12.4 to 12.6) are usually backward compatible. Major releases (12 to 13) may change ABI. Test new toolkit versions in staging before updating production instances.

FAQ / 05

Frequently Asked Questions

CUDA 13.2 is the current stable release as of May 2026, introducing the cuTile Python DSL for tile-level kernel programming. CUDA 13.x supports Turing, Ampere, Ada, Hopper (H100/H200), and Blackwell architectures; it dropped support for Maxwell, Pascal, and Volta. H100 and H200 workloads can run on CUDA 13.x.

ROCm 7.2 is the current stable release. MI300X and MI350X have been fully supported since ROCm 7.0.0. vLLM ROCm wheels require ROCm 7.0 or later. Install via: pip install vllm --extra-index-url https://download.pytorch.org/whl/rocm7.0.

PyTorch 2.5 dropped support for CUDA 11.x. PyTorch 2.6 targets CUDA 12.4 as the recommended runtime (compute capability 8.0+). Check the PyTorch installation matrix at pytorch.org/get-started/locally/ for the exact wheel matching your CUDA version.

Run nvcc --version to see the installed CUDA Toolkit version. Run nvidia-smi to see the driver version and maximum supported CUDA runtime. To upgrade, add NVIDIA's package repo and install: sudo apt-get install cuda-toolkit-12-8. On Spheron, the easiest path is to select an NVIDIA NGC container image with the target CUDA version pre-installed when deploying from app.spheron.ai.

The CUDA Toolkit includes the compiler (nvcc), libraries (cuBLAS, cuDNN), and headers. The CUDA Driver is installed at the OS level as part of the GPU driver package. The toolkit version must be less than or equal to the driver's maximum supported CUDA version. nvidia-smi shows the driver's CUDA ceiling (e.g., 'CUDA Version: 12.8'). You can install CUDA Toolkit 12.4 on a system with a driver reporting CUDA 12.8 - the ceiling just needs to be >= the toolkit.

This Week in CUDA and GPU Compute (Week of 19 May 2026)

What This Means for Spheron Users (Week of 19 May 2026)

CUDA Toolkit Release Tracker

AMD ROCm Release Tracker

Framework Compatibility Matrix

NVIDIA Datacenter Driver Branches

AMD Instinct Driver Updates

Library and Runtime Updates

cuDNN

cuBLAS and cuSPARSE

NCCL and RCCL

TensorRT

vLLM

Compiler and Kernel-Level Releases

OpenAI Triton (triton-lang)

CUTLASS

ThunderKittens

FlashAttention

AI Agent and Inference Framework News

SGLang

vLLM

TensorRT-LLM

NVIDIA Dynamo

llm-d

GPU Pricing for CUDA Workloads (as of 20 May 2026)

Archive: Past Weekly Digests

Quick Setup Guide

Check your current CUDA Toolkit and driver version

Install or upgrade CUDA Toolkit on Ubuntu 22.04

Install PyTorch with a specific CUDA version

Install ROCm and vLLM for AMD GPUs

Pin your framework to a tested CUDA/ROCm version

Monitor NVIDIA and AMD release channels for breaking changes

Frequently Asked Questions

01What is the latest NVIDIA CUDA Toolkit version in 2026?

02What ROCm version supports AMD MI300X and MI350X in 2026?

03Which PyTorch version requires CUDA 12.4 or higher?

04How do I check and update the CUDA version on a Spheron GPU instance?

05What is the difference between CUDA Toolkit version and CUDA Driver version?

Build what's next.