<!-- WEEKLY UPDATE PROTOCOL:
- Prepend a new "This Week" section (5 news items with source links) above the previous week's entry.
- Update the date: frontmatter field to the current Monday's date.
- Update any tracker tables (CUDA, ROCm, Framework, Driver) if a new version shipped.
- Append the previous week's top story as a one-line entry in the Archive section.
- Commit and push - the Vercel build updates the static page and lastmod in sitemap.
-->
This is a weekly digest of CUDA toolkit releases, AMD ROCm updates, and AI framework compatibility changes that affect GPU cloud workloads. It updates every Monday with the latest releases, version tables, and a "What this means for Spheron users" section tying each update to real deployment decisions. For deeper background on the CUDA vs ROCm divide, see the ROCm vs CUDA comparison. For a deep dive into CUDA 13's new programming model, see the CUDA 13 Tile programming guide.
This Week in CUDA and GPU Compute (Week of 19 May 2026)
1. CUDA 13.2 stable released
May 15, 2026 - CUDA Toolkit 13.2 Release Notes
CUDA 13.2 ships the cuTile Python DSL as a stable feature after its experimental debut in 13.1. It also extends the Cooperative Groups API with new grid_group synchronization primitives that simplify multi-block reductions. CUDA 13.x dropped support for Maxwell, Pascal, and Volta but continues to support Turing, Ampere, Ada, Hopper (H100/H200, compute capability 9.0), and Blackwell. H100 and H200 users can now run CUDA 13.x without pinning to 12.8.
2. ROCm 7.2.3 patch ships vLLM profiling improvements for MI300X and MI350X
May 2026 - ROCm Release Notes
AMD pushed ROCm 7.2.3 with improved profiling stability for vLLM workloads traced with PyTorch torch.profiler, reducing large idle gaps between GPU kernels. The release also merges multiple gather operations from similar embedding tables into batched operations, cutting kernel launches and memory traffic for embedding-heavy inference. ROCm 7.2 is the current stable branch for MI300X and MI350X production deployments. For a comparison of AMD and NVIDIA software stacks, see the ROCm vs CUDA comparison.
3. PyTorch 2.12 stable released
May 13, 2026 - PyTorch 2.12 Release Notes
PyTorch 2.12 is the current stable release. Key additions include batched linalg.eigh on CUDA that is up to 100x faster for symmetric eigendecompositions and continued torch.compile improvements for CUDA 13.x kernels. CUDA 12.8 and 13.x wheel variants are both available. For production deployments with CUDA Graphs and persistent compilation caching, the torch.compile and CUDA Graphs guide covers the current recommendations.
4. vLLM 0.21.0 released with KV offload and Hybrid Memory Allocator
May 15, 2026 - vLLM v0.21.0 Release Notes
vLLM 0.21.0 ships KV offload integration with the Hybrid Memory Allocator (HMA), enabling larger effective context windows by tiering KV cache between GPU and CPU memory. It also adds speculative decoding support for reasoning models with thinking budget constraints and the TOKENSPEED_MLA backend on Blackwell for DeepSeek-R1 and Kimi-K25. Eagle3 speculative decoding (added in v0.20.0) continues to be available. See the vLLM production deployment guide for current flag recommendations.
5. OpenAI Triton (triton-lang) 3.7.0 released
May 7, 2026 - OpenAI Triton (triton-lang) 3.7.0 Release Notes
OpenAI Triton (triton-lang) 3.7.0 ships with continued improvements to both the CUDA and ROCm backends, improving performance for flash attention and linear attention patterns on CDNA3 and Blackwell hardware. The block pointer API is stable across recent releases. Note: this is the kernel language OpenAI Triton (triton-lang), not NVIDIA Triton Inference Server. For GPU kernel development context, see the Triton kernel development guide.
What This Means for Spheron Users (Week of 19 May 2026)
CUDA 13.2 supports Turing, Ampere, Ada, Hopper, and Blackwell architectures. H100 and H200 users can now run CUDA 13.x without pinning to CUDA 12.8. For teams exploring B300 SXM6, spot instances are currently available at $2.45/hr per GPU, making CUDA 13.x workloads low-cost to validate.
When deploying on Spheron, select an NGC base container matching your target CUDA version (e.g.,
nvcr.io/nvidia/pytorch:24.08-py3for CUDA 12.8 or a newer tag for CUDA 13.x). The container handles the toolkit; the bare-metal driver is pre-installed at the host level. For automating driver and toolkit setup at instance launch, see the Spheron startup scripts guide.H100 SXM5 instances start at $1.66/hr spot or $2.64/hr on-demand, with SSH root access included. B200 SXM6 starts at $3.77/hr spot or $7.21/hr on-demand for teams that need Blackwell.
CUDA Toolkit Release Tracker
| Version | Release Date | Key Features | Min Driver Version | Supported Architectures |
|---|---|---|---|---|
| CUDA 12.4 | Mar 2024 | FP8 support on Hopper, cuDNN 9.0 integration | 550.54 | Volta, Turing, Ampere, Ada, Hopper |
| CUDA 12.6 | Jul 2024 | cuBLAS FP8 GEMMs, NVLink 5.0 support | 560.28 | Volta, Turing, Ampere, Ada, Hopper |
| CUDA 12.8 | Jan 2025 | Blackwell (Compute 10.0) GA support, GB200 NVLink, NVJPEG2000 | 570.00 | Volta, Turing, Ampere, Ada, Hopper, Blackwell |
| CUDA 13.0 | Nov 2025 | Dropped Maxwell/Pascal/Volta, Blackwell architecture focus, cuTile DSL preview | 575.00 | Turing, Ampere, Ada, Hopper, Blackwell |
| CUDA 13.1 | Mar 2026 | cuTile DSL (experimental), new CG primitives | 575.00 | Turing, Ampere, Ada, Hopper, Blackwell |
| CUDA 13.2 | May 2026 | cuTile DSL (stable), extended Cooperative Groups | 575.51 | Turing, Ampere, Ada, Hopper, Blackwell |
CUDA 13.x dropped support for Maxwell, Pascal, and Volta (pre-Turing). Turing, Ampere, Ada, Hopper (H100/H200), and Blackwell are all supported. See the CUDA 13 Tile programming guide for details on the new programming model.
Source: CUDA Toolkit Release Notes
AMD ROCm Release Tracker
| Version | Release Date | Key Changes | MI300X | MI350X | MI400 | Min Kernel |
|---|---|---|---|---|---|---|
| ROCm 6.2 | Jun 2024 | vLLM ROCm wheel support, MIGraphX 2.10 | Stable | No | No | 5.15 |
| ROCm 7.0.0 | Late 2025 | MI350X/MI355X GA support, HIP compiler major update, MX data types (FP4/FP6/FP8) on MI350X | Stable | Stable | Preview | 6.8 |
| ROCm 7.1 | Early 2026 | Deeper MI350X/MI355X optimization, improved FlashAttention, HIP Python improvements | Stable | Stable | Preview | 6.8 |
| ROCm 7.2 | 2026 | RDNA4 support, Ubuntu 24.04.4 support, RCCL improvements | Stable | Stable | Preview | 6.8 |
| ROCm 7.2.3 (current) | May 2026 | vLLM profiling stability, embedding-inference optimization | Stable | Stable | Preview | 6.8 |
ROCm 7.2 is the current stable series. MI350X and MI355X have been fully supported since ROCm 7.0.0. MI400 remains in developer preview. For a full comparison of AMD and NVIDIA software stacks and when to choose each, see the ROCm vs CUDA comparison.
Source: ROCm GitHub Releases
Framework Compatibility Matrix
| Framework | Version | Min CUDA | Min ROCm | Notes |
|---|---|---|---|---|
| PyTorch | 2.4 | 11.8 | 5.7 | Last version supporting CUDA 11.x |
| PyTorch | 2.5 | 12.1 | 6.1 | Drops CUDA 11.x |
| PyTorch | 2.6 | 12.4 | 6.2 | Recommended for CUDA Graphs + torch.compile |
| PyTorch | 2.7 | 12.4 | 6.2 | Stable since Apr 2025; cu121 wheel deprecated |
| PyTorch | 2.12 | 12.8 | 7.0 | Current stable (May 2026); CUDA 13.x wheels available |
| TensorFlow | 2.17 | 12.3 | 6.1 | Last 2.x before TF 3 API migration |
| TensorFlow | 2.18 | 12.5 | 6.2 | TF 3 compatibility layer included |
| JAX | 0.4.28 | 12.3 | 6.2 | Pallas kernel language stable |
| JAX | 0.4.30 | 12.4 | 6.2 | Mosaic GPU backend GA |
| OpenAI Triton (triton-lang) | 3.2 | 12.1 | 6.1 | Stable block pointer API |
| OpenAI Triton (triton-lang) | 3.7.0 | 12.3 | 7.0 | Current stable (May 2026); improved CUDA and ROCm backends |
For PyTorch 2.6 in production with CUDA Graphs and persistent compilation caching, see the torch.compile and CUDA Graphs guide.
NVIDIA Datacenter Driver Branches
| Branch | Latest Version | CUDA Support | Type |
|---|---|---|---|
| R555 | 555.85 | Up to CUDA 12.5 | End of Life |
| R560 | 560.94 | Up to CUDA 12.6 | End of Life |
| R570 | 570.153 | Up to CUDA 12.8 | Production (EOL Jun 2026) |
| R575 | 575.51 | Up to CUDA 13.2 | Short-Term Support (EOL) |
| R580 | 580.x | Up to CUDA 13.2 | Long Term Support (recommended) |
Use R580 for production LLM inference workloads on H100, H200, A100, B200, and B300. It is the current Long Term Support branch with an EOL date of August 2028, making it the stable choice for fleet deployments. R570 (production branch) reaches end of life in June 2026. R575 was a short-term support branch and is already EOL. Both R570 and R575 users should migrate to R580 before the R570 EOL date, and R575 users should treat the migration as urgent since that branch is no longer receiving security or stability fixes. LTS branches are ideal when you want a fixed driver ABI across a large fleet.
Source: NVIDIA Driver Documentation
AMD Instinct Driver Updates
| Release | Date | ROCm Compat | Key Changes | Supported GPUs |
|---|---|---|---|---|
| AMDGPU 6.7.0 | Jun 2025 | ROCm 6.2 | DMA engine fixes | MI300X |
| AMDGPU 6.7.2 | Aug 2025 | ROCm 6.2 | P-state tuning for MI300X under heavy MoE load | MI300X |
| AMDGPU 7.0 | Late 2025 | ROCm 7.0 | MI350X/MI355X stable support, MX data types (FP4/FP6/FP8), initial MI400 device hooks | MI300X, MI350X, MI355X |
| AMDGPU 7.2 | May 2026 | ROCm 7.2 | vLLM profiling stability, embedding-inference optimization | MI300X, MI350X, MI355X |
Source: AMD ROCm GitHub and AMD ROCm Documentation
Library and Runtime Updates
cuDNN
cuDNN 9.7 (released April 2026) is the current stable release. It adds FP8 convolution support on Blackwell and extends the graph API with fused attention patterns for GQA (Grouped-Query Attention) layouts. The graph API is now the primary path for custom attention implementations; the legacy C API is deprecated but not removed. Compatibility: requires CUDA 12.4 or later.
| cuDNN Version | CUDA Min | CUDA Max | Key Addition |
|---|---|---|---|
| 9.4 | 12.1 | 12.6 | FlashAttention-2 graph fusion |
| 9.5 | 12.3 | 12.8 | Hopper FP8 wgrad |
| 9.6 | 12.4 | 12.8 | GQA fused attention graph |
| 9.7 | 12.4 | 13.x | Blackwell FP8 convolution, GQA GJA layout |
cuBLAS and cuSPARSE
cuBLAS 12.8.4 (shipped with CUDA 12.8) added FP8 GEMM variants tuned for Blackwell's new tensor core layouts. The API interface is stable; no breaking changes from 12.6. cuSPARSE 12.5 added structured sparsity patterns for 2:4 pruning on Hopper and Blackwell without requiring a custom kernel.
NCCL and RCCL
NCCL 2.23 (current stable) improves All-Reduce bandwidth on 400G InfiniBand fabrics and adds adaptive tree algorithms that reduce latency for small tensors in pipeline-parallel setups. RCCL 2.21 (AMD's counterpart) brings NCCL API parity for the ring and tree collectives used by PyTorch DDP and DeepSpeed.
For tuning NCCL for multi-GPU training jobs, see the NCCL tuning guide.
TensorRT
TensorRT 10.9 is current. It ships FP4 sparsity on Blackwell GB200 and extends the FP8 quantization path to cover KV cache compression without a custom plugin. The deprecated TF32 fallback for FP16 overflow is removed in 10.9; if your network had silent TF32 promotion, you will see explicit errors now. For production LLM serving setup with TensorRT, see the TensorRT-LLM deployment guide.
vLLM
vLLM 0.21.0 (current stable as of this week) ships KV offload integration with the Hybrid Memory Allocator, speculative decoding support for reasoning models with thinking budget constraints, and the TOKENSPEED_MLA backend on Blackwell. Eagle3 speculative decoding and EPLB load balancing for MoE models (added in v0.20.0) are fully available. The --enable-prefix-caching flag is on by default for multi-turn workloads. See the vLLM production deployment guide for current flag recommendations.
Compiler and Kernel-Level Releases
OpenAI Triton (triton-lang)
OpenAI Triton (triton-lang) 3.7.0 is current stable. The block pointer API is stable across recent releases. Both the CUDA and ROCm backends have received continued improvements for flash attention and linear attention patterns on Blackwell and CDNA3 hardware. For writing production custom kernels with OpenAI Triton (triton-lang), see the Triton kernel development guide.
CUTLASS
CUTLASS 3.9 (April 2026) adds new tile shapes for Blackwell's FP8 MMA instruction set and improves the Python GEMM template API. The cute layout algebra is now the primary API path; the legacy CUTLASS 2.x API is in maintenance mode.
ThunderKittens
ThunderKittens (HazyResearch) is production-ready for H100 SXM5 targets and has experimental Ada (RTX 4090) support. Blackwell targeting is in active development. If your use case involves writing custom attention kernels for H100, ThunderKittens provides a higher-level abstraction than raw CUDA PTX.
FlashAttention
FlashAttention 3 is stable on Hopper (H100/H200) and is the recommended kernel for production transformer inference. FlashAttention 4 targeting Blackwell (B200/B300) is in beta. For a detailed guide on deploying FlashAttention 4 on Blackwell instances, see the FlashAttention 4 guide.
AI Agent and Inference Framework News
SGLang
SGLang 0.4.x is current stable. Notable additions include a new RadixAttention scheduler with better KV cache utilization for shared prefix scenarios and improved MoE expert routing. For production SGLang setup on Spheron instances, see the SGLang production setup guide.
vLLM
vLLM 0.21.0 is current (see Library and Runtime Updates above). The key deployment decision for new vLLM setups is whether to use the AsyncLLMEngine (best for per-request latency) or the V1 engine (best for throughput at high concurrency). See the vLLM production deployment guide for when to use each.
TensorRT-LLM
TensorRT-LLM 0.17 added FP4 inference for Blackwell GB200, which is the first production-grade sub-FP8 inference path for transformer models. It also ships improved in-flight batching for speculative decoding. For setup and production configuration, see the TensorRT-LLM deployment guide.
NVIDIA Dynamo
NVIDIA Dynamo 0.3 (May 2026) ships disaggregated prefill and decode routing as a stable feature. It targets clusters with 8+ nodes and works with both TensorRT-LLM and vLLM backends. For a full setup walkthrough, see the NVIDIA Dynamo disaggregated inference guide.
llm-d
llm-d 0.2 (April 2026) is the Kubernetes-native inference framework for disaggregated LLM serving. It ships a custom scheduler plugin and a Kubernetes operator for managing disaggregated prefill/decode pods. For a production setup on Kubernetes with Spheron instances, see the llm-d on Kubernetes guide.
GPU Pricing for CUDA Workloads (as of 20 May 2026)
The following pricing is from Spheron's live GPU marketplace. Use these figures to estimate compute costs when planning a CUDA or ROCm upgrade cycle.
| GPU | Architecture | CUDA Max | On-Demand (per GPU/hr) | Spot (per GPU/hr) |
|---|---|---|---|---|
| H100 SXM5 | Hopper (9.0) | CUDA 13.x | $2.64 | $1.66 |
| H200 SXM5 | Hopper (9.0) | CUDA 13.x | $4.56 | $1.19 |
| A100 80G PCIe | Ampere (8.0) | CUDA 13.x | $1.15 | N/A |
| B200 SXM6 | Blackwell (10.0) | CUDA 13.2 | $7.21 | $3.77 |
| B300 SXM6 | Blackwell (10.0) | CUDA 13.2 | $9.89 | $2.45 |
| L40S PCIe | Ada (8.9) | CUDA 13.2 | $1.03 | $1.03 |
B300 spot instances at $2.45/hr are currently the cheapest way to test CUDA 13.x workloads on Blackwell. H100 on Spheron starts at $1.66/hr spot (on-demand from $2.64/hr) for CUDA 13.x production deployments. A100 instances on Spheron at $1.15/hr on-demand are a solid choice for batch jobs.
Pricing fluctuates based on GPU availability. The prices above are based on 20 May 2026 and may have changed. Check current GPU pricing for live rates.
Archive: Past Weekly Digests
- [Week of 12 May 2026] - vLLM 0.20.0 released with Eagle3 speculative decoding and EPLB load balancing for MoE models.
- [Week of 05 May 2026] - ROCm 7.2.2 maintenance release shipped with vLLM profiling stability fixes; PyTorch 2.13 dev wheels landed on PyPI.
- [Week of 28 Apr 2026] - NCCL 2.23 released with improved bandwidth on 400G InfiniBand; TensorRT 10.9 added FP4 for Blackwell.
<!-- Add new weekly entries above this line. Move entries below #20 to a separate archive page when this list grows beyond 20 items. -->
CUDA and ROCm releases move fast. The simplest way to stay current is to run each new toolkit version on bare-metal instances where you control the driver stack and can verify behavior against your own kernels before pushing to production. Spheron gives you SSH root access and the ability to swap base images on every deployment, with no GPU sharing across tenants.
Rent H100 on Spheron | B200 GPU cloud | H200 GPU rental | Deploy now
Quick Setup Guide
Run nvcc --version to see the installed toolkit. Run nvidia-smi to see the driver version and the maximum CUDA runtime the driver supports. If the toolkit version exceeds the driver ceiling, upgrade the driver first.
Add NVIDIA's package repo: wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb && dpkg -i cuda-keyring_1.1-1_all.deb && apt-get update. Install: sudo apt-get install cuda-toolkit-12-8. Verify: nvcc --version. Update PATH if needed: export PATH=/usr/local/cuda-12.8/bin:$PATH.
Use the version-specific index URL: pip install torch --index-url https://download.pytorch.org/whl/cu128 for CUDA 12.8, or cu124 for CUDA 12.4. Never mix PyTorch wheels built for different CUDA targets - this causes silent compute errors.
Pull AMD's official ROCm Docker image: docker pull rocm/pytorch:rocm7.0_ubuntu22.04_py3.10_pytorch_release-2.6.0. Inside the container, install vLLM: pip install vllm --extra-index-url https://download.pytorch.org/whl/rocm7.0. Verify ROCm sees your GPU: rocm-smi.
In your Dockerfile or requirements.txt, pin both the base image (e.g., nvcr.io/nvidia/pytorch:24.10-py3) and framework versions. Unpinned installs pull the latest CUDA wheel, which may not have been tested against your custom kernels or CUDA extensions.
Subscribe to NVIDIA's developer blog at developer.nvidia.com/blog and AMD's ROCm GitHub releases at github.com/ROCm/ROCm/releases. CUDA minor releases (12.4 to 12.6) are usually backward compatible. Major releases (12 to 13) may change ABI. Test new toolkit versions in staging before updating production instances.
Frequently Asked Questions
CUDA 13.2 is the current stable release as of May 2026, introducing the cuTile Python DSL for tile-level kernel programming. CUDA 13.x supports Turing, Ampere, Ada, Hopper (H100/H200), and Blackwell architectures; it dropped support for Maxwell, Pascal, and Volta. H100 and H200 workloads can run on CUDA 13.x.
ROCm 7.2 is the current stable release. MI300X and MI350X have been fully supported since ROCm 7.0.0. vLLM ROCm wheels require ROCm 7.0 or later. Install via: pip install vllm --extra-index-url https://download.pytorch.org/whl/rocm7.0.
PyTorch 2.5 dropped support for CUDA 11.x. PyTorch 2.6 targets CUDA 12.4 as the recommended runtime (compute capability 8.0+). Check the PyTorch installation matrix at pytorch.org/get-started/locally/ for the exact wheel matching your CUDA version.
Run nvcc --version to see the installed CUDA Toolkit version. Run nvidia-smi to see the driver version and maximum supported CUDA runtime. To upgrade, add NVIDIA's package repo and install: sudo apt-get install cuda-toolkit-12-8. On Spheron, the easiest path is to select an NVIDIA NGC container image with the target CUDA version pre-installed when deploying from app.spheron.ai.
The CUDA Toolkit includes the compiler (nvcc), libraries (cuBLAS, cuDNN), and headers. The CUDA Driver is installed at the OS level as part of the GPU driver package. The toolkit version must be less than or equal to the driver's maximum supported CUDA version. nvidia-smi shows the driver's CUDA ceiling (e.g., 'CUDA Version: 12.8'). You can install CUDA Toolkit 12.4 on a system with a driver reporting CUDA 12.8 - the ceiling just needs to be >= the toolkit.
