This is a weekly digest of GPU cloud hardware launches, pricing moves, supply chain updates, and inference framework releases. It updates every Monday. For current Spheron GPU pricing including live spot and on-demand rates, check the pricing page. For CUDA toolkit and framework-level news, see the CUDA and ROCm news digest.
Hardware Launches and Successor Reveals
NVIDIA Vera Rubin NVL72: Q3 2026 Production Shipments Confirmed - 10 Jun 2026
10 Jun 2026 - NVIDIA Newsroom
NVIDIA confirmed that Vera Rubin NVL72 enters production ramp in Q3 2026. The platform pairs 72 Rubin GPUs per rack unit with ~20.7 TB HBM4 total memory (288 GB per GPU), targeting pre-training runs and large-scale inference at scales where H100 SXM5 instances on Spheron or B300 single-node configs hit memory walls. Initial allocations go to hyperscalers (AWS, Google Cloud, Azure), with neocloud availability expected Q4 2026 to Q1 2027 at the earliest. Lead times for custom NVL72 rack configs are tracking at 40+ weeks from initial order.
NVIDIA B300 SXM6 Spot Pricing Now Available - 2 Jun 2026
2 Jun 2026 - Spheron GPU catalog
B300 SXM6 on Spheron is now available as spot capacity starting at $3.29/hr per GPU, making Blackwell Ultra accessible for fault-tolerant batch inference and fine-tuning without reserved commitments. The B300 carries 288 GB HBM3e, roughly 3.6x the memory of H100 SXM5, with substantially higher FP4 throughput and a larger memory pool than B200. See the B300 vs B200 inference cost breakdown for workload-level pricing comparisons.
AMD MI400 (CDNA 5) Detailed at CES 2026; MI450 Sampling Underway - 28 May 2026
28 May 2026 - AMD Instinct Product Page
AMD detailed the MI400 series (CDNA 5 architecture) at CES 2026 in January. MI450 sampling is already underway in Q1 2026, with production shipments expected H2 2026 and cloud GA to follow. Expected specs include higher memory bandwidth and FP4/FP6 compute density that would put it in direct competition with Vera Rubin on large-batch transformer workloads. ROCm 7.x software compatibility is confirmed from launch; PyTorch and vLLM support timelines depend on the MI350X adoption curve currently underway. For the full MI400 vs B300 competitive analysis, see the AMD MI400 vs NVIDIA B300 guide.
NVIDIA R100 NVLink72 Pre-Orders Now Open - 15 May 2026
15 May 2026 - Spheron GPU catalog
The R100 pre-order page is live. The R100 is a rack-scale NVLink72 unit that pairs 72 Blackwell Ultra GPUs with a full NVLink fabric, targeting multi-tenant inference at the 1T+ parameter scale. Exact VRAM and bandwidth specs have not been officially published yet; pricing and ship date are subject to change as production ramps.
GPU Cloud Pricing Changes
Neocloud H100 Spot Rates Compress as Supply Increases - 9 Jun 2026
9 Jun 2026 - Spot market observation
H100 SXM5 spot rates on Spheron have compressed to $1.46/hr per GPU, down from $1.66/hr in May 2026, as more tier-2 and tier-3 data centers bring new capacity online. On-demand rates are holding around $2.54/hr per GPU. For teams running fault-tolerant batch inference or fine-tuning experiments, spot is now the clear default; the 43% discount vs on-demand makes checkpointing worth the implementation overhead. Check current Spheron GPU pricing for the live rates across all GPU models.
AWS P5 H100 On-Demand at ~$6.88/hr Per GPU - 1 Jun 2026
1 Jun 2026 - AWS EC2 Instance Pricing (p5.48xlarge)
AWS p5.48xlarge (8x H100 SXM5, 192 vCPUs, 2048 GiB RAM) runs at approximately $55.04/hr on-demand as of June 2026, roughly $6.88/hr per GPU. AWS spot for p5 instances can reach 60-70% off on-demand rates with interruption risk. The p5e instance family (H200 SXM5) adds an alternative at higher memory capacity. For teams running long H100 training jobs, Spheron on-demand at $2.54/hr gives ~2.7x cost savings vs AWS without architectural changes to your job.
Google Cloud A3 Pricing Stable; A3 Ultra (H200) Now GA - 20 May 2026
20 May 2026 - Cloud provider update
Google Cloud's A3 Mega (H100 SXM5) holds at approximately $12.36/hr per GPU on-demand (verify at cloud.google.com for current rates, as pricing may have changed). A3 Ultra (H200 SXM5) went generally available in Q2 2026 at approximately $15.20/hr per GPU on-demand. For teams evaluating cost-only decisions, you can rent A100 at $1.69/hr on-demand on Spheron to run smaller validation runs before committing to H100 or H200 production capacity.
Capacity and Availability Updates
HBM3e Supply Tightens: 36-52 Week Lead Times for Custom H200 Configs - 20 May 2026
20 May 2026 - Industry supply chain reports
HBM3e supply constraints are extending H200 and GB200 NVL72 custom-config lead times to 36-52 weeks as of May 2026. Spot demand for H200 SXM rental remains elevated because teams unable to get custom builds are absorbing neocloud spot capacity instead. H100 SXM5 availability looks more stable through H2 2026, as the HBM2e supply chain faces no similar constraint. For production inference workloads that need 141 GB per GPU today, spot H200 on Spheron is currently at $1.77/hr.
HBM4 Production Ramp: SK Hynix and Samsung Targeting H2 2026 Volume - 15 May 2026
15 May 2026 - Industry supply chain reports
SK Hynix and Samsung are targeting H2 2026 for HBM4 volume production, which will feed Vera Rubin NVL72 and next-gen AMD Instinct builds. Early yields are reportedly satisfactory, though volume availability for cloud deployments will depend on how NVIDIA allocates initial HGX shipments. Cloud providers expecting Q4 2026 Vera Rubin availability are likely already in allocation queues; independent neoclouds will follow 3-6 months behind.
New Tier-3 Data Center Capacity Added in EU and Asia Pacific - 5 May 2026
5 May 2026 - Spheron capacity update
Spheron added new data center partners in the EU and Asia Pacific regions, expanding spot capacity for H100 SXM5 and the option to rent B200 GPU in additional regions. Teams with regional data residency requirements now have more options without falling back to hyperscalers. All new capacity meets Tier 3+ compliance standards. For a broader picture of GPU cloud supply across regions, see the GPU cloud providers in Asia Pacific guide.
New Models and Inference Frameworks
vLLM 0.21.0: KV Offload and Hybrid Memory Allocator - 15 May 2026
15 May 2026 - vLLM v0.21.0 Release Notes
vLLM 0.21.0 ships KV offload with the Hybrid Memory Allocator (HMA), enabling larger effective context windows by tiering KV cache between GPU and CPU memory. This makes 128K context runs on H100 SXM5 more cost-effective since you are not paying for a second GPU just to hold the KV cache. The release also adds speculative decoding for reasoning models with thinking budget constraints and the TOKENSPEED_MLA backend on Blackwell for DeepSeek-R1 throughput. See the vLLM production deployment guide for flag-level recommendations on Spheron instances.
DeepSeek-R2 (Reported): 800B MoE Architecture with Improved STEM Reasoning - 1 Jun 2026
1 Jun 2026 - Community reports and technical analysis
Community reports indicate DeepSeek is preparing a 671B to 800B MoE model building on the R1 architecture with improved STEM and multi-step reasoning benchmarks. If confirmed, inference requirements would likely match or exceed the current DeepSeek-R1 single-node H100 setup. An 800B MoE at FP8 fits in a single 8x H100 SXM5 node (640 GB combined HBM3) with the right quantization. For GPU requirements, KV cache tuning, and FP8 quantization details at this scale, see the DeepSeek R2 deployment guide on GPU cloud.
SGLang 0.7: Speculative Decoding and Multi-Node Inference Improvements - 20 May 2026
20 May 2026 - SGLang GitHub releases
SGLang 0.7 ships improved speculative decoding throughput and multi-node tensor parallel inference on H100 and H200. The radix attention cache efficiency improvements in this release reduce KV cache fragmentation by 15-30% for chatbot workloads with shared prefixes. Teams running high-concurrency inference at 70B+ should benchmark SGLang 0.7 alongside vLLM 0.21.0 before committing to an inference stack. For a full framework benchmark comparison, see the vLLM vs TensorRT-LLM vs SGLang benchmarks.
GPU Availability on Spheron (Live)
Current spot and on-demand rates per GPU as of 13 Jun 2026:
| GPU | Spot $/hr | On-Demand $/hr | VRAM | Rental Page |
|---|---|---|---|---|
| H100 SXM5 | $1.46 | $2.54 | 80 GB HBM3 | Rent H100 |
| H200 SXM5 | $1.77 | $4.84 | 141 GB HBM3e | H200 on Spheron → |
| B200 SXM6 | $2.71 | $7.37 | 192 GB HBM3e | B200 capacity → |
| B300 SXM6 | $3.29 | $9.02 | 288 GB HBM3e | Check B300 availability → |
| A100 80GB SXM4 | $0.82 | $1.69 | 80 GB HBM2e | Rent A100 |
| R100 | Pre-order | Pre-order | TBA | R100 pre-order → |
Pricing fluctuates based on GPU availability. The prices above are based on 13 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Archive (Pre-June 2026)
Items from earlier in 2026, kept for context and crawlability.
- Apr 2026: NVIDIA B300 SXM6 spot capacity became available on neocloud providers. Spheron listed B300 on-demand at $9.02/hr.
- Mar 2026: vLLM 0.20.0 released with Eagle3 speculative decoding and initial Blackwell support.
- Mar 2026: AWS p5e (H200 SXM5) expanded to additional regions. On-demand rate approximately $14.27/hr per GPU.
- Feb 2026: AMD MI350X generally available on neocloud providers running ROCm 7.0.
- Jan 2026: NVIDIA GB200 NVL72 began shipping to hyperscalers. Cloud GA expected mid-2026.
- Jan 2026: H100 SXM5 on-demand rates on neoclouds compressed to sub-$3/hr as 2025 supply came online.
GPU cloud pricing shifts weekly. This page tracks every notable move across hyperscalers and neocloud providers so you can spot the gap before your next training run.
H100 cloud pricing → | B200 availability → | View all GPU pricing →
Frequently Asked Questions
This page is updated weekly with GPU cloud pricing changes across AWS, GCP, Azure, and neo-cloud providers including Spheron, Lambda Labs, RunPod, and Vast.ai. Check the Pricing Changes section for the most recent updates, and visit Spheron's live pricing page at spheron.network/pricing/ for current rates.
NVIDIA's 2026 GPU roadmap includes Vera Rubin NVL72 (B300 successor, expected H2 2026 shipments to cloud providers), the R100 (NVLink72 rack-scale unit), and Blackwell Ultra variants (B300 SXM6). AMD launched MI350X and detailed MI400 (CDNA 5 architecture) at CES 2026, with MI450 sampling already underway and production shipments planned for H2 2026. Check the Hardware Launches section for dated updates with source links.
AWS p5.48xlarge (8x H100 SXM5, 192 vCPUs, 2048 GiB RAM) lists at approximately $55.04/hr on-demand as of mid-2026, or about $6.88/hr per GPU. AWS also offers p5e instances with H200. Compare at spheron.network/pricing/.
NVIDIA's Vera Rubin NVL72 targets Q3 2026 for initial HGX shipments to major cloud providers, with neocloud availability typically 3-6 months behind hyperscaler GA dates. Watch this page for availability announcements as providers confirm capacity.
For cost-per-token efficiency, H100 SXM5 spot instances at $1.46/hr on Spheron currently offer the best price-performance for 7B-70B models. B200 SXM6 spot at $2.71/hr is better for 70B+ models due to higher memory bandwidth. Check the live GPU pricing page at spheron.network/pricing/ for current rates.
