NVIDIA first announced the Rubin architecture at Computex 2024, detailed the R100 specs at GTC 2025 (where initial roadmap specs were previewed), confirmed the 6-chip platform in production and announced H2 2026 availability at CES 2026 (January 2026), and at GTC 2026 (March 2026) published expanded system architecture details, finalized the full cloud partner list, and completed the 7-chip Vera Rubin platform by integrating the Groq 3 LPU as the 7th chip. The Rubin CPX concept originally announced at AI Infra Summit in September 2025 was removed from NVIDIA's roadmap at GTC 2026 and replaced by the Groq 3 LPX Rack, a dedicated low-latency inference accelerator. Each Groq 3 LPU has 500 MB of on-chip SRAM; the full rack houses 256 LPUs with 128GB aggregate SRAM and 640 TB/s scale-up bandwidth. The specs: 288GB of HBM4 memory, 50 PFLOPS of FP4 compute, and 336 billion transistors. That's 3.33x the FP4 compute of the B300 and 5.6x the B200. NVIDIA's official CES 2026 claim is an up to 10x reduction in inference token cost at system scale compared to the Blackwell platform. Cloud providers start shipping R100 instances in H2 2026, with broader availability following into 2027. If you're building AI infrastructure right now, those numbers matter for your planning horizon.
Rubin R100 Specs: Full Table
| Spec | R100 | B300 | B200 | H100 |
|---|---|---|---|---|
| Architecture | Rubin | Blackwell Ultra | Blackwell | Hopper |
| VRAM | 288 GB | 288 GB | 192 GB | 80 GB |
| Memory type | HBM4 | HBM3e | HBM3e | HBM3 |
| Memory bandwidth (TB/s) | up to 22 | 8 | 8 | 3.35 |
| FP4 Dense (PFLOPS) | 50 | 15 | 9 | N/A |
| FP8 Dense (TFLOPS) | ~16,000 (confirmed) | 7,000 | 4,500 | 3,958 |
| FP16 Dense (TFLOPS) | ~8,000 (projected) | 3,500 | 2,250 | 1,979 |
| Transistors | 336 B | 208 B | 208 B | 80 B |
| TDP (W) | ~2,300 | 1,400 | ~1,200 | 700 |
| Interconnect | NVLink 6 (3.6 TB/s) | NVLink 5 (1.8 TB/s) | NVLink 5 (1.8 TB/s) | NVLink 4 (900 GB/s) |
| Networking | ConnectX-9 (1.6T) | ConnectX-8 (800G) | ConnectX-7 (400G) | ConnectX-7 (400G) |
R100 specs are sourced from NVIDIA's GTC 2025 roadmap and confirmed at CES 2026 and GTC 2026. NVIDIA officially refers to this chip as the "Rubin GPU" within the Vera Rubin platform. No official alphanumeric chip designation has been confirmed by NVIDIA. "R100" was the initial pre-announcement industry speculation following prior-generation naming conventions (H100, B200); "R200" (and VR200 for the full superchip) has since emerged as the more commonly used 2026 industry designation, reflecting the dual-die package design per supply chain sources. Memory bandwidth, NVLink 6, and ConnectX-9 specs are officially confirmed. B300 transistor count (208B) is confirmed per NVIDIA's Blackwell Ultra announcement. R100 FP8 Dense is confirmed at approximately 16 PFLOPS per GPU, derived from NVIDIA's GTC 2026 system-level spec of 1.2 FP8 ExaFLOPS across the 72-GPU NVL72 rack. FP16 Dense for R100 is projected at approximately half the FP8 figure, consistent with prior-generation ratios. FP4 Dense figures represent inference throughput; the R100 training FP4 figure is 35 PFLOPS. B200 TDP reflects the SXM5 liquid-cooled data center variant (1,000W for air-cooled PCIe). Final benchmark numbers will be available once cloud providers ship production systems in H2 2026.
What Changed from Blackwell: Architecture Deep Dive
HBM4 Memory: The Biggest Jump
HBM3e, which powers the B200 and B300, tops out around 8 TB/s per GPU under current NVIDIA implementations. The R100 ships with HBM4 delivering up to 22 TB/s per GPU, confirmed by NVIDIA at GTC 2026. That's up to 2.75x the bandwidth of the B200 and B300.
That bandwidth increase has concrete implications for inference. KV cache reads and writes are memory-bandwidth-bound operations. At 22 TB/s, a single R100 can sustain much larger active KV caches during long-context inference without stalling the compute pipeline. Models that currently require 4xH100 or 2xB200 for 128K context windows may fit on a single R100 with room to spare. Multi-GPU sharding introduces communication overhead that grows with context length; fewer shards means lower latency per token at scale.
Transistor Count: 336B vs 208B
The R100's 336 billion transistors compare to the B200's 208 billion and the H100's 80 billion. The jump from H100 to B200 was 2.6x, and Rubin adds 61% more transistors over Blackwell, continuing a pattern of substantial growth each generation.
Higher transistor density enables more compute units, larger on-chip caches, and more sophisticated memory access controllers without proportionally increasing die area. The practical effect is more FLOPs per watt and more cache capacity to hide DRAM latency, both of which matter for inference workloads that alternate between compute-heavy attention operations and memory-bound KV cache lookups.
FP4 Compute: 50 PFLOPS
The R100's 50 PFLOPS FP4 figure is the headline number. To put it in context: the B300 has 15 PFLOPS, the B200 has 9 PFLOPS. The R100 represents 3.33x the B300 and 5.6x the B200 on this metric.
FP4 quantization is increasingly viable for production inference. Modern inference engines support FP4-quantized weights with minimal quality degradation on well-trained models, and Blackwell-optimized architectures like those used in newer Llama generations are specifically designed with FP4 inference in mind. At 50 PFLOPS FP4, the R100 can process tokens at a rate that no current GPU cluster configuration can match per chip. For a detailed explanation of how FP4 quantization works and where it trades off quality for throughput, see our FP4 quantization and Blackwell GPU cost guide.
NVLink and Interconnect
Rubin ships with NVLink 6, NVIDIA's confirmed next-generation interconnect. NVLink 5 in Blackwell delivers 1.8 TB/s per GPU. NVLink 6 doubles that to 3.6 TB/s per GPU, following the per-generation doubling pattern.
For multi-node training at scale, inter-GPU bandwidth determines how quickly gradients synchronize across the cluster. A doubling of NVLink bandwidth reduces gradient sync time proportionally for communication-bound training runs. For rack-scale inference deployments using tensor parallelism across many GPUs, higher NVLink bandwidth also lowers the latency penalty from splitting attention heads across chips.
R100 vs B300 vs B200 vs H100: Performance Projections
| GPU | Inference throughput (tokens/sec, Llama 70B FP8, normalized) | Memory capacity (GB) | Cost-per-token vs H100 | Suitable model size range |
|---|---|---|---|---|
| H100 SXM | 1.0x (baseline) | 80 | 1.0x | 7B-30B FP16, 70B with sharding |
| B200 | ~4.5x (est.) | 192 | ~0.25x | 7B-130B FP8/FP16 |
| B300 | ~5.5x (est.) | 288 | ~0.14x (Spheron spot) | 7B-200B FP8/FP16 |
| R100 | ~18-22x (projected) | 288 | ~0.05-0.08x (projected) | 7B-400B+ FP8/FP4 |
All R100 numbers are projected based on NVIDIA's published roadmap ratios, GTC 2025 architectural specifications, and GTC 2026 confirmations. NVIDIA officially claims an up to 10x reduction in inference token cost for the Vera Rubin NVL72 system compared to the Blackwell platform at rack scale; per-GPU cost projections in the table above use Spheron spot pricing assumptions and will differ from rack-scale system comparisons. Final per-GPU benchmark numbers will not be available until H2 2026. B200 and B300 numbers are early estimates consistent with published NVIDIA improvement ratios; see the B300 guide for sourcing details.
The R100's FP4 compute advantage compounds at larger model sizes. A 400B parameter model at FP4 precision requires approximately 200GB of VRAM just for weights, before accounting for KV cache. At FP8, the same model needs roughly 400GB, which requires 2xB300 or 3xB200, introducing NVLink communication overhead on every attention layer. The R100's 288GB HBM4 fits the 400B model on a single GPU in FP4, eliminating the inter-GPU communication overhead that today makes multi-GPU FP8 inference the only viable option for this model size.
Cloud Availability Timeline
Now (Q1 2026): The B300 is available on Spheron, CoreWeave, and select GPU cloud providers. B200 is widely available across the major GPU marketplaces. Pricing is compressing as supply increases.
H1 2026: B300 and B200 supply continues to ramp. Expect pricing pressure across providers as more capacity comes online. This is the current window to lock in competitive Blackwell pricing.
H2 2026: Rubin R100 systems begin shipping to the first confirmed cohort. NVIDIA announced AWS, Google Cloud, Microsoft Azure, OCI, CoreWeave, Lambda, Nebius, and Nscale as the initial eight H2 2026 recipients at CES 2026. At GTC 2026 (March 2026), NVIDIA expanded the confirmed list to include additional cloud providers including Crusoe and Together AI, confirming that specialist GPU clouds receive Rubin alongside hyperscalers from the start. NVIDIA also completed the 7-chip Vera Rubin platform at GTC 2026 by integrating the Groq 3 LPU as the 7th chip. The Rubin CPX concept (originally announced at AI Infra Summit in September 2025 with 30 PFLOPS NVFP4 and 128GB GDDR7 specs) was removed from NVIDIA's roadmap at GTC 2026 and replaced by the Groq 3 LPX Rack, a fundamentally different SRAM-based decode-acceleration architecture with 256 Groq 3 LPUs (500 MB on-chip SRAM each, 128GB aggregate at rack level), 640 TB/s scale-up bandwidth, targeting trillion-parameter models at million-token context. The Groq 3 LPX Rack is expected alongside Vera Rubin NVL72 deliveries starting H2 2026.
2027: Broader R100 availability across additional GPU cloud marketplaces, including Spheron. This is also when NVIDIA's Rubin Ultra is expected to begin shipping to first recipients. Rubin Ultra delivers 100 PFLOPS FP4 (2x Rubin), 1 TB HBM4e memory, and deploys in NVL576 "Kyber" racks. By this point, R100 pricing will likely be past the launch premium phase, and Blackwell pricing will have compressed further.
NVIDIA's cadence for the past two generations was approximately 2 years: Hopper (H100) launched in 2022, Blackwell (B200) in 2024, and Rubin (R100) is projected for 2026. That consistency makes the H2 2026 estimate reasonably reliable at the generation level. Specific months and availability tiers may shift.
Pricing Predictions
GPU pricing follows a consistent pattern across generations: high launch premiums followed by rapid compression as supply increases. Here's the trajectory from past generations applied to R100 projections:
| GPU | Launch price (cloud, $/hr) | Price ~18 months later | Decline |
|---|---|---|---|
| H100 (2023) | $8-10 | $1.50-3 | 70%+ |
| B200 (early 2025) | $5-7 | ~$2-4 (projected 2027) | ~50-60% est. |
| B300 (early 2026) | $4.95-18 | ~$2-6 (projected late 2027) | ~50-60% est. |
| R100 (H2 2026, projected) | $15-25 | ~$5-10 (projected 2028) | ~50-60% est. |
At launch, R100 pricing at hyperscalers will likely carry a 40-60% premium over current B300 dedicated rates. Specialist GPU clouds have historically undercut hyperscaler launch pricing by 40-60% once they receive hardware allocations.
| Provider tier | Projected R100 launch price ($/hr) | Notes |
|---|---|---|
| Hyperscalers (AWS, GCP, Azure) | $18-25 | Managed stack, SLA guarantees |
| Specialist GPU clouds (CoreWeave, Lambda, Nebius) | $10-15 | Better price, similar hardware |
| Marketplaces (Spheron) | $8-12 (projected spot) | Competitive rates once available |
For current Blackwell generation pricing, check GPU pricing on Spheron.
Should You Wait for Rubin or Buy Blackwell Now?
Wait for R100 if:
- You're planning a new, multi-year GPU infrastructure investment with no active workload deadline. If you're still in the design phase and won't deploy production systems until late 2026 or 2027, waiting is defensible.
- You're serving 400B+ parameter models where FP4 inference meets your quality bar. At FP4, a 400B model needs only 200GB of VRAM, fitting within the R100's 288GB HBM4 on a single GPU. That same workload currently requires 2xB300 or 3xB200 at FP8, adding NVLink communication overhead on every attention layer that the R100 eliminates entirely.
- You're building hardware for a long-term contract or multi-year lease and need the most forward-looking specs you can get. The R100's 6-9 month wait is less painful when amortized over a 3-year deployment.
Buy Blackwell now if:
- You have active inference or training workloads generating cost today. Every month you wait is direct opportunity cost, and Blackwell already delivers 4-5x the throughput per dollar of H100 at current pricing.
- The B300 or B200 meets your VRAM and throughput requirements. For the vast majority of current model sizes (7B-130B parameters), B300 is more than sufficient. The R100's extra compute doesn't help you if you're not hitting B300 limits.
- B300 pricing is already falling and cost-per-token is lower than what you're paying today on H100 infrastructure. The cost-per-token improvement from switching to B300 now will likely exceed any cost advantage from waiting for R100 launch pricing to mature. For the detailed cost-per-token math, see the B300 guide.
Stay on H100/H200 if:
- You're running sub-70B models with stable throughput requirements and your current per-token costs are acceptable. If the workload fits and the economics work, there's no urgent reason to migrate.
- You're mid-contract with an H100 provider and the switching cost exceeds the margin improvement from migrating to Blackwell now.
- Your team needs more time to validate Blackwell's software ecosystem for your specific stack. TensorRT-LLM and vLLM support is mature on Blackwell, but if you're running custom CUDA kernels or niche frameworks, verify compatibility before committing.
NVIDIA's roadmap has been reliable at the generation level: Hopper, then Blackwell, then Rubin, on roughly 2-year cycles. But dates and exact specs shift. Don't make infrastructure decisions based on projected specs alone, particularly for budget commitments or production SLAs.
Getting Access to Blackwell GPUs While You Wait
If you need GPU capacity now, Blackwell is available and the pricing trajectory is favorable:
B300 and B200 on Spheron: B300 GPU rental is available at $8.43/hr on-demand. B200 GPU rental is available at $6.03/hr on-demand. Spot instances are suitable for batch inference and experimentation; dedicated instances guarantee availability for production workloads.
H100 and H200 for teams not yet on Blackwell: H100 GPU rental starts from $0.99/hr spot or $2.01/hr PCIe on-demand ($2.50/hr SXM on-demand) and is the most widely available option. H200 GPU rental is available at $4.54/hr on-demand and offers 141GB VRAM, making it the right choice for 70B models without Blackwell's cost premium.
Current marketplace rates: Check GPU pricing on Spheron for live rates across all GPU types. The pricing page reflects current spot and dedicated rates from multiple providers, updated continuously.
Multi-provider comparison: Spheron's marketplace aggregates B300, B200, H200, and H100 instances across multiple providers in a single dashboard. You can compare rates, check availability, and switch providers without re-provisioning your stack.
For current Blackwell GPU access, Spheron offers B300 and B200 on spot and dedicated instances. Compare live rates and deploy without contracts. View GPU pricing
