NVIDIA put Rubin CPX on its roadmap at the AI Infra Summit in September 2025, then pulled it six months later at GTC 2026. The chip was supposed to be a GPU purpose-built for million-token context inference. What replaced it was fundamentally different: the Groq 3 LPX Rack, a 256-chip SRAM-based inference accelerator that NVIDIA acquired via a $20B licensing deal with Groq. If you have been researching "Rubin CPX" and found conflicting information, that is why. The chip is cancelled. This post covers what it was, why NVIDIA made the switch, and what the pivot means for long-context inference in 2026.
What Rubin CPX Was Supposed to Be
Rubin CPX was announced at NVIDIA's AI Infra Summit in September 2025. The specs, as originally announced, were:
- 128GB GDDR7 memory at approximately 1.5-2 TB/s bandwidth (estimated from GDDR7 spec; NVIDIA never published an official CPX bandwidth figure)
- 30 PFLOPS NVFP4 compute, matching the inference-tuned compute profile of the Rubin platform
- Integrated video encode/decode hardware, targeting multimodal inference workloads at long sequence lengths
- 3x attention speedup versus GB300, claimed at the time of announcement for specific long-context benchmarks
The design philosophy was straightforward: build a GPU variant optimized specifically for the decode phase of long-context inference, not training. Rubin R100 was the training-and-inference chip for the platform; CPX was supposed to handle the pure decode workloads where memory capacity mattered more than compute density.
GDDR7 was the deliberate memory choice. GDDR7 is cheaper than HBM per gigabyte and supports higher raw capacity per device at lower cost. For KV cache reads, which are streaming access patterns rather than random access, GDDR7 was a reasonable fit. At 128GB per chip, CPX would have been capable of holding a 70B model's weights in FP8 alongside a meaningful KV cache without multi-GPU sharding.
The rack-scale vision was called the Vera Rubin NVL144 CPX: a system built around the CPX chip that promised roughly 8 exaflops and 100TB of memory across the full rack. That is the system description you will encounter in pre-GTC 2026 press coverage. It no longer exists on NVIDIA's roadmap.
All CPX specs above are from the September 2025 announcement. The chip was cancelled before reaching production silicon. Treat these figures as what NVIDIA announced, not what it shipped.
The Million-Token Problem: Why Memory Architecture Matters
To understand why NVIDIA pivoted away from CPX, you need to understand what long-context inference actually demands at the hardware level.
The KV cache is the central memory constraint. For each token in the context, the attention mechanism stores key and value tensors for every layer and every head. The memory for a single token is: 2 × num_layers × num_heads × head_dim × bytes_per_element. For a Llama 3 70B-class model (80 layers, 8 KV heads, 128 head dim) in BF16, each token costs roughly 320KB. Here is what that means at different context lengths:
| Context length | KV cache size (BF16) | KV cache size (FP8) | KV cache size (FP4) |
|---|---|---|---|
| 32K tokens | ~10 GB | ~5 GB | ~2.5 GB |
| 128K tokens | ~41 GB | ~20 GB | ~10 GB |
| 256K tokens | ~82 GB | ~41 GB | ~20 GB |
| 1M tokens | ~320 GB | ~160 GB | ~80 GB |
At 1M tokens in BF16, no single GPU today can hold the KV cache. At FP4, a B300 (288GB) could just fit it on one chip, but FP4 KV caches are not yet standard in production inference frameworks.
Raw capacity is one dimension. Memory bandwidth is the other. At each decode step, the model reads the full KV cache to compute attention. For a 70B model with a 128K-token context in FP8:
- KV cache size: ~20 GB
- H100 SXM (3.35 TB/s): ~6 ms per decode step for KV reads alone
- B200 SXM (8 TB/s): ~2.5 ms per decode step
- Rubin R100 (22 TB/s): ~0.9 ms per decode step
- Groq 3 LPU SRAM (150 TB/s): ~0.13 ms per decode step
At 1M tokens in FP8, the KV cache is ~160 GB. Reading it at 8 TB/s takes 20 ms per step. Reading it at 150 TB/s SRAM takes 1 ms. That difference, repeated for every token in a long generation, is where the Groq 3 LPX wins against any GPU approach, including what CPX was designed to do.
The million-token problem has three dimensions: capacity, bandwidth, and attention compute complexity. CPX tried to solve capacity with GDDR7. NVIDIA ultimately concluded SRAM-based decode acceleration was a more complete solution, at least for the pure decode phase. For more on KV cache optimization, see our KV cache optimization guide.
Why NVIDIA Chose SRAM Over GDDR7
The $20B Groq licensing deal, signed in December 2025 and unveiled at GTC 2026, was a strategic bet on SRAM-first architecture for inference. The tradeoffs are real on both sides:
GDDR7 (what CPX was supposed to use):
- Higher total capacity per device (128GB vs 500MB SRAM per chip)
- Lower cost per gigabyte
- Good fit for KV cache capacity
- Off-chip memory: latency is measurably higher than on-chip SRAM
- Bandwidth ceiling around 1.5-2 TB/s per 128GB module
SRAM (Groq 3 LPU):
- 500MB per chip, 128GB aggregate across a 256-chip LPX rack
- 150 TB/s bandwidth per chip, no off-chip memory accesses
- Deterministic execution: no cache misses, no DRAM controller overhead
- Hard capacity ceiling: 400B+ models cannot fit in current LPX rack SRAM
- Inflexible: decode-only, no training, no prefill at competitive speeds
For pure autoregressive decode at long context, SRAM wins decisively on latency. A 70B decode at 128K context is 45x faster in memory access time on LPU versus H100. For prefill and mixed workloads, GPUs win because prefill is compute-bound, not memory-bandwidth-bound, and LPUs have far lower raw TFLOPS.
NVIDIA's architecture position: GPUs handle training and prefill; LPUs handle decode. This is prefill-decode disaggregation at the hardware level, which is exactly what CPX was supposed to avoid by being a single chip that did decode well without requiring a separate hardware tier.
The pivot is a bet that a two-hardware-tier data center (GPUs + LPX racks) is better than a single-tier GPU setup with specialized variants. For a full breakdown of LPU architecture and economics, see our NVIDIA Groq 3 LPU explained guide.
Rubin CPX vs Rubin R100 vs Groq 3 LPX: The 2026 Inference Hierarchy
| Capability | Rubin CPX (cancelled) | Rubin R100 | Groq 3 LPX (256-LPU rack) |
|---|---|---|---|
| Memory | 128GB GDDR7 | 288GB HBM4 | 128GB aggregate SRAM |
| Memory bandwidth | ~1.5-2 TB/s (estimated) | up to 22 TB/s | 150 TB/s per LPU chip |
| Compute profile | Inference only | Training + inference | Decode only |
| Context window target | 1M+ tokens | General purpose | 1M+ tokens decode |
| Status | Cancelled at GTC 2026 | H2 2026 | Late 2026 early access |
| Use case | Long-context decode | All workloads | Long-context autoregressive generation |
The Vera Rubin NVL144 CPX, as announced, was a rack-scale system built around CPX chips targeting 8 exaflops and 100TB memory. That system no longer exists in NVIDIA's roadmap. The Groq 3 LPX Rack is the rack-scale inference alternative: 256 LPUs, 128GB aggregate SRAM, 640 TB/s scale-up bandwidth. It is architecturally different from what CPX would have been (SRAM vs GDDR7, deterministic vs GPU-style execution), but it addresses the same workload category.
R100 is the general-purpose next-generation GPU. It does training and inference and brings 22 TB/s HBM4 bandwidth and 288GB VRAM to every workload. Groq 3 LPX is the specialized decode tier that CPX was supposed to be, doing that one job with SRAM instead of GDDR7. For the full R100 spec breakdown, see our NVIDIA Rubin R100 guide.
What This Means for GPU Cloud Providers and Pricing
Short-term (now through H2 2026), the hardware available for long-context inference is H100, H200, B200, and B300. B300 is the closest production equivalent to what CPX promised for context window capacity: 288GB HBM3e handles the same capacity range that CPX was targeting, just at lower bandwidth.
| GPU | VRAM | Bandwidth | On-demand | Spot | Best for |
|---|---|---|---|---|---|
| H100 SXM5 | 80GB | 3.35 TB/s | $2.90/hr | N/A | 32K context, 7B-70B |
| H200 SXM5 | 141GB | 4.8 TB/s | $4.50/hr | $1.19/hr | 128K context, cost-sensitive |
| B200 SXM6 | 192GB | 8 TB/s | N/A | $2.06/hr | 256K context, FP4 inference |
| B300 SXM6 | 288GB | 8 TB/s | $8.70/hr | $2.45/hr | 512K+ context, maximum VRAM |
| A100 80G SXM4 | 80GB | 2 TB/s | $1.64/hr | $0.45/hr | Short context, batch jobs |
Pricing fluctuates based on GPU availability. The prices above are based on 10 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Groq 3 LPX cloud pricing is not publicly available as of April 2026. The chip is in early-access preview. No on-demand or spot pricing has been published. Do not include LPX in your infrastructure cost projections until pricing is announced.
The practical hierarchy for long-context inference right now:
- Under 80K tokens: H100 or H200, cost-optimized. Single GPU fits the KV cache in FP8.
- 80K-256K tokens: H200 (FP8 KV) or B200 (FP4 KV). May require 2xH100 depending on model size.
- 256K-512K tokens: B200 or B300. FP8 KV cache on a single B300 handles most 70B use cases at 256K.
- 512K+ tokens: B300 with FP4 KV or multi-GPU. NVMe KV offloading extends effective capacity further.
- 1M+ tokens today: Multi-GPU B300, NVMe offloading, and KV quantization to FP4. Not single-GPU territory yet on available hardware.
Workloads That Benefit Most from Long-Context Hardware
Coding agents. Modern coding agents maintain context over entire codebases, not just individual files. A 200K-token context covers roughly 10,000 lines of code across multiple files simultaneously. Practical minimum: 128K tokens. INT4 VRAM requirement for 70B model: ~20GB for KV cache at 128K + ~35GB weights = ~55GB. An H100 at 80GB fits with tight margins. H200 at 141GB gives comfortable headroom. Recommended: H200 or B200. R100 will make this straightforward on a single chip.
Video search and understanding. Video frame embeddings at 1-second intervals for a 60-minute video generate roughly 3,600 frame tokens, each representing a dense embedding. Practical minimum: 64K-128K effective tokens depending on representation. FP8 VRAM requirement: similar to text at the same sequence length. Key difference: prefill is longer relative to decode, making pure LPU unsuitable. GPU-only or GPU-dominant hybrid.
Long-document analysis. Legal, financial, and scientific documents at full book length run 100K-500K tokens. For retrieval-augmented workflows that must attend over the full document (not chunked), 256K+ context is necessary. B300 at 288GB handles a 70B model at 256K context in FP8 comfortably on a single chip.
Multi-turn agents. Agents that accumulate context across sessions see KV cache grow over hours of interaction. At 10 tokens per second and a 2-hour session, context grows to roughly 72K tokens. The problem is less about peak context and more about sustained KV cache growth. Memory offloading to NVMe becomes more valuable here than raw VRAM. With prefill-decode disaggregation and NVMe KV offloading, an H200 can sustain much longer effective contexts than its 141GB VRAM suggests.
Bridging to Rubin: Optimizing Your Inference Stack Today
Three techniques work well on current hardware and transfer directly to R100 and Groq 3 LPX architectures:
1. KV cache quantization. On H100/H200, FP8 KV cache halves the memory requirement versus BF16 with minimal quality impact. On B200/B300, NVFP4 KV cache halves it again. vLLM and SGLang both support this today. A B300 at 288GB with NVFP4 KV cache can handle roughly 1M tokens for a 7B model or 256K for a 70B model on one GPU. See our KV cache optimization guide for implementation details.
2. NVMe KV cache offloading. KV cache pages that are not actively needed for the current decode step can be stored on NVMe and paged back in. For workloads with long context but sparse attention patterns, NVMe offloading extends effective context capacity well beyond what VRAM alone allows. The latency penalty is real but manageable if your SSD throughput is high enough. See our NVMe KV cache offloading guide for benchmarks and configuration.
3. Prefill-decode disaggregation. Running prefill on GPUs and decode on separate instances optimizes hardware utilization for each phase. This architecture applies today (GPU-to-GPU disaggregation) and will apply to LPX deployments in 2026 (GPU prefill, LPU decode). Building disaggregated serving infrastructure now means less rearchitecting when LPX becomes available. See our prefill-decode disaggregation guide for deployment patterns.
Here is a practical vLLM launch configuration for long-context inference on a B200 with FP8 KV cache and chunked prefill enabled:
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--kv-cache-dtype fp8 \
--max-model-len 262144 \
--enable-chunked-prefill \
--gpu-memory-utilization 0.92This configuration runs a 70B model at 256K context across 2xB200 with FP8 KV cache. The --enable-chunked-prefill flag breaks long prompts into chunks to avoid OOM during prefill. Adjust --tensor-parallel-size to 1 for B300 at shorter context lengths.
Note on NVFP4 KV cache: --kv-cache-dtype nvfp4 is not yet available in stable vLLM releases as of April 2026. Support is in progress (vLLM PR #37332). Once merged, switching from fp8 to nvfp4 halves KV cache memory again on B200/B300, doubling effective context at the same VRAM budget. For continuous batching and PagedAttention tuning on top of these settings, see our LLM serving optimization guide.
If you are building for long-context inference today, B200 and B300 are the best production hardware while Rubin R100 and Groq 3 LPX ramp. Spheron offers B200, B300, H100, and H200 with bare metal and on-demand options.
