What was NVIDIA Rubin CPX?

Rubin CPX was a GPU variant announced at NVIDIA's AI Infra Summit in September 2025, designed specifically for long-context inference with 128GB GDDR7 memory, integrated video encode/decode, and hardware acceleration for attention computations at million-token scale. NVIDIA removed it from its public roadmap at GTC 2026, replacing it with the Groq 3 LPX Rack as the primary strategy for inference acceleration.

Why did NVIDIA replace Rubin CPX with Groq 3 LPX?

NVIDIA's $20B licensing deal with Groq at GTC 2026 pivoted the company's inference strategy toward SRAM-based architectures. The Groq 3 LPU uses 500MB of on-chip SRAM per chip with 150 TB/s bandwidth, avoiding the HBM memory wall that limited what a GPU-based CPX variant could achieve at extreme sequence lengths. The LPX rack approach (256 LPUs with 128GB aggregate SRAM) is architecturally better suited to the decode phase of very long-context inference than a GDDR7-based GPU variant.

What GPU should I use for long-context inference today?

In 2026, the best options for long-context inference are: H200 SXM5 (141GB HBM3e, best memory capacity available at scale today), B200 SXM6 (192GB HBM3e, ~2.4x the bandwidth of H100, spot from $2.06/hr on Spheron), and B300 SXM6 (288GB HBM3e, highest capacity available). Combine any of these with NVMe KV cache offloading, chunked prefill, or prefill-decode disaggregation to extend effective context handling further. Rubin R100 with 288GB HBM4 at 22 TB/s lands H2 2026.

What is the million-token inference problem?

At one million tokens of context, the KV cache alone consumes roughly 330GB in BF16 for a Llama 70B-scale model (2 x layers x heads x head_dim x 1M tokens x 2 bytes). No single GPU today can hold that in memory. The problem has three dimensions: raw VRAM capacity, memory bandwidth for KV reads/writes per decode step, and interconnect bandwidth when distributing KV cache across multiple GPUs. GPU-based approaches use multi-GPU sharding, NVMe offloading, and KV quantization to push against these limits today.

When will Groq 3 LPX be available on cloud providers?

As of April 2026, Groq 3 LPX is in early-access preview. Broad cloud availability is expected in late 2026. Current GPUs (H100, H200, B200, B300) remain the production standard for inference through at least mid-to-late 2026. See our Groq 3 LPU guide for deployment planning guidance.

NVIDIA Rubin CPX Explained: The Long-Context Inference GPU That Was Replaced (2026 Guide)

NVIDIA put Rubin CPX on its roadmap at the AI Infra Summit in September 2025, then pulled it six months later at GTC 2026. The chip was supposed to be a GPU purpose-built for million-token context inference. What replaced it was fundamentally different: the Groq 3 LPX Rack, a 256-chip SRAM-based inference accelerator that NVIDIA acquired via a $20B licensing deal with Groq. If you have been researching "Rubin CPX" and found conflicting information, that is why. The chip is cancelled. This post covers what it was, why NVIDIA made the switch, and what the pivot means for long-context inference in 2026.

What Rubin CPX Was Supposed to Be

Rubin CPX was announced at NVIDIA's AI Infra Summit in September 2025. The specs, as originally announced, were:

128GB GDDR7 memory at approximately 1.5-2 TB/s bandwidth (estimated from GDDR7 spec; NVIDIA never published an official CPX bandwidth figure)
30 PFLOPS NVFP4 compute, matching the inference-tuned compute profile of the Rubin platform
Integrated video encode/decode hardware, targeting multimodal inference workloads at long sequence lengths
3x attention speedup versus GB300, claimed at the time of announcement for specific long-context benchmarks

The design philosophy was straightforward: build a GPU variant optimized specifically for the decode phase of long-context inference, not training. Rubin R100 was the training-and-inference chip for the platform; CPX was supposed to handle the pure decode workloads where memory capacity mattered more than compute density.

GDDR7 was the deliberate memory choice. GDDR7 is cheaper than HBM per gigabyte and supports higher raw capacity per device at lower cost. For KV cache reads, which are streaming access patterns rather than random access, GDDR7 was a reasonable fit. At 128GB per chip, CPX would have been capable of holding a 70B model's weights in FP8 alongside a meaningful KV cache without multi-GPU sharding.

The rack-scale vision was called the Vera Rubin NVL144 CPX: a system built around the CPX chip that promised roughly 8 exaflops and 100TB of memory across the full rack. That is the system description you will encounter in pre-GTC 2026 press coverage. It no longer exists on NVIDIA's roadmap.

All CPX specs above are from the September 2025 announcement. The chip was cancelled before reaching production silicon. Treat these figures as what NVIDIA announced, not what it shipped.

The Million-Token Problem: Why Memory Architecture Matters

To understand why NVIDIA pivoted away from CPX, you need to understand what long-context inference actually demands at the hardware level.

The KV cache is the central memory constraint. For each token in the context, the attention mechanism stores key and value tensors for every layer and every head. The memory for a single token is: 2 × num_layers × num_heads × head_dim × bytes_per_element. For a Llama 3 70B-class model (80 layers, 8 KV heads, 128 head dim) in BF16, each token costs roughly 320KB. Here is what that means at different context lengths:

Context length	KV cache size (BF16)	KV cache size (FP8)	KV cache size (FP4)
32K tokens	~10 GB	~5 GB	~2.5 GB
128K tokens	~41 GB	~20 GB	~10 GB
256K tokens	~82 GB	~41 GB	~20 GB
1M tokens	~320 GB	~160 GB	~80 GB

At 1M tokens in BF16, no single GPU today can hold the KV cache. At FP4, a B300 (288GB) could just fit it on one chip, but FP4 KV caches are not yet standard in production inference frameworks.

Raw capacity is one dimension. Memory bandwidth is the other. At each decode step, the model reads the full KV cache to compute attention. For a 70B model with a 128K-token context in FP8:

KV cache size: ~20 GB
H100 SXM (3.35 TB/s): ~6 ms per decode step for KV reads alone
B200 SXM (8 TB/s): ~2.5 ms per decode step
Rubin R100 (22 TB/s): ~0.9 ms per decode step
Groq 3 LPU SRAM (150 TB/s): ~0.13 ms per decode step

At 1M tokens in FP8, the KV cache is ~160 GB. Reading it at 8 TB/s takes 20 ms per step. Reading it at 150 TB/s SRAM takes 1 ms. That difference, repeated for every token in a long generation, is where the Groq 3 LPX wins against any GPU approach, including what CPX was designed to do.

The million-token problem has three dimensions: capacity, bandwidth, and attention compute complexity. CPX tried to solve capacity with GDDR7. NVIDIA ultimately concluded SRAM-based decode acceleration was a more complete solution, at least for the pure decode phase. For more on KV cache optimization, see our KV cache optimization guide.

Why NVIDIA Chose SRAM Over GDDR7

The $20B Groq licensing deal, signed in December 2025 and unveiled at GTC 2026, was a strategic bet on SRAM-first architecture for inference. The tradeoffs are real on both sides:

GDDR7 (what CPX was supposed to use):

Higher total capacity per device (128GB vs 500MB SRAM per chip)
Lower cost per gigabyte
Good fit for KV cache capacity
Off-chip memory: latency is measurably higher than on-chip SRAM
Bandwidth ceiling around 1.5-2 TB/s per 128GB module

SRAM (Groq 3 LPU):

500MB per chip, 128GB aggregate across a 256-chip LPX rack
150 TB/s bandwidth per chip, no off-chip memory accesses
Deterministic execution: no cache misses, no DRAM controller overhead
Hard capacity ceiling: 400B+ models cannot fit in current LPX rack SRAM
Inflexible: decode-only, no training, no prefill at competitive speeds

For pure autoregressive decode at long context, SRAM wins decisively on latency. A 70B decode at 128K context is 45x faster in memory access time on LPU versus H100. For prefill and mixed workloads, GPUs win because prefill is compute-bound, not memory-bandwidth-bound, and LPUs have far lower raw TFLOPS.

NVIDIA's architecture position: GPUs handle training and prefill; LPUs handle decode. This is prefill-decode disaggregation at the hardware level, which is exactly what CPX was supposed to avoid by being a single chip that did decode well without requiring a separate hardware tier.

The pivot is a bet that a two-hardware-tier data center (GPUs + LPX racks) is better than a single-tier GPU setup with specialized variants. For a full breakdown of LPU architecture and economics, see our NVIDIA Groq 3 LPU explained guide.

Rubin CPX vs Rubin R100 vs Groq 3 LPX: The 2026 Inference Hierarchy

Capability	Rubin CPX (cancelled)	Rubin R100	Groq 3 LPX (256-LPU rack)
Memory	128GB GDDR7	288GB HBM4	128GB aggregate SRAM
Memory bandwidth	~1.5-2 TB/s (estimated)	up to 22 TB/s	150 TB/s per LPU chip
Compute profile	Inference only	Training + inference	Decode only
Context window target	1M+ tokens	General purpose	1M+ tokens decode
Status	Cancelled at GTC 2026	H2 2026	Late 2026 early access
Use case	Long-context decode	All workloads	Long-context autoregressive generation

The Vera Rubin NVL144 CPX, as announced, was a rack-scale system built around CPX chips targeting 8 exaflops and 100TB memory. That system no longer exists in NVIDIA's roadmap. The Groq 3 LPX Rack is the rack-scale inference alternative: 256 LPUs, 128GB aggregate SRAM, 640 TB/s scale-up bandwidth. It is architecturally different from what CPX would have been (SRAM vs GDDR7, deterministic vs GPU-style execution), but it addresses the same workload category.

R100 is the general-purpose next-generation GPU. It does training and inference and brings 22 TB/s HBM4 bandwidth and 288GB VRAM to every workload. Groq 3 LPX is the specialized decode tier that CPX was supposed to be, doing that one job with SRAM instead of GDDR7. For the full R100 spec breakdown, see our NVIDIA Rubin R100 guide.

What This Means for GPU Cloud Providers and Pricing

Short-term (now through H2 2026), the hardware available for long-context inference is H100, H200, B200, and B300. B300 is the closest production equivalent to what CPX promised for context window capacity: 288GB HBM3e handles the same capacity range that CPX was targeting, just at lower bandwidth.

GPU	VRAM	Bandwidth	On-demand	Spot	Best for
H100 SXM5	80GB	3.35 TB/s	$2.90/hr	N/A	32K context, 7B-70B
H200 SXM5	141GB	4.8 TB/s	$4.50/hr	$1.19/hr	128K context, cost-sensitive
B200 SXM6	192GB	8 TB/s	N/A	$2.06/hr	256K context, FP4 inference
B300 SXM6	288GB	8 TB/s	$8.70/hr	$2.45/hr	512K+ context, maximum VRAM
A100 80G SXM4	80GB	2 TB/s	$1.64/hr	$0.45/hr	Short context, batch jobs

Pricing fluctuates based on GPU availability. The prices above are based on 10 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Groq 3 LPX cloud pricing is not publicly available as of April 2026. The chip is in early-access preview. No on-demand or spot pricing has been published. Do not include LPX in your infrastructure cost projections until pricing is announced.

The practical hierarchy for long-context inference right now:

Under 80K tokens: H100 or H200, cost-optimized. Single GPU fits the KV cache in FP8.
80K-256K tokens: H200 (FP8 KV) or B200 (FP4 KV). May require 2xH100 depending on model size.
256K-512K tokens: B200 or B300. FP8 KV cache on a single B300 handles most 70B use cases at 256K.
512K+ tokens: B300 with FP4 KV or multi-GPU. NVMe KV offloading extends effective capacity further.
1M+ tokens today: Multi-GPU B300, NVMe offloading, and KV quantization to FP4. Not single-GPU territory yet on available hardware.

Workloads That Benefit Most from Long-Context Hardware

Coding agents. Modern coding agents maintain context over entire codebases, not just individual files. A 200K-token context covers roughly 10,000 lines of code across multiple files simultaneously. Practical minimum: 128K tokens. INT4 VRAM requirement for 70B model: ~20GB for KV cache at 128K + ~35GB weights = ~55GB. An H100 at 80GB fits with tight margins. H200 at 141GB gives comfortable headroom. Recommended: H200 or B200. R100 will make this straightforward on a single chip.

Video search and understanding. Video frame embeddings at 1-second intervals for a 60-minute video generate roughly 3,600 frame tokens, each representing a dense embedding. Practical minimum: 64K-128K effective tokens depending on representation. FP8 VRAM requirement: similar to text at the same sequence length. Key difference: prefill is longer relative to decode, making pure LPU unsuitable. GPU-only or GPU-dominant hybrid.

Long-document analysis. Legal, financial, and scientific documents at full book length run 100K-500K tokens. For retrieval-augmented workflows that must attend over the full document (not chunked), 256K+ context is necessary. B300 at 288GB handles a 70B model at 256K context in FP8 comfortably on a single chip.

Multi-turn agents. Agents that accumulate context across sessions see KV cache grow over hours of interaction. At 10 tokens per second and a 2-hour session, context grows to roughly 72K tokens. The problem is less about peak context and more about sustained KV cache growth. Memory offloading to NVMe becomes more valuable here than raw VRAM. With prefill-decode disaggregation and NVMe KV offloading, an H200 can sustain much longer effective contexts than its 141GB VRAM suggests.

Bridging to Rubin: Optimizing Your Inference Stack Today

Three techniques work well on current hardware and transfer directly to R100 and Groq 3 LPX architectures:

1. KV cache quantization. On H100/H200, FP8 KV cache halves the memory requirement versus BF16 with minimal quality impact. On B200/B300, NVFP4 KV cache halves it again. vLLM and SGLang both support this today. A B300 at 288GB with NVFP4 KV cache can handle roughly 1M tokens for a 7B model or 256K for a 70B model on one GPU. See our KV cache optimization guide for implementation details.

2. NVMe KV cache offloading. KV cache pages that are not actively needed for the current decode step can be stored on NVMe and paged back in. For workloads with long context but sparse attention patterns, NVMe offloading extends effective context capacity well beyond what VRAM alone allows. The latency penalty is real but manageable if your SSD throughput is high enough. See our NVMe KV cache offloading guide for benchmarks and configuration.

3. Prefill-decode disaggregation. Running prefill on GPUs and decode on separate instances optimizes hardware utilization for each phase. This architecture applies today (GPU-to-GPU disaggregation) and will apply to LPX deployments in 2026 (GPU prefill, LPU decode). Building disaggregated serving infrastructure now means less rearchitecting when LPX becomes available. See our prefill-decode disaggregation guide for deployment patterns.

Here is a practical vLLM launch configuration for long-context inference on a B200 with FP8 KV cache and chunked prefill enabled:

bash

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --kv-cache-dtype fp8 \
  --max-model-len 262144 \
  --enable-chunked-prefill \
  --gpu-memory-utilization 0.92

This configuration runs a 70B model at 256K context across 2xB200 with FP8 KV cache. The --enable-chunked-prefill flag breaks long prompts into chunks to avoid OOM during prefill. Adjust --tensor-parallel-size to 1 for B300 at shorter context lengths.

Note on NVFP4 KV cache: --kv-cache-dtype nvfp4 is not yet available in stable vLLM releases as of April 2026. Support is in progress (vLLM PR #37332). Once merged, switching from fp8 to nvfp4 halves KV cache memory again on B200/B300, doubling effective context at the same VRAM budget. For continuous batching and PagedAttention tuning on top of these settings, see our LLM serving optimization guide.

If you are building for long-context inference today, B200 and B300 are the best production hardware while Rubin R100 and Groq 3 LPX ramp. Spheron offers B200, B300, H100, and H200 with bare metal and on-demand options.
Rent B200 → | Rent H200 → | View all pricing →
Get started on Spheron →