Engineering

NVIDIA Groq 3 LPU Explained: How the Non-GPU Inference Chip Changes AI Cloud Economics (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 7, 2026
NVIDIA Groq 3 LPULPU vs GPUAI InferenceGPU CloudInference OptimizationGroq 3 LPXAI Hardware 2026
NVIDIA Groq 3 LPU Explained: How the Non-GPU Inference Chip Changes AI Cloud Economics (2026)

At GTC 2026, NVIDIA announced a $20 billion licensing deal with Groq and introduced the Groq 3 LPU: a purpose-built inference chip with 500 MB of on-chip SRAM and 150 TB/s memory bandwidth. That bandwidth figure is not a typo. For comparison, the H100 SXM has 3.35 TB/s of HBM3 bandwidth. The LPU is roughly 45x more memory-bandwidth-capable per chip, achieved by replacing HBM with a massive SRAM pool sitting directly on the die.

This is not a GPU successor. The LPU does one thing: autoregressive token generation. It cannot train models, cannot run prefill at competitive speeds, and has no use for computer vision or video generation. What it does well, it does dramatically better than any GPU on watts per token. Whether that matters for your infrastructure depends on your workload profile, but the architecture is genuinely different from anything NVIDIA has shipped before.

What Is the NVIDIA Groq 3 LPU

The Groq 3 LPU (Language Processing Unit) is a chip optimized for a single operation: the decode phase of autoregressive language model inference. Decode is the token-by-token generation step, where the model produces each output token sequentially by attending to all previous tokens in the KV cache.

This phase is fundamentally memory-bandwidth-bound. For each token generated, the model reads the entire weight matrix and the full KV cache from memory. On a GPU, those weights sit in HBM with 3-8 TB/s of bandwidth. The LPU replaces HBM with a flat SRAM mesh that sits directly on the processor die. No off-chip memory accesses, no DRAM controllers, no HBM stacks. Every weight and every KV cache token is in on-chip SRAM at all times.

The result is that the decode phase, which is typically the bottleneck in LLM serving, runs at SRAM speeds rather than HBM speeds. NVIDIA claims 35x more inference throughput per megawatt compared to HBM-based GPU inference at equivalent model sizes.

The licensing arrangement matters here. Groq has been building LPU chips independently for several years, with a focus on deterministic, high-throughput inference. NVIDIA licensed the architecture and is now integrating it into its data center product portfolio as a complementary inference tier, not a GPU replacement.

Why NVIDIA Built a Non-GPU Chip

The memory wall in GPU inference is real. A 70B parameter model at FP8 precision stores roughly 70 GB of weights. Every decode step requires reading all of those weights once, plus reading the KV cache. At 3.35 TB/s (H100 SXM), reading 70 GB takes about 21 milliseconds of memory access time. At 8 TB/s (B200), it is 8.75 milliseconds. At 150 TB/s (LPU SRAM), it is approximately 0.47 milliseconds.

That is a 45x improvement in weight-loading speed. For a model generating tokens at 500 tokens/second, the decode time per token is 2 milliseconds. At 150 TB/s, memory access time for a 70B model weight load is no longer the bottleneck. The chip can serve tokens faster while consuming far less power because it spends less time waiting on memory.

GPUs are general-purpose. They need programmable compute for matrix multiplications, HBM for large model storage, and the ability to handle arbitrary input/output shapes. That generality is what lets GPUs do training, fine-tuning, vision, video, embedding generation, and inference all on the same hardware. The LPU gives up all of that flexibility to optimize the one operation that accounts for most real-time AI serving cost.

For a data center operator running millions of inference requests per day, the economics matter. An H100 drawing 700W and generating 3,000 tokens/second for 70B serving delivers roughly 4.3 tokens/watt. NVIDIA's published figures for the Groq 3 LPU put it at around 150 tokens/watt for comparable model sizes. That is a 35x improvement in inference per watt.

The tradeoff is inflexibility. If your workload suddenly shifts from inference to training, or you need to run a prefill-heavy RAG pipeline, the LPU cannot help. You still need GPUs for those tasks.

Architecture: The SRAM-First Design

On-Chip SRAM Pool

The core architectural decision is replacing all off-chip memory with a single flat SRAM pool. Each Groq 3 LPU chip contains 500 MB of on-chip SRAM. For a 7B model at FP8 (7 GB of weights), a single chip cannot hold the full model. For production 70B deployment, LPU chips are connected in a mesh topology where the aggregate SRAM across the full rack holds the entire model in on-chip memory.

The Groq 3 LPX rack contains 256 LPU chips. At 500 MB per chip, the full rack has 128 GB of on-chip SRAM. A 70B FP8 model fits inside a single LPX rack with room for KV cache.

Deterministic Execution

Unlike GPU inference, which can have variable latency due to DRAM access patterns, cache misses, and memory controller queuing, LPU execution is deterministic. The chip executes a fixed data-flow graph: each clock cycle does exactly the same operations in the same order. There is no prefetching logic, no cache hierarchy, no speculative execution.

This gives the LPU a latency profile that GPUs cannot match for steady-state generation: flat decode latency regardless of batch size variation, up to the chip's capacity limits.

No HBM, No DRAM Controllers

Removing HBM simplifies the chip significantly. HBM stacks require physical integration with the GPU die, high-bandwidth memory controllers, thermal management for the memory stacks, and complex memory allocation logic. The LPU has none of this. The die is simpler, the manufacturing process is more predictable, and the thermal envelope is lower.

The cost is model size flexibility. You cannot load a 400B parameter model onto a single LPX rack. The SRAM capacity is fixed. If your model does not fit in the on-chip SRAM of the available LPU system, you are limited to GPUs.

Prefill Limitation

The LPU is not designed for the prefill phase. Prefill processes the input prompt all at once, which is a batch matrix multiplication at the sequence length. This operation is compute-bound at long sequence lengths, not memory-bandwidth-bound. GPUs are faster for prefill because they have far more raw TFLOPS.

This architectural limitation means LPU deployments always require GPUs for the prefill step. A request arrives, a GPU processes the prompt (prefill), and the KV cache is transferred to the LPU for decode. This disaggregated architecture adds complexity but maximizes the throughput-per-watt advantage of the LPU for the decode step.

Full Specs Breakdown

SpecGroq 3 LPU (per chip)Groq 3 LPX (full rack)
On-chip SRAM500 MB128 GB (256 chips)
Memory bandwidth150 TB/s40 PB/s aggregate
Chip count1256
Model capacityN/A: rack deployment required (500 MB SRAM cannot hold any production LLM)7B–70B models (FP8) at rack level
Inference throughput per watt~150 tokens/Wrack-level
Precision supportFP8 (confirmed); INT8, BF16 unverifiedsame
Training supportNoNo
Prefill performanceLowLow
Target workloadAutoregressive decodeDense LLM serving

Figures are based on NVIDIA GTC 2026 press release disclosures. Per-chip SRAM bandwidth is computed from total SRAM die bandwidth. Run your own benchmarks before production decisions. Aggregate LPX bandwidth is theoretical maximum.

LPU vs GPU Comparison

DimensionGroq 3 LPUH100 SXM5B200 SXM6
Memory typeOn-chip SRAMHBM3HBM3e
Memory bandwidth150 TB/s (per chip)3.35 TB/s8.0 TB/s
Tokens/watt (70B, FP8)~150~4.3~10
Training supportNoYesYes
Fine-tuning supportNoYesYes
Prefill performanceLowHighVery high
Vision/multimodalNoYesYes
Variable batch supportLimitedFullFull
Deterministic latencyYesNoNo
Production availabilityEarly access (2026)Available nowAvailable now
Cloud pricingUnannouncedfrom $2.90/hrfrom $7.43/hr

Pricing fluctuates based on GPU availability. The prices above are based on 07 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

The comparison highlights why the LPU is complementary rather than competitive. For every dimension that matters for training and general-purpose compute, GPUs dominate. For the one dimension that matters most for high-volume real-time LLM serving (tokens per watt), the LPU is significantly ahead.

The practical implication is that teams running high-volume inference today should understand the LPU architecture so they are prepared to evaluate it when pricing and availability become clearer. Teams with diverse workloads (training, fine-tuning, multimodal) will continue to need GPUs as their primary infrastructure regardless of LPU deployment.

For a full benchmark comparison of H100, H200, B200, and L40S for current production inference, see Best GPU for AI Inference in 2026.

The 25% Rule for Hybrid Deployments

NVIDIA's recommended deployment model for the Groq 3 LPU is a hybrid GPU-LPU architecture: GPUs handle all non-decode work, LPUs handle steady-state token generation.

NVIDIA published a reference ratio of 3:1 GPU-to-LPU for mixed inference fleets. In practical terms, this means that for a fleet running primarily dense LLM inference (7B-70B models), roughly 25% of compute budget should be LPU capacity, with 75% remaining on GPU. The exact ratio depends on your workload:

Prefill-to-decode ratio. Workloads with short prompts and long outputs (chatbots, creative writing APIs) have low prefill-to-decode ratios. More of the work is decode, and those workloads benefit most from LPU offloading. Workloads with long prompts and short outputs (classification, RAG retrieval with long documents) are prefill-heavy and may see little benefit.

Traffic steadiness. LPUs are efficient at steady-state load. They do not handle burst traffic gracefully because the SRAM capacity is fixed and model-specific. If your traffic spikes 5x during peak hours, you need GPU capacity to absorb the burst regardless of how many LPUs you have.

Model specificity. A single LPX rack holds one model (or model-size class) at a time. If you serve 30 different models at varying traffic levels, LPUs require careful capacity planning. GPUs are flexible: you can hot-swap models, share GPU capacity across model families, and reallocate compute without hardware changes.

The disaggregated serving requirement. GPU-LPU hybrid deployments require prefill-decode disaggregation: GPUs run the prefill step, pass the KV cache to the LPU, and the LPU runs decode. This adds engineering complexity. Your serving stack must support disaggregated inference. vLLM, SGLang, and NVIDIA Dynamo all have disaggregated inference support, but you need to configure and test it. See the NVIDIA Dynamo disaggregated inference guide for setup details.

For most teams today, the practical action is not to redesign your infrastructure for LPU deployment. It is to understand the architecture so you can evaluate it clearly when early access opens and pricing becomes public.

Impact on Inference Costs

LPU pricing is not yet public. NVIDIA has not announced per-hour or per-rack pricing for the Groq 3 LPX as of April 2026. The cost analysis here uses current GPU pricing and NVIDIA's published throughput claims to estimate relative economics.

The key claim to evaluate is 35x more inference throughput per megawatt. If that figure holds at production model sizes and real workload patterns, the LPU should be able to generate tokens at a fraction of the energy cost of GPU inference. For data center operators paying for power, that translates to lower operating cost per token, even if the hardware acquisition cost is higher.

For reference, here is current on-demand GPU pricing for production inference:

GPUOn-Demand $/hrSpot $/hrEst. tokens/sec (70B FP8)Est. CPM ($)
H100 SXM5from $2.90from $0.80~3,066~$0.263
H200 SXM5from $4.54N/A~4,374~$0.289
B200 SXM6from $7.43from $1.71~6,972 (FP8) / ~12,841 (FP4)~$0.296 (FP8) / ~$0.161 (FP4)
A100 80G SXM4from $1.64from $0.45~1,400~$0.326

Pricing fluctuates based on GPU availability. The prices above are based on 07 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

CPM is calculated as ($/hr) / (tokens/sec × 3600) × 1,000,000. Throughput estimates are per-GPU figures based on MLPerf Inference results and standard vLLM benchmarks for Llama 3.1 70B. Actual throughput varies with sequence length, batch size, and serving framework.

The LPU's value proposition is not lower cost-per-GPU-hour, but lower cost-per-token due to higher tokens-per-watt. Whether that translates to competitive cloud pricing depends on how NVIDIA structures LPX rack pricing, power overhead at the facility level, and the efficiency gains at the rack level after accounting for cooling and interconnect.

For teams evaluating inference cost reduction today, the practical playbook is to optimize GPU inference first: FP8 quantization on H100 or B200, continuous batching via vLLM or SGLang, and spot instances for batch workloads. These optimizations are available now and produce 2-4x cost reductions without requiring any new hardware. See AI Inference Cost Economics in 2026 for the full four-layer optimization framework.

Availability Timeline

As of April 2026, the Groq 3 LPU is in early-access preview. Access requires registration through NVIDIA's enterprise program. Broad cloud availability is expected in late 2026 as NVIDIA and its partners build out LPX rack deployments.

The timeline will compress if demand is high and NVIDIA can ramp production of the chip. It will extend if there are supply chain issues, software integration complexity at the framework level, or if cloud providers need time to build out the power infrastructure for LPX racks.

Current-generation GPUs (H100, H200, B200) remain the production standard for inference through at least mid-to-late 2026. The B200, which became widely available in early 2026, is still the best option for production inference deployments today. For teams that need to make infrastructure decisions now, the right move is to deploy on B200 or H100 and plan to evaluate LPU access when early pricing and availability data becomes public.

For the full B200 specification, benchmark data, and cloud pricing comparison, see the NVIDIA B200 Complete Guide.

How to Prepare Your Inference Stack

Step 1: Identify LPU-fit workloads

LPUs work best for workloads that are:

  • Purely autoregressive generation (standard chat completion, code generation, long-form writing)
  • Predictable sequence lengths (you know roughly how long outputs will be)
  • Steady, sustained request rates (not bursty)
  • Dense model architectures (transformer decoder-only models, not MoE)

Poor LPU candidates:

  • Prefill-heavy RAG pipelines (long documents in context, short outputs)
  • Multimodal models (vision, audio, video inputs)
  • Training and fine-tuning jobs
  • Irregular batch patterns with high peak-to-average ratios
  • MoE models (sparse activations do not map efficiently to the LPU's fixed data-flow)

Step 2: Audit your serving stack for disaggregation readiness

GPU-LPU hybrid deployments require the serving framework to separate the prefill and decode phases. Check your current setup:

vLLM: Disaggregated prefill-decode support is available in vLLM v0.6+. The --kv-transfer-config flag configures KV cache transfer between prefill and decode workers running in separate process groups. This works with multi-node setups where prefill runs on one node and decode on another.

SGLang: SGLang supports disaggregated inference through its radix attention and chunked prefill features. The configuration requires setting separate prefill and decode worker counts.

For step-by-step guides on running vLLM or SGLang on Spheron GPUs, see the Spheron LLM deployment docs.

NVIDIA Dynamo: Designed from the ground up for disaggregated inference. Dynamo routes prefill requests to GPU workers and decode requests to the decode tier (GPU or LPU). It is the serving framework most directly aligned with the LPU deployment model. See the Dynamo guide for configuration details.

If your serving stack does not support disaggregated inference, now is a reasonable time to evaluate migration. Disaggregated inference improves GPU utilization even without LPUs, because prefill and decode can be scaled independently. See hybrid cloud-edge AI inference patterns for architectural guidance on multi-tier inference serving.

Step 3: Establish a tokens/watt baseline on current GPUs

When LPU pricing becomes public, the primary comparison metric will be cost-per-token and tokens-per-watt. Establish your current baseline now so you have concrete numbers for the comparison.

On your production setup, measure:

  • Tokens generated per second (decode throughput)
  • GPU power draw during inference (via nvidia-smi dmon or DCGM)
  • Tokens per watt (throughput / power draw)
  • Cost per million tokens (GPU $/hr / (tokens/sec × 3600) × 1M)

With these numbers in hand, you can evaluate any LPU pricing announcement against your actual current costs rather than against theoretical benchmarks.

Step 4: Avoid premature redesign

The LPU is not yet generally available. Its pricing, software maturity, and framework compatibility are not public. Redesigning your inference infrastructure for LPU deployment today is premature.

The right preparation is understanding the architecture, auditing your workloads for LPU fit, and ensuring your serving framework can eventually support disaggregated inference. None of that requires changing your current production setup.

For teams that are mid-deployment cycle and evaluating H100 vs B200 for a new production system, the calculus has not changed. B200 with FP4 delivers the best available tokens-per-dollar for dense 70B inference today. The LPU is relevant to infrastructure decisions you will make in late 2026 or 2027, not the ones you are making this quarter.

Where the LPU Fits in the 2026 AI Hardware Stack

The announcement creates a four-tier hardware picture for AI inference:

Tier 1: LPUs (decode specialists). High tokens-per-watt, deterministic latency, narrow workload fit. Best for steady-state decode of dense transformer models. Not available for general production use as of April 2026.

Tier 2: B200 / B300 (maximum throughput GPUs). FP4 inference, 192-288 GB VRAM, best throughput-per-dollar for 70B+ models. Right for teams with high volume and models that fit in VRAM.

Tier 3: H100 / H200 (production workhorses). Wide availability, mature software stack, strong cost-per-token on 13B-70B models. The safe default for most production deployments today.

Tier 4: L40S (cost-efficient small models). 48 GB GDDR6, lower per-hour cost, best for 7B-13B at moderate concurrency. The right choice for cost-sensitive endpoints on smaller models.

The LPU sits above the GPU tiers for one dimension only: tokens per watt on decode. For every other dimension (training, fine-tuning, multimodal, flexibility), it is below. A mature inference fleet in late 2026 will likely use all four tiers, routing workloads to the hardware best suited for each task.

For a comparison of how the Hopper and Blackwell GPU generations relate to each other and the upcoming Rubin architecture, see Rubin vs Blackwell vs Hopper.


The LPU changes the efficiency floor for high-volume LLM inference. While you wait for LPU availability and pricing, the best infrastructure decision is to optimize your current GPU inference stack. Spheron provides H100, H200, B200, and B300 GPU rental at transparent per-hour rates with no egress fees or long-term commitments.

Rent H100 → | Rent B200 → | View all GPU pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.