Comparison

Cerebras vs NVIDIA H100: Wafer-Scale vs GPU for LLM Inference (2026 Decision Guide)

Back to BlogWritten by Mitrasish, Co-founderApr 28, 2026
cerebras vs h100cerebras inferencecerebras wse-3wafer scale ai chipcerebras vs nvidiaLLM InferenceH100GPU CloudCost Per Million TokensAI Hardware 2026
Cerebras vs NVIDIA H100: Wafer-Scale vs GPU for LLM Inference (2026 Decision Guide)

Cerebras published 2,100 tokens per second for Llama 3.1 70B on the WSE-3. That number is real. It is also measured at batch size 1, on a full-system wafer-scale unit that draws roughly 23 kW. Context matters here. The WSE-3 is not a GPU you add to a cluster. It is a complete inference system with a very specific performance profile, and whether it is the right choice for your workload depends almost entirely on how you submit requests.

This guide compares the WSE-3 and the NVIDIA H100 SXM5 directly: architecture, Llama 70B benchmarks at realistic batch sizes, cost per million tokens, and a decision framework for when each wins. For context on how other non-GPU inference chips approach similar architectural tradeoffs, see the NVIDIA Groq 3 LPU breakdown.

Cerebras WSE-3 Architecture

The Wafer Scale Engine 3 (WSE-3) is what the name describes: a single silicon wafer that functions as one chip. It contains 900,000 AI cores and 44 GB of on-die SRAM, all connected by an on-chip fabric at 21 PB/s bandwidth. There is no HBM. There are no off-chip memory stacks. Every parameter the chip needs during inference lives in SRAM that is physically part of the die.

This architecture eliminates the HBM memory wall that limits GPU inference throughput at low batch sizes. On a GPU, each decode step requires loading all model weights and the KV cache from HBM. At 3.35 TB/s (H100 SXM5), reading the 70 GB of weights for a 70B FP8 model takes roughly 21 ms per decode step. On the WSE-3, those same weights sit in on-die SRAM accessible at 21 PB/s, reducing the memory access time by roughly 6,000x.

That bandwidth advantage is why the batch-1 throughput numbers are so high. When batch size is 1, inference is almost entirely memory-bandwidth-bound. The WSE-3 has an enormous bandwidth edge at that operating point.

The catch: the WSE-3 holds only 44 GB of SRAM. A 70B model at FP8 needs roughly 70 GB. Cerebras manages this by quantizing to a lower precision or by partitioning across multiple WSE-3 systems. The WSE-3 is a complete infrastructure system, not a single accelerator you drop into a rack alongside other hardware.

SpecCerebras WSE-3NVIDIA H100 SXM5
ArchitectureWafer-ScaleHopper
Compute cores900,000 AI cores16,896 CUDA + 528 Tensor Cores
Memory typeOn-die SRAMHBM3
Memory capacity44 GB80 GB
Memory bandwidth21 PB/s (on-die)3.35 TB/s
TDP~23 kW (full system)700 W
FP16 throughput~125 PFLOPS (FP16, sparse)2.0 PFLOPS
FP8 throughputNot natively supported3.96 PFLOPS
Form factorFull system unitSingle GPU card

Note that comparing WSE-3's 21 PB/s directly to H100's 3.35 TB/s is not a like-for-like comparison. WSE-3 uses on-die SRAM; H100 uses off-chip HBM. The architectures are fundamentally different. On-die SRAM provides lower latency and higher bandwidth but at much smaller capacity. The effective throughput difference at the model level is smaller than the raw bandwidth ratio suggests because the WSE-3 cannot hold a full 70B model in its 44 GB SRAM without precision reduction.

NVIDIA H100 SXM5 Architecture Refresher

The H100 SXM5 is the Hopper generation GPU with 80 GB HBM3, 3.35 TB/s memory bandwidth, and 3.96 PFLOPS of FP8 compute via the Transformer Engine. It connects to neighboring GPUs over NVLink 4 at 900 GB/s bidirectional bandwidth, supports MIG for up to 7 isolated instances, and uses FP8 Transformer Engine to double throughput over FP16 at minimal quality cost.

The H100's 80 GB of HBM3 matters for model fit. A 70B model in FP8 uses roughly 70 GB, leaving about 10 GB for KV cache. That is tight for long-context workloads but sufficient for most production chat endpoints at standard context lengths. For the full H100 vs H200 specification comparison, see our H100 vs H200 comparison.

Unlike the WSE-3, the H100 is a single GPU card. You can run it standalone for a 70B endpoint, pair two via NVLink for tensor-parallel serving, or scale to 8-GPU nodes for larger models or higher throughput. The software ecosystem (vLLM, SGLang, TensorRT-LLM) is mature and production-ready on H100.

Inference Benchmarks: Llama 70B Performance and Model Support

These numbers are drawn from Cerebras's published benchmarks for WSE-3 and community vLLM benchmarks for H100 SXM5 with FP8. The Cerebras figures are system-level (one WSE-3 unit). The H100 figures are per-GPU from standard vLLM deployments. They are not perfectly comparable because the WSE-3 is a full system and the H100 is a single card, but the comparison is the relevant one for teams evaluating cost per token.

Llama 3.1 70B inference performance:

MetricCerebras WSE-3H100 SXM5 (vLLM, FP8)Winner
Throughput at batch 1~2,100 tok/s~550 tok/sCerebras
Throughput at batch 8~2,100 tok/s~3,200 tok/sH100
Throughput at batch 32~2,100 tok/s~8,000 tok/sH100
Throughput at batch 64~2,100 tok/s~12,000 tok/sH100
TTFT at batch 1~50 ms~80 msCerebras
TTFT at batch 8~50 ms~200 msCerebras
TTFT at batch 32~50 ms~400 msCerebras
TTFT at batch 64~50 ms~700 msCerebras

Cerebras figures are from published WSE-3 benchmark disclosures, updated with the October 2024 throughput announcement (Cerebras Inference source). H100 figures are per-GPU estimates from community vLLM benchmarks on SXM5 hardware with FP8 precision and continuous batching. Run your own benchmarks before production decisions.

The throughput pattern is the key insight. Cerebras delivers a fixed ~2,100 tokens/sec regardless of batch size because the on-die SRAM bandwidth saturates a single inference pipeline but cannot parallelize across many concurrent requests the way GPU HBM can. H100 scales linearly from ~550 tokens/sec at batch 1 to ~12,000 tokens/sec at batch 64 because HBM can serve weight reads in parallel across batched token positions.

The TTFT picture is different. Cerebras wins at every batch size because SRAM access latency is fundamentally lower than HBM latency. For use cases where time-to-first-token is the primary metric (real-time chat, agent loops), Cerebras has a consistent advantage that does not erode with batch scaling.

Model support matrix:

Model / FamilyCerebras WSE-3H100 SXM5 (vLLM)Winner
Llama 3.x familySupportedSupportedTie
Qwen 3-32BSupportedSupportedTie
Qwen 3-235B Instruct / ThinkingSupportedSupportedTie
GPT-OSS-120BSupportedSupportedTie
ZAI GLM-4.7SupportedSupportedTie
DeepSeek V4Not supportedSupportedH100
Qwen 2.5Not supportedSupportedH100
Gemma 3Not supportedSupportedH100
MistralNot supportedSupportedH100
Multimodal modelsNot supportedSupportedH100

Cerebras's catalog has expanded beyond Llama to include Qwen 3, GPT-OSS-120B, and GLM-4.7. If your production stack uses DeepSeek, Gemma, Mistral, or any multimodal model, Cerebras remains off the table. Verify the current model support list at the Cerebras documentation before committing.

Where Cerebras Wins

Cerebras has a genuine performance edge in specific scenarios:

Real-time single-user chat. At batch 1, WSE-3 delivers roughly 3x the tokens/sec of a single H100. For a chatbot where each session is its own isolated request stream, that translates directly to faster response generation for the end user.

Agent loops with strict TTFT requirements. TTFT stays flat at ~50 ms across all batch sizes on the WSE-3. For multi-step agentic pipelines where the agent issues back-to-back requests and total wall-clock time matters, a 50 ms TTFT versus an 80-400 ms TTFT compounds significantly across many steps.

Llama-family-only workloads at low concurrency. If your production stack is exclusively Llama 3.1 or 3.3 models and your request rate is low enough that you never exceed batch 8-16, WSE-3 can deliver higher throughput at lower TTFT from a single system.

Managed API preference. Cerebras Inference API handles all infrastructure. You send requests, you pay per token. No provisioning, no CUDA driver management, no framework configuration. For teams that want to avoid GPU operations entirely, the managed API path is simpler.

Where H100 on Spheron Wins

At batch sizes above 8, H100 throughput surpasses the WSE-3 and keeps scaling. The practical scenarios where H100 wins clearly:

Batch throughput at concurrency above 8. Once you have more than 8 concurrent users, H100 with vLLM continuous batching generates more tokens per second than the WSE-3 from a single GPU. That advantage compounds as batch size grows.

Fine-tuning. The Cerebras Inference API does not support fine-tuning. If you need to adapt a base model to your domain, you need GPU access. Spheron H100 instances support full fine-tuning workflows with Axolotl, Unsloth, Hugging Face Transformers, and any other standard framework.

Model variety. Cerebras's inference catalog has grown beyond Llama and now includes Qwen 3 and a handful of other models. But DeepSeek, Gemma, Mistral, Phi, and all multimodal models remain unsupported. H100 supports every major open-weight model through standard serving frameworks. If your roadmap includes a broad set of models, GPU cloud gives you far more flexibility.

Multi-tenant serving. H100 supports MIG, allowing up to 7 isolated GPU instances from a single card. For platforms serving multiple customers or models simultaneously, MIG provides resource isolation that the WSE-3's single-pipeline architecture cannot replicate.

Full control and no lock-in. H100 instances give you root access, a standard CUDA environment, and choice of inference framework. Cerebras locks you into their API and SDK. If Cerebras changes pricing, deprecates model support, or has an outage, you have no alternative runtime. GPU cloud instances are fully portable.

Regional availability. Cerebras Inference API is US-only as of 2026. Spheron B200 and H100 instances are available through data center partners across multiple regions, which matters for latency-sensitive applications and data residency requirements.

Cost Analysis: Cerebras Inference API vs H100 on Spheron

Cerebras Inference API pricing for Llama 70B (as of early 2026): approximately $0.60 per million input tokens and $0.60 per million output tokens. This per-token rate applies regardless of your batch size.

H100 SXM5 pricing on Spheron: $2.904/hr on-demand (derived from live API as of 28 Apr 2026). Spot pricing on H100 SXM5 runs $0.80/hr per GPU when available, a roughly 72% discount to on-demand.

To compare apples to apples, the relevant metric is cost per million output tokens at your actual operating batch size. Formula:

CPM = ($/hr) / (tokens/sec × 3,600) × 1,000,000

H100 SXM5 cost per million output tokens (Llama 70B FP8, vLLM):

Batch sizeThroughput (tok/s)$/hrCPM ($)
Batch 1 (on-demand)~550$2.904~$1.47
Batch 8 (on-demand)~3,200$2.904~$0.252
Batch 32 (on-demand)~8,000$2.904~$0.101
Batch 1 (spot)~550$0.80~$0.40
Batch 8 (spot)~3,200$0.80~$0.069
Batch 32 (spot)~8,000$0.80~$0.028

Cerebras Inference API (Llama 70B):

Batch sizeCPM ($)
Any batch~$0.60

At batch 1, Cerebras at $0.60/M is significantly cheaper than H100 on-demand at $1.47/M. At batch 8, the positions swap: H100 at $0.252/M costs less than half the Cerebras rate. At batch 32, H100 is roughly 6x cheaper per million tokens than Cerebras.

Spot pricing changes the picture substantially. At spot pricing, H100 is cheaper than Cerebras even at batch 1 (~$0.40/M vs $0.60/M). At batch 8, H100 spot drops to ~$0.069/M, roughly 8x cheaper than Cerebras. At batch 32, ~$0.028/M, over 21x cheaper.

The break-even batch size on on-demand pricing is somewhere between 2 and 8 concurrent requests. If your median batch size is consistently above that range, H100 on Spheron wins on cost per token. Below that range, Cerebras is the cheaper option for output tokens specifically. If spot instances are available for your workload, the economics favor H100 even at lower batch sizes.

For detailed cost-per-token methodology across GPU types, see the cost-per-token benchmark for a full breakdown including B200 and H200.

Pricing fluctuates based on GPU availability. The prices above are based on 28 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Migration Considerations

Model support matrix. Cerebras Inference API supports Llama 3.1 8B, Llama 3.1 70B, Llama 3.3 70B, Qwen 3-32B, Qwen 3-235B Instruct, Qwen 3-235B Thinking, GPT-OSS-120B, and ZAI GLM-4.7. It does not support DeepSeek, Gemma, Mistral, multimodal models, or most other open-weight models at time of writing. H100 via vLLM, SGLang, or TensorRT-LLM supports all major open-weight models. Verify the current Cerebras model list before making a migration decision, as the catalog is actively expanding.

Software stack. Cerebras uses a proprietary SDK. You cannot point a vLLM server at WSE-3 hardware. Any migration from GPU-based vLLM deployment to Cerebras requires rewriting your inference client. The reverse migration (back to GPU) requires rewriting again. H100 deployments use standard frameworks and configuration formats that transfer across GPU providers. See the vLLM deployment guide on Spheron for a step-by-step setup that runs on any H100 instance.

Lock-in risk. Cerebras Inference API is a managed service. You have no access to the underlying hardware, no ability to adjust inference parameters beyond what the API exposes, and no fallback runtime. GPU cloud instances give you full root access and are portable across providers that run standard CUDA hardware.

Regional availability. Cerebras Inference API is available in US regions only. Spheron has data center partners across multiple regions globally, which matters for applications with latency requirements tied to user geography or data residency regulations.

Fine-tuning path. If you need to fine-tune at any point, Cerebras Inference API cannot do it. You would need a separate Cerebras cluster engagement, which is not available as a self-service on-demand purchase. H100 on GPU cloud supports fine-tuning with any standard framework. See our fine-tuning guide for workflows using Axolotl and Unsloth on Spheron H100 instances.

The Decision Framework

ScenarioChoose CerebrasChoose Spheron H100
Single-user chatbot, Llama 3.x onlyYes
Real-time agent (TTFT under 100ms, batch 1)Yes
Multi-tenant API, batch above 8Yes
Fine-tuning requiredYes
Qwen 3-32B or 3-235BYes
DeepSeek / Qwen 2.5 / Gemma / MistralYes
Multimodal modelsYes
Non-US region requiredYes
Cost per million tokens at scaleBatch 1-4Batch 8+
Full control, no vendor lock-inYes
Framework portability (vLLM/SGLang)Yes

The core tradeoff is batch size versus latency. Cerebras wins on TTFT at any batch size and on throughput at batch 1-4. H100 wins on throughput at batch 8+, cost per token at scale, model variety, and operational flexibility.

For most production inference deployments, the batch profile determines the answer. Real-time chat APIs with isolated user sessions benefit from Cerebras's TTFT advantage. Multi-user APIs, batch processing pipelines, and any workload requiring fine-tuning or non-Llama models belong on GPU cloud.

For teams evaluating all GPU options before committing, the GPU inference selection guide covers L40S, H100, H200, and B200 with cost-per-token data at realistic batch sizes.


Running LLM inference at scale? Spheron H100 and B200 instances offer per-minute billing, full root access, and support for every major open-weight model, with no lock-in to a single chip vendor.

Rent H100 on Spheron → | Rent B200 → | View all GPU pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.