Cerebras published 2,100 tokens per second for Llama 3.1 70B on the WSE-3. That number is real. It is also measured at batch size 1, on a full-system wafer-scale unit that draws roughly 23 kW. Context matters here. The WSE-3 is not a GPU you add to a cluster. It is a complete inference system with a very specific performance profile, and whether it is the right choice for your workload depends almost entirely on how you submit requests.
This guide compares the WSE-3 and the NVIDIA H100 SXM5 directly: architecture, Llama 70B benchmarks at realistic batch sizes, cost per million tokens, and a decision framework for when each wins. For context on how other non-GPU inference chips approach similar architectural tradeoffs, see the NVIDIA Groq 3 LPU breakdown.
Cerebras WSE-3 Architecture
The Wafer Scale Engine 3 (WSE-3) is what the name describes: a single silicon wafer that functions as one chip. It contains 900,000 AI cores and 44 GB of on-die SRAM, all connected by an on-chip fabric at 21 PB/s bandwidth. There is no HBM. There are no off-chip memory stacks. Every parameter the chip needs during inference lives in SRAM that is physically part of the die.
This architecture eliminates the HBM memory wall that limits GPU inference throughput at low batch sizes. On a GPU, each decode step requires loading all model weights and the KV cache from HBM. At 3.35 TB/s (H100 SXM5), reading the 70 GB of weights for a 70B FP8 model takes roughly 21 ms per decode step. On the WSE-3, those same weights sit in on-die SRAM accessible at 21 PB/s, reducing the memory access time by roughly 6,000x.
That bandwidth advantage is why the batch-1 throughput numbers are so high. When batch size is 1, inference is almost entirely memory-bandwidth-bound. The WSE-3 has an enormous bandwidth edge at that operating point.
The catch: the WSE-3 holds only 44 GB of SRAM. A 70B model at FP8 needs roughly 70 GB. Cerebras manages this by quantizing to a lower precision or by partitioning across multiple WSE-3 systems. The WSE-3 is a complete infrastructure system, not a single accelerator you drop into a rack alongside other hardware.
| Spec | Cerebras WSE-3 | NVIDIA H100 SXM5 |
|---|---|---|
| Architecture | Wafer-Scale | Hopper |
| Compute cores | 900,000 AI cores | 16,896 CUDA + 528 Tensor Cores |
| Memory type | On-die SRAM | HBM3 |
| Memory capacity | 44 GB | 80 GB |
| Memory bandwidth | 21 PB/s (on-die) | 3.35 TB/s |
| TDP | ~23 kW (full system) | 700 W |
| FP16 throughput | ~125 PFLOPS (FP16, sparse) | 2.0 PFLOPS |
| FP8 throughput | Not natively supported | 3.96 PFLOPS |
| Form factor | Full system unit | Single GPU card |
Note that comparing WSE-3's 21 PB/s directly to H100's 3.35 TB/s is not a like-for-like comparison. WSE-3 uses on-die SRAM; H100 uses off-chip HBM. The architectures are fundamentally different. On-die SRAM provides lower latency and higher bandwidth but at much smaller capacity. The effective throughput difference at the model level is smaller than the raw bandwidth ratio suggests because the WSE-3 cannot hold a full 70B model in its 44 GB SRAM without precision reduction.
NVIDIA H100 SXM5 Architecture Refresher
The H100 SXM5 is the Hopper generation GPU with 80 GB HBM3, 3.35 TB/s memory bandwidth, and 3.96 PFLOPS of FP8 compute via the Transformer Engine. It connects to neighboring GPUs over NVLink 4 at 900 GB/s bidirectional bandwidth, supports MIG for up to 7 isolated instances, and uses FP8 Transformer Engine to double throughput over FP16 at minimal quality cost.
The H100's 80 GB of HBM3 matters for model fit. A 70B model in FP8 uses roughly 70 GB, leaving about 10 GB for KV cache. That is tight for long-context workloads but sufficient for most production chat endpoints at standard context lengths. For the full H100 vs H200 specification comparison, see our H100 vs H200 comparison.
Unlike the WSE-3, the H100 is a single GPU card. You can run it standalone for a 70B endpoint, pair two via NVLink for tensor-parallel serving, or scale to 8-GPU nodes for larger models or higher throughput. The software ecosystem (vLLM, SGLang, TensorRT-LLM) is mature and production-ready on H100.
Inference Benchmarks: Llama 70B Performance and Model Support
These numbers are drawn from Cerebras's published benchmarks for WSE-3 and community vLLM benchmarks for H100 SXM5 with FP8. The Cerebras figures are system-level (one WSE-3 unit). The H100 figures are per-GPU from standard vLLM deployments. They are not perfectly comparable because the WSE-3 is a full system and the H100 is a single card, but the comparison is the relevant one for teams evaluating cost per token.
Llama 3.1 70B inference performance:
| Metric | Cerebras WSE-3 | H100 SXM5 (vLLM, FP8) | Winner |
|---|---|---|---|
| Throughput at batch 1 | ~2,100 tok/s | ~550 tok/s | Cerebras |
| Throughput at batch 8 | ~2,100 tok/s | ~3,200 tok/s | H100 |
| Throughput at batch 32 | ~2,100 tok/s | ~8,000 tok/s | H100 |
| Throughput at batch 64 | ~2,100 tok/s | ~12,000 tok/s | H100 |
| TTFT at batch 1 | ~50 ms | ~80 ms | Cerebras |
| TTFT at batch 8 | ~50 ms | ~200 ms | Cerebras |
| TTFT at batch 32 | ~50 ms | ~400 ms | Cerebras |
| TTFT at batch 64 | ~50 ms | ~700 ms | Cerebras |
Cerebras figures are from published WSE-3 benchmark disclosures, updated with the October 2024 throughput announcement (Cerebras Inference source). H100 figures are per-GPU estimates from community vLLM benchmarks on SXM5 hardware with FP8 precision and continuous batching. Run your own benchmarks before production decisions.
The throughput pattern is the key insight. Cerebras delivers a fixed ~2,100 tokens/sec regardless of batch size because the on-die SRAM bandwidth saturates a single inference pipeline but cannot parallelize across many concurrent requests the way GPU HBM can. H100 scales linearly from ~550 tokens/sec at batch 1 to ~12,000 tokens/sec at batch 64 because HBM can serve weight reads in parallel across batched token positions.
The TTFT picture is different. Cerebras wins at every batch size because SRAM access latency is fundamentally lower than HBM latency. For use cases where time-to-first-token is the primary metric (real-time chat, agent loops), Cerebras has a consistent advantage that does not erode with batch scaling.
Model support matrix:
| Model / Family | Cerebras WSE-3 | H100 SXM5 (vLLM) | Winner |
|---|---|---|---|
| Llama 3.x family | Supported | Supported | Tie |
| Qwen 3-32B | Supported | Supported | Tie |
| Qwen 3-235B Instruct / Thinking | Supported | Supported | Tie |
| GPT-OSS-120B | Supported | Supported | Tie |
| ZAI GLM-4.7 | Supported | Supported | Tie |
| DeepSeek V4 | Not supported | Supported | H100 |
| Qwen 2.5 | Not supported | Supported | H100 |
| Gemma 3 | Not supported | Supported | H100 |
| Mistral | Not supported | Supported | H100 |
| Multimodal models | Not supported | Supported | H100 |
Cerebras's catalog has expanded beyond Llama to include Qwen 3, GPT-OSS-120B, and GLM-4.7. If your production stack uses DeepSeek, Gemma, Mistral, or any multimodal model, Cerebras remains off the table. Verify the current model support list at the Cerebras documentation before committing.
Where Cerebras Wins
Cerebras has a genuine performance edge in specific scenarios:
Real-time single-user chat. At batch 1, WSE-3 delivers roughly 3x the tokens/sec of a single H100. For a chatbot where each session is its own isolated request stream, that translates directly to faster response generation for the end user.
Agent loops with strict TTFT requirements. TTFT stays flat at ~50 ms across all batch sizes on the WSE-3. For multi-step agentic pipelines where the agent issues back-to-back requests and total wall-clock time matters, a 50 ms TTFT versus an 80-400 ms TTFT compounds significantly across many steps.
Llama-family-only workloads at low concurrency. If your production stack is exclusively Llama 3.1 or 3.3 models and your request rate is low enough that you never exceed batch 8-16, WSE-3 can deliver higher throughput at lower TTFT from a single system.
Managed API preference. Cerebras Inference API handles all infrastructure. You send requests, you pay per token. No provisioning, no CUDA driver management, no framework configuration. For teams that want to avoid GPU operations entirely, the managed API path is simpler.
Where H100 on Spheron Wins
At batch sizes above 8, H100 throughput surpasses the WSE-3 and keeps scaling. The practical scenarios where H100 wins clearly:
Batch throughput at concurrency above 8. Once you have more than 8 concurrent users, H100 with vLLM continuous batching generates more tokens per second than the WSE-3 from a single GPU. That advantage compounds as batch size grows.
Fine-tuning. The Cerebras Inference API does not support fine-tuning. If you need to adapt a base model to your domain, you need GPU access. Spheron H100 instances support full fine-tuning workflows with Axolotl, Unsloth, Hugging Face Transformers, and any other standard framework.
Model variety. Cerebras's inference catalog has grown beyond Llama and now includes Qwen 3 and a handful of other models. But DeepSeek, Gemma, Mistral, Phi, and all multimodal models remain unsupported. H100 supports every major open-weight model through standard serving frameworks. If your roadmap includes a broad set of models, GPU cloud gives you far more flexibility.
Multi-tenant serving. H100 supports MIG, allowing up to 7 isolated GPU instances from a single card. For platforms serving multiple customers or models simultaneously, MIG provides resource isolation that the WSE-3's single-pipeline architecture cannot replicate.
Full control and no lock-in. H100 instances give you root access, a standard CUDA environment, and choice of inference framework. Cerebras locks you into their API and SDK. If Cerebras changes pricing, deprecates model support, or has an outage, you have no alternative runtime. GPU cloud instances are fully portable.
Regional availability. Cerebras Inference API is US-only as of 2026. Spheron B200 and H100 instances are available through data center partners across multiple regions, which matters for latency-sensitive applications and data residency requirements.
Cost Analysis: Cerebras Inference API vs H100 on Spheron
Cerebras Inference API pricing for Llama 70B (as of early 2026): approximately $0.60 per million input tokens and $0.60 per million output tokens. This per-token rate applies regardless of your batch size.
H100 SXM5 pricing on Spheron: $2.904/hr on-demand (derived from live API as of 28 Apr 2026). Spot pricing on H100 SXM5 runs $0.80/hr per GPU when available, a roughly 72% discount to on-demand.
To compare apples to apples, the relevant metric is cost per million output tokens at your actual operating batch size. Formula:
CPM = ($/hr) / (tokens/sec × 3,600) × 1,000,000H100 SXM5 cost per million output tokens (Llama 70B FP8, vLLM):
| Batch size | Throughput (tok/s) | $/hr | CPM ($) |
|---|---|---|---|
| Batch 1 (on-demand) | ~550 | $2.904 | ~$1.47 |
| Batch 8 (on-demand) | ~3,200 | $2.904 | ~$0.252 |
| Batch 32 (on-demand) | ~8,000 | $2.904 | ~$0.101 |
| Batch 1 (spot) | ~550 | $0.80 | ~$0.40 |
| Batch 8 (spot) | ~3,200 | $0.80 | ~$0.069 |
| Batch 32 (spot) | ~8,000 | $0.80 | ~$0.028 |
Cerebras Inference API (Llama 70B):
| Batch size | CPM ($) |
|---|---|
| Any batch | ~$0.60 |
At batch 1, Cerebras at $0.60/M is significantly cheaper than H100 on-demand at $1.47/M. At batch 8, the positions swap: H100 at $0.252/M costs less than half the Cerebras rate. At batch 32, H100 is roughly 6x cheaper per million tokens than Cerebras.
Spot pricing changes the picture substantially. At spot pricing, H100 is cheaper than Cerebras even at batch 1 (~$0.40/M vs $0.60/M). At batch 8, H100 spot drops to ~$0.069/M, roughly 8x cheaper than Cerebras. At batch 32, ~$0.028/M, over 21x cheaper.
The break-even batch size on on-demand pricing is somewhere between 2 and 8 concurrent requests. If your median batch size is consistently above that range, H100 on Spheron wins on cost per token. Below that range, Cerebras is the cheaper option for output tokens specifically. If spot instances are available for your workload, the economics favor H100 even at lower batch sizes.
For detailed cost-per-token methodology across GPU types, see the cost-per-token benchmark for a full breakdown including B200 and H200.
Pricing fluctuates based on GPU availability. The prices above are based on 28 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Migration Considerations
Model support matrix. Cerebras Inference API supports Llama 3.1 8B, Llama 3.1 70B, Llama 3.3 70B, Qwen 3-32B, Qwen 3-235B Instruct, Qwen 3-235B Thinking, GPT-OSS-120B, and ZAI GLM-4.7. It does not support DeepSeek, Gemma, Mistral, multimodal models, or most other open-weight models at time of writing. H100 via vLLM, SGLang, or TensorRT-LLM supports all major open-weight models. Verify the current Cerebras model list before making a migration decision, as the catalog is actively expanding.
Software stack. Cerebras uses a proprietary SDK. You cannot point a vLLM server at WSE-3 hardware. Any migration from GPU-based vLLM deployment to Cerebras requires rewriting your inference client. The reverse migration (back to GPU) requires rewriting again. H100 deployments use standard frameworks and configuration formats that transfer across GPU providers. See the vLLM deployment guide on Spheron for a step-by-step setup that runs on any H100 instance.
Lock-in risk. Cerebras Inference API is a managed service. You have no access to the underlying hardware, no ability to adjust inference parameters beyond what the API exposes, and no fallback runtime. GPU cloud instances give you full root access and are portable across providers that run standard CUDA hardware.
Regional availability. Cerebras Inference API is available in US regions only. Spheron has data center partners across multiple regions globally, which matters for applications with latency requirements tied to user geography or data residency regulations.
Fine-tuning path. If you need to fine-tune at any point, Cerebras Inference API cannot do it. You would need a separate Cerebras cluster engagement, which is not available as a self-service on-demand purchase. H100 on GPU cloud supports fine-tuning with any standard framework. See our fine-tuning guide for workflows using Axolotl and Unsloth on Spheron H100 instances.
The Decision Framework
| Scenario | Choose Cerebras | Choose Spheron H100 |
|---|---|---|
| Single-user chatbot, Llama 3.x only | Yes | |
| Real-time agent (TTFT under 100ms, batch 1) | Yes | |
| Multi-tenant API, batch above 8 | Yes | |
| Fine-tuning required | Yes | |
| Qwen 3-32B or 3-235B | Yes | |
| DeepSeek / Qwen 2.5 / Gemma / Mistral | Yes | |
| Multimodal models | Yes | |
| Non-US region required | Yes | |
| Cost per million tokens at scale | Batch 1-4 | Batch 8+ |
| Full control, no vendor lock-in | Yes | |
| Framework portability (vLLM/SGLang) | Yes |
The core tradeoff is batch size versus latency. Cerebras wins on TTFT at any batch size and on throughput at batch 1-4. H100 wins on throughput at batch 8+, cost per token at scale, model variety, and operational flexibility.
For most production inference deployments, the batch profile determines the answer. Real-time chat APIs with isolated user sessions benefit from Cerebras's TTFT advantage. Multi-user APIs, batch processing pipelines, and any workload requiring fine-tuning or non-Llama models belong on GPU cloud.
For teams evaluating all GPU options before committing, the GPU inference selection guide covers L40S, H100, H200, and B200 with cost-per-token data at realistic batch sizes.
Running LLM inference at scale? Spheron H100 and B200 instances offer per-minute billing, full root access, and support for every major open-weight model, with no lock-in to a single chip vendor.
Rent H100 on Spheron → | Rent B200 → | View all GPU pricing →
