Is Cerebras faster than H100 for LLM inference?

For single-user or low-batch workloads, Cerebras WSE-3 can deliver 2,100+ tokens/sec on Llama 70B, which is significantly higher than a single H100's ~500-700 tokens/sec. However, at batch sizes above 8-16 concurrent requests, H100's ability to saturate HBM bandwidth across batched sequences closes or reverses the gap on throughput-per-dollar.

What is the Cerebras WSE-3 and how does it work?

The Cerebras WSE-3 is a wafer-scale chip with 900,000 cores and 44 GB of on-die SRAM connected at 21 PB/s bandwidth. Unlike GPUs that use HBM memory stacks connected to the die, WSE-3 puts all memory on the wafer itself, eliminating the HBM memory wall. This gives it extremely low memory latency and high bandwidth per parameter, which benefits autoregressive decode at low batch sizes.

Can I fine-tune models on Cerebras WSE-3?

Cerebras supports training on the WSE-3, but the Cerebras Inference API (the main public access point) is inference-only. Fine-tuning on Cerebras hardware requires a dedicated Cerebras cluster engagement, not available on-demand the way GPU cloud instances are. NVIDIA H100 GPU cloud instances support fine-tuning with standard frameworks like Hugging Face Transformers, Axolotl, and Unsloth.

How does Cerebras Inference API pricing compare to renting H100s on Spheron?

Cerebras Inference API charges per output token. At low concurrency (batch 1-4), Cerebras can be cost-competitive or cheaper. At batch 32+, the per-token cost on H100 via GPU cloud drops significantly because throughput scales with batch size while the hourly instance cost stays fixed. For teams running multi-tenant serving or batch processing, GPU cloud typically wins on cost per million tokens at scale.

Cerebras vs NVIDIA H100: Wafer-Scale vs GPU for LLM Inference (2026 Decision Guide)

Cerebras published 2,100 tokens per second for Llama 3.1 70B on the WSE-3. That number is real. It is also measured at batch size 1, on a full-system wafer-scale unit that draws roughly 23 kW. Context matters here. The WSE-3 is not a GPU you add to a cluster. It is a complete inference system with a very specific performance profile, and whether it is the right choice for your workload depends almost entirely on how you submit requests.

This guide compares the WSE-3 and the NVIDIA H100 SXM5 directly: architecture, Llama 70B benchmarks at realistic batch sizes, cost per million tokens, and a decision framework for when each wins. For context on how other non-GPU inference chips approach similar architectural tradeoffs, see the NVIDIA Groq 3 LPU breakdown.

Cerebras WSE-3 Architecture

The Wafer Scale Engine 3 (WSE-3) is what the name describes: a single silicon wafer that functions as one chip. It contains 900,000 AI cores and 44 GB of on-die SRAM, all connected by an on-chip fabric at 21 PB/s bandwidth. There is no HBM. There are no off-chip memory stacks. Every parameter the chip needs during inference lives in SRAM that is physically part of the die.

This architecture eliminates the HBM memory wall that limits GPU inference throughput at low batch sizes. On a GPU, each decode step requires loading all model weights and the KV cache from HBM. At 3.35 TB/s (H100 SXM5), reading the 70 GB of weights for a 70B FP8 model takes roughly 21 ms per decode step. On the WSE-3, those same weights sit in on-die SRAM accessible at 21 PB/s, reducing the memory access time by roughly 6,000x.

That bandwidth advantage is why the batch-1 throughput numbers are so high. When batch size is 1, inference is almost entirely memory-bandwidth-bound. The WSE-3 has an enormous bandwidth edge at that operating point.

The catch: the WSE-3 holds only 44 GB of SRAM. A 70B model at FP8 needs roughly 70 GB. Cerebras manages this by quantizing to a lower precision or by partitioning across multiple WSE-3 systems. The WSE-3 is a complete infrastructure system, not a single accelerator you drop into a rack alongside other hardware.

Spec	Cerebras WSE-3	NVIDIA H100 SXM5
Architecture	Wafer-Scale	Hopper
Compute cores	900,000 AI cores	16,896 CUDA + 528 Tensor Cores
Memory type	On-die SRAM	HBM3
Memory capacity	44 GB	80 GB
Memory bandwidth	21 PB/s (on-die)	3.35 TB/s
TDP	~23 kW (full system)	700 W
FP16 throughput	~125 PFLOPS (FP16, sparse)	2.0 PFLOPS
FP8 throughput	Not natively supported	3.96 PFLOPS
Form factor	Full system unit	Single GPU card

Note that comparing WSE-3's 21 PB/s directly to H100's 3.35 TB/s is not a like-for-like comparison. WSE-3 uses on-die SRAM; H100 uses off-chip HBM. The architectures are fundamentally different. On-die SRAM provides lower latency and higher bandwidth but at much smaller capacity. The effective throughput difference at the model level is smaller than the raw bandwidth ratio suggests because the WSE-3 cannot hold a full 70B model in its 44 GB SRAM without precision reduction.

NVIDIA H100 SXM5 Architecture Refresher

The H100 SXM5 is the Hopper generation GPU with 80 GB HBM3, 3.35 TB/s memory bandwidth, and 3.96 PFLOPS of FP8 compute via the Transformer Engine. It connects to neighboring GPUs over NVLink 4 at 900 GB/s bidirectional bandwidth, supports MIG for up to 7 isolated instances, and uses FP8 Transformer Engine to double throughput over FP16 at minimal quality cost.

The H100's 80 GB of HBM3 matters for model fit. A 70B model in FP8 uses roughly 70 GB, leaving about 10 GB for KV cache. That is tight for long-context workloads but sufficient for most production chat endpoints at standard context lengths. For the full H100 vs H200 specification comparison, see our H100 vs H200 comparison.

Unlike the WSE-3, the H100 is a single GPU card. You can run it standalone for a 70B endpoint, pair two via NVLink for tensor-parallel serving, or scale to 8-GPU nodes for larger models or higher throughput. The software ecosystem (vLLM, SGLang, TensorRT-LLM) is mature and production-ready on H100.

Inference Benchmarks: Llama 70B Performance and Model Support

These numbers are drawn from Cerebras's published benchmarks for WSE-3 and community vLLM benchmarks for H100 SXM5 with FP8. The Cerebras figures are system-level (one WSE-3 unit). The H100 figures are per-GPU from standard vLLM deployments. They are not perfectly comparable because the WSE-3 is a full system and the H100 is a single card, but the comparison is the relevant one for teams evaluating cost per token.

Llama 3.1 70B inference performance:

Metric	Cerebras WSE-3	H100 SXM5 (vLLM, FP8)	Winner
Throughput at batch 1	~2,100 tok/s	~550 tok/s	Cerebras
Throughput at batch 8	~2,100 tok/s	~3,200 tok/s	H100
Throughput at batch 32	~2,100 tok/s	~8,000 tok/s	H100
Throughput at batch 64	~2,100 tok/s	~12,000 tok/s	H100
TTFT at batch 1	~50 ms	~80 ms	Cerebras
TTFT at batch 8	~50 ms	~200 ms	Cerebras
TTFT at batch 32	~50 ms	~400 ms	Cerebras
TTFT at batch 64	~50 ms	~700 ms	Cerebras

Cerebras figures are from published WSE-3 benchmark disclosures, updated with the October 2024 throughput announcement (Cerebras Inference source). H100 figures are per-GPU estimates from community vLLM benchmarks on SXM5 hardware with FP8 precision and continuous batching. Run your own benchmarks before production decisions.

The throughput pattern is the key insight. Cerebras delivers a fixed ~2,100 tokens/sec regardless of batch size because the on-die SRAM bandwidth saturates a single inference pipeline but cannot parallelize across many concurrent requests the way GPU HBM can. H100 scales linearly from ~550 tokens/sec at batch 1 to ~12,000 tokens/sec at batch 64 because HBM can serve weight reads in parallel across batched token positions.

The TTFT picture is different. Cerebras wins at every batch size because SRAM access latency is fundamentally lower than HBM latency. For use cases where time-to-first-token is the primary metric (real-time chat, agent loops), Cerebras has a consistent advantage that does not erode with batch scaling.

Model support matrix:

Model / Family	Cerebras WSE-3	H100 SXM5 (vLLM)	Winner
Llama 3.x family	Supported	Supported	Tie
Qwen 3-32B	Supported	Supported	Tie
Qwen 3-235B Instruct / Thinking	Supported	Supported	Tie
GPT-OSS-120B	Supported	Supported	Tie
ZAI GLM-4.7	Supported	Supported	Tie
DeepSeek V4	Not supported	Supported	H100
Qwen 2.5	Not supported	Supported	H100
Gemma 3	Not supported	Supported	H100
Mistral	Not supported	Supported	H100
Multimodal models	Not supported	Supported	H100

Cerebras's catalog has expanded beyond Llama to include Qwen 3, GPT-OSS-120B, and GLM-4.7. If your production stack uses DeepSeek, Gemma, Mistral, or any multimodal model, Cerebras remains off the table. Verify the current model support list at the Cerebras documentation before committing.

Where Cerebras Wins

Cerebras has a genuine performance edge in specific scenarios:

Real-time single-user chat. At batch 1, WSE-3 delivers roughly 3x the tokens/sec of a single H100. For a chatbot where each session is its own isolated request stream, that translates directly to faster response generation for the end user.

Agent loops with strict TTFT requirements. TTFT stays flat at ~50 ms across all batch sizes on the WSE-3. For multi-step agentic pipelines where the agent issues back-to-back requests and total wall-clock time matters, a 50 ms TTFT versus an 80-400 ms TTFT compounds significantly across many steps.

Llama-family-only workloads at low concurrency. If your production stack is exclusively Llama 3.1 or 3.3 models and your request rate is low enough that you never exceed batch 8-16, WSE-3 can deliver higher throughput at lower TTFT from a single system.

Managed API preference. Cerebras Inference API handles all infrastructure. You send requests, you pay per token. No provisioning, no CUDA driver management, no framework configuration. For teams that want to avoid GPU operations entirely, the managed API path is simpler.

Where H100 on Spheron Wins

At batch sizes above 8, H100 throughput surpasses the WSE-3 and keeps scaling. The practical scenarios where H100 wins clearly:

Batch throughput at concurrency above 8. Once you have more than 8 concurrent users, H100 with vLLM continuous batching generates more tokens per second than the WSE-3 from a single GPU. That advantage compounds as batch size grows.

Fine-tuning. The Cerebras Inference API does not support fine-tuning. If you need to adapt a base model to your domain, you need GPU access. Spheron H100 instances support full fine-tuning workflows with Axolotl, Unsloth, Hugging Face Transformers, and any other standard framework.

Model variety. Cerebras's inference catalog has grown beyond Llama and now includes Qwen 3 and a handful of other models. But DeepSeek, Gemma, Mistral, Phi, and all multimodal models remain unsupported. H100 supports every major open-weight model through standard serving frameworks. If your roadmap includes a broad set of models, GPU cloud gives you far more flexibility.

Multi-tenant serving. H100 supports MIG, allowing up to 7 isolated GPU instances from a single card. For platforms serving multiple customers or models simultaneously, MIG provides resource isolation that the WSE-3's single-pipeline architecture cannot replicate.

Full control and no lock-in. H100 instances give you root access, a standard CUDA environment, and choice of inference framework. Cerebras locks you into their API and SDK. If Cerebras changes pricing, deprecates model support, or has an outage, you have no alternative runtime. GPU cloud instances are fully portable.

Regional availability. Cerebras Inference API is US-only as of 2026. Spheron B200 and H100 instances are available through data center partners across multiple regions, which matters for latency-sensitive applications and data residency requirements.

Cost Analysis: Cerebras Inference API vs H100 on Spheron

Cerebras Inference API pricing for Llama 70B (as of early 2026): approximately $0.60 per million input tokens and $0.60 per million output tokens. This per-token rate applies regardless of your batch size.

H100 SXM5 pricing on Spheron: $2.904/hr on-demand (derived from live API as of 28 Apr 2026). Spot pricing on H100 SXM5 runs $0.80/hr per GPU when available, a roughly 72% discount to on-demand.

To compare apples to apples, the relevant metric is cost per million output tokens at your actual operating batch size. Formula:

CPM = ($/hr) / (tokens/sec × 3,600) × 1,000,000

H100 SXM5 cost per million output tokens (Llama 70B FP8, vLLM):

Batch size	Throughput (tok/s)	$/hr	CPM ($)
Batch 1 (on-demand)	~550	$2.904	~$1.47
Batch 8 (on-demand)	~3,200	$2.904	~$0.252
Batch 32 (on-demand)	~8,000	$2.904	~$0.101
Batch 1 (spot)	~550	$0.80	~$0.40
Batch 8 (spot)	~3,200	$0.80	~$0.069
Batch 32 (spot)	~8,000	$0.80	~$0.028

Cerebras Inference API (Llama 70B):

Batch size	CPM ($)
Any batch	~$0.60

At batch 1, Cerebras at $0.60/M is significantly cheaper than H100 on-demand at $1.47/M. At batch 8, the positions swap: H100 at $0.252/M costs less than half the Cerebras rate. At batch 32, H100 is roughly 6x cheaper per million tokens than Cerebras.

Spot pricing changes the picture substantially. At spot pricing, H100 is cheaper than Cerebras even at batch 1 (~$0.40/M vs $0.60/M). At batch 8, H100 spot drops to ~$0.069/M, roughly 8x cheaper than Cerebras. At batch 32, ~$0.028/M, over 21x cheaper.

The break-even batch size on on-demand pricing is somewhere between 2 and 8 concurrent requests. If your median batch size is consistently above that range, H100 on Spheron wins on cost per token. Below that range, Cerebras is the cheaper option for output tokens specifically. If spot instances are available for your workload, the economics favor H100 even at lower batch sizes.

For detailed cost-per-token methodology across GPU types, see the cost-per-token benchmark for a full breakdown including B200 and H200.

Pricing fluctuates based on GPU availability. The prices above are based on 28 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Migration Considerations

Model support matrix. Cerebras Inference API supports Llama 3.1 8B, Llama 3.1 70B, Llama 3.3 70B, Qwen 3-32B, Qwen 3-235B Instruct, Qwen 3-235B Thinking, GPT-OSS-120B, and ZAI GLM-4.7. It does not support DeepSeek, Gemma, Mistral, multimodal models, or most other open-weight models at time of writing. H100 via vLLM, SGLang, or TensorRT-LLM supports all major open-weight models. Verify the current Cerebras model list before making a migration decision, as the catalog is actively expanding.

Software stack. Cerebras uses a proprietary SDK. You cannot point a vLLM server at WSE-3 hardware. Any migration from GPU-based vLLM deployment to Cerebras requires rewriting your inference client. The reverse migration (back to GPU) requires rewriting again. H100 deployments use standard frameworks and configuration formats that transfer across GPU providers. See the vLLM deployment guide on Spheron for a step-by-step setup that runs on any H100 instance.

Lock-in risk. Cerebras Inference API is a managed service. You have no access to the underlying hardware, no ability to adjust inference parameters beyond what the API exposes, and no fallback runtime. GPU cloud instances give you full root access and are portable across providers that run standard CUDA hardware.

Regional availability. Cerebras Inference API is available in US regions only. Spheron has data center partners across multiple regions globally, which matters for applications with latency requirements tied to user geography or data residency regulations.

Fine-tuning path. If you need to fine-tune at any point, Cerebras Inference API cannot do it. You would need a separate Cerebras cluster engagement, which is not available as a self-service on-demand purchase. H100 on GPU cloud supports fine-tuning with any standard framework. See our fine-tuning guide for workflows using Axolotl and Unsloth on Spheron H100 instances.

The Decision Framework

Scenario	Choose Cerebras	Choose Spheron H100
Single-user chatbot, Llama 3.x only	Yes
Real-time agent (TTFT under 100ms, batch 1)	Yes
Multi-tenant API, batch above 8	Yes
Fine-tuning required	Yes
Qwen 3-32B or 3-235B	Yes
DeepSeek / Qwen 2.5 / Gemma / Mistral	Yes
Multimodal models	Yes
Non-US region required	Yes
Cost per million tokens at scale	Batch 1-4	Batch 8+
Full control, no vendor lock-in	Yes
Framework portability (vLLM/SGLang)	Yes

The core tradeoff is batch size versus latency. Cerebras wins on TTFT at any batch size and on throughput at batch 1-4. H100 wins on throughput at batch 8+, cost per token at scale, model variety, and operational flexibility.

For most production inference deployments, the batch profile determines the answer. Real-time chat APIs with isolated user sessions benefit from Cerebras's TTFT advantage. Multi-user APIs, batch processing pipelines, and any workload requiring fine-tuning or non-Llama models belong on GPU cloud.

For teams evaluating all GPU options before committing, the GPU inference selection guide covers L40S, H100, H200, and B200 with cost-per-token data at realistic batch sizes.

Running LLM inference at scale? Spheron H100 and B200 instances offer per-minute billing, full root access, and support for every major open-weight model, with no lock-in to a single chip vendor.
Rent H100 on Spheron → | Rent B200 → | View all GPU pricing →