Is Intel Gaudi 3 faster than NVIDIA H200 for LLM inference?

For most workloads, no. The H200's 4.8 TB/s memory bandwidth exceeds Gaudi 3's 3.7 TB/s, and H200's FP8 TFLOPS (3,958) substantially outpace Gaudi 3 (1,835). At large batch sizes with FP8, Gaudi 3 can reach competitive throughput for dense models, but its software stack (vLLM-HPU fork, SynapseAI) adds deployment complexity that CUDA eliminates. Gaudi 3 is most relevant when cost-per-hour is lower and the workload is a dense model at batch 64+, where the absolute throughput gap is larger than at low batch but lower hourly rates may still produce a competitive cost-per-token.

What is SynapseAI and how does it compare to CUDA?

SynapseAI is Intel's deep learning SDK for Gaudi accelerators, equivalent to what CUDA is for NVIDIA. It includes the driver layer, compiler, and graph optimization tooling. Unlike CUDA, which has over a decade of community libraries and framework support, SynapseAI was purpose-built for the Habana chip family. Teams use SynapseAI indirectly through the optimum-habana library (HuggingFace integration) or the vLLM-HPU fork. Direct SynapseAI programming is rarely needed unless writing custom kernels.

How much does Intel Gaudi 3 cost compared to H200 on GPU cloud?

Intel Gaudi 3 is not currently listed on Spheron's marketplace as of May 2026. On Spheron, H200 SXM5 starts at $4.22/hr on-demand ($1.76/hr spot) and B200 SXM6 at $3.50/hr spot (no on-demand currently available). For Gaudi 3 availability and pricing, check /pricing/ or contact Spheron directly. Gaudi 3 appears on some specialized neocloud providers; rates vary widely and shift as capacity grows.

Can I run vLLM on Intel Gaudi 3?

Yes, but through a maintained fork: HabanaAI/vllm-fork. This fork tracks upstream vLLM with a delay, typically several weeks to months behind the main release. Key features like speculative decoding variants, MRV2, and SGLang integration are CUDA-only and not available on the Gaudi 3 fork. Supported model architectures as of May 2026 include LLaMA, Mistral, Mixtral, Qwen 2, and Falcon. Newer architectures may not be supported yet.

When should I choose Gaudi 3 over H200 or B200?

Gaudi 3 is worth evaluating when: (1) your workload is a dense model (not fine-grained MoE) at batch size 64 or higher; (2) the Gaudi 3 hourly rate at your target provider is meaningfully below H200; (3) your team can absorb the porting effort for the vLLM-HPU fork; (4) you need 128 GB single-GPU VRAM and the model doesn't benefit from FP4. For anything requiring FP4 quantization, MoE routing over NVLink, or a battle-tested production stack, H200 or B200 is the lower-risk choice.

Does Intel Gaudi 3 support FP8 quantization?

Yes. Gaudi 3 supports FP8 inference and delivers 1,835 TFLOPS at FP8 dense. This is substantially lower than H200 (3,958 TFLOPS FP8) and B200 (4,500 TFLOPS FP8). Gaudi 3 does not support FP4 precision, which limits its throughput ceiling for models that B200 can run at FP4.

What is OPEA and when should I use it instead of vLLM?

OPEA (Open Platform for Enterprise AI) is Intel's reference architecture for deploying AI pipelines on Gaudi hardware. It provides pre-built deployment patterns for RAG, fine-tuning, and serving on Gaudi accelerators. Use OPEA when deploying multi-component pipelines (retrieval plus generation, for example) on Gaudi where you want Intel-validated configurations. For simple single-model serving, the vLLM-HPU fork is simpler to operate. OPEA is documented at opea.dev.

Intel Gaudi 3 vs NVIDIA H200 and B200: LLM Inference Benchmarks, Pricing, and Migration Guide (2026)

Intel Gaudi 3 entered 2026 as the most credible non-NVIDIA option for dense LLM inference on GPU cloud. IBM Cloud, Dell PowerEdge XE9680 deployments, and several neocloud operators added Gaudi 3 capacity through 2025 and early 2026. The question teams keep asking is whether that 128 GB of HBM2e and a lower price tag justify swapping the SynapseAI stack for CUDA. This post gives you the benchmark data, cost math, and migration steps to answer that question for your workload.

Quick Answer: Gaudi 3 vs H200 vs B200 at a Glance

GPU	Best For	HBM	Interconnect	Spheron Price
Intel Gaudi 3 HPU	Large-batch dense inference, cost-sensitive serving	128 GB HBM2e	24x 200GbE RoCE v2 per OAM (4.8 Tbps aggregate)	Not currently listed (check /pricing/)
H200 SXM5	70B-class inference, mature CUDA stack	141 GB HBM3e	NVLink 4.0 (900 GB/s)	From $4.22/hr on-demand ($1.76/hr spot)
B200 SXM6	FP4 workloads, 100B+ models, highest throughput	192 GB HBM3e	NVLink 5 (1.8 TB/s)	From $3.50/hr spot (no on-demand currently)

Why Gaudi 3 Matters in 2026

Most of the GPU cloud discussion in 2025-2026 focused on AMD as the primary non-NVIDIA alternative. Intel's Gaudi 3 got less attention, but the hardware case for it is specific and real.

128 GB HBM2e per chip. That exceeds H100's 80 GB by 60% and sits in the same neighborhood as H200's 141 GB. For inference workloads where a 70B model at FP8 (~35 GB) fits on a single chip with substantial KV cache headroom, Gaudi 3 provides a single-GPU fit that older NVIDIA Hopper chips could not offer. For a deeper look at how AMD's MI350X approaches the same large-memory use case with 288 GB HBM3E, see our AMD MI350X vs B200 comparison.

24x 200GbE RoCE v2 fabric. Each Gaudi 3 OAM has 24 ports of 200GbE RoCE v2 (4.8 Tbps aggregate per OAM). In an 8-OAM baseboard, most ports handle intra-baseboard all-to-all traffic, with a subset dedicated to scale-out beyond the node. The Gaudi 3 scale-out design uses standard Ethernet rather than NVSwitch: 200GbE is cheaper and more flexible for multi-rack scale-out, but it adds latency for workloads that need tight GPU-GPU communication. For dense batch inference where each GPU handles independent requests, this connectivity is adequate. For multi-GPU tensor parallelism with large KV cache sharing, NVLink wins decisively.

OAM form factor. Gaudi 3 uses the Open Accelerator Module standard, which allows multi-vendor server ecosystem support. Dell, Supermicro, and Lenovo have all produced OAM-based Gaudi 3 systems. This matters if you're building on-premise infrastructure or sourcing from non-NVIDIA-ecosystem integrators.

Neocloud capacity. IBM Cloud added Gaudi 3 instances in late 2025. Several GPU cloud providers now include Gaudi 3 in their capacity mix, which is what creates the cost-sensitive inference use case. Intel's developer cloud also provides on-demand access for evaluation.

Spec Sheet Comparison

Specification	Intel Gaudi 3	H200 SXM5	B200 SXM6
Architecture	Intel Gaudi 3 (HL-325L)	NVIDIA Hopper (GH100)	NVIDIA Blackwell (GB100)
VRAM	128 GB HBM2e	141 GB HBM3e	192 GB HBM3e
Memory Bandwidth	3.7 TB/s	4.8 TB/s	8.0 TB/s
FP8 Dense TFLOPS	1,835	3,958	4,500
FP16 TFLOPS	459	1,979	2,250
FP4 Support	No	No	Yes (9,000 TFLOPS)
Multi-GPU Interconnect	24x 200GbE RoCE v2 per OAM (4.8 Tbps aggregate)	NVLink 4 (900 GB/s)	NVLink 5 (1.8 TB/s)
TDP	900W*	700W	1,000W
Form Factor	OAM	SXM5	SXM6

*900W applies to the HL-325L OAM air-cooled SKU. The PCIe AIC variant of Gaudi 3 has a 600W TDP. Factor this in when planning PCIe-based deployments.

The memory bandwidth gap is the most important number here. At 3.7 TB/s, Gaudi 3 has 77% of H200's bandwidth and 46% of B200's. For memory-bandwidth-bound inference (batch size 1-8, decode-heavy workloads), this translates directly into throughput. For a broader comparison of H200 and B200 architecture including NVLink generations, see the H200 vs B200 vs GB200 comparison.

You can access H200 on Spheron directly for on-demand instances. The AMD MI350X vs B200 guide covers the Blackwell architecture in more depth.

LLM Inference Benchmarks

Data source note: Gaudi 3 third-party benchmark data is limited as of May 2026. Intel has not submitted Gaudi 3 results to MLPerf Inference v6.0 for data center LLM tasks (verified against the MLCommons v6.0 announcement). Intel's participation in v6 covered Xeon 6 CPU and Arc Pro GPU workloads, so no MLPerf v6 Gaudi 3 numbers are available. See our MLPerf Inference v6 results breakdown for context. The figures below combine Intel vendor-supplied data and bandwidth-scaling projections from established H200 reference points. Each row is labeled by source.

Llama 3.3 70B FP8 - Throughput by Batch Size

Throughput is measured in tokens per second per GPU. This is the canonical dense 70B benchmark and the fairest comparison point: all three GPUs support FP8, and no FP4 advantage tilts the B200 numbers.

GPU	Batch 1 (tok/s)	Batch 8 (tok/s)	Batch 32 (tok/s)	Batch 128 (tok/s)	Source
Intel Gaudi 3	~520 (est.)	~1,350 (est.)	~1,850 (est.)	~2,100 (est.)	Bandwidth-scaling projection from Intel vendor data (Gaudi 3 vs H100 LLaMA 2 70B, habana.ai)
H200 SXM5	~680	~1,800	~2,500	~3,200	vLLM community benchmarks, SemiAnalysis InferenceX reference data
B200 SXM6	~1,150	~3,000	~3,600	~4,400	Bandwidth-scaling from SemiAnalysis InferenceX; FP8 mode (not FP4)

At batch 1, Gaudi 3's bandwidth deficit shows clearly: approximately 76% of H200's throughput and 45% of B200's. As batch size grows, the workload shifts from memory-bound to compute-bound. Here the gap widens further, because Gaudi 3's FP8 TFLOPS (1,835) are substantially below H200 (3,958) and B200 (4,500).

These are projections, not measured lab results. Real performance depends on HPU driver version, SynapseAI graph compilation overhead, and the specific model's attention pattern. Intel's published vendor data shows Gaudi 3 reaching approximately 2x H100 throughput for LLaMA 2 70B at large batch sizes (habana.ai, vendor claim), but H100 is a lower baseline than H200, and the claim covers specific scenarios that may not match your production configuration.

DeepSeek V4 MoE: Where Gaudi 3 Struggles

Fine-grained mixture-of-experts models like DeepSeek V4 route each token to a subset of experts. In a multi-GPU setup, expert weights are sharded across GPUs, and routing causes frequent small-volume transfers between GPUs during each forward pass.

Over NVLink 5 (B200) at 1.8 TB/s, these expert dispatch transfers are fast. Over 200GbE (Gaudi 3) at 4.8 Tbps aggregate per OAM but with higher per-hop latency than NVSwitch, they are slower, and the latency compounds with more experts and higher concurrency.

For DeepSeek V4's fine-grained MoE routing at batch sizes above 32, expect Gaudi 3 to deliver meaningfully lower throughput than H200 or B200. This is the single biggest Gaudi 3 weakness for production MoE inference. No independent benchmark data for DeepSeek V4 on Gaudi 3 is available as of May 2026; the projection is based on known 200GbE vs NVLink latency differentials.

Qwen 3 72B Dense: A More Competitive Scenario

Dense models avoid the MoE routing penalty. For Qwen 3 72B in FP8, all expert weights are always loaded, inter-GPU communication consists of activation transfers rather than expert dispatch, and 200GbE is less of a bottleneck.

At batch 32 with Qwen 3 72B FP8 on a 2-GPU configuration:

GPU Config	Est. Throughput (tok/s)	Note
2x Gaudi 3 (tensor-parallel)	~2,600 (est.)	Bandwidth scaling; no independent data
2x H200 SXM5	~3,800	vLLM community benchmarks
2x B200 SXM6	~6,200 (est.)	FP8 mode, bandwidth scaling

The gap narrows compared to MoE models but does not close. Gaudi 3 is more competitive for dense models, but H200 still leads on per-GPU throughput. For teams where the H200 hourly rate is the constraint (not the CUDA stack), Gaudi 3 at a lower price point could produce comparable cost-per-token for this workload.

For the GPU Cost Per Token Benchmark 2026 covering A100, H100, H200, and B200 with measured throughput, use those numbers as your H200/B200 baseline when calculating how Gaudi 3 would compare at your specific use case.

Software Stack Reality Check

Dimension	Intel Gaudi 3	NVIDIA H200 / B200
Primary SDK	SynapseAI 1.x	CUDA 12.x
Inference Framework	vLLM-HPU fork (HabanaAI), OPEA	vLLM, TensorRT-LLM, SGLang
Container Registry	vault.habana.ai	NGC (nvcr.io)
Attention Implementation	Gaudi-native fused attention (not FlashAttention 3)	FlashAttention 3 (Hopper/Blackwell)
HuggingFace Support	optimum-habana library	Native transformers + accelerate
Model Coverage	Growing: LLaMA, Mistral, Qwen 2, Falcon supported; newer architectures vary	Near-universal
Quantization Options	FP8, BF16	FP8, FP4 (B200), BF16, INT8

The practical constraint is the vLLM-HPU fork. The HabanaAI/vllm-fork repository tracks upstream vLLM with a lag, typically several weeks to months. Teams that rely on recent vLLM features will find them absent on Gaudi 3. Specific gaps as of May 2026 include:

MRV2 (Model Runner v2): Available in upstream vLLM, not yet in the HPU fork. For context on what MRV2 enables for production deployments, see our vLLM production deployment guide.
Eagle3 speculative decoding: CUDA-only in upstream vLLM.
SGLang: Not supported on Gaudi 3; the RadixAttention scheduler has no HPU backend.
TensorRT-LLM: NVIDIA-only by design. Teams using TensorRT-LLM for B200 deployments will need a complete stack swap to target Gaudi 3. For a production TensorRT-LLM setup, see our TensorRT-LLM deployment guide.

OPEA (Open Platform for Enterprise AI) is Intel's multi-component pipeline framework for Gaudi. It is useful for RAG pipelines and multi-model serving on Intel hardware, but it adds orchestration complexity that single-model serving via vLLM-HPU avoids. If your use case is simple model serving, start with vLLM-HPU before evaluating OPEA.

Cost-Per-Million-Tokens Analysis

Gaudi 3 is not listed on Spheron's marketplace as of 12 May 2026. For H200 and B200, the live API shows the following rates:

GPU	$/hr	Est. Llama 3.3 70B tok/s (FP8, batch 32)	Cost per 1M tokens
Intel Gaudi 3	Not listed (check /pricing/)	~1,850 (est.)	Not calculable (check /pricing/)
H200 SXM5	$4.22 on-demand	~2,500	~$0.47
B200 SXM6	$3.50 spot	~3,600 (FP8)	~$0.27
B200 SXM6	$3.50 spot	~10,000 (FP4)	~$0.10

Formula: CPM = ($/hr) / (tokens_per_sec × 3600 / 1,000,000)

At FP8, B200 has a lower cost-per-token than H200 at this batch size: $0.27 vs $0.47. B200's advantage grows further with FP4 quantization ($0.10/M), but FP4 requires Blackwell-optimized model weights and a calibrated quantization step that Gaudi 3 cannot match.

If Gaudi 3 were available at a significantly lower hourly rate than H200, the math could flip: at the same ~1,850 tok/s at batch 32, Gaudi 3 would need to be priced below $3.12/hr to deliver a lower CPM than H200. Whether that pricing exists depends on the specific provider and current supply.

Pricing fluctuates based on GPU availability. The prices above are based on 12 May 2026 and may have changed. Check current GPU pricing for live rates.

When Gaudi 3 Wins

Large-batch dense model inference with FP8, where cost matters. At batch 64-128 with a dense 70B model, the absolute throughput gap between Gaudi 3 and H200 is larger than at low batch, but Gaudi 3's lower hourly rate can still produce a competitive cost-per-token if pricing is favorable.

128 GB single-GPU VRAM for models that H100 cannot handle. The H100's 80 GB of HBM3 cannot serve a 70B FP8 model comfortably with meaningful KV cache. Gaudi 3's 128 GB fits the 70B FP8 weight set plus room for batch caching. This matters if your provider offers Gaudi 3 but not H200.

Teams already on Intel data center infrastructure. If your on-premise or hybrid setup already uses Intel servers and your ops team has experience with the Intel ecosystem, Gaudi 3 integration is lower-friction than it would be for a team starting from scratch.

Workloads that don't depend on FP4. If your model and serving pipeline run FP8 or BF16 only and you're not targeting Blackwell-specific optimizations, the FP4 gap is irrelevant.

When Gaudi 3 Loses

Long-context reasoning with multi-GPU KV cache sharing. NVLink's 900 GB/s (H200) and 1.8 TB/s (B200) far exceed what 200GbE provides for cross-GPU attention and KV cache transfers. For context windows above 32k tokens with multi-GPU tensor parallelism, NVLink topology is a practical requirement.

Fine-grained MoE routing (DeepSeek-class models). Expert dispatch over 200GbE adds meaningful latency compared to NVSwitch, and the throughput gap widens with concurrency. For any production MoE serving at scale, H200 or B200 is the right call.

B200-exclusive workloads with native FP4. For Blackwell-optimized models where FP4 doubles throughput over FP8, there is no Gaudi 3 equivalent. The gap is not a configuration difference; it is a hardware capability that Gaudi 3 does not have.

Cutting-edge inference features. If you're deploying speculative decoding variants, MRV2, or SGLang's RadixAttention scheduler, all of these are CUDA-only today. The vLLM-HPU fork lags upstream by several release cycles.

Teams without tolerance for porting effort. A vLLM-based CUDA deployment can generally move to a new NVIDIA GPU (H200 to B200, for example) with a container swap. Moving from CUDA vLLM to vLLM-HPU requires validating behavior, adjusting dtype flags, and re-testing each model variant. The porting cost is real and should be included in any cost-benefit analysis.

Migration Guide: Porting vLLM to Intel Gaudi 3

Step 1: Verify Model Compatibility with vLLM-HPU

Check the HabanaAI/vllm-fork README for the supported model list. Intel is also transitioning users to the vllm-gaudi plugin starting vLLM 1.24.0 as a longer-term path; check the HabanaAI GitHub org for the current recommended approach before starting a new deployment. As of May 2026, supported architectures include:

LlamaForCausalLM (LLaMA 2, 3, 3.1, 3.2, 3.3)
MistralForCausalLM, MixtralForCausalLM
Qwen2ForCausalLM
FalconForCausalLM
GPT-NeoX and related variants

Newer architectures (Llama 4's attention variant, some custom MoE patterns) may not be listed. Confirm the fork's current release tag and compare it against the upstream vLLM version that added your model's support. Do not assume that support in upstream vLLM implies support in vLLM-HPU.

Step 2: Install the Intel Gaudi Software Stack

Pull the official Docker image from Intel's registry:

bash

docker pull vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest

This image includes SynapseAI, HPU runtime, and PyTorch pre-installed. No manual SynapseAI driver installation is needed inside the container. Verify the HPU is visible:

bash

python -c "import habana_frameworks.torch.core as htcore; print('HPU ready')"

If this fails, the HPU runtime is not accessible, likely a missing driver on the host or a container runtime configuration issue.

Step 3: Adapt Your vLLM Serving Command for Gaudi 3

The key changes from a CUDA vLLM command:

Replace --device cuda with --device hpu
Remove --dtype fp16 (fp16 is not optimal on HPU; use bfloat16 or float8)
Adjust --tensor-parallel-size to match your Gaudi 3 HPU count

Before (CUDA):

bash

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --device cuda \
  --dtype fp16 \
  --tensor-parallel-size 2

After (Gaudi 3 HPU):

bash

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --device hpu \
  --dtype bfloat16 \
  --tensor-parallel-size 2

For FP8 inference on Gaudi 3, substitute --dtype float8. Run a warmup pass before benchmarking: the SynapseAI graph compiler compiles the compute graph on first use, which adds latency to initial requests.

Step 4: Run a Side-by-Side Benchmark Before Committing

Spin up both a Gaudi 3 instance (at your target provider) and an H200 or B200 instance on Spheron on-demand. Run the same model, batch size, and sequence lengths for at least 30 minutes on each. Pull tokens-per-second from the serving logs and calculate cost-per-million-tokens using:

CPM = ($/hr) / (tokens_per_sec × 3600 / 1,000,000)

Compare actual measured CPM against the estimates in this post. If Gaudi 3 delivers within 15% of H200's CPM at your batch size, and the hourly rate is lower, the economics favor Gaudi 3 for that workload. If throughput is significantly below the estimates here (which is possible given SynapseAI compilation variance), reconsider.

Decision Matrix

Situation	Choose
Dense 70B inference, batch 64+, cost matters and Gaudi 3 available	Gaudi 3 (if CPM proves out in benchmarks)
MoE model (DeepSeek-class), any batch size	B200 or H200
Need FP4 and Blackwell-native quantization	B200
Mature CUDA stack, minimal porting budget	H200
Long-context (128k+ tokens), multi-GPU attention	H200 or B200
Fine-grained speculative decoding or SGLang	H200 or B200 (CUDA-only features)
Benchmarking alternatives before committing	Spheron on-demand (switch GPU types per-hour)
On-premise Intel server ecosystem	Gaudi 3 OAM form factor

Spheron Gaudi 3 Availability

Gaudi 3 does not currently appear in Spheron's live GPU marketplace as of 12 May 2026. Spheron's marketplace aggregates supply from multiple data center partners globally, and availability reflects what those partners offer at a given time.

For the most current Gaudi 3 availability, check current GPU pricing. If you need Gaudi 3 capacity at scale or want to discuss custom configurations, check current GPU pricing or reach out directly via the Spheron app.

Spheron's neutral marketplace model is useful for exactly this comparison scenario: you can rent B200 on Spheron and run your workload today, then check Gaudi 3 availability at a later date without changing providers or signing a contract. The on-demand, per-minute billing makes A/B GPU comparisons practical.

Intel Gaudi 3 is worth benchmarking against H200 and B200 if your workload is dense, batch-heavy inference where CUDA's ecosystem premium does not translate to lower costs. Spheron lets you spin up both and compare on actual traffic before committing.
Rent H200 → | Rent B200 → | View all GPU pricing →

Quick Answer: Gaudi 3 vs H200 vs B200 at a Glance

Why Gaudi 3 Matters in 2026

Spec Sheet Comparison

LLM Inference Benchmarks

Llama 3.3 70B FP8 - Throughput by Batch Size

DeepSeek V4 MoE: Where Gaudi 3 Struggles

Qwen 3 72B Dense: A More Competitive Scenario

Software Stack Reality Check

Cost-Per-Million-Tokens Analysis

When Gaudi 3 Wins

When Gaudi 3 Loses

Migration Guide: Porting vLLM to Intel Gaudi 3

Step 1: Verify Model Compatibility with vLLM-HPU

Step 2: Install the Intel Gaudi Software Stack

Step 3: Adapt Your vLLM Serving Command for Gaudi 3

Step 4: Run a Side-by-Side Benchmark Before Committing

Decision Matrix

Spheron Gaudi 3 Availability

Build what's next.