Intel Gaudi 3 entered 2026 as the most credible non-NVIDIA option for dense LLM inference on GPU cloud. IBM Cloud, Dell PowerEdge XE9680 deployments, and several neocloud operators added Gaudi 3 capacity through 2025 and early 2026. The question teams keep asking is whether that 128 GB of HBM2e and a lower price tag justify swapping the SynapseAI stack for CUDA. This post gives you the benchmark data, cost math, and migration steps to answer that question for your workload.
Quick Answer: Gaudi 3 vs H200 vs B200 at a Glance
| GPU | Best For | HBM | Interconnect | Spheron Price |
|---|---|---|---|---|
| Intel Gaudi 3 HPU | Large-batch dense inference, cost-sensitive serving | 128 GB HBM2e | 24x 200GbE RoCE v2 per OAM (4.8 Tbps aggregate) | Not currently listed (check /pricing/) |
| H200 SXM5 | 70B-class inference, mature CUDA stack | 141 GB HBM3e | NVLink 4.0 (900 GB/s) | From $4.22/hr on-demand ($1.76/hr spot) |
| B200 SXM6 | FP4 workloads, 100B+ models, highest throughput | 192 GB HBM3e | NVLink 5 (1.8 TB/s) | From $3.50/hr spot (no on-demand currently) |
Why Gaudi 3 Matters in 2026
Most of the GPU cloud discussion in 2025-2026 focused on AMD as the primary non-NVIDIA alternative. Intel's Gaudi 3 got less attention, but the hardware case for it is specific and real.
128 GB HBM2e per chip. That exceeds H100's 80 GB by 60% and sits in the same neighborhood as H200's 141 GB. For inference workloads where a 70B model at FP8 (~35 GB) fits on a single chip with substantial KV cache headroom, Gaudi 3 provides a single-GPU fit that older NVIDIA Hopper chips could not offer. For a deeper look at how AMD's MI350X approaches the same large-memory use case with 288 GB HBM3E, see our AMD MI350X vs B200 comparison.
24x 200GbE RoCE v2 fabric. Each Gaudi 3 OAM has 24 ports of 200GbE RoCE v2 (4.8 Tbps aggregate per OAM). In an 8-OAM baseboard, most ports handle intra-baseboard all-to-all traffic, with a subset dedicated to scale-out beyond the node. The Gaudi 3 scale-out design uses standard Ethernet rather than NVSwitch: 200GbE is cheaper and more flexible for multi-rack scale-out, but it adds latency for workloads that need tight GPU-GPU communication. For dense batch inference where each GPU handles independent requests, this connectivity is adequate. For multi-GPU tensor parallelism with large KV cache sharing, NVLink wins decisively.
OAM form factor. Gaudi 3 uses the Open Accelerator Module standard, which allows multi-vendor server ecosystem support. Dell, Supermicro, and Lenovo have all produced OAM-based Gaudi 3 systems. This matters if you're building on-premise infrastructure or sourcing from non-NVIDIA-ecosystem integrators.
Neocloud capacity. IBM Cloud added Gaudi 3 instances in late 2025. Several GPU cloud providers now include Gaudi 3 in their capacity mix, which is what creates the cost-sensitive inference use case. Intel's developer cloud also provides on-demand access for evaluation.
Spec Sheet Comparison
| Specification | Intel Gaudi 3 | H200 SXM5 | B200 SXM6 |
|---|---|---|---|
| Architecture | Intel Gaudi 3 (HL-325L) | NVIDIA Hopper (GH100) | NVIDIA Blackwell (GB100) |
| VRAM | 128 GB HBM2e | 141 GB HBM3e | 192 GB HBM3e |
| Memory Bandwidth | 3.7 TB/s | 4.8 TB/s | 8.0 TB/s |
| FP8 Dense TFLOPS | 1,835 | 3,958 | 4,500 |
| FP16 TFLOPS | 459 | 1,979 | 2,250 |
| FP4 Support | No | No | Yes (9,000 TFLOPS) |
| Multi-GPU Interconnect | 24x 200GbE RoCE v2 per OAM (4.8 Tbps aggregate) | NVLink 4 (900 GB/s) | NVLink 5 (1.8 TB/s) |
| TDP | 900W* | 700W | 1,000W |
| Form Factor | OAM | SXM5 | SXM6 |
*900W applies to the HL-325L OAM air-cooled SKU. The PCIe AIC variant of Gaudi 3 has a 600W TDP. Factor this in when planning PCIe-based deployments.
The memory bandwidth gap is the most important number here. At 3.7 TB/s, Gaudi 3 has 77% of H200's bandwidth and 46% of B200's. For memory-bandwidth-bound inference (batch size 1-8, decode-heavy workloads), this translates directly into throughput. For a broader comparison of H200 and B200 architecture including NVLink generations, see the H200 vs B200 vs GB200 comparison.
You can access H200 on Spheron directly for on-demand instances. The AMD MI350X vs B200 guide covers the Blackwell architecture in more depth.
LLM Inference Benchmarks
Data source note: Gaudi 3 third-party benchmark data is limited as of May 2026. Intel has not submitted Gaudi 3 results to MLPerf Inference v6.0 for data center LLM tasks (verified against the MLCommons v6.0 announcement). Intel's participation in v6 covered Xeon 6 CPU and Arc Pro GPU workloads, so no MLPerf v6 Gaudi 3 numbers are available. See our MLPerf Inference v6 results breakdown for context. The figures below combine Intel vendor-supplied data and bandwidth-scaling projections from established H200 reference points. Each row is labeled by source.
Llama 3.3 70B FP8 - Throughput by Batch Size
Throughput is measured in tokens per second per GPU. This is the canonical dense 70B benchmark and the fairest comparison point: all three GPUs support FP8, and no FP4 advantage tilts the B200 numbers.
| GPU | Batch 1 (tok/s) | Batch 8 (tok/s) | Batch 32 (tok/s) | Batch 128 (tok/s) | Source |
|---|---|---|---|---|---|
| Intel Gaudi 3 | ~520 (est.) | ~1,350 (est.) | ~1,850 (est.) | ~2,100 (est.) | Bandwidth-scaling projection from Intel vendor data (Gaudi 3 vs H100 LLaMA 2 70B, habana.ai) |
| H200 SXM5 | ~680 | ~1,800 | ~2,500 | ~3,200 | vLLM community benchmarks, SemiAnalysis InferenceX reference data |
| B200 SXM6 | ~1,150 | ~3,000 | ~3,600 | ~4,400 | Bandwidth-scaling from SemiAnalysis InferenceX; FP8 mode (not FP4) |
At batch 1, Gaudi 3's bandwidth deficit shows clearly: approximately 76% of H200's throughput and 45% of B200's. As batch size grows, the workload shifts from memory-bound to compute-bound. Here the gap widens further, because Gaudi 3's FP8 TFLOPS (1,835) are substantially below H200 (3,958) and B200 (4,500).
These are projections, not measured lab results. Real performance depends on HPU driver version, SynapseAI graph compilation overhead, and the specific model's attention pattern. Intel's published vendor data shows Gaudi 3 reaching approximately 2x H100 throughput for LLaMA 2 70B at large batch sizes (habana.ai, vendor claim), but H100 is a lower baseline than H200, and the claim covers specific scenarios that may not match your production configuration.
DeepSeek V4 MoE: Where Gaudi 3 Struggles
Fine-grained mixture-of-experts models like DeepSeek V4 route each token to a subset of experts. In a multi-GPU setup, expert weights are sharded across GPUs, and routing causes frequent small-volume transfers between GPUs during each forward pass.
Over NVLink 5 (B200) at 1.8 TB/s, these expert dispatch transfers are fast. Over 200GbE (Gaudi 3) at 4.8 Tbps aggregate per OAM but with higher per-hop latency than NVSwitch, they are slower, and the latency compounds with more experts and higher concurrency.
For DeepSeek V4's fine-grained MoE routing at batch sizes above 32, expect Gaudi 3 to deliver meaningfully lower throughput than H200 or B200. This is the single biggest Gaudi 3 weakness for production MoE inference. No independent benchmark data for DeepSeek V4 on Gaudi 3 is available as of May 2026; the projection is based on known 200GbE vs NVLink latency differentials.
Qwen 3 72B Dense: A More Competitive Scenario
Dense models avoid the MoE routing penalty. For Qwen 3 72B in FP8, all expert weights are always loaded, inter-GPU communication consists of activation transfers rather than expert dispatch, and 200GbE is less of a bottleneck.
At batch 32 with Qwen 3 72B FP8 on a 2-GPU configuration:
| GPU Config | Est. Throughput (tok/s) | Note |
|---|---|---|
| 2x Gaudi 3 (tensor-parallel) | ~2,600 (est.) | Bandwidth scaling; no independent data |
| 2x H200 SXM5 | ~3,800 | vLLM community benchmarks |
| 2x B200 SXM6 | ~6,200 (est.) | FP8 mode, bandwidth scaling |
The gap narrows compared to MoE models but does not close. Gaudi 3 is more competitive for dense models, but H200 still leads on per-GPU throughput. For teams where the H200 hourly rate is the constraint (not the CUDA stack), Gaudi 3 at a lower price point could produce comparable cost-per-token for this workload.
For the GPU Cost Per Token Benchmark 2026 covering A100, H100, H200, and B200 with measured throughput, use those numbers as your H200/B200 baseline when calculating how Gaudi 3 would compare at your specific use case.
Software Stack Reality Check
| Dimension | Intel Gaudi 3 | NVIDIA H200 / B200 |
|---|---|---|
| Primary SDK | SynapseAI 1.x | CUDA 12.x |
| Inference Framework | vLLM-HPU fork (HabanaAI), OPEA | vLLM, TensorRT-LLM, SGLang |
| Container Registry | vault.habana.ai | NGC (nvcr.io) |
| Attention Implementation | Gaudi-native fused attention (not FlashAttention 3) | FlashAttention 3 (Hopper/Blackwell) |
| HuggingFace Support | optimum-habana library | Native transformers + accelerate |
| Model Coverage | Growing: LLaMA, Mistral, Qwen 2, Falcon supported; newer architectures vary | Near-universal |
| Quantization Options | FP8, BF16 | FP8, FP4 (B200), BF16, INT8 |
The practical constraint is the vLLM-HPU fork. The HabanaAI/vllm-fork repository tracks upstream vLLM with a lag, typically several weeks to months. Teams that rely on recent vLLM features will find them absent on Gaudi 3. Specific gaps as of May 2026 include:
- MRV2 (Model Runner v2): Available in upstream vLLM, not yet in the HPU fork. For context on what MRV2 enables for production deployments, see our vLLM production deployment guide.
- Eagle3 speculative decoding: CUDA-only in upstream vLLM.
- SGLang: Not supported on Gaudi 3; the RadixAttention scheduler has no HPU backend.
- TensorRT-LLM: NVIDIA-only by design. Teams using TensorRT-LLM for B200 deployments will need a complete stack swap to target Gaudi 3. For a production TensorRT-LLM setup, see our TensorRT-LLM deployment guide.
OPEA (Open Platform for Enterprise AI) is Intel's multi-component pipeline framework for Gaudi. It is useful for RAG pipelines and multi-model serving on Intel hardware, but it adds orchestration complexity that single-model serving via vLLM-HPU avoids. If your use case is simple model serving, start with vLLM-HPU before evaluating OPEA.
Cost-Per-Million-Tokens Analysis
Gaudi 3 is not listed on Spheron's marketplace as of 12 May 2026. For H200 and B200, the live API shows the following rates:
| GPU | $/hr | Est. Llama 3.3 70B tok/s (FP8, batch 32) | Cost per 1M tokens |
|---|---|---|---|
| Intel Gaudi 3 | Not listed (check /pricing/) | ~1,850 (est.) | Not calculable (check /pricing/) |
| H200 SXM5 | $4.22 on-demand | ~2,500 | ~$0.47 |
| B200 SXM6 | $3.50 spot | ~3,600 (FP8) | ~$0.27 |
| B200 SXM6 | $3.50 spot | ~10,000 (FP4) | ~$0.10 |
Formula: CPM = ($/hr) / (tokens_per_sec × 3600 / 1,000,000)
At FP8, B200 has a lower cost-per-token than H200 at this batch size: $0.27 vs $0.47. B200's advantage grows further with FP4 quantization ($0.10/M), but FP4 requires Blackwell-optimized model weights and a calibrated quantization step that Gaudi 3 cannot match.
If Gaudi 3 were available at a significantly lower hourly rate than H200, the math could flip: at the same ~1,850 tok/s at batch 32, Gaudi 3 would need to be priced below $3.12/hr to deliver a lower CPM than H200. Whether that pricing exists depends on the specific provider and current supply.
Pricing fluctuates based on GPU availability. The prices above are based on 12 May 2026 and may have changed. Check current GPU pricing for live rates.
When Gaudi 3 Wins
Large-batch dense model inference with FP8, where cost matters. At batch 64-128 with a dense 70B model, the absolute throughput gap between Gaudi 3 and H200 is larger than at low batch, but Gaudi 3's lower hourly rate can still produce a competitive cost-per-token if pricing is favorable.
128 GB single-GPU VRAM for models that H100 cannot handle. The H100's 80 GB of HBM3 cannot serve a 70B FP8 model comfortably with meaningful KV cache. Gaudi 3's 128 GB fits the 70B FP8 weight set plus room for batch caching. This matters if your provider offers Gaudi 3 but not H200.
Teams already on Intel data center infrastructure. If your on-premise or hybrid setup already uses Intel servers and your ops team has experience with the Intel ecosystem, Gaudi 3 integration is lower-friction than it would be for a team starting from scratch.
Workloads that don't depend on FP4. If your model and serving pipeline run FP8 or BF16 only and you're not targeting Blackwell-specific optimizations, the FP4 gap is irrelevant.
When Gaudi 3 Loses
Long-context reasoning with multi-GPU KV cache sharing. NVLink's 900 GB/s (H200) and 1.8 TB/s (B200) far exceed what 200GbE provides for cross-GPU attention and KV cache transfers. For context windows above 32k tokens with multi-GPU tensor parallelism, NVLink topology is a practical requirement.
Fine-grained MoE routing (DeepSeek-class models). Expert dispatch over 200GbE adds meaningful latency compared to NVSwitch, and the throughput gap widens with concurrency. For any production MoE serving at scale, H200 or B200 is the right call.
B200-exclusive workloads with native FP4. For Blackwell-optimized models where FP4 doubles throughput over FP8, there is no Gaudi 3 equivalent. The gap is not a configuration difference; it is a hardware capability that Gaudi 3 does not have.
Cutting-edge inference features. If you're deploying speculative decoding variants, MRV2, or SGLang's RadixAttention scheduler, all of these are CUDA-only today. The vLLM-HPU fork lags upstream by several release cycles.
Teams without tolerance for porting effort. A vLLM-based CUDA deployment can generally move to a new NVIDIA GPU (H200 to B200, for example) with a container swap. Moving from CUDA vLLM to vLLM-HPU requires validating behavior, adjusting dtype flags, and re-testing each model variant. The porting cost is real and should be included in any cost-benefit analysis.
Migration Guide: Porting vLLM to Intel Gaudi 3
Step 1: Verify Model Compatibility with vLLM-HPU
Check the HabanaAI/vllm-fork README for the supported model list. Intel is also transitioning users to the vllm-gaudi plugin starting vLLM 1.24.0 as a longer-term path; check the HabanaAI GitHub org for the current recommended approach before starting a new deployment. As of May 2026, supported architectures include:
- LlamaForCausalLM (LLaMA 2, 3, 3.1, 3.2, 3.3)
- MistralForCausalLM, MixtralForCausalLM
- Qwen2ForCausalLM
- FalconForCausalLM
- GPT-NeoX and related variants
Newer architectures (Llama 4's attention variant, some custom MoE patterns) may not be listed. Confirm the fork's current release tag and compare it against the upstream vLLM version that added your model's support. Do not assume that support in upstream vLLM implies support in vLLM-HPU.
Step 2: Install the Intel Gaudi Software Stack
Pull the official Docker image from Intel's registry:
docker pull vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latestThis image includes SynapseAI, HPU runtime, and PyTorch pre-installed. No manual SynapseAI driver installation is needed inside the container. Verify the HPU is visible:
python -c "import habana_frameworks.torch.core as htcore; print('HPU ready')"If this fails, the HPU runtime is not accessible, likely a missing driver on the host or a container runtime configuration issue.
Step 3: Adapt Your vLLM Serving Command for Gaudi 3
The key changes from a CUDA vLLM command:
- Replace
--device cudawith--device hpu - Remove
--dtype fp16(fp16 is not optimal on HPU; use bfloat16 or float8) - Adjust
--tensor-parallel-sizeto match your Gaudi 3 HPU count
Before (CUDA):
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.3-70B-Instruct \
--device cuda \
--dtype fp16 \
--tensor-parallel-size 2After (Gaudi 3 HPU):
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.3-70B-Instruct \
--device hpu \
--dtype bfloat16 \
--tensor-parallel-size 2For FP8 inference on Gaudi 3, substitute --dtype float8. Run a warmup pass before benchmarking: the SynapseAI graph compiler compiles the compute graph on first use, which adds latency to initial requests.
Step 4: Run a Side-by-Side Benchmark Before Committing
Spin up both a Gaudi 3 instance (at your target provider) and an H200 or B200 instance on Spheron on-demand. Run the same model, batch size, and sequence lengths for at least 30 minutes on each. Pull tokens-per-second from the serving logs and calculate cost-per-million-tokens using:
CPM = ($/hr) / (tokens_per_sec × 3600 / 1,000,000)Compare actual measured CPM against the estimates in this post. If Gaudi 3 delivers within 15% of H200's CPM at your batch size, and the hourly rate is lower, the economics favor Gaudi 3 for that workload. If throughput is significantly below the estimates here (which is possible given SynapseAI compilation variance), reconsider.
Decision Matrix
| Situation | Choose |
|---|---|
| Dense 70B inference, batch 64+, cost matters and Gaudi 3 available | Gaudi 3 (if CPM proves out in benchmarks) |
| MoE model (DeepSeek-class), any batch size | B200 or H200 |
| Need FP4 and Blackwell-native quantization | B200 |
| Mature CUDA stack, minimal porting budget | H200 |
| Long-context (128k+ tokens), multi-GPU attention | H200 or B200 |
| Fine-grained speculative decoding or SGLang | H200 or B200 (CUDA-only features) |
| Benchmarking alternatives before committing | Spheron on-demand (switch GPU types per-hour) |
| On-premise Intel server ecosystem | Gaudi 3 OAM form factor |
Spheron Gaudi 3 Availability
Gaudi 3 does not currently appear in Spheron's live GPU marketplace as of 12 May 2026. Spheron's marketplace aggregates supply from multiple data center partners globally, and availability reflects what those partners offer at a given time.
For the most current Gaudi 3 availability, check current GPU pricing. If you need Gaudi 3 capacity at scale or want to discuss custom configurations, check current GPU pricing or reach out directly via the Spheron app.
Spheron's neutral marketplace model is useful for exactly this comparison scenario: you can rent B200 on Spheron and run your workload today, then check Gaudi 3 availability at a later date without changing providers or signing a contract. The on-demand, per-minute billing makes A/B GPU comparisons practical.
Intel Gaudi 3 is worth benchmarking against H200 and B200 if your workload is dense, batch-heavy inference where CUDA's ecosystem premium does not translate to lower costs. Spheron lets you spin up both and compare on actual traffic before committing.
