SambaNova's headline for 2026 is top-of-class tokens per second on large MoE models using a 3-tier memory architecture that NVIDIA chips don't replicate. The claim is real in specific conditions. But the RDU is not a GPU, and the performance numbers come with architectural constraints that matter enormously for most real workloads. This post covers the architecture, benchmarks against H200 and B200 on Spheron, cost-per-million-token math using live pricing, and a clear decision framework for when each option wins.
For a structural analogy, see the Cerebras WSE-3 vs H100 comparison, which covers a similar ASIC-vs-GPU tradeoff for a different chip class.
Quick Answer: SN40L vs H200 vs B200 at a Glance
| System | Best For | Memory | Interconnect | Price |
|---|---|---|---|---|
| SambaNova SN40L (8-RDU node) | Large MoE inference at batch 1-4, zero-ops managed API | 64 GB HBM3/RDU + 1.5 TB DDR5/RDU (12 TB node total) | RDULink (proprietary, intra-node) | Per-token API billing (no hourly rental) |
| H200 SXM5 | 70B-class inference, training, mature CUDA stack | 141 GB HBM3e | NVLink 4 (900 GB/s) | From $4.62/hr on-demand ($1.40/hr spot) |
| B200 SXM6 | FP4 workloads, 100B+ models, highest throughput | 192 GB HBM3e | NVLink 5 (1.8 TB/s) | From $7.00/hr on-demand ($2.14/hr spot) |
Prices current as of 30 May 2026 and subject to change. See the cost-per-million-token section below for detailed on-demand and spot CPM math.
The core difference: SambaNova sells managed inference access at a per-token price. Spheron sells bare-metal H200 and B200 instances billed by the hour. These are different purchasing models, and the right choice depends on your batch size, model architecture, and ops investment.
What the SambaNova RDU Actually Is
The RDU (Reconfigurable Dataflow Unit) executes model graphs differently from a GPU. A GPU runs SIMT (Single Instruction, Multiple Threads): it schedules batches of independent operations across thousands of CUDA cores, using HBM as the main working memory. The RDU uses a dataflow execution model: the model is compiled by SambaNova Composer into a static graph that pipelines a single request through the hardware, reading weights from whichever memory tier holds them.
The 3-tier memory hierarchy is the architectural feature that makes SN40L relevant for large MoE models:
- On-chip SRAM (~520 MB per RDU): Fastest tier. Holds activations for active layers during execution.
- HBM3 (64 GB per RDU): Medium-speed tier. Holds weights for the layers actively in use.
- DDR5 (1.5 TB per RDU socket, up to 12 TB per 8-RDU node): Slow tier. Holds cold expert weights that are not currently being called.
For a MoE model like Llama 4 Maverick (17B active parameters, 128 experts, 400B+ total parameters), most expert weights are inactive at any given token. On a standard GPU cluster, you need to spread the full model across enough HBM to fit all experts simultaneously. On the SN40L, cold experts live in DDR5 and are fetched on demand. This eliminates the need for an 8x NVLink tensor-parallel configuration just to achieve memory fit.
For dense models, the DDR5 tier provides no advantage. Dense transformers activate every parameter on every token pass. What matters is HBM bandwidth, and the H200's 4.8 TB/s and B200's 8.0 TB/s substantially outrun the SN40L node's estimated ~2 TB/s effective HBM bandwidth.
SambaNova Composer is the proprietary layer that compiles your model graph for the RDU. Unlike CUDA, which accepts arbitrary kernel code, Composer only supports architectures it can map to the RDU dataflow. This means no vLLM, no SGLang, no custom attention kernels. You access the hardware through SambaNova Cloud's API, not through a bare-metal SSH session.
For the broader picture of how various inference ASIC vendors approach these tradeoffs differently, see the Etched AI Sohu vs NVIDIA comparison and the Intel Gaudi 3 vs H200 and B200 guide.
Spec Sheet: SN40L vs H200 vs B200
| Specification | SambaNova SN40L (8-RDU node) | H200 SXM5 | B200 SXM6 |
|---|---|---|---|
| Architecture | RDU (Reconfigurable Dataflow Unit) | NVIDIA Hopper (GH100) | NVIDIA Blackwell (GB100) |
| On-chip SRAM | ~520 MB per RDU | N/A | N/A |
| HBM | 64 GB HBM3 per RDU (512 GB total node) | 141 GB HBM3e | 192 GB HBM3e |
| DDR | 1.5 TB DDR5 per RDU (12 TB per 8-RDU node) | N/A | N/A |
| Memory Bandwidth (HBM) | ~2 TB/s (estimated per node) | 4.8 TB/s | 8.0 TB/s |
| FP8 TFLOPS | ~800 (estimated, inferred from throughput data) | 3,958 | 4,500 |
| FP4 Support | No | No | Yes (9,000 TFLOPS) |
| Multi-chip Interconnect | RDULink (proprietary, intra-node) | NVLink 4 (900 GB/s) | NVLink 5 (1.8 TB/s) |
| Training Support | No (inference-only) | Full CUDA training | Full CUDA training |
| Custom Kernels | No (Composer-managed only) | Yes (CUDA/Triton) | Yes (CUDA/Triton) |
| Access Model | Managed API (SambaNova Cloud) | Bare-metal rental (Spheron) | Bare-metal rental (Spheron) |
Note on SN40L FLOPs: SambaNova does not publish per-chip raw TFLOPS figures. The ~800 FP8 TFLOPS estimate is inferred from published throughput claims against known model sizes and is not an official specification.
For a detailed H200 vs B200 generation-over-generation comparison, see the NVIDIA H200 vs B200 vs GB200 guide.
LLM Inference Benchmarks
Source note: SambaNova does not submit to MLPerf data center inference. Throughput figures for SN40L combine SambaNova-published benchmark claims and SemiAnalysis InferenceX estimates. H200 and B200 figures are from vLLM production measurements. Each row is labeled by source.
Llama 4 Maverick (17B/128E MoE, ~400B Total Parameters): Throughput (tokens/sec per system)
Llama 4 Maverick is the right model to use here. It has 128 experts and activates 17B parameters per forward pass. The DDR5 cold-storage advantage on SN40L is most visible at low batch sizes where the full expert set does not need to fit in HBM simultaneously.
| System | Batch 1 | Batch 8 | Batch 32 | Memory Fit | Source |
|---|---|---|---|---|---|
| SambaNova SN40L (8-RDU node) | ~2,800 | ~4,200 | ~5,100 | Yes (DDR5 expert spill) | SambaNova published |
| H200 SXM5 x8 | ~420 | ~2,800 | ~7,400 | Marginal (141 GB each) | SemiAnalysis InferenceX est. |
| B200 SXM6 x8 | ~580 | ~3,900 | ~10,200 | Yes (192 GB each) | SemiAnalysis InferenceX est. |
The SN40L has a real advantage at batch 1 and batch 8 on this model. A single 8-RDU node serves Llama 4 Maverick at 2,800 tokens/sec without tensor parallelism complexity. At batch 32, H200 x8 and B200 x8 overtake it because HBM bandwidth matters more than memory fit once requests are batched.
Llama 3.3 70B (Dense): Throughput (tokens/sec per system)
Dense models do not benefit from DDR5 cold storage. Every parameter activates on every token.
| System | Batch 1 | Batch 8 | Batch 32 | Source |
|---|---|---|---|---|
| SambaNova SN40L (8-RDU node) | ~380 | ~1,600 | ~2,100 | SemiAnalysis est. |
| H200 SXM5 x1 | ~340 | ~1,800 | ~4,200 | vLLM production measurement |
| B200 SXM6 x1 | ~480 | ~2,600 | ~6,100 | vLLM production measurement |
A single H200 outperforms or matches the 8-RDU SN40L node on Llama 3.3 70B at batch 8 and beyond. B200 is faster at every batch size for dense models.
Time to First Token (TTFT) and Inter-Token Latency (ITL)
SambaNova's dataflow model compiles models into a static graph executed from a pre-warmed state. There is no per-request weight loading from HBM into compute cores. This gives SN40L low TTFT on large MoE models at batch 1 because the graph is already executing before the request arrives in a conventional sense.
At batch 1 for Llama 4 Maverick, SambaNova publishes TTFT under 100ms. An H200 x8 vLLM setup at batch 1 on the same model typically shows 300-700ms TTFT depending on tensor parallel configuration. This is a real advantage for latency-sensitive applications at low concurrency.
At batch 32, ITL (inter-token latency, time per output token) flips. H200 and B200's higher HBM bandwidth sustains lower per-token latency under load. SN40L's ITL degrades more at high concurrency because the DDR5 fetch bandwidth is fixed and shared.
For a deeper breakdown of TTFT vs ITL tradeoffs and how to set latency budgets, refer to SambaNova's published latency documentation.
Cost-Per-Million-Tokens: SambaNova Cloud vs Spheron H200/B200
This is the decision-critical calculation. Use the formula for GPU cloud:
CPM = ($/hr per node) / (tokens_per_sec x 3600 / 1,000,000)
For SambaNova Cloud, CPM is the direct per-token price from their API pricing page. There is no hourly rate, no batch-scaling discount, and no spot pricing. The managed API charges per output token regardless of your concurrency.
| System | Model | Batch | Tokens/sec (system) | Cost | CPM |
|---|---|---|---|---|---|
| SambaNova Cloud | Llama 4 Maverick | N/A (managed) | N/A | ~$6/M output tokens | $6.00 |
| SambaNova Cloud | Llama 3.3 70B | N/A (managed) | N/A | ~$0.70/M output tokens | $0.70 |
| H200 SXM5 x8 on-demand | Llama 4 Maverick | 32 | ~7,400 | $36.96/hr | $1.39 |
| B200 SXM6 x8 on-demand | Llama 4 Maverick | 32 | ~10,200 | $56.00/hr | $1.52 |
| B200 SXM6 x8 spot | Llama 4 Maverick | 32 | ~10,200 | $17.11/hr | $0.47 |
| H200 SXM5 x1 spot | Llama 3.3 70B | 8 | ~1,800 | $1.40/hr | $0.22 |
SambaNova Cloud pricing sourced from cloud.sambanova.ai public pricing page, May 2026. Verify current rates before making budget decisions.
At batch 32, GPU cloud on Spheron delivers Llama 4 Maverick at $0.47/M tokens on spot and $1.39-1.52/M tokens on-demand. SambaNova Cloud charges approximately $6/M output tokens for the same model. The GPU cloud advantage at scale remains substantial.
The SambaNova argument is not about CPM at high batch. It is about: zero infrastructure overhead, no Docker, no vLLM configuration, no CUDA driver management, and a flat per-token price regardless of whether you are serving one request or one hundred concurrently. For teams without dedicated ML infrastructure engineers, that operational value is real. For teams already running production GPU infrastructure, the CPM math is hard to ignore.
Pricing fluctuates based on GPU availability. The prices above are based on 30 May 2026 and may have changed. Check current GPU pricing → for live rates.
Where SN40L Wins
Large MoE models at batch 1-4. For Llama 4 Maverick, DeepSeek V3, and Mixtral 8x22B at low concurrency, the SN40L's DDR5 expert spill gives a single 8-RDU node memory fit that eliminates tensor parallelism overhead. A single SN40L node at batch 1 delivers 2,800 tokens/sec on Llama 4 Maverick, which no single H200 can match for this model due to memory fit constraints.
Long-context decode on models that exceed H200's 141 GB. The DDR5 tier acts as an overflow for KV cache at very long contexts. Teams running models that sit between 141 GB and 512 GB would need a 4-way H200 tensor parallel config; on SN40L, the DDR5 tier provides that capacity automatically with less configuration.
Zero-ops managed inference. SambaNova Cloud handles all hardware management, software updates, and capacity planning. Teams that want GPU-class performance for a specific MoE workload without building an inference platform can get a production-grade API endpoint in hours rather than days.
Where H200 and B200 Win
Training and fine-tuning. SN40L is inference-only. The RDU architecture has no gradient computation path in SambaNova's released tooling. Spheron H200 and B200 instances run full CUDA training pipelines using Axolotl, Megatron-LM, DeepSpeed, and any other training framework built for CUDA. For teams that need training and inference on the same GPU fleet, GPU cloud is the only option.
Dense models at batch 32 or higher. H200 SXM5's 4.8 TB/s HBM bandwidth and B200 SXM6's 8.0 TB/s dominate when requests are batched. At batch 32 on Llama 3.3 70B, a single B200 delivers ~6,100 tokens/sec versus the SN40L node's ~2,100. The bandwidth advantage compounds as batch size increases.
Custom kernels and non-Composer architectures. vLLM, SGLang, TensorRT-LLM, and any framework built on CUDA kernels runs on H200 and B200 without modification. FlashAttention 3, speculative decoding, SGLang's RadixAttention prefix caching, and Tensor RT-LLM's in-flight batching are all available on GPU cloud. None of these run on SN40L.
Vision and multimodal workloads. SambaNova Composer does not support vision encoders as of May 2026. For Llama 4 Scout/Maverick with the vision modality enabled, or any other multimodal model, GPU cloud is the only option.
FP4 quantization on B200. SN40L has no FP4 support. B200's 9,000 FP4 TFLOPS doubles its effective throughput for models that tolerate FP4 quantization quality. For Llama 4 Maverick at FP4, B200 delivers throughput that SN40L cannot approach on its current architecture.
Mixed training and inference fleets. Spheron instances can switch between training and inference jobs without hardware changes. SambaNova Cloud is a managed inference API: you pay per output token and cannot reallocate that capacity to a training run.
For H200 SXM5 instances on Spheron and for B200 SXM6 on Spheron, the on-demand and spot pricing give full flexibility to mix workloads and scale independently.
Porting a vLLM Stack Off SambaNova Composer
If you have existing workloads running on SambaNova Cloud and want to evaluate GPU cloud alternatives, here is what the migration involves.
What Composer actually does to your model. SambaNova Composer takes a standard HuggingFace model definition and compiles it into a RDU-native execution graph. It applies proprietary quantization and operator fusion. The compiled artifact is not portable. But the original model weights are standard HuggingFace format.
Step 1: Get the original weights. Download the model in HuggingFace safetensors format from the original model hub (Llama, DeepSeek, Qwen, etc.), not from SambaNova. SambaNova Cloud does not expose compiled model artifacts for export.
Step 2: Run vLLM on Spheron. On an H200 or B200 instance, run:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-4-Maverick-17B-128E-Instruct \
--tensor-parallel-size 8 \
--dtype float8For a direct comparison of inference serving frameworks, evaluate vLLM, TensorRT-LLM, and SGLang against your latency and throughput requirements.
Step 3: Watch for quantization drift. SambaNova Composer's quantization differs from vLLM's FP8 defaults. Run 50-100 representative prompts through both systems and compare output quality. For factual tasks, check whether accuracy metrics shift. For generation tasks, evaluate with your existing quality rubric.
Step 4: Reconfigure KV cache. vLLM's KV cache defaults are tuned for HBM. Adjust --gpu-memory-utilization to allocate more HBM to KV cache since you no longer have DDR5 overflow. For Llama 4 Maverick on H200 x8, start with --gpu-memory-utilization 0.90 and reduce if OOM errors appear.
Common gotchas. Composer's quantization sometimes produces different token distributions than vLLM FP8 for long-context generation. If you see quality degradation beyond 4096 tokens of context, compare with --dtype float16 as a reference point before blaming the framework.
Decision Framework: Inference ASIC vs GPU Cloud
- If your primary workload is large MoE inference (Llama 4 Maverick, DeepSeek V3, Mixtral 8x22B) at batch 1-4 AND you need zero-ops managed inference with no infrastructure team: evaluate SambaNova Cloud for that specific workload.
- If you need training and inference on the same fleet, or you plan to fine-tune: GPU cloud (H200 or B200 on Spheron) is the only option.
- If you use custom attention kernels, speculative decoding, or SGLang's RadixAttention: GPU cloud required.
- If your model is dense (not MoE) and your batch size exceeds 16: B200 on Spheron wins on CPM by a significant margin.
- If FP4 quantization is viable for your quality bar: B200 on Spheron has no equivalent in the SambaNova lineup.
- If you are uncertain about long-term vendor lock-in: GPU cloud gives you portability across cloud providers, model frameworks, and workload types.
For a broader GPU selection guide across inference workloads, see the best GPU for AI inference in 2026 and the detailed GPU cost-per-token benchmark.
Most inference stacks don't need a proprietary dataflow accelerator. For Llama 4 Maverick and standard dense models, H200 and B200 on Spheron give you full CUDA flexibility, transparent on-demand and spot pricing, and the ability to run training and inference on the same fleet.
Quick Setup Guide
Run 'python -m vllm.entrypoints.openai.api_server --model <your-model> --tensor-parallel-size 8' on an H200 x8 or B200 x8 instance on Spheron. Send 500 requests at batch sizes 1, 8, and 32 using 'vllm bench throughput' and record tokens/sec from the serving logs. This baseline is your comparison point for any SambaNova Cloud API benchmark you run next.
Check whether your primary model is a Mixture of Experts architecture (Llama 4 Maverick, DeepSeek V3, Mixtral) or a dense transformer (Llama 3.3 70B, Qwen 3 32B, Gemma 4). Run 'grep -i mixture content/model_card.md' or check the model config.json for 'num_experts'. MoE models with many experts benefit from SambaNova's DDR tier at low batch sizes. Dense models do not.
Use the formula: CPM = ($/hr per node) / (tokens_per_sec x 3600 / 1,000,000). For SambaNova Cloud, CPM is the direct per-token price from their pricing page. Compare your measured GPU cloud CPM at your actual production batch size against the SambaNova Cloud rate. If your batch size is reliably above 16 for dense models or above 8 for MoE, GPU cloud wins on CPM in most configurations.
Download your model weights in HuggingFace safetensors format from the original model hub (not from SambaNova). In a test environment, run 'vllm serve <model-name> --dtype float16 --max-model-len 8192' on an H200 instance and compare output quality with 50 prompts against your SambaNova Cloud baseline. Check perplexity and sample outputs for any quantization drift before committing to migration.
Frequently Asked Questions
It depends on the model and batch size. For large MoE models like Llama 4 Maverick (400B+ params) at batch 1-4, SambaNova's 3-tier memory architecture lets a single 8-RDU node hold the full model without tensor parallelism. At batch 32+, H200 and B200 reclaim throughput leadership through HBM bandwidth. For dense models like Llama 3.3 70B at any realistic production batch size, H200 is faster.
The RDU (Reconfigurable Dataflow Unit) executes models as a compiled dataflow graph rather than scheduling CUDA kernels on streaming multiprocessors. The key architectural difference is the 3-tier memory hierarchy: on-chip SRAM (~520 MB per RDU), HBM3 (64 GB per RDU), and DDR5 (1.5 TB per RDU, 12 TB per 8-RDU node). That DDR5 tier lets large MoE models keep cold expert weights in slow memory while HBM holds active layers, eliminating the need to spread the model across many GPUs.
SambaNova Cloud uses per-output-token billing, not hourly rentals. For large models like Llama 4 Maverick, managed API pricing runs approximately $6/M output tokens. On Spheron, H200 SXM5 starts at $4.62/hr on-demand and B200 SXM6 at $7.00/hr on-demand. At batch size 32, H200 x8 on-demand comes to roughly $1.39/M tokens and B200 x8 on-demand to roughly $1.52/M tokens. Spot pricing brings H200 x8 to about $0.22/M tokens and B200 x8 to about $0.47/M tokens, well below SambaNova Cloud's $6/M for high-volume inference.
No. SambaNova uses a proprietary orchestration layer called Composer that compiles model graphs to the RDU dataflow architecture. vLLM, SGLang, and TensorRT-LLM are CUDA-only and do not run on SN40L hardware. You access SambaNova inference through their cloud API. Migrating off SambaNova back to vLLM on GPU cloud requires re-exporting model weights to standard HuggingFace format and reconfiguring your serving stack from scratch.
Choose H200 or B200 on Spheron when: you need training and inference on the same fleet; your workload is a dense model at batch 8 or higher; you use custom attention kernels, speculative decoding, or SGLang's RadixAttention; your model is multimodal and requires a vision encoder; or you want the ability to switch inference frameworks without migration cost. B200 specifically wins on FP4 quantization, where SN40L has no equivalent.
SambaNova Composer is the proprietary compilation and orchestration layer that maps model graphs to the RDU dataflow architecture. There is no open-source equivalent. Migrating off SambaNova requires exporting model weights to standard HuggingFace safetensors format, which is possible for most models. The lock-in is primarily in the serving layer, not the weights. Teams with existing PyTorch training pipelines can keep those unchanged and swap only the inference API client.
