Most CUDA alternative discussions focus on AMD and ROCm. Tenstorrent takes a different approach: a fully open ISA built on RISC-V, a mesh-network chip fabric, and a compiler stack you can actually read and modify. The performance case for their Wormhole Galaxy server is real on paper. The software maturity is not. This post covers both.
Why Tenstorrent Matters in 2026
Jim Keller founded Tenstorrent after decades designing CPUs at AMD, Apple, Intel, and Tesla. The design philosophy he brought to Tenstorrent is fundamentally different from NVIDIA: start with the ISA open, make the compiler stack a first-class open-source project, and design the chip around explicit data movement rather than a hardware cache hierarchy.
The result is a chip where every detail of the compute stack is auditable. No proprietary microcode. No closed driver interface. No PTX layer you can reverse-engineer but not officially modify.
NVIDIA's moat runs deep. CUDA has been accumulating ecosystem software for over 20 years: cuDNN, cuBLAS, NVLink, NCCL, TensorRT, and a PTX ISA that vendors write to but cannot modify. TT-Metal, Tenstorrent's compiler stack, is open-source MIT-licensed code on GitHub. You can read every kernel, modify every allocation decision, and submit a pull request. That is not possible with CUDA.
For most production inference teams, this distinction is academic today. The ecosystem gap matters more than the ISA openness. But for sovereign AI programs, defense and financial services compliance requirements, and hardware research groups, full-stack auditability is a first-class requirement, not a nice-to-have. The EU AI Act and several national AI programs now explicitly require auditable, open-source compute stacks for certain workload categories. Tenstorrent's architecture passes that requirement in ways NVIDIA's proprietary stack cannot.
For a broader picture of where CUDA alternatives stand today, see our ROCm vs CUDA comparison.
Tenstorrent Wormhole: Architecture Deep Dive
Tensix Core Design
A Tensix core is not a CUDA core equivalent. It is a self-contained compute tile with three components:
- RISC-V data movement processor: handles memory operations and kernel control flow
- Matrix engine: 8x8 BF16 tile multiply-accumulate unit for tensor operations
- Vector unit: element-wise operations (activations, normalizations)
Each Tensix core has approximately 1.5 MB of local L1 SRAM. The n300 card has 128 Tensix cores across both ASICs (64 per ASIC), giving 192 MB of on-chip SRAM in aggregate (96 MB per ASIC).
The key design choice: data movement is explicit. The programmer (or compiler) schedules when data moves from DRAM to SRAM and when it moves back. There is no hardware prefetcher guessing your access pattern. This makes performance predictable when you get it right, and debugging painful when you do not. TT-Metal kernel development requires you to think in terms of tile pipelines and explicit DMA transfers, which is closer to writing a FPGA design than writing a CUDA kernel.
On-Chip Ethernet Fabric
Every Wormhole chip includes 16 Ethernet cores running at 100 Gbps each. These are not for connecting to a network switch. They are the chip's multi-chip interconnect.
When you plug multiple n300 cards together in a Galaxy board, the Ethernet fabric connects chips directly, chip to chip, without going through a PCIe switch. Two adjacent chips communicate through these Ethernet links at 100 Gbps per link. The Galaxy board's 32 Wormhole processors form a 2D mesh through these links, with each node able to reach any other node through the mesh.
The practical result: you do not need InfiniBand or NVLink to scale a Tenstorrent cluster. The fabric is built into the silicon.
Memory Hierarchy and the HBM Trade-off
The n300 card uses 24 GB of GDDR6 per card, not HBM. This is one of the most significant specs differences versus H100.
H100 SXM5 ships with 80 GB HBM3 at 3.35 TB/s memory bandwidth per GPU. A single n300 card has 24 GB GDDR6 at around 576 GB/s per card. The bandwidth gap is roughly 6x per physical unit.
What partially compensates for this is the aggregate on-chip SRAM. When a model's working set fits in the 192 MB on-chip SRAM, GDDR6 accesses become less frequent, and the bandwidth gap matters less. For small-batch inference where you can tile the attention computation into SRAM, this tradeoff is favorable. For large-batch inference with large KV caches that overflow on-chip memory, GDDR6 bandwidth becomes the bottleneck.
Tenstorrent Blackhole: The Next Generation
p100 and p150b Cards
Blackhole is Tenstorrent's successor to Wormhole. The p100 card is built on a single Blackhole die with:
- 120 Tensix compute cores
- 16 big RISC-V cores, 4 clusters of 4 (new in Blackhole: eliminates the need for a separate host processor for data orchestration)
- 28 GB GDDR6
- PCIe 5.0 x16 interface
The addition of 16 dedicated big RISC-V cores means Blackhole can manage its own data movement without constant roundtrips to a host CPU. On Wormhole, host CPU overhead for small-batch workloads was a real bottleneck. Blackhole addresses this on-die.
The p150b is Tenstorrent's Blackhole-based Galaxy card. Full production specs had not been publicly released as of April 2026. Any throughput figures for p150b should be treated as preliminary until independent benchmarks are available.
What Changes Vs Wormhole
Blackhole increases L1 SRAM per Tensix tile compared to Wormhole, which helps with larger tile operations and reduces GDDR6 bandwidth pressure for compute-heavy kernels.
Native PCIe 5.0 support is backward-compatible with existing CUDA servers. Teams running heterogeneous clusters can slot Blackhole cards into the same PCIe infrastructure as NVIDIA GPUs.
Theoretical peak compute on a single Blackhole die is 745 TFLOPS FP8 per Tenstorrent's official specifications. This figure has not been confirmed at scale in independent benchmarks.
One important caveat: most TT-Metal documentation, tutorials, and verified model support as of Q1 2026 targets Wormhole n300 and n150 cards. Blackhole software support is earlier in the development cycle. If you are building on Tenstorrent today, the documented path runs through Wormhole.
Galaxy Server vs NVIDIA H100 SXM5: Llama 70B Inference
The Tenstorrent Wormhole Galaxy puts 32 Wormhole processors on a single board, connected through the on-chip Ethernet mesh without any PCIe switch in the data path.
| Metric | Tenstorrent Wormhole Galaxy (32 processors) | NVIDIA H100 SXM5 (8x) |
|---|---|---|
| Chip count | 32 processors | 8 GPUs |
| Total VRAM | 384 GB GDDR6 | 640 GB HBM3 |
| Memory bandwidth (aggregate) | ~9.2 TB/s | ~26.8 TB/s |
| Peak FLOPS (aggregate) | ~9,322 TFLOPS FP8 (official) | ~31,664 TFLOPS FP8 (8x) |
| Llama 70B decode (batch 32) | ~4,000-5,000 tok/s (TT-Metal, unverified) | ~2,500-3,500 tok/s (vLLM) |
| Interconnect | 100 GbE Ethernet mesh | NVLink 4 (900 GB/s) |
| Cloud availability | On-prem only / direct from Tenstorrent | Spheron, hyperscalers |
The Llama 70B throughput figures for Galaxy come from Tenstorrent's own benchmarking, not independent third-party audits. This matters because benchmarking methodology differs from a production serving setup. The H100 vLLM numbers include routing overhead, continuous batching queue management, and PagedAttention memory pressure. The Tenstorrent Galaxy numbers are typically from controlled single-model benchmark runs on TT-Metal without a production serving layer.
In other words: Galaxy looks faster on raw decode throughput, but you cannot directly translate that number to a serving SLA without accounting for the full serving stack gap.
The on-chip SRAM advantage is real for certain workloads. An 8xH100 SXM5 node's 640 GB HBM3 exceeds Galaxy's 384 GB GDDR6 for raw capacity, so H100 has the total memory edge for very large models. Where Galaxy can recover ground is for compute-bound workloads where the active working set fits in on-chip SRAM and GDDR6 bandwidth pressure is reduced.
Cost per million tokens (8xH100 on Spheron, Llama 70B, batch 32):
At $2.90/hr per H100 SXM5 on-demand (or $0.80/hr spot), an 8-GPU node runs $23.20/hr on-demand or $6.40/hr spot. At 3,000 tok/s midpoint throughput:
- On-demand: ($23.20 / 3,000) / 3,600 × 1,000,000 = $2.15/M tokens
- Spot: ($6.40 / 3,000) / 3,600 × 1,000,000 = $0.59/M tokens
Tenstorrent Galaxy hardware is not available on cloud marketplaces as of April 2026. The p100 card sells direct at approximately $2,400 street price, making on-prem TCO comparisons possible but cloud spot/on-demand pricing comparisons unavailable.
For a comparison with another inference-focused non-GPU chip, see our Groq 3 LPU explainer.
TT-Metal and the Software Stack Reality Check
What Works Today
The TT-LLM project, built on top of TT-Metal, has verified the following models on Wormhole as of early 2026:
- Llama 2 7B, 13B, 70B
- Llama 3 8B, 70B
- Falcon 7B
- Mistral 7B
TT-Metal's Python and C++ APIs are stable for these workloads. The project is actively maintained on GitHub and the Tenstorrent team ships updates regularly.
The key word is "verified." Verified means the model outputs pass numerical accuracy tests and the pipeline runs end-to-end. It does not mean the serving stack is production-ready with the features teams rely on in production.
What Doesn't Work Yet
The gaps relative to CUDA are significant:
- MoE architectures: DeepSeek V3, Qwen3 MoE, and other mixture-of-experts models have partial support. Expert routing does not fuse cleanly on the current Tensix mapping.
- FlashAttention 3: No TT-Metal port exists. Custom attention kernels must be rewritten in TT-Metal C++ from scratch.
- TensorRT-LLM: CUDA-only, no TT equivalent.
- Most HuggingFace models: Models not in the verified list require manual kernel rewrites, which is not a one-day task.
- OpenAI-compatible serving API: No production equivalent to vLLM's OpenAI-compatible endpoint with continuous batching and PagedAttention.
Compilation and Tooling
TT-Metal model compilation takes 3-10 minutes per model. On CUDA, TensorRT compilation can take a few minutes for optimized engines, but loading an already-compiled model is seconds. The 3-10 minute cold-start time on TT-Metal affects development iteration speed significantly.
There is no profiler equivalent to NVIDIA Nsight for TT-Metal. Kernel-level debugging relies on TT-Metal's own tooling, which is functional but far less mature than the CUDA profiling ecosystem.
Note for anyone reading older tutorials: TT-Buda, Tenstorrent's earlier Python API, has been deprecated; new development should use TT-Metal directly. Code targeting TT-Buda will not work on current Tenstorrent hardware without a port.
For comparison with what a mature GPU inference stack looks like in production, see our vLLM production deployment guide.
When Tenstorrent Is the Right Call
There are specific contexts where Tenstorrent's open-stack design is worth the software maturity trade-off:
Research labs building custom compute primitives. Full ISA access and an open compiler means you can add new data types, modify the memory allocation strategy, or implement novel attention patterns without waiting for NVIDIA to expose the capability in CUDA. For hardware architecture research, this is uniquely valuable.
Sovereign AI programs with open-stack requirements. EU AI Act compliance and several national AI programs now require auditable compute stacks for certain deployment categories. A fully open RISC-V ISA and MIT-licensed compiler stack satisfies these requirements in ways CUDA does not.
Defense and financial services. Regulated industries with full-stack auditability requirements increasingly need to verify every layer of the compute stack. CUDA's proprietary microcode and closed driver components create compliance friction that Tenstorrent's open stack avoids.
Budget-sensitive teams willing to invest in porting. The on-prem economics of Galaxy hardware can compare favorably to cloud H100 over a 3+ year horizon, especially for teams running sustained high-throughput inference on supported models. The trade-off is the upfront engineering cost to port and maintain a TT-Metal stack.
When NVIDIA H100 Is Still the Right Call
For most production teams in 2026, CUDA on H100 remains the right choice:
Production inference starting today. The path from "I have a model" to "I have a production serving endpoint" on H100 with vLLM is measured in days to weeks, not months. On TT-Metal, the same journey is months for a supported model and longer for anything outside the verified list.
Models outside the verified list. If your production model is not Llama 2, Llama 3, Falcon, or Mistral, you are not running on Tenstorrent without significant kernel engineering work.
FlashAttention 3, TensorRT-LLM, or NCCL. These frameworks have no TT-Metal equivalents. Teams running heavily optimized CUDA stacks cannot port these to TT-Metal without replacing them entirely.
Fine-tuning with FSDP or DeepSpeed. Both frameworks integrate deeply with NCCL for collective communication. There is no NCCL equivalent for TT-Metal. Running distributed training with gradient synchronization at scale requires CUDA today.
Teams without dedicated compiler engineers. Operating a TT-Metal stack requires people who can debug kernel pipelines and data movement schedules. This is not a typical ML engineer skill set. Without in-house compiler expertise, the gap between theoretical and achieved performance can be large.
Enterprise support requirements. NVIDIA's enterprise support contracts cover CUDA, NIM, TensorRT-LLM, and the full CUDA software stack. Tenstorrent's commercial support offering is early-stage.
For H100 SXM5 instances on Spheron, the setup path is straightforward: pull a standard Docker image, launch vLLM or SGLang, and you are serving within hours.
How Spheron Fits the Open-Stack Story
Spheron's GPU cloud gives production teams a practical bridge between today's CUDA reality and where open-stack hardware is heading.
Bare-metal H100 and B200 instances with no hypervisor overhead. The full PCIe and NVLink topology is exposed, which matters for teams optimizing collective communication patterns or running near-metal profiling. No platform-specific orchestration layer sits between your serving stack and the hardware.
ROCm alongside CUDA, no re-provisioning required. Teams exploring CUDA alternatives can benchmark AMD ROCm workloads without changing their cloud provider. The option to move away from CUDA-only lock-in is there when the software stack is ready. See ROCm vs CUDA on Spheron for the current compatibility picture.
No proprietary scheduler dependency. No NVIDIA NIM dependency, no platform-specific orchestration. Bring your own vLLM, SGLang, or, when the model support matures, TT-LLM. The serving architecture you build on Spheron today transfers directly to whatever hardware you run next.
Open migration path. Production workloads that run on Spheron's B200 instances today are not locked to a platform-specific format. When Tenstorrent Galaxy hardware becomes available on cloud marketplaces and TT-Metal's model coverage expands, the serving layer can migrate without re-architecting from scratch.
For teams also evaluating AMD as a CUDA alternative, see our AMD MI400 vs B300 comparison.
Live Pricing (as of 28 Apr 2026)
Current Spheron GPU pricing for H100, H200, and B200:
| GPU | On-Demand ($/hr) | Spot ($/hr) | Llama 70B $/M tokens (batch 32, est.) |
|---|---|---|---|
| H100 SXM5 (per GPU) | $2.90 | $0.80 | ~$2.15 on-demand / ~$0.59 spot (8x node) |
| B200 SXM6 (per GPU) | N/A (spot only) | $2.06 | ~$0.83 spot (8x node, ~5,500 tok/s est.) |
| H200 SXM5 (per GPU) | $9.76 | N/A | Not calculated (limited availability) |
| Tenstorrent Galaxy | N/A | N/A | N/A (not on cloud marketplaces yet) |
Tenstorrent Galaxy is sold direct from Tenstorrent. The p100 Blackhole card is priced at approximately $2,400 street price. Galaxy server on-prem economics can be competitive over a 3+ year horizon; cloud spot/on-demand pricing is not available for direct comparison as of today.
Pricing fluctuates based on GPU availability. The prices above are based on 28 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
For teams that want production-grade open-stack inference today, Spheron's bare-metal H100 and B200 nodes give you CUDA and ROCm access without a proprietary orchestration layer. Run reliable inference now and migrate to Tenstorrent hardware when the software stack is ready.
H100 SXM5 instances → | B200 instances → | View full pricing →
