Comparison

Etched AI Sohu vs NVIDIA: Transformer ASIC vs General-Purpose GPU for LLM Inference (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 1, 2026
etched ai sohuetched aitransformer asictransformer-only ASICASIC vs GPU inferenceSohu vs H100Sohu vs B200Sohu vs Groq LPUinference chip comparisonfixed-function AI inference
Etched AI Sohu vs NVIDIA: Transformer ASIC vs General-Purpose GPU for LLM Inference (2026)

Etched AI claims one 8-chip Sohu server delivers 500,000 tokens per second on Llama 70B, roughly 62,500 tokens/sec per chip. For comparison, a single H100 SXM5 achieves around 700 tokens/sec at batch 1 with vLLM. That per-chip advantage is real in the sense that it reflects Sohu's architecture: the chip hard-codes transformer attention directly into silicon as fixed-function logic rather than as software running on a programmable compute unit. The implied tradeoff is the entire story here. Sohu is a bet that transformer attention is the dominant AI architecture for the next several years, and that the workload is stable enough to justify giving up all programmability.

For teams evaluating inference hardware right now, the practical question is not whether Sohu is fast. It is whether the architectural constraints, supply risk, and toolchain migration cost are acceptable for your specific workload. This post covers the architecture in detail, compares Sohu against H100, B200, and the Groq 3 LPU, and gives a framework for deciding when the ASIC bet pays off.

What Is Etched AI and the Sohu Chip

Etched AI is a chip startup founded in 2022, having raised close to $1 billion in total funding through a $500M round at a $5B valuation. Their first product is the Sohu chip, a transformer-only ASIC designed for autoregressive language model inference. The name and funding are real. The chip has been demonstrated to investors and appeared in controlled benchmarks, but as of April 2026 it is not publicly available for purchase or rental.

The core architectural claim is that by implementing transformer attention as fixed-function silicon rather than as programmable matrix multiply instructions, Sohu can achieve throughput figures that no GPU can match for the same workload class. NVIDIA GPUs are programmable compute units that run CUDA kernels written in software. Groq's LPU is a dataflow processor with a custom compiler. Sohu takes a different position: there is no programmability layer at all. The chip does one thing, and it does it by being physically built for that one thing.

This is different from other custom inference chips in an important way. The Groq 3 LPU has a compiler that can, in principle, be extended. AMD GPUs support ROCm. Tenstorrent ships TT-Metal as open-source software. Sohu's architecture does not have a software abstraction layer in the same sense. If transformer attention changes in the next generation of models, the chip cannot adapt.

Sohu Architecture: The Transformer-Only Bet

Fixed-Function Transformer Units

Transformer attention requires three core operations at each layer: computing query, key, and value projections; running multi-head attention across the KV cache; and passing results through a feed-forward network. On a GPU, each of these is a CUDA kernel that can be swapped out for a different implementation. PagedAttention, FlashAttention-2, and FlashAttention-3 are all software optimizations that improve how attention is computed on general-purpose hardware.

On Sohu, these operations are hardwired as static circuits. The chip does not have a general matrix multiply unit that runs attention software. It has physical circuits that implement the attention computation directly. This removes all of the overhead from kernel launch latency, memory allocation, and scheduler decisions. It also means the chip cannot run any computation that does not map to transformer attention. There is no way to compile a convolution, an SSM scan, or a diffusion U-Net step onto Sohu because there are no programmable units to target.

On-Chip Memory Design

Sohu's throughput advantage comes primarily from memory bandwidth. The bottleneck for autoregressive decode on GPUs is KV cache reads: each new token requires reading the full KV cache from HBM. The H100 SXM5 has 80 GB of HBM3 with 3.35 TB/s of bandwidth. Every token generation step is bounded by how fast the model can read those KV cache values.

According to Etched's published materials and industry reporting, Sohu uses 144GB of HBM3E per chip with approximately 1.8x the memory bandwidth of an H100 SXM5. That gives Sohu more memory capacity than an H100 (80GB) at roughly 1.8x the H100's 3.35 TB/s bandwidth, using the same HBM architecture that GPUs use rather than replacing it with on-chip SRAM. The Groq 3 LPU takes a fundamentally different route: 500 MB of on-chip SRAM per chip achieves 150 TB/s bandwidth but with very limited context window capacity. Sohu's throughput advantage over GPUs comes from architectural specialization of transformer attention patterns built on top of standard HBM3E, not from a SRAM-based design like Groq. Multi-chip configurations are still required for large model weights since even 144GB fills quickly with 70B+ parameter models at full precision.

What Sohu Cannot Do

This is the section that matters most for most teams.

  • Vision and multimodal models: any model with a visual encoder (LLaVA, Qwen-VL, LLama 3.2 Vision) cannot run on Sohu because the encoder uses convolutional or attention operations outside the pure transformer pattern
  • Diffusion models: image generation (Stable Diffusion, Flux) and video generation (Wan 2.1, CogVideoX) require U-Net convolutions that are not transformer attention
  • MoE with dynamic expert routing: DeepSeek V4, Mixtral, and Qwen3-235B-A22B use sparse expert selection at each token, which requires irregular memory access patterns that fixed-function transformer circuits cannot accommodate
  • SSM and Mamba architectures: state space models replace attention with a scan operation, which is computationally different from transformer attention
  • Training and fine-tuning: Sohu has no backward pass implementation; it is inference-only
  • Future architectures: any model architecture that does not conform to dense transformer attention requires new hardware

DeepSeek V4 and Qwen3-235B-A22B are two of the most widely deployed open-weight models as of April 2026. Both are MoE architectures. Both are incompatible with Sohu. This is not a niche edge case. It means a significant fraction of current production inference workloads cannot run on Sohu at all.

Etched's Claimed Numbers vs Realistic NVIDIA Baseline

Etched's 500,000 tokens/sec figure for Llama 70B is for an 8-chip server, from their own published materials, and has not been independently verified. The conditions are important: this appears to be measured at or near batch size 1, where the fixed-function attention circuits can operate at peak efficiency without the batching overhead that GPUs exploit to amortize their lower per-token bandwidth.

At higher batch sizes, the picture changes. GPU throughput scales well with batching because the compute units stay busy across multiple requests. Fixed-function attention circuits do not benefit from batching in the same way; their advantage is decode throughput per request, not aggregate throughput across many requests.

ModelSohu per chip (claimed by Etched)H100 SXM5 (vLLM)B200 SXM6 (vLLM)Notes
Llama 70B at batch 1~62,500 tok/s~700 tok/s~1,200 tok/sSohu figure derived from 8-chip server claim of 500k tok/s
Llama 70B at batch 32TBD~5,000 tok/s~9,000 tok/s
Llama 70B at batch 256TBD~45,000 tok/s~80,000 tok/s

Sohu figures are per chip, derived from Etched's published 8-chip server claim of 500,000 tok/s on Llama 70B; not independently verified. NVIDIA figures reflect vLLM with FlashAttention-2 on a single chip. Run your own benchmarks before making procurement decisions.

The batch size gap is critical for any team running a serving API with concurrent requests. At batch 256, a single H100 can generate 45,000 tokens per second. Etched's 500k tok/s figure is for an 8-chip server at batch 1, which normalizes to roughly 62,500 tok/s per chip. A fair comparison requires matching conditions. Etched has not published batch 32 or batch 256 figures.

Sohu vs B200 and B300: Throughput and Cost for Pure Transformer Inference

For teams with pure transformer workloads, the cost-per-million-token comparison matters more than raw throughput. The formula is: Cost per 1M tokens = ($/hr) / (tokens/sec × 3,600) × 1,000,000

ChipOn-demand ($/hr)Spot ($/hr)Llama 70B tok/s (batch 32)Cost/1M tokens (on-demand)Cost/1M tokens (spot)
H100 SXM5 (Spheron)$2.90$0.801~5,000~$0.161~$0.044
B200 SXM6 (Spheron)$7.00$1.712~9,000~$0.216~$0.053
B300 SXM6 (Spheron)$9.77$2.447~16,000 (est.)~$0.170~$0.042
Sohu (Etched, est.)Not availableNot available~62,500/chip at batch 1, unverified (8-chip server: ~500k)Not calculableNot calculable

Pricing fluctuates based on GPU availability. The prices above are based on 1 May 2026 and may have changed. Check current GPU pricing → for live rates.

Sohu's cost-per-token cannot be calculated because the chip is not available for purchase or cloud rental, and Etched has not published pricing. The throughput figure that would make it competitive is also only available at batch 1, which overstates real-world advantage versus GPUs at typical serving batch sizes.

For teams making decisions today, Spheron B200 instances deliver 9,000 tokens/sec at batch 32 for $0.216 per million tokens on-demand, available now with full vLLM and TensorRT-LLM support. B300 availability further narrows the cost advantage that Sohu would need to offer to justify migration.

Sohu vs Groq 3 LPU: Two Non-GPU Inference Chips

Both Sohu and the Groq 3 LPU are non-GPU inference chips targeting the same bottleneck: HBM memory bandwidth limits autoregressive decode throughput. They take very different approaches, and the practical differences matter. For the full Groq 3 LPU architecture breakdown, see the NVIDIA Groq 3 LPU explained post.

DimensionEtched SohuGroq 3 LPU
ArchitectureFixed-function transformer ASICDataflow SRAM processor (programmable)
Memory144GB HBM3E per chip500 MB on-chip SRAM per chip
Memory bandwidth~1.8x H100 SXM5 bandwidth150 TB/s per chip
ProgrammabilityNone (hardwired ops only)Custom compiler, some flexibility
vLLM compatibilityNoNo (disaggregated via NVIDIA Dynamo)
Ecosystem backingStartup (Etched AI)NVIDIA-licensed, data center deployments
Production availability (Apr 2026)Pre-production, no public accessEarly access
Architecture flexibilityTransformer onlyTransformer-focused, limited MoE support

The most important difference is organizational backing. Groq was acquired into NVIDIA's product portfolio with a $20 billion licensing deal. The Groq 3 LPU is being deployed in data centers as part of NVIDIA's DGX infrastructure strategy, with NVIDIA's supply chain and enterprise contracts behind it. Sohu is a startup chip, and while Etched has raised close to $1 billion in total funding, it lacks the organizational infrastructure, supply chain, and enterprise support contracts that NVIDIA brings. That difference in risk profile compounds every other comparison point.

The Transformer-ASIC Bet: What Etched Gives Up

The premise behind Sohu is that transformer architecture has converged and will remain the dominant paradigm for AI for long enough to justify giving up all programmability. This is a specific architectural bet, not just a performance optimization.

WorkloadSohuGroq 3 LPUH100/B200
Dense transformer inferenceYesYesYes
MoE inference (DeepSeek V4, Qwen3-235B-A22B)NoPartialYes
Multimodal (vision encoder + language model)NoNoYes
Diffusion image/video generationNoNoYes
SSM/MambaNoNoYes
Training and fine-tuningNoNoYes
Future architecturesHigh riskMedium riskLow risk

The counter-evidence to this bet is already in production. DeepSeek V4 is the most downloaded model on Hugging Face as of early 2026 and it is a 671B MoE architecture that Sohu cannot serve. Qwen3-235B-A22B is a 235B MoE model that represents the frontier of open-weight capabilities. The move toward MoE is not a future risk; it is the current reality.

Diffusion language models are an emerging category with a different compute pattern. See the diffusion language models on GPU cloud guide for a detailed breakdown of how dLLMs differ architecturally from autoregressive transformers. Any team serving or planning to serve dLLMs alongside standard transformers cannot use a transformer-only ASIC.

When Sohu Wins

For teams where Sohu's constraints are genuinely acceptable, the throughput advantage is real. The specific scenarios where Sohu's architecture makes sense:

  • Pure autoregressive dense transformer serving with a single model architecture locked in for two or more years, where you are certain you will not adopt MoE or multimodal
  • Extreme low-latency requirements where batch 1 decode speed is critical and you accept the throughput tradeoff at higher concurrency
  • Greenfield infrastructure builds with no existing CUDA toolchain investment, where the migration cost to Etched's compiler is not additive
  • Commodity transformer serving at hyperscale where the model architecture is static and the focus is on cost reduction per token at very high volume
  • Organizations with the engineering resources to build and maintain a proprietary inference stack independent of vLLM, SGLang, and TensorRT-LLM

When NVIDIA Wins

For most teams, NVIDIA H100 or B200 is the right choice right now. Specifically:

  • Any multimodal workload combining vision encoders with language models
  • MoE models, which includes DeepSeek V4, Mixtral, and Qwen3-235B-A22B
  • Diffusion model serving for image or video generation
  • Workloads where SSM or Mamba architectures are under evaluation
  • Teams running training or fine-tuning alongside inference on the same infrastructure
  • Any team that depends on vLLM, SGLang, TensorRT-LLM, or FlashAttention today
  • Organizations that need production hardware now, not in 12-18 months
  • Teams that cannot accept the software risk of a pre-production startup toolchain

Software Stack Risk: CUDA vs Etched's Toolchain

The CUDA ecosystem is 18 years old and deeply embedded in every major inference framework. vLLM's PagedAttention, TensorRT-LLM's kernel fusion, SGLang's multi-turn session management, FlashAttention-3's hardware-specific optimizations: all of these are CUDA-native and require complete rewrites to move off GPU.

Etched requires a custom compiler developed by Etched AI. There is no migration path from vLLM or TensorRT-LLM. Moving to Sohu means rebuilding your serving stack from scratch with a proprietary toolchain, then maintaining it on hardware from a startup. If Etched hits schedule and the chip performs as claimed, you win on cost. If they miss a release date, pivot, or have a supply chain issue, your serving stack is on unsupported hardware.

Compare this to the Groq 3 LPU: NVIDIA-backed, integrated with NVIDIA Dynamo for disaggregated inference, with an enterprise support contract behind it. Or compare to Tenstorrent: at least TT-Metal is open-source under an MIT license (see the Tenstorrent vs NVIDIA post for how the software story plays out in practice). Sohu has neither the organizational backing of Groq LPU nor the open-source hedge of Tenstorrent.

TCO Model: When Does the ASIC Bet Pay Off

The break-even logic has three conditions that all need to hold simultaneously:

  1. The ASIC's cost per token must be materially lower than GPU cloud for your specific workload and batch profile
  2. The migration and toolchain rewrite cost must be less than the cumulative savings over a realistic time horizon
  3. The architecture must stay stable long enough to amortize both costs before the next generation of models makes the hardware obsolete

For Sohu specifically, none of these can be verified today. Etched has not published pricing or per-rack costs. The toolchain migration cost is unknown but likely significant for any team with a mature vLLM deployment. And the architecture stability assumption is already in question given the production adoption of MoE models.

Consider the alternative: rent H100 by the hour on Spheron at $2.90 per hour with no CapEx commitment and no toolchain migration. When B200 is available, switch. When B300 becomes accessible, switch again. Flexible GPU cloud rental requires no migration cost, no architecture bet, and full access to every model that ships in the next 12 months.

Practical Guidance for AI Infra Teams

Today, April 2026, GPU cloud is the right default for transformer inference. H100 and B200 are available now, vLLM runs on both out of the box, and the cost-per-token figures are competitive with any ASIC claim that has not been independently verified at production batch sizes.

Over the 12-18 month horizon, watch Sohu closely if your stack is genuinely pure transformer and your token volume is high enough that GPU cloud spend is a meaningful line item. Get on Etched's early access list, but do not commit production traffic until there are independent benchmarks at your actual batch sizes and a clear migration path from vLLM.

The hedge strategy is to build your cost and throughput baseline on Spheron GPU cloud now. Measure tokens per second at your production batch sizes. Calculate your actual cost per million tokens. That number is the benchmark every ASIC claim, including Sohu, Groq LPU, and Cerebras, must beat to justify migration. Without that baseline, ASIC throughput claims have no reference point.

For teams that have not yet picked a GPU for inference, the best GPU for AI inference guide covers H100 vs H200 vs B200 vs L40S with concrete decision criteria based on model size, batch size, and budget.

Etched Sohu is a real architectural bet, but it is not available for production workloads today. H100 and B200 GPU cloud on Spheron lets you serve transformer inference now and build the benchmark baseline you need to evaluate Sohu, Groq LPU, or Cerebras when they prove out at scale.

Rent H100 on Spheron → | Rent B200 → | View all GPU pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.