Comparison

Huawei Ascend 950 vs NVIDIA B300 and B200 for LLM Inference: TPS, Memory Bandwidth, and Cost Comparison (2026)

huawei ascend 950ascend 950 vs nvidiaatlas 950 supernodehuawei ai chip llm inferencehuawei ascend vs b200ascend 950 llm inferenceCANN vs CUDALLM InferenceGPU Cloud Pricing
Huawei Ascend 950 vs NVIDIA B300 and B200 for LLM Inference: TPS, Memory Bandwidth, and Cost Comparison (2026)

The Atlas 950 SuperNode was cited as part of Huawei's compute infrastructure behind some DeepSeek-family models. That connection brought the Ascend 950 (also called the 950PR) onto the radar of LLM inference teams who had never looked at non-NVIDIA hardware before. This post gives you the spec breakdown, benchmarks, and cost math to decide whether Ascend 950 is worth pursuing or whether NVIDIA Blackwell and Hopper are the practical choice. One thing to state clearly upfront: most teams outside China cannot legally procure Ascend hardware. US export control rules generally restrict acquisition of Huawei's advanced AI chips, and Ascend 950 is not offered through any globally accessible GPU cloud provider.

This is the third entry in a series covering non-NVIDIA AI accelerators for LLM inference. See Intel Gaudi 3 vs H200 and B200 for the SynapseAI stack analysis and AMD MI400 vs B300 for the CDNA 5 vs Blackwell Ultra comparison.

Quick Answer: Ascend 950 vs B200 vs B300 at a Glance

GPUBest ForMemoryBandwidthSpheron Price
Ascend 950 (950PR)Training at SuperNode scale inside Huawei's infrastructure112 GB HiBL1.4 TB/sHuawei Cloud China only
B200 SXM6FP4 and FP8 inference, 100B+ models, highest throughput192 GB HBM3e8.0 TB/sFrom $2.68/hr
B300 SXM6Highest per-GPU memory, 200B+ models at FP16288 GB HBM3e8.0 TB/sFrom $3.29/hr

What Is the Huawei Ascend 950 (950PR)?

The Ascend 950 is Huawei's current flagship AI accelerator, built on the Da Vinci 3.0 architecture. Its key specs, per Huawei announcement, are: 1.56 PFLOPS FP4 compute, 112 GB HiBL (Huawei's proprietary HBM-class memory) at 1.4 TB/s, and a 600W TDP. The chip uses HCCS for intra-node communication and LingQu for cluster-scale interconnect. There is no Western-accessible datasheet for the Ascend 950. All specs above come from Huawei's Chinese-language announcements and third-party analysis, and should be read as Huawei-claimed figures rather than independently verified numbers.

The Atlas 950 SuperNode

The Atlas 950 SuperNode scales 8,192 Ascend 950 cards through Huawei's LingQu proprietary fabric, delivering 16 EFLOPS FP4 in aggregate (per Huawei announcement). Note: the SuperNode is built using the Ascend 950DT (decode/training) variant, not the 950PR (prefill) card profiled throughout this post; the specs and throughput estimates in this article apply to the 950PR and do not reflect the 950DT cards in the SuperNode. It was cited as part of the training compute infrastructure behind some DeepSeek-family models. To be precise: the SuperNode connection is to DeepSeek's training infrastructure, not inference serving. The Atlas 950 SuperNode is not available as a commercial cloud rental product outside Huawei Cloud in China.

Full Spec Comparison: Ascend 950 vs B200 vs B300

SpecificationAscend 950 (950PR)NVIDIA B200 SXM6NVIDIA B300 SXM6
ArchitectureDa Vinci 3.0 (Huawei)Blackwell (GB100)Blackwell Ultra (dual-die)
Memory112 GB HiBL192 GB HBM3e288 GB HBM3e
Memory Bandwidth1.4 TB/s8.0 TB/s8.0 TB/s
FP4 Compute1.56 PFLOPS (per Huawei)20 PFLOPS NVFP4 (sparse)~28 PFLOPS NVFP4 (est., sparse)
FP8 Compute~1 PFLOPS (per Huawei)9 PFLOPS~11 PFLOPS (est.)
Multi-GPU InterconnectHCCS intra-node; LingQu scale-outNVLink 5 (1.8 TB/s)NVLink 5 (1.8 TB/s)
TDP600W1,000W1,400W
Software StackCANN + torch_npuCUDA 12.xCUDA 12.x
Cloud AvailabilityHuawei Cloud China onlyAvailable on SpheronAvailable on Spheron

Note on FP4 figures: Huawei's FP4 implementation on Ascend and NVIDIA's NVFP4 are different formats with different numeric ranges and rounding behavior. Peak TFLOPS figures are not directly comparable across vendors. The Ascend 950's FP8 figure (~1 PFLOPS) is per Huawei's published specification and is the basis for the widely cited "2.8x H20" performance claim. NVIDIA's B200 FP4 figure (20 PFLOPS sparse) is NVIDIA's published sparse peak; dense throughput is approximately half that.

Memory Bandwidth: Why 1.4 TB/s vs 8 TB/s Decides Inference Throughput

Decode (token generation) is memory-bandwidth-bound at small batch sizes. Each forward pass loads model weights from GPU memory, and token throughput scales roughly linearly with how fast that load completes. At batch size 1, nearly all time is spent loading weights and KV cache rather than doing matrix math.

Llama 3.3 70B in FP8 weighs approximately 70 billion × 1 byte = 70 GB. At 1.4 TB/s (Ascend 950), the weights move through the memory system roughly 20 times per second at batch 1. At 8 TB/s (B200), that same load completes approximately 114 times per second. That 5.7x bandwidth ratio (8 TB/s divided by 1.4 TB/s) maps almost directly to a 5.7x throughput gap at small batches, which is the figure used throughout this post.

The gap does not close at long context. KV cache grows with sequence length, and loading the accumulated KV cache per decode step also scales with bandwidth. For applications like long-document Q&A or multi-turn conversations with large context windows, the bandwidth gap remains decisive. For deeper reading on why LLM weights and KV cache dominate GPU memory requirements, see the GPU memory requirements for LLMs guide.

LLM Inference Benchmarks

Independent Ascend 950 inference benchmarks are scarce outside Chinese-language technical sources as of June 2026. Huawei has not submitted Ascend 950 to MLCommons or MLPerf data center inference benchmarks. The Ascend 950 figures below are bandwidth-scaling projections derived from H200 baselines using the 1.4 TB/s vs 4.8 TB/s ratio, and are labeled as estimates throughout. Note: at batch 32 and above, inference begins to shift from purely bandwidth-bound toward partially compute-bound behavior, meaning actual throughput on the Ascend 950 could exceed the pure bandwidth-ratio projection, but without measured data, the estimates here follow the bandwidth ratio conservatively.

Llama 3.3 70B FP8: Throughput by Batch Size

GPUBatch 1 tok/sBatch 8 tok/sBatch 32 tok/sSource
Ascend 950~200 (est.)~550 (est.)~730 (est.)Batch 1/8: bandwidth-scaling from H200 baseline (1.4/4.8 TB/s ratio); batch 32 transitions toward compute-bound
H200 SXM5~680~1,800~2,500vLLM community benchmarks, SemiAnalysis InferenceX
B200 SXM6~1,150~3,000~3,600 (FP8)Bandwidth-scaling from SemiAnalysis InferenceX; FP8 mode
B300 SXM6~1,150 (est.)~3,000 (est.)~4,200 (est.)Same bandwidth as B200; compute-bound advantage at batch 32+

B200 and B300 both carry 8.0 TB/s per chip, so their batch-1 and batch-8 decode throughput is similar. At batch 32 and above, B300's larger compute die provides a modest edge for compute-bound work. At small batches, bandwidth is the binding constraint and the two chips perform equivalently.

DeepSeek V4-Pro: Multi-GPU Configuration

DeepSeek V4-Pro at FP8 requires roughly 500 GB VRAM to hold model weights, which rules out single-card inference on any current hardware. The minimum configurations are: 5x Ascend 950 (560 GB combined HiBL), 3x B200 SXM6 (576 GB combined HBM3e), or 4x H200 SXM5 (564 GB combined HBM3e).

There is an additional complication on Ascend: the vLLM Ascend backend has incomplete support for DeepSeek V4's fine-grained MoE routing as of June 2026. Teams running DeepSeek V4-Pro on Ascend would likely need custom CANN kernels or patches to the community backend. For the full vLLM expert parallelism setup on NVIDIA hardware, see the Deploy DeepSeek V4 on GPU cloud guide.

InterconnectPer-GPU BWArchitectureScale
HCCS + LingQu (Ascend 950)HCCS intra-node: hundreds of GB/s (unconfirmed); LingQu chip-to-chip: ~2 TB/s (Huawei-claimed)Ethernet-based fabricUp to 8,192 cards (Atlas SuperNode)
NVLink 5 (B200/B300)1.8 TB/s per GPUNVSwitch all-to-all72 GPUs in NVL72; multi-rack via InfiniBand

Huawei has cited ~2 TB/s chip-to-chip bandwidth for the LingQu/UnifiedBus interconnect; this figure has not been independently confirmed. NVSwitch in the B200/B300 NVL72 delivers 1.8 TB/s per-GPU all-to-all within a 72-GPU rack. For fine-grained MoE expert dispatch like DeepSeek V4-Pro (256+ experts), this tight coupling keeps expert-routing latency low.

For teams outside China operating at sub-SuperNode scale (8 to 16 GPUs), NVLink 5 on B200 or H200 is the proven interconnect option. LingQu at that scale is not accessible as a standalone product through any globally available channel.

Software Reality: CANN vs CUDA

DimensionAscend 950 (CANN)NVIDIA B200 / B300 (CUDA)
Primary SDKCANN 8.xCUDA 12.x
PyTorch integrationtorch_npu pluginNative CUDA backend
Inference frameworkvLLM Ascend backend (community); MindSpore TransformersvLLM, TensorRT-LLM, SGLang
Flash AttentionMindSpore Attention (not FlashAttention 3)FlashAttention 3 (Blackwell/Hopper native)
HuggingFace supportPartial via torch_npuNative transformers + accelerate
Model coverageLLaMA, Qwen, DeepSeek (limited); newer architectures varyNear-universal
Container registryHuawei SWR (China-accessible)NGC (global)
QuantizationFP8, BF16; FP4 hardware support with toolchain in betaFP8, FP4 (B200/B300 native), BF16, INT8
English documentationLimited; primary docs are in ChineseExtensive

torch_npu patches PyTorch's device dispatch to route compute to CANN. Standard ops work. Custom CUDA kernels require full rewrites using Ascend C, Huawei's kernel language equivalent to CUDA C++.

The vLLM Ascend backend is a community project, not Huawei-official. It lags upstream vLLM by several release cycles. Speculative decoding, MRV2, and SGLang are not available on Ascend. TensorRT-LLM is NVIDIA-only and will never run on Ascend hardware. Teams using TensorRT-LLM in production cannot migrate to Ascend without a complete stack swap.

Estimated porting time: 2 to 4 weeks for a standard PyTorch vLLM deployment on Ascend, assuming no prior CANN experience. Custom attention implementations or non-standard model architectures add 4 to 8 weeks. For comparison, deploying vLLM on NVIDIA hardware typically takes a few hours. See the vLLM production deployment guide for the NVIDIA baseline.

Availability and Export Controls

Ascend 950 is not sold outside China and is not available on any globally accessible GPU cloud provider. Channels where it is available: Huawei Cloud (China region), Alibaba Cloud (China region), and ByteDance Volcengine.

US export control updates from October 2022 and October 2023 from the Bureau of Industry and Security (BIS) restrict Huawei's access to advanced chip manufacturing equipment and restrict export of advanced AI accelerators involving Huawei's supply chain. The specific legal exposure depends on organization type, jurisdiction, and the nature of the transaction. US export control rules generally restrict export of Ascend hardware, but teams should verify their specific situation with legal counsel.

For teams inside China: Ascend 950 on Huawei Cloud is a real and competitive option, particularly for Chinese-language model inference where the community has CANN-optimized serving paths for Qwen, Baichuan, and DeepSeek variants.

Cost-Per-Token Analysis

Spot CPM formula: CPM = (spot $/hr) / (tokens_per_sec × 3600 / 1,000,000)

GPUSpot $/hr (Spheron)On-demand $/hr (Spheron)Est. Llama 3.3 70B FP8 tok/s (batch 32)Cost per 1M tokens (spot)
Ascend 950N/AN/A~730 (est.)Check Huawei Cloud CNY pricing
H200 SXM5$3.31$4.88~2,500~$0.37/M
B200 SXM6$2.68$9.23~3,600 (FP8)~$0.21/M
B300 SXM6$3.29$9.16~4,200 (est., FP8)~$0.22/M

B200 and B300 produce similar cost-per-token at batch 32 despite B300's higher hourly rate, because B300's larger compute die delivers more tokens per second at that batch size. H200 is the most established option with the widest vLLM support. For measured baseline numbers from independent inference benchmarks, see the GPU cost per token benchmark 2026 post.

Pricing fluctuates based on GPU availability. The prices above are based on 10 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

When Ascend 950 Makes Sense

  • Teams inside China with existing CANN toolchain experience and direct Huawei Cloud access
  • Large-scale MoE inference within Huawei's Atlas 950 SuperNode infrastructure
  • Chinese-language model serving (Qwen, Baichuan, DeepSeek) where the community has CANN-optimized inference paths
  • Organizations where access to NVIDIA hardware is the binding constraint (rare outside specific jurisdictions)

When NVIDIA Wins

  • Any team outside China: Ascend 950 is not procurable through legal channels for most organizations
  • Fine-grained MoE inference (DeepSeek V4-class models) at multi-GPU scale where NVLink 5's 1.8 TB/s reduces expert dispatch latency
  • Deployments using TensorRT-LLM, SGLang, upstream vLLM features, or speculative decoding
  • Teams without CANN experience who cannot absorb a 2 to 8 week porting investment
  • Long-context inference at small batch sizes: B200's 8 TB/s bandwidth produces roughly 5.7x higher decode throughput than Ascend 950's 1.4 TB/s
  • Any workload requiring English-language documentation, active community support, and mature ecosystem tooling

Decision Matrix

SituationChoose
Team inside China, CANN experience, Huawei Cloud accessAscend 950
Team outside China, any LLM inference workloadNVIDIA B200 or H200 on Spheron
Fine-grained MoE inference (DeepSeek V4-class), global accessB200 SXM6 or H200 SXM5
Highest per-GPU memory capacity (200B+ models at FP16)B300 SXM6 (288 GB HBM3e)
Lowest cost-per-token at current market ratesCompare B200 spot vs H200 spot via live /pricing/
Need to A/B test GPU types before committingSpheron on-demand (per-minute billing, no contracts)
TensorRT-LLM or SGLang in productionNVIDIA only (CANN not supported)

For teams outside China deciding now, the practical path comes down to memory and budget. B300 GPU rental delivers 288 GB HBM3e at 8.0 TB/s, which handles 200B+ parameter models at FP16 and large MoE models at FP8 without scaling past a single card. If your budget is tighter or you need lower per-GPU cost at equivalent bandwidth, B200 SXM6 on Spheron offers the same 8 TB/s with 192 GB at $2.68/hr. For teams running Llama 3 or DeepSeek V3-class models that fit in under 141 GB, on-demand H200 instances remain the most battle-tested option in the NVIDIA cloud stack with the widest vLLM coverage and the most community benchmarks to compare against.


The Ascend 950 is real hardware with real performance inside Huawei's infrastructure. But export controls make it unavailable to most teams, and CANN's porting cost eats the budget advantage for anyone starting from a CUDA stack. For DeepSeek V4-Pro, Qwen 3, or any MoE model running today, NVIDIA Blackwell and Hopper on Spheron give you the same models, verified benchmarks, and per-minute billing.

Check H200 availability → | B200 GPU pricing → | View all GPU pricing →

STEPS / 03

Quick Setup Guide

  1. Confirm whether Ascend 950 is procurable for your team

    Before investing in Ascend evaluation, verify procurement feasibility: determine whether your organization is subject to US export control restrictions (if so, Ascend hardware is not legally procurable), identify whether Huawei Cloud China offers the specific model and capacity you need, and assess your team's existing CANN or torch_npu experience. If procurement is blocked or uncertain, move directly to NVIDIA Blackwell or Hopper alternatives.

  2. Benchmark your target GPU on Spheron before committing

    Spin up a B200 SXM6 or H200 SXM5 instance on Spheron for on-demand access. Run your production model using vLLM at the same batch size and sequence lengths you need in production. Record tokens per second and time-per-output-token. Calculate cost per million tokens: CPM = ($/hr) divided by (tokens_per_sec times 3600 divided by 1,000,000). This establishes the baseline any alternative hardware must beat.

  3. Calculate cost-per-token across hardware options

    For NVIDIA hardware, use Spheron's live pricing from /pricing/. For Ascend, use Huawei Cloud China's listed pricing in CNY converted at current exchange rates, and add an estimate of your team's CANN porting cost amortized over expected GPU-hours. Most teams outside China find the porting cost alone makes Ascend uncompetitive, before accounting for procurement barriers.

FAQ / 05

Frequently Asked Questions

The Ascend 950 (also called the 950PR) is Huawei's flagship AI accelerator with 1.56 PFLOPS FP4 compute, 112 GB HiBL memory, 1.4 TB/s bandwidth, and a 600W TDP. In the Atlas 950 SuperNode configuration, 8,192 cards deliver 16 EFLOPS FP4 aggregate. By comparison, the NVIDIA B200 delivers 8 TB/s memory bandwidth and 20 PFLOPS NVFP4 sparse per card. The 5.7x bandwidth gap means B200 produces substantially higher decode throughput at small batch sizes, the regime that dominates real-time LLM serving.

Generally no. US export control rules restrict export of Huawei's advanced AI chips, and Ascend 950 is only sold through Huawei Cloud China, Alibaba Cloud China, and ByteDance Volcengine. Teams outside China who need GPU capacity for LLM inference use NVIDIA Hopper or Blackwell hardware through globally accessible providers like Spheron.

CANN (Compute Architecture for Neural Networks) is Huawei's AI software stack for Ascend hardware, analogous to CUDA. PyTorch runs on Ascend via the torch_npu plugin. vLLM has a community-contributed Ascend backend that lags upstream vLLM by several release cycles and covers fewer model architectures. Teams accustomed to CUDA typically underestimate the porting effort when moving to Ascend.

The Atlas 950 SuperNode is a rack-scale AI system built from 8,192 Ascend 950 cards connected via Huawei's LingQu proprietary interconnect, delivering 16 EFLOPS FP4 aggregate. It was cited as part of the compute infrastructure behind some DeepSeek-family model training. It is not available as a cloud rental product outside Huawei Cloud in China.

If you cannot access Ascend 950 hardware (which most teams outside China cannot), the practical alternatives for DeepSeek V4-class MoE inference are NVIDIA H200 SXM5 (141 GB HBM3e, 4.8 TB/s, mature vLLM stack) and NVIDIA B200 SXM6 (192 GB HBM3e, 8 TB/s) for higher throughput with FP4 and FP8. Both are available on Spheron with on-demand and spot pricing. Check /pricing/ for current rates.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.