AWS Trainium 3 launched at re:Invent in December 2025 and picked up real enterprise adoption quickly. Uber began a Trainium 3 training pilot in April 2026. Anthropic committed to 1M+ accelerators under a long-term AWS agreement. The per-unit pricing looks compelling against H200. The software story is substantially more complicated, and the math only closes under conditions that most teams do not match.
The question worth asking is not just "which chip is faster" but "does the cost advantage survive the engineering cost of moving off CUDA." This post gives you the spec comparison, the cost-per-token math using live Spheron pricing, and a realistic accounting of what the Neuron SDK migration actually requires.
For the broader context on hyperscaler custom silicon and why these chips are captive to their builders, see Hyperscaler Custom AI Chips in 2026. This post focuses specifically on the Trainium 3 vs H200 and B200 comparison with full cost and migration detail.
Quick Answer: Trainium 3 vs H200 vs B200 at a Glance
| System | Best For | Memory / Bandwidth | Price |
|---|---|---|---|
| AWS Trainium 3 (EC2 Trn3) | Dense transformer training/inference inside AWS, Neuron-native stacks | 144 GB HBM3e / 4.9 TB/s (per chip) | ~$1.80/hr per NeuronCore (AWS estimate; see note below) |
| NVIDIA H200 SXM5 | 70B-class inference and fine-tuning, mature CUDA stack, vLLM-native | 141 GB HBM3e / 4.8 TB/s | $4.54/hr on-demand, $3.31/hr spot on Spheron |
| NVIDIA B200 SXM6 | 100B+ inference, FP4 workloads, highest CUDA throughput | 192 GB HBM3e / 8.0 TB/s | $5.34/hr spot on Spheron (no on-demand listings currently) |
Note on Trainium 3 pricing: AWS prices EC2 Trn3 instances per instance, not per chip or NeuronCore. The ~$1.80/hr per NeuronCore figure is an illustrative estimate derived from dividing instance pricing by NeuronCore count; AWS does not publish a verified per-NeuronCore rate. A NeuronCore is a sub-unit of the Trainium chip (8 NeuronCore-v4 per chip), not a direct equivalent to an NVIDIA GPU. Per-unit comparisons across the two architectures require careful interpretation.
Trainium 3 is EC2 Trn3 only. H200 and B200 are available on-demand through Spheron GPU cloud.
Trainium 3 Architecture: What Changed from Trn2
Trainium 3 is AWS's third-generation ML accelerator, manufactured on a 3nm process. The headline specs represent a meaningful step up from Trn2.
Each Trainium 3 chip contains 8 NeuronCore-v4 processors and 144 GB HBM3e, with per-chip bandwidth of approximately 4.9 TB/s. The NeuronCore-v4 design includes a systolic array for matrix operations and vector units for element-wise ops, analogous in broad strokes to NVIDIA's SM plus Tensor Core split but not interchangeable at the software level. The right comparison unit is the chip (not the NeuronCore): at the chip level, Trainium 3 (144 GB / 4.9 TB/s) sits close to H200 (141 GB / 4.8 TB/s) in raw memory and bandwidth.
The UltraServer topology connects multiple Trn3 instances via NeuronLink, AWS's proprietary interconnect designed for scale-out training. For large multi-node jobs where compute and communication patterns align with NeuronLink's topology, this enables efficient gradient synchronization at scale.
AWS claims up to 4.4x higher performance per Trn3 UltraServer versus Trn2. On a dense FP8 basis, this works out to roughly 2x per NeuronCore, though that per-NeuronCore figure is derived from the FP8 spec jump, not an AWS-published per-NeuronCore claim. The specific gain depends on model architecture, batch size, and sequence length. AWS-internal measurements for dense transformer training show the most consistent improvement. MoE architectures and workloads with irregular compute patterns show more variable gains.
EC2 Trn3 comes in several instance sizes, from the trn3.2xlarge (smallest) to the trn3.48xlarge (largest). AWS publishes 8 NeuronCore-v4 units per Trainium 3 chip; per-instance NeuronCore counts are not separately verified by AWS in public pricing documentation, so the per-NeuronCore estimates throughout this post are illustrative. The larger instances are the practical target for 70B+ model inference or pre-training runs.
Spec Head-to-Head: Trainium 3 vs H200 vs B200
| Spec | Trainium 3 | NVIDIA H200 SXM5 | NVIDIA B200 SXM6 |
|---|---|---|---|
| Architecture | NeuronCore-v4 (3nm) | Hopper (4nm) | Blackwell (4nm) |
| VRAM | 144 GB HBM3e per chip (8 NeuronCore-v4) | 141 GB HBM3e | 192 GB HBM3e |
| Memory Bandwidth | 4.9 TB/s | 4.8 TB/s | 8.0 TB/s |
| FP8 TFLOPS | 2.52 PFLOPS (dense) | 1.98 PFLOPS (dense) | 4.50 PFLOPS (dense) |
| FP16 TFLOPS | ~1.2 PFLOPS (est.) | ~0.99 PFLOPS (dense) | 2.25 PFLOPS (dense) |
| FP4 support | MXFP4 (no FP4 speedup) | No | Yes (native FP4 acceleration) |
| Interconnect | NeuronLink (proprietary) | NVLink 4 (900 GB/s) | NVLink 5 (1.8 TB/s) |
| Form factor | EC2 Trn3 (AWS only) | SXM5 module | SXM6 module |
| Availability | AWS EC2 only | Open GPU cloud + AWS | Open GPU cloud |
| Software stack | Neuron SDK (no CUDA) | CUDA 12.x, vLLM-native | CUDA 12.x, vLLM-native |
A few numbers need context:
FP8 throughput: All FP8 and FP16 figures in the table above are dense (no 2:4 sparsity). On a dense basis, Trainium 3 (2.52 PFLOPS FP8) leads H200 (1.98 PFLOPS) but trails B200 (4.50 PFLOPS) by roughly 79%. For workloads that exploit 2:4 sparsity, H200 sparse FP8 reaches ~3.96 PFLOPS (57% above Trainium 3) and B200 sparse FP8 reaches ~9.0 PFLOPS (roughly 3.6x Trainium 3). Actual throughput depends on whether the model and serving framework can utilize sparsity, which varies by workload.
Memory: At the chip level, Trainium 3 (144 GB HBM3e per chip, 8 NeuronCore-v4) is close to H200 (141 GB per GPU) in raw capacity. Per-NeuronCore figures are not the right comparison unit. For model-fit analysis, compare at the chip or instance level: the trn3.48xlarge (the largest single-node Trn3 instance) provides enough aggregate HBM3e to run a 70B FP8 model (~70 GB) with substantial KV cache headroom. A single H200 GPU (141 GB) handles the same on a single device. B200's 192 GB per GPU adds headroom for 140B+ models or long-context serving.
Bandwidth: Trainium 3 (4.9 TB/s) and H200 (4.8 TB/s) are nearly identical. For memory-bandwidth-bound decode at small batch sizes, neither chip has a bandwidth advantage. B200 at 8.0 TB/s is in a different tier.
Inference Benchmarks: Llama 3 70B and Dense Transformer Workloads
There is no independent third-party benchmark comparing Trainium 3 directly to H200 or B200 as of July 2026. The available public data breaks into two separate buckets.
Trainium data (AWS-internal): AWS published throughput figures for Trn2 (the previous generation) on Llama 2 70B showing competitive throughput with H100 at batch 32+. For Trainium 3, AWS claims up to 4.4x higher performance per Trn3 UltraServer versus Trn2. Extrapolating from Trn2 inference figures gives an estimated per-chip baseline, but these are derived estimates, not measured Trn3 numbers from an independent lab.
H200 data (independent): The vLLM team and independent benchmark runs publish H200 SXM5 throughput for Llama 3 70B. At batch 8 using vLLM with FP8 quantization, H200 achieves approximately 1,500 tokens/sec per GPU. At batch 32, throughput increases to around 2,400 tokens/sec as GPU utilization improves.
The comparison challenge: Trainium 3 figures are AWS-internal at batch 32+, H200 figures are independently measured. At the same batch size (32), H200 throughput rises to close the per-chip gap. The advantage Trainium 3 claims at batch 32 in AWS benchmarks has not been reproduced by any third party on a matched workload.
For workloads with variable batch sizes, short requests, or latency constraints that cap practical batch size at 8 or below, the comparison is even less favorable to Trainium 3 based on available data.
The Real Cost Per Token: Trainium 3 vs H200 vs B200
The CPM formula: CPM = ($/hr per unit) / (tokens_per_sec × 3600 / 1,000,000)
Three scenarios compared using live Spheron pricing (fetched 02 Jul 2026) and the throughput data described above:
| Config | $/hr per unit | Tokens/sec | CPM | Notes |
|---|---|---|---|---|
| Trainium 3 (EC2 Trn3, batch 32) | ~$1.80/NeuronCore | ~1,800 (AWS-claimed) | ~$0.28/M | AWS-internal data; NeuronCore ≠ GPU |
| H200 SXM5 on Spheron (on-demand) | $4.54/GPU | 1,500 (vLLM, batch 8) | $0.84/M | Independently measured |
| H200 SXM5 on Spheron (spot) | $3.31/GPU | 1,500 (vLLM, batch 8) | $0.61/M | Spot can be reclaimed anytime |
| B200 SXM6 on Spheron (spot) | $5.34/GPU | ~2,200 (est., FP8) | ~$0.67/M | No on-demand available currently; spot only |
The 30-50% cost advantage AWS claims for Trainium 3 is real in AWS's own benchmarks at high batch sizes on dense transformer models natively ported to Neuron. That advantage narrows substantially against H200 spot on Spheron and disappears when the engineering cost of the Neuron SDK migration is factored into the total cost calculation.
To make that concrete: if your team spends three weeks porting a production vLLM stack to Neuron at a blended rate of $150/hr for two engineers, that migration costs ~$36,000 before you run a single inference request. At the CPM savings between Trainium 3 ($0.28/M) and H200 spot ($0.61/M, independently verified), you would need to serve approximately 109 billion tokens before the migration breaks even. At 10M tokens/day, that is 30 years. At 1B tokens/day, it is 109 days.
The math changes significantly at scale. If you are already inside AWS and serving 100B+ tokens per day on a fully Neuron-ported stack, the CPM savings compound into real money. If you are starting from vLLM and serving a few billion tokens per month, the migration cost never closes.
Pricing fluctuates based on GPU availability. The prices above are based on 02 Jul 2026 and may have changed. Check current GPU pricing → for live rates.
The Neuron SDK Migration Tax: What 2-6 Weeks Actually Means
The Neuron SDK is AWS's CUDA equivalent: a compiler, runtime, and library stack that runs exclusively on Trainium and Inferentia chips. It does not run native CUDA code. vLLM has an official Neuron backend (the vllm-neuron plugin / NxD Inference) that enables basic serving with minimal code changes, but CUDA-specific components do not carry over: custom Triton kernels, AWQ/GPTQ quantization kernels, FlashAttention-3, and TensorRT-LLM engines cannot run on Neuron.
For a team currently running vLLM in production, the porting checklist includes:
Framework replacement:
- For basic serving, the vllm-neuron plugin (NxD Inference) provides a Neuron backend for vLLM with minimal code changes. For full model-parallel inference,
transformers-neuronxremains an option (now considered legacy). AWS Neuron ModelServer handles managed production serving. - Replace
flash-attnwith Neuron's built-in attention kernel. Neuron's persistent KV cache handles standard attention, but custom FlashAttention-3 patterns and non-standard attention variants must be dropped or rewritten. - Replace TensorRT-LLM serialized
.enginefiles with Neuron-compiled.neffartifacts. Compilation time for a 70B model on Neuron is typically 20-45 minutes.
Operator and kernel replacement:
- Any operator using
torch.cuda.*must be replaced with a Neuron-compatible equivalent or removed - Custom CUDA kernels in C++ or PTX have no Neuron equivalent. They must be rewritten entirely in Neuron's kernel API or dropped
- AWQ and GPTQ quantization kernels commonly used in production stacks are CUDA-only. Neuron has its own quantization path, which is not compatible with GPTQ/AWQ weight formats
Validation and evaluation:
- Rebuild evaluation harnesses to run inside
torch_neuronxcontext. Standard PyTorch evaluation code that calls.cuda()will fail - Re-run model accuracy evals after Neuron compilation. Neuron's graph compiler applies precision choices during compilation that can subtly shift output distributions versus FP16 CUDA baseline
- Validate numerics on your actual use case, not just a synthetic benchmark
Realistic time estimates:
| Scenario | Time |
|---|---|
| Team new to Neuron, Neuron-supported model (Llama 3 70B, Mistral) | 2-3 weeks |
| Team new to Neuron, custom attention kernel or non-standard arch | 4-6 weeks |
| Team with prior Neuron SDK experience | 1-2 weeks |
What is not portable: custom CUDA kernels (Triton, CUDA C, PTX), any library that calls torch.cuda.is_available() as a dispatch condition, TensorRT-LLM-compiled engines, and vLLM's CUDA-specific extensions (AWQ/GPTQ kernels, FlashAttention-3, custom CUDA operators). The vllm-neuron plugin covers basic continuous batching serving but not these CUDA-native components, which require full rewrites or must be dropped.
The ongoing maintenance overhead: Neuron SDK releases follow AWS's internal roadmap, not the vLLM or SGLang release cycle. When vLLM adds speculative decoding improvements, RadixAttention updates, or new quantization methods, Neuron typically lags by one to several quarters. The ported stack requires dedicated attention on each Neuron SDK update to pick up performance improvements and resolve API changes.
For context on what vLLM's production feature set looks like and what you would be giving up on the CUDA side, see the vLLM production deployment guide. For TRT-LLM specifics, see the TensorRT-LLM deployment guide.
The Intel Gaudi 3 vs H200 and B200 comparison covers a similar migration-cost analysis for SynapseAI. The pattern is structurally the same as Neuron: proprietary SDK, limited CUDA-ecosystem parity, and real compute savings that only close at sufficient scale.
The authoritative list of Neuron-supported model architectures is maintained at the AWS Neuron documentation. Check it for your specific model before planning any migration, as support changes with each SDK release.
Training vs Inference: Where Trainium 3 Wins and Where CUDA Holds
Training
Dense transformer pre-training is where Trainium 3 is most competitive. For large training runs where:
- The team is already AWS-native (S3 for data, SageMaker HyperPod for distributed training, CloudWatch for experiment tracking)
- The model architecture is natively supported in the Neuron SDK (or the team has already ported it)
- Training runs at scale and duration where the per-unit cost savings compound over months
For these conditions, Trainium 3 with NeuronLink scale-out is a legitimate alternative to H100/H200 clusters at training scale. Anthropic's commitment is a real signal: at the billions-of-tokens-per-day scale of frontier model training, the migration engineering cost is a one-time fixed cost against a long-term variable cost.
AWS's managed training environment (SageMaker HyperPod with Trn3 instances) removes some of the distributed training setup overhead relative to a raw H200 cluster. If a team does not want to manage distributed training infrastructure, this matters.
Inference
For production inference serving, the picture changes. While vLLM now has an official Neuron backend (vllm-neuron / NxD Inference) that supports basic continuous batching on Trainium, several production features remain significantly more mature on CUDA or have no Neuron equivalent:
- Continuous batching: mature in vLLM and SGLang; vllm-neuron provides a Neuron version with reduced flexibility and fewer optimization options
- Speculative decoding: mature in vLLM and SGLang with multiple variants; AWS Neuron added speculative decoding support in 2024 but with fewer variants and less community tooling
- RadixAttention (prefix caching): available in SGLang; no Neuron equivalent
- PagedAttention: vLLM's memory management system; Neuron has its own approach, with different behavior and less community tooling around it
- FP4 quantization: native FP4 acceleration on B200 via CUDA; Trainium 3 supports MXFP4 but at the same throughput rate as MXFP8 with no dedicated FP4 speedup (native FP4 acceleration is slated for Trainium 4)
Teams running mixed-traffic inference servers (variable batch sizes, diverse sequence lengths, multiple LoRA adapters, latency-sensitive requests) see these tooling advantages on top of the per-dollar throughput comparison. The H200's CUDA ecosystem is three to four years ahead of Neuron on inference serving features.
Spot pricing adds a further dimension unavailable on EC2 Trn3. As of July 2026, AWS does not offer a spot or preemptible equivalent for Trn3 instances. Spheron offers spot H200 instances at $3.31/hr per GPU, which directly cuts inference costs for batch workloads without any porting requirement.
For broader cost optimization strategies at the infrastructure layer, see the AI inference cost economics 2026 guide. For GPU selection specifically for inference workloads, see best GPU for AI inference 2026.
AWS Lock-In vs Neocloud NVIDIA: Decision Framework
| Scenario | Recommendation |
|---|---|
| AWS-native team, Neuron-supported model, batch 32+ dense inference | Trainium 3 worth evaluating. Run CPM math at your specific batch profile and factor in migration weeks. |
| Any stack using vLLM, SGLang, TensorRT-LLM, or custom CUDA kernels | H200 or B200 on Spheron. No migration required. |
| Portability requirement (switch providers, avoid single-vendor dependency) | CUDA GPU cloud only. Trainium is AWS-exclusive and non-portable. |
| Spot or preemptible budget workloads (embeddings, async batch inference) | H200 or B200 spot on Spheron. No spot equivalent exists on EC2 Trn3. |
| Large-scale pre-training inside AWS ecosystem | Trainium 3 worth pricing. Calculate: (migration weeks × engineering cost) vs (lifetime per-NeuronCore savings at your training volume). |
| Mixed training and inference fleet | CUDA. One framework runs both workloads unchanged. Trainium requires separate Neuron stacks per phase. |
The structural case for staying on CUDA for inference is strongest when the team runs vLLM or SGLang, needs spot pricing, or needs to port workloads across providers. Trainium 3 locks all of these out simultaneously.
Benchmarking Your Own Workload on GPU Cloud First
Before committing to any Trainium migration, establish a CUDA baseline so the comparison is concrete rather than theoretical. The Spheron API delivers on-demand H200 SXM5 instances by the minute with no reserved commitment.
Step 1: Provision an H200 SXM5 on Spheron on-demand.
Step 2: Run vLLM with your production model:
python -m vllm.entrypoints.openai.api_server \
--model <your-model> \
--tensor-parallel-size 1 \
--dtype fp8Step 3: Send 500 test requests at your production batch size using the vLLM benchmark tool and record tokens/sec.
Step 4: Compute your baseline CPM:
CPM = (on_demand_$/hr) / (tokens_per_sec × 3600 / 1,000,000)Step 5: That CPM is the number Trainium 3 must beat, after you add back the engineering cost of the Neuron SDK migration. If you are evaluating B200 SXM6 instances for larger models or FP4 workloads, repeat the benchmark on B200.
For most teams, the CUDA baseline on Spheron is not just a diagnostic step. It is also a faster path to production. The H200 runs your existing vLLM stack unchanged, bills per minute, and lets you switch GPU types if you need to compare.
Trainium 3 is worth the migration cost if you are already AWS-native and running Neuron-supported models at high batch sizes. For everyone else, H200 and B200 on GPU cloud run vLLM, SGLang, and TensorRT-LLM unchanged at neocloud rates with no rewrite required and full portability.
Spheron H200 instances → | B200 SXM6 availability → | View all GPU pricing →
Quick Setup Guide
Run 'pip show vllm tensorrt-llm flash-attn sglang' to identify CUDA-native packages. Any package in this list must be replaced or rewritten before running on Trainium. If your stack depends on custom CUDA kernels, custom FlashAttention variants, or framework-specific operators using torch.cuda, Trainium 3 is not a drop-in replacement and requires substantial porting.
List every vLLM, TensorRT-LLM, or SGLang feature your production stack uses: continuous batching, speculative decoding, RadixAttention, PagedAttention, FP8 tensor cores, multi-adapter serving. Each feature without a Neuron SDK equivalent adds porting time. Multiply estimated weeks by your team's blended rate and compare against expected lifetime compute savings at the throughput Trainium 3 actually delivers for your batch profile.
Before evaluating Trainium, establish your CUDA baseline. Provision an H200 SXM5 or B200 SXM6 on Spheron, run vLLM with your production model at your actual batch size, and record tokens/sec. Compute CPM = ($/hr) / (tokens_per_sec x 3600 / 1,000,000). This is the number any Trainium migration must beat after you add back the engineering cost of the port.
Frequently Asked Questions
AWS Trainium 3 is available via EC2 Trn3 instances, priced per instance rather than per chip. AWS does not publish a verified per-NeuronCore price. The ~$1.80/hr per NeuronCore figure cited in comparisons is an illustrative estimate derived by dividing instance pricing by NeuronCore count and should be treated as directional only. Note that a NeuronCore is a sub-unit of the Trainium chip (8 NeuronCore-v4 units per chip), not a direct equivalent to an NVIDIA GPU, so per-unit comparisons require careful interpretation.
Yes, with limitations. vLLM has an official AWS Neuron backend (the vllm-neuron plugin / NxD Inference) that supports Trainium with minimal code changes for basic serving. However, it exposes a reduced feature set compared to the CUDA build: CUDA-specific components including custom Triton kernels, AWQ/GPTQ quantization kernels, FlashAttention-3, and TensorRT-LLM engines do not port. For advanced features or custom architectures, you may still need transformers-neuronx or AWS Neuron ModelServer directly.
For a team new to the Neuron SDK working with a Neuron-supported model like Llama 3 70B, expect 2-3 weeks to reach basic inference. For custom attention kernels, non-standard architectures, or MoE models with partial Neuron support, the estimate rises to 4-6 weeks. Teams with prior Neuron experience can compress this to 1-2 weeks. Ongoing maintenance adds roughly 1-2 days per Neuron SDK release cycle.
At batch size 32+ for dense transformer models, AWS claims Trainium 3 matches or beats H200 throughput. However, these are AWS-internal figures for Neuron-ported models, not independent benchmarks. At the batch sizes typical of real serving workloads (batch 1-8 for latency-sensitive APIs), the comparison is less favorable and no public third-party data exists as of July 2026. H200's CUDA ecosystem also provides speculative decoding, RadixAttention, and continuous batching in vLLM with no porting required.
At AWS's claimed throughput for Trainium 3 at batch 32, the estimated cost is around $0.28/million tokens per NeuronCore. NVIDIA H200 SXM5 on Spheron works out to approximately $0.84/million tokens on-demand or $0.61/million tokens on spot, based on vLLM throughput at batch 8. The Trainium 3 cost advantage is real at high batch sizes using Neuron-native stacks. It narrows against H200 spot and disappears when engineering migration cost is factored in for smaller teams.
