Hyperscaler Custom AI Chips in 2026: Trainium 3, Google TPU, Maia 200, and Meta MTIA vs NVIDIA GPU

Inference is now roughly two-thirds of AI compute spend. Hyperscalers are responding by building their own silicon: AWS Trainium 3, Google TPU Ironwood, Microsoft Maia 200, and Meta MTIA are all real chips with real performance numbers. The headline projection - NVIDIA's inference share dropping from 90% to somewhere around 20-30% by 2028 - sounds alarming. There is a structural catch, though: none of these four chips are available to rent. They are captive to their builders. If you are not AWS, Google, Microsoft, or Meta, they do not exist for you.

This post covers what each chip actually does, what specs are publicly confirmed, what the migration path looks like if you are inside the relevant hyperscaler ecosystem, and why the projection matters far less for external teams than the headlines imply.

The 2026 Custom Silicon Inflection

Inference was a rounding error in 2022. Training was the cost center, and GPU clusters were bought, leased, or rented in service of training runs. By late 2024 that had inverted. Serving production traffic continuously compounding is simply more expensive in aggregate than the one-time compute jobs that produced the models.

Industry analysts put inference at 60-80% of AI GPU spend for teams shipping to production. Hyperscalers are at even higher ratios because they run Copilot, Gemini API, Alexa, and recommendation systems at billions of requests per day. That is the context for why AWS, Google, Microsoft, and Meta all started building their own chips. At their scale, a 30% efficiency gain on inference silicon pays for itself in months, not years.

The NVIDIA inference share projection is about that internal compute, not about what external teams rent. When a hyperscaler moves Copilot traffic from H100 clusters to Maia 200, that comes off NVIDIA's hyperscaler-internal inference share. It has no effect on what GPU cloud providers offer to external customers. For the overwhelming majority of AI teams - those who are not hyperscalers - the accessible market is NVIDIA GPU cloud and a small set of alternative accelerators with public APIs.

For baseline context on how hyperscaler GPU costs compare to specialized cloud, see AWS, GCP, and Azure GPU vs Spheron.

AWS Trainium 3: 3nm, 144GB HBM3E, and the Neuron SDK Trap

Trainium 3 is AWS's third-generation ML training and inference accelerator. The chip is manufactured on a 3nm process with 144 GB of HBM3E at 4.9 TB/s. The UltraServer form factor connects multiple Trn3 instances via AWS's proprietary NeuronLink interconnect, targeting large multi-node training jobs.

AWS has published throughput numbers for Llama 2 70B on Trn2 (Trainium 2 instances, the preceding generation) that are competitive with H100 for that specific workload. For certain batch-heavy inference workloads using fully dense transformer models, Trainium 3 can match or beat H100 on per-dollar throughput inside AWS.

The Neuron SDK is where the math gets complicated for most teams. Neuron is AWS's CUDA equivalent: a compiler, runtime, and set of optimized libraries specifically for Trainium and Inferentia chips. It does not run CUDA code. Migration means replacing vLLM with Neuron-compatible inference code, replacing FlashAttention with Neuron's attention kernel, and replacing TensorRT-LLM with Neuron's graph compiler.

The porting effort for a production vLLM stack typically runs two to six weeks of engineering time. Some model architectures are not supported in Neuron at all. Custom CUDA kernels have no Neuron equivalent and must be rewritten. And after migration, the stack runs exclusively on AWS infrastructure, with no portability to any other GPU provider.

Trainium 3 is only available via EC2 Trn3 instances. No third-party GPU marketplace offers Trainium hardware. If you want Trainium, you are signing up for AWS as your sole GPU provider.

For a detailed look at what AWS charges for H100 capacity and why the pricing stacks up the way it does, see the AWS H100 pricing 2026 guide.

Google TPU Ironwood: What Changed After Trillium

Ironwood is Google's seventh-generation TPU, announced at Google Cloud Next 2025 and generally available since late 2025. It follows Trillium (v6e and v6p) and is positioned primarily for large-scale inference rather than training. Google has published per-chip specs: 192 GB HBM3E at 7.37 TB/s memory bandwidth, 4,614 FP8 TFLOPS, and 9.6 Tb/s inter-chip interconnect (ICI) bandwidth.

Ironwood is deployed internally for Gemini API traffic. The serving architecture targets very large batch sizes at steady-state throughput, which is well-matched to Gemini API workloads running at Google's scale.

The software story is the same as Trillium: JAX, MaxText, and Jetstream. None of the standard GPU inference stack runs on Ironwood. vLLM, SGLang, and TensorRT-LLM all require CUDA and have no TPU port. PyTorch/XLA provides a bridge but is materially slower than native JAX for TPU workloads in practice.

For a detailed migration cost analysis and benchmark comparison between Trillium and B200, see Google TPU Trillium v6 vs NVIDIA B200. The analysis there extends directly to Ironwood for teams evaluating whether a JAX stack rewrite pencils out.

Ironwood is available through Google Cloud as a managed TPU resource, GCP accounts only. It is not available through any other provider.

Microsoft Maia 200: Inference-First, Azure-Only

Maia 200 is Microsoft's in-house inference ASIC, deployed across Azure data centers to handle Copilot and OpenAI API traffic. Microsoft published a full datasheet in January 2026: TSMC 3nm, 216 GB HBM3e at 7 TB/s memory bandwidth, native FP8 and FP4 tensor cores, and approximately 750W TDP. Microsoft also previewed a Maia SDK with PyTorch integration, a Triton compiler, and an optimized kernel library.

Maia 200 is not available to Azure customers as a provisionable instance type. There is no Maia 200 VM SKU. Teams using Azure OpenAI API or Copilot benefit from Maia's throughput indirectly through lower latency and higher API capacity, but they cannot provision Maia 200 hardware or deploy custom models on it.

The Maia SDK is a previewed internal toolchain, not a public API for external workloads. If you need to deploy your own models, you are still on Azure's GPU catalog: H100, A100, and the emerging H200 SKUs. Maia 200 is Microsoft's internal efficiency play, not a product.

Meta MTIA: The 300-500 Roadmap and Why It Doesn't Matter Outside Meta

Meta's MTIA (Meta Training and Inference Accelerator) roadmap runs from the original Gen 1 chip through the 300, 400, 450, and 500 generations. Meta has published high-level throughput claims for recommendation model inference showing multi-generation improvement. The chips are designed for Meta's specific workload: ranking and recommendation models with irregular sparsity patterns and transformer components for generative features.

MTIA 300 is already in production for ranking and recommendation workloads inside Meta's infrastructure. MTIA 400 is in testing. The 450 and 500 generations remain roadmap announcements without publicly available independent benchmark data. Meta deploys MTIA in its own data centers, running Facebook and Instagram infrastructure. Like Maia, MTIA is an internal efficiency chip. There is no external availability, no cloud offering, no partner program, and no migration path for teams not running Meta's specific software stack.

Published performance data reflects Meta's internal benchmarks on Meta's own recommendation models. These numbers do not translate to general-purpose LLM inference performance.

Spec and Economics Table

Chip	Maker	Memory	Bandwidth	Process Node	Rentable	Stack
Trainium 3	AWS	144 GB HBM3E	4.9 TB/s	3nm	AWS EC2 Trn3 only	Neuron SDK
TPU Ironwood (v7)	Google	192 GB HBM3E	7.37 TB/s	3nm (est.)	GCP only	JAX/XLA
Maia 200	Microsoft	216 GB HBM3e	7 TB/s	TSMC 3nm	No direct rental	Maia SDK (internal)
MTIA 500 (roadmap)	Meta	Not published	Not published	3nm (roadmap)	No	Internal only
NVIDIA B300 SXM6	NVIDIA	288 GB HBM3e	8 TB/s	TSMC 4nm	Yes, GPU cloud	CUDA, vLLM, TRT-LLM
NVIDIA H200 SXM5	NVIDIA	141 GB HBM3e	4.8 TB/s	TSMC 4nm	Yes, GPU cloud	CUDA, vLLM, TRT-LLM
NVIDIA H100 SXM5	NVIDIA	80 GB HBM3	3.35 TB/s	TSMC 4nm	Yes, GPU cloud	CUDA, vLLM, TRT-LLM

For on-demand pricing from Spheron's live API as of June 2026: H100 SXM5 at approximately $2.54/hr per GPU on-demand, H200 SXM5 at approximately $4.88/hr per GPU on-demand. B300 SXM6 availability varies; check current GPU pricing → for live rates.

For Trainium 2 (the generation preceding Trainium 3), AWS has published internal figures showing competitive cost-per-token for dense transformer inference at high batch sizes within AWS. These figures are AWS-internal measurements, not third-party reproductions, and apply only to Neuron-ported models at the batch sizes AWS chose for benchmarking.

Pricing fluctuates based on GPU availability. The prices above are based on 22 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

The CUDA Moat: Why Migration Still Costs Weeks

The CUDA moat is often described as a competitive advantage for NVIDIA. It is more precisely a migration cost for any team moving to a non-CUDA chip.

Twenty-plus years of library development have produced a stack where almost every serious LLM inference optimization runs exclusively on CUDA: FlashAttention-3, PagedAttention, continuous batching in vLLM, speculative decoding in SGLang, FP8 tensor cores via TensorRT-LLM, NCCL for multi-node communication, AWQ and GPTQ quantization kernels. None of these run on Trainium (Neuron SDK), on TPU (JAX/XLA), and certainly not on Maia or MTIA.

The migration math for a Trainium port typically looks like:

2-6 weeks to port the serving stack from vLLM to Neuron-compatible inference code
Additional weeks for models not natively supported in Neuron
Ongoing maintenance overhead as Neuron SDK updates lag behind vLLM feature releases
Complete loss of portability: the ported stack only runs on AWS

For TPU the estimate is similar: 2-8 weeks of engineering to reach feature parity in JAX/MaxText, with ongoing maintenance of a JAX stack separate from your CUDA-based development environment.

For a comparison of what CUDA migration looks like to a more accessible non-NVIDIA stack, see the ROCm vs CUDA on GPU cloud 2026 guide. For how SynapseAI (Intel Gaudi's SDK) compares to CUDA as a migration target, see Intel Gaudi 3 vs H200 and B200.

For teams at hyperscaler internal scale where engineering weeks translate to marginal cost savings over months, this math can close. For the vast majority of external teams whose GPU bills are measured in thousands, not millions, of dollars per month, the migration cost exceeds the lifetime savings.

The Catch: These Chips Are Captive

This is the structural constraint the headline projections rarely make explicit.

Maia 200 is not rentable. There is no API, no instance type, no migration path.

MTIA is not rentable. It does not exist outside Meta's infrastructure.

TPU Ironwood is available only through Google Cloud, requires JAX, and carries single-vendor lock-in to GCP. Teams running on TPU cannot move to Spheron, AWS, or Azure without rewriting their serving stack.

Trainium 3 is available only through AWS EC2, requires the Neuron SDK, and carries single-vendor lock-in to AWS. Teams running on Trainium cannot move to any other provider without rewriting their serving stack.

The lock-in is not incidental. It is the product. AWS and Google offer these chips at favorable price-per-FLOP because they are capturing your workload in their ecosystem. Once your production inference stack runs on Neuron or JAX, switching costs are high enough that short-term pricing differences stop mattering.

For teams evaluating what it costs to migrate away from AWS, GCP, or Azure to a portable GPU cloud stack, see How to Migrate from AWS, GCP, and Azure.

What the Other 99% of Teams Actually Use

Teams outside the hyperscaler ecosystem have one practical option: CUDA GPU cloud. The standard inference stack is vLLM or SGLang, both of which run unchanged across H100, H200, B200, and B300 generations. The same Docker image works on a different instance type when you upgrade hardware. No rewrite, no framework port, no single-vendor dependency.

The portability has a concrete value. Teams using GPU cloud on Spheron can switch from H100 to H200 without changing a line of inference code. When B300 becomes the default, the switch is an instance type change. No migration project, no engineering weeks, no lock-in exit cost. Spheron aggregates capacity from 5+ providers, so model availability changes do not create a single point of failure.

Spot pricing adds another dimension unavailable on captive ASIC clouds. For batch inference, embeddings, and async workloads, spot GPU instances on Spheron run materially cheaper than on-demand, with no equivalent on Trainium (AWS on-demand only) or TPU (GCP on-demand only).

For teams running production LLM inference today, H200 SXM5 instances cover most 70B and 8x MoE model sizes without tensor parallelism overhead.

For larger models or higher throughput, Spheron B200 SXM6 adds 192 GB per GPU with FP8 and FP4 support via TensorRT-LLM. For the highest throughput tier, B300 Blackwell Ultra on Spheron delivers 288 GB HBM3e at 8 TB/s.

Decision Framework: When to Track Hyperscaler ASICs vs When to Ignore Them

Your Situation	Recommendation
AWS-only infrastructure, Neuron SDK-compatible models, no vLLM dependency	Trainium 3 on EC2 Trn3 is worth evaluating for cost-per-token at high batch
Google Cloud-only, JAX-native stack, steady-state dense transformer inference	TPU Ironwood via GCP is worth benchmarking at your batch profile
Any stack using vLLM, SGLang, TensorRT-LLM, or custom CUDA kernels	CUDA GPU cloud (H200/B200/B300) on a neutral marketplace, no migration required
Multi-cloud or portability requirements	CUDA GPU cloud only. No captive ASIC option offers portability
Budget sensitivity with batch or async workloads	Spot GPU cloud on Spheron. No spot equivalent on Trainium or TPU

The headline projection - NVIDIA losing inference share to hyperscaler ASICs - is accurate for hyperscaler-internal compute. For external teams building on GPU cloud, the accessible chip landscape in June 2026 looks the same as it did before the projections were published: NVIDIA GPU cloud, AMD as a distant second, and a handful of alternative accelerators with niche applicability. Captive hyperscaler ASICs are not in that set.

For context on how other fixed-function and non-CUDA inference chips compare in the accessible market, see Etched AI Sohu vs NVIDIA: Transformer ASIC vs GPU.

Most teams running vLLM or SGLang today have nothing to gain from hyperscaler ASICs that require framework rewrites and lock you to one cloud. GPU cloud on H200, B200, and B300 runs your existing stack unchanged.
H200 SXM5 on Spheron → | Spheron B200 → | B300 Blackwell Ultra → | View all pricing →

STEPS / 03

Quick Setup Guide

Check whether your workload is captive-ASIC eligible
Determine whether you are an AWS customer using only Neuron SDK-compatible models, a Google Cloud customer using only JAX-based workloads, or if you rely on any library that requires CUDA (vLLM, FlashAttention, TensorRT-LLM, SGLang). If any of these apply, captive hyperscaler ASICs are not available to you. Provision a GPU cloud instance on Spheron for the same hardware at lower cost without migration overhead.
Run a CUDA dependency audit before evaluating Trainium or TPU
List all Python packages and frameworks your inference stack depends on. Run 'pip show vllm tensorrt-llm flash-attn' to identify CUDA-native dependencies. Any package in this list must be ported or replaced before running on Trainium (Neuron SDK) or TPU (JAX/XLA). For mixed stacks, estimate the engineering weeks required, multiply by your team's blended hourly rate, and compare that against the lifetime compute cost savings on the ASIC.
Benchmark your target GPU on Spheron before committing
Provision an H200 SXM5 or B200 SXM6 on-demand instance on Spheron. Run vLLM with your production model: 'python -m vllm.entrypoints.openai.api_server --model <your-model> --tensor-parallel-size 8 --dtype fp8'. Measure tokens/sec and compute CPM = ($/hr) / (tokens_per_sec x 3600 / 1,000,000). This is the baseline any captive ASIC must beat, after accounting for migration cost.

FAQ / 06

Frequently Asked Questions

No. AWS Trainium 3 is only available as EC2 Trn3 instances inside AWS. You cannot rent Trainium hardware through any third-party GPU marketplace. The chip is captive to AWS and requires the Neuron SDK to run models, which is not compatible with vLLM or TensorRT-LLM.

Maia 200 is Microsoft's in-house inference ASIC deployed across Azure data centers to serve Copilot and OpenAI API traffic. It is not available for external rent. Teams using Azure get its throughput benefit indirectly through managed inference APIs, but cannot provision Maia 200 hardware directly.

Meta MTIA (Gen 1 through the 300-500 roadmap) is optimized for Meta's specific recommendation and transformer workloads at Meta's scale. It is an internal chip and is not available outside Meta's infrastructure. Published throughput numbers reflect Meta's controlled workloads, not general-purpose inference benchmarks.

Analysts project that hyperscalers (AWS, Google, Microsoft, Meta) will serve an increasing share of their own internal AI workloads on custom ASICs rather than NVIDIA GPUs. This shifts NVIDIA share among hyperscaler-internal workloads, not among external teams renting GPU cloud. For the 99% of teams who are not hyperscalers, NVIDIA GPU cloud on platforms like Spheron remains the only option.

The CUDA moat is the 20+ year accumulation of libraries (cuDNN, cuBLAS, FlashAttention, vLLM, TensorRT-LLM, NCCL) that run exclusively on NVIDIA GPUs. Migrating to Trainium requires rewriting to the Neuron SDK. Migrating to TPU requires JAX and MaxText. Migrating to Maia or MTIA is not possible because those chips are captive. For any team running vLLM, SGLang, or TensorRT-LLM today, the migration cost to any of these ASICs is weeks to months of engineering work with no portability guarantee.

For most inference workloads, H100 SXM5, H200 SXM5, B200 SXM6, or B300 SXM6 on Spheron run vLLM unchanged, support FP8 and FP4 quantization, and are available on-demand with no lock-in. Spot pricing reduces costs further for batch workloads. Check /pricing/ for current rates.

The 2026 Custom Silicon Inflection

AWS Trainium 3: 3nm, 144GB HBM3E, and the Neuron SDK Trap

Google TPU Ironwood: What Changed After Trillium

Microsoft Maia 200: Inference-First, Azure-Only

Meta MTIA: The 300-500 Roadmap and Why It Doesn't Matter Outside Meta

Spec and Economics Table

The CUDA Moat: Why Migration Still Costs Weeks

The Catch: These Chips Are Captive

What the Other 99% of Teams Actually Use

Decision Framework: When to Track Hyperscaler ASICs vs When to Ignore Them

Quick Setup Guide

Check whether your workload is captive-ASIC eligible

Run a CUDA dependency audit before evaluating Trainium or TPU

Benchmark your target GPU on Spheron before committing

Frequently Asked Questions

01Can you rent AWS Trainium 3 on the open market?

02What is Microsoft Maia 200 and is it available outside Azure?

03How does Meta MTIA compare to NVIDIA GPU for inference?

04Why will NVIDIA inference share drop from 90% to 20-30% by 2028?

05What is the CUDA moat and why does it matter for ASIC migration?

06What GPU should I use instead of hyperscaler ASICs for LLM inference?

Build what's next.