Comparison

Google TPU 8i vs NVIDIA Rubin and B200 for LLM Inference: Benchmarks, Cost Per Token, and Migration Guide (2026)

TPU 8iGoogle TPU 8i vs NVIDIATPU vs GPU inference 2026Google TPUNVIDIA RubinB200LLM InferenceCost Per Million TokensJAXCUDA MigrationMoE
Google TPU 8i vs NVIDIA Rubin and B200 for LLM Inference: Benchmarks, Cost Per Token, and Migration Guide (2026)

Google unveiled the TPU 8i at Google Cloud Next 2026: 288GB HBM, 384MB on-chip SRAM, and 10.1 PFLOPS at FP4 for inference. Those numbers land squarely in the same territory as NVIDIA's Blackwell and upcoming Rubin chips. But raw PFLOPS comparisons miss the part of the story that matters most for teams running actual production workloads. The software ecosystem, vendor lock-in, and pricing model change the math significantly.

Two quick things to establish before we get into the comparison. First, the 8i and 8t are different chips. TPU 8i is inference-only. TPU 8t is the training companion in the same generation. This post covers 8i. Second, both TPU 8i and NVIDIA's Rubin R100 are pre-GA as of June 2026. The B200 is available now; the other two are not. That timing matters for anyone with active workloads.

TPU 8i at a Glance: Specs, Boardfly Topology, and the Inference/Training Split

Google's decision to split inference (8i) and training (8t) into separate chips reflects a real architectural tradeoff. Inference is dominated by memory-bandwidth-bound decode operations and KV cache reads. Training needs different compute patterns, especially for gradient aggregation. Separating the two lets Google optimize each chip's memory hierarchy and compute tile configuration for its specific access pattern.

This same philosophy appeared earlier in NVIDIA's roadmap under the Rubin CPX concept, which was designed for decode-phase acceleration. The approach signals a broader trend: inference and training are increasingly diverging enough to justify different silicon.

ChipHBMSRAMFP4 PFLOPSInterconnectAvailability
TPU 8i288GB384MB10.1BoardflyGA late 2027
TPU 8t (training partner)Not publishedNot publishedNot publishedBoardflyGA late 2027
Ironwood (TPU v7, prior gen)192GBNot publishedNot publishedICIAvailable on GCP

The Boardfly topology is Google's new pod-level interconnect design for the v8 generation. Unlike the ICI (Inter-Chip Interconnect) used in Trillium v6, Boardfly is a PCB-level fabric that routes chip-to-chip communication across the board directly, reducing hop latency for multi-chip inference workloads. Google hasn't published full Boardfly bandwidth specs, but the design goal is lower latency at large batch sizes where multi-chip coordination becomes the bottleneck.

The 384MB on-chip SRAM is the number worth watching. On-chip SRAM is accessible at orders of magnitude higher bandwidth than HBM. For KV cache hot-path reads during decode, having more SRAM means fewer HBM round-trips per token. This is the same principle that explains Cerebras's wafer-scale SRAM advantage at low batch sizes.

TPU 8i vs NVIDIA Rubin R100 and B200: Compute, Memory, and Interconnect

SpecTPU 8i (per chip)NVIDIA Rubin R100NVIDIA B200 SXM6NVIDIA B300 SXM6
Memory (HBM)288GB288GB HBM4192GB HBM3e288GB HBM3e
Memory Bandwidth~8.6 TB/sUp to 22 TB/s8 TB/s8 TB/s
FP4 (PFLOPS)10.1509.015
On-chip SRAM384MBNot published~50MB L2~50MB L2
Multi-chip interconnectBoardflyNVLink 6 (3.6 TB/s)NVLink 5 (1.8 TB/s)NVLink 5 (1.8 TB/s)
Native software stackJAX, MaxText, JetstreamCUDA, vLLM, TensorRT-LLMCUDA, vLLM, TensorRT-LLMCUDA, vLLM, TensorRT-LLM
Spot pricing availableNoNot announcedYes (Spheron)Yes (Spheron)

A few things stand out here. The Rubin R100's 50 PFLOPS FP4 is five times TPU 8i's 10.1, which is a large gap. R100's 22 TB/s bandwidth is also a structural advantage: memory bandwidth determines how fast a chip can serve KV cache reads per token at scale, and 22 TB/s is close to three times what B200 or B300 offers. For full R100 specs, see the NVIDIA Rubin R100 guide.

TPU 8i's 384MB SRAM advantage over GPU's smaller L2 cache is real for the hot-path reads that happen repeatedly during decode. But SRAM capacity alone doesn't close the gap with R100's raw bandwidth, and at larger KV cache sizes the 288GB HBM becomes the limiting factor for both.

R100 is pre-GA with an H2 2026 timeline. TPU 8i is pre-GA targeting late 2027. B200 is available today on Spheron. If your workload is running now, B200 is the only one of these three you can actually use. For full B200 specs, see the B200 complete guide.

Inference Benchmarks: What the Numbers Actually Show

There are no independent TPU 8i vs B200 benchmarks as of June 2026. The chip isn't in GA. Any table that shows specific tokens-per-second numbers for TPU 8i at this point is fabricating figures, including from "extrapolating" Google's claims.

What Google has stated: roughly 80% price-performance improvement over Ironwood (TPU v7), the previous generation in the v8 line. Ironwood and Trillium v6e are separate generations: Trillium is v6e; Ironwood is v7. No public per-chip throughput benchmark exists for Ironwood, which means there is no reliable baseline to extrapolate from. Applying Google's 80% claim to any available baseline would produce an internally inconsistent estimate, so no specific tok/s figure for TPU 8i is cited here.

For context, B200 with vLLM running Llama 70B at FP8 delivers approximately 1,000 tokens/sec per GPU at batch 8, based on MLPerf Inference v6 data and public vLLM team benchmarks.

MetricTPU 8i (estimated from Google claims)B200 per GPU (vLLM, MLPerf v6)
Throughput (Llama 70B, batch 8)No independent benchmark (chip pre-GA, no reliable Ironwood baseline to extrapolate from)~1,000 tokens/sec
FP4 PFLOPS10.1 (Google-confirmed)9.0 (NVIDIA-confirmed)
Memory (HBM)288GB192GB

These are rough figures for context, not SLA targets. Run your own benchmarks against your actual model and request distribution before making infrastructure decisions.

On RL and reasoning workloads, Google's claims for TPU 8i are stronger. The Boardfly topology and larger on-chip SRAM are better suited for long-context, compute-bound workloads like chain-of-thought reasoning at large batch sizes. For short-context, high-concurrency serving, the advantage is less clear.

For MoE architectures like DeepSeek, TPU's systolic array design is a disadvantage. Expert routing in MoE generates irregular compute patterns that systolic arrays handle less efficiently than GPU streaming multiprocessors. For more on this, see MoE inference optimization on GPU cloud.

Cost Per Million Tokens: TPU 8i Price-Performance vs Live GPU Cloud Math

Google has not published official TPU 8i pricing as of June 2026. The chip is pre-GA.

The prior generation (Trillium v6e) was priced at approximately $4.20-4.50/chip-hour on-demand on GCP, based on publicly available v6e pricing. TPU 8i will likely launch at a premium over that, with estimates in the $5-7/chip-hour range at GA given the higher per-chip performance. These are estimates, not confirmed prices.

For B200 on Spheron, live pricing from the API as of 22 Jun 2026: $3.70/GPU/hr on-demand for B200 SXM6. Spot pricing is $5.34/GPU/hr, currently above on-demand, reflecting tight B200 spot demand at the time of this post.

ConfigurationCost ($/hr)Throughput estimate (tokens/sec)Cost per million tokens
B200 on Spheron (on-demand)$3.70~1,000 (Llama 70B, batch 8)~$1.03
B200 on Spheron (spot)$5.34~1,000~$1.48
TPU 8i estimated ($5/chip-hr)$5.00No independent benchmark pre-GAN/A
TPU 8i estimated ($7/chip-hr)$7.00No independent benchmark pre-GAN/A

A few things to note about this table. No independent throughput benchmark exists for TPU 8i pre-GA, so cost-per-million-tokens for TPU 8i cannot be reliably computed here. The B200 throughput is from MLPerf Inference v6. B200 spot is currently above on-demand ($5.34 vs $3.70), reflecting tight B200 spot supply as of the publish date, not a structural condition.

The other gap here is spot pricing availability. TPU 8i on Google Cloud does not have an equivalent interruptible tier at launch. For batch processing, offline summarization, or any workload that can tolerate interruption, GPU cloud's spot mechanism exists even when specific SKU spreads are compressed.

For cost-per-token methodology across GPU types, see LLM inference cost economics for 2026.

Pricing fluctuates based on GPU availability. The prices above are based on 22 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

The Hidden Tax: Google Cloud Lock-In, No Spot Market, and Framework Portability

Vendor Lock-In

TPU 8i only exists on Google Cloud. Every production deployment decision you make around TPU 8i is made inside a single vendor's platform, billing model, and region footprint. If Google Cloud has an outage in the zone where your TPUs are allocated, you have no fallback to a different provider without a complete replatform.

Compare that to GPU cloud, where Spheron aggregates 5+ providers into a single deployment interface. If one provider has capacity issues, your stack can move. That redundancy matters when you're running inference for production traffic.

No Spot-Equivalent Pricing

Google Cloud has preemptible VMs, but the availability windows and SLA characteristics are different from GPU spot markets. GPU cloud spot markets adjust dynamically with supply and demand. H200 spot on Spheron runs at $3.31/hr for interruptible workloads. B200 spot is currently at $5.34/hr, above B200 on-demand ($3.70), with tight spot demand inverting the normal relationship for B200 capacity right now. Spot economics vary by GPU SKU and shift as capacity changes; checking live rates before committing to spot is always worth doing.

TPU 8i's pricing model at launch doesn't include an equivalent interruptible tier. For teams with batch/async workloads, the absence of a spot mechanism is a structural constraint, separate from what any individual SKU's spot spread happens to be on a given day.

Pricing fluctuates based on GPU availability. The prices above are based on 22 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Framework Portability

vLLM, SGLang, TensorRT-LLM, AWQ, GPTQ, FlashAttention 3, bitsandbytes: all CUDA-only. None of these run on TPU hardware. If your inference stack uses any of them, TPU 8i is not a drop-in alternative. It requires a rewrite.

For a detailed comparison of CUDA-native inference frameworks, see vLLM vs TensorRT-LLM vs SGLang benchmarks.

When TPU 8i Wins vs When GPU Cloud Wins

TPU 8i is the right call when:

  • You run Google's first-party models (Gemini family) and are already fully integrated into GCP's billing, identity, and networking model
  • Your workloads are large-batch, steady-state inference on dense transformers at batch 32 or higher
  • Your team is JAX-native with no load-bearing CUDA kernel dependencies
  • You don't use vLLM, SGLang, or any CUDA-only quantization library in the critical path
  • You're operating at hyperscaler scale where Google's contractual pricing makes the economics work

GPU cloud (B200/H200 on Spheron) is the right call when:

  • You use vLLM, SGLang, TensorRT-LLM, or any CUDA-based inference framework
  • You serve MoE models like DeepSeek or Llama 4 Maverick, where sparse expert routing favors GPU streaming multiprocessors over TPU systolic arrays (see MoE inference optimization on GPU cloud, linked earlier)
  • You need spot pricing for batch jobs or interruptible workloads
  • You need to burst across multiple providers without reconfiguring your stack
  • Your models come from Hugging Face and would require porting for JAX
  • CUDA custom kernels (FlashAttention 3, AWQ, paged attention, custom KV cache management) are in your critical path
  • You want to avoid single-vendor dependency on GCP

For teams currently on Trillium v6e who are weighing the upgrade path, see Google TPU Trillium v6 vs NVIDIA B200 for the predecessor comparison.

The B200 SXM6 on Spheron is available now at on-demand and spot rates, with no JAX migration required.

Migration Checklist: Porting a vLLM/SGLang Stack to JAX/TPU

What Does Not Move (At All, or Without Major Rewrite)

Before building a migration plan, start with what cannot transfer:

  • FlashAttention 3 is CUDA-only. The JAX Pallas equivalent is newer and significantly less battle-tested in production. Expect to validate attention kernel behavior carefully before trusting it with production traffic.
  • AWQ/GPTQ/bitsandbytes quantization kernels have no direct TPU equivalents. TPU quantization uses a different flow (JAX-native INT8 or BF16, or Google's proprietary quantization tooling).
  • Paged attention and vLLM's KV cache management don't map to Jetstream. Jetstream has its own batching and KV management with different semantics. You'll need to re-validate memory behavior and adjust batch limits from scratch.
  • torch.compile graphs and CUDA graph captures don't translate. XLA tracing has different constraints: shapes typically need to be fixed or handled via symbolic shape support, and not all PyTorch ops trace cleanly through torch_xla.
  • Custom CUDA kernels (attention masks, sparse patterns, custom decode logic) need to be rewritten in XLA/Pallas or dropped entirely.
  • DCGM-based GPU observability and existing Prometheus exporters need to be replaced with Google Cloud Monitoring TPU telemetry. Your Grafana dashboards won't carry over.

What Can Move (With Effort)

  • Model weights: most Hugging Face checkpoints load via MaxText or transformers + accelerate. Validate outputs on eval benchmarks before serving production traffic. Weight format conversions are often required.
  • Batching strategy: map vLLM's --max-num-batched-tokens logic to Jetstream's equivalent config params. The concepts are similar but the API surface is different.
  • Serving API: Jetstream exposes a different API surface than vLLM's OpenAI-compatible API. Clients built against vLLM's endpoint need updates.
  • Observability: rebuild dashboards using Cloud Monitoring's TPU telemetry. The underlying metrics exist; they just live in a different system.

Migration Timeline

  • Weeks 1-2: Environment setup, model porting to MaxText/JAX, basic inference validation on evals
  • Weeks 3-5: Performance tuning (XLA compilation, batch shape optimization, SRAM-resident KV cache config), replacing CUDA-only features with JAX equivalents
  • Weeks 6-8: Serving infrastructure rebuild (Jetstream API differences from vLLM OpenAI-compatible API), observability rebuild in Cloud Monitoring
  • Ongoing: TPU firmware and XLA compiler updates follow a separate cadence from the CUDA ecosystem. Every major framework update on the CUDA side requires independent validation on TPU. Budget for this ongoing maintenance.

vLLM has experimental TPU support via torch_xla, but it is not production-grade and lags significantly behind the CUDA implementation. For production vLLM deployment on GPU, see vLLM production deployment 2026.


TPU 8i's PFLOPS numbers are real, but they do not exist in isolation from Google's platform, pricing model, and JAX-only software requirement. For teams already running vLLM or SGLang, Spheron delivers B200 and H200 on a bare-metal marketplace with both on-demand and spot options, no JAX rewrite required.

B200 SXM6 on Spheron → | H200 on Spheron → | H100 on Spheron →

STEPS / 03

Quick Setup Guide

  1. Assess your current CUDA dependencies before evaluating TPU 8i

    List every custom CUDA kernel, quantization library (AWQ, GPTQ, bitsandbytes), and framework dependency in your inference stack. If any are load-bearing (FlashAttention 3, paged attention, custom KV cache kernels), estimate the rewrite cost before factoring TPU 8i into your roadmap. For most teams, this alone makes GPU cloud the default.

  2. Benchmark cost-per-million-tokens at your actual batch distribution

    Pull your P50 concurrent request count from production. Run your model through vLLM or SGLang on Spheron B200 (spot or on-demand) at that batch size and compute CPM using the formula: (cluster $/hr) / (tokens_per_sec × 3600 / 1,000,000). Compare that number against TPU 8i's projected chip-hour cost at the same batch size before committing to a migration.

  3. Prototype the MaxText/Jetstream path in isolation before full migration

    If TPU 8i still makes sense after the dependency and cost analysis, set up a parallel JAX environment on a TPU v6e node (available today) and port a single non-critical model. Measure real throughput, TTFT, and p99 latency under your actual request distribution. Do not extrapolate from Google's published benchmarks alone.

FAQ / 05

Frequently Asked Questions

TPU 8i is Google's inference-specialized chip in the TPU v8 generation, unveiled at Google Cloud Next 2026. It carries 288GB HBM, 384MB of on-chip SRAM, and delivers 10.1 PFLOPS at FP4. Google claims roughly 80% price-performance gain over Ironwood (TPU v7), its immediate predecessor. Note: Ironwood (v7) is a separate generation from Trillium (v6e), which came before it. The 8i variant is inference-only; training workloads use the companion TPU 8t chip. General availability is scheduled for late 2027.

On paper, TPU 8i's 10.1 PFLOPS FP4 is higher than B200's 9.0 PFLOPS FP4, and its 288GB HBM matches the B300. However, B200 on Spheron runs vLLM natively with FP8/FP4 quantization support, no migration required. TPU 8i is locked to Google Cloud, requires a JAX/MaxText stack, and has no spot-market pricing equivalent. For teams already on CUDA, the effective cost-per-token on GPU cloud typically remains competitive or better once migration and ecosystem overhead are included.

No. vLLM and SGLang are CUDA-native and do not run on TPU hardware. TPU 8i requires JAX, PyTorch/XLA, or Google's MaxText and Jetstream inference frameworks. Porting a production vLLM stack to TPU takes 4-12 weeks of engineering work, and not all features (FlashAttention 3, paged attention, AWQ/GPTQ kernels) have equivalent implementations on the TPU software side.

Google has not published official TPU 8i per-chip pricing as of June 2026 (GA is late 2027). Based on TPU v6e pricing trends (~$4.20-4.50/chip-hour on-demand for the prior generation), TPU 8i is likely in the $5-7/chip-hour range at launch. B200 on Spheron at on-demand rates (check live pricing at spheron.network/pricing/) delivers competitive cost-per-token for most LLM serving workloads without Google Cloud lock-in or JAX migration overhead.

TPU 8i is most advantageous for teams already deep in the Google Cloud and JAX ecosystem, running large-batch steady-state inference on dense transformer models, or serving Google's first-party models (Gemini family). For teams using vLLM, open-weight models from Hugging Face, MoE architectures like DeepSeek, or needing multi-provider flexibility, GPU cloud on platforms like Spheron is a better fit.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.