OpenAI's Jalapeño chip is a custom LLM inference ASIC built with Broadcom, targeting a 10 GW infrastructure commitment through 2029. It is real, it is significant at OpenAI's scale, and it has no bearing on what AI teams outside OpenAI can actually use. The chip is planned to ship at the end of 2026, it is vertically integrated OpenAI silicon, and nobody outside OpenAI can rent it.
That gap between "this chip is impressive" and "this chip is accessible" is the whole story for external teams. If you are building on open-weight models, serving GPT-OSS, DeepSeek V4 Flash, or Qwen 3, your path is GPU cloud. H100, H200, and B200 instances run vLLM and SGLang unchanged today. This post covers what Jalapeño actually is, what the ASIC vs GPU trade-off looks like, why all the 2026 custom silicon is captive, and what the accessible inference landscape looks like for everyone who is not OpenAI.
What OpenAI's Jalapeño Chip Actually Is
Jalapeño is OpenAI's first purpose-built LLM inference ASIC. It is not a general-purpose GPU and it is not designed for training. It is optimized for the decode phase of autoregressive language model inference: token-by-token generation at the scale OpenAI runs, which means billions of API requests per day across ChatGPT, the OpenAI API, and enterprise contracts.
OpenAI designed the chip in partnership with Broadcom, using Broadcom's custom ASIC division as both the design collaborator and the manufacturer. The tape-out cycle runs roughly 9 months, which is faster than traditional ASIC development cycles, achieved through a focused architecture and close partnership with Broadcom's foundry operations.
The scale commitment is the clearest signal of how seriously OpenAI is treating this. OpenAI has announced a 10 GW infrastructure roadmap through 2029. That is not a pilot program. Building out infrastructure at 10 GW of power capacity over three years requires chips that generate better performance-per-watt than GPUs, because at that scale even a 20% improvement in inference efficiency pays for custom silicon development in months.
The first Jalapeño chips are planned for end of 2026. From there, OpenAI expects a multi-year ramp as the chips are integrated into more of their infrastructure. This is a long-term infrastructure bet, not a product launch.
One key point worth making explicit: Jalapeño is OpenAI-captive silicon. It is not a product OpenAI sells. There is no API, no rental market, no instance type. The chip serves OpenAI's own API traffic. The closest analog is Microsoft Maia 200 or Meta MTIA, both of which are hyperscaler-captive ASICs that do not exist outside their builder's infrastructure.
Why an ASIC Instead of More GPUs
Inference has become the dominant AI compute cost. For teams running production AI at scale, inference typically accounts for 60-80% of total GPU spend. At OpenAI's scale, that percentage is even higher: ChatGPT and the OpenAI API run continuously, generating hundreds of billions of tokens per day. Training runs are one-time compute jobs. Inference is perpetual.
At that scale, the economics of custom silicon become compelling. A GPU like the H100 is designed for flexibility. It handles training, fine-tuning, inference, computer vision, video generation, and any other compute-bound workload you can throw at it. That programmability is valuable, but it comes with overhead: memory controllers, instruction dispatch logic, general-purpose ALUs, and HBM stacks that serve the average workload, not the specific workload.
A custom inference ASIC removes that overhead. If you know your workload is LLM inference, and nothing else, you can design silicon that does only that. You can put more FLOPS on the die, optimize the memory subsystem specifically for KV cache reads, and reduce the energy spent on instruction scheduling. The result is better performance per watt for that specific workload.
The math is straightforward. An H100 SXM5 draws approximately 700W and delivers around 3,000-5,000 tokens/sec for 70B model inference at batch 32. If an inference ASIC can deliver comparable throughput at 400W, the power savings compound at data center scale: 10 GW of inference infrastructure running 40% more efficiently saves 4 GW of power, which translates directly to operating cost.
For a broader look at how this trend is playing out across AWS, Google, Microsoft, and Meta, see Hyperscaler Custom AI Chips in 2026.
ASIC vs GPU Trade-offs: What Jalapeño Gives Up
The efficiency argument is real. But efficiency comes from specificity, and specificity means constraints. Here is what a custom inference ASIC gives up versus a programmable GPU.
Architectural flexibility. Jalapeño is presumably designed around the transformer attention pattern as it exists in 2025-2026. If OpenAI's model architecture changes in a material way, the chip may not serve the next generation of models efficiently, or at all. OpenAI's own history demonstrates this risk: the jump from dense transformer GPT-4 to o-series reasoning models involved significant architectural shifts. If Jalapeño hardcodes attention patterns that match GPT-4o but not o3's reasoning architecture, OpenAI needs another tape-out to adapt.
This is not hypothetical. The move toward MoE (Mixture of Experts) architectures is already the dominant trend in open-weight frontier models. DeepSeek V4 is 671B with 37B active parameters per token. Qwen3-235B-A22B uses 22B active parameters out of 235B. MoE requires dynamic expert routing that fixed-function silicon handles poorly. Dense transformer inference ASICs cannot serve MoE architectures at competitive throughput.
No training or fine-tuning. Jalapeño is an inference chip. It cannot train models. OpenAI still needs NVIDIA GPUs (or equivalent) for training runs, which means their GPU fleet does not go away. It supplements with inference ASICs for serving workloads.
Software ecosystem lock-in. GPU inference runs on vLLM, SGLang, TensorRT-LLM, and the CUDA ecosystem. A custom ASIC requires its own compiler, runtime, and kernel libraries. Every model that runs on Jalapeño requires compilation and validation against Jalapeño's toolchain. When a new model architecture ships, OpenAI needs software updates, not just the pretrained weights.
For a detailed look at the transformer-ASIC architectural bet and what it gives up, see Etched AI Sohu vs NVIDIA. For how NVIDIA approaches the same fixed-function inference trade-off with the Groq 3 LPU, see NVIDIA Groq 3 LPU Explained.
The Custom Silicon Landscape in 2026
Every major AI infrastructure player has a custom chip program. Here is the current state:
| Chip | Builder | Who Can Use It | Rentable | Software Stack |
|---|---|---|---|---|
| Jalapeño | OpenAI + Broadcom | OpenAI only | No | Proprietary |
| Trainium 3 | AWS | AWS customers | Via EC2 Trn3 only | Neuron SDK (not CUDA) |
| TPU Ironwood | Google Cloud customers | Via GCP only | JAX/MaxText (not CUDA) | |
| Maia 200 | Microsoft | Microsoft internally | No | Proprietary |
| Meta MTIA | Meta | Meta internally | No | Proprietary |
| Groq 3 LPU | NVIDIA + Groq | Early access (2026) | Limited | NVIDIA Dynamo |
| Etched Sohu | Etched AI | Not yet available | No | Etched proprietary |
The pattern is clear. Every hyperscaler-built ASIC is captive. Trainium and TPU Ironwood are technically available to external customers, but only through their respective cloud and with significant toolchain migration requirements: the Neuron SDK does not run CUDA code, and JAX is not compatible with vLLM or TensorRT-LLM. For the practical purposes of most teams, those chips are as inaccessible as Maia 200 or MTIA.
Groq 3 LPU is the closest thing to an exception: NVIDIA licensed the architecture and is building out data center deployments, with early access expected in late 2026. But it is still pre-production, still not broadly available, and still specialized for autoregressive inference.
For the full analysis of each hyperscaler chip's specs and economics, see Hyperscaler Custom AI Chips in 2026.
What Jalapeño Means If You Are Not OpenAI
The direct answer: nothing, operationally. Jalapeño is OpenAI-captive silicon. The chip ships end of 2026, ramps over years, and will never appear on a GPU cloud marketplace. There is no migration path, no API, no instance type.
What Jalapeño means indirectly is that OpenAI's API pricing may drop over time as their infrastructure becomes more efficient. If Jalapeño delivers meaningful performance-per-watt gains over H100, OpenAI can serve more tokens per dollar and potentially pass some of that savings to API customers. But this is speculation about future pricing, not something you can build on today.
For teams building on open-weight models, the practical infrastructure path is unchanged. The GPT-OSS models (20B and 120B MoE) that rival GPT-quality outputs on many benchmarks run on GPU cloud today. See Deploy GPT-OSS on GPU Cloud for a step-by-step setup guide. DeepSeek V4 Flash and Qwen 3 are competitive at similar quality tiers and run on H100 via vLLM without modification. See DeepSeek vs Llama 4 vs Qwen 3 for a benchmark comparison.
Both GPT-OSS and DeepSeek V4 Flash run on H100, H200, and B200 via vLLM unchanged. Same Docker image, same inference server, same CUDA ecosystem. The vLLM server setup guide on Spheron covers the full deployment steps on H100 and A100 instances. The only difference from one GPU generation to the next is an instance type change, not a toolchain migration.
Running Open-Weight Models on Spheron Today
For teams evaluating GPU cloud options right now, here is what on-demand and spot pricing looks like for the three primary inference GPUs on Spheron:
| GPU | On-Demand $/hr | Spot $/hr | Good for |
|---|---|---|---|
| H100 SXM5 | from $3.92/hr | from $2.91/hr | GPT-OSS 20B (1 GPU), GPT-OSS 120B MoE (4-8 GPU), DeepSeek V4 Flash (1 GPU FP8), Qwen3-32B (1 GPU) |
| H200 SXM5 | from $3.70/hr | from $3.31/hr | Larger context windows, GPT-OSS 120B MoE on fewer GPUs, higher-batch DeepSeek V4 |
| B200 SXM6 | N/A (spot only) | from $5.34/hr | Maximum throughput for high-volume serving; only spot instances available currently |
Pricing fluctuates based on GPU availability. The prices above are based on 29 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
The key practical differences between H100 and H200 for inference workloads:
H100 SXM5 (3.35 TB/s HBM3): Strong for models up to 70B at standard context lengths. A single H100 runs GPT-OSS 20B or DeepSeek V4 Flash at FP8 with headroom for parallel requests. At $3.92/hr on-demand or $2.91/hr spot, H100 is a practical starting point for 20B-class inference. Spot pricing at $2.91/hr is cheaper than H200 spot when reclamation risk is acceptable.
H200 SXM5 (4.8 TB/s HBM3e, 141 GB): The larger memory pool and higher bandwidth handle longer context windows and larger batch sizes. For GPT-OSS 120B MoE, H200 fits the model on fewer GPUs than H100 due to the 141 GB memory. At $3.70/hr on-demand, H200 is actually cheaper per hour than H100 SXM5 at current pricing.
B200 SXM6 (8 TB/s HBM3e, 192 GB): Maximum memory bandwidth and capacity. Currently available as spot only on Spheron at $5.34/hr per GPU. Best suited for high-volume serving where you need maximum tokens/sec throughput and can tolerate spot reclamation by using queued batch workloads. Check Spheron B200 instances for current multi-GPU configurations.
For H100 SXM5 availability by region and provider, see H100 SXM5 availability on Spheron.
Decision Framework: When to Care About Custom ASICs
The question of whether custom inference ASICs matter to your team is mostly determined by your scale and infrastructure ownership model.
| Your Situation | Recommendation |
|---|---|
| Running at OpenAI scale, billions of requests/day, owning your data center infrastructure | Custom ASIC economics become compelling. At 10 GW scale, even 20% efficiency gains pay off the chip development cost in months. |
| GPU cloud, vLLM-based inference stack, using open-weight models | Custom ASICs do not exist for you today. Jalapeño is captive. Trainium and TPU require toolchain migration and cloud lock-in. GPU cloud is your only accessible path. |
| Evaluating Trainium 3 or TPU Ironwood inside AWS or GCP | Add migration cost math first. Rewriting from vLLM to Neuron SDK typically takes 2-6 engineering weeks. Multiply by your team's blended hourly rate and compare against lifetime efficiency savings before committing. |
| Considering Groq 3 LPU for high-volume decode workloads | This is worth tracking. Early access is expected late 2026. Establish a tokens/watt baseline on H100 now so you have a concrete comparison point when LPU pricing becomes public. |
| Evaluating Etched Sohu or similar transformer-only ASICs | Wait for independent production benchmarks. Startup ASICs without publicly verified third-party numbers carry high execution risk on top of the architectural inflexibility trade-off. |
The pattern across the table: if you are not a hyperscaler building infrastructure at data center scale, custom ASICs are not your current problem. The accessible chip market is GPU cloud, CUDA ecosystem, and the open-weight model library that runs on it.
Jalapeño confirms that inference economics at scale favor custom silicon. That is a useful signal for long-term infrastructure planning. But it does not change what is available to rent today, and it does not change the fact that open-weight frontier models like DeepSeek V4, GPT-OSS, and Qwen 3 run on H100 via vLLM without modification.
OpenAI's Jalapeño chip is vertically integrated silicon built for one company at one scale. For everyone building on open-weight models today, GPU cloud is the accessible path. H100, H200, and B200 instances on Spheron run vLLM and SGLang unchanged, with spot pricing for batch workloads and no multi-year infrastructure commitment.
Spheron H100 instances → | B200 on Spheron → | View current GPU pricing →
Quick Setup Guide
List every model you currently serve and classify each by architecture: dense transformer, MoE transformer, multimodal, diffusion, or SSM. Custom inference ASICs like Jalapeño or Etched Sohu target dense transformer inference only. If your stack includes MoE models (DeepSeek V4, Qwen3-235B), multimodal models, or any non-transformer architecture, fixed-function ASIC silicon is a non-starter regardless of throughput claims.
Provision an H100 SXM5 or H200 SXM5 instance on Spheron and run your production model with vLLM. Measure tokens per second at your actual median batch size. Calculate CPM = ($/hr) divided by (tokens_per_sec multiplied by 3600, divided by 1,000,000). This is the number any ASIC claim must beat after accounting for migration cost, toolchain rewrite, and vendor lock-in risk.
Start an H100 or B200 instance at app.spheron.ai and deploy using vLLM: 'python -m vllm.entrypoints.openai.api_server --model <model-name> --tensor-parallel-size 1 --dtype fp8'. GPT-OSS 20B runs on a single H100. GPT-OSS 120B MoE runs on 2-4 H100s. DeepSeek V4 Flash runs on a single H100 at FP8. No architecture changes needed, no toolchain migration, no ASIC dependency.
Frequently Asked Questions
Jalapeño is OpenAI's first custom LLM inference ASIC, designed in partnership with Broadcom on roughly a 9-month tape-out cycle. It targets autoregressive transformer inference at scale and is part of OpenAI's plan to build 10 GW of AI infrastructure by 2029. The chip is planned to ship at the end of 2026 with a multi-year ramp, and it is vertically integrated, meaning no one outside OpenAI can rent or access it.
No. Jalapeño is OpenAI-captive silicon. It is not a product OpenAI sells or rents. It is infrastructure OpenAI builds to serve its own API traffic. Teams building on open-weight models need GPU cloud instead: H100, H200, and B200 instances on platforms like Spheron run vLLM and SGLang unchanged and are available today.
A custom inference ASIC hard-codes specific operations into silicon, trading flexibility for efficiency. An H100 or B200 is a programmable GPU: the same chip that trained GPT-4 runs DeepSeek V4 Flash, Qwen 3, and Llama 4 today without any hardware changes. An inference ASIC optimized for one architecture cannot adapt if that architecture changes, and it cannot train models, run fine-tuning, or handle multimodal workloads. The efficiency gain is real at scale, but the constraints are significant.
Broadcom is both the design partner and the chip manufacturer for Jalapeño. OpenAI worked with Broadcom's custom ASIC division to develop the chip on a roughly 9-month tape-out schedule. This is the same model Amazon, Google, and Microsoft use: contract out chip design and manufacturing to a specialist while keeping the resulting silicon captive for internal use.
For teams running open-weight models, GPU cloud is the accessible path. H100 SXM5 instances run GPT-OSS 20B and 120B, DeepSeek V4 Flash, and Qwen 3 via vLLM today, with on-demand pricing from around $3.92/hr per GPU. B200 SXM6 spot instances start around $5.34/hr. Both run the full CUDA ecosystem with no architectural lock-in. Jalapeño does not exist outside OpenAI's infrastructure, and no timeline suggests it will become externally rentable.
