What GPU do I need to run GPT-OSS 20B?

GPT-OSS 20B is a Mixture-of-Experts model with 21B total parameters. It requires approximately 42 GB VRAM at BF16 precision to hold all expert weights. It fits on a single A100 80GB with room for a generous KV cache. With FP8 quantization, it can run on a 24 GB GPU like the RTX 4090 or RTX 5090, though throughput will be lower. On Spheron, an A100 80G SXM4 starts from $1.08/hr on-demand as of 02 Apr 2026.

What GPU do I need to run GPT-OSS 120B MoE?

GPT-OSS 120B uses a Mixture-of-Experts architecture. With MXFP4 quantization, all expert weights fit within the 80 GB VRAM of a single H100 SXM5. Without quantization, you need 2-4x H100 80GB for tensor parallelism. Spheron H100 SXM5 instances start from $2.40/hr on-demand as of 02 Apr 2026.

Which inference engine is best for GPT-OSS: vLLM, SGLang, or Ollama?

vLLM is the best choice for production multi-user serving of both GPT-OSS variants. It supports continuous batching, FP8/MXFP4 quantization, and tensor parallelism. SGLang offers lower time-to-first-token for structured output workloads. Ollama is easiest to set up for single-user local use but does not scale to concurrent production traffic.

How much cheaper is self-hosting GPT-OSS vs the OpenAI API?

At moderate throughput, self-hosting GPT-OSS 20B on an A100 at $1.08/hr can cost well under $1.00 per million tokens at 100+ tokens/sec. OpenAI's cheapest API tier runs $1-$5 per million tokens depending on the model. The crossover point depends on how you run the GPU: on-demand (spinning up only when needed) is cheaper than the API starting around 10M tokens/month, as the cost comparison table in this post shows. If you run the GPU 24/7 regardless of traffic, the break-even is much higher, roughly 380-400M tokens/month at $1.08/hr continuous.

Does GPT-OSS support the OpenAI API format?

Yes. Both vLLM and SGLang expose an OpenAI-compatible REST API at /v1/chat/completions. Your existing code that calls the OpenAI API can switch to a self-hosted GPT-OSS endpoint by changing only the base URL and API key.

Deploy GPT-OSS on GPU Cloud: Self-Host OpenAI's First Open-Source Model (2026)

GPT-OSS is OpenAI's first Apache 2.0 licensed model, released August 5, 2025. Two variants: a 20B Mixture-of-Experts model and a 120B Mixture-of-Experts model. Both run on hardware you can rent by the hour, and neither requires any usage restrictions or API fees once deployed. This guide covers exactly how to get both variants running on a GPU instance, from instance selection through production monitoring. For vLLM multi-GPU production setup beyond what's covered here, see the vLLM production deployment guide.

What Is GPT-OSS and Why OpenAI Released It

GPT-OSS ships under the Apache 2.0 license, which means no usage restrictions, no request to OpenAI before commercial deployment, and no royalty obligations. You can fine-tune it, modify it, and ship it inside a product without needing OpenAI's permission.

The two variants cover different use cases:

GPT-OSS 20B: Mixture-of-Experts model, 21 billion total parameters with approximately 3.6 billion active per forward pass (32 experts, Top-4 routing). Fits on a single A100 80GB. Good for applications where you want predictable latency and straightforward deployment.
GPT-OSS 120B MoE: Mixture-of-Experts, 120 billion total parameters with a fraction active per forward pass. Higher capability ceiling, fits on a single H100 80GB with MXFP4 quantization.

The open-model deployment landscape has expanded fast. Llama 4 and DeepSeek V3.2 established that frontier-quality models can run on rented hardware. GPT-OSS follows the same pattern with OpenAI's weights behind it.

GPT-OSS 20B vs GPT-OSS 120B: Architecture and Benchmarks

Property	GPT-OSS 20B	GPT-OSS 120B MoE
Architecture	Mixture-of-Experts	Mixture-of-Experts
Total parameters	21B	120B
Active parameters per forward pass	~3.6B (Top-4 of 32 experts)	~5.1B (Top-4 of 128 experts)
Context length	128K tokens	128K tokens
License	Apache 2.0	Apache 2.0

The MoE architecture means GPT-OSS 120B computes roughly the same amount of work per token as a 20B dense model during inference, despite having 120B total parameters. The expert routing selects a subset of specialist layers for each token. This is the same pattern used in the Llama 4 deployment guide, where Scout has 109B total parameters but only 17B active per pass.

The practical result: GPT-OSS 120B MoE achieves significantly higher benchmark scores than the 20B model at similar inference cost per token, once the model fits in VRAM.

Benchmark comparisons (from OpenAI's August 2025 release notes):

Benchmark	GPT-OSS 20B	GPT-OSS 120B MoE
MMLU	85.3	90.0
GPQA Diamond	71.5	80.9

GPU Requirements: VRAM, Memory, and Storage

VRAM requirements start with the weights formula: memory = parameters × bytes_per_element. GPT-OSS 20B is a MoE model with 21B total parameters. All expert weights must be loaded into VRAM even though only ~3.6B parameters activate per token. At BF16 (2 bytes per parameter), that is approximately 42 GB for weights alone. Add KV cache and framework overhead and the practical floor is about 50 GB for low-concurrency workloads. For the full derivation of every component in GPU memory usage, see the GPU memory requirements guide.

GPT-OSS 120B MoE is more complex. The full model stored in FP16 occupies about 240 GB across all expert weights. With MXFP4 (4-bit) quantization, the stored size drops to roughly 60 GB, which fits comfortably on a single H100 SXM5 80GB. Without quantization, you need 3-4 H100s with tensor parallelism.

Variant	Precision	VRAM	Recommended GPU	Spheron Price
GPT-OSS 20B	BF16	~42 GB	A100 80G SXM4	from $1.08/hr
GPT-OSS 20B	FP8	~21 GB	RTX 4090 / RTX 5090	from $0.51/hr
GPT-OSS 120B MoE	MXFP4	~60 GB	H100 SXM5 80GB	from $2.40/hr
GPT-OSS 120B MoE	BF16 TP4	~240 GB	4x H100 SXM5	from $9.60/hr

Storage: model weights download from Hugging Face Hub. GPT-OSS 20B is approximately 40 GB, GPT-OSS 120B is approximately 240 GB. Provision at least 2x the model size in disk space to handle the download and unpacking overhead.

Pricing fluctuates based on GPU availability. The prices above are based on 02 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Deploy GPT-OSS 20B with vLLM on a Single A100

Step 1: Provision and verify your instance

Rent an A100 80G SXM4 on Spheron's A100 GPU rental. SSH in and confirm your GPU:

bash

nvidia-smi

You should see the A100 80GB with 81,920 MiB of VRAM. If the NVIDIA Container Toolkit is not pre-installed:

bash

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Validate GPU access inside Docker:

bash

docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Step 2: Launch GPT-OSS 20B with vLLM

bash

docker run --gpus all --ipc=host -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model openai/gpt-oss-20b \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --max-num-seqs 256

Flag breakdown:

--ipc=host: required for shared memory between GPU processes. Skipping this causes CUDA errors under load.
--dtype bfloat16: BF16 gives you full model quality on A100 without the numerical instability of FP16 at the extremes.
--gpu-memory-utilization 0.90: leaves 10% headroom. The A100 80GB has plenty of room for GPT-OSS 20B weights (~42GB) plus KV cache.
--max-model-len 32768: 32K context. Raise to 65536 or 131072 if your workload needs longer context and you can reduce --max-num-seqs to compensate.

FP8 variant for smaller GPUs (24 GB VRAM):

bash

docker run --gpus all --ipc=host -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model openai/gpt-oss-20b \
  --quantization fp8 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 16384

FP8 cuts weight size from ~42 GB to ~21 GB, allowing GPT-OSS 20B to run on a single RTX 5090 or RTX 4090.

Step 3: Test the endpoint

bash

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Explain MXFP4 quantization in one paragraph."}],
    "max_tokens": 200
  }'

For multi-GPU tensor parallelism and load balancing across multiple instances, see the vLLM production deployment guide. For KV cache tuning with --kv-cache-dtype fp8 and prefix caching, see the KV cache optimization guide.

Deploy GPT-OSS 120B MoE with MXFP4 on H100

MXFP4 is Microscaling FP4, a 4-bit floating-point format standardized by the Open Compute Project (OCP) consortium with contributions from AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm. It compresses MoE expert weights to 4 bits per parameter, reducing the 120B model's stored size from ~240 GB (FP16) to approximately 60 GB. On Hopper GPUs (H100), vLLM uses the Triton matmul_ogs kernel for MXFP4 computation; the Marlin kernel is a fallback for non-Hopper architectures. Native MXFP4 tensor core support starts with Blackwell. For a look at native FP4 quantization on Blackwell, see the FP4 quantization guide.

Single H100 with MXFP4 (recommended):

bash

# Requires vLLM v0.17+
docker run --gpus all --ipc=host -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model openai/gpt-oss-120b \
  --quantization mxfp4 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 32768

At 0.92 GPU memory utilization on an H100 SXM5 80GB, you have about 73 GB available. MXFP4 weights occupy ~60 GB, leaving ~13 GB for KV cache. This supports moderate concurrency (20-40 simultaneous requests at 1K context). For higher concurrency, reduce --max-model-len to free more KV cache space.

Multi-GPU BF16 without quantization (4x H100, full precision):

bash

docker run --gpus all --ipc=host -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model openai/gpt-oss-120b \
  --dtype bfloat16 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768

4x H100 SXM5 gives you 320 GB total VRAM, enough for the 240 GB BF16 weights plus KV cache. NVLink between SXM GPUs keeps tensor parallelism communication overhead low.

SGLang vs vLLM vs Ollama for GPT-OSS

Engine	Best for	GPT-OSS 20B	GPT-OSS 120B MoE	Ease of setup
vLLM	Production, multi-user	Full support	MXFP4 quantization	Moderate
SGLang	Structured output, low TTFT	Full support	Full support	Moderate
Ollama	Single-user local dev	Supported	Slow (no MoE opt)	Easy

For thorough benchmark numbers across all three engines on the same H100, see the vLLM vs TensorRT-LLM vs SGLang benchmarks.

SGLang launch command for GPT-OSS 20B:

bash

pip install sglang[all]

python -m sglang.launch_server \
  --model openai/gpt-oss-20b \
  --port 8000 \
  --dtype bfloat16 \
  --mem-fraction-static 0.88

SGLang's RadixAttention caches KV state for shared prompt prefixes across requests. For chatbot applications with a fixed system prompt, SGLang reduces time-to-first-token compared to vLLM by reusing that cached prefix instead of recomputing it per request. The OpenAI-compatible API is at /v1/chat/completions, same endpoint as vLLM.

Ollama for local development:

bash

ollama run openai/gpt-oss-20b

Ollama works fine for a single developer testing GPT-OSS locally. It does not support continuous batching, so concurrent requests queue behind each other. Do not use Ollama for production API serving above 2-3 simultaneous users. For a detailed comparison of Ollama and vLLM across throughput, feature support, and production readiness, see the Ollama vs vLLM comparison.

Performance Benchmarks: Throughput and Latency

The numbers below are representative estimates based on comparable model architectures (20B MoE and 120B MoE with similar active parameter counts) running on the same hardware class. Official GPT-OSS benchmarks from OpenAI's release were not published with server-side throughput figures at the time of writing. Treat these as directional, not precise.

GPU	Model	Engine	Throughput (tok/s)	TTFT p50 (ms)	p99 latency (ms)
A100 SXM4 80G	GPT-OSS 20B	vLLM BF16	~1,200	~95	~380
H100 SXM5 80G	GPT-OSS 20B	vLLM FP8	~2,100	~68	~220
H100 SXM5 80G	GPT-OSS 120B MoE	vLLM MXFP4	~1,600	~140	~480
4x H100 SXM5	GPT-OSS 120B MoE	vLLM BF16 TP4	~2,800	~110	~310

The MoE architecture means GPT-OSS 120B MXFP4 on a single H100 reaches comparable throughput to GPT-OSS 20B BF16 on an A100, because only ~5.1B parameters activate per token (4 of 128 experts). For GPU performance comparisons across workloads, see the GPU cloud benchmarks.

Cost Comparison: Self-Hosting GPT-OSS vs OpenAI API

The break-even calculation is straightforward: at what monthly token volume does the per-hour GPU cost undercut the per-token API fee?

Assumptions used below:

GPT-OSS 20B on A100: $1.08/hr on-demand, ~1,200 tokens/sec throughput, ~70% utilization
GPT-OSS 120B MoE on H100: $2.40/hr on-demand, ~1,600 tokens/sec throughput, ~70% utilization
OpenAI API cost: $2/million tokens (input+output blended estimate, varies by model and tier)
Spot pricing: A100 from $0.45/hr, H100 from $0.80/hr

Monthly tokens	OpenAI API	GPT-OSS 20B (A100)	GPT-OSS 120B (H100)
10M	~$20	~$4	~$6
100M	~$200	~$36	~$60
1B	~$2,000	~$360	~$600

Self-hosting pays off quickly. Even at 10M tokens per month, the A100 cost is roughly one-fifth of the API equivalent. The calculation tilts further toward self-hosting as volume grows, because the GPU cost is fixed per hour regardless of how many tokens you generate.

For batch inference workloads (embeddings, offline document processing, nightly jobs), use spot instances: A100 from $0.45/hr vs $1.08/hr on-demand, H100 from $0.80/hr vs $2.40/hr. Spot cuts costs significantly for workloads that can tolerate preemption.

Pricing fluctuates based on GPU availability. The prices above are based on 02 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

For a deeper cost breakdown across instance types and reservation strategies, see the GPU cost optimization playbook and the serverless vs on-demand vs reserved comparison.

Production Checklist: Monitoring, Scaling, and High Availability

GPU health monitoring: run nvidia-smi dmon -s pum -d 10 for real-time per-GPU metrics during load testing. For production, integrate DCGM with Prometheus and Grafana. Watch GPU utilization, memory usage, and temperature. See the GPU monitoring guide for a DCGM + Prometheus setup.

vLLM metrics endpoint: vLLM exposes Prometheus-compatible metrics at /metrics. Key signals to watch:

vllm:num_requests_waiting: queue depth. If this stays above 0 under normal load, you need more GPU capacity.
vllm:kv_cache_usage_perc: KV cache fill rate. Above 90% consistently means you should reduce --max-model-len or add more instances.
vllm:time_to_first_token_seconds: p50 and p99 TTFT. For GPT-OSS 20B on A100, expect p50 under 100ms at low concurrency.

Horizontal scaling: run multiple vLLM instances and load balance with nginx:

nginx

   upstream gptoss {
     server 127.0.0.1:8000;
     server 127.0.0.1:8001;
     keepalive 100;
   }

   server {
     listen 80;
     location / {
       proxy_pass http://gptoss;
       proxy_http_version 1.1;
       proxy_set_header Connection "";
       proxy_set_header Host $host;
       proxy_buffering off;
       proxy_read_timeout 300;
     }
   }

The keepalive 100 directive enables connection pooling in the upstream block, but it requires proxy_http_version 1.1 and proxy_set_header Connection "" in the location block. Without these, nginx defaults to HTTP/1.0 for upstream connections, which does not support persistent connections and makes keepalive a no-op. proxy_buffering off is required for SSE/streaming completions: without it, nginx buffers the full upstream response before forwarding, defeating token streaming. proxy_read_timeout 300 raises the default 60-second timeout to 5 minutes; LLM inference for long outputs on a 120B model can easily exceed 60 seconds, causing 504 Gateway Timeout errors without this setting.

Each vLLM instance handles one GPU. For GPT-OSS 120B MoE with MXFP4, each instance handles one H100.

Spot vs on-demand: use spot instances for batch inference jobs (A100 spot from $0.45/hr vs $1.08/hr on-demand, H100 spot from $0.80/hr vs $2.40/hr on-demand). Use on-demand for latency-sensitive APIs where a preemption would break a live user request. Configure your deployment to drain in-flight requests before yielding a spot instance.

Graceful shutdown: vLLM handles SIGTERM gracefully by default, draining in-flight requests before exiting. There is no --shutdown-timeout CLI flag in vLLM. The correct way to control the drain window is to set your load balancer's connection-draining timeout (30-60 seconds is typical), then stop routing new requests to the instance before sending SIGTERM. vLLM will finish active requests during that window.

Health check endpoint: vLLM exposes /health that returns 200 when the server is ready. Use this in your load balancer health check rather than the model endpoint, which will return errors during model loading.

For production architecture patterns covering multi-region deployments, failover, and inference caching, see the production GPU cloud architecture guide.

GPT-OSS gives you a commercially free model that runs on infrastructure you control. Spheron provides the A100 and H100 instances to run it, with spot pricing that cuts costs further for batch workloads.
Rent A100 → | Rent H100 → | View all GPU pricing →
Get started on Spheron →

What Is GPT-OSS and Why OpenAI Released It

GPT-OSS 20B vs GPT-OSS 120B: Architecture and Benchmarks

GPU Requirements: VRAM, Memory, and Storage

Deploy GPT-OSS 20B with vLLM on a Single A100

Deploy GPT-OSS 120B MoE with MXFP4 on H100

SGLang vs vLLM vs Ollama for GPT-OSS

Performance Benchmarks: Throughput and Latency

Cost Comparison: Self-Hosting GPT-OSS vs OpenAI API

Production Checklist: Monitoring, Scaling, and High Availability

Build what's next.