Tutorial

Deploy GPT-OSS on GPU Cloud: Self-Host OpenAI's First Open-Source Model (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 2, 2026
GPT-OSSOpenAI Open SourcevLLMSelf-Hosted LLMGPU CloudLLM DeploymentH100A100
Deploy GPT-OSS on GPU Cloud: Self-Host OpenAI's First Open-Source Model (2026)

GPT-OSS is OpenAI's first Apache 2.0 licensed model, released August 5, 2025. Two variants: a 20B Mixture-of-Experts model and a 120B Mixture-of-Experts model. Both run on hardware you can rent by the hour, and neither requires any usage restrictions or API fees once deployed. This guide covers exactly how to get both variants running on a GPU instance, from instance selection through production monitoring. For vLLM multi-GPU production setup beyond what's covered here, see the vLLM production deployment guide.

What Is GPT-OSS and Why OpenAI Released It

GPT-OSS ships under the Apache 2.0 license, which means no usage restrictions, no request to OpenAI before commercial deployment, and no royalty obligations. You can fine-tune it, modify it, and ship it inside a product without needing OpenAI's permission.

The two variants cover different use cases:

  • GPT-OSS 20B: Mixture-of-Experts model, 21 billion total parameters with approximately 3.6 billion active per forward pass (32 experts, Top-4 routing). Fits on a single A100 80GB. Good for applications where you want predictable latency and straightforward deployment.
  • GPT-OSS 120B MoE: Mixture-of-Experts, 120 billion total parameters with a fraction active per forward pass. Higher capability ceiling, fits on a single H100 80GB with MXFP4 quantization.

The open-model deployment landscape has expanded fast. Llama 4 and DeepSeek V3.2 established that frontier-quality models can run on rented hardware. GPT-OSS follows the same pattern with OpenAI's weights behind it.

GPT-OSS 20B vs GPT-OSS 120B: Architecture and Benchmarks

PropertyGPT-OSS 20BGPT-OSS 120B MoE
ArchitectureMixture-of-ExpertsMixture-of-Experts
Total parameters21B120B
Active parameters per forward pass~3.6B (Top-4 of 32 experts)~5.1B (Top-4 of 128 experts)
Context length128K tokens128K tokens
LicenseApache 2.0Apache 2.0

The MoE architecture means GPT-OSS 120B computes roughly the same amount of work per token as a 20B dense model during inference, despite having 120B total parameters. The expert routing selects a subset of specialist layers for each token. This is the same pattern used in the Llama 4 deployment guide, where Scout has 109B total parameters but only 17B active per pass.

The practical result: GPT-OSS 120B MoE achieves significantly higher benchmark scores than the 20B model at similar inference cost per token, once the model fits in VRAM.

Benchmark comparisons (from OpenAI's August 2025 release notes):

BenchmarkGPT-OSS 20BGPT-OSS 120B MoE
MMLU85.390.0
GPQA Diamond71.580.9

GPU Requirements: VRAM, Memory, and Storage

VRAM requirements start with the weights formula: memory = parameters × bytes_per_element. GPT-OSS 20B is a MoE model with 21B total parameters. All expert weights must be loaded into VRAM even though only ~3.6B parameters activate per token. At BF16 (2 bytes per parameter), that is approximately 42 GB for weights alone. Add KV cache and framework overhead and the practical floor is about 50 GB for low-concurrency workloads. For the full derivation of every component in GPU memory usage, see the GPU memory requirements guide.

GPT-OSS 120B MoE is more complex. The full model stored in FP16 occupies about 240 GB across all expert weights. With MXFP4 (4-bit) quantization, the stored size drops to roughly 60 GB, which fits comfortably on a single H100 SXM5 80GB. Without quantization, you need 3-4 H100s with tensor parallelism.

VariantPrecisionVRAMRecommended GPUSpheron Price
GPT-OSS 20BBF16~42 GBA100 80G SXM4from $1.08/hr
GPT-OSS 20BFP8~21 GBRTX 4090 / RTX 5090from $0.51/hr
GPT-OSS 120B MoEMXFP4~60 GBH100 SXM5 80GBfrom $2.40/hr
GPT-OSS 120B MoEBF16 TP4~240 GB4x H100 SXM5from $9.60/hr

Storage: model weights download from Hugging Face Hub. GPT-OSS 20B is approximately 40 GB, GPT-OSS 120B is approximately 240 GB. Provision at least 2x the model size in disk space to handle the download and unpacking overhead.

Pricing fluctuates based on GPU availability. The prices above are based on 02 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Deploy GPT-OSS 20B with vLLM on a Single A100

Step 1: Provision and verify your instance

Rent an A100 80G SXM4 on Spheron's A100 GPU rental. SSH in and confirm your GPU:

bash
nvidia-smi

You should see the A100 80GB with 81,920 MiB of VRAM. If the NVIDIA Container Toolkit is not pre-installed:

bash
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Validate GPU access inside Docker:

bash
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Step 2: Launch GPT-OSS 20B with vLLM

bash
docker run --gpus all --ipc=host -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model openai/gpt-oss-20b \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --max-num-seqs 256

Flag breakdown:

  • --ipc=host: required for shared memory between GPU processes. Skipping this causes CUDA errors under load.
  • --dtype bfloat16: BF16 gives you full model quality on A100 without the numerical instability of FP16 at the extremes.
  • --gpu-memory-utilization 0.90: leaves 10% headroom. The A100 80GB has plenty of room for GPT-OSS 20B weights (~42GB) plus KV cache.
  • --max-model-len 32768: 32K context. Raise to 65536 or 131072 if your workload needs longer context and you can reduce --max-num-seqs to compensate.

FP8 variant for smaller GPUs (24 GB VRAM):

bash
docker run --gpus all --ipc=host -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model openai/gpt-oss-20b \
  --quantization fp8 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 16384

FP8 cuts weight size from ~42 GB to ~21 GB, allowing GPT-OSS 20B to run on a single RTX 5090 or RTX 4090.

Step 3: Test the endpoint

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Explain MXFP4 quantization in one paragraph."}],
    "max_tokens": 200
  }'

For multi-GPU tensor parallelism and load balancing across multiple instances, see the vLLM production deployment guide. For KV cache tuning with --kv-cache-dtype fp8 and prefix caching, see the KV cache optimization guide.

Deploy GPT-OSS 120B MoE with MXFP4 on H100

MXFP4 is Microscaling FP4, a 4-bit floating-point format standardized by the Open Compute Project (OCP) consortium with contributions from AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm. It compresses MoE expert weights to 4 bits per parameter, reducing the 120B model's stored size from ~240 GB (FP16) to approximately 60 GB. On Hopper GPUs (H100), vLLM uses the Triton matmul_ogs kernel for MXFP4 computation; the Marlin kernel is a fallback for non-Hopper architectures. Native MXFP4 tensor core support starts with Blackwell. For a look at native FP4 quantization on Blackwell, see the FP4 quantization guide.

Single H100 with MXFP4 (recommended):

bash
# Requires vLLM v0.17+
docker run --gpus all --ipc=host -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model openai/gpt-oss-120b \
  --quantization mxfp4 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 32768

At 0.92 GPU memory utilization on an H100 SXM5 80GB, you have about 73 GB available. MXFP4 weights occupy ~60 GB, leaving ~13 GB for KV cache. This supports moderate concurrency (20-40 simultaneous requests at 1K context). For higher concurrency, reduce --max-model-len to free more KV cache space.

Multi-GPU BF16 without quantization (4x H100, full precision):

bash
docker run --gpus all --ipc=host -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model openai/gpt-oss-120b \
  --dtype bfloat16 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768

4x H100 SXM5 gives you 320 GB total VRAM, enough for the 240 GB BF16 weights plus KV cache. NVLink between SXM GPUs keeps tensor parallelism communication overhead low.

SGLang vs vLLM vs Ollama for GPT-OSS

EngineBest forGPT-OSS 20BGPT-OSS 120B MoEEase of setup
vLLMProduction, multi-userFull supportMXFP4 quantizationModerate
SGLangStructured output, low TTFTFull supportFull supportModerate
OllamaSingle-user local devSupportedSlow (no MoE opt)Easy

For thorough benchmark numbers across all three engines on the same H100, see the vLLM vs TensorRT-LLM vs SGLang benchmarks.

SGLang launch command for GPT-OSS 20B:

bash
pip install sglang[all]

python -m sglang.launch_server \
  --model openai/gpt-oss-20b \
  --port 8000 \
  --dtype bfloat16 \
  --mem-fraction-static 0.88

SGLang's RadixAttention caches KV state for shared prompt prefixes across requests. For chatbot applications with a fixed system prompt, SGLang reduces time-to-first-token compared to vLLM by reusing that cached prefix instead of recomputing it per request. The OpenAI-compatible API is at /v1/chat/completions, same endpoint as vLLM.

Ollama for local development:

bash
ollama run openai/gpt-oss-20b

Ollama works fine for a single developer testing GPT-OSS locally. It does not support continuous batching, so concurrent requests queue behind each other. Do not use Ollama for production API serving above 2-3 simultaneous users. For a detailed comparison of Ollama and vLLM across throughput, feature support, and production readiness, see the Ollama vs vLLM comparison.

Performance Benchmarks: Throughput and Latency

The numbers below are representative estimates based on comparable model architectures (20B MoE and 120B MoE with similar active parameter counts) running on the same hardware class. Official GPT-OSS benchmarks from OpenAI's release were not published with server-side throughput figures at the time of writing. Treat these as directional, not precise.

GPUModelEngineThroughput (tok/s)TTFT p50 (ms)p99 latency (ms)
A100 SXM4 80GGPT-OSS 20BvLLM BF16~1,200~95~380
H100 SXM5 80GGPT-OSS 20BvLLM FP8~2,100~68~220
H100 SXM5 80GGPT-OSS 120B MoEvLLM MXFP4~1,600~140~480
4x H100 SXM5GPT-OSS 120B MoEvLLM BF16 TP4~2,800~110~310

The MoE architecture means GPT-OSS 120B MXFP4 on a single H100 reaches comparable throughput to GPT-OSS 20B BF16 on an A100, because only ~5.1B parameters activate per token (4 of 128 experts). For GPU performance comparisons across workloads, see the GPU cloud benchmarks.

Cost Comparison: Self-Hosting GPT-OSS vs OpenAI API

The break-even calculation is straightforward: at what monthly token volume does the per-hour GPU cost undercut the per-token API fee?

Assumptions used below:

  • GPT-OSS 20B on A100: $1.08/hr on-demand, ~1,200 tokens/sec throughput, ~70% utilization
  • GPT-OSS 120B MoE on H100: $2.40/hr on-demand, ~1,600 tokens/sec throughput, ~70% utilization
  • OpenAI API cost: $2/million tokens (input+output blended estimate, varies by model and tier)
  • Spot pricing: A100 from $0.45/hr, H100 from $0.80/hr
Monthly tokensOpenAI APIGPT-OSS 20B (A100)GPT-OSS 120B (H100)
10M~$20~$4~$6
100M~$200~$36~$60
1B~$2,000~$360~$600

Self-hosting pays off quickly. Even at 10M tokens per month, the A100 cost is roughly one-fifth of the API equivalent. The calculation tilts further toward self-hosting as volume grows, because the GPU cost is fixed per hour regardless of how many tokens you generate.

For batch inference workloads (embeddings, offline document processing, nightly jobs), use spot instances: A100 from $0.45/hr vs $1.08/hr on-demand, H100 from $0.80/hr vs $2.40/hr. Spot cuts costs significantly for workloads that can tolerate preemption.

Pricing fluctuates based on GPU availability. The prices above are based on 02 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

For a deeper cost breakdown across instance types and reservation strategies, see the GPU cost optimization playbook and the serverless vs on-demand vs reserved comparison.

Production Checklist: Monitoring, Scaling, and High Availability

  1. GPU health monitoring: run nvidia-smi dmon -s pum -d 10 for real-time per-GPU metrics during load testing. For production, integrate DCGM with Prometheus and Grafana. Watch GPU utilization, memory usage, and temperature. See the GPU monitoring guide for a DCGM + Prometheus setup.
  1. vLLM metrics endpoint: vLLM exposes Prometheus-compatible metrics at /metrics. Key signals to watch:
  • vllm:num_requests_waiting: queue depth. If this stays above 0 under normal load, you need more GPU capacity.
  • vllm:kv_cache_usage_perc: KV cache fill rate. Above 90% consistently means you should reduce --max-model-len or add more instances.
  • vllm:time_to_first_token_seconds: p50 and p99 TTFT. For GPT-OSS 20B on A100, expect p50 under 100ms at low concurrency.
  1. Horizontal scaling: run multiple vLLM instances and load balance with nginx:
nginx
   upstream gptoss {
     server 127.0.0.1:8000;
     server 127.0.0.1:8001;
     keepalive 100;
   }

   server {
     listen 80;
     location / {
       proxy_pass http://gptoss;
       proxy_http_version 1.1;
       proxy_set_header Connection "";
       proxy_set_header Host $host;
       proxy_buffering off;
       proxy_read_timeout 300;
     }
   }

The keepalive 100 directive enables connection pooling in the upstream block, but it requires proxy_http_version 1.1 and proxy_set_header Connection "" in the location block. Without these, nginx defaults to HTTP/1.0 for upstream connections, which does not support persistent connections and makes keepalive a no-op. proxy_buffering off is required for SSE/streaming completions: without it, nginx buffers the full upstream response before forwarding, defeating token streaming. proxy_read_timeout 300 raises the default 60-second timeout to 5 minutes; LLM inference for long outputs on a 120B model can easily exceed 60 seconds, causing 504 Gateway Timeout errors without this setting.

Each vLLM instance handles one GPU. For GPT-OSS 120B MoE with MXFP4, each instance handles one H100.

  1. Spot vs on-demand: use spot instances for batch inference jobs (A100 spot from $0.45/hr vs $1.08/hr on-demand, H100 spot from $0.80/hr vs $2.40/hr on-demand). Use on-demand for latency-sensitive APIs where a preemption would break a live user request. Configure your deployment to drain in-flight requests before yielding a spot instance.
  1. Graceful shutdown: vLLM handles SIGTERM gracefully by default, draining in-flight requests before exiting. There is no --shutdown-timeout CLI flag in vLLM. The correct way to control the drain window is to set your load balancer's connection-draining timeout (30-60 seconds is typical), then stop routing new requests to the instance before sending SIGTERM. vLLM will finish active requests during that window.
  1. Health check endpoint: vLLM exposes /health that returns 200 when the server is ready. Use this in your load balancer health check rather than the model endpoint, which will return errors during model loading.

For production architecture patterns covering multi-region deployments, failover, and inference caching, see the production GPU cloud architecture guide.


GPT-OSS gives you a commercially free model that runs on infrastructure you control. Spheron provides the A100 and H100 instances to run it, with spot pricing that cuts costs further for batch workloads.

Rent A100 → | Rent H100 → | View all GPU pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.