Comparison

Open-Weight Frontier Model Showdown 2026: GPT-OSS 120B vs GLM-5.1 vs DeepSeek V4

Back to BlogWritten by Mitrasish, Co-founderApr 19, 2026
Open-Weight Frontier ModelsFrontier Model Comparison 2026GPT-OSS 120BGLM-5.1DeepSeek V4 BenchmarkGPU BenchmarksLLM Cost Per TokenMoE LLM InferenceGPU Cloud
Open-Weight Frontier Model Showdown 2026: GPT-OSS 120B vs GLM-5.1 vs DeepSeek V4

Three open-weight frontier models are deployable as of April 2026: GPT-OSS 120B, GLM-5.1, and DeepSeek V4. The hardware cost spread between them is enormous. GPT-OSS 120B fits on a single H100 with MXFP4 quantization. GLM-5.1 needs at minimum 4x H200 for INT4 or 8x H200 for FP8. DeepSeek V4 requires 4x H200 or 8x H100. Picking the wrong model for your budget means either overpaying for capacity you don't need, or under-serving your workload with hardware that can't hold the weights.

This post puts all three models side by side: architecture, VRAM requirements, inference throughput, quality benchmarks, and cost per million tokens at current Spheron pricing.

TL;DR

Use CaseBest ModelMin GPU ConfigCost/hr (Spheron, on-demand)
Code generation, single-GPU budgetGPT-OSS 120B1x H100 SXM5~$4.41
Highest quality coding, Arena leaderGLM-5.14x H200 (INT4) or 8x H200 (FP8)~$18.16 / ~$36.33
1M token context, agentic reasoningDeepSeek V44x H200 or 8x H100~$18.16 / ~$35.28
Cost-per-token optimizationGPT-OSS 120B1x H100 SXM5~$4.41
Batch inference, fine-tunable, Apache 2.0 license (provisional)DeepSeek V48x H100 SXM5~$35.28

Model Architecture Comparison

ModelTotal ParamsActive ParamsArchitectureContext WindowMin Weight Size (Quantized)Min H100-equiv GPUsLicense
GPT-OSS 120B120B~5.1B (MoE)MoE128K~60GB (MXFP4)†1x H100 80GBApache 2.0
GLM-5.1754B40B (MoE)MoE200K (128K max output)~377GB (INT4) / ~754GB (FP8)¶4x H200 (INT4) / 8x H200 (FP8)MIT
DeepSeek V4~1T~37B (MoE)MoE1M~500GB (FP8)‡4x H200 or 8x H100 (FP8)Apache 2.0 (provisional)

† GPT-OSS 120B MXFP4 weight size: the ~60GB estimate covers expert weights compressed to 4-bit precision. MXFP4 is an Open Compute Project (OCP) industry standard; native hardware acceleration for it requires Blackwell GPUs (B200, GB200). On H100, the runtime falls back to software emulation of MXFP4 or MXFP8, which reduces throughput compared to native Blackwell. See FP4 quantization on Blackwell for details.

¶ GLM-5.1 INT4 estimate of ~377GB is theoretical minimum at 4-bits per parameter (754B × 0.5 bytes). AWQ/GPTQ INT4 in practice may be slightly larger due to block metadata. FP8 at ~754GB assumes 1 byte per parameter, excluding KV cache and activation overhead. At 32K context length with moderate batch sizes, expect 10-15% additional VRAM for KV cache.

‡ DeepSeek V4 FP8 at ~500GB is the weight-only minimum; weights are reported by pre-release sources as approximately 1T × 0.5 bytes with sparse MoE routing compression. At runtime, add 15-20% for activations and framework buffers. Treat all V4 figures as provisional until official release. The Apache 2.0 license is also provisional based on pre-release sources; confirm at the official DeepSeek Hugging Face repository before use. See the GPU memory requirements guide and the GPU requirements cheat sheet 2026 for VRAM planning details.

All three models are MoE architectures. The active parameter count (the column that drives per-token compute) ranges from ~5.1B (GPT-OSS 120B) to ~40B (GLM-5.1). Total parameter count sets the memory floor. For a deep dive on why you pay for storage even when only a fraction of parameters are active per token, see the MoE inference optimization guide.

GPU Hardware Requirements and Quantization

GPT-OSS 120B

QuantizationVRAM RequiredMinimum GPUsRecommended Spheron Instance
BF16~240GB3x H100 80GB4x H100 SXM5
FP8~120GB2x H100 80GB2x H100 SXM5
MXFP4~60GB1x H100 80GB1x H100 SXM5

MXFP4 is a 4-bit floating-point format defined by the Open Compute Project (OCP) industry standard. NVIDIA Blackwell GPUs (B200, GB200) support it natively; NVIDIA also has a proprietary variant called NVFP4. On H100, vLLM falls back to software emulation at reduced throughput; you still get the memory savings but not the hardware speedup. For production on H100, FP8 at 2x H100 may give better throughput per dollar than MXFP4 on 1x H100, depending on your batch size. Link to the full setup guide: deploy GPT-OSS on GPU cloud.

GLM-5.1

QuantizationVRAM RequiredMinimum GPUsRecommended Spheron Instance
BF16~1,508GB19x H100Not recommended
FP8~754GB8x H200 SXM5 (1,128GB)8x H200 SXM5
AWQ INT4~377GB5x A100 80GB (400GB) or 4x H200 (564GB)4x H200 or 5x A100 SXM4

GLM-5.1 FP8 is the practical production configuration. At 32K context length with 16 concurrent requests, expect roughly 100-140GB additional VRAM for KV cache on top of the ~754GB weight footprint, so 8x H200 (1,128GB total) is tight at high concurrency. AWQ INT4 on 5x A100 80GB is viable for lower-concurrency deployments where quality loss from aggressive quantization is acceptable. Full setup: deploy GLM-5.1 on GPU cloud.

DeepSeek V4

QuantizationVRAM RequiredMinimum GPUsRecommended Spheron Instance
BF16~2,000GB25x H100Not recommended on single node
FP8~500GB4x H200 (564GB) or 8x H100 (640GB)8x H100 SXM5
INT4~250GB4x H100 80GB4x H100 SXM5

DeepSeek V4 with expert parallelism requires that the GPU cluster has high-bandwidth interconnect. On 8x H100 SXM5, NVLink handles the expert routing traffic efficiently. A cluster without NVLink (e.g., H100 PCIe) will bottleneck on inter-GPU communication during MoE routing and may not reach useful throughput even with enough aggregate VRAM. Full setup: deploy DeepSeek V4 on GPU cloud. See also the MoE inference optimization guide for expert parallelism configuration details.

Inference Benchmarks

Throughput (Tokens/Second)

Benchmark figures below are estimates based on published vLLM benchmarks and model architecture characteristics. Batch size: 50 concurrent requests, input/output sequence of 512/256 tokens.

ModelHardwarePrecisionThroughput (tokens/sec, aggregate)
GPT-OSS 120B1x H100 SXM5MXFP4~600-800
GPT-OSS 120B4x H100 SXM5FP8~1,800-2,400
GLM-5.18x H200 SXM5FP8~800-1,200
GLM-5.14x H200 SXM5INT4~600-900
DeepSeek V48x H100 SXM5FP8~700-1,000
DeepSeek V44x H200 SXM5FP8~700-1,000

These throughput estimates could not be confirmed from official model cards or technical reports for the April 2026 versions. Treat them as directional and run your own vLLM benchmarking on your target hardware before capacity planning. Benchmark methodology and inference engine choice affect numbers significantly; see the vLLM vs TensorRT-LLM vs SGLang comparison for the impact on throughput.

Per-GPU throughput is where GPT-OSS 120B wins decisively. At ~600-800 tokens/sec on a single H100, it outperforms GLM-5.1 and DeepSeek V4 on a normalized per-GPU basis by a factor of 5-8x, because those models distribute computation across 8 GPUs to produce similar or lower aggregate throughput.

TTFT (Time to First Token)

ModelHardwareTTFT p50 (1 req)TTFT p99 (10 req)
GPT-OSS 120B1x H100 SXM5~80-120ms~200-400ms
GLM-5.18x H200 SXM5~150-250ms~400-700ms
DeepSeek V48x H100 SXM5~200-350ms~500-900ms

TTFT figures are estimates; actual values depend heavily on prompt length, KV cache state, and batching configuration. For production latency tuning, see the vLLM vs TensorRT-LLM vs SGLang benchmarks post.

Throughput per Dollar

ModelConfig$/hrEst. Throughput (tok/sec)Est. $/1M tokens
GPT-OSS 120B1x H100 SXM5 (MXFP4)$4.41~700~$1.75
GLM-5.18x H200 SXM5 (FP8)$36.33~1,000~$10.09
GLM-5.14x H200 SXM5 (INT4)$18.16~750~$6.72
DeepSeek V48x H100 SXM5 (FP8)$35.28~850~$11.53
DeepSeek V44x H200 SXM5 (FP8)$18.16~850~$5.93

Formula: ($/hr / 3600) / (tokens/sec / 1,000,000) = $/M tokens. You can substitute current pricing and your measured throughput to recalculate.

The Est. throughput column uses unverified estimates from the benchmarks section above. The resulting $/M tokens figures are also estimates; run your own throughput measurement before using these for budget planning.

Pricing fluctuates based on GPU availability. The prices above are based on 19 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Quality Benchmarks

Coding (SWE-bench, HumanEval)

ModelSWE-bench ProHumanEvalNotes
GPT-OSS 120B~74%¹~85%²OpenAI release notes, Aug 2025
GLM-5.158.4% (94.6% of Claude Opus 4.6)~94.6%³Z.ai/Zhipu official
DeepSeek V4Unverified⁴Unverified⁴Pre-release claims only

¹ GPT-OSS 120B SWE-bench: figure could not be confirmed from the official OpenAI model card (OpenAI does not publish SWE-bench Pro directly). The 94.6% relative-to-Claude figure is from the GLM-5.1 deployment guide sourced from Z.ai release notes. GPT-OSS figure is an estimate; verify against your own eval before using for production selection.

² GPT-OSS 120B HumanEval: OpenAI release notes for GPT-OSS (August 2025) reference strong coding performance but do not cite a specific HumanEval number in the published materials. Treat as unverified.

³ GLM-5.1 HumanEval percentage: the "94.6% of Claude Opus 4.6" metric is from Z.ai's SWE-bench Pro evaluation, not standard HumanEval. These are different benchmarks. The column header reflects the common metric used; source data is Z.ai's April 2026 release notes.

⁴ DeepSeek V4 benchmark scores: as noted in the DeepSeek V4 deployment guide, V4 had not officially launched as of March 2026. Coding benchmark scores from pre-release materials are provisional and unverified.

General Reasoning (MMLU-Pro)

ModelMMLU-ProChatbot Arena EloNotes
GPT-OSS 120B~79%⁵N/AOpenAI internal benchmarks
GLM-5.1~82%⁶1467Z.ai official, Chatbot Arena
DeepSeek V4UnverifiedN/APre-release only

⁵ GPT-OSS 120B MMLU-Pro: OpenAI's release materials reference competitive MMLU performance but do not publish MMLU-Pro scores directly. Treat as unverified.

⁶ GLM-5.1 Chatbot Arena Elo 1467 is from the official Chatbot Arena leaderboard as of April 2026, sourced from the GLM-5.1 deployment guide. The Elo reflects API-served GLM-5.1; self-hosted FP8 quality may differ slightly from the API endpoint quality.

For all benchmark figures: these numbers are starting points. Run your own evaluation on your specific task distribution before making hardware procurement decisions.

Cost-Per-Token Analysis

On-Demand Pricing

ModelGPU Config$/hr (on-demand)Est. Throughput (tok/s)Est. $/M tokens
GPT-OSS 120B1x H100 SXM5 (MXFP4)$4.41~700~$1.75
GLM-5.18x H200 SXM5 (FP8)$36.33~1,000~$10.09
GLM-5.14x H200 SXM5 (INT4)$18.16~750~$6.72
DeepSeek V48x H100 SXM5 (FP8)$35.28~850~$11.53
DeepSeek V44x H200 SXM5 (FP8)$18.16~850~$5.93

H200 SXM5 pricing ($4.54/hr per GPU, yielding $36.33 for 8x and $18.16 for 4x) is sourced from the H200 rental page. Confirm current rates at the pricing page before finalizing budget estimates.

The Est. $/M tokens figures use the unverified throughput estimates from the benchmarks section above. Substitute your measured throughput into the formula ($/hr / 3600) / (tokens/sec / 1,000,000) = $/M tokens for accurate cost modeling.

Pricing fluctuates based on GPU availability. The prices above are based on 19 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Spot vs On-Demand

Spot instances can reduce costs 40-60% for GLM-5.1 deployments on H200 multi-GPU configurations, making them viable for batch inference and offline pipelines. The GLM-5.1 deployment guide documents spot pricing details for H200 configurations. For single-GPU GPT-OSS 120B, spot pricing makes it even more compelling for cost-sensitive workloads.

H100 SXM5 spot instances are not currently available on Spheron, so DeepSeek V4 deployments on 8x H100 SXM5 run on-demand only. Check the pricing page for up-to-date spot availability across GPU types before planning a spot-based pipeline for DeepSeek V4.

For workloads where uptime matters, on-demand is the safer choice. For nightly batch jobs, offline evaluation runs, or asynchronous inference queues, spot pricing on 8x H200 (GLM-5.1) can cut operating costs significantly.

API Pricing Comparison

Self-hosting becomes cost-effective once your token volume is high enough to justify leaving the GPU on. For GPT-OSS 120B at $4.41/hr on-demand, the crossover against a $1-5/M token API depends on utilization. Running the GPU only when needed (spinning up per job), self-hosting beats most API tiers starting around 10M tokens/month. At continuous 24/7 operation, the break-even rises to hundreds of millions of tokens per month.

For GLM-5.1 and DeepSeek V4 at $18-36/hr, the same logic applies but the GPU cost is higher, so higher token volumes are required to break even. See the on-premise vs GPU cloud cost breakdown and the GPU cloud pricing comparison 2026 for the full break-even analysis.

Deployment Guide

Deploy GPT-OSS 120B on Spheron

bash
# Single H100 SXM5 with MXFP4
docker run --gpus all --ipc=host -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model openai/gpt-oss-120b \
  --quantization mxfp4 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 32768

Note: MXFP4 hardware acceleration only applies on Blackwell GPUs (B200, GB200). On H100, vLLM uses software emulation for MXFP4 weight loading, which reduces throughput. If you need maximum throughput on H100, use FP8 with 2x H100 instead. See FP4 quantization on Blackwell GPUs for hardware requirements. Full deployment guide: deploy GPT-OSS on GPU cloud.

Deploy GLM-5.1 on Spheron (8x H200, FP8)

bash
python -m vllm.entrypoints.openai.api_server \
  --model ./glm5-1-fp8 \
  --served-model-name glm5-1-fp8 \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --enable-expert-parallel \
  --max-model-len 32768 \
  --port 8000

Download weights first: huggingface-cli download zai-org/GLM-5.1-FP8 --local-dir ./glm5-1-fp8. The zai-org/GLM-5.1-FP8 model ID is from Z.ai's official Hugging Face org as of April 2026; verify the ID hasn't changed at the Hugging Face repo before running. Full deployment guide: deploy GLM-5.1 on GPU cloud.

Deploy DeepSeek V4 on Spheron (8x H100, FP8)

bash
python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-V4 \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --enable-expert-parallel \
  --max-model-len 65536 \
  --port 8000

The model ID deepseek-ai/DeepSeek-V4 is provisional based on pre-release information. Confirm the final Hugging Face model ID at the DeepSeek Hugging Face page before using this command. Full deployment guide: deploy DeepSeek V4 on GPU cloud.

For users running vLLM Model Runner V2 (MRV2), see the vLLM MRV2 deployment guide for configuration differences. Model quick-start configs for all three models are also at docs.spheron.ai/quick-guides/llms/.

Decision Matrix

WorkloadRecommended ModelWhy
Code generation, single-GPU budgetGPT-OSS 120BMXFP4 on 1x H100, competitive coding benchmarks, Apache 2.0
Highest quality coding, Chatbot Arena leaderGLM-5.1Elo 1467, 94.6% of Claude Opus 4.6 on SWE-bench
1M token context, agentic reasoningDeepSeek V4Largest context window, strongest long-context MoE
Cost-per-token optimizationGPT-OSS 120BSingle-GPU deployment, lowest $/hr by a wide margin
Batch inference, fine-tuned pipelinesDeepSeek V4Apache 2.0 (provisional), expert parallelism, 1M token context

If you're unsure where to start, deploy GPT-OSS 120B first. It runs on a single H100, costs $4.41/hr on-demand, and gives you a working OpenAI-compatible endpoint without the multi-GPU orchestration complexity of GLM-5.1 or DeepSeek V4. Once you have a baseline and know your token volume and quality requirements, you can decide whether GLM-5.1's coding quality or DeepSeek V4's 1M context window justifies the 4-8x increase in GPU cost.


GPT-OSS 120B, GLM-5.1, and DeepSeek V4 are all available on Spheron with on-demand and spot pricing. Pick a single H100 for GPT-OSS 120B, scale up to 8x H200 for GLM-5.1, or run DeepSeek V4 on 8x H100 - all with per-minute billing.

Rent H100 → | Rent H200 → | View all GPU pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.