GPT-OSS 120B vs GLM-5.1 vs DeepSeek V4: Which Open-Weight Model Wins in 2026?

TL;DR

For coding on a single GPU: GPT-OSS 120B (fits on 1x H100 SXM5 with MXFP4, ~$4.41/hr)
For highest coding quality: GLM-5.1 (Chatbot Arena Elo 1467, needs 4-8x H200 SXM5)
For 1M context and agentic reasoning: DeepSeek V4 (1T MoE, needs 4x H200 minimum)

Costs, VRAM requirements, and full benchmarks below.

Three open-weight frontier models are deployable as of April 2026: GPT-OSS 120B, GLM-5.1, and DeepSeek V4. The hardware cost spread between them is enormous. GPT-OSS 120B fits on a single H100 with MXFP4 quantization. GLM-5.1 needs at minimum 4x H200 for INT4 or 8x H200 for FP8. DeepSeek V4 requires 4x H200 or 8x H100. Picking the wrong model for your budget means either overpaying for capacity you don't need, or under-serving your workload with hardware that can't hold the weights.

This post puts all three models side by side: architecture, VRAM requirements, inference throughput, quality benchmarks, and cost per million tokens at current Spheron pricing.

TL;DR

Use Case	Best Model	Min GPU Config	Cost/hr (Spheron, on-demand)
Code generation, single-GPU budget	GPT-OSS 120B	1x H100 SXM5	~$4.41
Highest quality coding, Arena leader	GLM-5.1	4x H200 (INT4) or 8x H200 (FP8)	~$18.16 / ~$36.33
1M token context, agentic reasoning	DeepSeek V4	4x H200 or 8x H100	~$18.16 / ~$35.28
Cost-per-token optimization	GPT-OSS 120B	1x H100 SXM5	~$4.41
Batch inference, fine-tunable, Apache 2.0 license (provisional)	DeepSeek V4	8x H100 SXM5	~$35.28

Model Architecture Comparison

Model	Total Params	Active Params	Architecture	Context Window	Min Weight Size (Quantized)	Min H100-equiv GPUs	License
GPT-OSS 120B	120B	~5.1B (MoE)	MoE	128K	~60GB (MXFP4)†	1x H100 80GB	Apache 2.0
GLM-5.1	754B	40B (MoE)	MoE	200K (128K max output)	~377GB (INT4) / ~754GB (FP8)¶	4x H200 (INT4) / 8x H200 (FP8)	MIT
DeepSeek V4	~1T	~37B (MoE)	MoE	1M	~500GB (FP8)‡	4x H200 or 8x H100 (FP8)	Apache 2.0 (provisional)

† GPT-OSS 120B MXFP4 weight size: the ~60GB estimate covers expert weights compressed to 4-bit precision. MXFP4 is an Open Compute Project (OCP) industry standard; native hardware acceleration for it requires Blackwell GPUs (B200, GB200). On H100, the runtime falls back to software emulation of MXFP4 or MXFP8, which reduces throughput compared to native Blackwell. See FP4 quantization on Blackwell for details.

¶ GLM-5.1 INT4 estimate of ~377GB is theoretical minimum at 4-bits per parameter (754B × 0.5 bytes). AWQ/GPTQ INT4 in practice may be slightly larger due to block metadata. FP8 at ~754GB assumes 1 byte per parameter, excluding KV cache and activation overhead. At 32K context length with moderate batch sizes, expect 10-15% additional VRAM for KV cache.

‡ DeepSeek V4 FP8 at ~500GB is the weight-only minimum; weights are reported by pre-release sources as approximately 1T × 0.5 bytes with sparse MoE routing compression. At runtime, add 15-20% for activations and framework buffers. Treat all V4 figures as provisional until official release. The Apache 2.0 license is also provisional based on pre-release sources; confirm at the official DeepSeek Hugging Face repository before use. See the GPU memory requirements guide and the GPU requirements cheat sheet 2026 for VRAM planning details.

All three models are MoE architectures. The active parameter count (the column that drives per-token compute) ranges from ~5.1B (GPT-OSS 120B) to ~40B (GLM-5.1). Total parameter count sets the memory floor. For a deep dive on why you pay for storage even when only a fraction of parameters are active per token, see the MoE inference optimization guide.

GPU Hardware Requirements and Quantization

GPT-OSS 120B

Quantization	VRAM Required	Minimum GPUs	Recommended Spheron Instance
BF16	~240GB	3x H100 80GB	4x H100 SXM5
FP8	~120GB	2x H100 80GB	2x H100 SXM5
MXFP4	~60GB	1x H100 80GB	1x H100 SXM5

MXFP4 is a 4-bit floating-point format defined by the Open Compute Project (OCP) industry standard. NVIDIA Blackwell GPUs (B200, GB200) support it natively; NVIDIA also has a proprietary variant called NVFP4. On H100, vLLM falls back to software emulation at reduced throughput; you still get the memory savings but not the hardware speedup. For production on H100, FP8 at 2x H100 may give better throughput per dollar than MXFP4 on 1x H100, depending on your batch size. Link to the full setup guide: deploy GPT-OSS on GPU cloud.

GLM-5.1

Quantization	VRAM Required	Minimum GPUs	Recommended Spheron Instance
BF16	~1,508GB	19x H100	Not recommended
FP8	~754GB	8x H200 SXM5 (1,128GB)	8x H200 SXM5
AWQ INT4	~377GB	5x A100 80GB (400GB) or 4x H200 (564GB)	4x H200 or 5x A100 SXM4

GLM-5.1 FP8 is the practical production configuration. At 32K context length with 16 concurrent requests, expect roughly 100-140GB additional VRAM for KV cache on top of the ~754GB weight footprint, so 8x H200 (1,128GB total) is tight at high concurrency. AWQ INT4 on 5x A100 80GB is viable for lower-concurrency deployments where quality loss from aggressive quantization is acceptable. Full setup: deploy GLM-5.1 on GPU cloud.

Z.ai's June 2026 update, GLM-5.2 on Spheron, reduces total parameter count to 744B and expands context to 1M tokens with a coding-first focus.

DeepSeek V4

Quantization	VRAM Required	Minimum GPUs	Recommended Spheron Instance
BF16	~2,000GB	25x H100	Not recommended on single node
FP8	~500GB	4x H200 (564GB) or 8x H100 (640GB)	8x H100 SXM5
INT4	~250GB	4x H100 80GB	4x H100 SXM5

DeepSeek V4 with expert parallelism requires that the GPU cluster has high-bandwidth interconnect. On 8x H100 SXM5, NVLink handles the expert routing traffic efficiently. A cluster without NVLink (e.g., H100 PCIe) will bottleneck on inter-GPU communication during MoE routing and may not reach useful throughput even with enough aggregate VRAM. Full setup: deploy DeepSeek V4 on GPU cloud. See also the MoE inference optimization guide for expert parallelism configuration details.

Inference Benchmarks

Throughput (Tokens/Second)

Benchmark figures below are estimates based on published vLLM benchmarks and model architecture characteristics. Batch size: 50 concurrent requests, input/output sequence of 512/256 tokens.

Model	Hardware	Precision	Throughput (tokens/sec, aggregate)
GPT-OSS 120B	1x H100 SXM5	MXFP4	~600-800
GPT-OSS 120B	4x H100 SXM5	FP8	~1,800-2,400
GLM-5.1	8x H200 SXM5	FP8	~800-1,200
GLM-5.1	4x H200 SXM5	INT4	~600-900
DeepSeek V4	8x H100 SXM5	FP8	~700-1,000
DeepSeek V4	4x H200 SXM5	FP8	~700-1,000

These throughput estimates could not be confirmed from official model cards or technical reports for the April 2026 versions. Treat them as directional and run your own vLLM benchmarking on your target hardware before capacity planning. Benchmark methodology and inference engine choice affect numbers significantly; see the vLLM vs TensorRT-LLM vs SGLang comparison for the impact on throughput.

Per-GPU throughput is where GPT-OSS 120B wins decisively. At ~600-800 tokens/sec on a single H100, it outperforms GLM-5.1 and DeepSeek V4 on a normalized per-GPU basis by a factor of 5-8x, because those models distribute computation across 8 GPUs to produce similar or lower aggregate throughput.

TTFT (Time to First Token)

Model	Hardware	TTFT p50 (1 req)	TTFT p99 (10 req)
GPT-OSS 120B	1x H100 SXM5	~80-120ms	~200-400ms
GLM-5.1	8x H200 SXM5	~150-250ms	~400-700ms
DeepSeek V4	8x H100 SXM5	~200-350ms	~500-900ms

TTFT figures are estimates; actual values depend heavily on prompt length, KV cache state, and batching configuration. For production latency tuning, see the vLLM vs TensorRT-LLM vs SGLang benchmarks post.

Throughput per Dollar

Model	Config	$/hr	Est. Throughput (tok/sec)	Est. $/1M tokens
GPT-OSS 120B	1x H100 SXM5 (MXFP4)	$4.41	~700	~$1.75
GLM-5.1	8x H200 SXM5 (FP8)	$36.33	~1,000	~$10.09
GLM-5.1	4x H200 SXM5 (INT4)	$18.16	~750	~$6.72
DeepSeek V4	8x H100 SXM5 (FP8)	$35.28	~850	~$11.53
DeepSeek V4	4x H200 SXM5 (FP8)	$18.16	~850	~$5.93

Formula: ($/hr / 3600) / (tokens/sec / 1,000,000) = $/M tokens. You can substitute current pricing and your measured throughput to recalculate.

The Est. throughput column uses unverified estimates from the benchmarks section above. The resulting $/M tokens figures are also estimates; run your own throughput measurement before using these for budget planning.

Pricing fluctuates based on GPU availability. The prices above are based on 19 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Quality Benchmarks

Coding (SWE-bench, HumanEval)

Model	SWE-bench Pro	HumanEval	Notes
GPT-OSS 120B	~74%¹	~85%²	OpenAI release notes, Aug 2025
GLM-5.1	58.4% (94.6% of Claude Opus 4.6)	~94.6%³	Z.ai/Zhipu official
DeepSeek V4	Unverified⁴	Unverified⁴	Pre-release claims only

¹ GPT-OSS 120B SWE-bench: figure could not be confirmed from the official OpenAI model card (OpenAI does not publish SWE-bench Pro directly). The 94.6% relative-to-Claude figure is from the GLM-5.1 deployment guide sourced from Z.ai release notes. GPT-OSS figure is an estimate; verify against your own eval before using for production selection.

² GPT-OSS 120B HumanEval: OpenAI release notes for GPT-OSS (August 2025) reference strong coding performance but do not cite a specific HumanEval number in the published materials. Treat as unverified.

³ GLM-5.1 HumanEval percentage: the "94.6% of Claude Opus 4.6" metric is from Z.ai's SWE-bench Pro evaluation, not standard HumanEval. These are different benchmarks. The column header reflects the common metric used; source data is Z.ai's April 2026 release notes.

⁴ DeepSeek V4 benchmark scores: as noted in the DeepSeek V4 deployment guide, V4 had not officially launched as of March 2026. Coding benchmark scores from pre-release materials are provisional and unverified.

General Reasoning (MMLU-Pro)

Model	MMLU-Pro	Chatbot Arena Elo	Notes
GPT-OSS 120B	~79%⁵	N/A	OpenAI internal benchmarks
GLM-5.1	~82%⁶	1467	Z.ai official, Chatbot Arena
DeepSeek V4	Unverified	N/A	Pre-release only

⁵ GPT-OSS 120B MMLU-Pro: OpenAI's release materials reference competitive MMLU performance but do not publish MMLU-Pro scores directly. Treat as unverified.

⁶ GLM-5.1 Chatbot Arena Elo 1467 is from the official Chatbot Arena leaderboard as of April 2026, sourced from the GLM-5.1 deployment guide. The Elo reflects API-served GLM-5.1; self-hosted FP8 quality may differ slightly from the API endpoint quality.

For all benchmark figures: these numbers are starting points. Run your own evaluation on your specific task distribution before making hardware procurement decisions.

Cost-Per-Token Analysis

On-Demand Pricing

Model	GPU Config	$/hr (on-demand)	Est. Throughput (tok/s)	Est. $/M tokens
GPT-OSS 120B	1x H100 SXM5 (MXFP4)	$4.41	~700	~$1.75
GLM-5.1	8x H200 SXM5 (FP8)	$36.33	~1,000	~$10.09
GLM-5.1	4x H200 SXM5 (INT4)	$18.16	~750	~$6.72
DeepSeek V4	8x H100 SXM5 (FP8)	$35.28	~850	~$11.53
DeepSeek V4	4x H200 SXM5 (FP8)	$18.16	~850	~$5.93

H200 SXM5 pricing ($4.54/hr per GPU, yielding $36.33 for 8x and $18.16 for 4x) is sourced from the H200 rental page. Confirm current rates at the pricing page before finalizing budget estimates.

The Est. $/M tokens figures use the unverified throughput estimates from the benchmarks section above. Substitute your measured throughput into the formula ($/hr / 3600) / (tokens/sec / 1,000,000) = $/M tokens for accurate cost modeling.

Pricing fluctuates based on GPU availability. The prices above are based on 19 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Spot vs On-Demand

Spot instances can reduce costs 40-60% for GLM-5.1 deployments on H200 multi-GPU configurations, making them viable for batch inference and offline pipelines. The GLM-5.1 deployment guide documents spot pricing details for H200 configurations. For single-GPU GPT-OSS 120B, spot pricing makes it even more compelling for cost-sensitive workloads.

H100 SXM5 spot instances are not currently available on Spheron, so DeepSeek V4 deployments on 8x H100 SXM5 run on-demand only. Check the pricing page for up-to-date spot availability across GPU types before planning a spot-based pipeline for DeepSeek V4.

For workloads where uptime matters, on-demand is the safer choice. For nightly batch jobs, offline evaluation runs, or asynchronous inference queues, spot pricing on 8x H200 (GLM-5.1) can cut operating costs significantly.

API Pricing Comparison

Self-hosting becomes cost-effective once your token volume is high enough to justify leaving the GPU on. For GPT-OSS 120B at $4.41/hr on-demand, the crossover against a $1-5/M token API depends on utilization. Running the GPU only when needed (spinning up per job), self-hosting beats most API tiers starting around 10M tokens/month. At continuous 24/7 operation, the break-even rises to hundreds of millions of tokens per month.

For GLM-5.1 and DeepSeek V4 at $18-36/hr, the same logic applies but the GPU cost is higher, so higher token volumes are required to break even. See the on-premise vs GPU cloud cost breakdown and the GPU cloud pricing comparison 2026 for the full break-even analysis.

Deployment Guide

Deploy GPT-OSS 120B on Spheron

bash

# Single H100 SXM5 with MXFP4
docker run --gpus all --ipc=host -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model openai/gpt-oss-120b \
  --quantization mxfp4 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 32768

Note: MXFP4 hardware acceleration only applies on Blackwell GPUs (B200, GB200). On H100, vLLM uses software emulation for MXFP4 weight loading, which reduces throughput. If you need maximum throughput on H100, use FP8 with 2x H100 instead. See FP4 quantization on Blackwell GPUs for hardware requirements. Full deployment guide: deploy GPT-OSS on GPU cloud. If your deployment leans on native MXFP4 at rack scale, Spheron has GB200 rack-scale capacity you can reserve today: enter your GPU count, timeline, and workload on the form and the team confirms availability and reaches out within a business day.

Deploy GLM-5.1 on Spheron (8x H200, FP8)

bash

python -m vllm.entrypoints.openai.api_server \
  --model ./glm5-1-fp8 \
  --served-model-name glm5-1-fp8 \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --enable-expert-parallel \
  --max-model-len 32768 \
  --port 8000

Download weights first: huggingface-cli download zai-org/GLM-5.1-FP8 --local-dir ./glm5-1-fp8. The zai-org/GLM-5.1-FP8 model ID is from Z.ai's official Hugging Face org as of April 2026; verify the ID hasn't changed at the Hugging Face repo before running. Full deployment guide: deploy GLM-5.1 on GPU cloud.

Deploy DeepSeek V4 on Spheron (8x H100, FP8)

bash

python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-V4 \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --enable-expert-parallel \
  --max-model-len 65536 \
  --port 8000

The model ID deepseek-ai/DeepSeek-V4 is provisional based on pre-release information. Confirm the final Hugging Face model ID at the DeepSeek Hugging Face page before using this command. Full deployment guide: deploy DeepSeek V4 on GPU cloud.

For users running vLLM Model Runner V2 (MRV2), see the vLLM MRV2 deployment guide for configuration differences. Model quick-start configs for all three models are also at docs.spheron.ai/quick-guides/llms/.

Decision Matrix

Workload	Recommended Model	Why
Code generation, single-GPU budget	GPT-OSS 120B	MXFP4 on 1x H100, competitive coding benchmarks, Apache 2.0
Highest quality coding, Chatbot Arena leader	GLM-5.1	Elo 1467, 94.6% of Claude Opus 4.6 on SWE-bench
1M token context, agentic reasoning	DeepSeek V4	Largest context window, strongest long-context MoE
Cost-per-token optimization	GPT-OSS 120B	Single-GPU deployment, lowest $/hr by a wide margin
Batch inference, fine-tuned pipelines	DeepSeek V4	Apache 2.0 (provisional), expert parallelism, 1M token context

If you're unsure where to start, deploy GPT-OSS 120B first. It runs on a single H100, costs $4.41/hr on-demand, and gives you a working OpenAI-compatible endpoint without the multi-GPU orchestration complexity of GLM-5.1 or DeepSeek V4. Once you have a baseline and know your token volume and quality requirements, you can decide whether GLM-5.1's coding quality or DeepSeek V4's 1M context window justifies the 4-8x increase in GPU cost. For teams also evaluating Moonshot AI's models, you can deploy Kimi K2.6 for agentic-swarm workloads (300-agent coordination) on a similar 8×H200 hardware setup. Alibaba's Qwen3.7 Max entered the frontier MoE category in May 2026 and should be considered alongside the models benchmarked here if you are evaluating options as of mid-2026. Xiaomi's MiMo-V2.5-Pro is a 1.02T MoE coding model with ~57.2% SWE-bench Pro and MIT licensing, released in July 2026, requiring 8x H200 or B200 and targeting long-horizon agentic coding tasks with a 1M context window.

GPT-OSS 120B, GLM-5.1, and DeepSeek V4 are all available on Spheron with on-demand and spot pricing. Pick a single H100 for GPT-OSS 120B, scale up to 8x H200 for GLM-5.1, or run DeepSeek V4 on 8x H100 - all with per-minute billing.
Spheron H100 → | On-demand H200 → | View all GPU pricing →
Get started on Spheron →

STEPS / 04

Quick Setup Guide

Choose your model based on GPU budget and quality requirements
GPT-OSS 120B runs on 1x H100 SXM5 with MXFP4 quantization (~$4.41/hr on-demand). GLM-5.1 needs 8x H200 SXM5 for FP8 (~$36.33/hr on-demand) or 4x H200 for AWQ INT4 (~$18.16/hr). DeepSeek V4 requires 8x H100 SXM5 or 4x H200 at FP8 (~$35.28/hr on-demand). Match your GPU cluster size to the model's minimum VRAM footprint.
Provision a GPU cluster on Spheron
Log into app.spheron.ai, select GPU type and count, choose on-demand or spot pricing, and launch the instance. Spot pricing cuts costs significantly for batch workloads. SSH into the node and verify GPU count with nvidia-smi.
Deploy with vLLM using model-specific flags
For GPT-OSS 120B: docker run with --quantization mxfp4 --tensor-parallel-size 1. For GLM-5.1: --quantization fp8 --tensor-parallel-size 8 --enable-expert-parallel. For DeepSeek V4: --quantization fp8 --tensor-parallel-size 8 --enable-expert-parallel. See the per-model deploy sections in this post for full commands.
Run the OpenAI-compatible inference endpoint
All three models expose /v1/chat/completions once vLLM is running. Test with curl -X POST http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{"model": "<model-name>", "messages": [{"role": "user", "content": "Hello"}]}'

FAQ / 05

Frequently Asked Questions

GPT-OSS 120B is the most cost-efficient large frontier model. With MXFP4 quantization it runs on a single H100 SXM5 at ~$4.41/hr on Spheron, compared to GLM-5.1 which needs 8x H200 SXM5 at ~$36.33/hr (FP8 on-demand) and DeepSeek V4 which requires 8x H100 at ~$35.28/hr (FP8 on-demand). Spot pricing on H200 configurations can reduce GLM-5.1 costs significantly for batch workloads; H100 SXM5 spot is not currently available.

GLM-5.1 has 754B total parameters and 40B active per forward pass (MoE). FP8 quantization requires roughly 754GB VRAM, so 8x H200 SXM5 is the minimum practical configuration. AWQ INT4 drops to around 377GB, which fits on 4x H200 or 5x A100 80GB.

DeepSeek V4 has 1T total parameters and 37B active per forward pass (MoE). FP8 requires roughly 500GB VRAM minimum - at least 4x H200 (564GB) or 8x H100 (640GB). BF16 requires 8x H200 or larger.

On SWE-bench, GLM-5.1 scores 94.6% relative to Claude Opus 4.6, making it the stronger coding model. GPT-OSS 120B is competitive on MMLU-Pro and general reasoning. DeepSeek V4 leads on multi-step math and long-context reasoning. For code-heavy workloads, GLM-5.1 or GPT-OSS 120B are the top picks; for agentic workflows requiring very large context, DeepSeek V4 is the stronger option.

Yes. All three models are served via vLLM or SGLang's OpenAI-compatible API at /v1/chat/completions. Your client code only needs to update the model name string and base URL when switching between GPT-OSS, GLM-5.1, and DeepSeek V4.

TL;DR

Model Architecture Comparison

GPU Hardware Requirements and Quantization

GPT-OSS 120B

GLM-5.1

DeepSeek V4

Inference Benchmarks

Throughput (Tokens/Second)

TTFT (Time to First Token)

Throughput per Dollar

Quality Benchmarks

Coding (SWE-bench, HumanEval)

General Reasoning (MMLU-Pro)

Cost-Per-Token Analysis

On-Demand Pricing

Spot vs On-Demand

API Pricing Comparison

Deployment Guide

Deploy GPT-OSS 120B on Spheron

Deploy GLM-5.1 on Spheron (8x H200, FP8)

Deploy DeepSeek V4 on Spheron (8x H100, FP8)

Decision Matrix

Quick Setup Guide

Choose your model based on GPU budget and quality requirements

Provision a GPU cluster on Spheron

Deploy with vLLM using model-specific flags

Run the OpenAI-compatible inference endpoint

Frequently Asked Questions

01Which open-weight frontier model is cheapest to self-host in 2026?

02What GPU setup does GLM-5.1 need?

03What GPU setup does DeepSeek V4 need?

04How does GPT-OSS 120B compare to GLM-5.1 on quality benchmarks?

05Can I switch between these models without changing my application code?

Build what's next.