Three open-weight frontier models are deployable as of April 2026: GPT-OSS 120B, GLM-5.1, and DeepSeek V4. The hardware cost spread between them is enormous. GPT-OSS 120B fits on a single H100 with MXFP4 quantization. GLM-5.1 needs at minimum 4x H200 for INT4 or 8x H200 for FP8. DeepSeek V4 requires 4x H200 or 8x H100. Picking the wrong model for your budget means either overpaying for capacity you don't need, or under-serving your workload with hardware that can't hold the weights.
This post puts all three models side by side: architecture, VRAM requirements, inference throughput, quality benchmarks, and cost per million tokens at current Spheron pricing.
TL;DR
| Use Case | Best Model | Min GPU Config | Cost/hr (Spheron, on-demand) |
|---|---|---|---|
| Code generation, single-GPU budget | GPT-OSS 120B | 1x H100 SXM5 | ~$4.41 |
| Highest quality coding, Arena leader | GLM-5.1 | 4x H200 (INT4) or 8x H200 (FP8) | ~$18.16 / ~$36.33 |
| 1M token context, agentic reasoning | DeepSeek V4 | 4x H200 or 8x H100 | ~$18.16 / ~$35.28 |
| Cost-per-token optimization | GPT-OSS 120B | 1x H100 SXM5 | ~$4.41 |
| Batch inference, fine-tunable, Apache 2.0 license (provisional) | DeepSeek V4 | 8x H100 SXM5 | ~$35.28 |
Model Architecture Comparison
| Model | Total Params | Active Params | Architecture | Context Window | Min Weight Size (Quantized) | Min H100-equiv GPUs | License |
|---|---|---|---|---|---|---|---|
| GPT-OSS 120B | 120B | ~5.1B (MoE) | MoE | 128K | ~60GB (MXFP4)† | 1x H100 80GB | Apache 2.0 |
| GLM-5.1 | 754B | 40B (MoE) | MoE | 200K (128K max output) | ~377GB (INT4) / ~754GB (FP8)¶ | 4x H200 (INT4) / 8x H200 (FP8) | MIT |
| DeepSeek V4 | ~1T | ~37B (MoE) | MoE | 1M | ~500GB (FP8)‡ | 4x H200 or 8x H100 (FP8) | Apache 2.0 (provisional) |
† GPT-OSS 120B MXFP4 weight size: the ~60GB estimate covers expert weights compressed to 4-bit precision. MXFP4 is an Open Compute Project (OCP) industry standard; native hardware acceleration for it requires Blackwell GPUs (B200, GB200). On H100, the runtime falls back to software emulation of MXFP4 or MXFP8, which reduces throughput compared to native Blackwell. See FP4 quantization on Blackwell for details.
¶ GLM-5.1 INT4 estimate of ~377GB is theoretical minimum at 4-bits per parameter (754B × 0.5 bytes). AWQ/GPTQ INT4 in practice may be slightly larger due to block metadata. FP8 at ~754GB assumes 1 byte per parameter, excluding KV cache and activation overhead. At 32K context length with moderate batch sizes, expect 10-15% additional VRAM for KV cache.
‡ DeepSeek V4 FP8 at ~500GB is the weight-only minimum; weights are reported by pre-release sources as approximately 1T × 0.5 bytes with sparse MoE routing compression. At runtime, add 15-20% for activations and framework buffers. Treat all V4 figures as provisional until official release. The Apache 2.0 license is also provisional based on pre-release sources; confirm at the official DeepSeek Hugging Face repository before use. See the GPU memory requirements guide and the GPU requirements cheat sheet 2026 for VRAM planning details.
All three models are MoE architectures. The active parameter count (the column that drives per-token compute) ranges from ~5.1B (GPT-OSS 120B) to ~40B (GLM-5.1). Total parameter count sets the memory floor. For a deep dive on why you pay for storage even when only a fraction of parameters are active per token, see the MoE inference optimization guide.
GPU Hardware Requirements and Quantization
GPT-OSS 120B
| Quantization | VRAM Required | Minimum GPUs | Recommended Spheron Instance |
|---|---|---|---|
| BF16 | ~240GB | 3x H100 80GB | 4x H100 SXM5 |
| FP8 | ~120GB | 2x H100 80GB | 2x H100 SXM5 |
| MXFP4 | ~60GB | 1x H100 80GB | 1x H100 SXM5 |
MXFP4 is a 4-bit floating-point format defined by the Open Compute Project (OCP) industry standard. NVIDIA Blackwell GPUs (B200, GB200) support it natively; NVIDIA also has a proprietary variant called NVFP4. On H100, vLLM falls back to software emulation at reduced throughput; you still get the memory savings but not the hardware speedup. For production on H100, FP8 at 2x H100 may give better throughput per dollar than MXFP4 on 1x H100, depending on your batch size. Link to the full setup guide: deploy GPT-OSS on GPU cloud.
GLM-5.1
| Quantization | VRAM Required | Minimum GPUs | Recommended Spheron Instance |
|---|---|---|---|
| BF16 | ~1,508GB | 19x H100 | Not recommended |
| FP8 | ~754GB | 8x H200 SXM5 (1,128GB) | 8x H200 SXM5 |
| AWQ INT4 | ~377GB | 5x A100 80GB (400GB) or 4x H200 (564GB) | 4x H200 or 5x A100 SXM4 |
GLM-5.1 FP8 is the practical production configuration. At 32K context length with 16 concurrent requests, expect roughly 100-140GB additional VRAM for KV cache on top of the ~754GB weight footprint, so 8x H200 (1,128GB total) is tight at high concurrency. AWQ INT4 on 5x A100 80GB is viable for lower-concurrency deployments where quality loss from aggressive quantization is acceptable. Full setup: deploy GLM-5.1 on GPU cloud.
DeepSeek V4
| Quantization | VRAM Required | Minimum GPUs | Recommended Spheron Instance |
|---|---|---|---|
| BF16 | ~2,000GB | 25x H100 | Not recommended on single node |
| FP8 | ~500GB | 4x H200 (564GB) or 8x H100 (640GB) | 8x H100 SXM5 |
| INT4 | ~250GB | 4x H100 80GB | 4x H100 SXM5 |
DeepSeek V4 with expert parallelism requires that the GPU cluster has high-bandwidth interconnect. On 8x H100 SXM5, NVLink handles the expert routing traffic efficiently. A cluster without NVLink (e.g., H100 PCIe) will bottleneck on inter-GPU communication during MoE routing and may not reach useful throughput even with enough aggregate VRAM. Full setup: deploy DeepSeek V4 on GPU cloud. See also the MoE inference optimization guide for expert parallelism configuration details.
Inference Benchmarks
Throughput (Tokens/Second)
Benchmark figures below are estimates based on published vLLM benchmarks and model architecture characteristics. Batch size: 50 concurrent requests, input/output sequence of 512/256 tokens.
| Model | Hardware | Precision | Throughput (tokens/sec, aggregate) |
|---|---|---|---|
| GPT-OSS 120B | 1x H100 SXM5 | MXFP4 | ~600-800 |
| GPT-OSS 120B | 4x H100 SXM5 | FP8 | ~1,800-2,400 |
| GLM-5.1 | 8x H200 SXM5 | FP8 | ~800-1,200 |
| GLM-5.1 | 4x H200 SXM5 | INT4 | ~600-900 |
| DeepSeek V4 | 8x H100 SXM5 | FP8 | ~700-1,000 |
| DeepSeek V4 | 4x H200 SXM5 | FP8 | ~700-1,000 |
These throughput estimates could not be confirmed from official model cards or technical reports for the April 2026 versions. Treat them as directional and run your own vLLM benchmarking on your target hardware before capacity planning. Benchmark methodology and inference engine choice affect numbers significantly; see the vLLM vs TensorRT-LLM vs SGLang comparison for the impact on throughput.
Per-GPU throughput is where GPT-OSS 120B wins decisively. At ~600-800 tokens/sec on a single H100, it outperforms GLM-5.1 and DeepSeek V4 on a normalized per-GPU basis by a factor of 5-8x, because those models distribute computation across 8 GPUs to produce similar or lower aggregate throughput.
TTFT (Time to First Token)
| Model | Hardware | TTFT p50 (1 req) | TTFT p99 (10 req) |
|---|---|---|---|
| GPT-OSS 120B | 1x H100 SXM5 | ~80-120ms | ~200-400ms |
| GLM-5.1 | 8x H200 SXM5 | ~150-250ms | ~400-700ms |
| DeepSeek V4 | 8x H100 SXM5 | ~200-350ms | ~500-900ms |
TTFT figures are estimates; actual values depend heavily on prompt length, KV cache state, and batching configuration. For production latency tuning, see the vLLM vs TensorRT-LLM vs SGLang benchmarks post.
Throughput per Dollar
| Model | Config | $/hr | Est. Throughput (tok/sec) | Est. $/1M tokens |
|---|---|---|---|---|
| GPT-OSS 120B | 1x H100 SXM5 (MXFP4) | $4.41 | ~700 | ~$1.75 |
| GLM-5.1 | 8x H200 SXM5 (FP8) | $36.33 | ~1,000 | ~$10.09 |
| GLM-5.1 | 4x H200 SXM5 (INT4) | $18.16 | ~750 | ~$6.72 |
| DeepSeek V4 | 8x H100 SXM5 (FP8) | $35.28 | ~850 | ~$11.53 |
| DeepSeek V4 | 4x H200 SXM5 (FP8) | $18.16 | ~850 | ~$5.93 |
Formula: ($/hr / 3600) / (tokens/sec / 1,000,000) = $/M tokens. You can substitute current pricing and your measured throughput to recalculate.
The Est. throughput column uses unverified estimates from the benchmarks section above. The resulting $/M tokens figures are also estimates; run your own throughput measurement before using these for budget planning.
Pricing fluctuates based on GPU availability. The prices above are based on 19 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Quality Benchmarks
Coding (SWE-bench, HumanEval)
| Model | SWE-bench Pro | HumanEval | Notes |
|---|---|---|---|
| GPT-OSS 120B | ~74%¹ | ~85%² | OpenAI release notes, Aug 2025 |
| GLM-5.1 | 58.4% (94.6% of Claude Opus 4.6) | ~94.6%³ | Z.ai/Zhipu official |
| DeepSeek V4 | Unverified⁴ | Unverified⁴ | Pre-release claims only |
¹ GPT-OSS 120B SWE-bench: figure could not be confirmed from the official OpenAI model card (OpenAI does not publish SWE-bench Pro directly). The 94.6% relative-to-Claude figure is from the GLM-5.1 deployment guide sourced from Z.ai release notes. GPT-OSS figure is an estimate; verify against your own eval before using for production selection.
² GPT-OSS 120B HumanEval: OpenAI release notes for GPT-OSS (August 2025) reference strong coding performance but do not cite a specific HumanEval number in the published materials. Treat as unverified.
³ GLM-5.1 HumanEval percentage: the "94.6% of Claude Opus 4.6" metric is from Z.ai's SWE-bench Pro evaluation, not standard HumanEval. These are different benchmarks. The column header reflects the common metric used; source data is Z.ai's April 2026 release notes.
⁴ DeepSeek V4 benchmark scores: as noted in the DeepSeek V4 deployment guide, V4 had not officially launched as of March 2026. Coding benchmark scores from pre-release materials are provisional and unverified.
General Reasoning (MMLU-Pro)
| Model | MMLU-Pro | Chatbot Arena Elo | Notes |
|---|---|---|---|
| GPT-OSS 120B | ~79%⁵ | N/A | OpenAI internal benchmarks |
| GLM-5.1 | ~82%⁶ | 1467 | Z.ai official, Chatbot Arena |
| DeepSeek V4 | Unverified | N/A | Pre-release only |
⁵ GPT-OSS 120B MMLU-Pro: OpenAI's release materials reference competitive MMLU performance but do not publish MMLU-Pro scores directly. Treat as unverified.
⁶ GLM-5.1 Chatbot Arena Elo 1467 is from the official Chatbot Arena leaderboard as of April 2026, sourced from the GLM-5.1 deployment guide. The Elo reflects API-served GLM-5.1; self-hosted FP8 quality may differ slightly from the API endpoint quality.
For all benchmark figures: these numbers are starting points. Run your own evaluation on your specific task distribution before making hardware procurement decisions.
Cost-Per-Token Analysis
On-Demand Pricing
| Model | GPU Config | $/hr (on-demand) | Est. Throughput (tok/s) | Est. $/M tokens |
|---|---|---|---|---|
| GPT-OSS 120B | 1x H100 SXM5 (MXFP4) | $4.41 | ~700 | ~$1.75 |
| GLM-5.1 | 8x H200 SXM5 (FP8) | $36.33 | ~1,000 | ~$10.09 |
| GLM-5.1 | 4x H200 SXM5 (INT4) | $18.16 | ~750 | ~$6.72 |
| DeepSeek V4 | 8x H100 SXM5 (FP8) | $35.28 | ~850 | ~$11.53 |
| DeepSeek V4 | 4x H200 SXM5 (FP8) | $18.16 | ~850 | ~$5.93 |
H200 SXM5 pricing ($4.54/hr per GPU, yielding $36.33 for 8x and $18.16 for 4x) is sourced from the H200 rental page. Confirm current rates at the pricing page before finalizing budget estimates.
The Est. $/M tokens figures use the unverified throughput estimates from the benchmarks section above. Substitute your measured throughput into the formula ($/hr / 3600) / (tokens/sec / 1,000,000) = $/M tokens for accurate cost modeling.
Pricing fluctuates based on GPU availability. The prices above are based on 19 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Spot vs On-Demand
Spot instances can reduce costs 40-60% for GLM-5.1 deployments on H200 multi-GPU configurations, making them viable for batch inference and offline pipelines. The GLM-5.1 deployment guide documents spot pricing details for H200 configurations. For single-GPU GPT-OSS 120B, spot pricing makes it even more compelling for cost-sensitive workloads.
H100 SXM5 spot instances are not currently available on Spheron, so DeepSeek V4 deployments on 8x H100 SXM5 run on-demand only. Check the pricing page for up-to-date spot availability across GPU types before planning a spot-based pipeline for DeepSeek V4.
For workloads where uptime matters, on-demand is the safer choice. For nightly batch jobs, offline evaluation runs, or asynchronous inference queues, spot pricing on 8x H200 (GLM-5.1) can cut operating costs significantly.
API Pricing Comparison
Self-hosting becomes cost-effective once your token volume is high enough to justify leaving the GPU on. For GPT-OSS 120B at $4.41/hr on-demand, the crossover against a $1-5/M token API depends on utilization. Running the GPU only when needed (spinning up per job), self-hosting beats most API tiers starting around 10M tokens/month. At continuous 24/7 operation, the break-even rises to hundreds of millions of tokens per month.
For GLM-5.1 and DeepSeek V4 at $18-36/hr, the same logic applies but the GPU cost is higher, so higher token volumes are required to break even. See the on-premise vs GPU cloud cost breakdown and the GPU cloud pricing comparison 2026 for the full break-even analysis.
Deployment Guide
Deploy GPT-OSS 120B on Spheron
# Single H100 SXM5 with MXFP4
docker run --gpus all --ipc=host -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model openai/gpt-oss-120b \
--quantization mxfp4 \
--gpu-memory-utilization 0.92 \
--max-model-len 32768Note: MXFP4 hardware acceleration only applies on Blackwell GPUs (B200, GB200). On H100, vLLM uses software emulation for MXFP4 weight loading, which reduces throughput. If you need maximum throughput on H100, use FP8 with 2x H100 instead. See FP4 quantization on Blackwell GPUs for hardware requirements. Full deployment guide: deploy GPT-OSS on GPU cloud.
Deploy GLM-5.1 on Spheron (8x H200, FP8)
python -m vllm.entrypoints.openai.api_server \
--model ./glm5-1-fp8 \
--served-model-name glm5-1-fp8 \
--tensor-parallel-size 8 \
--quantization fp8 \
--enable-expert-parallel \
--max-model-len 32768 \
--port 8000Download weights first: huggingface-cli download zai-org/GLM-5.1-FP8 --local-dir ./glm5-1-fp8. The zai-org/GLM-5.1-FP8 model ID is from Z.ai's official Hugging Face org as of April 2026; verify the ID hasn't changed at the Hugging Face repo before running. Full deployment guide: deploy GLM-5.1 on GPU cloud.
Deploy DeepSeek V4 on Spheron (8x H100, FP8)
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-V4 \
--tensor-parallel-size 8 \
--quantization fp8 \
--enable-expert-parallel \
--max-model-len 65536 \
--port 8000The model ID deepseek-ai/DeepSeek-V4 is provisional based on pre-release information. Confirm the final Hugging Face model ID at the DeepSeek Hugging Face page before using this command. Full deployment guide: deploy DeepSeek V4 on GPU cloud.
For users running vLLM Model Runner V2 (MRV2), see the vLLM MRV2 deployment guide for configuration differences. Model quick-start configs for all three models are also at docs.spheron.ai/quick-guides/llms/.
Decision Matrix
| Workload | Recommended Model | Why |
|---|---|---|
| Code generation, single-GPU budget | GPT-OSS 120B | MXFP4 on 1x H100, competitive coding benchmarks, Apache 2.0 |
| Highest quality coding, Chatbot Arena leader | GLM-5.1 | Elo 1467, 94.6% of Claude Opus 4.6 on SWE-bench |
| 1M token context, agentic reasoning | DeepSeek V4 | Largest context window, strongest long-context MoE |
| Cost-per-token optimization | GPT-OSS 120B | Single-GPU deployment, lowest $/hr by a wide margin |
| Batch inference, fine-tuned pipelines | DeepSeek V4 | Apache 2.0 (provisional), expert parallelism, 1M token context |
If you're unsure where to start, deploy GPT-OSS 120B first. It runs on a single H100, costs $4.41/hr on-demand, and gives you a working OpenAI-compatible endpoint without the multi-GPU orchestration complexity of GLM-5.1 or DeepSeek V4. Once you have a baseline and know your token volume and quality requirements, you can decide whether GLM-5.1's coding quality or DeepSeek V4's 1M context window justifies the 4-8x increase in GPU cost.
GPT-OSS 120B, GLM-5.1, and DeepSeek V4 are all available on Spheron with on-demand and spot pricing. Pick a single H100 for GPT-OSS 120B, scale up to 8x H200 for GLM-5.1, or run DeepSeek V4 on 8x H100 - all with per-minute billing.
