Which open-source model is best for production in 2026?

It depends on your use case. Qwen3-32B gives the best cost-to-quality ratio for coding and instruction following at ~$2.40/hr on a single H100 SXM. Llama 4 Scout is the only option if you need a 10M token context window. DeepSeek V3.2 Speciale leads on math and multi-step reasoning benchmarks but requires 8x H100 GPUs and costs ~$19.20/hr to run.

What are the hardware costs to run each model on H100 SXM?

DeepSeek V3.2 Speciale requires 8x H100 SXM 80GB (FP8) at approximately $19.20/hr on Spheron. Llama 4 Scout runs on 1x H100 SXM with INT4 quantization at ~$2.40/hr. Qwen3-32B runs on 1x H100 SXM at FP8 for ~$2.40/hr. Llama 4 Maverick requires 4x H100 SXM at INT4 (~$9.60/hr) or 8x H100 SXM (~$19.20/hr). Prices as of March 2026.

How does Llama 4 Scout compare to Qwen3-32B?

Llama 4 Scout (109B total, 17B active) and Qwen3-32B (32B dense) are the two most cost-efficient options for single-GPU deployment. Scout's MoE architecture means it activates only 17B parameters per forward pass despite having 109B total, giving strong general quality with a 10M token context window. Qwen3-32B is a dense model that scores higher on coding benchmarks (HumanEval) and supports thinking mode for on-demand chain-of-thought reasoning.

Can I switch between models without changing my API code?

Yes. All three model families serve via vLLM's OpenAI-compatible API endpoint. Your client code only needs to change the model name in the request body. The endpoint format, authentication, and request structure remain identical across DeepSeek, Llama 4, and Qwen 3 when served with vLLM.

Which model has the most permissive license for commercial use?

Qwen 3 (Apache 2.0) and DeepSeek V3.2 (MIT) are both fully permissive with no usage-scale restrictions or use-case limitations. Llama 4 is available under Meta's Llama 4 Community License, which allows commercial use but requires explicit permission from Meta for applications exceeding 700 million monthly active users. For maximum flexibility with no restrictions of any kind, Qwen 3 or DeepSeek V3.2 are the clear choices.

DeepSeek V3.2 vs Llama 4 vs Qwen 3: Best Open-Source LLM for Production 2026

Three open-source model families are production-ready as of March 2026: DeepSeek V3.2 Speciale, Llama 4 Scout and Maverick, and Qwen 3. Each targets different hardware budgets and use cases.

Picking the wrong one costs money. DeepSeek V3.2 Speciale needs 8x H100 GPUs at minimum. Qwen3-32B runs on a single H100. Llama 4 Scout sits between them in quality but has a context window no other model touches. This post breaks down which one to pick for your workload, with benchmark numbers, cost per million tokens, and hardware requirements for each configuration.

TL;DR

Use Case	Winner	Cost/hr (Spheron)
Code generation	Qwen3-32B	~$2.40 (1x H100 SXM)
Conversational AI	Llama 4 Scout	~$2.40 (1x H100 SXM)
RAG and long-document Q&A	Llama 4 Scout	~$2.40 (1x H100 SXM)
Math and multi-step reasoning	DeepSeek V3.2 Speciale	~$19.20 (8x H100 SXM)

Model Overview

Model	Total Params	Active Params	Architecture	Context Window	Min Weight Size (Quantized)	Min H100s	License
DeepSeek V3.2 Speciale	685B	37B	MoE	128K	~640 GB (FP8, optimized)†	8x H100 80GB	MIT
Llama 4 Scout	109B	17B	MoE	10M	~55-65 GB (INT4)	1x H100 80GB	Llama 4 Community
Llama 4 Maverick	400B	17B	MoE	1M	~200-243 GB (INT4)¶	4x H100 80GB	Llama 4 Community
Qwen3-32B	32.8B	32.8B	Dense	131K (with YaRN scaling; native 32K)	~33 GB	1x H100 80GB	Apache 2.0
Qwen3-235B-A22B	235B	22B	MoE	128K	~235 GB	8x H100 80GB	Apache 2.0

† Optimized FP8 quantization compresses the weight footprint to approximately 640 GB (below the naive 685 GB you would expect at 1 byte per parameter), which just fits within the 640 GB aggregate VRAM of 8x H100 SXM5 80GB. At high KV cache usage or large batch sizes, you may still need expert offloading strategies. See the DeepSeek V3.2 Speciale deployment guide for details.

¶ Llama 4 Maverick INT4 size: the ~200 GB figure is the theoretical minimum at 4 bits per parameter (400B × 0.5 bytes). Published GGUF Q4 quantizations on Hugging Face are larger (~243 GB) due to metadata and block-level quantization overhead. 4x H100 80GB (320 GB aggregate) fits the ~200 GB AWQ/GPTQ INT4 size but is tight for GGUF Q4. If using GGUF Q4, plan for careful memory management or use 4x A100 80GB as a fallback.

The active parameter count is what drives inference cost on a per-token basis. Both Llama 4 variants activate just 17B parameters per forward pass despite having 109B or 400B total weights in memory. DeepSeek V3.2 Speciale activates 37B. Qwen3-32B is a dense model, so all 32.8B parameters are active on every token.

The critical distinction for hardware planning: all model weights must reside in VRAM regardless of how many parameters are active. See our GPU memory requirements guide and the 2026 GPU requirements cheat sheet for VRAM planning across quantization formats.

Benchmark Comparison

Benchmarks tell you different things. MMLU tests general knowledge breadth. HumanEval tests code generation accuracy. MT-Bench tests instruction following and conversational quality. Each maps to a different class of production workloads.

Quality Benchmarks (MMLU, HumanEval, MT-Bench)

Model	MMLU	HumanEval	MT-Bench
DeepSeek V3.2 Speciale	88.5†	82.6†°	N/A
Llama 4 Scout	79.6	N/A¹	N/A
Llama 4 Maverick	85.5	82.4‡	N/A
Qwen3-32B	83.6	88.0§	N/A
Qwen3-235B-A22B	N/A²	90.3§	N/A

¹ Llama 4 Scout HumanEval: some sources cite 74.1%, but the official model card does not list HumanEval directly (it reports MBPP at 67.8). Independent evaluations show ~60-61%. Treat all figures as unverified and use your own eval set for production decisions.

² Qwen3-235B-A22B MMLU: sources conflict (87.1 vs 76.6). Verify against the official Qwen3 technical report before citing.

† DeepSeek V3.2 Speciale MMLU and HumanEval: these scores are reported for the DeepSeek V3 base model that Speciale builds on. Speciale-specific benchmark scores are not separately published for general benchmarks. See the DeepSeek V3.2 Speciale deployment guide for task-specific performance data.

° DeepSeek V3.2 Speciale HumanEval 82.6: this score is from HumanEval-Mul (the multilingual variant of HumanEval), not the standard single-language HumanEval benchmark. HumanEval-Mul covers Python, Java, TypeScript, and several other languages. The two benchmarks are not directly comparable.

‡ Llama 4 Maverick HumanEval: some official Meta sources report 86.4%, not 82.4%. The discrepancy likely reflects evaluation methodology differences (pass@1 vs pass@10). Verify against the official Llama 4 model card before citing.

§ Qwen3-32B HumanEval 88.0 and Qwen3-235B-A22B HumanEval 90.3: these scores could not be confirmed from the official Qwen3 technical report or model cards, which focus on EvalPlus and LiveCodeBench instead. Treat as unverified.

MT-Bench scores for all five models could not be confirmed from official model cards or technical reports. Treat the remaining figures as starting points and verify against primary sources or your own benchmarks before capacity planning.

One result stands out on HumanEval. Qwen3-32B reports 88.0§, higher than DeepSeek V3.2 Speciale at 82.6†° (HumanEval-Mul), despite being a much smaller model. These scores are unverified against official sources (see benchmark footnotes). If they hold on your eval set, Qwen3-32B for coding workloads is not a budget compromise; it is the top performer per dollar. MT-Bench scores for all models in this comparison could not be confirmed from official sources and are omitted above. Use your own instruction-following eval for production selection decisions.

Inference Speed on H100 (vLLM, tokens/sec)

The cost formula: (cost/hr) / (throughput tokens/sec x 3600) x 1,000,000 = cost per 1M tokens.

Model	Hardware	Throughput (tokens/sec)	Cost/hr (Spheron)	Cost per 1M tokens (est.)
DeepSeek V3.2 Speciale	8x H100 SXM5 (FP8)	~400 aggregate	~$19.20	~$13.33
Llama 4 Scout	1x H100 SXM5 (INT4)	~800	~$2.40	~$0.83
Llama 4 Maverick	4x H100 SXM5 (INT4)	~1,200 aggregate	~$9.60	~$2.22
Qwen3-32B	1x H100 SXM5 (FP8)	~850	~$2.40	~$0.78
Qwen3-235B-A22B	8x H100 SXM5 (FP8)	~600 aggregate	~$19.20	~$8.89

H100 SXM5 spot pricing is available at $0.80/hr, which reduces costs significantly for batch workloads or non-latency-sensitive inference. On-demand pricing shown above. Pricing fluctuates based on GPU availability. The prices above are based on 30 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

Llama 4 Scout and Qwen3-32B are essentially tied on cost per million tokens at $0.78-$0.83. Both run on a single H100 SXM5 and deliver competitive throughput. The choice between them comes down to use-case fit, not cost.

DeepSeek V3.2 Speciale at $13.33 per 1M tokens is the most expensive option here. That cost is justified only when you need its specific strengths: math reasoning, multi-step logical inference, or tasks where it demonstrably outperforms the alternatives on your eval set.

Memory Requirements

The VRAM numbers in the model overview above are weights-only. At runtime, add 15-20% overhead for activations and framework buffers. For KV cache planning at specific context lengths and batch sizes, see our GPU memory requirements guide and GPU requirements cheat sheet 2026.

The practical cutoffs:

Single H100 80GB: Qwen3-32B (FP8) or Llama 4 Scout (INT4).
4x H100 80GB: Llama 4 Maverick (INT4) or Qwen3-235B (INT4, tight).
8x H100 80GB: Qwen3-235B (FP8). DeepSeek V3.2 Speciale (FP8, optimized) fits within the 640 GB aggregate VRAM, though high KV cache usage may still require offloading strategies.

License and Commercial Use

Qwen 3 (Apache 2.0) is the most permissive option. Apache 2.0 has no usage-scale restrictions, no restrictions on using model outputs, and no prohibition on commercial use at any scale. For teams that need maximum flexibility, including using fine-tuned weights as a starting point for other projects, Qwen 3 is the clear choice.

Llama 4 (Llama 4 Community License) permits commercial use with one notable threshold: if your application exceeds 700 million monthly active users, you need explicit written permission from Meta. For the vast majority of production deployments, this is not a constraint. The license also does not restrict using model outputs.

DeepSeek V3.2 (MIT License) is fully permissive with no use-based restrictions. Like Apache 2.0, MIT imposes no limitations on commercial use, output usage, or application domain. DeepSeek switched all V3 variants (including V3.2 Speciale) to MIT licensing starting March 2025.

Use-Case Recommendations

Code Generation

Winner: Qwen3-32B (Qwen3-235B-A22B if budget allows).

Qwen3-32B reports 88.0§ on HumanEval, the highest in this comparison, though this score could not be confirmed from official sources (see benchmark footnotes). Community evaluations and independent benchmarks consistently place it among the top small models for coding. For code completion, generation, and review, Qwen3-32B at $0.78 per 1M tokens on a single H100 SXM5 is the default choice.

Thinking mode adds on-demand chain-of-thought reasoning for harder problems, debugging sessions, and architecture questions, without changing the deployment configuration. See the Qwen 3 deployment guide for setup details.

Conversational AI and Instruction Following

Winner: Llama 4 Scout for single-GPU cost efficiency. Use Llama 4 Maverick if you need higher quality.

Scout performs well for a single-GPU model on instruction following. Its 10M token context window means you can load entire conversation histories, large documents, or extended tool-use traces without hitting context limits. For most conversational applications, the context window advantage is the defining factor at this price point.

If your application needs Maverick-level quality and 1M token context, budget 4x H100 SXM5 at $9.60/hr. See the Llama 4 deployment guide for hardware and vLLM configuration.

RAG and Document Q&A

Winner: Llama 4 Scout.

The 10M token context window changes what RAG looks like. Instead of chunking documents, retrieving fragments, and stitching them back together, you can load entire document sets directly into the context. For long-document Q&A, contract analysis, or codebases with many interdependencies, this eliminates a significant layer of retrieval complexity.

Qwen3-32B (131K with YaRN scaling) and DeepSeek V3.2 Speciale (128K) both have context windows well below Scout's 10M. That handles most RAG applications, but if your documents are longer or you need to retrieve across a large corpus in a single pass, Scout's 10M token window is not available anywhere else in this comparison at this price.

For RAG serving architecture and production configuration, see the vLLM production deployment guide. For the model deployment itself, see the Llama 4 GPU deployment guide.

Reasoning and Math

Winner: DeepSeek V3.2 Speciale.

V3.2 Speciale leads on MMLU (88.5†) and has demonstrated strong mathematical reasoning performance. Press coverage has cited IMO gold-medal level results in 2025, though the precise attribution of this claim to V3.2 Speciale specifically (vs. other DeepSeek variants) is not fully confirmed in primary sources. See the DeepSeek V3.2 Speciale deployment guide for task-specific benchmark details. For math-heavy workloads, scientific reasoning, multi-step logical inference, and tasks where chain-of-thought depth matters, it is the best option here.

The cost is 8x H100 at ~$19.20/hr. Qwen3-235B-A22B is a viable alternative at similar cost if MIT or Apache 2.0 licensing is a priority, and reported HumanEval scores§ suggest it may be the better pick if your reasoning tasks involve code.

Fine-Tuning Ecosystem

DeepSeek V3.2 Speciale supports LoRA and full fine-tuning via Hugging Face transformers. Given the model size (685B parameters), fine-tuning requires a multi-node setup and careful memory management. LoRA is the practical approach for most teams. See our LLM fine-tuning guide for setup details, and the Axolotl vs Unsloth vs torchtune comparison for framework selection.

Llama 4 has strong ecosystem support. unsloth covers Scout efficiently given its manageable size. For Maverick, torchtune is the better fit given its multi-GPU requirements. Both are covered in our Axolotl vs Unsloth vs torchtune post.

Qwen 3's Apache 2.0 license means no fine-tuning restrictions of any kind. unsloth supports Qwen3-8B and Qwen3-32B with LoRA out of the box. Standard PEFT/LoRA works with all dense variants. See the LLM fine-tuning guide for configuration details.

How to Deploy on Spheron

The fastest path to running any of these models is a bare-metal GPU instance on Spheron. Here is the setup for Qwen3-32B, the recommended starting point for most teams:

bash

# 1. Provision a 1x H100 instance at app.spheron.ai
# 2. Install vLLM
pip install vllm --upgrade

# 3. Start the inference server
vllm serve Qwen/Qwen3-32B \
  --quantization fp8 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 32768 \
  --port 8000

For DeepSeek V3.2 Speciale (8x H100 required), see the full deployment guide. For Llama 4 Scout and Maverick, see the Llama 4 deployment guide.

Final Recommendation

For most teams starting in 2026, Qwen3-32B is the right default. It leads on code generation, runs on a single H100 SXM5 at $2.40/hr, carries an Apache 2.0 license with no restrictions, and delivers competitive quality on instruction following and general tasks. If you need a 10M token context window for RAG or long-document applications, Llama 4 Scout is the only option at this price point. For math, scientific reasoning, and multi-step inference where raw benchmark quality matters most, DeepSeek V3.2 Speciale is the leader, at 8x the hardware cost (~$19.20/hr for 8x H100 SXM5).

DeepSeek V3.2, Llama 4, and Qwen 3 are all available on Spheron with no waitlist. Provision a single H100 for Qwen3-32B or scale to 8x H100 for the full DeepSeek V3.2 Speciale experience.
Rent H100 → | Rent H200 → | View all pricing →
Get started on Spheron →