Qwen3-Coder-Next is 80B parameters with only 3B active per forward pass. That ratio matters for the GPU budget: storage cost is determined by total parameters (80B must live in VRAM), but inference speed is determined by active parameters (closer to a 3B model). For teams that have been watching the coding model space, this is the clearest path to frontier-scale code quality on a single high-VRAM GPU.
The per-seat SaaS math is the other reason to self-host. A 10-person team on Cursor Team pays $400/month. A single H200 SXM5 running Qwen3-Coder-Next at FP8 runs the model continuously for that team. Once you are above 8 developers, the crossover point typically tips toward self-hosting on a 2026 GPU.
This guide covers VRAM sizing, vLLM deployment on Spheron, expert parallelism tuning, IDE integration, and the full cost breakdown.
What Qwen3-Coder-Next Is
Qwen3-Coder-Next is Alibaba's coding-specialized MoE model in the Qwen3 family. It targets the same coding task categories as Devstral and Qwen2.5-Coder but at a much larger total parameter scale, while keeping active parameters low for practical inference.
| Property | Value |
|---|---|
| Architecture | 80B MoE, 3B active parameters |
| Context window | 256K tokens |
| License | Apache 2.0 |
| SWE-Bench Verified | 70.6% |
| Tool use | OpenAI-compatible via vLLM |
| HuggingFace | Qwen/Qwen3-Coder-Next |
The key distinction from the Qwen3.6 general-purpose models is coding specialization. Both Coder-Next and Qwen3-30B-A3B activate 3B parameters per token, but Coder-Next packs 512 experts vs. 128 in Qwen3-30B-A3B, roughly 4x more expert diversity within the larger 80B parameter pool.
For context on the Qwen 3.6 Plus general-purpose model, see the Qwen 3.6 Plus deployment guide.
VRAM Requirements and GPU Sizing
The most common mistake with MoE models is confusing active parameters with storage requirements. Qwen3-Coder-Next activates 3B parameters per forward pass, but all 80B parameters must live in VRAM. The "3B active" figure tells you about inference speed, not memory footprint.
| Precision | Weight footprint | Runtime (weights + ~15% overhead) | Min VRAM |
|---|---|---|---|
| FP16 | ~160 GB | ~184 GB | 2x H200 SXM5 (minimum) |
| FP8 | ~80 GB | ~92 GB | H200 SXM5 141 GB (single GPU, fits) |
| AWQ INT4 | ~40 GB | ~46 GB | A100 80GB (fits, with context limits) |
The 15% overhead figure accounts for activations, CUDA context, and framework buffers. KV cache is on top of that and depends on max-model-len and concurrent sequences.
Here are the recommended GPU configurations on Spheron with current pricing:
| GPU | VRAM | Precision | On-Demand | Spot | Notes |
|---|---|---|---|---|---|
| H200 SXM5 | 141 GB | FP8 | $4.82/hr | $3.31/hr | Single-GPU production sweet spot |
| 2x H100 SXM5 | 160 GB total | FP8 | $4.06/hr per GPU ($8.12/hr) | $2.91/hr per GPU ($5.82/hr) | FP8 fits across two GPUs with 68 GB for KV cache |
| H100 SXM5 | 80 GB | FP8 | $4.06/hr | $2.91/hr | FP8 does NOT fit (92 GB > 80 GB) |
| H100 SXM5 | 80 GB | AWQ INT4 | $4.06/hr | $2.91/hr | INT4 fits (46 GB), limited context |
| A100 80G | 80 GB | AWQ INT4 | $1.69/hr | $0.80/hr | Budget option, 80 GB headroom for INT4 |
Important on H100: A single H100 SXM5 cannot run FP8 for Qwen3-Coder-Next. The FP8 runtime footprint is approximately 92 GB, which exceeds the 80 GB VRAM ceiling. H100 is only viable with AWQ INT4, which keeps the runtime footprint to around 46 GB.
RTX 5090 note: 32 GB VRAM is insufficient even for AWQ INT4 (which requires roughly 46 GB). The RTX 5090 works well for smaller Qwen3-Coder variants but not for the 80B model.
For the single-GPU production case, H200 SXM5 rental on Spheron at FP8 is the recommended path. The 141 GB VRAM leaves approximately 49 GB for KV cache after the ~92 GB FP8 runtime footprint, which supports 32K context with multiple concurrent sessions.
For a general VRAM sizing reference across model families, see the GPU memory requirements for LLMs guide.
Pricing fluctuates based on GPU availability. The prices above are based on 19 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Deploy Qwen3-Coder-Next with vLLM on Spheron
Step 1: Choose GPU and quantization
Pick based on budget and team size:
- H200 SXM5 + FP8: single-GPU production default. Best quality, highest single-instance cost.
- 2x H100 SXM5 + FP8: more KV cache headroom than a single H200 at comparable FP8 quality. Good for teams that regularly send long-context requests.
- A100 80G + AWQ INT4: budget option. Fits the AWQ INT4 checkpoint with 34 GB left for KV cache. Restrict
max-model-lento 16K or lower to prevent OOM. - H100 SXM5 + AWQ INT4: same profile as A100 at a higher cost per hour. Pick H100 if you need the higher FP8 Tensor Core throughput for other models on the same instance.
Step 2: Provision a Spheron GPU instance
Log in at app.spheron.ai and navigate to GPU Cloud. Select the GPU tier matching your choice above and deploy with the PyTorch 2.5 / CUDA 12.4 base image. Attach at least 200 GB persistent storage for model weights. Spot instances are available on H200 and H100 for 30-45% savings over on-demand rates; for a coding assistant where brief interruptions are tolerable (IDE plugins retry automatically), spot is a reasonable choice.
See docs.spheron.ai for step-by-step instance provisioning.
Step 3: Install vLLM and download weights
pip install 'vllm>=0.9.0'
pip install huggingface_hub hf_transfer
export HF_TOKEN=your_token_here
export HF_HUB_ENABLE_HF_TRANSFER=1
# Verify the exact repo slug at huggingface.co/Qwen before running
huggingface-cli download Qwen/Qwen3-Coder-NextIf vLLM does not natively recognize the model class, add --trust-remote-code to the serve command as a fallback. For models released after the last stable vLLM build, check docs.vllm.ai for supported model status.
Step 4: Launch vLLM with expert parallelism
H200 SXM5, FP8 (single GPU, production):
vllm serve Qwen/Qwen3-Coder-Next \
--quantization fp8 \
--max-model-len 32768 \
--port 8000 \
--served-model-name qwen3-coder-next \
--enable-expert-parallel \
--gpu-memory-utilization 0.922x H100 SXM5, FP8 (tensor parallel, more KV headroom):
vllm serve Qwen/Qwen3-Coder-Next \
--quantization fp8 \
--tensor-parallel-size 2 \
--enable-expert-parallel \
--max-model-len 65536 \
--port 8000 \
--served-model-name qwen3-coder-next \
--gpu-memory-utilization 0.90A100 80G, AWQ INT4 (budget):
vllm serve Qwen/Qwen3-Coder-Next-AWQ \
--quantization awq \
--max-model-len 16384 \
--port 8000 \
--served-model-name qwen3-coder-next \
--gpu-memory-utilization 0.92Note on the AWQ path: if Qwen/Qwen3-Coder-Next-AWQ does not yet exist on HuggingFace (quantized checkpoints often appear after the base model release), check the Qwen organization page at huggingface.co/Qwen for the correct slug. Do not attempt a download that will 404.
Why --enable-expert-parallel: Qwen3-Coder-Next is an 80B MoE model with many experts and only 3B active per token. Expert parallelism routes token dispatch across GPUs at the expert level rather than sharding weight matrices across all GPUs uniformly. For long code generation outputs, this reduces all-to-all communication overhead and improves throughput compared to tensor parallelism alone. The flag is a no-op on single-GPU setups but harmless to include.
For deeper MoE tuning, see the MoE inference optimization guide.
Step 5: Test the endpoint
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-coder-next",
"messages": [
{
"role": "user",
"content": "Write a Python function that validates an email address using regex and returns both the result and the matched groups."
}
],
"max_tokens": 512,
"temperature": 0.2
}'A correct response returns a choices[0].message.content field with the code output. To verify function calling support, include a tools array in the request. Check the official model card to confirm whether function calling is supported before wiring up agentic pipelines.
Step 6: Cloud-init startup script
For reproducible instance bootstrapping, use this startup script in your Spheron deployment configuration:
#!/bin/bash
set -e
echo "--- Setting Up Environment ---"
sudo apt-get update -y
sudo apt-get install -y python3-venv
sudo python3 -m venv /opt/qwen_coder_venv
source /opt/qwen_coder_venv/bin/activate
pip install --upgrade pip
pip install 'vllm>=0.9.0' huggingface_hub hf_transfer
echo "--- Launching vLLM Server ---"
export HF_TOKEN=your_hf_token_here
export HF_HUB_ENABLE_HF_TRANSFER=1
nohup vllm serve Qwen/Qwen3-Coder-Next \
--quantization fp8 \
--max-model-len 32768 \
--port 8000 \
--served-model-name qwen3-coder-next \
--enable-expert-parallel \
--gpu-memory-utilization 0.92 \
--host 0.0.0.0 \
--api-key "your_api_key_here" > /var/log/vllm.log 2>&1 &
echo "--- Waiting for server to initialize (ETA 10-20 minutes for weight download) ---"
for i in {1..3600}; do
if curl -sf "http://localhost:8000/v1/models" > /dev/null; then
echo "vLLM server is ready!"
break
fi
if [ $((i % 15)) -eq 0 ]; then
echo "Still waiting... ($i seconds elapsed)"
fi
sleep 1
done
curl -sf http://localhost:8000/health > /dev/null || { echo 'ERROR: vLLM failed to start'; exit 1; }Quantization Options for Smaller GPUs
For teams that cannot justify an H200 or 2x H100, AWQ INT4 is the practical fallback.
AWQ INT4 compresses Qwen3-Coder-Next's 80B weights from ~80 GB (FP8) to roughly 40 GB, with a runtime footprint around 46 GB including overhead. That fits on an A100 80G (80 GB VRAM) with about 34 GB left for KV cache. Restricting max-model-len to 16K or lower is important here because KV cache grows quadratically with context at full FP16 attention.
For AWQ, use official Qwen-provided quantized checkpoints from huggingface.co/Qwen rather than community quantizations. If the official AWQ checkpoint is not yet available, wait rather than using an unofficial one. Community quantizations often target different calibration sets and can degrade code quality in non-obvious ways.
GGUF variants enable CPU+GPU hybrid offloading via llama.cpp. For a small team running occasional code generation at low concurrency, a large GGUF model on a workstation with 48+ GB system RAM plus a consumer GPU is technically viable, but expect higher latency. This is not a production pattern.
For a broader review of quantization tradeoffs, see the MXFP4 quantization guide and the LLM serving optimization overview.
Wiring into IDEs and Agentic Harnesses
Continue
Add to ~/.continue/config.json:
{
"models": [
{
"title": "Qwen3-Coder-Next (self-hosted)",
"provider": "openai",
"model": "qwen3-coder-next",
"apiBase": "http://<your-spheron-ip>:8000/v1",
"apiKey": "your_api_key_here"
}
]
}Set apiKey to match the --api-key value from your startup script.
Cline
In VS Code, open Cline's Settings panel (gear icon in the sidebar). Set:
- API Provider: OpenAI Compatible
- Base URL:
http://<your-spheron-ip>:8000/v1 - Model ID:
qwen3-coder-next - API Key: your_api_key_here
Cline's agentic workflows drive file reads, terminal commands, and diff application through the tool-call interface if function calling is enabled in the model.
Aider
aider \
--openai-api-base http://<your-spheron-ip>:8000/v1 \
--model qwen3-coder-next \
--openai-api-key your_api_key_hereAider's --architect mode pairs well with long-context models. Increase --max-model-len on your vLLM instance if Aider is sending full repo reads.
For the full Tabby stack with user authentication, SSH tunneling, and multi-user API key management, see the self-hosted AI coding assistant guide.
Throughput, Cost Per Million Tokens, and Quality vs. Competitors
Throughput figures for Qwen3-Coder-Next are not yet independently benchmarked at time of writing. The estimates below are derived from comparable 80B MoE architectures with similar active parameter counts. Benchmark on your Spheron instance before making capacity decisions.
| GPU Config | Context | Throughput (tok/s, est.) | TTFT (ms, est.) | Notes |
|---|---|---|---|---|
| H200 SXM5, FP8 | 8K | ~600-900 | ~150-300 | MoE routing with 3B active is fast |
| H200 SXM5, FP8 | 32K | ~350-550 | ~500-900 | KV cache growth increases latency |
| 2x H100 SXM5, FP8 | 8K | ~500-750 | ~180-350 | NVLink bandwidth affects MoE dispatch |
| A100 80G, AWQ INT4 | 8K | ~300-500 | ~200-400 | Adequate for small team use |
Cost per million tokens, computed from live pricing at $4.82/hr for H200 FP8:
- At 750 tok/s throughput:
($4.82 / 3600) / (750 / 1,000,000)= approximately $1.79/M tokens
For reference, the Anthropic Claude API and OpenAI GPT-4o run $2-10+ per million tokens depending on tier and volume. A self-hosted Qwen3-Coder-Next at H200 FP8 pricing can come in under $2/M tokens at reasonable throughput.
Comparison across coding model deployment options:
| Model | Min GPU | On-Demand $/hr | Est. $/M tokens | Context |
|---|---|---|---|---|
| Qwen3-Coder-Next 80B MoE | H200 FP8 | $4.82/hr | ~$1.79 (est.) | 256K |
| Devstral 24B | H100 FP8 | $4.06/hr | ~$0.90 (est.) | 128K |
| Kimi K2.7 Code 1T MoE | 8x H200 FP8 | $4.82/hr per GPU (~$38.56/hr for 8x) | ~$5-8 (est.) | 256K |
| Qwen2.5-Coder 32B | A100 BF16 | $1.69/hr | ~$0.60 (est.) | 128K |
All throughput-derived $/M token figures are estimated. Verify against your deployment configuration.
For the Devstral deployment and L40S cost breakdown, see deploying Devstral on GPU cloud.
For Kimi K2.7 Code, Moonshot's 1T coding-first MoE, see the Kimi K2.7 Code deployment guide.
Cost Per Developer Per Month on Spheron
The key question is at what team size self-hosting becomes cheaper than per-seat SaaS.
| GPU | Precision | On-Demand | Spot | Devs served | $/dev/month (60% util) |
|---|---|---|---|---|---|
| A100 80G | AWQ INT4 | $1.69/hr | $0.80/hr | 3-5 | ~$69 (spot) |
| H200 SXM5 | FP8 | $4.82/hr | $3.31/hr | 10-15 | ~$139 (on-demand), ~$95 (spot) |
| 2x H100 SXM5 | FP8 | $4.06/hr per GPU ($8.12/hr) | $2.91/hr per GPU ($5.82/hr) | 12-18 | ~$195 (on-demand), ~$140 (spot) |
Pricing fluctuates based on GPU availability. The prices above are based on 19 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
At 60% utilization (realistic for a daytime coding schedule), an H200 SXM5 on spot pricing at $3.31/hr costs approximately $1,430/month. Across 15 developers, that is $95/developer/month, which is higher than Cursor Team at $40/seat. The breakeven shifts when you factor in that self-hosting eliminates usage caps, prevents code from leaving your network, and the same instance can serve other models.
The 2x H100 FP8 on-demand configuration at $8.12/hr (2 × $4.06/hr per GPU) costs approximately $3,510/month at 60% utilization. For 18 developers that is about $195/developer/month. On spot ($5.82/hr for both GPUs) it drops to approximately $2,514/month total, or about $140/developer/month. For teams with strict data sovereignty requirements, the spot configuration is competitive with enterprise Cursor or Copilot Business pricing.
The A100 AWQ INT4 at spot pricing works for small teams where privacy and control matter more than raw seat cost: $0.80/hr spot, 60% utilization, 4 developers, comes to roughly $346/month total, or about $87/developer/month. You own the endpoint, have no usage caps, and can swap models without changing plans.
Production Checklist
Before routing real developer traffic:
- Enable streaming. IDE plugins expect
stream: trueresponses. Without it, the plugin waits for the full completion before showing any output, making short autocompletes feel broken. - Set a request cap. Use
--max-num-seqs 32or lower to cap concurrent sequences. GPU memory spikes from sudden floods can OOM-crash the server without this limit. - Configure Prometheus scraping. vLLM exposes
/metricsat launch. Point Grafana athttp://your-instance:8000/metricsand alert onvllm_num_requests_waiting > 10as an early warning of capacity pressure. - Persistent storage for weights. Mount a persistent volume at the HuggingFace cache path (
~/.cache/huggingface). Re-downloading 80+ GB of weights on every restart adds significant downtime. - Systemd for auto-restart. Create a systemd unit so vLLM restarts after OOM crashes or instance reboots:
[Unit]
Description=Qwen3-Coder-Next vLLM Server
After=network.target
[Service]
ExecStart=/opt/qwen_coder_venv/bin/vllm serve Qwen/Qwen3-Coder-Next \
--quantization fp8 \
--max-model-len 32768 \
--port 8000 \
--served-model-name qwen3-coder-next \
--enable-expert-parallel \
--gpu-memory-utilization 0.92 \
--host 0.0.0.0 \
--api-key ${VLLM_API_KEY}
Restart=always
RestartSec=15
Environment=HF_TOKEN=your_token_here
Environment=VLLM_API_KEY=your_api_key_here
[Install]
WantedBy=multi-user.target- Horizontal scaling for large teams. For teams above 20 developers, run two vLLM instances on separate GPUs behind an NGINX upstream:
upstream qwen_coder {
server gpu1:8000;
server gpu2:8000;
}
server {
listen 80;
location /v1/ {
proxy_pass http://qwen_coder;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Authorization $http_authorization;
proxy_buffering off;
proxy_read_timeout 300s;
}
}- vLLM version pinning. Qwen3-Coder-Next is a new model. If native vLLM support is not yet in a stable release, pin the nightly build date in your startup script to avoid a silent breakage from an upstream parser change.
- License review. Qwen3-Coder-Next is released under Apache 2.0. Review the full license terms at huggingface.co/Qwen/Qwen3-Coder-Next before commercial production deployment.
Qwen3-Coder-Next's 80B MoE architecture delivers frontier-scale coding quality at a 3B-active-param inference footprint. Spheron's H200 SXM5 nodes give you the single-GPU option that fits FP8, and A100 80G instances cover the budget AWQ INT4 path.
Check H200 SXM5 availability → | L40S GPU pricing → | View all GPU pricing →
Quick Setup Guide
For single-GPU production, use H200 SXM5 141GB with FP8 quantization. The FP8 runtime footprint is roughly 92 GB, leaving about 49 GB for KV cache. For a two-GPU setup with more KV cache headroom, use 2x H100 SXM5 (160 GB total) with FP8 and tensor-parallel-size 2. For a budget option, use an A100 80GB with AWQ INT4, which fits the roughly 46 GB weight footprint with room for KV cache but limits maximum context length.
Log in at app.spheron.ai, navigate to GPU Cloud, and select your GPU tier. Attach at least 200 GB of persistent storage for model weights and vLLM cache. For H200 FP8, select a single H200 SXM5 instance. For 2x H100 FP8, select a two-GPU H100 SXM5 bundle. Enable spot pricing where available for significant hourly savings. See the Spheron docs at docs.spheron.ai for provisioning details.
Run 'pip install vllm>=0.9.0' and 'pip install huggingface_hub hf_transfer'. Export your HuggingFace token as HF_TOKEN and enable fast transfers with HF_HUB_ENABLE_HF_TRANSFER=1. Download weights with 'huggingface-cli download Qwen/Qwen3-Coder-Next'. Verify the exact repo slug at huggingface.co/Qwen before running, as Alibaba naming conventions vary between releases.
For H200 FP8 single GPU: vllm serve Qwen/Qwen3-Coder-Next --quantization fp8 --max-model-len 32768 --port 8000 --served-model-name qwen3-coder-next --enable-expert-parallel --gpu-memory-utilization 0.92. The --enable-expert-parallel flag reduces all-to-all communication overhead for MoE routing and improves throughput for long code generation outputs. For 2x H100 FP8, add --tensor-parallel-size 2 and adjust --max-model-len to 65536 for more context headroom.
Send a POST request to http://localhost:8000/v1/chat/completions with Content-Type: application/json and a JSON body containing model: qwen3-coder-next, a messages array, and a coding prompt. A correct response returns a choices array with a message containing the code output. Verify function calling by including a tools array if the model card confirms function calling support.
In Continue's ~/.continue/config.json, add an entry with provider: openai, apiBase: http://your-spheron-ip:8000/v1, model: qwen3-coder-next, and apiKey: none. In Cline's VS Code settings, set the API provider to OpenAI Compatible, enter the base URL, and set the model ID to qwen3-coder-next. In Aider, pass --openai-api-base http://ip:8000/v1 and --model qwen3-coder-next.
Frequently Asked Questions
Qwen3-Coder-Next has 80B total parameters, so all 80B must reside in VRAM regardless of the active parameter count. In FP16 that is 160 GB of weights plus roughly 15% runtime overhead (activations, framework buffers), bringing the minimum to around 184 GB. In FP8 the weights compress to approximately 80 GB with a total footprint of about 92 GB, which fits a single H200 SXM5 141 GB with headroom for KV cache. AWQ INT4 reduces the weight footprint to roughly 40-46 GB, fitting on an A100 80GB. A single H100 80GB cannot run FP8 because the 92 GB runtime exceeds 80 GB available VRAM; H100 is viable only with AWQ INT4.
Yes, but only on a GPU with at least 100 GB VRAM at FP8. The H200 SXM5 141GB is the practical single-GPU option for production: FP8 weights require roughly 92 GB, leaving about 49 GB for KV cache at standard context lengths. A single H100 SXM5 80GB does not work at FP8 because the runtime footprint exceeds 80 GB. For a two-GPU setup, 2x H100 SXM5 (160 GB total) runs FP8 comfortably with tensor parallelism.
Devstral 24B is a dense 24B model and runs on a single L40S at FP8. Kimi K2.7 Code is a 1T/32B-active MoE that requires 8x H200 minimum. Qwen3-Coder-Next sits between them: 80B total with 3B active, fitting on a single H200 SXM5 at FP8. In terms of inference cost per token, 3B active parameters means throughput is closer to a 3B dense model than an 80B one. On SWE-Bench Verified, Qwen3-Coder-Next scores 70.6%.
Yes. vLLM exposes an OpenAI-compatible endpoint at /v1/chat/completions and /v1/completions. Continue, Cline, and Aider all support custom OpenAI-compatible base URLs. In Continue's config.json set the provider to openai, point apiBase at your Spheron instance IP and port 8000, and set model to qwen3-coder-next. Aider uses --openai-api-base and --model flags. Cline has a base URL field in its VS Code settings panel.
Three options are practical. FP8 brings the total runtime footprint to about 92 GB, the recommended precision for H200 single-GPU or 2x H100 multi-GPU deployments. AWQ INT4 compresses weights to roughly 40-46 GB, fitting on an A100 80GB. GGUF enables CPU+GPU hybrid offloading for experimental or low-concurrency setups. For production serving, FP8 on H200 or AWQ INT4 on A100 are the two validated paths. Always use official Qwen-provided quantized checkpoints where available rather than community quantizations.
