Tutorial

Deploy Qwen3-Coder-Next on GPU Cloud: Self-Host Alibaba's 80B MoE Coding Model with vLLM (2026)

qwen3-coder-nextqwen3 coder deployself-host qwen coder gpuqwen3-coder-next vram requirementsAlibaba Coding ModelMoE InferencevLLMExpert ParallelismGPU Cloud
Deploy Qwen3-Coder-Next on GPU Cloud: Self-Host Alibaba's 80B MoE Coding Model with vLLM (2026)

Qwen3-Coder-Next is 80B parameters with only 3B active per forward pass. That ratio matters for the GPU budget: storage cost is determined by total parameters (80B must live in VRAM), but inference speed is determined by active parameters (closer to a 3B model). For teams that have been watching the coding model space, this is the clearest path to frontier-scale code quality on a single high-VRAM GPU.

The per-seat SaaS math is the other reason to self-host. A 10-person team on Cursor Team pays $400/month. A single H200 SXM5 running Qwen3-Coder-Next at FP8 runs the model continuously for that team. Once you are above 8 developers, the crossover point typically tips toward self-hosting on a 2026 GPU.

This guide covers VRAM sizing, vLLM deployment on Spheron, expert parallelism tuning, IDE integration, and the full cost breakdown.

What Qwen3-Coder-Next Is

Qwen3-Coder-Next is Alibaba's coding-specialized MoE model in the Qwen3 family. It targets the same coding task categories as Devstral and Qwen2.5-Coder but at a much larger total parameter scale, while keeping active parameters low for practical inference.

PropertyValue
Architecture80B MoE, 3B active parameters
Context window256K tokens
LicenseApache 2.0
SWE-Bench Verified70.6%
Tool useOpenAI-compatible via vLLM
HuggingFaceQwen/Qwen3-Coder-Next

The key distinction from the Qwen3.6 general-purpose models is coding specialization. Both Coder-Next and Qwen3-30B-A3B activate 3B parameters per token, but Coder-Next packs 512 experts vs. 128 in Qwen3-30B-A3B, roughly 4x more expert diversity within the larger 80B parameter pool.

For context on the Qwen 3.6 Plus general-purpose model, see the Qwen 3.6 Plus deployment guide.

VRAM Requirements and GPU Sizing

The most common mistake with MoE models is confusing active parameters with storage requirements. Qwen3-Coder-Next activates 3B parameters per forward pass, but all 80B parameters must live in VRAM. The "3B active" figure tells you about inference speed, not memory footprint.

PrecisionWeight footprintRuntime (weights + ~15% overhead)Min VRAM
FP16~160 GB~184 GB2x H200 SXM5 (minimum)
FP8~80 GB~92 GBH200 SXM5 141 GB (single GPU, fits)
AWQ INT4~40 GB~46 GBA100 80GB (fits, with context limits)

The 15% overhead figure accounts for activations, CUDA context, and framework buffers. KV cache is on top of that and depends on max-model-len and concurrent sequences.

Here are the recommended GPU configurations on Spheron with current pricing:

GPUVRAMPrecisionOn-DemandSpotNotes
H200 SXM5141 GBFP8$4.82/hr$3.31/hrSingle-GPU production sweet spot
2x H100 SXM5160 GB totalFP8$4.06/hr per GPU ($8.12/hr)$2.91/hr per GPU ($5.82/hr)FP8 fits across two GPUs with 68 GB for KV cache
H100 SXM580 GBFP8$4.06/hr$2.91/hrFP8 does NOT fit (92 GB > 80 GB)
H100 SXM580 GBAWQ INT4$4.06/hr$2.91/hrINT4 fits (46 GB), limited context
A100 80G80 GBAWQ INT4$1.69/hr$0.80/hrBudget option, 80 GB headroom for INT4

Important on H100: A single H100 SXM5 cannot run FP8 for Qwen3-Coder-Next. The FP8 runtime footprint is approximately 92 GB, which exceeds the 80 GB VRAM ceiling. H100 is only viable with AWQ INT4, which keeps the runtime footprint to around 46 GB.

RTX 5090 note: 32 GB VRAM is insufficient even for AWQ INT4 (which requires roughly 46 GB). The RTX 5090 works well for smaller Qwen3-Coder variants but not for the 80B model.

For the single-GPU production case, H200 SXM5 rental on Spheron at FP8 is the recommended path. The 141 GB VRAM leaves approximately 49 GB for KV cache after the ~92 GB FP8 runtime footprint, which supports 32K context with multiple concurrent sessions.

For a general VRAM sizing reference across model families, see the GPU memory requirements for LLMs guide.

Pricing fluctuates based on GPU availability. The prices above are based on 19 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Deploy Qwen3-Coder-Next with vLLM on Spheron

Step 1: Choose GPU and quantization

Pick based on budget and team size:

  • H200 SXM5 + FP8: single-GPU production default. Best quality, highest single-instance cost.
  • 2x H100 SXM5 + FP8: more KV cache headroom than a single H200 at comparable FP8 quality. Good for teams that regularly send long-context requests.
  • A100 80G + AWQ INT4: budget option. Fits the AWQ INT4 checkpoint with 34 GB left for KV cache. Restrict max-model-len to 16K or lower to prevent OOM.
  • H100 SXM5 + AWQ INT4: same profile as A100 at a higher cost per hour. Pick H100 if you need the higher FP8 Tensor Core throughput for other models on the same instance.

Step 2: Provision a Spheron GPU instance

Log in at app.spheron.ai and navigate to GPU Cloud. Select the GPU tier matching your choice above and deploy with the PyTorch 2.5 / CUDA 12.4 base image. Attach at least 200 GB persistent storage for model weights. Spot instances are available on H200 and H100 for 30-45% savings over on-demand rates; for a coding assistant where brief interruptions are tolerable (IDE plugins retry automatically), spot is a reasonable choice.

See docs.spheron.ai for step-by-step instance provisioning.

Step 3: Install vLLM and download weights

bash
pip install 'vllm>=0.9.0'
pip install huggingface_hub hf_transfer

export HF_TOKEN=your_token_here
export HF_HUB_ENABLE_HF_TRANSFER=1

# Verify the exact repo slug at huggingface.co/Qwen before running
huggingface-cli download Qwen/Qwen3-Coder-Next

If vLLM does not natively recognize the model class, add --trust-remote-code to the serve command as a fallback. For models released after the last stable vLLM build, check docs.vllm.ai for supported model status.

Step 4: Launch vLLM with expert parallelism

H200 SXM5, FP8 (single GPU, production):

bash
vllm serve Qwen/Qwen3-Coder-Next \
  --quantization fp8 \
  --max-model-len 32768 \
  --port 8000 \
  --served-model-name qwen3-coder-next \
  --enable-expert-parallel \
  --gpu-memory-utilization 0.92

2x H100 SXM5, FP8 (tensor parallel, more KV headroom):

bash
vllm serve Qwen/Qwen3-Coder-Next \
  --quantization fp8 \
  --tensor-parallel-size 2 \
  --enable-expert-parallel \
  --max-model-len 65536 \
  --port 8000 \
  --served-model-name qwen3-coder-next \
  --gpu-memory-utilization 0.90

A100 80G, AWQ INT4 (budget):

bash
vllm serve Qwen/Qwen3-Coder-Next-AWQ \
  --quantization awq \
  --max-model-len 16384 \
  --port 8000 \
  --served-model-name qwen3-coder-next \
  --gpu-memory-utilization 0.92

Note on the AWQ path: if Qwen/Qwen3-Coder-Next-AWQ does not yet exist on HuggingFace (quantized checkpoints often appear after the base model release), check the Qwen organization page at huggingface.co/Qwen for the correct slug. Do not attempt a download that will 404.

Why --enable-expert-parallel: Qwen3-Coder-Next is an 80B MoE model with many experts and only 3B active per token. Expert parallelism routes token dispatch across GPUs at the expert level rather than sharding weight matrices across all GPUs uniformly. For long code generation outputs, this reduces all-to-all communication overhead and improves throughput compared to tensor parallelism alone. The flag is a no-op on single-GPU setups but harmless to include.

For deeper MoE tuning, see the MoE inference optimization guide.

Step 5: Test the endpoint

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-coder-next",
    "messages": [
      {
        "role": "user",
        "content": "Write a Python function that validates an email address using regex and returns both the result and the matched groups."
      }
    ],
    "max_tokens": 512,
    "temperature": 0.2
  }'

A correct response returns a choices[0].message.content field with the code output. To verify function calling support, include a tools array in the request. Check the official model card to confirm whether function calling is supported before wiring up agentic pipelines.

Step 6: Cloud-init startup script

For reproducible instance bootstrapping, use this startup script in your Spheron deployment configuration:

bash
#!/bin/bash
set -e

echo "--- Setting Up Environment ---"

sudo apt-get update -y
sudo apt-get install -y python3-venv

sudo python3 -m venv /opt/qwen_coder_venv
source /opt/qwen_coder_venv/bin/activate

pip install --upgrade pip
pip install 'vllm>=0.9.0' huggingface_hub hf_transfer

echo "--- Launching vLLM Server ---"

export HF_TOKEN=your_hf_token_here
export HF_HUB_ENABLE_HF_TRANSFER=1

nohup vllm serve Qwen/Qwen3-Coder-Next \
    --quantization fp8 \
    --max-model-len 32768 \
    --port 8000 \
    --served-model-name qwen3-coder-next \
    --enable-expert-parallel \
    --gpu-memory-utilization 0.92 \
    --host 0.0.0.0 \
    --api-key "your_api_key_here" > /var/log/vllm.log 2>&1 &

echo "--- Waiting for server to initialize (ETA 10-20 minutes for weight download) ---"

for i in {1..3600}; do
  if curl -sf "http://localhost:8000/v1/models" > /dev/null; then
    echo "vLLM server is ready!"
    break
  fi
  if [ $((i % 15)) -eq 0 ]; then
    echo "Still waiting... ($i seconds elapsed)"
  fi
  sleep 1
done

curl -sf http://localhost:8000/health > /dev/null || { echo 'ERROR: vLLM failed to start'; exit 1; }

Quantization Options for Smaller GPUs

For teams that cannot justify an H200 or 2x H100, AWQ INT4 is the practical fallback.

AWQ INT4 compresses Qwen3-Coder-Next's 80B weights from ~80 GB (FP8) to roughly 40 GB, with a runtime footprint around 46 GB including overhead. That fits on an A100 80G (80 GB VRAM) with about 34 GB left for KV cache. Restricting max-model-len to 16K or lower is important here because KV cache grows quadratically with context at full FP16 attention.

For AWQ, use official Qwen-provided quantized checkpoints from huggingface.co/Qwen rather than community quantizations. If the official AWQ checkpoint is not yet available, wait rather than using an unofficial one. Community quantizations often target different calibration sets and can degrade code quality in non-obvious ways.

GGUF variants enable CPU+GPU hybrid offloading via llama.cpp. For a small team running occasional code generation at low concurrency, a large GGUF model on a workstation with 48+ GB system RAM plus a consumer GPU is technically viable, but expect higher latency. This is not a production pattern.

For a broader review of quantization tradeoffs, see the MXFP4 quantization guide and the LLM serving optimization overview.

Wiring into IDEs and Agentic Harnesses

Continue

Add to ~/.continue/config.json:

json
{
  "models": [
    {
      "title": "Qwen3-Coder-Next (self-hosted)",
      "provider": "openai",
      "model": "qwen3-coder-next",
      "apiBase": "http://<your-spheron-ip>:8000/v1",
      "apiKey": "your_api_key_here"
    }
  ]
}

Set apiKey to match the --api-key value from your startup script.

Cline

In VS Code, open Cline's Settings panel (gear icon in the sidebar). Set:

  • API Provider: OpenAI Compatible
  • Base URL: http://<your-spheron-ip>:8000/v1
  • Model ID: qwen3-coder-next
  • API Key: your_api_key_here

Cline's agentic workflows drive file reads, terminal commands, and diff application through the tool-call interface if function calling is enabled in the model.

Aider

bash
aider \
  --openai-api-base http://<your-spheron-ip>:8000/v1 \
  --model qwen3-coder-next \
  --openai-api-key your_api_key_here

Aider's --architect mode pairs well with long-context models. Increase --max-model-len on your vLLM instance if Aider is sending full repo reads.

For the full Tabby stack with user authentication, SSH tunneling, and multi-user API key management, see the self-hosted AI coding assistant guide.

Throughput, Cost Per Million Tokens, and Quality vs. Competitors

Throughput figures for Qwen3-Coder-Next are not yet independently benchmarked at time of writing. The estimates below are derived from comparable 80B MoE architectures with similar active parameter counts. Benchmark on your Spheron instance before making capacity decisions.

GPU ConfigContextThroughput (tok/s, est.)TTFT (ms, est.)Notes
H200 SXM5, FP88K~600-900~150-300MoE routing with 3B active is fast
H200 SXM5, FP832K~350-550~500-900KV cache growth increases latency
2x H100 SXM5, FP88K~500-750~180-350NVLink bandwidth affects MoE dispatch
A100 80G, AWQ INT48K~300-500~200-400Adequate for small team use

Cost per million tokens, computed from live pricing at $4.82/hr for H200 FP8:

  • At 750 tok/s throughput: ($4.82 / 3600) / (750 / 1,000,000) = approximately $1.79/M tokens

For reference, the Anthropic Claude API and OpenAI GPT-4o run $2-10+ per million tokens depending on tier and volume. A self-hosted Qwen3-Coder-Next at H200 FP8 pricing can come in under $2/M tokens at reasonable throughput.

Comparison across coding model deployment options:

ModelMin GPUOn-Demand $/hrEst. $/M tokensContext
Qwen3-Coder-Next 80B MoEH200 FP8$4.82/hr~$1.79 (est.)256K
Devstral 24BH100 FP8$4.06/hr~$0.90 (est.)128K
Kimi K2.7 Code 1T MoE8x H200 FP8$4.82/hr per GPU (~$38.56/hr for 8x)~$5-8 (est.)256K
Qwen2.5-Coder 32BA100 BF16$1.69/hr~$0.60 (est.)128K

All throughput-derived $/M token figures are estimated. Verify against your deployment configuration.

For the Devstral deployment and L40S cost breakdown, see deploying Devstral on GPU cloud.

For Kimi K2.7 Code, Moonshot's 1T coding-first MoE, see the Kimi K2.7 Code deployment guide.

Cost Per Developer Per Month on Spheron

The key question is at what team size self-hosting becomes cheaper than per-seat SaaS.

GPUPrecisionOn-DemandSpotDevs served$/dev/month (60% util)
A100 80GAWQ INT4$1.69/hr$0.80/hr3-5~$69 (spot)
H200 SXM5FP8$4.82/hr$3.31/hr10-15~$139 (on-demand), ~$95 (spot)
2x H100 SXM5FP8$4.06/hr per GPU ($8.12/hr)$2.91/hr per GPU ($5.82/hr)12-18~$195 (on-demand), ~$140 (spot)

Pricing fluctuates based on GPU availability. The prices above are based on 19 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

At 60% utilization (realistic for a daytime coding schedule), an H200 SXM5 on spot pricing at $3.31/hr costs approximately $1,430/month. Across 15 developers, that is $95/developer/month, which is higher than Cursor Team at $40/seat. The breakeven shifts when you factor in that self-hosting eliminates usage caps, prevents code from leaving your network, and the same instance can serve other models.

The 2x H100 FP8 on-demand configuration at $8.12/hr (2 × $4.06/hr per GPU) costs approximately $3,510/month at 60% utilization. For 18 developers that is about $195/developer/month. On spot ($5.82/hr for both GPUs) it drops to approximately $2,514/month total, or about $140/developer/month. For teams with strict data sovereignty requirements, the spot configuration is competitive with enterprise Cursor or Copilot Business pricing.

The A100 AWQ INT4 at spot pricing works for small teams where privacy and control matter more than raw seat cost: $0.80/hr spot, 60% utilization, 4 developers, comes to roughly $346/month total, or about $87/developer/month. You own the endpoint, have no usage caps, and can swap models without changing plans.

Production Checklist

Before routing real developer traffic:

  • Enable streaming. IDE plugins expect stream: true responses. Without it, the plugin waits for the full completion before showing any output, making short autocompletes feel broken.
  • Set a request cap. Use --max-num-seqs 32 or lower to cap concurrent sequences. GPU memory spikes from sudden floods can OOM-crash the server without this limit.
  • Configure Prometheus scraping. vLLM exposes /metrics at launch. Point Grafana at http://your-instance:8000/metrics and alert on vllm_num_requests_waiting > 10 as an early warning of capacity pressure.
  • Persistent storage for weights. Mount a persistent volume at the HuggingFace cache path (~/.cache/huggingface). Re-downloading 80+ GB of weights on every restart adds significant downtime.
  • Systemd for auto-restart. Create a systemd unit so vLLM restarts after OOM crashes or instance reboots:
ini
[Unit]
Description=Qwen3-Coder-Next vLLM Server
After=network.target

[Service]
ExecStart=/opt/qwen_coder_venv/bin/vllm serve Qwen/Qwen3-Coder-Next \
  --quantization fp8 \
  --max-model-len 32768 \
  --port 8000 \
  --served-model-name qwen3-coder-next \
  --enable-expert-parallel \
  --gpu-memory-utilization 0.92 \
  --host 0.0.0.0 \
  --api-key ${VLLM_API_KEY}
Restart=always
RestartSec=15
Environment=HF_TOKEN=your_token_here
Environment=VLLM_API_KEY=your_api_key_here

[Install]
WantedBy=multi-user.target
  • Horizontal scaling for large teams. For teams above 20 developers, run two vLLM instances on separate GPUs behind an NGINX upstream:
nginx
upstream qwen_coder {
    server gpu1:8000;
    server gpu2:8000;
}

server {
    listen 80;
    location /v1/ {
        proxy_pass http://qwen_coder;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Authorization $http_authorization;
        proxy_buffering off;
        proxy_read_timeout 300s;
    }
}
  • vLLM version pinning. Qwen3-Coder-Next is a new model. If native vLLM support is not yet in a stable release, pin the nightly build date in your startup script to avoid a silent breakage from an upstream parser change.
  • License review. Qwen3-Coder-Next is released under Apache 2.0. Review the full license terms at huggingface.co/Qwen/Qwen3-Coder-Next before commercial production deployment.

Qwen3-Coder-Next's 80B MoE architecture delivers frontier-scale coding quality at a 3B-active-param inference footprint. Spheron's H200 SXM5 nodes give you the single-GPU option that fits FP8, and A100 80G instances cover the budget AWQ INT4 path.

Check H200 SXM5 availability → | L40S GPU pricing → | View all GPU pricing →

STEPS / 06

Quick Setup Guide

  1. Choose GPU and quantization level

    For single-GPU production, use H200 SXM5 141GB with FP8 quantization. The FP8 runtime footprint is roughly 92 GB, leaving about 49 GB for KV cache. For a two-GPU setup with more KV cache headroom, use 2x H100 SXM5 (160 GB total) with FP8 and tensor-parallel-size 2. For a budget option, use an A100 80GB with AWQ INT4, which fits the roughly 46 GB weight footprint with room for KV cache but limits maximum context length.

  2. Provision a Spheron GPU instance

    Log in at app.spheron.ai, navigate to GPU Cloud, and select your GPU tier. Attach at least 200 GB of persistent storage for model weights and vLLM cache. For H200 FP8, select a single H200 SXM5 instance. For 2x H100 FP8, select a two-GPU H100 SXM5 bundle. Enable spot pricing where available for significant hourly savings. See the Spheron docs at docs.spheron.ai for provisioning details.

  3. Install vLLM and download weights from HuggingFace

    Run 'pip install vllm>=0.9.0' and 'pip install huggingface_hub hf_transfer'. Export your HuggingFace token as HF_TOKEN and enable fast transfers with HF_HUB_ENABLE_HF_TRANSFER=1. Download weights with 'huggingface-cli download Qwen/Qwen3-Coder-Next'. Verify the exact repo slug at huggingface.co/Qwen before running, as Alibaba naming conventions vary between releases.

  4. Launch vLLM with expert parallelism

    For H200 FP8 single GPU: vllm serve Qwen/Qwen3-Coder-Next --quantization fp8 --max-model-len 32768 --port 8000 --served-model-name qwen3-coder-next --enable-expert-parallel --gpu-memory-utilization 0.92. The --enable-expert-parallel flag reduces all-to-all communication overhead for MoE routing and improves throughput for long code generation outputs. For 2x H100 FP8, add --tensor-parallel-size 2 and adjust --max-model-len to 65536 for more context headroom.

  5. Test the OpenAI-compatible endpoint

    Send a POST request to http://localhost:8000/v1/chat/completions with Content-Type: application/json and a JSON body containing model: qwen3-coder-next, a messages array, and a coding prompt. A correct response returns a choices array with a message containing the code output. Verify function calling by including a tools array if the model card confirms function calling support.

  6. Connect Continue or Cline to the vLLM endpoint

    In Continue's ~/.continue/config.json, add an entry with provider: openai, apiBase: http://your-spheron-ip:8000/v1, model: qwen3-coder-next, and apiKey: none. In Cline's VS Code settings, set the API provider to OpenAI Compatible, enter the base URL, and set the model ID to qwen3-coder-next. In Aider, pass --openai-api-base http://ip:8000/v1 and --model qwen3-coder-next.

FAQ / 05

Frequently Asked Questions

Qwen3-Coder-Next has 80B total parameters, so all 80B must reside in VRAM regardless of the active parameter count. In FP16 that is 160 GB of weights plus roughly 15% runtime overhead (activations, framework buffers), bringing the minimum to around 184 GB. In FP8 the weights compress to approximately 80 GB with a total footprint of about 92 GB, which fits a single H200 SXM5 141 GB with headroom for KV cache. AWQ INT4 reduces the weight footprint to roughly 40-46 GB, fitting on an A100 80GB. A single H100 80GB cannot run FP8 because the 92 GB runtime exceeds 80 GB available VRAM; H100 is viable only with AWQ INT4.

Yes, but only on a GPU with at least 100 GB VRAM at FP8. The H200 SXM5 141GB is the practical single-GPU option for production: FP8 weights require roughly 92 GB, leaving about 49 GB for KV cache at standard context lengths. A single H100 SXM5 80GB does not work at FP8 because the runtime footprint exceeds 80 GB. For a two-GPU setup, 2x H100 SXM5 (160 GB total) runs FP8 comfortably with tensor parallelism.

Devstral 24B is a dense 24B model and runs on a single L40S at FP8. Kimi K2.7 Code is a 1T/32B-active MoE that requires 8x H200 minimum. Qwen3-Coder-Next sits between them: 80B total with 3B active, fitting on a single H200 SXM5 at FP8. In terms of inference cost per token, 3B active parameters means throughput is closer to a 3B dense model than an 80B one. On SWE-Bench Verified, Qwen3-Coder-Next scores 70.6%.

Yes. vLLM exposes an OpenAI-compatible endpoint at /v1/chat/completions and /v1/completions. Continue, Cline, and Aider all support custom OpenAI-compatible base URLs. In Continue's config.json set the provider to openai, point apiBase at your Spheron instance IP and port 8000, and set model to qwen3-coder-next. Aider uses --openai-api-base and --model flags. Cline has a base URL field in its VS Code settings panel.

Three options are practical. FP8 brings the total runtime footprint to about 92 GB, the recommended precision for H200 single-GPU or 2x H100 multi-GPU deployments. AWQ INT4 compresses weights to roughly 40-46 GB, fitting on an A100 80GB. GGUF enables CPU+GPU hybrid offloading for experimental or low-concurrency setups. For production serving, FP8 on H200 or AWQ INT4 on A100 are the two validated paths. Always use official Qwen-provided quantized checkpoints where available rather than community quantizations.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.