Tutorial

Deploy Qwen 3.5 on GPU Cloud: Hardware Requirements and Setup Guide (2026)

Back to BlogWritten by Mitrasish, Co-founderMar 31, 2026
Qwen 3.5Qwen 3.5 GPUMoEvLLMGPU CloudLLM DeploymentOpen Source AIModel Deployment
Deploy Qwen 3.5 on GPU Cloud: Hardware Requirements and Setup Guide (2026)

Qwen 3.5 is Alibaba's latest open-source model family, featuring a flagship 397B MoE architecture with support for 201 languages and an Apache 2.0 license. Released in early 2026, it introduces a new hybrid architecture with Gated DeltaNet (GDN) replacing the primary attention mechanism in 75% of layers, combined with sparse MoE. All Qwen 3.5 variants are natively multimodal, supporting text, images, and video input. The 397B-A17B is the flagship; the 9B and 27B dense variants are smaller but still natively multimodal, supporting text, images, and video input. The 9B and 27B dense variants fit on a single consumer or datacenter GPU; the 35B-A3B MoE is one of the most hardware-efficient models in this parameter class; and the 397B requires the same 8x H100 setup as the largest frontier MoE models. The full family also includes smaller dense models (0.8B, 2B, 4B); this guide covers five variants: 9B, 27B, 35B-A3B, 122B-A10B, and 397B-A17B.

For comparison with the previous generation, see our Qwen 3 GPU deployment guide. For VRAM math across model families, see the GPU memory requirements guide and the GPU requirements cheat sheet for 2026.

Qwen 3.5 Model Variants

ModelParametersArchitectureContext WindowFP16 SizeFP8 SizeQ4_K_M Size
Qwen3.5-9B9BDense262K (ext. 1M+)~18 GB~9 GB~5 GB
Qwen3.5-27B27BDense262K (ext. 1M+)~54 GB~27 GB~14 GB
Qwen3.5-35B-A3B35B total / 3B activeMoE262K (ext. 1M+)~70 GB~35 GB~18 GB
Qwen3.5-122B-A10B122B total / 10B activeMoE262K (ext. 1M+)~244 GB~122 GB~61 GB
Qwen3.5-397B-A17B397B total / 17B activeMoE262K (ext. 1M+)~794 GB~397 GB~199 GB

Note on family coverage: This guide covers five variants: 9B, 27B, 35B-A3B, 122B-A10B, and 397B-A17B. The full Qwen 3.5 family also includes smaller dense variants (0.8B, 2B, 4B) not covered here. The 122B-A10B is relevant for multi-GPU deployments where the 35B fits on one GPU but the 397B is too expensive: at FP8, the 122B-A10B requires approximately 2x H100 80GB. Hardware requirements for that variant follow the same VRAM math described in this guide.

One important note on the MoE variants: "3B active" and the 397B total active expert count refer to the number of expert parameters activated per forward pass, not the total model size. The full 35B and 397B parameter weights must reside in VRAM even though only a fraction of parameters are active for each token. Do not plan hardware based on active parameters. Plan based on the total size columns above.

GPU Hardware Requirements

Qwen3.5-9B: Single RTX 4090 or L40S

The 9B model is the most accessible Qwen 3.5 variant for budget deployments.

  • RTX 4090 (24 GB): FP8 weights (~9 GB) reach approximately 10.4-10.8 GB at runtime with 15-20% activation and framework overhead. With --gpu-memory-utilization 0.9, vLLM caps total allocation at 21.6 GB (24 x 0.9), leaving about 11 GB for KV cache. Best budget option for development and light production workloads. The RTX 4090 has native FP8 Tensor Core support.
  • L40S (48 GB): FP16 weights (~18 GB) reach about 20.7-21.6 GB at runtime. With --gpu-memory-utilization 0.9, vLLM caps total allocation at 43.2 GB (48 x 0.9), leaving about 22-23 GB for KV cache. Better throughput than RTX 4090 for production workloads.

See RTX 4090 GPU rental for pricing and availability.

Qwen3.5-27B: Single H100 or A100 80GB

  • H100 80GB: FP8 weights (~27 GB) reach approximately 31-32 GB at runtime with 15-20% activation and framework overhead. With --gpu-memory-utilization 0.9, vLLM caps total allocation at 72 GB, leaving about 40 GB for KV cache. Strong throughput for concurrent requests. The recommended production configuration for the 27B model.
  • H200 141GB: FP8 and FP16 both fit with substantial KV cache headroom. Better for latency-sensitive single-stream workloads and extended context lengths up to the full 262K native window (extensible to 1M+).
  • A100 80GB: Use INT8 on A100. FP16 is not viable: the 27B FP16 weights (~54 GB) plus 15-20% activation and framework overhead reach approximately 62-65 GB at runtime, which works technically but leaves only about 7-10 GB for KV cache after the 72 GB cap from --gpu-memory-utilization 0.9. INT8 weights (~27 GB) reach approximately 31-32 GB at runtime, leaving about 40 GB for KV cache. Note that A100 lacks native FP8 Tensor Cores. Using --quantization fp8 on A100 will either error or fall back to FP16. Use INT8 via bitsandbytes instead.

See H100 GPU rental and A100 GPU rental for current rates.

Qwen3.5-35B-A3B: Single H100 (Efficient MoE)

The 35B-A3B MoE model activates only 3B parameters per token. Total weights (~70 GB FP16 or ~35 GB FP8) fit on a single H100, and per-token compute cost is far lower than the 27B dense model due to sparse activation. This makes it a strong choice when you need high throughput on a single GPU without moving to a multi-GPU setup.

  • H100 80GB: FP8 weights (~35 GB) reach approximately 40-42 GB at runtime with 15-20% activation and framework overhead. With --gpu-memory-utilization 0.9, vLLM caps total allocation at 72 GB, leaving approximately 30-32 GB for KV cache. Use --enable-expert-parallel in vLLM for better multi-GPU throughput if you scale out.

See H100 GPU rental for current rates.

Qwen3.5-397B: 4x to 8x H100 or H200

  • 4x H100 80GB (320 GB total): INT4/AWQ weights (~199 GB file size) consume approximately 230-240 GB at runtime once you account for 15-20% overhead from activations and framework buffers. With --gpu-memory-utilization 0.9, vLLM caps total allocation at 288 GB (320 x 0.9), leaving approximately 48-58 GB for KV cache. Viable for development and low-concurrency inference. Use --tensor-parallel-size 4.
  • 8x H100 80GB (640 GB total): FP8 weights (~397 GB file size) consume approximately 457-476 GB at runtime with activation overhead. With --gpu-memory-utilization 0.9, vLLM caps total allocation at 576 GB (640 x 0.9), leaving approximately 100-119 GB for KV cache. Better throughput and more headroom for production batch serving. Use --tensor-parallel-size 8.
  • 8x H200 141GB (1,128 GB total): The most capable single-node configuration for 397B. FP8 weights (~397 GB) reach approximately 457-476 GB at runtime. With vLLM at 0.9 utilization (1,015 GB cap), this leaves approximately 539-558 GB free for KV cache.

See H100 GPU rental and H200 GPU rental for multi-GPU configurations.

What Won't Work

  • Qwen3.5-27B on a single RTX 4090 at FP16: 54 GB weights, 24 GB VRAM. Not possible.
  • Qwen3.5-397B on any consumer GPU: Even at Q4_K_M, 199 GB exceeds any consumer card.
  • Qwen3.5-397B on 2x H100 at FP8: 397 GB model, 160 GB total VRAM. Not enough.
  • Qwen3.5-27B and larger on A100 with FP8: A100 lacks FP8 Tensor Core support. Use INT8 on A100 instead.

VRAM Calculator: Choosing the Right Quantization

The right quantization format depends on your hardware and quality requirements:

Use CaseRecommended FormatReason
H100 / H200 productionFP8Native Tensor Core support, minimal quality loss
A100 productionINT8 (bitsandbytes)A100 lacks FP8 hardware; INT8 is the next best option
Budget / developmentQ4_K_M (GGUF via llama.cpp)Fits smaller GPUs, some quality tradeoff
Highest qualityBF16 / FP16Only practical for 9B on a single GPU or 27B on H200

BF16 is impractical for 27B on H100. The FP16 weights (~54 GB) plus 15-20% runtime overhead reach approximately 62-65 GB, which technically fits within the 72 GB vLLM cap from --gpu-memory-utilization 0.9 (80 GB x 0.9 = 72 GB). However, this leaves only about 7-10 GB for KV cache, which is insufficient for real concurrent workloads. FP8 cuts the weight footprint in half with negligible quality loss on H100's Tensor Cores.

For throughput optimization beyond quantization, the speculative decoding guide covers techniques that can add 2-5x gains on top of any quantization configuration.

Step-by-Step Deployment with vLLM 0.17 on Spheron

Prerequisites

Provision a GPU instance on Spheron matching your model size. For 27B, select a single H100 80GB from the H100 rental page or a single H200 from the H200 rental page. Ensure at least 60 GB of persistent storage for the 27B weights (more for larger variants).

SSH in and verify GPU setup:

bash
nvidia-smi
# Verify GPU count, VRAM, and driver version

Install vLLM 0.17

bash
pip install vllm==0.17.0
# Verify installation
python -c "import vllm; print(vllm.__version__)"

vLLM 0.17 added native support for Qwen 3.5's hybrid architecture, which uses Gated DeltaNet (GDN) as the primary attention mechanism in 75% of layers, with sparse MoE. This is a distinct code path from Qwen 3; vLLM 0.17 specifically added GDN kernel support for this model family. If you encounter a model class not found error, add --trust-remote-code as a fallback and check vLLM's supported models list for the latest Qwen 3.5 status.

Download Model Weights

Important: Alibaba naming conventions vary between model releases. Qwen 3 uses Qwen/Qwen3-32B (no dot), while earlier series used different formats. Before running these commands, verify the exact repository names at https://huggingface.co/Qwen. The commands below use the expected convention based on prior releases:

bash
# Qwen3.5-9B
huggingface-cli download Qwen/Qwen3.5-9B \
    --local-dir /data/models/qwen3.5-9b

# Qwen3.5-27B (~54 GB at FP16, ~27 GB at FP8)
huggingface-cli download Qwen/Qwen3.5-27B \
    --local-dir /data/models/qwen3.5-27b

# Qwen3.5-35B-A3B (~70 GB at FP16, ~35 GB at FP8)
huggingface-cli download Qwen/Qwen3.5-35B-A3B \
    --local-dir /data/models/qwen3.5-35b-a3b

# Qwen3.5-397B-A17B (~794 GB at FP16, ~397 GB at FP8)
huggingface-cli download Qwen/Qwen3.5-397B-A17B \
    --local-dir /data/models/qwen3.5-397b-a17b

# Qwen3.5-397B-A17B INT4 -- for 4x H100 deployment
# NOTE: No official AWQ checkpoint has been released by the Qwen organization.
# Official quantized checkpoints available: FP8 (Qwen/Qwen3.5-397B-A17B-FP8),
# NVFP4 (nvidia/Qwen3.5-397B-A17B-NVFP4), and community GGUF (unsloth/Qwen3.5-397B-A17B-GGUF).
# The AWQ path below requires a community-quantized model. Verify the exact repo
# name at https://huggingface.co/Qwen before downloading, as none currently exists officially.
# Example using a community AWQ repo (verify availability before use):
huggingface-cli download YOUR_COMMUNITY_REPO/Qwen3.5-397B-A17B-AWQ \
    --local-dir /data/models/qwen3.5-397b-a17b-awq

Use persistent storage to avoid re-downloading on instance restarts. The 27B download takes 10-20 minutes; the 397B takes several hours depending on bandwidth.

Launch Inference Server

Three configurations covering the main use cases:

bash
# 9B on single GPU (RTX 4090 / L40S)
vllm serve /data/models/qwen3.5-9b \
    --served-model-name Qwen/Qwen3.5-9B \
    --quantization fp8 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768 \
    --port 8000

# 27B on single H100 (FP8) -- recommended production config
vllm serve /data/models/qwen3.5-27b \
    --served-model-name Qwen/Qwen3.5-27B \
    --quantization fp8 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768 \
    --port 8000

# 397B MoE on 8x H100 (FP8)
vllm serve /data/models/qwen3.5-397b-a17b \
    --served-model-name Qwen/Qwen3.5-397B-A17B \
    --tensor-parallel-size 8 \
    --quantization fp8 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768 \
    --port 8000

# 397B MoE on 4x H100 (AWQ INT4 / lower cost option)
# Requires a community-quantized AWQ checkpoint -- no official AWQ checkpoint from Qwen exists.
# Replace the path below with the actual local directory from your download step.
vllm serve /data/models/qwen3.5-397b-a17b-awq \
    --served-model-name Qwen/Qwen3.5-397B-A17B \
    --tensor-parallel-size 4 \
    --quantization awq \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192 \
    --port 8000

The --quantization fp8 flag activates native FP8 Tensor Cores on H100 and H200. On A100, use --quantization bitsandbytes for INT8 instead.

Test the API

bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "Qwen/Qwen3.5-27B", "messages": [{"role": "user", "content": "Explain tensor parallelism in one paragraph."}], "max_tokens": 512}'

Python OpenAI client example:

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-27B",
    messages=[{"role": "user", "content": "Write a binary search in Python."}],
    max_tokens=1024
)
print(response.choices[0].message.content)

Monitor throughput during the test with nvidia-smi dmon -s pum -d 5 in a separate terminal.

Qwen 3.5 vs DeepSeek V3.2 vs Llama 4 Maverick: Performance and Cost

ModelParamsGPU ConfigHourly Cost (Spheron)LicenseContextLanguages
Qwen3.5-27B27B1x H100~$2.01/hrApache 2.0262K201
Qwen3.5-397B-A17B397B MoE8x H100~$16.08/hrApache 2.0262K201
DeepSeek V3.2 Speciale685B MoE8x H100~$16.08/hrMIT160KN/A
Llama 4 Maverick400B MoE8x H100 (FP8) / 4x H100 (INT4)~$16.08/hr / ~$8.04/hrLlama 4 Community1M~12

For full deployment guides on the alternatives, see the DeepSeek V3.2 Speciale deployment guide and the Llama 4 GPU deployment guide.

Pricing fluctuates based on GPU availability. The prices above are based on 31 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

Production Optimization: Batch Size, Context Length, and Tensor Parallelism

  1. Batch size tuning: Use --max-num-seqs to control concurrent requests. Start at 32 and increase until nvidia-smi shows consistent 90%+ GPU memory utilization. Watch vllm:kv_cache_usage_perc in the Prometheus metrics endpoint at /metrics; if it stays above 80% you're at or near the KV cache limit.
  1. Context length: A shorter --max-model-len frees KV cache VRAM for more concurrent requests. Recommended starting values: 32768 for 9B and 27B, 16384 for 35B-A3B, and 8192 for 397B on 4x H100 (INT4). Only raise to 262K if your application genuinely requires it.
  1. Tensor parallelism for 397B: --tensor-parallel-size 4 (INT4) halves hardware cost versus --tensor-parallel-size 8 (FP8), but adds quantization quality tradeoff. For the 397B model, NVLink-connected H100 SXM5 GPUs significantly reduce communication overhead compared to PCIe setups at this level of parallelism. Note that H100 SXM5 is priced at ~$2.40/hr, making an 8x SXM5 config ~$19.20/hr versus the ~$16.08/hr shown in the pricing table below for H100 PCIe. See the vLLM production deployment guide for detailed tensor parallelism tuning and the MIG/time-slicing guide for running multiple smaller models on the same hardware.
  1. Speculative decoding: The speculative decoding production guide covers techniques applicable to Qwen 3.5 that can deliver 2-5x throughput gains by using a small draft model to generate candidate tokens. Most effective for low-concurrency, latency-sensitive workloads.

Spheron GPU Pricing for Qwen 3.5 Workloads

VariantGPU ConfigOn-Demand / hrSpot / hrMonthly (24/7 on-demand)
9B1x RTX 4090$0.50/hrN/A~$360
27B1x H100 80GB$2.01/hrN/A~$1,447
35B-A3B1x H100 80GB$2.01/hrN/A~$1,447
397B (INT4)4x H100 80GB$8.04/hrN/A~$5,789
397B (FP8)8x H100 80GB$16.08/hrN/A~$11,578

For context on cloud alternatives: AWS p4d.24xlarge (8x A100 40GB) lists at approximately $32.77/hr on-demand; Azure ND96asr_v4 (8x A100 80GB) runs approximately $27.20/hr. The 8x H100 configuration on Spheron at ~$16.08/hr runs the full 397B FP8 model at roughly half the cost of comparable hyperscaler options, with no long-term contracts and per-minute billing.

Pricing fluctuates based on GPU availability. The prices above are based on 31 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

Troubleshooting

  • OOM on 397B even with 8x H100: The FP8 runtime footprint (~457-476 GB) can hit limits if other processes are consuming VRAM. Reduce --max-model-len first; cutting from 32768 to 8192 can free 20-40 GB of KV cache pre-allocation. If still OOM, switch to a community-quantized AWQ INT4 checkpoint (no official AWQ checkpoint from Qwen exists; check HuggingFace for community quants) with --quantization awq and --tensor-parallel-size 4 on 4x H100 instead.
  • Tensor parallel rank mismatch: Ensure --tensor-parallel-size evenly divides your total GPU count. If you have 8 GPUs and pass --tensor-parallel-size 6, vLLM will error. Common values: 2, 4, 8.
  • CUDA driver mismatch: vLLM 0.17 is built on PyTorch 2.10.0. Running pip install vllm==0.17.0 does not upgrade the CUDA driver. If your driver is outdated, upgrade to a compatible version or provision a new instance with an up-to-date driver. Verify with nvidia-smi and nvcc --version.
  • Slow expert routing on A100: A100 lacks FP8 hardware Tensor Cores. Using --quantization fp8 on A100 will either error or silently fall back to FP16. Switch to INT8 with --quantization bitsandbytes for best A100 performance on Qwen 3.5 MoE models.
  • Model class not found: If vLLM does not recognize the Qwen 3.5 model class, add --trust-remote-code to the serve command. vLLM 0.17 added dedicated GDN support for Qwen 3.5, but older vLLM versions will not have this code path. Ensure you are running vLLM 0.17.0 or later.

Qwen 3.5 is available to deploy on Spheron now, with bare metal H100 and H200 nodes ready for the full 397B configuration. No contracts, no waitlists.

Rent H100 → | Rent H200 → | View all pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.