What GPUs do you need to run Qwen 3.5 397B?

At INT4 quantization, the 397B-A17B MoE weights are approximately 199 GB (file size), but the runtime footprint reaches 230-240 GB with activation and framework overhead. Requiring at minimum 4x H100 80GB (320 GB total VRAM), the vLLM cap at 0.9 utilization is 288 GB, leaving approximately 48-58 GB for KV cache. At FP8, the weights are approximately 397 GB with a runtime footprint of 457-476 GB, requiring 8x H100 80GB (640 GB total) with approximately 100-119 GB for KV cache after the 576 GB vLLM cap. The model activates 17B parameters per token during inference, but the full 397B parameter count must reside in VRAM regardless. Do not plan hardware based on active parameters.

How does Qwen 3.5 compare to DeepSeek V3.2 on GPU requirements?

Qwen 3.5 397B at FP8 requires 8x H100 (640 GB), about the same as DeepSeek V3.2 Speciale. However, Qwen 3.5 27B runs on a single H100 at $2.01/hr, making it far more cost-efficient for most single-model workloads. DeepSeek V3.2 Speciale has advantages in advanced math reasoning but costs approximately 8x more to run.

Can you run Qwen 3.5 27B on a single GPU?

Yes. At FP8, the 27B weights are approximately 27 GB, reaching about 31-32 GB at runtime with activation and framework overhead. On a single H100 80GB with --gpu-memory-utilization 0.9 (72 GB cap), this leaves about 40 GB for KV cache, which comfortably supports 32K-262K context workloads. The A100 80GB also works, but FP16 is not practical: the runtime footprint (~62-65 GB) technically fits within the 72 GB vLLM cap but leaves only 7-10 GB for KV cache, which is insufficient for real workloads. INT8 works well on A100 instead.

What quantization format should I use for Qwen 3.5 on H100?

FP8 is the best choice on H100 and H200. H100 has native FP8 Tensor Core support, and vLLM activates it with --quantization fp8. FP8 reduces weight memory by half compared to FP16 with minimal quality loss (under 1-2% on most benchmarks). For A100, use INT8 via bitsandbytes instead, since A100 lacks hardware FP8 Tensor Cores.

Does vLLM 0.17 support Qwen 3.5 natively?

Yes, vLLM 0.17 supports Qwen 3.5 natively. Qwen 3.5 uses a new hybrid architecture with Gated DeltaNet (GDN) as the primary attention mechanism in 75% of layers, combined with sparse MoE. vLLM 0.17 specifically added GDN support for this architecture. It is not the same code path as Qwen 3. If you encounter a model class not found error, add --trust-remote-code and verify you are on vLLM 0.17.0 or later.

Deploy Qwen 3.5 on GPU Cloud: Hardware Requirements and Setup Guide (2026)

Qwen 3.5 is Alibaba's latest open-source model family, featuring a flagship 397B MoE architecture with support for 201 languages and an Apache 2.0 license. Released in early 2026, it introduces a new hybrid architecture with Gated DeltaNet (GDN) replacing the primary attention mechanism in 75% of layers, combined with sparse MoE. All Qwen 3.5 variants are natively multimodal, supporting text, images, and video input. The 397B-A17B is the flagship; the 9B and 27B dense variants are smaller but still natively multimodal, supporting text, images, and video input. The 9B and 27B dense variants fit on a single consumer or datacenter GPU; the 35B-A3B MoE is one of the most hardware-efficient models in this parameter class; and the 397B requires the same 8x H100 setup as the largest frontier MoE models. The full family also includes smaller dense models (0.8B, 2B, 4B); this guide covers five variants: 9B, 27B, 35B-A3B, 122B-A10B, and 397B-A17B.

For comparison with the previous generation, see our Qwen 3 GPU deployment guide. For VRAM math across model families, see the GPU memory requirements guide and the GPU requirements cheat sheet for 2026.

Qwen 3.5 Model Variants

Model	Parameters	Architecture	Context Window	FP16 Size	FP8 Size	Q4_K_M Size
Qwen3.5-9B	9B	Dense	262K (ext. 1M+)	~18 GB	~9 GB	~5 GB
Qwen3.5-27B	27B	Dense	262K (ext. 1M+)	~54 GB	~27 GB	~14 GB
Qwen3.5-35B-A3B	35B total / 3B active	MoE	262K (ext. 1M+)	~70 GB	~35 GB	~18 GB
Qwen3.5-122B-A10B	122B total / 10B active	MoE	262K (ext. 1M+)	~244 GB	~122 GB	~61 GB
Qwen3.5-397B-A17B	397B total / 17B active	MoE	262K (ext. 1M+)	~794 GB	~397 GB	~199 GB

Note on family coverage: This guide covers five variants: 9B, 27B, 35B-A3B, 122B-A10B, and 397B-A17B. The full Qwen 3.5 family also includes smaller dense variants (0.8B, 2B, 4B) not covered here. The 122B-A10B is relevant for multi-GPU deployments where the 35B fits on one GPU but the 397B is too expensive: at FP8, the 122B-A10B requires approximately 2x H100 80GB. Hardware requirements for that variant follow the same VRAM math described in this guide.

One important note on the MoE variants: "3B active" and the 397B total active expert count refer to the number of expert parameters activated per forward pass, not the total model size. The full 35B and 397B parameter weights must reside in VRAM even though only a fraction of parameters are active for each token. Do not plan hardware based on active parameters. Plan based on the total size columns above.

GPU Hardware Requirements

Qwen3.5-9B: Single RTX 4090 or L40S

The 9B model is the most accessible Qwen 3.5 variant for budget deployments.

RTX 4090 (24 GB): FP8 weights (~9 GB) reach approximately 10.4-10.8 GB at runtime with 15-20% activation and framework overhead. With --gpu-memory-utilization 0.9, vLLM caps total allocation at 21.6 GB (24 x 0.9), leaving about 11 GB for KV cache. Best budget option for development and light production workloads. The RTX 4090 has native FP8 Tensor Core support.
L40S (48 GB): FP16 weights (~18 GB) reach about 20.7-21.6 GB at runtime. With --gpu-memory-utilization 0.9, vLLM caps total allocation at 43.2 GB (48 x 0.9), leaving about 22-23 GB for KV cache. Better throughput than RTX 4090 for production workloads.

See RTX 4090 GPU rental for pricing and availability.

Qwen3.5-27B: Single H100 or A100 80GB

H100 80GB: FP8 weights (~27 GB) reach approximately 31-32 GB at runtime with 15-20% activation and framework overhead. With --gpu-memory-utilization 0.9, vLLM caps total allocation at 72 GB, leaving about 40 GB for KV cache. Strong throughput for concurrent requests. The recommended production configuration for the 27B model.
H200 141GB: FP8 and FP16 both fit with substantial KV cache headroom. Better for latency-sensitive single-stream workloads and extended context lengths up to the full 262K native window (extensible to 1M+).
A100 80GB: Use INT8 on A100. FP16 is not viable: the 27B FP16 weights (~54 GB) plus 15-20% activation and framework overhead reach approximately 62-65 GB at runtime, which works technically but leaves only about 7-10 GB for KV cache after the 72 GB cap from --gpu-memory-utilization 0.9. INT8 weights (~27 GB) reach approximately 31-32 GB at runtime, leaving about 40 GB for KV cache. Note that A100 lacks native FP8 Tensor Cores. Using --quantization fp8 on A100 will either error or fall back to FP16. Use INT8 via bitsandbytes instead.

See H100 GPU rental and A100 GPU rental for current rates.

Qwen3.5-35B-A3B: Single H100 (Efficient MoE)

The 35B-A3B MoE model activates only 3B parameters per token. Total weights (~70 GB FP16 or ~35 GB FP8) fit on a single H100, and per-token compute cost is far lower than the 27B dense model due to sparse activation. This makes it a strong choice when you need high throughput on a single GPU without moving to a multi-GPU setup.

H100 80GB: FP8 weights (~35 GB) reach approximately 40-42 GB at runtime with 15-20% activation and framework overhead. With --gpu-memory-utilization 0.9, vLLM caps total allocation at 72 GB, leaving approximately 30-32 GB for KV cache. Use --enable-expert-parallel in vLLM for better multi-GPU throughput if you scale out.

See H100 GPU rental for current rates.

Qwen3.5-397B: 4x to 8x H100 or H200

4x H100 80GB (320 GB total): INT4/AWQ weights (~199 GB file size) consume approximately 230-240 GB at runtime once you account for 15-20% overhead from activations and framework buffers. With --gpu-memory-utilization 0.9, vLLM caps total allocation at 288 GB (320 x 0.9), leaving approximately 48-58 GB for KV cache. Viable for development and low-concurrency inference. Use --tensor-parallel-size 4.
8x H100 80GB (640 GB total): FP8 weights (~397 GB file size) consume approximately 457-476 GB at runtime with activation overhead. With --gpu-memory-utilization 0.9, vLLM caps total allocation at 576 GB (640 x 0.9), leaving approximately 100-119 GB for KV cache. Better throughput and more headroom for production batch serving. Use --tensor-parallel-size 8.
8x H200 141GB (1,128 GB total): The most capable single-node configuration for 397B. FP8 weights (~397 GB) reach approximately 457-476 GB at runtime. With vLLM at 0.9 utilization (1,015 GB cap), this leaves approximately 539-558 GB free for KV cache.

See H100 GPU rental and H200 GPU rental for multi-GPU configurations.

What Won't Work

Qwen3.5-27B on a single RTX 4090 at FP16: 54 GB weights, 24 GB VRAM. Not possible.
Qwen3.5-397B on any consumer GPU: Even at Q4_K_M, 199 GB exceeds any consumer card.
Qwen3.5-397B on 2x H100 at FP8: 397 GB model, 160 GB total VRAM. Not enough.
Qwen3.5-27B and larger on A100 with FP8: A100 lacks FP8 Tensor Core support. Use INT8 on A100 instead.

VRAM Calculator: Choosing the Right Quantization

The right quantization format depends on your hardware and quality requirements:

Use Case	Recommended Format	Reason
H100 / H200 production	FP8	Native Tensor Core support, minimal quality loss
A100 production	INT8 (bitsandbytes)	A100 lacks FP8 hardware; INT8 is the next best option
Budget / development	Q4_K_M (GGUF via llama.cpp)	Fits smaller GPUs, some quality tradeoff
Highest quality	BF16 / FP16	Only practical for 9B on a single GPU or 27B on H200

BF16 is impractical for 27B on H100. The FP16 weights (~54 GB) plus 15-20% runtime overhead reach approximately 62-65 GB, which technically fits within the 72 GB vLLM cap from --gpu-memory-utilization 0.9 (80 GB x 0.9 = 72 GB). However, this leaves only about 7-10 GB for KV cache, which is insufficient for real concurrent workloads. FP8 cuts the weight footprint in half with negligible quality loss on H100's Tensor Cores.

For throughput optimization beyond quantization, the speculative decoding guide covers techniques that can add 2-5x gains on top of any quantization configuration.

Step-by-Step Deployment with vLLM 0.17 on Spheron

Prerequisites

Provision a GPU instance on Spheron matching your model size. For 27B, select a single H100 80GB from the H100 rental page or a single H200 from the H200 rental page. Ensure at least 60 GB of persistent storage for the 27B weights (more for larger variants).

SSH in and verify GPU setup:

bash

nvidia-smi
# Verify GPU count, VRAM, and driver version

Install vLLM 0.17

bash

pip install vllm==0.17.0
# Verify installation
python -c "import vllm; print(vllm.__version__)"

vLLM 0.17 added native support for Qwen 3.5's hybrid architecture, which uses Gated DeltaNet (GDN) as the primary attention mechanism in 75% of layers, with sparse MoE. This is a distinct code path from Qwen 3; vLLM 0.17 specifically added GDN kernel support for this model family. If you encounter a model class not found error, add --trust-remote-code as a fallback and check vLLM's supported models list for the latest Qwen 3.5 status.

Download Model Weights

Important: Alibaba naming conventions vary between model releases. Qwen 3 uses Qwen/Qwen3-32B (no dot), while earlier series used different formats. Before running these commands, verify the exact repository names at https://huggingface.co/Qwen. The commands below use the expected convention based on prior releases:

bash

# Qwen3.5-9B
huggingface-cli download Qwen/Qwen3.5-9B \
    --local-dir /data/models/qwen3.5-9b

# Qwen3.5-27B (~54 GB at FP16, ~27 GB at FP8)
huggingface-cli download Qwen/Qwen3.5-27B \
    --local-dir /data/models/qwen3.5-27b

# Qwen3.5-35B-A3B (~70 GB at FP16, ~35 GB at FP8)
huggingface-cli download Qwen/Qwen3.5-35B-A3B \
    --local-dir /data/models/qwen3.5-35b-a3b

# Qwen3.5-397B-A17B (~794 GB at FP16, ~397 GB at FP8)
huggingface-cli download Qwen/Qwen3.5-397B-A17B \
    --local-dir /data/models/qwen3.5-397b-a17b

# Qwen3.5-397B-A17B INT4 -- for 4x H100 deployment
# NOTE: No official AWQ checkpoint has been released by the Qwen organization.
# Official quantized checkpoints available: FP8 (Qwen/Qwen3.5-397B-A17B-FP8),
# NVFP4 (nvidia/Qwen3.5-397B-A17B-NVFP4), and community GGUF (unsloth/Qwen3.5-397B-A17B-GGUF).
# The AWQ path below requires a community-quantized model. Verify the exact repo
# name at https://huggingface.co/Qwen before downloading, as none currently exists officially.
# Example using a community AWQ repo (verify availability before use):
huggingface-cli download YOUR_COMMUNITY_REPO/Qwen3.5-397B-A17B-AWQ \
    --local-dir /data/models/qwen3.5-397b-a17b-awq

Use persistent storage to avoid re-downloading on instance restarts. The 27B download takes 10-20 minutes; the 397B takes several hours depending on bandwidth.

Launch Inference Server

Three configurations covering the main use cases:

bash

# 9B on single GPU (RTX 4090 / L40S)
vllm serve /data/models/qwen3.5-9b \
    --served-model-name Qwen/Qwen3.5-9B \
    --quantization fp8 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768 \
    --port 8000

# 27B on single H100 (FP8) -- recommended production config
vllm serve /data/models/qwen3.5-27b \
    --served-model-name Qwen/Qwen3.5-27B \
    --quantization fp8 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768 \
    --port 8000

# 397B MoE on 8x H100 (FP8)
vllm serve /data/models/qwen3.5-397b-a17b \
    --served-model-name Qwen/Qwen3.5-397B-A17B \
    --tensor-parallel-size 8 \
    --quantization fp8 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768 \
    --port 8000

# 397B MoE on 4x H100 (AWQ INT4 / lower cost option)
# Requires a community-quantized AWQ checkpoint -- no official AWQ checkpoint from Qwen exists.
# Replace the path below with the actual local directory from your download step.
vllm serve /data/models/qwen3.5-397b-a17b-awq \
    --served-model-name Qwen/Qwen3.5-397B-A17B \
    --tensor-parallel-size 4 \
    --quantization awq \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192 \
    --port 8000

The --quantization fp8 flag activates native FP8 Tensor Cores on H100 and H200. On A100, use --quantization bitsandbytes for INT8 instead.

Test the API

bash

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "Qwen/Qwen3.5-27B", "messages": [{"role": "user", "content": "Explain tensor parallelism in one paragraph."}], "max_tokens": 512}'

Python OpenAI client example:

python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-27B",
    messages=[{"role": "user", "content": "Write a binary search in Python."}],
    max_tokens=1024
)
print(response.choices[0].message.content)

Monitor throughput during the test with nvidia-smi dmon -s pum -d 5 in a separate terminal.

Qwen 3.5 vs DeepSeek V3.2 vs Llama 4 Maverick: Performance and Cost

Model	Params	GPU Config	Hourly Cost (Spheron)	License	Context	Languages
Qwen3.5-27B	27B	1x H100	~$2.01/hr	Apache 2.0	262K	201
Qwen3.5-397B-A17B	397B MoE	8x H100	~$16.08/hr	Apache 2.0	262K	201
DeepSeek V3.2 Speciale	685B MoE	8x H100	~$16.08/hr	MIT	160K	N/A
Llama 4 Maverick	400B MoE	8x H100 (FP8) / 4x H100 (INT4)	~$16.08/hr / ~$8.04/hr	Llama 4 Community	1M	~12

For full deployment guides on the alternatives, see the DeepSeek V3.2 Speciale deployment guide and the Llama 4 GPU deployment guide.

Pricing fluctuates based on GPU availability. The prices above are based on 31 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

Production Optimization: Batch Size, Context Length, and Tensor Parallelism

Batch size tuning: Use --max-num-seqs to control concurrent requests. Start at 32 and increase until nvidia-smi shows consistent 90%+ GPU memory utilization. Watch vllm:kv_cache_usage_perc in the Prometheus metrics endpoint at /metrics; if it stays above 80% you're at or near the KV cache limit.

Context length: A shorter --max-model-len frees KV cache VRAM for more concurrent requests. Recommended starting values: 32768 for 9B and 27B, 16384 for 35B-A3B, and 8192 for 397B on 4x H100 (INT4). Only raise to 262K if your application genuinely requires it.

Tensor parallelism for 397B: --tensor-parallel-size 4 (INT4) halves hardware cost versus --tensor-parallel-size 8 (FP8), but adds quantization quality tradeoff. For the 397B model, NVLink-connected H100 SXM5 GPUs significantly reduce communication overhead compared to PCIe setups at this level of parallelism. Note that H100 SXM5 is priced at ~$2.40/hr, making an 8x SXM5 config ~$19.20/hr versus the ~$16.08/hr shown in the pricing table below for H100 PCIe. See the vLLM production deployment guide for detailed tensor parallelism tuning and the MIG/time-slicing guide for running multiple smaller models on the same hardware.

Speculative decoding: The speculative decoding production guide covers techniques applicable to Qwen 3.5 that can deliver 2-5x throughput gains by using a small draft model to generate candidate tokens. Most effective for low-concurrency, latency-sensitive workloads.

Spheron GPU Pricing for Qwen 3.5 Workloads

Variant	GPU Config	On-Demand / hr	Spot / hr	Monthly (24/7 on-demand)
9B	1x RTX 4090	$0.50/hr	N/A	~$360
27B	1x H100 80GB	$2.01/hr	N/A	~$1,447
35B-A3B	1x H100 80GB	$2.01/hr	N/A	~$1,447
397B (INT4)	4x H100 80GB	$8.04/hr	N/A	~$5,789
397B (FP8)	8x H100 80GB	$16.08/hr	N/A	~$11,578

For context on cloud alternatives: AWS p4d.24xlarge (8x A100 40GB) lists at approximately $32.77/hr on-demand; Azure ND96asr_v4 (8x A100 80GB) runs approximately $27.20/hr. The 8x H100 configuration on Spheron at ~$16.08/hr runs the full 397B FP8 model at roughly half the cost of comparable hyperscaler options, with no long-term contracts and per-minute billing.

Pricing fluctuates based on GPU availability. The prices above are based on 31 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

Troubleshooting

OOM on 397B even with 8x H100: The FP8 runtime footprint (~457-476 GB) can hit limits if other processes are consuming VRAM. Reduce --max-model-len first; cutting from 32768 to 8192 can free 20-40 GB of KV cache pre-allocation. If still OOM, switch to a community-quantized AWQ INT4 checkpoint (no official AWQ checkpoint from Qwen exists; check HuggingFace for community quants) with --quantization awq and --tensor-parallel-size 4 on 4x H100 instead.
Tensor parallel rank mismatch: Ensure --tensor-parallel-size evenly divides your total GPU count. If you have 8 GPUs and pass --tensor-parallel-size 6, vLLM will error. Common values: 2, 4, 8.
CUDA driver mismatch: vLLM 0.17 is built on PyTorch 2.10.0. Running pip install vllm==0.17.0 does not upgrade the CUDA driver. If your driver is outdated, upgrade to a compatible version or provision a new instance with an up-to-date driver. Verify with nvidia-smi and nvcc --version.
Slow expert routing on A100: A100 lacks FP8 hardware Tensor Cores. Using --quantization fp8 on A100 will either error or silently fall back to FP16. Switch to INT8 with --quantization bitsandbytes for best A100 performance on Qwen 3.5 MoE models.
Model class not found: If vLLM does not recognize the Qwen 3.5 model class, add --trust-remote-code to the serve command. vLLM 0.17 added dedicated GDN support for Qwen 3.5, but older vLLM versions will not have this code path. Ensure you are running vLLM 0.17.0 or later.

Qwen 3.5 is available to deploy on Spheron now, with bare metal H100 and H200 nodes ready for the full 397B configuration. No contracts, no waitlists.
Rent H100 → | Rent H200 → | View all pricing →
Get started on Spheron →

Qwen 3.5 Model Variants

GPU Hardware Requirements

Qwen3.5-9B: Single RTX 4090 or L40S

Qwen3.5-27B: Single H100 or A100 80GB

Qwen3.5-35B-A3B: Single H100 (Efficient MoE)

Qwen3.5-397B: 4x to 8x H100 or H200

What Won't Work

VRAM Calculator: Choosing the Right Quantization

Step-by-Step Deployment with vLLM 0.17 on Spheron

Prerequisites

Install vLLM 0.17

Download Model Weights

Launch Inference Server

Test the API

Qwen 3.5 vs DeepSeek V3.2 vs Llama 4 Maverick: Performance and Cost

Production Optimization: Batch Size, Context Length, and Tensor Parallelism

Spheron GPU Pricing for Qwen 3.5 Workloads

Troubleshooting

Build what's next.