Can I run Qwen3-32B on a single GPU?

Yes. At FP8 quantization, the 32B model weights are approximately 32 GB, fitting on a single H100 80GB. However, the actual runtime footprint reaches ~37-38 GB once activation and framework buffers (15-20% overhead) are included. With the recommended --gpu-memory-utilization 0.9 flag, vLLM caps total GPU allocation at 72 GB, leaving practical KV cache headroom of ~34-35 GB. Plan your max-model-len accordingly to avoid OOM. The H200 141GB gives substantially more headroom for longer contexts. For INT4, the weights shrink to ~16 GB, which fits on a single A100 80GB or L40S 48GB. Use single-GPU INT4 for development or low-concurrency workloads; H100 FP8 is the better production choice for throughput.

What is the difference between Qwen 3 thinking mode and regular mode?

Qwen 3 supports two inference modes. Regular mode behaves like a standard instruction-following model and is faster. Thinking mode activates extended chain-of-thought reasoning similar to o1-style models, producing higher quality answers on math, coding, and logic tasks at the cost of more tokens per response. Toggle it by setting 'enable_thinking' inside 'chat_template_kwargs' in the vLLM extra_body: extra_body={"chat_template_kwargs": {"enable_thinking": true}}.

Does Qwen3-235B run on consumer GPUs?

No. At INT4, the 235B-A22B MoE weights are approximately 117 GB, far exceeding even a maxed-out consumer GPU (32 GB on RTX 5090). You need at minimum 4x H100 80GB (320 GB total) to run 235B at INT4. Consumer GPUs are appropriate only for the 8B model at FP16 or the 14B model at INT4.

How does Qwen3-32B compare to DeepSeek V3.2 Speciale and Llama 4 on cost?

Qwen3-32B runs on a single H100 at $2.01/hr on Spheron on-demand (pricing as of 18 Mar 2026 and subject to change based on GPU availability), versus 8x H100 (~$16/hr) for DeepSeek V3.2 Speciale and 2x H100 (~$4/hr) for Llama 4 Scout. For single-turn instruction following and coding tasks, Qwen3-32B delivers competitive quality at a fraction of the larger model deployment cost.

Is Qwen3-VL supported by vLLM?

Yes. Qwen3-VL (vision-language) variants are supported in vLLM 0.11.0 and later with the standard serve command. Qwen3-VL comes in dense sizes (2B, 4B, 8B, 32B) and MoE sizes (30B-A3B, 235B-A22B). No --trust-remote-code flag is required when serving with vLLM. GPU requirements mirror those of the text model at the same parameter count.

Deploy Qwen 3 on GPU Cloud: Hardware Requirements and Setup Guide

Qwen 3 is Alibaba's open-source model family released in April 2025, covering everything from a lightweight 0.6B dense model to a 235B MoE with near-frontier reasoning quality. The 32B dense variant is the production sweet spot: it runs on a single H100 80GB at FP8, delivers competitive coding and instruction-following benchmarks, and costs significantly less to operate than larger MoE architectures. For a full breakdown of VRAM math across model sizes, see our GPU memory requirements guide and the GPU requirements cheat sheet for 2026.

This guide covers the exact hardware requirements for every Qwen 3 variant, step-by-step vLLM deployment, performance benchmarks, and cost analysis.

Qwen 3 Model Variants

Model	Parameters	Architecture	Context Window	FP16 Size	FP8 Size	INT4 Size
Qwen3-8B	8B	Dense	128K	~16 GB	~8 GB	~4 GB
Qwen3-14B	14B	Dense	128K	~28 GB	~14 GB	~7 GB
Qwen3-32B	32B	Dense	128K	~64 GB	~32 GB	~16 GB
Qwen3-30B-A3B	30B total / 3B active	MoE	128K	~60 GB	~30 GB	~15 GB
Qwen3-235B-A22B	235B total / 22B active	MoE	128K	~470 GB	~235 GB	~117 GB
Qwen3-VL-8B	8B	Dense + Vision	256K	~16 GB	~8 GB	~4 GB
Qwen3-VL-32B	32B	Dense + Vision	256K	~64 GB	~32 GB	~16 GB

One important note on the 235B-A22B MoE: the "22B active" refers to the number of expert parameters activated per forward pass, not the total model size. The full 235B parameter weights must reside in VRAM even though only 22B are active for each token. Do not plan hardware based on the 22B active figure; plan based on the total size column above.

Qwen 3 also ships smaller dense models (0.6B, 1.7B, 4B) with 32K context windows, and additional Qwen3-VL sizes (2B and 4B). This guide focuses on the models most relevant for cloud GPU deployment: 8B and above.

GPU Hardware Requirements

Qwen3-8B: Single RTX 4090 or L40S

The 8B model is the most accessible Qwen 3 variant for budget deployments.

RTX 4090 (24 GB): FP8 weights (~8 GB) reach ~9.2-9.6 GB at runtime with 15-20% activation and framework overhead. With --gpu-memory-utilization 0.9, vLLM caps total allocation at 21.6 GB (24 x 0.9), leaving ~12 GB for KV cache. Best budget option for development and light production. Note: the RTX 4090 (Ada Lovelace) has native FP8 Tensor Core support and is fully supported for W8A8 FP8 quantization in vLLM 0.8 and later. Throughput is lower than on H100 due to lower memory bandwidth (~1 TB/s vs ~3.35 TB/s on H100 SXM5) and fewer SMs.
L40S (48 GB): FP16 weights (~16 GB) reach ~18.4-19.2 GB at runtime with 15-20% activation and framework overhead. With --gpu-memory-utilization 0.9, vLLM caps total allocation at 43.2 GB (48 x 0.9), leaving ~24-25 GB for KV cache. Better throughput than RTX 4090 for production workloads.

See RTX 4090 GPU rental for pricing and availability.

Qwen3-14B: L40S or 2x RTX 4090

L40S 48GB: FP8 fits (~14 GB), ideal single-GPU option. FP16 weights (~28 GB) also fit with headroom for KV cache.
2x RTX 4090 (48 GB total): FP16 fits with tensor-parallel-size 2. Works but adds interconnect overhead; the L40S is simpler and usually cheaper.

Qwen3-32B: Single H100 or H200 (Recommended Production Config)

This is the configuration most teams should deploy. Qwen3-32B is the largest dense Qwen 3 model and delivers performance competitive with Qwen2.5-72B on most coding and instruction-following benchmarks, on a single GPU.

H100 80GB: FP8 weights (~32 GB) reach ~37-38 GB at runtime with 15-20% activation and framework overhead. With --gpu-memory-utilization 0.9, vLLM caps total allocation at 72 GB, leaving practical KV cache headroom of ~34-35 GB. Strong throughput for concurrent requests. At $2.01/hr on Spheron on-demand (pricing as of 18 Mar 2026), this is the best cost-quality ratio in the Qwen 3 lineup.
H200 141GB: FP8 and FP16 both fit on a single GPU with substantial KV cache headroom. Better for latency-sensitive single-stream workloads and extended context lengths.
A100 80GB: Use INT8 on A100. FP16 is not viable for production: the 32B FP16 weights (~64 GB) plus 15-20% activation and framework overhead reach ~73.6-76.8 GB at runtime, which exceeds the 72 GB cap from --gpu-memory-utilization 0.9 (80 GB x 0.9). Raising --gpu-memory-utilization to 0.97-0.98 avoids the OOM but leaves essentially no KV cache headroom, making the deployment impractical under real load. INT8 weights (~32 GB) reach ~37-38 GB at runtime, fitting with ~34-35 GB of KV cache headroom. Note that A100 also lacks native FP8 Tensor Cores, so FP8 is not an option. Throughput is lower than H100.

See H100 GPU rental for current rates.

Qwen3-30B-A3B: Single H100 (Efficient MoE Alternative)

The 30B-A3B MoE model activates only 3B parameters per token. Total weights (~60 GB FP16 or ~30 GB FP8) fit on a single H100, and per-token compute cost is far lower than the 32B dense model due to sparse activation. This makes it a strong choice when you need high throughput on a single GPU.

H100 80GB: FP8 weights (~30 GB) reach ~34.5-36 GB at runtime with 15-20% activation and framework overhead. With --gpu-memory-utilization 0.9, vLLM caps total allocation at 72 GB, leaving practical KV cache headroom of ~36-37 GB. Use --enable-expert-parallel in vLLM for better multi-GPU throughput if you scale out.

Qwen3-235B-A22B: 4x to 8x H100

4x H100 80GB (320 GB total): INT4 weights (~117 GB file size) consume ~135-140 GB at runtime once you account for the 15-20% overhead from activations and framework buffers. With --gpu-memory-utilization 0.9, vLLM caps total allocation at 288 GB (320 × 0.9), leaving ~148-153 GB for KV cache. Viable for development and low-concurrency inference.
8x H100 80GB (640 GB total): FP8 weights (~235 GB file size) consume ~270-280 GB at runtime with activation overhead. With --gpu-memory-utilization 0.9, vLLM caps total allocation at 576 GB (640 × 0.9), leaving ~296-306 GB for KV cache. Better throughput and more headroom for production batch serving.

See H100 GPU rental for multi-GPU configurations.

What Won't Work

Qwen3-32B on a single RTX 4090 at FP16: 64 GB weights, 24 GB VRAM. Not possible.
Qwen3-235B on any consumer GPU: Even at INT4, 117 GB exceeds any consumer card.
Qwen3-235B on 2x H100 at FP8: 235 GB model, 160 GB total VRAM. Not enough.
Qwen3-32B and larger on A100 with FP8: A100 lacks FP8 Tensor Core support. Use INT8 on A100. FP16 is also impractical: the runtime footprint (~73.6-76.8 GB) exceeds the 72 GB cap from --gpu-memory-utilization 0.9, leaving no KV cache headroom for real workloads.

Step-by-Step vLLM Deployment

Prerequisites

Provision a GPU instance on Spheron matching your model size. For 32B, select a single H100 80GB from the H100 rental page or a single H200 from the H200 rental page. Ensure at least 80 GB of persistent storage for the 32B weights (more for larger variants).

SSH in and verify GPU setup:

bash

nvidia-smi
# Verify GPU count, VRAM, CUDA 12.9+, and driver 575+

Install Dependencies

bash

pip install vllm --upgrade
# Verify CUDA and driver versions
nvidia-smi
python -c "import vllm; print(vllm.__version__)"

No model-specific kernels are required. vLLM supports all Qwen 3 text models natively in version 0.8.4 and later; Qwen3-VL (vision-language variants) requires vLLM 0.11.0 or later. Thinking mode (enable_thinking) requires vLLM 0.9.0 or later. The current stable release is 0.17.1 as of March 2026.

Download Model Weights

bash

huggingface-cli download Qwen/Qwen3-32B \
    --local-dir /data/models/qwen3-32b

The 32B FP16 weights are approximately 64 GB. Download takes 10-30 minutes depending on your network bandwidth. Use persistent storage so you don't re-download on instance restarts.

For other sizes, substitute the repo name: Qwen/Qwen3-8B, Qwen/Qwen3-14B, Qwen/Qwen3-30B-A3B, Qwen/Qwen3-235B-A22B.

Launch vLLM Server

Three deployment configurations covering the main use cases:

bash

# 8B on single GPU (RTX 4090 / L40S)
vllm serve Qwen/Qwen3-8B \
    --quantization fp8 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768 \
    --port 8000

# 32B on single H100 (FP8) -- recommended production config
vllm serve Qwen/Qwen3-32B \
    --quantization fp8 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768 \
    --port 8000

# 235B MoE on 8x H100
vllm serve Qwen/Qwen3-235B-A22B \
    --tensor-parallel-size 8 \
    --quantization fp8 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768 \
    --port 8000

The --quantization fp8 flag is available in vLLM 0.7.x and later. On H100 and H200 GPUs, this activates native FP8 Tensor Cores.

Test the API

bash

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "Qwen/Qwen3-32B", "messages": [{"role": "user", "content": "Explain tensor parallelism in one paragraph."}], "max_tokens": 512}'

Qwen 3 also works with the Python OpenAI client. Here are both standard mode and thinking mode examples:

python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# Standard mode
response = client.chat.completions.create(
    model="Qwen/Qwen3-32B",
    messages=[{"role": "user", "content": "Write a binary search in Python."}],
    max_tokens=1024
)

# Thinking mode (extended chain-of-thought)
# Note: thinking mode outputs a hidden <think>...</think> block before the final answer.
# Token count (and cost per request) is significantly higher in thinking mode.
response = client.chat.completions.create(
    model="Qwen/Qwen3-32B",
    messages=[{"role": "user", "content": "Solve: if 3^x = 81, what is x?"}],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
    max_tokens=2048
)

Thinking mode activates extended chain-of-thought reasoning before the model produces its final answer. The internal reasoning tokens are not billed to your output count by default in API wrappers, but they do consume GPU compute and increase latency. For math, coding, and logic tasks, thinking mode often produces better results. For straightforward instruction following, standard mode is faster and cheaper.

Performance Benchmarks

Estimated single-stream throughput on common hardware configurations using vLLM with PagedAttention:

Model	GPU Config	Precision	Throughput (tok/s)	TTFT (ms)
Qwen3-8B	1x RTX 4090	FP8	~85-100 tok/s	~80 ms
Qwen3-8B	1x H100	FP16	~180-200 tok/s	~40 ms
Qwen3-32B	1x H100	FP8	~65-80 tok/s	~90 ms
Qwen3-32B	1x H200	FP8	~90-110 tok/s	~70 ms
Qwen3-32B	1x B200	FP8	~140-170 tok/s	~50 ms
Qwen3-235B-A22B	8x H100	FP8	~20-30 tok/s	~200 ms

These numbers reflect single-stream (one concurrent request) performance. Production batch serving with PagedAttention can multiply effective throughput by 4-10x for concurrent requests. If your application handles many simultaneous users, the batch throughput matters more than single-stream numbers.

Cost Analysis: Qwen3-32B on H100 vs H200 vs B200

Spheron on-demand pricing for common Qwen3-32B configurations (pricing as of 18 Mar 2026 and subject to change based on GPU availability):

Configuration	GPUs	Hourly (On-demand)	Monthly (24/7)
32B on 1x H100 (On-demand)	1x H100 80GB	$2.01/hr	~$1,447
32B on 1x H100 (Spot)	1x H100 80GB	$0.99/hr	~$713
32B on 1x H200 (On-demand)	1x H200 141GB	$4.54/hr	~$3,269
32B on 1x B200 (On-demand)	1x B200	$6.03/hr	~$4,342
32B on 1x B200 (Spot)	1x B200	$2.25/hr	~$1,620

See current GPU pricing for live rates before committing to a configuration.

For teams running over 50M tokens per month, self-hosted Qwen3-32B on H100 outperforms API pricing for comparable quality models. At $2.01/hr on-demand versus $3-10 per million tokens for frontier API equivalents, the break-even on infrastructure typically comes within the first month for any high-volume production workload.

When to Use Qwen 3 vs Other Models

Use Qwen3-32B when: your workload is general instruction following, coding, tool calling, or structured output generation. It's cost-competitive, widely supported in vLLM, and performs well on standard coding and reasoning benchmarks. The single H100 configuration at $2.01/hr on-demand is one of the best value propositions in open-source LLM deployment.

Use Qwen3-30B-A3B when: you need high throughput on a budget. The MoE architecture activates only 3B parameters per token, so inference is fast while the model quality sits close to the 32B dense variant. It runs on a single H100 with room to spare.

Use Qwen3-235B when: you need near-frontier quality and your budget allows the 8x GPU setup. The MoE architecture activates only 22B parameters per token, so throughput is better than a dense 235B model would suggest, but you still need the full 235B weights in VRAM.

Use DeepSeek V3.2 Speciale instead when: advanced math reasoning and multi-step logical problem solving are your primary use case. V3.2 Speciale leads on math benchmarks but requires 8x H100 at minimum (approximately $16/hr on-demand) versus a single H100 ($2.01/hr) for Qwen3-32B. See the full DeepSeek V3.2 Speciale deployment guide for hardware details.

Use Llama 4 Scout instead when: you need a 10M token context window, or you want MoE-based deployment of a strong general-purpose model. See our Llama 4 GPU deployment guide for comparison.

Use Kimi K2.5 instead when: multimodal coding with image or video input is a core requirement. See the Kimi K2.5 guide for that use case.

Troubleshooting

OOM on model load: Reduce --max-model-len first; it directly controls KV cache pre-allocation. If still OOM, switch to a pre-quantized AWQ or GPTQ model variant (e.g., search HuggingFace for Qwen3-8B-AWQ) and use --quantization awq or --quantization gptq to reduce model weight memory. Note: int4 is not a valid vLLM --quantization value; use awq, gptq, fp8, or bitsandbytes instead.
Slow TTFT on 32B: Verify FP8 quantization is active. Run nvidia-smi during a request and confirm the GPU shows high utilization (70%+). If utilization is low, the --quantization fp8 flag may not have been applied or the GPU lacks native FP8 Tensor Core support (A100 does not have hardware FP8 Tensor Cores; use INT8 on A100 instead). RTX 4090 has FP8 support but lower throughput than H100 due to memory bandwidth constraints.
CUDA driver mismatch: vLLM 0.17.x pre-built wheels require CUDA 12.9 (based on PyTorch 2.10.0); older CUDA versions require building from source. Running pip install vllm --upgrade only updates the vLLM Python package; it does not upgrade the CUDA driver or toolkit. If your driver is outdated, you need to upgrade at the system level first: on Debian/Ubuntu, add the NVIDIA CUDA repository and run apt install cuda-toolkit-12-9, or provision a new instance that already has the required driver. After the driver is upgraded, also update vLLM with pip install vllm --upgrade. Verify the driver and CUDA version with nvidia-smi, and the toolkit version with nvcc --version.
Thinking mode not working: Ensure vLLM version is 0.9.x or later (python -c "import vllm; print(vllm.__version__)"). The enable_thinking flag must be passed as extra_body={"chat_template_kwargs": {"enable_thinking": True}}, not directly in extra_body. Versions before 0.9.0 had compatibility issues with enable_thinking=False; 0.9.0 added the dedicated qwen3 reasoning parser.

Qwen3 models are available on Spheron now, no waitlist, no contracts. Rent an H100 or RTX 4090, set up vLLM in minutes, and run your own inference server on bare metal.
Get started on Spheron →

Qwen 3 Model Variants

GPU Hardware Requirements

Qwen3-8B: Single RTX 4090 or L40S

Qwen3-14B: L40S or 2x RTX 4090

Qwen3-32B: Single H100 or H200 (Recommended Production Config)

Qwen3-30B-A3B: Single H100 (Efficient MoE Alternative)

Qwen3-235B-A22B: 4x to 8x H100

What Won't Work

Step-by-Step vLLM Deployment

Prerequisites

Install Dependencies

Download Model Weights

Launch vLLM Server

Test the API

Performance Benchmarks

Cost Analysis: Qwen3-32B on H100 vs H200 vs B200

When to Use Qwen 3 vs Other Models

Troubleshooting

Build what's next.