Qwen3.5-Omni, released March 30, 2026, is Alibaba's first model to unify text, audio, image, and video understanding with text and speech generation in a single 30B MoE architecture (30B total parameters, 3B active per token). Unlike separate pipeline approaches where you chain ASR plus LLM plus TTS, Qwen3.5-Omni processes all modalities in one inference pass and outputs both a text response and synthesized speech simultaneously. It is expected to follow the Qwen team's Apache 2.0 licensing pattern; verify the license on the official model card before use. At FP8, it fits on a single 80GB datacenter GPU with good KV cache headroom.
For context on related deployments: for text and image/video inference without audio support, see the Qwen 3.5 deployment guide. For vision-only models without audio support, see Deploy Vision Language Models on GPU Cloud. For real-time voice AI pipelines combining ASR, LLM, and TTS as separate components, see the Voice AI GPU Infrastructure guide.
What Is Qwen3.5-Omni
Qwen3.5-Omni is a 30B MoE model (30B total parameters, 3B active per token) with a unified multimodal encoder and a dual-output decoder. It can accept any combination of text, speech, images, and video frames as a single input sequence and generate a text response, a speech response, or both.
The key difference from VLMs like Qwen3-VL is audio. Qwen3-VL handles images and video but cannot process speech input or generate audio output. Qwen3.5-Omni adds a dedicated audio encoder that converts waveforms into audio tokens processed alongside visual and text tokens through the main transformer. On the output side, the Talker component synthesizes speech in parallel with text generation.
Key specs:
- Parameters: 30B total / 3B active per token (MoE)
- Context window: 256K tokens (262,144)
- Input modalities: text, speech, audio, images, video
- Output modalities: text, streaming speech
- License: Apache 2.0 (expected; verify on the official model card before use)
- Architecture: Thinker-Talker with unified multimodal encoder (MoE backbone)
Architecture Deep Dive: Unified Multimodal Pipeline
The model has three main components: a multimodal encoder, a Thinker decoder, and a Talker decoder.
Multimodal encoder. A single encoder handles all input modalities. For audio, it uses a custom Audio Transformer (AuT) trained from scratch, with a 128-channel mel spectrogram input and a convolutional frontend that downsamples the audio features before passing them through transformer layers. This replaced the Whisper-based encoder used in earlier Qwen-Omni generations. For images and video frames, it uses a ViT-style encoder similar to Qwen3-VL. For text, standard token embeddings. All three outputs are projected into the same embedding space and concatenated as a single input sequence to the Thinker.
Thinker decoder. An autoregressive transformer decoder that generates text tokens. This is the reasoning component of the MoE backbone. It sees the full multimodal context and produces a text response that captures both understanding and reasoning steps. Only 3B parameters are active per forward pass, so decode speed is closer to a 3B dense model despite the 30B total weight footprint.
Talker decoder. A smaller streaming decoder that reads the Thinker's output token-by-token and synthesizes audio in parallel, without waiting for the full text response to complete. The Talker adds approximately 1 GB of VRAM overhead and operates concurrently with the Thinker during generation, so audio streaming starts as soon as the first text tokens are ready.
Memory footprint breakdown:
| Component | FP16 VRAM | Notes |
|---|---|---|
| Language model backbone (30B MoE) | ~60 GB | 30B total weights; 3B active per token |
| Audio encoder (AuT) | ~1 GB | Custom Audio Transformer, 128-channel mel |
| Visual encoder (ViT) | ~2 GB | Shared with image and video inputs |
| Talker decoder | ~1 GB | Audio synthesis component |
| Total weights | ~64 GB | Before KV cache |
| KV cache (8K context, FP16) | ~4 GB | Scales with sequence length |
| Total runtime (single request) | ~68 GB | With 8K context on H100 FP16 |
GPU Requirements: VRAM, Memory Bandwidth, and Recommended Configs
The 30B total MoE weights are the primary constraint: at FP16 they require about 60 GB of VRAM, which means only 80GB datacenter GPUs fit FP16 inference. At FP8, weights drop to about 30 GB and the L40S 48GB becomes viable. The secondary constraint is KV cache: audio sequences grow at roughly 7 audio tokens per second, so a 60-second clip generates about 420 tokens before any text is added. At 256K max context, the window is large, but KV cache pressure builds quickly under concurrent load when each request carries audio.
GPU requirements by use case:
| Use case | Recommended GPU | Quantization | Estimated VRAM | Cost/hr (on-demand) |
|---|---|---|---|---|
| Development / testing | L40S 48GB | FP8 | ~38 GB | $0.72 |
| Single-user production | A100 80GB | INT8 (bitsandbytes) | ~38 GB | $1.04 |
| Multi-user production | H100 SXM5 80GB | BF16 (FP16) | ~68 GB | $2.54 |
| Higher throughput | H100 SXM5 80GB | FP8 | ~38 GB | $2.54 |
| Edge / constrained | L40S 48GB | INT4 | ~20 GB | $0.72 |
Pricing fluctuates based on GPU availability. The prices above are based on 10 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
L40S 48GB note. FP16 does not fit on the L40S 48GB: the 30B MoE weights alone are approximately 60 GB. At FP8, the weight footprint drops to about 30 GB, leaving roughly 14 GB for KV cache on a 48GB card. For short audio inputs under 30 seconds and single-user development, L40S at FP8 works. For concurrent production load with longer audio sequences, the limited KV cache headroom becomes a bottleneck. Use an 80GB GPU for production.
Memory bandwidth matters. The Thinker-Talker architecture runs two decoders concurrently during generation, which doubles the memory bandwidth demand compared to a single-decoder model. H100 SXM5 (3.35 TB/s) handles this significantly better than A100 (2.0 TB/s) at the same batch sizes, which translates to measurably lower token generation latency when the Talker is active.
Step-by-Step Deployment with vLLM on Spheron GPU Cloud
Step 1: Provision a GPU Instance
Go to app.spheron.ai and create a new deployment. For FP16 production serving, select H100 SXM5 80GB or A100 80GB. For FP8 development and testing, an L40S 48GB is sufficient. See the Spheron getting started guide for step-by-step provisioning instructions.
Verify CUDA availability after SSH:
nvidia-smi
# Expected: CUDA >= 12.1, driver >= 530Step 2: Install vLLM
Qwen3.5-Omni requires vLLM v0.17.0 or later, which includes important bug fixes for mixed-modality and audio cache handling. Initial Qwen-Omni support was added in earlier versions, but v0.17.0+ is required for reliable production use. If you encounter a model class not found error with standard vLLM, check whether your specific model version requires the vLLM-Omni fork instead.
pip install "vllm>=0.17.0"
python -c "import vllm; print(vllm.__version__)"Step 3: Download Model Weights
pip install huggingface_hub
huggingface-cli download Qwen/Qwen3.5-Omni \
--local-dir /data/models/qwen3.5-omniAlways verify the exact repository name at huggingface.co/Qwen before running. Alibaba naming conventions have changed across releases (Qwen3 used no dot separator, Qwen3.5 reintroduced it).
Step 4: Launch the Inference Server
FP16 on H100 (recommended for production):
vllm serve /data/models/qwen3.5-omni \
--dtype bfloat16 \
--max-model-len 8192 \
--max-num-seqs 32 \
--port 8000 \
--trust-remote-codeFP8 on H100 (higher throughput, minimal quality loss):
vllm serve /data/models/qwen3.5-omni \
--dtype bfloat16 \
--quantization fp8 \
--max-model-len 16384 \
--max-num-seqs 64 \
--port 8000 \
--trust-remote-codeINT8 on A100 80GB (A100 lacks native FP8 Tensor Cores):
vllm serve /data/models/qwen3.5-omni \
--dtype float16 \
--quantization bitsandbytes \
--load-format bitsandbytes \
--max-model-len 8192 \
--port 8000 \
--trust-remote-codeL40S 48GB with context cap (FP8 required):
vllm serve /data/models/qwen3.5-omni \
--dtype bfloat16 \
--quantization fp8 \
--max-model-len 4096 \
--gpu-memory-utilization 0.85 \
--port 8000 \
--trust-remote-codeStep 5: Test the Endpoint
Text input:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="/data/models/qwen3.5-omni",
messages=[
{"role": "user", "content": "Explain the difference between convolution and attention in neural networks."}
],
max_tokens=512,
)
print(response.choices[0].message.content)Audio input (base64 WAV):
import base64
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
with open("audio_sample.wav", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="/data/models/qwen3.5-omni",
messages=[
{
"role": "user",
"content": [
{
"type": "audio_url",
"audio_url": {
"url": f"data:audio/wav;base64,{audio_b64}"
}
},
{
"type": "text",
"text": "Transcribe this audio and summarize the main points."
}
]
}
],
max_tokens=512,
)
print(response.choices[0].message.content)Combined image + audio input:
import base64
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
with open("audio.wav", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="/data/models/qwen3.5-omni",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": "https://example.com/chart.png"}
},
{
"type": "audio_url",
"audio_url": {
"url": f"data:audio/wav;base64,{audio_b64}"
}
},
{
"type": "text",
"text": "The person in the audio is asking about this chart. Answer their question."
}
]
}
],
max_tokens=512,
)
print(response.choices[0].message.content)Step 6: Monitor and Verify
# Watch GPU memory and utilization in real time
nvidia-smi dmon -s mu -d 2
# Check vLLM metrics endpoint
curl http://localhost:8000/metrics | grep vllm
# Check model is loaded correctly
curl http://localhost:8000/v1/models | python3 -m json.toolQuantization Options: Running Qwen3.5-Omni on a Single GPU
Quantization lets you trade a small amount of quality for significantly reduced VRAM requirements.
FP8 (H100 and L40S, recommended for production):
H100 and L40S both have native FP8 Tensor Core support. The L40S uses Ada Lovelace architecture with 4th-gen Tensor Cores that support FP8 (733 TFLOPS at FP8 per NVIDIA datasheet). FP8 halves the weight memory with under 2% quality degradation on most benchmarks. Use --quantization fp8 with vLLM. The model weights drop from ~60 GB to ~30 GB, freeing up roughly 30 GB more for KV cache on an 80GB GPU.
INT8 via bitsandbytes (A100 only):
A100 (Ampere architecture) lacks native FP8 hardware support. INT8 quantization via bitsandbytes is the practical alternative for A100. Weights drop to ~30 GB with slightly more quality loss than FP8. Use --quantization bitsandbytes --load-format bitsandbytes.
GGUF / INT4 via llama.cpp:
For development or edge deployments, INT4 quantization via GGUF format compresses the 30B weights to around 15 GB. This fits on a GPU with 24+ GB VRAM. Use the Qwen3.5-Omni GGUF checkpoints if available on Hugging Face, or convert with llama.cpp/convert_hf_to_gguf.py. Note: the audio and video encoders may not be fully supported in all llama.cpp builds; verify multimodal input handling works before deploying in production.
AWQ (any GPU, pre-quantized checkpoints):
If AWQ-quantized checkpoints are published by the Qwen team or the community, these offer clean INT4 quantization with better accuracy than standard INT4. Check the Qwen organization on Hugging Face for -AWQ model variants.
VRAM comparison across quantization formats:
| Format | Weight VRAM | Total (8K ctx) | Suitable GPU |
|---|---|---|---|
| FP16 / BF16 | ~64 GB | ~68 GB | A100 80GB, H100 80GB |
| FP8 | ~30 GB | ~34 GB | L40S 48GB, H100 80GB |
| INT8 | ~30 GB | ~34 GB | A100 80GB |
| INT4 / GGUF | ~15 GB | ~19 GB | Any GPU with 24+ GB VRAM |
Real-Time Inference Latency: Voice, Video, and Text Pipelines
Qwen3.5-Omni has different latency profiles depending on what you feed it.
Text-only input. This is the fastest path. With no audio or visual tokens to process, latency reflects the MoE architecture: 3B active parameters per token means decode throughput is fast despite the 30B total weight footprint. On H100 with FP16, expect TTFT around 30-80ms for short inputs and around 80-150 tokens/sec generation throughput.
Audio input. The audio encoder adds processing time proportional to clip length. A 10-second audio clip adds roughly 20-30ms of encoder latency on H100 before the Thinker starts generating. A 60-second clip adds 80-120ms. If the Talker is enabled for audio output, the first audio chunks stream back as the Thinker generates tokens, so perceived latency stays low even for long outputs.
Video input. Video processing is the most expensive path. Each frame requires ViT encoding (about 1-3ms per frame on H100). For a 5-second video at 2fps (10 frames), expect 15-30ms of encoder preprocessing. At 10fps, that doubles. Most use cases for Qwen3.5-Omni with video send sparse keyframes (1-2fps) rather than dense frame sequences.
Real-time latency benchmarks (single request, H100 SXM5 80GB, FP16, vLLM):
| Input type | Encoder latency | TTFT | Throughput |
|---|---|---|---|
| Text only (256 tokens) | 0 ms | ~35 ms | ~130 tok/s |
| Audio 10s clip | ~25 ms | ~60 ms | ~120 tok/s |
| Audio 60s clip | ~100 ms | ~135 ms | ~115 tok/s |
| Image (512px) | ~5 ms | ~40 ms | ~125 tok/s |
| Video 5s at 2fps | ~25 ms | ~60 ms | ~120 tok/s |
| Audio + image | ~30 ms | ~65 ms | ~115 tok/s |
These are single-request measurements. Under concurrent load, throughput scales with continuous batching but TTFT increases as requests queue. Use --max-num-seqs to control the tradeoff between throughput and latency.
Cost Comparison: Self-Hosted vs. Multimodal API Providers
Qwen3.5-Omni competes directly with hosted multimodal APIs. Self-hosting on GPU cloud has a different cost structure: you pay for GPU time regardless of whether the GPU is busy.
Monthly cost to serve Qwen3.5-Omni (full-time single GPU):
| GPU | On-demand/hr | Spot/hr | Monthly (on-demand) | Monthly (spot) |
|---|---|---|---|---|
| H100 SXM5 80GB | $2.54 | N/A | ~$1,829 | N/A |
| A100 80GB PCIe | $1.04 | N/A | ~$749 | N/A |
| A100 80GB SXM4 | $1.64 | $0.45 | ~$1,181 | ~$325 |
| L40S 48GB | $0.72 | N/A | ~$518 | N/A |
Pricing fluctuates based on GPU availability. The prices above are based on 10 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Break-even analysis against hosted APIs:
Assume you are processing 1 million audio tokens per day (about 11 hours of audio input). At typical hosted multimodal API pricing of $0.01-0.05 per 1K audio tokens, that is $10-50/day or $300-1,500/month. At those volumes, a dedicated A100 80GB at $749/month is competitive, and an L40S at $518/month is clearly cheaper.
For bursty workloads (heavy usage for a few hours per day), spot instances cut costs significantly: an A100 spot at $0.45/hr means 4 hours of heavy daily use costs about $54/month. Spot instances are interruptible, so pair them with fallback logic that switches to on-demand when spot is unavailable.
Vs. GPT-4o audio: GPT-4o audio input pricing is roughly $0.10/1K audio tokens at current rates. At 100K audio tokens per day (~1.1 hours of audio), that is $10/day or $300/month. A dedicated GPU instance only wins at higher volumes, but gives you data privacy, no rate limits, and consistent latency from co-located compute.
Production Tips: Continuous Batching, Streaming Output, and Scaling
Continuous batching. vLLM's default continuous batching mode handles audio inputs well. The main thing to tune is --max-num-seqs. Audio inputs produce long token sequences, so each concurrent request consumes more KV cache than a text-only request of the same prompt length. Start with --max-num-seqs 16 and increase until you see GPU memory pressure under load.
vllm serve /data/models/qwen3.5-omni \
--dtype bfloat16 \
--max-model-len 8192 \
--max-num-seqs 16 \
--port 8000 \
--trust-remote-codeStreaming text output. Use stream=True in the OpenAI client for low-latency response streaming:
stream = client.chat.completions.create(
model="/data/models/qwen3.5-omni",
messages=[{"role": "user", "content": [
{"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{audio_b64}"}},
{"type": "text", "text": "What did the speaker say?"}
]}],
max_tokens=512,
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)Context length management for audio. Long audio sessions accumulate tokens fast. A 10-minute audio file is approximately 4,200 audio tokens (at ~7 tokens per second). If you are running conversations with ongoing audio context, implement a sliding window that truncates the oldest audio tokens while keeping the full text conversation history. The Thinker can reason about text summaries of prior audio content rather than raw audio tokens.
Tensor parallelism for higher throughput. A single GPU is enough for the weights, but if you need higher concurrent throughput, run two instances behind a load balancer rather than splitting one model across GPUs. Tensor parallelism on the 30B MoE model across 2x H100 adds coordination overhead that often reduces throughput rather than increasing it for this architecture.
Monitor KV cache utilization. Audio sequences inflate KV cache more than text sequences. Watch the vllm:gpu_cache_usage_perc metric in Prometheus. If it consistently exceeds 80%, reduce --max-num-seqs or add another instance.
# Watch key metrics
curl -s http://localhost:8000/metrics | grep -E "cache_usage|num_running|num_waiting"Persistent storage for weights. Weights are approximately 60 GB for the FP16 checkpoint downloaded in Step 3 (or ~30 GB for a pre-quantized FP8 checkpoint if available). Always load from persistent storage to avoid re-downloading on instance restarts. On Spheron, mount a persistent volume at /data/models/.
Fallback for spot interruptions. If you are using spot instances, wrap your client code to detect interruptions and fall back to an on-demand instance:
import time
from openai import OpenAI, APIConnectionError
def call_with_fallback(spot_client, ondemand_client, **kwargs):
try:
return spot_client.chat.completions.create(**kwargs)
except APIConnectionError:
time.sleep(1)
return ondemand_client.chat.completions.create(**kwargs)Qwen3.5-Omni is one of the most capable open-source multimodal models that fits on a single GPU. If you want to run it without per-token API costs and with full control over your data, Spheron's GPU cloud gives you on-demand H100 and A100 instances ready in under 90 seconds.
