Standard prefill-decode disaggregation was designed for text-only LLMs: one compute-dense GPU pool handles prompt processing, one memory-bandwidth-dense pool handles token generation, and a NIXL transfer moves KV cache between them. That two-stage split works well for text. It breaks down for any-to-any multimodal models.
Models like Qwen3.5-Omni process text, audio, images, and video in a single inference pass. They carry dedicated modality encoders (a ViT for images and video, an Audio Transformer for speech) that run before prefill even begins. At low concurrency these encoders are not the bottleneck. At scale, packing encoder inference onto the same GPU pool as prefill computation creates a three-way contention problem: encoder jobs queue behind text prefills, prefill compute waits on encoder completions, and decode throughput degrades because prefill nodes are occupied with encoder work.
vLLM-Omni (a 2026 extension of the main vLLM codebase, sometimes packaged separately as vllm-omni) introduces a third disaggregated stage: an encoder pool that runs independently from prefill and decode. Each stage gets the GPU type that matches its actual compute profile, and the stages communicate via NIXL feature tensor and KV cache transfers. This guide covers the architecture, a full deployment walkthrough on Spheron GPU cloud, GPU sizing per stage, and the scale thresholds where this topology is worth the added complexity.
If you are new to multimodal VLM deployment without disaggregation, start with our vision-language model deployment guide first.
Why Text-Only Disaggregation Breaks for Multimodal Models
Text-only disaggregation splits one bottleneck: the prefill phase blocks the decode phase. Separate the two onto different GPU types and both run faster.
Multimodal models have three distinct compute stages with different bottleneck profiles:
| Stage | What it does | Bottleneck | Ideal GPU profile |
|---|---|---|---|
| Encoder (ViT / Audio Transformer) | Converts raw images, video frames, audio waveforms into feature tensors | Throughput-bound at moderate batch sizes, VRAM is secondary | Mid-tier GPU with good FP16/BF16 TFLOPS, e.g. A100 PCIe or L40S |
| Prefill | Runs full transformer attention over prompt tokens + modality feature tokens | Compute-bound (FP8 TFLOPS), scales with prompt length | H100 SXM5, B200 |
| Decode | Autoregressive token generation, reads KV cache every step | Memory-bandwidth-bound | H200 SXM5 (4.8 TB/s HBM3e), A100 80GB arrays |
Two-stage text disaggregation only separates prefill and decode. Encoder inference still happens on the prefill pool. When image concurrency is high, the ViT runs hundreds of forward passes per second on prefill GPUs that should be spending their FP8 TFLOPS on attention computation. Audio inference queues behind image batches, and both queue behind text prefills. The result is encoder-induced head-of-line blocking that limits multimodal throughput even when prefill and decode are correctly split.
The number matters. At 128 concurrent requests each sending one image at 1024px resolution, the prefill pool is processing roughly 128,000 visual tokens per batch cycle before the first text prefill token runs. That is a large fraction of each prefill GPU's TFLOPS budget going to encoder work rather than attention.
vLLM-Omni solves this by extracting encoder inference into its own pool, letting prefill GPUs focus entirely on transformer attention.
vLLM-Omni Architecture: Three Disaggregated Pools
The topology has three pools connected by NIXL data transfers:
User Request
|
v
[Router]
|
+---> [Encoder Pool] (L40S / A100 PCIe)
| ViT / Audio Transformer
| raw image / audio / video -> feature tensors
|
| feature tensors (via NIXL)
|
+---> [Prefill Pool] (H100 SXM5 / B200)
| Full transformer prefill
| prompt + feature tokens -> KV cache
|
| KV cache (via NIXL)
|
+---> [Decode Pool] (H200 SXM5 / A100 80GB)
| Autoregressive decode
| KV cache -> token stream
|
v
ResponseThe router is an OpenAI-compatible endpoint that accepts multimodal requests. It dispatches raw modality inputs (image bytes, audio waveforms, video frames) to the encoder pool, waits for feature tensors, then dispatches prefill to a prefill worker with those tensors attached, and finally routes decode continuations to a decode worker that holds the generated KV cache.
Each pool is independently scaled. Adding encoder nodes increases encoder throughput without touching prefill or decode. Adding prefill nodes increases prompt processing throughput. Adding decode nodes increases concurrent generation capacity. This is what makes the topology cost-effective at scale: you grow each dimension by the rate that dimension is actually saturating.
For NIXL KV transfer internals, see the NIXL KV transfer guide.
KV Cache and Feature Transfer: NIXL and Mooncake Backends
Two NIXL transfers happen per request in the three-stage topology, versus one in text-only disaggregation.
Transfer 1: Encoder to Prefill (feature tensors)
The encoder produces modality feature tensors, not a KV cache. These are the per-image or per-audio-segment embeddings that the prefill pool will attend to alongside text tokens. Feature tensors are smaller than KV caches (they don't grow with prompt length), but the transfer must complete before prefill can start. NIXL sends them over RDMA or TCP; on RDMA, this transfer is typically under 2ms for image feature payloads.
Transfer 2: Prefill to Decode (KV cache)
This is the same transfer as text-only disaggregation. The prefill pool produces the full KV cache for the prompt (including attention over the modality features), and NIXL transfers it to a decode worker. Payload size scales with prompt length and model layer count.
Two hops vs. one hop in text-only disaggregation. The compound latency is 4-16ms per request on RDMA (2-8ms per hop) and 20-60ms on TCP (10-30ms per hop). At low concurrency, this overhead is visible in TTFT. At high concurrency, the throughput gains from encoder separation outweigh the hop latency, typically above 64 concurrent multimodal requests.
NIXL backends:
- RDMA (InfiniBand, RoCE): Lowest latency, sub-millisecond per hop. Requires NIC compatibility and network configuration. Preferred for same-datacenter deployments.
- TCP fallback: Works everywhere, adds latency. Acceptable for development and low-concurrency serving.
Mooncake as alternative: Mooncake provides a distributed KV cache store with better multi-region support and object storage backends. It is a better fit when encoder and prefill pools are in different regions, or when you want to cache encoder outputs across requests for the same image. For single-region deployments, NIXL is simpler. For cross-region or object-storage-backed topologies, consider Mooncake. See LMCache for cross-node KV sharing for a related approach using shared KV stores.
Deployment Walkthrough on Spheron GPU Cloud
Note on vLLM-Omni package status: As of June 2026, vLLM-Omni may be a separate repository or feature branch rather than a merged mainline vLLM release. The commands below use vllm-omni as the package name. Verify the current package name and install path against the official vLLM GitHub repository before deploying. If the package name has changed, the flags (--role, --kv-transfer-config, --nixl-metadata-server) remain the same.
Step 1: Provision three GPU pools on Spheron
Log in to app.spheron.ai and provision instances for each pool. Place all instances in the same region to minimize NIXL transfer latency.
- Encoder pool: A100 80GB PCIe or L40S 48GB (check current L40S availability at /pricing/). One instance per 30-50 concurrent image/audio streams.
- Prefill pool: H100 SXM5 for cost-effective FP8 throughput, or B200 for maximum compute density. One instance per 2,000 prompt tokens/sec target.
- Decode pool: H200 SXM5 for the highest memory bandwidth (4.8 TB/s HBM3e), or A100 80GB arrays for cost-conscious scaling. Scale out decode nodes for concurrency.
For detailed provisioning steps, see docs.spheron.ai.
Step 2: Install vLLM-Omni and NIXL on all nodes
Run this on every node across all three pools:
pip install vllm-omni nixlVerify RDMA devices are available:
ibv_devicesIf no RDMA devices are listed, NIXL will use TCP fallback. This works but adds latency. For production RDMA deployments, verify RoCE or InfiniBand configuration with your Spheron support contact.
Start the NIXL metadata server on a coordinator node (can be any node in the cluster):
nixl-metadata-server --port 5557Step 3: Launch encoder pool workers
On each encoder node:
vllm-omni serve <model-name> \
--role encoder \
--kv-transfer-config '{"kv_connector": "NixlConnector", "kv_role": "encoder_producer"}' \
--nixl-metadata-server <coordinator-ip>:5557 \
--tensor-parallel-size 1 \
--port 8001The encoder pool accepts raw multimodal inputs (base64 image bytes, audio file paths, video frame sequences) and returns modality feature tensors. It does not run the full transformer; it only runs the ViT or Audio Transformer heads.
Step 4: Launch prefill pool workers
On each prefill node:
vllm-omni serve <model-name> \
--role prefill \
--kv-transfer-config '{"kv_connector": "NixlConnector", "kv_role": "kv_producer"}' \
--nixl-metadata-server <coordinator-ip>:5557 \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--port 8002Prefill nodes receive encoder feature tensors and text tokens, run the full transformer forward pass, and transfer the resulting KV cache to decode nodes via NIXL. --tensor-parallel-size should match the GPU count on that node.
Step 5: Launch decode pool workers
On each decode node:
vllm-omni serve <model-name> \
--role decode \
--kv-transfer-config '{"kv_connector": "NixlConnector", "kv_role": "kv_consumer"}' \
--nixl-metadata-server <coordinator-ip>:5557 \
--tensor-parallel-size 8 \
--max-num-seqs 256 \
--port 8003Set --max-num-seqs to your target concurrency. Decode nodes hold KV caches received from prefill and run autoregressive generation. Each active request occupies one decode slot until generation completes.
Step 6: Deploy the router
vllm-omni-router \
--encoder-endpoints http://<encoder-node-1>:8001,http://<encoder-node-2>:8001 \
--prefill-endpoints http://<prefill-node-1>:8002 \
--decode-endpoints http://<decode-node-1>:8003,http://<decode-node-2>:8003 \
--port 8000The router exposes an OpenAI-compatible API at port 8000. Test with:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<model-name>",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "What is in this image?"}
]
}]
}'Pool sizing config example
pools:
encoder:
gpu_type: a100-80g-pcie
instances: 2
role: encoder
prefill:
gpu_type: h100-sxm5-80g
instances: 4
role: prefill
decode:
gpu_type: h200-sxm5-141g
instances: 8
role: decode
nixl:
metadata_server: "10.0.0.1:5557"
transport: rdmaRight-Sizing GPUs Per Stage
The cost efficiency of this topology comes from matching GPU type to stage bottleneck. A decode node does not benefit from H100-class FP8 TFLOPS; it needs memory bandwidth. An encoder node does not need H200's HBM3e; it needs moderate throughput at low cost.
| Stage | Bottleneck | Recommended GPU | Why | Spheron On-Demand | Spheron Spot |
|---|---|---|---|---|---|
| Encoder | FP16 throughput at moderate batch | A100 80G PCIe | Best cost/TFLOPS ratio for encoder batch sizes | $1.43/hr | $1.19/hr |
| Prefill | FP8 TFLOPS, HBM bandwidth for KV writes | H100 SXM5 | 1,979 TFLOPS FP8 (dense), 3.35 TB/s HBM3 | $3.98/hr | $2.91/hr |
| Prefill (max scale) | FP8 TFLOPS | B200 SXM6 | ~2.25x H100 FP8 TFLOPS | $9.30/hr | $5.34/hr |
| Decode | HBM read bandwidth (KV cache reads per token) | H200 SXM5 | 4.8 TB/s HBM3e, highest bandwidth on Spheron | $4.84/hr | $1.77/hr |
| Decode (budget) | HBM read bandwidth | A100 80G PCIe | 2 TB/s, cost-effective for smaller-scale decode | $1.43/hr | $1.19/hr |
For encoder nodes, L40S 48GB is also a reasonable option at lower per-hour cost than A100 80G PCIe; check current L40S availability and pricing at /pricing/ since availability fluctuates.
A dedicated H100 SXM5 prefill pool typically handles 3-4x the prefill throughput of the same H100 running a colocated multimodal workload, because the encoder work has been offloaded to the encoder pool and prefill compute is no longer interrupted by ViT inference.
H200 SXM5 decode nodes offer the highest memory bandwidth on Spheron at 4.8 TB/s, which directly determines how many tokens per second each decode node can sustain during autoregressive generation.
For a broader guide on matching GPU tiers to inference stage bottlenecks, see the heterogeneous GPU inference guide.
Pricing fluctuates based on GPU availability. The prices above are based on 16 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Benchmarks: Throughput and Latency vs. Colocated Serving
These numbers are based on reference methodology: comparing a disaggregated three-pool topology against single-pool vLLM multimodal serving on the same total GPU budget, at varying image concurrency levels.
Throughput gains at 96+ GPUs:
Encoder pool separation eliminates ViT bottlenecks at high image concurrency. Expected gains at 128 concurrent multimodal requests:
- Image-heavy workloads (1+ image per request): 1.5-2.5x multimodal tokens/sec throughput vs. colocated serving on the same GPU budget. The gain is larger when images are high-resolution (1024px+), because ViT inference time scales with resolution.
- Audio-heavy workloads: 1.3-2.0x throughput improvement. Audio Transformer inference is faster than large ViT passes, so the gain is slightly lower.
- Mixed text + image/audio: Gains sit in the 1.4-1.8x range, depending on the fraction of multimodal vs. text-only requests.
Latency overhead:
Two NIXL hops add latency per request. On RDMA:
- Feature tensor transfer (encoder to prefill): 1-4ms
- KV cache transfer (prefill to decode): 2-8ms
- Total per-request NIXL overhead: 3-12ms added to TTFT
On TCP fallback, multiply these by roughly 4x. Two-hop TCP overhead of 20-60ms is meaningful for interactive latency SLOs under 100ms.
Break-even point:
NIXL overhead is worth it when throughput gains from encoder separation exceed the latency cost. In practice, this happens above approximately 64 concurrent multimodal requests. Below that threshold, colocated serving with chunked prefill is simpler and has lower TTFT.
For baseline throughput numbers comparing vLLM against other inference frameworks, see the inference framework benchmark and the vLLM MRV2 for baseline throughput guide.
When to Use vLLM-Omni Disaggregation vs. Colocated Serving
| Factor | Use colocated serving | Use vLLM-Omni disaggregation |
|---|---|---|
| GPU count | Under 96 GPUs | 96+ GPUs |
| Image/audio concurrency | Low (under 64 concurrent multimodal requests) | High (64+ concurrent multimodal requests) |
| Workload type | Latency-sensitive interactive (chat UI, real-time voice) | Throughput-optimized batch (offline processing, bulk inference) |
| Encoder bottleneck | Not measurable in production metrics | Visible in production: encoder queue depth > 10ms |
| Infrastructure complexity tolerance | Low (simpler is better) | High (team can operate distributed NIXL topology) |
| Cost-per-token target | Secondary concern | Primary optimization target |
The 96-GPU threshold is not a hard rule; it is the typical point where disaggregation overhead is amortized. For a compute-dense prefill workload with very high image concurrency (2+ images per request, 1024px+), the break-even point can be lower. For audio-only workloads with smaller encoder models, it may be higher.
For deployments below this scale, start with chunked prefill in standard vLLM multimodal serving. This breaks long prompts into fixed-size chunks interleaved with decode steps, reducing encoder-induced stalls without requiring separate hardware. See chunked prefill for single-node multimodal serving for setup.
Spheron for Heterogeneous Multi-Pool Topologies
Most GPU cloud platforms require homogeneous clusters: you provision one GPU type for the entire deployment. vLLM-Omni's three-pool topology specifically benefits from heterogeneous hardware, since each pool has a different ideal GPU type.
Spheron's marketplace aggregates 5+ providers, which means you can provision A100 PCIe for encoder nodes, H100 SXM5 for prefill nodes, and H200 SXM5 for decode nodes from a single account, in the same region, billed per minute. No vendor lock-in means you can swap GPU tiers per stage as pricing or availability changes: if A100 spot prices drop, shift encoder nodes to spot and keep prefill and decode on on-demand.
This directly maps to how vLLM-Omni's pools should be sized:
- Encoder pool: run on spot. Encoder work is retriable; if a spot instance is reclaimed, the affected request re-encodes on another node. No decode state is lost.
- Prefill pool: on-demand. Prefill interruption mid-request means re-running the full attention pass; the cost of retry makes on-demand preferable for most workloads.
- Decode pool: on-demand. Losing a decode worker mid-generation drops the generation; users see a timeout. On-demand ensures continuity.
To get started, provision the three pools at app.spheron.ai. For API documentation and multi-instance setup, see docs.spheron.ai.
For broader context on inference engineering at scale, see the inference engineering guide.
Running any-to-any multimodal inference at scale means matching GPU hardware to each stage's actual compute profile. Spheron's marketplace lets you mix GPU generations across encoder, prefill, and decode pools from 5+ providers in one account.
H100 SXM5 on Spheron → | H200 SXM5 availability → | View all GPU pricing →
Quick Setup Guide
vLLM-Omni is a 2026 extension of the main vLLM codebase. Install the vllm-omni package or a vLLM build that includes multimodal disaggregation support. Verify with: python -c 'import vllm; print(vllm.__version__)'. Check that your target model (e.g. Qwen3.5-Omni, a future any-to-any model) is listed in vLLM-Omni's supported model registry before deploying.
Log in to app.spheron.ai and provision instances for each pool. Encoder pool: L40S 48GB or A100 80GB (one per concurrent modality stream target). Prefill pool: H100 SXM5 80GB or B200 (compute-dense, 1 per prefill shard). Decode pool: H200 SXM5 141GB or A100 80GB array (memory-bandwidth-dense, scale out for throughput). All pools should be in the same region for low-latency NIXL transfers.
On every node across all three pools: pip install vllm-omni nixl. Verify RDMA devices with ibv_devices (for InfiniBand or RoCE setups). For TCP-only environments, NIXL falls back to TCP automatically, but transfer latency will be higher. Start the NIXL metadata server on a coordinator node: nixl-metadata-server --port 5557.
On each encoder node, start vLLM-Omni with encoder role: vllm-omni serve <model> --role encoder --kv-transfer-config '{"kv_connector": "NixlConnector", "kv_role": "encoder_producer"}' --nixl-metadata-server <coordinator-ip>:5557. The encoder pool receives raw multimodal inputs (audio files, image URLs, video frames) and produces modality feature tensors that are forwarded to prefill nodes.
On each prefill node: vllm-omni serve <model> --role prefill --kv-transfer-config '{"kv_connector": "NixlConnector", "kv_role": "kv_producer"}' --nixl-metadata-server <coordinator-ip>:5557. Prefill nodes receive encoder outputs and text tokens, run the full transformer forward pass, and transfer the resulting KV cache to decode nodes via NIXL.
On each decode node: vllm-omni serve <model> --role decode --kv-transfer-config '{"kv_connector": "NixlConnector", "kv_role": "kv_consumer"}' --nixl-metadata-server <coordinator-ip>:5557. Set --max-num-seqs to your target concurrency. Decode nodes receive KV caches from prefill workers and run autoregressive generation until the response is complete.
Deploy the vLLM-Omni router process, which accepts OpenAI-compatible multimodal API requests, dispatches encoder work to the encoder pool, waits for feature tensors, dispatches prefill to the prefill pool, and routes decode continuation to a decode worker. The router exposes a single endpoint at port 8000. Test with a curl request that includes an image or audio attachment alongside a text prompt.
Frequently Asked Questions
vLLM-Omni is a 2026 extension of vLLM that adds fully disaggregated serving for any-to-any multimodal models - models that accept and emit text, audio, images, and video. It splits the serving pipeline into three independent GPU pools: one for modality encoders, one for prefill, and one for decode, each sized to the compute profile of that stage.
Standard prefill-decode disaggregation assumes a single encoder that runs as part of prefill. Multimodal models have dedicated modality encoders (ViT for images/video, Audio Transformer for speech) that have their own GPU compute profile - they are throughput-bound at different batch sizes than text prefill. Mixing encoder and prefill on the same pool caps throughput for both. vLLM-Omni adds a third pool dedicated to modality encoding, letting each stage scale independently.
Encoder pools work well on mid-range GPUs (L40S, A100 PCIe) because encoder inference is throughput-bound at moderate batch sizes and does not need H100-class compute. Prefill pools need maximum TFLOPS - H100 SXM5 or B200 are the natural choices. Decode pools are memory-bandwidth-bound and benefit from H200 SXM5 (4.8 TB/s HBM3e) or large A100 80GB arrays. Right-sizing each pool independently is where the cost-per-token savings come from.
After encoder inference, the modality feature tensors are transferred to the prefill pool via NIXL (NVIDIA Inference Transfer Library) over RDMA or TCP. The prefill pool attends to these features alongside text tokens and produces a KV cache, which is then transferred to the decode pool via NIXL for autoregressive generation. Mooncake is an alternative transfer backend for multi-region or heterogeneous network topologies.
For fewer than 96 GPUs or low concurrent request volumes, colocated multimodal serving with a single vLLM pool is simpler and has less latency overhead from NIXL transfers. Full disaggregation with vLLM-Omni makes sense at 96+ GPUs, consistently high image/video/audio concurrency, or when encoder bottlenecks are measurably limiting throughput in your production metrics.
