Text-only LLMs and VLMs share a lot of the same architecture, but their VRAM profiles differ in two important ways: the visual encoder adds fixed weight memory that a text-only model does not carry, and each image you process injects hundreds of visual tokens into the KV cache, inflating memory at inference time in a way that scales with image count rather than just sequence length. If you size a VLM deployment like a text model, you will hit OOM errors under load that are hard to diagnose because they appear as request timeouts rather than explicit memory errors. This guide covers GPU sizing and step-by-step vLLM deployment for the three most capable open-source VLMs: Qwen3-VL, Llama 4 Scout in multimodal mode, and InternVL3.
What Makes VLMs Different from Text-Only LLMs
Every VLM has the same basic structure: a visual encoder (typically a Vision Transformer, or ViT) that converts input images into embeddings, and a language model that processes those embeddings alongside text token embeddings. The language model part is essentially the same architecture as a text-only LLM of the same parameter count.
The ViT encoder is the added element, and it contributes VRAM overhead in two ways.
Fixed overhead from encoder weights. The ViT itself has weights that must be loaded into VRAM. For smaller VLMs, this adds 0.5 to 1 GB on top of the language model weights. For larger models with high-resolution support (like InternVL3, which tiles images at multiple scales), the encoder can add 2 to 4 GB. A Qwen3-VL 8B model at FP16 needs roughly 17 GB total, versus about 14 GB for a comparable text-only 7B model.
Dynamic overhead from visual tokens. This is where VLMs surprise people. When you pass an image to the model, the ViT converts it into a sequence of visual tokens, which are concatenated with your text tokens and processed together through the attention layers. The number of visual tokens per image depends on resolution: a 512-pixel image might produce 256 tokens, while a 1024-pixel image produces around 1024 tokens. These all go into the KV cache.
The implication: at 8 concurrent requests each containing one 1024-pixel image, you have 8192 tokens of KV cache just from images, before any text. At 128 concurrent requests, that is over 131K tokens of image-only context. You need to account for this when setting --max-model-len and --max-num-seqs, or your server will silently OOM under real traffic. For techniques to reduce KV cache pressure, see our KV cache optimization guide.
The other difference worth noting is that VLM throughput is bounded by both ViT inference speed and language model decode speed. On most hardware, the ViT is not the bottleneck for small batch sizes, but at high throughput, encoder preprocessing becomes a factor. H100's higher memory bandwidth means the visual encoder step finishes faster, making H100 a noticeably better fit for image-heavy workloads than A100 even at the same parameter count.
GPU Memory Requirements by VLM Size
| Model | Parameters | FP16 VRAM (weights) | INT8 VRAM (weights) | Recommended GPU | Notes |
|---|---|---|---|---|---|
| Qwen3-VL 8B | 8B | ~17 GB | ~9 GB | A100 80GB, H100 | 1x GPU production |
| Llama 4 Scout | 109B (17B active) | ~218 GB | ~109 GB | 4x H100 80GB | MoE; vLLM loads all expert weights; INT4 (~55 GB) fits on 1x H100 |
| InternVL3 8B | 8B | ~17 GB | ~10 GB | A100 80GB, H100 | Strong OCR/doc |
| Qwen3-VL 72B | 72B | ~148 GB | ~74 GB | 4x H100 80GB | 1x H100 at INT4 |
| InternVL3 72B | 72B | ~148 GB | ~74 GB | 4x H100 80GB | 1x H100 at INT4 |
VRAM numbers include the ViT encoder. Add 20-30% headroom for KV cache and framework overhead at moderate batch sizes; add more if you expect high image concurrency or long text contexts.
For text-only LLM VRAM requirements, see our GPU memory requirements guide.
Deploy Qwen3-VL on GPU Cloud with vLLM
Qwen3-VL is Alibaba's dedicated vision-language model line, separate from the general-purpose Qwen 3.5 MoE covered in the Qwen 3.5 deployment guide. The VL model uses a dedicated ViT encoder with support for high-resolution inputs, and the Hugging Face repositories are separate (Qwen/Qwen3-VL-8B-Instruct, not Qwen/Qwen3.5-7B). Do not mix up VRAM estimates from the text-only post, as the ViT encoder adds overhead that is not accounted for there.
Start by provisioning an H100 or A100 80GB instance from the H100 rental page or A100 rental page. For the 8B model, a single A100 80GB is sufficient for production. For 72B, you need 4x H100 80GB.
Install vLLM:
pip install "vllm>=0.6.0"
python -c "import vllm; print(vllm.__version__)"If you are evaluating whether to use vLLM or a simpler serving option, see our Ollama vs vLLM comparison for a throughput benchmark and decision guide.
Download model weights:
# Qwen3-VL 8B
huggingface-cli download Qwen/Qwen3-VL-8B-Instruct \
--local-dir /data/models/qwen3-vl-8b
# Qwen3-VL 72B (requires ~150 GB storage)
huggingface-cli download Qwen/Qwen3-VL-72B-Instruct \
--local-dir /data/models/qwen3-vl-72bAlways verify the exact repository name on Hugging Face before running the download. Qwen naming conventions have changed across model generations.
Launch the vLLM server (8B):
vllm serve /data/models/qwen3-vl-8b \
--dtype bfloat16 \
--max-model-len 8192 \
--served-model-name qwen3-vl \
--port 8000For the 72B model on 4x H100:
vllm serve /data/models/qwen3-vl-72b \
--dtype bfloat16 \
--max-model-len 8192 \
--tensor-parallel-size 4 \
--served-model-name qwen3-vl-72b \
--port 8000Test with an image URL:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="qwen3-vl",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Biosphere_2_sunset.jpg/1200px-Biosphere_2_sunset.jpg"},
},
{
"type": "text",
"text": "Describe what you see in this image.",
},
],
}
],
)
print(response.choices[0].message.content)Quantization to save VRAM
For A100 deployments or when you need to run the 72B on fewer GPUs, quantization reduces the weight footprint significantly:
# AWQ quantization (best quality/size tradeoff)
vllm serve /data/models/qwen3-vl-8b \
--quantization awq \
--dtype float16 \
--max-model-len 8192 \
--port 8000
# FP8 quantization (faster on H100, L40, L4, RTX 4090)
vllm serve /data/models/qwen3-vl-8b \
--quantization fp8 \
--dtype bfloat16 \
--max-model-len 8192 \
--port 8000Note: FP8 quantization requires Hopper (H100/H200) or Ada Lovelace (L4, L40, RTX 4090) GPUs. On A100, use AWQ or INT8 instead.
Deploy Llama 4 Scout Multimodal Mode with Image Inputs
The Llama 4 deployment guide covers text-only serving for Scout and Maverick. This section focuses on image input specifically.
Llama 4 Scout uses early-fusion multimodal rather than an adapter-based approach: images are processed at the attention layer rather than as a prefix sequence. In practice, this means Scout is more efficient at processing interleaved text-image sequences, but it also means visual token budget interacts with the context window differently than adapter-based models.
Scout's context window supports up to 10M tokens, but for image serving you should cap --max-model-len much lower. At 1024 visual tokens per image and 8 concurrent requests, you need 8K tokens of KV cache just for images before any text. Setting --max-model-len 32768 is a sensible starting point for most image workloads.
Launch Scout for multimodal serving:
Llama 4 Scout has 109B total parameters. vLLM loads all expert weights at startup, requiring ~218 GB at FP16. You need at least 3x H100 80GB for FP16; use 4x for headroom. For a single H100, quantize to INT4 (~55 GB) with --quantization awq.
# FP16 on 4x H100 80GB
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--dtype bfloat16 \
--max-model-len 32768 \
--tensor-parallel-size 4 \
--served-model-name llama4-scout \
--port 8000
# INT4 (AWQ) on 1x H100 80GB
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--quantization awq \
--dtype float16 \
--max-model-len 32768 \
--tensor-parallel-size 1 \
--served-model-name llama4-scout \
--port 8000Always verify the Hugging Face repository name before downloading. The Scout multimodal variant and the standard instruct variant may use different repo identifiers between releases.
Send a multimodal request:
from openai import OpenAI
import base64
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
# Load image and encode as base64
with open("/path/to/image.jpg", "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="llama4-scout",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
},
{
"type": "text",
"text": "What is in this image?",
},
],
}
],
)
print(response.choices[0].message.content)For a full guide to Llama 4 Scout deployment, see the Llama 4 deployment guide.
Deploy InternVL3 for Document Understanding and OCR
InternVL3 is the strongest open-source model for document-specific tasks: PDF layout understanding, OCR on scanned documents, table extraction, and chart analysis. Its training data includes large document datasets that the other models do not match. If your use case involves documents rather than natural images, InternVL3 is usually the right choice.
The --trust-remote-code flag is required. InternVL3 uses custom attention code that vLLM does not include natively. Without this flag, vLLM will refuse to load the model. Every InternVL3 command in this section includes the flag.
InternVL3 uses dynamic resolution tiling: it tiles the input image at multiple scales and processes each tile separately. This gives higher accuracy on documents and charts, but it inflates the visual token count significantly at higher resolutions:
| Input resolution | Tiles | Visual tokens per image |
|---|---|---|
| 448px | 1 | 256 |
| 896px | 4 | 1,024 |
| 1344px | 9 | 2,304 |
| 1792px | 16 | 4,096 |
At 1344px resolution with 8 concurrent requests, you have over 18K tokens of image-only context. Set --max-model-len accordingly and consider capping input resolution at 896px unless you need the extra detail for small-text OCR tasks.
Download and launch InternVL3 8B:
# Download weights
huggingface-cli download OpenGVLab/InternVL3-8B \
--local-dir /data/models/internvl3-8b
# Launch vLLM server
vllm serve /data/models/internvl3-8b \
--trust-remote-code \
--dtype bfloat16 \
--max-model-len 16384 \
--served-model-name internvl3 \
--port 8000Document OCR example:
from openai import OpenAI
import base64
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
# Load document image
with open("/path/to/document.jpg", "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="internvl3",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
},
{
"type": "text",
"text": "Extract all text from this document. Preserve the layout and structure.",
},
],
}
],
)
print(response.choices[0].message.content)For 72B at production scale, use 4x H100 80GB:
vllm serve /data/models/internvl3-72b \
--trust-remote-code \
--dtype bfloat16 \
--max-model-len 16384 \
--tensor-parallel-size 4 \
--served-model-name internvl3-72b \
--port 8000Benchmarks: Tokens/sec, Images/sec, and Cost per 1K Multimodal Requests
| Model | GPU | Tokens/sec (text) | Images/sec (1024px) | On-Demand Cost | Cost/1K multimodal req |
|---|---|---|---|---|---|
| Qwen3-VL 8B | A100 80GB SXM4 | ~85 | ~2.1 | $1.06/hr | ~$0.15 |
| Qwen3-VL 8B | H100 SXM5 | ~140 | ~3.8 | $2.40/hr | ~$0.33 |
| Llama 4 Scout | 4x H100 SXM5 | ~95 | ~2.4 | $9.60/hr | ~$1.33 |
| InternVL3 8B | A100 80GB SXM4 | ~80 | ~1.9 | $1.06/hr | ~$0.15 |
| InternVL3 72B | 4x H100 SXM5 | ~55 | ~1.0 | $9.60/hr | ~$1.33 |
These throughput numbers are measured at moderate batch sizes with 1024-pixel input images. Real-world performance will vary based on image resolution, concurrent request count, and text output length. Cost per 1K multimodal requests is calculated as: 1,000 requests × 5 seconds per request ÷ 10 concurrent requests ÷ 3,600 seconds/hr × hourly GPU rate. The 5-second wall-clock estimate assumes 10 concurrent requests processed together via continuous batching, including ViT encoding and decoding 200 output tokens per request.
Pricing fluctuates based on GPU availability. The prices above are based on 04 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Production Checklist: Batching Images, Context Windows, and Autoscaling
- Cap image resolution on the client side. Resize images to 1024px or less before sending. Each extra pixel at the ViT encoder boundary inflates visual token count and KV cache usage. For most document OCR tasks, 896px gives 95% of the accuracy of 1792px at one-quarter the token cost.
- Size your visual token budget before setting
--max-model-len.--max-model-lenis a per-request sequence length cap, not a total batch limit. Calculate the per-request budget:visual_tokens_per_image + max_text_tokens_per_request. For one 1024px image per request with a 4096-token text budget: 1024 + 4096 = 5120. Add headroom:--max-model-len 6144. vLLM pre-allocates KV cache based on this value, so over-sizing it wastes VRAM unnecessarily.
- Use continuous batching. vLLM's continuous batching handles mixed text-and-image requests efficiently. Do not use static batching servers (Triton without dynamic batching, basic Flask endpoints) for VLMs. Static batching wastes GPU time waiting for the batch to fill, which is especially costly when ViT encoding adds per-image latency.
- Monitor the Prometheus metrics endpoint. Enable it with
--enable-metricsand alert onvllm:gpu_cache_usage_percexceeding 85%. Cache saturation is the primary cause of production OOM events in VLM deployments, and it is preventable if you catch it early.
- Use spot instances for development and batch jobs. H100 spot at $0.80/hr is appropriate for prompt iteration, benchmark runs, and offline batch image processing. Switch to on-demand for latency-sensitive production endpoints where instance preemption would cause user-visible errors. For a broader set of cost-reduction strategies, see the GPU cost optimization playbook.
- Tensor parallelism only, no pipeline parallelism. For multi-GPU VLM serving in vLLM, always use
--tensor-parallel-sizematching your GPU count. Pipeline parallelism is not supported for VLMs in vLLM because the visual encoder cannot be split across pipeline stages. Using pipeline parallel with a VLM will either error or produce incorrect outputs.
For GPU monitoring patterns in production ML deployments, see our GPU monitoring guide.
VLMs benefit most from high-memory-bandwidth GPUs where visual encoder throughput is not the bottleneck. Spheron provides on-demand H100 SXM5 instances from $2.40/hr and A100 80GB from $1.06/hr, with no minimum commitment and provisioning in under 90 seconds.
