How much VRAM does a vision language model need compared to a text-only LLM?

A VLM adds a visual encoder (ViT) on top of the base language model. The encoder typically adds 0.5 to 4 GB of fixed VRAM depending on resolution support. For example, Qwen3-VL 8B needs about 17 GB at FP16 versus 14 GB for a comparable text-only 7B model. The bigger difference is dynamic: each image you process adds roughly 256-1024 visual tokens to the context, inflating KV cache significantly at high batch sizes.

Can I run Qwen3-VL 8B on an RTX 4090?

Yes. Qwen3-VL 8B at FP16 fits in ~17 GB, well within the RTX 4090's 24 GB. You will have roughly 7 GB remaining for KV cache, which is enough for moderate image batch sizes at 1024-pixel resolution. For production workloads with concurrent users, an A100 80GB or H100 gives you more headroom for batching and longer context.

What is the best GPU for InternVL3 72B?

InternVL3 72B requires at minimum 2x H100 80GB at FP16 (144 GB for weights plus KV cache overhead). For production serving with image batching, 4x H100 80GB is more practical. At INT4 quantization, you can fit the weights on a single H100 80GB, but throughput is limited.

Does vLLM support vision language models natively?

Yes. vLLM 0.6+ includes stable, production-ready multimodal support. It handles image tokenization, visual encoder inference, and batching of mixed text-and-image requests. You pass images as base64 or URL in the OpenAI-compatible messages API under the content array alongside the text prompt.

How do I reduce VRAM usage when serving VLMs?

The most effective levers are: (1) quantize to INT8 or INT4 for a 50-75% weight reduction with minimal quality loss, (2) limit max image resolution via model config to reduce visual token count per image, (3) cap max-model-len to restrict KV cache allocation, and (4) use continuous batching in vLLM rather than static batching to pack more concurrent requests into available memory.

Deploy Vision Language Models on GPU Cloud: Qwen3-VL, Llama 4 Scout, and InternVL3 Guide (2026)

Text-only LLMs and VLMs share a lot of the same architecture, but their VRAM profiles differ in two important ways: the visual encoder adds fixed weight memory that a text-only model does not carry, and each image you process injects hundreds of visual tokens into the KV cache, inflating memory at inference time in a way that scales with image count rather than just sequence length. If you size a VLM deployment like a text model, you will hit OOM errors under load that are hard to diagnose because they appear as request timeouts rather than explicit memory errors. This guide covers GPU sizing and step-by-step vLLM deployment for the three most capable open-source VLMs: Qwen3-VL, Llama 4 Scout in multimodal mode, and InternVL3.

What Makes VLMs Different from Text-Only LLMs

Every VLM has the same basic structure: a visual encoder (typically a Vision Transformer, or ViT) that converts input images into embeddings, and a language model that processes those embeddings alongside text token embeddings. The language model part is essentially the same architecture as a text-only LLM of the same parameter count.

The ViT encoder is the added element, and it contributes VRAM overhead in two ways.

Fixed overhead from encoder weights. The ViT itself has weights that must be loaded into VRAM. For smaller VLMs, this adds 0.5 to 1 GB on top of the language model weights. For larger models with high-resolution support (like InternVL3, which tiles images at multiple scales), the encoder can add 2 to 4 GB. A Qwen3-VL 8B model at FP16 needs roughly 17 GB total, versus about 14 GB for a comparable text-only 7B model.

Dynamic overhead from visual tokens. This is where VLMs surprise people. When you pass an image to the model, the ViT converts it into a sequence of visual tokens, which are concatenated with your text tokens and processed together through the attention layers. The number of visual tokens per image depends on resolution: a 512-pixel image might produce 256 tokens, while a 1024-pixel image produces around 1024 tokens. These all go into the KV cache.

The implication: at 8 concurrent requests each containing one 1024-pixel image, you have 8192 tokens of KV cache just from images, before any text. At 128 concurrent requests, that is over 131K tokens of image-only context. You need to account for this when setting --max-model-len and --max-num-seqs, or your server will silently OOM under real traffic. For techniques to reduce KV cache pressure, see our KV cache optimization guide.

The other difference worth noting is that VLM throughput is bounded by both ViT inference speed and language model decode speed. On most hardware, the ViT is not the bottleneck for small batch sizes, but at high throughput, encoder preprocessing becomes a factor. H100's higher memory bandwidth means the visual encoder step finishes faster, making H100 a noticeably better fit for image-heavy workloads than A100 even at the same parameter count.

GPU Memory Requirements by VLM Size

Model	Parameters	FP16 VRAM (weights)	INT8 VRAM (weights)	Recommended GPU	Notes
Qwen3-VL 8B	8B	~17 GB	~9 GB	A100 80GB, H100	1x GPU production
Llama 4 Scout	109B (17B active)	~218 GB	~109 GB	4x H100 80GB	MoE; vLLM loads all expert weights; INT4 (~55 GB) fits on 1x H100
InternVL3 8B	8B	~17 GB	~10 GB	A100 80GB, H100	Strong OCR/doc
Qwen3-VL 72B	72B	~148 GB	~74 GB	4x H100 80GB	1x H100 at INT4
InternVL3 72B	72B	~148 GB	~74 GB	4x H100 80GB	1x H100 at INT4

VRAM numbers include the ViT encoder. Add 20-30% headroom for KV cache and framework overhead at moderate batch sizes; add more if you expect high image concurrency or long text contexts.

For text-only LLM VRAM requirements, see our GPU memory requirements guide.

Deploy Qwen3-VL on GPU Cloud with vLLM

Qwen3-VL is Alibaba's dedicated vision-language model line, separate from the general-purpose Qwen 3.5 MoE covered in the Qwen 3.5 deployment guide. The VL model uses a dedicated ViT encoder with support for high-resolution inputs, and the Hugging Face repositories are separate (Qwen/Qwen3-VL-8B-Instruct, not Qwen/Qwen3.5-7B). Do not mix up VRAM estimates from the text-only post, as the ViT encoder adds overhead that is not accounted for there.

Start by provisioning an H100 or A100 80GB instance from the H100 rental page or A100 rental page. For the 8B model, a single A100 80GB is sufficient for production. For 72B, you need 4x H100 80GB.

Install vLLM:

bash

pip install "vllm>=0.6.0"
python -c "import vllm; print(vllm.__version__)"

If you are evaluating whether to use vLLM or a simpler serving option, see our Ollama vs vLLM comparison for a throughput benchmark and decision guide.

Download model weights:

bash

# Qwen3-VL 8B
huggingface-cli download Qwen/Qwen3-VL-8B-Instruct \
    --local-dir /data/models/qwen3-vl-8b

# Qwen3-VL 72B (requires ~150 GB storage)
huggingface-cli download Qwen/Qwen3-VL-72B-Instruct \
    --local-dir /data/models/qwen3-vl-72b

Always verify the exact repository name on Hugging Face before running the download. Qwen naming conventions have changed across model generations.

Launch the vLLM server (8B):

bash

vllm serve /data/models/qwen3-vl-8b \
    --dtype bfloat16 \
    --max-model-len 8192 \
    --served-model-name qwen3-vl \
    --port 8000

For the 72B model on 4x H100:

bash

vllm serve /data/models/qwen3-vl-72b \
    --dtype bfloat16 \
    --max-model-len 8192 \
    --tensor-parallel-size 4 \
    --served-model-name qwen3-vl-72b \
    --port 8000

Test with an image URL:

python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="qwen3-vl",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Biosphere_2_sunset.jpg/1200px-Biosphere_2_sunset.jpg"},
                },
                {
                    "type": "text",
                    "text": "Describe what you see in this image.",
                },
            ],
        }
    ],
)

print(response.choices[0].message.content)

Quantization to save VRAM

For A100 deployments or when you need to run the 72B on fewer GPUs, quantization reduces the weight footprint significantly:

bash

# AWQ quantization (best quality/size tradeoff)
vllm serve /data/models/qwen3-vl-8b \
    --quantization awq \
    --dtype float16 \
    --max-model-len 8192 \
    --port 8000

# FP8 quantization (faster on H100, L40, L4, RTX 4090)
vllm serve /data/models/qwen3-vl-8b \
    --quantization fp8 \
    --dtype bfloat16 \
    --max-model-len 8192 \
    --port 8000

Note: FP8 quantization requires Hopper (H100/H200) or Ada Lovelace (L4, L40, RTX 4090) GPUs. On A100, use AWQ or INT8 instead.

Deploy Llama 4 Scout Multimodal Mode with Image Inputs

The Llama 4 deployment guide covers text-only serving for Scout and Maverick. This section focuses on image input specifically.

Llama 4 Scout uses early-fusion multimodal rather than an adapter-based approach: images are processed at the attention layer rather than as a prefix sequence. In practice, this means Scout is more efficient at processing interleaved text-image sequences, but it also means visual token budget interacts with the context window differently than adapter-based models.

Scout's context window supports up to 10M tokens, but for image serving you should cap --max-model-len much lower. At 1024 visual tokens per image and 8 concurrent requests, you need 8K tokens of KV cache just for images before any text. Setting --max-model-len 32768 is a sensible starting point for most image workloads.

Launch Scout for multimodal serving:

Llama 4 Scout has 109B total parameters. vLLM loads all expert weights at startup, requiring ~218 GB at FP16. You need at least 3x H100 80GB for FP16; use 4x for headroom. For a single H100, quantize to INT4 (~55 GB) with --quantization awq.

bash

# FP16 on 4x H100 80GB
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
    --dtype bfloat16 \
    --max-model-len 32768 \
    --tensor-parallel-size 4 \
    --served-model-name llama4-scout \
    --port 8000

# INT4 (AWQ) on 1x H100 80GB
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
    --quantization awq \
    --dtype float16 \
    --max-model-len 32768 \
    --tensor-parallel-size 1 \
    --served-model-name llama4-scout \
    --port 8000

Always verify the Hugging Face repository name before downloading. The Scout multimodal variant and the standard instruct variant may use different repo identifiers between releases.

Send a multimodal request:

python

from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

# Load image and encode as base64
with open("/path/to/image.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="llama4-scout",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
                },
                {
                    "type": "text",
                    "text": "What is in this image?",
                },
            ],
        }
    ],
)

print(response.choices[0].message.content)

For a full guide to Llama 4 Scout deployment, see the Llama 4 deployment guide.

Deploy InternVL3 for Document Understanding and OCR

InternVL3 is the strongest open-source model for document-specific tasks: PDF layout understanding, OCR on scanned documents, table extraction, and chart analysis. Its training data includes large document datasets that the other models do not match. If your use case involves documents rather than natural images, InternVL3 is usually the right choice.

The --trust-remote-code flag is required. InternVL3 uses custom attention code that vLLM does not include natively. Without this flag, vLLM will refuse to load the model. Every InternVL3 command in this section includes the flag.

InternVL3 uses dynamic resolution tiling: it tiles the input image at multiple scales and processes each tile separately. This gives higher accuracy on documents and charts, but it inflates the visual token count significantly at higher resolutions:

Input resolution	Tiles	Visual tokens per image
448px	1	256
896px	4	1,024
1344px	9	2,304
1792px	16	4,096

At 1344px resolution with 8 concurrent requests, you have over 18K tokens of image-only context. Set --max-model-len accordingly and consider capping input resolution at 896px unless you need the extra detail for small-text OCR tasks.

Download and launch InternVL3 8B:

bash

# Download weights
huggingface-cli download OpenGVLab/InternVL3-8B \
    --local-dir /data/models/internvl3-8b

# Launch vLLM server
vllm serve /data/models/internvl3-8b \
    --trust-remote-code \
    --dtype bfloat16 \
    --max-model-len 16384 \
    --served-model-name internvl3 \
    --port 8000

Document OCR example:

python

from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

# Load document image
with open("/path/to/document.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="internvl3",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
                },
                {
                    "type": "text",
                    "text": "Extract all text from this document. Preserve the layout and structure.",
                },
            ],
        }
    ],
)

print(response.choices[0].message.content)

For 72B at production scale, use 4x H100 80GB:

bash

vllm serve /data/models/internvl3-72b \
    --trust-remote-code \
    --dtype bfloat16 \
    --max-model-len 16384 \
    --tensor-parallel-size 4 \
    --served-model-name internvl3-72b \
    --port 8000

Benchmarks: Tokens/sec, Images/sec, and Cost per 1K Multimodal Requests

Model	GPU	Tokens/sec (text)	Images/sec (1024px)	On-Demand Cost	Cost/1K multimodal req
Qwen3-VL 8B	A100 80GB SXM4	~85	~2.1	$1.06/hr	~$0.15
Qwen3-VL 8B	H100 SXM5	~140	~3.8	$2.40/hr	~$0.33
Llama 4 Scout	4x H100 SXM5	~95	~2.4	$9.60/hr	~$1.33
InternVL3 8B	A100 80GB SXM4	~80	~1.9	$1.06/hr	~$0.15
InternVL3 72B	4x H100 SXM5	~55	~1.0	$9.60/hr	~$1.33

These throughput numbers are measured at moderate batch sizes with 1024-pixel input images. Real-world performance will vary based on image resolution, concurrent request count, and text output length. Cost per 1K multimodal requests is calculated as: 1,000 requests × 5 seconds per request ÷ 10 concurrent requests ÷ 3,600 seconds/hr × hourly GPU rate. The 5-second wall-clock estimate assumes 10 concurrent requests processed together via continuous batching, including ViT encoding and decoding 200 output tokens per request.

Pricing fluctuates based on GPU availability. The prices above are based on 04 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Production Checklist: Batching Images, Context Windows, and Autoscaling

Cap image resolution on the client side. Resize images to 1024px or less before sending. Each extra pixel at the ViT encoder boundary inflates visual token count and KV cache usage. For most document OCR tasks, 896px gives 95% of the accuracy of 1792px at one-quarter the token cost.

Size your visual token budget before setting --max-model-len. --max-model-len is a per-request sequence length cap, not a total batch limit. Calculate the per-request budget: visual_tokens_per_image + max_text_tokens_per_request. For one 1024px image per request with a 4096-token text budget: 1024 + 4096 = 5120. Add headroom: --max-model-len 6144. vLLM pre-allocates KV cache based on this value, so over-sizing it wastes VRAM unnecessarily.

Use continuous batching. vLLM's continuous batching handles mixed text-and-image requests efficiently. Do not use static batching servers (Triton without dynamic batching, basic Flask endpoints) for VLMs. Static batching wastes GPU time waiting for the batch to fill, which is especially costly when ViT encoding adds per-image latency.

Monitor the Prometheus metrics endpoint. Enable it with --enable-metrics and alert on vllm:gpu_cache_usage_perc exceeding 85%. Cache saturation is the primary cause of production OOM events in VLM deployments, and it is preventable if you catch it early.

Use spot instances for development and batch jobs. H100 spot at $0.80/hr is appropriate for prompt iteration, benchmark runs, and offline batch image processing. Switch to on-demand for latency-sensitive production endpoints where instance preemption would cause user-visible errors. For a broader set of cost-reduction strategies, see the GPU cost optimization playbook.

Tensor parallelism only, no pipeline parallelism. For multi-GPU VLM serving in vLLM, always use --tensor-parallel-size matching your GPU count. Pipeline parallelism is not supported for VLMs in vLLM because the visual encoder cannot be split across pipeline stages. Using pipeline parallel with a VLM will either error or produce incorrect outputs.

For GPU monitoring patterns in production ML deployments, see our GPU monitoring guide.

VLMs benefit most from high-memory-bandwidth GPUs where visual encoder throughput is not the bottleneck. Spheron provides on-demand H100 SXM5 instances from $2.40/hr and A100 80GB from $1.06/hr, with no minimum commitment and provisioning in under 90 seconds.
Rent H100 → | Rent A100 → | View all GPU pricing →
Get started on Spheron →

What Makes VLMs Different from Text-Only LLMs

GPU Memory Requirements by VLM Size

Deploy Qwen3-VL on GPU Cloud with vLLM

Quantization to save VRAM

Deploy Llama 4 Scout Multimodal Mode with Image Inputs

Deploy InternVL3 for Document Understanding and OCR

Benchmarks: Tokens/sec, Images/sec, and Cost per 1K Multimodal Requests

Production Checklist: Batching Images, Context Windows, and Autoscaling

Build what's next.