How much VRAM does SmolVLM 2.2B need?

SmolVLM 2.2B at FP16 needs approximately 5-6 GB of VRAM for weights plus KV cache overhead. It fits comfortably on an RTX 4090 (24 GB) with enough headroom for high-throughput batching. At INT8, weights drop to around 2.5 GB, making it runnable on consumer GPUs with 8+ GB VRAM.

Can SmolVLM run on an RTX 4090?

Yes. All three SmolVLM variants (256M, 500M, 2.2B) run comfortably on an RTX 4090 24GB. The 2.2B model at FP16 uses around 6 GB for weights, leaving 18 GB for KV cache and large image batches. You can run multiple SmolVLM 256M instances simultaneously on a single RTX 4090 for parallel edge inference workloads.

What is SmolVLA and how does it differ from SmolVLM?

SmolVLA is a vision-language-action model built on top of SmolVLM. It adds a diffusion-based action head that outputs robot joint commands from image observations and language instructions. SmolVLM handles visual understanding and reasoning; SmolVLA extends that into physical robot control. SmolVLA is designed for manipulation tasks and on-device robotics fleets where a 7B+ generalist VLA would be too slow or too expensive.

How does SmolVLM compare to Qwen2.5-VL on document OCR?

SmolVLM 2.2B scores roughly 15-20% lower than Qwen2.5-VL 7B on document QA benchmarks like DocVQA. However, it runs at 3-5x higher throughput on the same hardware. For high-volume OCR pipelines where you process thousands of documents per hour, SmolVLM's cost-per-document is significantly lower. For complex documents with dense tables or mixed scripts, Qwen2.5-VL 7B or InternVL3 8B are better choices.

What is the best GPU for SmolVLM production serving?

For production SmolVLM inference, an RTX 4090 is the best cost-optimized choice for single-node deployments. Its 24 GB VRAM fits multiple concurrent SmolVLM 2.2B instances, and its FP16 throughput on small models is competitive with data center cards at a fraction of the cost. For higher-concurrency workloads, an L40S (48 GB VRAM) doubles your batch headroom and allows you to run SmolVLA alongside SmolVLM on the same card.

Deploy SmolVLM and SmolVLA on GPU Cloud: Tiny Multimodal Models for Edge AI and Robotics (2026 Guide)

Running SmolVLM 2.2B on an RTX 4090 handles OCR and screen-grounding at 3-5x higher throughput and roughly 90% lower cost per 1,000 requests compared to a frontier 7B VLM on an H100. The "small" in SmolVLM is not a compromise for most production vision pipelines. If you need to deploy frontier VLMs like Qwen3-VL and InternVL3, the calculus is different, but for high-volume document processing and edge multimodal tasks, SmolVLM belongs at the top of your shortlist.

SmolVLM and SmolVLA: What Hugging Face Built

The SmolVLM family grew out of Hugging Face's SmolLM research line, which showed that aggressive architecture compression and careful data curation could close most of the quality gap between large and small language models. SmolVLM extends that approach to vision: a SigLIP-based visual encoder feeds into a SmolLM2 language backbone, keeping the weight footprint an order of magnitude smaller than frontier VLMs while retaining most of their practical capability for document-centric tasks.

SmolVLA takes the next step. It grafts a Flow Matching action head onto the SmolVLM backbone, enabling the model to output robot joint commands directly from image observations and language instructions. Instead of a 7B generalist VLA that strains embedded hardware, SmolVLA runs on a single small GPU and is designed for real manipulation pipelines.

Model	Params	VRAM (FP16)	VRAM (INT8)	Primary Use Case
SmolVLM 256M	256M	~1.2 GB	~0.7 GB	Real-time edge, on-device
SmolVLM 500M	500M	~1.5 GB	~1.0 GB	Lightweight OCR, screen-grounding
SmolVLM 2.2B	2.2B	~5.5 GB	~2.8 GB	Document QA, visual reasoning
SmolVLA	~450M	~2 GB	~1.2 GB	Robotic manipulation control

Values are approximate at default resolution. Actual VRAM including KV cache scales with batch size and image count. Figures exclude vision-encoder activations during inference; expect 1-2 GB additional usage at high image counts per batch.

Hardware Sizing: VRAM, Batch Size, and Throughput

Three GPUs cover the practical deployment range for SmolVLM workloads.

GPU	VRAM	SmolVLM 2.2B FP16 (weights)	Available for KV cache	Max concurrent requests	Est. throughput (doc QA)
RTX 4090	24 GB	~5.5 GB	~18 GB	24-32	~200 req/min
L4 (24 GB)	24 GB	~5.5 GB	~18 GB	24-32	~150 req/min
L40S	48 GB	~5.5 GB	~42 GB	64-96	~450 req/min

Throughput figures are estimates based on published Hugging Face benchmarks and typical vLLM throughput at moderate concurrency. Your actual numbers depend on image resolution, document length, and batch configuration.

The RTX 4090 at $0.53/hr on-demand and the L40S at $0.72/hr on-demand are the two primary deployment targets for SmolVLM on Spheron. Both give you significantly better cost-per-request than renting an H100 or A100 for a workload that does not need their capacity.

Pricing fluctuates based on GPU availability. The prices above are based on 04 May 2026 and may have changed. Check current GPU pricing → for live rates.

Running Multiple SmolVLM Instances

The 256M and 500M variants are small enough to run multiple replicas on a single RTX 4090. At ~1.2 GB weights for the 256M model, you can fit up to 8 replicas with KV cache headroom on a 24 GB card. This is the main differentiation from frontier VLMs: instead of saturating one large model, you can run a small fleet of SmolVLM instances in parallel for real-time document ingestion pipelines where per-request latency matters more than peak throughput.

Deploying SmolVLM with Hugging Face Transformers

For development and single-request inference, the transformers AutoProcessor / AutoModelForVision2Seq pattern is the simplest path:

python

from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch

processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceTB/SmolVLM2-2.2B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="cuda"
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Extract all text visible in this document."}
        ]
    }
]

image = Image.open("document.png")
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt").to("cuda")

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=500)

generated_ids = output[0][inputs['input_ids'].shape[1]:]
print(processor.decode(generated_ids, skip_special_tokens=True))

This works well for prototyping and batch scripts. For production serving with concurrent requests, you need vLLM.

Always verify the Hugging Face repo name before downloading. The SmolVLM2 family uses HuggingFaceTB/SmolVLM2-2.2B-Instruct, not the older HuggingFaceM4/SmolVLM-Instruct naming. Check the SmolVLM2 model card on Hugging Face for the current canonical repo path.

Deploying SmolVLM with vLLM for Production

vLLM's multimodal support handles SmolVLM2 natively as of vLLM 0.8+. vLLM's continuous batching groups new requests into in-progress forward passes, which keeps GPU utilization high even at variable concurrency. Launch the server:

bash

vllm serve HuggingFaceTB/SmolVLM2-2.2B-Instruct \
  --dtype bfloat16 \
  --max-model-len 4096 \
  --served-model-name smolvlm2 \
  --max-num-seqs 32

Then query it with the OpenAI client:

python

from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

with open("document.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="smolvlm2",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
                {"type": "text", "text": "Extract all text from this document."}
            ]
        }
    ],
    max_tokens=500
)
print(response.choices[0].message.content)

SmolVLM natively resizes inputs to its training resolution (384x384 per tile). For high-resolution documents, enable the multi-tile processor config to preserve detail at the cost of more visual tokens per image. See the vLLM supported models documentation for advanced batching configuration.

SmolVLM's weight footprint means most of the 24 GB RTX 4090 VRAM is available for KV cache, letting you push --max-num-seqs well above what is practical with a 7B model. Start at 32 and increase until you hit OOM, then back off by 20%.

SmolVLA: When You Need Robotic Action Output

SmolVLA adds a Flow Matching action head to the SmolVLM architecture. The visual encoder and language backbone handle perception and instruction understanding; the action head translates that into robot joint commands. The result is a model that fits the latency and hardware constraints of on-device robotics without the weight of a generalist VLA.

SmolVLA vs Generalist VLA Models

Model	Params	Min VRAM	Inference latency	Manipulation tasks
SmolVLA	~450M	~2 GB	<100ms	Sim-to-real, pick-and-place, table manipulation
OpenVLA	7B	14 GB	~500ms	General manipulation
pi0 (PaliGemma)	3B	~7 GB	~200ms	Complex dexterous tasks

SmolVLA's <100ms latency at ~2 GB VRAM makes it the only option in the list that can run alongside other inference workloads on a shared GPU. An L40S can host SmolVLA and a SmolVLM 2.2B serving endpoint simultaneously without VRAM contention.

Deploying SmolVLA for Robotics

SmolVLA integrates with the lerobot library, Hugging Face's robotics framework:

python

import torch
from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy

policy = SmolVLAPolicy.from_pretrained("lerobot/smolvla_base")
policy.eval()

# At each control step - replace with real sensor data:
observation = {"image": torch.zeros(1, 3, 224, 224), "state": torch.zeros(1, 6)}
action = policy.select_action(observation)
# action shape: (1, action_dim) - joint positions or velocities

SmolVLA targets leRobot and Gym-compatible environments. Verify the current installation path (pip install lerobot) and the SmolVLAPolicy API against the Hugging Face leRobot documentation before writing production code, as the leRobot robotics framework updates frequently. Spheron's GPU cloud runs all NVIDIA CUDA-based robotics simulation environments without additional configuration.

Benchmarks: SmolVLM vs Frontier VLMs

Benchmark numbers are sourced from the SmolVLM2 Hugging Face blog post and the Open VLM Leaderboard. Throughput estimates are for FP16 serving on an RTX 4090 with vLLM at moderate concurrency.

Model	Params	DocVQA	ChartQA	ScreenSpot	Throughput on RTX 4090 (req/min)
SmolVLM 256M	256M	~67%	~54%	~56%	~900
SmolVLM 500M	500M	~72%	~60%	~63%	~700
SmolVLM 2.2B	2.2B	~81%	~72%	~75%	~200
InternVL3 2B	2B	~88%	~81%	~78%	~180
Qwen2.5-VL 7B	7B	~93%	~90%	~88%	~60
Llama 3.2 Vision 11B	11B	~90%	~83%	~78%	~35

Verify specific benchmark numbers against the Open VLM Leaderboard before citing in production decisions, as model updates can shift these figures.

Cost-Per-Task Math

This is where SmolVLM's value becomes concrete. Take a document OCR pipeline as the baseline:

SmolVLM 2.2B on RTX 4090:

GPU cost: $0.53/hr
Throughput: ~200 doc QA requests/min = 12,000 req/hr
Cost per 1,000 requests: $0.53 / 12 = $0.044

Qwen2.5-VL 7B on H100 PCIe:

GPU cost: $2.01/hr
Throughput: ~60 req/min = 3,600 req/hr
Cost per 1,000 requests: $2.01 / 3.6 = ~$0.56

At 81% DocVQA vs 93%, SmolVLM 2.2B gives up about 12 accuracy points in exchange for a 13x reduction in cost per request. For pipelines where "good enough" OCR accuracy is acceptable - expense report extraction, invoice parsing, form digitization - that tradeoff is straightforward.

The math flips at the high end. For complex legal document analysis, multi-table financial reports, or mixed-script documents, the accuracy gap matters and Qwen2.5-VL or InternVL3 are the right choices.

For single-node deployments, rent an RTX 4090 on Spheron to start. For higher-concurrency workloads or when you need SmolVLA running alongside SmolVLM, L40S GPU rental on Spheron gives you 48 GB VRAM with enough headroom to run both simultaneously.

Production Checklist

Use vLLM for concurrent serving; avoid running transformers generate() inside a web server
Set --max-model-len to match your p99 document length (4096 is usually enough for SmolVLM)
Monitor GPU memory with nvidia-smi dmon - SmolVLM's small weight footprint means OOM is almost always a batch size issue, not a model size issue
For robotics: pin SmolVLA inference to a single GPU core for deterministic latency; do not share the card with batch inference jobs
For offline document processing at scale, batch LLM inference on GPU cloud cuts per-token cost by 60-70% over online serving by using spot instances with checkpointing; use on-demand only for robotics control loops where preemption is not acceptable

SmolVLM 2.2B runs at 3-5x the throughput of a 7B VLM on a single RTX 4090, making it the most cost-efficient choice for high-volume OCR, screen-grounding, and edge multimodal pipelines. Spheron's RTX 4090 rentals and L40S instances give you bare-metal access at GPU-cloud prices.
Start deploying SmolVLM on Spheron →

SmolVLM and SmolVLA: What Hugging Face Built

Hardware Sizing: VRAM, Batch Size, and Throughput

Running Multiple SmolVLM Instances

Deploying SmolVLM with Hugging Face Transformers

Deploying SmolVLM with vLLM for Production

SmolVLA: When You Need Robotic Action Output

SmolVLA vs Generalist VLA Models

Deploying SmolVLA for Robotics

Benchmarks: SmolVLM vs Frontier VLMs

Cost-Per-Task Math

Production Checklist

Build what's next.