Running SmolVLM 2.2B on an RTX 4090 handles OCR and screen-grounding at 3-5x higher throughput and roughly 90% lower cost per 1,000 requests compared to a frontier 7B VLM on an H100. The "small" in SmolVLM is not a compromise for most production vision pipelines. If you need to deploy frontier VLMs like Qwen3-VL and InternVL3, the calculus is different, but for high-volume document processing and edge multimodal tasks, SmolVLM belongs at the top of your shortlist.
SmolVLM and SmolVLA: What Hugging Face Built
The SmolVLM family grew out of Hugging Face's SmolLM research line, which showed that aggressive architecture compression and careful data curation could close most of the quality gap between large and small language models. SmolVLM extends that approach to vision: a SigLIP-based visual encoder feeds into a SmolLM2 language backbone, keeping the weight footprint an order of magnitude smaller than frontier VLMs while retaining most of their practical capability for document-centric tasks.
SmolVLA takes the next step. It grafts a Flow Matching action head onto the SmolVLM backbone, enabling the model to output robot joint commands directly from image observations and language instructions. Instead of a 7B generalist VLA that strains embedded hardware, SmolVLA runs on a single small GPU and is designed for real manipulation pipelines.
| Model | Params | VRAM (FP16) | VRAM (INT8) | Primary Use Case |
|---|---|---|---|---|
| SmolVLM 256M | 256M | ~1.2 GB | ~0.7 GB | Real-time edge, on-device |
| SmolVLM 500M | 500M | ~1.5 GB | ~1.0 GB | Lightweight OCR, screen-grounding |
| SmolVLM 2.2B | 2.2B | ~5.5 GB | ~2.8 GB | Document QA, visual reasoning |
| SmolVLA | ~450M | ~2 GB | ~1.2 GB | Robotic manipulation control |
Values are approximate at default resolution. Actual VRAM including KV cache scales with batch size and image count. Figures exclude vision-encoder activations during inference; expect 1-2 GB additional usage at high image counts per batch.
Hardware Sizing: VRAM, Batch Size, and Throughput
Three GPUs cover the practical deployment range for SmolVLM workloads.
| GPU | VRAM | SmolVLM 2.2B FP16 (weights) | Available for KV cache | Max concurrent requests | Est. throughput (doc QA) |
|---|---|---|---|---|---|
| RTX 4090 | 24 GB | ~5.5 GB | ~18 GB | 24-32 | ~200 req/min |
| L4 (24 GB) | 24 GB | ~5.5 GB | ~18 GB | 24-32 | ~150 req/min |
| L40S | 48 GB | ~5.5 GB | ~42 GB | 64-96 | ~450 req/min |
Throughput figures are estimates based on published Hugging Face benchmarks and typical vLLM throughput at moderate concurrency. Your actual numbers depend on image resolution, document length, and batch configuration.
The RTX 4090 at $0.53/hr on-demand and the L40S at $0.72/hr on-demand are the two primary deployment targets for SmolVLM on Spheron. Both give you significantly better cost-per-request than renting an H100 or A100 for a workload that does not need their capacity.
Pricing fluctuates based on GPU availability. The prices above are based on 04 May 2026 and may have changed. Check current GPU pricing → for live rates.
Running Multiple SmolVLM Instances
The 256M and 500M variants are small enough to run multiple replicas on a single RTX 4090. At ~1.2 GB weights for the 256M model, you can fit up to 8 replicas with KV cache headroom on a 24 GB card. This is the main differentiation from frontier VLMs: instead of saturating one large model, you can run a small fleet of SmolVLM instances in parallel for real-time document ingestion pipelines where per-request latency matters more than peak throughput.
Deploying SmolVLM with Hugging Face Transformers
For development and single-request inference, the transformers AutoProcessor / AutoModelForVision2Seq pattern is the simplest path:
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct")
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceTB/SmolVLM2-2.2B-Instruct",
torch_dtype=torch.bfloat16,
device_map="cuda"
)
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Extract all text visible in this document."}
]
}
]
image = Image.open("document.png")
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt").to("cuda")
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=500)
generated_ids = output[0][inputs['input_ids'].shape[1]:]
print(processor.decode(generated_ids, skip_special_tokens=True))This works well for prototyping and batch scripts. For production serving with concurrent requests, you need vLLM.
Always verify the Hugging Face repo name before downloading. The SmolVLM2 family uses HuggingFaceTB/SmolVLM2-2.2B-Instruct, not the older HuggingFaceM4/SmolVLM-Instruct naming. Check the SmolVLM2 model card on Hugging Face for the current canonical repo path.
Deploying SmolVLM with vLLM for Production
vLLM's multimodal support handles SmolVLM2 natively as of vLLM 0.8+. vLLM's continuous batching groups new requests into in-progress forward passes, which keeps GPU utilization high even at variable concurrency. Launch the server:
vllm serve HuggingFaceTB/SmolVLM2-2.2B-Instruct \
--dtype bfloat16 \
--max-model-len 4096 \
--served-model-name smolvlm2 \
--max-num-seqs 32Then query it with the OpenAI client:
from openai import OpenAI
import base64
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
with open("document.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="smolvlm2",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
{"type": "text", "text": "Extract all text from this document."}
]
}
],
max_tokens=500
)
print(response.choices[0].message.content)SmolVLM natively resizes inputs to its training resolution (384x384 per tile). For high-resolution documents, enable the multi-tile processor config to preserve detail at the cost of more visual tokens per image. See the vLLM supported models documentation for advanced batching configuration.
SmolVLM's weight footprint means most of the 24 GB RTX 4090 VRAM is available for KV cache, letting you push --max-num-seqs well above what is practical with a 7B model. Start at 32 and increase until you hit OOM, then back off by 20%.
SmolVLA: When You Need Robotic Action Output
SmolVLA adds a Flow Matching action head to the SmolVLM architecture. The visual encoder and language backbone handle perception and instruction understanding; the action head translates that into robot joint commands. The result is a model that fits the latency and hardware constraints of on-device robotics without the weight of a generalist VLA.
SmolVLA vs Generalist VLA Models
| Model | Params | Min VRAM | Inference latency | Manipulation tasks |
|---|---|---|---|---|
| SmolVLA | ~450M | ~2 GB | <100ms | Sim-to-real, pick-and-place, table manipulation |
| OpenVLA | 7B | 14 GB | ~500ms | General manipulation |
| pi0 (PaliGemma) | 3B | ~7 GB | ~200ms | Complex dexterous tasks |
SmolVLA's <100ms latency at ~2 GB VRAM makes it the only option in the list that can run alongside other inference workloads on a shared GPU. An L40S can host SmolVLA and a SmolVLM 2.2B serving endpoint simultaneously without VRAM contention.
Deploying SmolVLA for Robotics
SmolVLA integrates with the lerobot library, Hugging Face's robotics framework:
import torch
from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained("lerobot/smolvla_base")
policy.eval()
# At each control step - replace with real sensor data:
observation = {"image": torch.zeros(1, 3, 224, 224), "state": torch.zeros(1, 6)}
action = policy.select_action(observation)
# action shape: (1, action_dim) - joint positions or velocitiesSmolVLA targets leRobot and Gym-compatible environments. Verify the current installation path (pip install lerobot) and the SmolVLAPolicy API against the Hugging Face leRobot documentation before writing production code, as the leRobot robotics framework updates frequently. Spheron's GPU cloud runs all NVIDIA CUDA-based robotics simulation environments without additional configuration.
Benchmarks: SmolVLM vs Frontier VLMs
Benchmark numbers are sourced from the SmolVLM2 Hugging Face blog post and the Open VLM Leaderboard. Throughput estimates are for FP16 serving on an RTX 4090 with vLLM at moderate concurrency.
| Model | Params | DocVQA | ChartQA | ScreenSpot | Throughput on RTX 4090 (req/min) |
|---|---|---|---|---|---|
| SmolVLM 256M | 256M | ~67% | ~54% | ~56% | ~900 |
| SmolVLM 500M | 500M | ~72% | ~60% | ~63% | ~700 |
| SmolVLM 2.2B | 2.2B | ~81% | ~72% | ~75% | ~200 |
| InternVL3 2B | 2B | ~88% | ~81% | ~78% | ~180 |
| Qwen2.5-VL 7B | 7B | ~93% | ~90% | ~88% | ~60 |
| Llama 3.2 Vision 11B | 11B | ~90% | ~83% | ~78% | ~35 |
Verify specific benchmark numbers against the Open VLM Leaderboard before citing in production decisions, as model updates can shift these figures.
Cost-Per-Task Math
This is where SmolVLM's value becomes concrete. Take a document OCR pipeline as the baseline:
SmolVLM 2.2B on RTX 4090:
- GPU cost: $0.53/hr
- Throughput: ~200 doc QA requests/min = 12,000 req/hr
- Cost per 1,000 requests: $0.53 / 12 = $0.044
Qwen2.5-VL 7B on H100 PCIe:
- GPU cost: $2.01/hr
- Throughput: ~60 req/min = 3,600 req/hr
- Cost per 1,000 requests: $2.01 / 3.6 = ~$0.56
At 81% DocVQA vs 93%, SmolVLM 2.2B gives up about 12 accuracy points in exchange for a 13x reduction in cost per request. For pipelines where "good enough" OCR accuracy is acceptable - expense report extraction, invoice parsing, form digitization - that tradeoff is straightforward.
The math flips at the high end. For complex legal document analysis, multi-table financial reports, or mixed-script documents, the accuracy gap matters and Qwen2.5-VL or InternVL3 are the right choices.
For single-node deployments, rent an RTX 4090 on Spheron to start. For higher-concurrency workloads or when you need SmolVLA running alongside SmolVLM, L40S GPU rental on Spheron gives you 48 GB VRAM with enough headroom to run both simultaneously.
Production Checklist
- Use vLLM for concurrent serving; avoid running
transformersgenerate()inside a web server - Set
--max-model-lento match your p99 document length (4096 is usually enough for SmolVLM) - Monitor GPU memory with
nvidia-smi dmon- SmolVLM's small weight footprint means OOM is almost always a batch size issue, not a model size issue - For robotics: pin SmolVLA inference to a single GPU core for deterministic latency; do not share the card with batch inference jobs
- For offline document processing at scale, batch LLM inference on GPU cloud cuts per-token cost by 60-70% over online serving by using spot instances with checkpointing; use on-demand only for robotics control loops where preemption is not acceptable
SmolVLM 2.2B runs at 3-5x the throughput of a 7B VLM on a single RTX 4090, making it the most cost-efficient choice for high-volume OCR, screen-grounding, and edge multimodal pipelines. Spheron's RTX 4090 rentals and L40S instances give you bare-metal access at GPU-cloud prices.
