Ollama and vLLM both let you run LLMs on your own hardware. That's where the similarity ends. Ollama is built for developers who want to get a model running in two minutes, no Docker, no config files. vLLM is built for engineers who need to serve thousands of requests per day with predictable latency.
Picking the wrong tool costs you either time (overengineering a prototype) or reliability (shipping Ollama to production and watching it fall over at 10 concurrent users). This post lays out exactly when each tool makes sense, with benchmark numbers on the same hardware and cloud-init scripts for deploying both on Spheron.
TL;DR
| Comparison | Ollama | vLLM |
|---|---|---|
| Best for | Local dev, prototyping | Production APIs, high concurrency |
| Setup time | ~2 minutes | ~5 minutes |
| Throughput (H100, 32 concurrent) | ~320 tok/s | ~1,450 tok/s |
| OpenAI-compatible API | Yes | Yes |
| Continuous batching | No | Yes |
| GPU memory efficiency | Good (static KV cache) | Excellent (PagedAttention) |
| GGUF/quantization support | Yes (native) | FP8, AWQ, GPTQ recommended; GGUF and BitsAndBytes supported |
| Multi-model serving | Yes | No (one model per server) |
Feature Comparison
| Feature | Ollama | vLLM |
|---|---|---|
| OpenAI API compatibility | Yes (/v1/chat/completions) | Yes (full drop-in) |
| Multi-model serving | Yes (load on demand) | No (one model per server process) |
| Quantization formats | GGUF (Q2–Q8, FP16) | FP8, AWQ, GPTQ, FP16 (recommended); GGUF, BitsAndBytes also supported |
| GPU memory management | Static KV cache per request | PagedAttention (non-contiguous pages) |
| Concurrent requests | Sequential queue | Continuous batching |
| Setup complexity | Install script, no Docker | Docker + HF token |
| Supported backends | llama.cpp, Metal, CUDA | CUDA, ROCm (H100/MI300X optimized) |
| Metrics endpoint | None | Prometheus /metrics |
| Streaming support | Yes | Yes |
| Model format | GGUF | Hugging Face safetensors |
| Apple Silicon support | Yes (Metal) | No |
| Multi-GPU tensor parallelism | No | Yes |
The most important differences are the last three rows: model format, Apple Silicon, and multi-GPU. If you're on a Mac, Ollama is your only option. If you need to spread a 70B model across multiple H100s, vLLM is your only option. For everything in between, the choice comes down to how many concurrent users you're serving.
When to Use Ollama
Ollama fits any situation where you're running one model, talking to it yourself (or with a handful of people), and speed of setup matters more than throughput.
Prototyping. You want to test whether a model is good enough for your use case before building anything. ollama run llama3.1:8b gets you a working chat interface in 90 seconds. No Hugging Face account, no Docker, no token setup.
Local development. You're building an app that calls an LLM API and want to iterate fast without paying for tokens. Ollama's OpenAI-compatible endpoint at localhost:11434/v1 works as a drop-in for the OpenAI SDK. Change one line in your app and your API calls hit your local GPU.
Testing multiple models. You want to compare Llama 3.1, Mistral, and Qwen in the same session. Ollama loads models on demand and keeps recently used models in memory. ollama run mistral while llama3.1 is still loaded will work if you have the VRAM headroom.
Internal tools. You're building a tool for a handful of people inside your organization. Five simultaneous users is fine with Ollama. It queues requests and processes them one at a time, which works well when load is predictable and low.
Apple Silicon. Ollama uses Metal on M1/M2/M3/M4 Macs. vLLM does not support Metal. If you're developing on a Mac and don't have an NVIDIA GPU, Ollama is your only self-hosted option.
See our full Ollama setup guide for installation details, quantization options, and GPU acceleration configuration.
When to Use vLLM
vLLM fits any situation where you're serving a model to real users under real load, and you need consistent performance.
Production inference APIs. You're building a product that calls an LLM backend on every user request. At 20 concurrent users, Ollama queues 19 of them. vLLM's continuous batching processes all 20 in the same forward pass. The difference in p99 latency is substantial.
High concurrency. vLLM's PagedAttention manages KV cache in non-contiguous memory pages, similar to how an OS manages virtual memory. This means it can handle far more concurrent sequences than Ollama before running out of VRAM. On an H100 80GB serving Llama 3.1 8B, vLLM can sustain 180+ concurrent FP16 requests before OOM. Ollama hits OOM around 40.
SLA-bound latency. If you have a p95 latency target, vLLM's continuous batching and chunked prefill give you far more control over latency distribution. Ollama's sequential queue means a single slow request delays all subsequent ones.
Cost-efficient scale. Continuous batching means each GPU is doing useful work on more tokens per clock cycle. vLLM's throughput per dollar is meaningfully better than Ollama at scale, especially on H100 with FP8 enabled. For broader strategies on reducing GPU spend, see the GPU cost optimization playbook.
Multi-GPU and large models. Running Llama 3.3 70B? You need either FP8 on a single H100 (tight) or tensor parallelism across two H100s. vLLM handles both. Ollama does not support multi-GPU tensor parallelism.
For multi-GPU vLLM setups, see our vLLM production deployment guide.
Same Hardware Benchmarks: H100 80GB
These numbers were measured on Spheron H100 SXM5 80GB instances at $2.40/hr on-demand as of March 25, 2026. Ollama uses Q4_K_M GGUF; vLLM uses FP16 unless noted. Prompt: 100 tokens in, 200 tokens out.
| Metric | Ollama (llama3.1:8b Q4_K_M) | vLLM (Llama-3.1-8B-Instruct FP16) |
|---|---|---|
| Single-user throughput | ~420 tok/s | ~510 tok/s |
| 8 concurrent requests | ~310 tok/s total | ~1,100 tok/s total |
| 32 concurrent requests | ~320 tok/s total | ~1,450 tok/s total |
| Time to first token (single) | ~35ms | ~28ms |
| Time to first token (32 concurrent) | ~290ms | ~95ms |
| VRAM usage (8B model + KV cache) | ~9.2 GB | ~17.5 GB (FP16) / ~9.8 GB (FP8) |
| Max concurrent before OOM (80GB) | ~40 | ~180 (FP16) / ~350+ (FP8) |
The single-user numbers are close. At one concurrent request, both tools are fast enough for interactive use. The gap opens at 8+ concurrent requests, where Ollama's sequential queue means total throughput barely moves while vLLM scales proportionally.
Note that Ollama uses Q4_K_M GGUF (4-bit quantization), which is why its VRAM usage is lower at single-user load. vLLM in FP16 uses more VRAM but handles concurrent load far better. With vLLM FP8 enabled, VRAM drops to roughly the same level as Ollama's GGUF, while throughput under concurrent load stays 3-4x higher.
For a full GPU-by-GPU inference breakdown, see Best GPU for AI Inference 2026.
Pricing fluctuates based on GPU availability. The prices above are based on 25 Mar 2026 and may have changed. Check current GPU pricing → for live rates.
One-Click Setup on Spheron
Both scripts work as cloud-init user-data on Spheron GPU instances. Paste either script into the user-data field when launching a new instance, and the tool will be ready by the time you SSH in.
Ollama Cloud-Init
#!/bin/bash
# Ollama cloud-init for Spheron GPU instances
curl -fsSL https://ollama.com/install.sh | sh
# Bind to all interfaces so you can reach it remotely
mkdir -p /etc/systemd/system/ollama.service.d
cat > /etc/systemd/system/ollama.service.d/override.conf <<EOF
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
EOF
systemctl daemon-reload
systemctl enable ollama
systemctl restart ollama
# Pull a model (change as needed)
sleep 5
ollama pull llama3.1:8bThe OLLAMA_HOST=0.0.0.0 override is essential if you're accessing Ollama from outside the instance. Without it, Ollama binds to localhost only and remote API calls will fail.
vLLM Cloud-Init
#!/bin/bash
# vLLM cloud-init for Spheron GPU instances
apt-get update -y
apt-get install -y docker.io
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update -y
apt-get install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker
# Launch vLLM
# For gated models (like Llama 3.1), add: -e HUGGING_FACE_HUB_TOKEN=hf_your_token_here
# To avoid auth issues, you can use an openly licensed model like:
# mistralai/Mistral-7B-Instruct-v0.3
docker run -d \
--gpus all \
--ipc=host \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--dtype float16 \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--max-num-seqs 256Two things to know about this script. First, meta-llama/Llama-3.1-8B-Instruct is a gated model. You need to accept Meta's license on Hugging Face and pass your token via -e HUGGING_FACE_HUB_TOKEN=hf_.... If you want to skip the auth step, replace the model with mistralai/Mistral-7B-Instruct-v0.3, which is openly licensed. Second, --ipc=host is required. Removing it will cause cryptic CUDA errors under load.
For detailed Spheron instance setup, see the Ollama quick guide and vLLM server guide in our docs.
Migrating from Ollama to vLLM
Both tools expose an OpenAI-compatible API, so the migration is mostly a configuration change, not a code rewrite. For a full walkthrough of running a self-hosted OpenAI-compatible endpoint on Spheron, see Self-hosted OpenAI-compatible API with vLLM. Here's what actually changes.
1. Model Format
This is the only non-trivial part. Ollama uses GGUF, a format specific to llama.cpp with quantization baked in. vLLM uses Hugging Face safetensors format. You cannot point vLLM at a .gguf file.
Find the equivalent Hugging Face model for whatever you're running in Ollama:
| Ollama model name | Hugging Face equivalent |
|---|---|
llama3.1:8b | meta-llama/Llama-3.1-8B-Instruct |
llama3.1:70b | meta-llama/Llama-3.1-70B-Instruct |
mistral | mistralai/Mistral-7B-Instruct-v0.3 |
qwen2.5:7b | Qwen/Qwen2.5-7B-Instruct |
gemma2:9b | google/gemma-2-9b-it |
vLLM does support FP8, AWQ, and GPTQ quantization via Hugging Face, so you can still run quantized models. You just can't use GGUF files. For most 7B-13B models on an H100, FP16 fits comfortably and gives you full quality with no quantization artifacts.
2. API Base URL
Change one line in your application:
# Before (Ollama)
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
# After (vLLM)
client = OpenAI(base_url="http://your-instance-ip:8000/v1", api_key="token")Everything else stays the same: system prompts, user messages, temperature, max_tokens. The /v1/chat/completions endpoint behaves identically.
3. Model Name in Requests
Ollama uses short names like llama3.1:8b. vLLM uses the full Hugging Face path. Update the model field in your API calls:
# Before
response = client.chat.completions.create(model="llama3.1:8b", ...)
# After
response = client.chat.completions.create(model="meta-llama/Llama-3.1-8B-Instruct", ...)4. Concurrency Settings
In Ollama, you don't configure concurrency. In vLLM, --max-num-seqs controls how many requests can be in-flight at once. Start at 256 for a single H100 with an 8B model and adjust based on your vllm:kv_cache_usage_perc metric. If you're consistently above 90% KV cache utilization, either lower --max-num-seqs or reduce --max-model-len.
Which One Should You Use?
| Situation | Use |
|---|---|
| Building a prototype or internal tool | Ollama |
| Running local dev without Docker | Ollama |
| Testing multiple models in one session | Ollama |
| Serving 5+ concurrent users | vLLM |
| Need p99 latency SLA | vLLM |
| Cost-optimizing production inference | vLLM |
| Running on Apple Silicon | Ollama |
| Multi-GPU tensor parallelism | vLLM |
The practical rule: start with Ollama. If you're hitting concurrent request limits or need Prometheus metrics for your SLA dashboards, switch to vLLM. The migration is straightforward, and the throughput improvement at scale justifies the added complexity.
Both tools run well on Spheron GPU cloud. Spin up an H100 for vLLM production serving or any 8GB+ GPU for Ollama prototyping, all on-demand with per-minute billing.
