Phi-4 was already hard to justify ignoring for cost-sensitive inference. Phi-5 makes the case more direct: Microsoft's 14B model matches or exceeds much larger models on math, coding, and structured reasoning, and it fits on a single L40S with room for KV cache.
This guide covers how to deploy Phi-5 on GPU cloud for production inference, including VRAM sizing across all variants, vLLM and SGLang deployment commands, quantization paths, and an honest cost comparison against OpenAI gpt-4o-mini.
Note: Phi-5 specs in this guide are based on Microsoft's pre-release announcements and the known architecture trajectory of the Phi model family. Verify against the official model card at release. Deployment patterns and quantization workflows are stable regardless of exact model specs.
Phi-5 vs Phi-4: What Changed
Phi-4 (14B, December 2024) already showed what careful data curation could do for a mid-sized dense transformer. It outperformed models 4x its size on math and coding benchmarks while fitting on a single A100 or L40S. Phi-5 extends that pattern.
Architecture changes:
- Parameter count: Phi-5 maintains the 14B dense architecture for the primary model. The Mini variant targets 3-4B.
- Context window: Extended from Phi-4's 16K to an expected 32K-64K for Phi-5, enabling longer document processing without chunking.
- Training data: Heavier synthetic data pipeline with a focus on code, math, and structured reasoning. Microsoft's Phi series has consistently used synthetic data to compensate for smaller parameter counts.
- Attention: Likely grouped-query attention (GQA) for better KV cache efficiency, similar to the approach used in Phi-3.5.
Benchmark comparison (pre-release, based on announced figures):
| Model | Params | MMLU | HumanEval | GSM8K | Context |
|---|---|---|---|---|---|
| Phi-4 | 14B | 84.8 | 82.9 | 91.5 | 16K |
| Phi-5 (expected) | 14B | 87+ | 85+ | 93+ | 32-64K |
| Phi-5 Mini (expected) | 3-4B | 75+ | 70+ | 82+ | 32K |
| Gemma 3 9B | 9B | 74.9 | 71.7 | 86.6 | 128K |
| Qwen 2.5 14B | 14B | 79.9 | 72.7 | 85.6 | 128K |
| Llama 3.2 11B | 11B | 73.0 | 72.6 | 87.3 | 128K |
Phi-5's context window is shorter than some competitors like Qwen 2.5 or Gemma 3. For tasks that genuinely need 128K+ context, those alternatives are better choices. For coding, math, and structured reasoning within 32-64K context, Phi-5's benchmark-to-parameter ratio still leads its size class.
When Phi-5 Beats Larger Models
The clearest wins for Phi-5 are workloads where the task complexity does not require a 70B model but where you still want reliable output quality.
Cost-per-token reality check:
| Scenario | Token Rate | Monthly Token Vol for Breakeven vs 70B |
|---|---|---|
| Phi-5 vs Llama 3.3 70B | ~4x more tokens/sec per GPU | Any production-scale serving |
| Phi-5 vs GPT-4o API | Self-hosted at ~$0.06/M output | ~150M tokens/day |
| Phi-5 vs gpt-4o-mini API | Self-hosted at ~$0.06/M output | ~100M tokens/day |
Three categories where Phi-5 consistently beats larger models in production:
Agent inner-loop reasoning. Agentic workflows make 15-30 sequential model calls per task. Each call on a 70B model adds 300-800ms at typical batch sizes. Phi-5 on L40S runs at 2,500+ tokens/sec, which translates to under 50ms for a 100-token reasoning step. Over 20 steps, that latency difference is the difference between a 1-second response and a 15-second one.
Structured output and function calling. Phi-5's instruction-following training makes it particularly reliable for JSON extraction, classification with defined labels, and constrained tool calls. For these tasks, a 70B model gives you the same answer structure at 5x the cost.
SLM workloads at scale. Document classification, entity extraction, summarization pipelines, and RAG over internal knowledge bases all fall into this category. If you're processing millions of documents per day with a fixed set of tasks, Phi-5 handles most of them correctly while costing far less per token than any API or large hosted model.
For a broader view of the small language model deployment landscape, see the SLM deployment guide for comparisons across Phi-3, Mistral 7B, Gemma 2 9B, and Llama 3.2.
GPU Sizing: VRAM Requirements for Phi-5
VRAM requirements for the Phi-5 family, based on expected architecture and Phi-4 baselines:
| Variant | Params | VRAM FP16 | VRAM FP8 | VRAM AWQ INT4 | Recommended GPU |
|---|---|---|---|---|---|
| Phi-5 Mini | 3-4B | ~8GB | ~4GB | ~3GB | RTX 4090, L4 (24GB) |
| Phi-5 | 14B | ~28GB | ~14GB | ~9GB | L40S, A100 80GB |
| Phi-5 Vision | ~14B+ | ~32GB | ~16GB | ~10GB | L40S 48GB, A100 80GB |
These are weight-only estimates. Add 20-30% for KV cache headroom at production batch sizes. A Phi-5 14B at FP16 needs 28GB for weights but ~35-36GB end-to-end with a reasonable KV cache budget at batch 32. An L40S with 48GB gives you enough headroom; an A100 80GB gives you substantial batch headroom.
For a complete breakdown of VRAM requirements across quantization levels and model sizes, see GPU memory requirements for LLMs.
The L40S GPU rental is the right primary choice for Phi-5 14B serving. It has enough VRAM for the full FP16 model plus KV cache, and its Ada Lovelace architecture supports FP8 via the Transformer Engine. The A100 80GB makes sense if you want extra KV cache headroom for long-context workloads (32K+ prompts) or if you're running a continuous batching setup with many concurrent users. For Phi-5 Mini, the L4 (24GB) covers FP16 with headroom and handles AWQ INT4 Phi-5 14B at a lower price point.
Skip the H100 for Phi-5 in most cases: at $3.84/hr vs the L40S at $0.72/hr, that's a 5.3x premium per hour for a model that runs fine on a single GPU. The H100's NVLink bandwidth and FP8 throughput matter most for large multi-GPU workloads, which Phi-5 14B does not require.
Deploying Phi-5 with vLLM
vLLM 0.4+ supports the Phi-4 family and is expected to add Phi-5 support at or near launch. The deployment pattern is identical.
Standard FP16/BF16 deployment on L40S:
docker run --runtime nvidia --gpus all \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model microsoft/Phi-5 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--dtype bfloat16 \
--gpu-memory-utilization 0.90 \
--max-num-batched-tokens 65536 \
--enable-chunked-prefill \
--max-num-seqs 256Flag notes:
--dtype bfloat16: BF16 is preferred over FP16 for the Phi family; better numerical stability on transformer operations.--enable-chunked-prefill: Prevents long-prompt requests from blocking the decode queue. Particularly useful for Phi-5's extended context window.--max-num-batched-tokens 65536: Controls how many tokens are processed per iteration step. Tune this based on your GPU's memory bandwidth and target latency.--max-num-seqs 256: Max concurrent sequences. Start here and adjust based on KV cache utilization from the/metricsendpoint.
Test the endpoint:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/Phi-5",
"messages": [
{"role": "user", "content": "Write a Python function to find all prime numbers up to n using the Sieve of Eratosthenes."}
],
"max_tokens": 512,
"temperature": 0
}'FP8 deployment on L40S (Ada Transformer Engine):
docker run --runtime nvidia --gpus all \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model microsoft/Phi-5 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--dtype float8 \
--gpu-memory-utilization 0.90 \
--max-num-batched-tokens 65536 \
--enable-chunked-prefillFP8 cuts VRAM from ~28GB to ~14GB for Phi-5 14B. On an L40S, this frees enough headroom to run larger KV caches or serve more concurrent sequences. Throughput typically improves 30-40% over BF16 on Ada hardware.
AWQ INT4 deployment:
docker run --runtime nvidia --gpus all \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model microsoft/Phi-5-AWQ \
--quantization awq \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--dtype float16 \
--gpu-memory-utilization 0.90With AWQ, Phi-5 14B fits in ~9GB, making it viable on any GPU with 12GB+ VRAM including RTX 4090. The throughput advantage of running the full model on L40S usually outweighs the memory savings at high batch sizes, but for cost-constrained deployments where you already have an RTX 4090, AWQ is the right path.
Deploying Phi-5 with SGLang
SGLang is the better choice when you need structured output, constrained decoding, or RadixAttention for prefix caching across shared system prompts.
Install and launch:
pip install sglang[all]
python -m sglang.launch_server \
--model-path microsoft/Phi-5 \
--port 30000 \
--dtype bfloat16 \
--mem-fraction-static 0.85 \
--max-prefill-tokens 65536 \
--chunked-prefill-size 8192When SGLang is better than vLLM for Phi-5:
- Structured output at scale: SGLang's constrained decoding for JSON schemas is faster than vLLM's guided decoding for high-throughput structured extraction workloads.
- Long shared prefixes: If most requests share a long system prompt (common in agent frameworks), SGLang's RadixAttention reuses the KV cache for the shared prefix. For Phi-5's extended context window, this can halve effective TTFT.
- Function calling pipelines: SGLang's native tool call primitives fit naturally with Phi-5's strong instruction-following for structured function calls.
For pure throughput on mixed-length general requests, vLLM is typically faster. For constrained generation and shared-prefix workloads, SGLang wins.
Quantization Paths for Phi-5
AWQ (Recommended for H100, A100, L40S)
AWQ INT4 produces the best quality-to-memory tradeoff for Phi-5. Expected accuracy delta vs FP16: roughly 1-2% degradation on MMLU, under 3% on HumanEval. For most production tasks, this is indistinguishable from full precision.
For a full walkthrough of the AWQ quantization process including calibration dataset selection, see the AWQ quantization guide for LLM deployment.
Quantize from base weights using AutoAWQ:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "microsoft/Phi-5"
quant_path = "Phi-5-AWQ"
# Quantization config
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
# Load model and tokenizer
model = AutoAWQForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Calibrate and quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)Then deploy with vLLM using --model Phi-5-AWQ --quantization awq.
GPTQ
GPTQ is a viable alternative when pre-quantized GPTQ checkpoints are available on Hugging Face and you want to avoid running the AWQ calibration yourself.
docker run --runtime nvidia --gpus all \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model microsoft/Phi-5-GPTQ-4bit \
--quantization gptq \
--dtype float16GPTQ vs AWQ in practice: if a pre-calibrated GPTQ checkpoint exists with a good calibration dataset, it is fine. If you need to quantize yourself, AWQ produces slightly better quality for the same bit-width. For Phi-5 specifically, check Hugging Face for community-quantized checkpoints before running calibration from scratch.
MXFP4 (Blackwell Only)
MXFP4 microscaling quantization requires Blackwell hardware (B200, GB200, RTX 5090). On supported hardware, it delivers roughly 2x throughput over BF16 with minimal quality loss, thanks to block-scaled FP4 with per-group exponents.
For the full MXFP4 workflow using TensorRT Model Optimizer and MR-GPTQ, see the MXFP4 microscaling guide. The process applies directly to Phi-5 since the model architecture is a standard dense transformer.
If you're running Phi-5 on Blackwell infrastructure, MXFP4 is the right quantization target. On Hopper (H100) or Ada (L40S), use FP8 instead.
Multimodal Phi-5: Vision-Language Setup
Phi-5 Vision extends the base model with a visual encoder for image understanding. Based on Microsoft's pre-release information, the vision encoder adds roughly 4-6GB of VRAM overhead, bringing total requirements to ~32-34GB FP16 on an L40S.
vLLM deployment for Phi-5 Vision:
docker run --runtime nvidia --gpus all \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model microsoft/Phi-5-Vision \
--tensor-parallel-size 1 \
--max-model-len 16384 \
--dtype bfloat16 \
--image-input-type pixel_values \
--image-token-id 50257 \
--image-input-shape 1,3,336,336 \
--image-feature-size 1024Check the official Phi-5 Vision model card for the exact image token and feature size values, as these are architecture-specific and may differ from the flags above.
Multimodal request example:
from openai import OpenAI
import base64
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
with open("diagram.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="microsoft/Phi-5-Vision",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
{"type": "text", "text": "Describe the architecture shown in this diagram and identify the bottlenecks."}
]
}
],
max_tokens=512
)Production Patterns
Phi-5 as an LLM Router
Phi-5's low inference cost and reliable instruction following make it a natural router for a tiered inference system. The router classifies incoming queries by complexity and routes them to the appropriate model tier.
For a complete implementation guide covering the classifier design, NGINX proxy setup, and GPU tier selection, see the LLM inference router pattern.
A simplified Phi-5 router that classifies queries before dispatching to a larger model:
from openai import OpenAI
import json
router_client = OpenAI(base_url="http://phi5-instance:8000/v1", api_key="none")
large_client = OpenAI(base_url="http://h100-instance:8001/v1", api_key="none")
def route_request(user_message: str) -> str:
# Ask Phi-5 to classify the query complexity
classification = router_client.chat.completions.create(
model="microsoft/Phi-5",
messages=[
{
"role": "system",
"content": (
"Classify this query as SIMPLE or COMPLEX. "
"SIMPLE: factual lookup, short summarization, classification, extraction. "
"COMPLEX: multi-step reasoning, code generation, novel synthesis, long-form writing. "
"Respond with exactly one word: SIMPLE or COMPLEX."
)
},
{"role": "user", "content": user_message}
],
max_tokens=5,
temperature=0
)
tier = (classification.choices[0].message.content or "").strip().upper()
if tier == "SIMPLE":
# Serve directly from Phi-5
response = router_client.chat.completions.create(
model="microsoft/Phi-5",
messages=[{"role": "user", "content": user_message}],
max_tokens=512
)
else:
# Escalate to a larger model
response = large_client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct",
messages=[{"role": "user", "content": user_message}],
max_tokens=1024
)
return response.choices[0].message.content or ""With this pattern, Phi-5 classifies each request in under 50ms. For the 60-70% of queries that are genuinely simple, Phi-5 answers directly at L40S pricing. Only the complex minority reaches the expensive H100 tier.
Phi-5 as Agent Inner-Loop Reasoning
Phi-5's throughput makes it well-suited for the fast reasoning steps in agentic pipelines. On an L40S, expect 2,000-3,000 tokens/sec for decode at moderate batch sizes.
Latency math per reasoning step:
- Average reasoning step: 150 tokens
- Phi-5 on L40S at 2,500 tokens/sec: 150 / 2500 = 60ms
- A 20-step agent task: 20 * 60ms = ~1.2 seconds for model inference
Compare against GPT-4o at ~100-150 tokens/sec effective throughput from the API: 150 tokens / 120 tokens/sec = 1.25s per step, 25 seconds total for 20 steps.
For agent tasks where reasoning steps are short and the right answer does not require a 70B model, Phi-5 cuts multi-step agent latency by 10-20x.
Phi-5 for SLM Workloads
Classification, entity extraction, structured output, RAG-based Q&A, and document summarization are all tasks where Phi-5 delivers strong results at a fraction of large-model costs. The SLM deployment guide covers the broader pattern of routing these workloads to small models, including multi-LoRA serving for running multiple task-specific fine-tunes on a single GPU.
Cost Comparison: Phi-5 on Spheron vs OpenAI API
Live pricing from the Spheron GPU marketplace, fetched 26 May 2026:
| GPU | VRAM | On-Demand (per GPU/hr) | Best For |
|---|---|---|---|
| L4 PCIe | 24GB | ~$1.25 | Phi-5 Mini, AWQ INT4 Phi-5 14B |
| L40S PCIe | 48GB | ~$0.72 | Phi-5 14B FP16 and FP8 serving |
| A100 80GB PCIe | 80GB | ~$1.04 | High-throughput batched inference, multi-user |
| H100 SXM5 | 80GB | ~$3.84 | Large multi-GPU workloads (5.3x premium over L40S) |
Pricing fluctuates based on GPU availability. The prices above are based on 26 May 2026 and may have changed. Check current GPU pricing → for live rates.
Throughput estimate for Phi-5 14B (expected, based on Phi-4 benchmarks):
- L40S FP16: ~2,500 tokens/sec output at batch 32
- L40S FP8: ~3,500 tokens/sec output at batch 32
- H100 SXM5 FP8: ~6,000 tokens/sec output at batch 64
Cost-per-million-output-tokens (L40S FP8 at 3,500 tokens/sec):
- $0.72/hr / (3,500 * 3,600 tokens/hr) = $0.72 / 12.6M = ~$0.057/M tokens
OpenAI gpt-4o-mini pricing:
- Input: $0.15/1M tokens
- Output: $0.60/1M tokens
Daily cost comparison: dedicated L40S vs gpt-4o-mini API (output tokens):
| Daily Output Volume | Spheron L40S (self-hosted) | OpenAI gpt-4o-mini API |
|---|---|---|
| 1M tokens/day | $17.28 (24hr dedicated) | $0.60 |
| 10M tokens/day | $17.28 (24hr dedicated) | $6.00 |
| 100M tokens/day | $17.28 (24hr dedicated) | $60.00 |
| 1B tokens/day | ~$69.12 (4x L40S) | $600.00 |
The dedicated server model shows self-hosting wins at 100M+ tokens/day: $17.28/day for the server vs $60/day in API costs for output tokens alone. At 1B tokens/day you need 4 L40S instances (one handles ~302M tokens/day at 3,500 tokens/sec) but save ~$531/day vs API pricing.
Breakeven calculation (mixed workload, 70% input / 30% output):
Effective API cost per M tokens: (0.7 $0.15 + 0.3 $0.60) = $0.285/M
L40S monthly cost: $0.72 24 30 = $518.40/month
Breakeven: $518.40 / $0.285 per M = 1.82B tokens/month (~60M tokens/day)
Above 60M tokens/day in mixed traffic, self-hosting Phi-5 on a single L40S is cheaper than gpt-4o-mini. That threshold drops further if you use spot pricing or run multiple workloads on the same instance.
Note: Spheron aggregates compute from 5+ providers. Rates fluctuate based on real-time availability across the marketplace.
Phi-5's small footprint makes hyperscaler markup highly visible: you're paying for 48+ GB of VRAM you don't need when a L40S or even an L4 handles the workload. Spheron's marketplace aggregates on-demand GPU access from 5+ providers so you only pay for what you use.
Rent L40S on Spheron → | Rent H100 on Spheron → | View all GPU pricing →
Quick Setup Guide
Pick based on your task and hardware. Phi-5 Mini (3-4B) fits on any 8GB+ GPU and handles classification, extraction, and short-form text generation. Phi-5 (14B) is the main model for coding, math, and instruction-following. Phi-5 Vision adds image understanding. Start with Phi-5 14B unless you have hard latency or cost constraints.
Go to app.spheron.ai and select an L40S 48GB for standard Phi-5 14B serving, or an A100 80GB for high-throughput batched inference. For Phi-5 Mini, an RTX 4090 or A100 PCIe handles the workload. Use on-demand for production, spot for batch processing jobs.
Run: docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model microsoft/Phi-5 --tensor-parallel-size 1 --max-model-len 32768 --dtype bfloat16. This starts an OpenAI-compatible endpoint on port 8000. For AWQ: add --quantization awq and point to the AWQ checkpoint.
For single-GPU L40S deployment: --max-num-batched-tokens 65536 --enable-chunked-prefill. For 2-GPU tensor parallelism on A100 or H100: --tensor-parallel-size 2. Chunked prefill prevents TTFT spikes from long prompts. Monitor KV cache hit rate with vLLM's /metrics endpoint.
Use vllm bench throughput or locust to simulate your target QPS. Measure TTFT (time to first token) and ITL (inter-token latency) at P50 and P99. Target TTFT under 500ms and ITL under 30ms for interactive workloads. For batch inference, focus on throughput in tokens/sec rather than latency.
Frequently Asked Questions
Phi-5 (14B) in FP16 needs about 28GB of VRAM, making a single L40S (48GB) the minimum comfortable choice for on-demand serving. With AWQ INT4 quantization, the model fits in ~9GB and runs on an RTX 4090. Phi-5 Mini (3-4B) fits on any GPU with 8GB+ VRAM in FP16. Add 20-30% headroom for KV cache on top of weight memory.
Phi-5 continues Microsoft's pattern of training small models to punch above their weight class using high-quality synthetic data. Based on pre-release benchmarks, Phi-5 shows improvements over Phi-4 on MMLU, HumanEval, and GSM8K, while maintaining the same 14B dense architecture that makes it practical to self-host on a single mid-tier GPU.
Yes. Phi-5 supports AWQ INT4 quantization through AutoAWQ. Pre-quantized checkpoints are available on Hugging Face, or you can quantize from the base weights yourself with a calibration dataset. vLLM loads AWQ-quantized Phi-5 directly with the --quantization awq flag, reducing VRAM from ~28GB to ~9GB for the 14B variant.
At high token volumes, yes. A dedicated L40S on Spheron runs at ~$0.72/hr. At steady-state throughput of roughly 3,500 output tokens/sec (FP8), that works out to around $0.06 per million output tokens vs $0.60/1M for gpt-4o-mini. The breakeven vs gpt-4o-mini typically falls around 60M tokens/day for mixed workloads running on a continuously provisioned GPU.
Based on Microsoft's pre-release announcements, Phi-5 includes a vision-language variant (Phi-5 Vision) with a dedicated visual encoder for image understanding tasks. The vision encoder adds roughly 4-6GB of VRAM overhead on top of the base model. Configuration follows the same vLLM deployment pattern with a few additional flags for image token processing.
