TL;DR: Llama Stack on H200 SXM5
GPU: H200 SXM5 141GB on Spheron, $4.62/hr on-demand, $1.40/hr spot
Step 1: Start vLLM as the inference backend
docker run --gpus all --ipc=host -p 8000:8000 \
-e HF_TOKEN=your_hf_token \
vllm/vllm-openai:latest \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct \
--dtype auto --tensor-parallel-size 1 \
--gpu-memory-utilization 0.90 --max-model-len 32768Step 2: Build and start the Llama Stack server
pip install llama-stack
llama stack build --template vllm-gpu --image-type docker
llama stack run ./run.yamlTest inference:
curl http://localhost:8321/v1/inference/chat-completion \
-H "Content-Type: application/json" \
-d '{
"model_id": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"messages": [{"role": "user", "content": "What is Llama Stack?"}]
}'Test agent creation:
curl http://localhost:8321/v1/agents \
-H "Content-Type: application/json" \
-d '{
"agent_config": {
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"instructions": "You are a helpful assistant.",
"tools": []
}
}'The rest of this guide covers every component in detail.
Related Guides
- Deploy Llama 4 Scout and Maverick on GPU Cloud: hardware requirements and vLLM setup for Llama 4 models
- vLLM Production Deployment 2026: complete vLLM setup guide (Llama Stack's default inference backend)
- SGLang Production Deployment: alternative serving engine for agentic workloads
- Self-Host OpenAI-Compatible API: API compatibility patterns for LLM switching
What Is Llama Stack?
When you deploy a raw vLLM server, you get tokens in and tokens out. That's useful but incomplete for production applications. You still need to wire up:
- A safety filter so the model doesn't generate harmful outputs
- An agents loop that handles multi-step tool calling
- A vector store and retrieval pipeline for RAG
- An evaluation harness to benchmark your deployment
- A unified client API that doesn't change when you swap inference backends
Meta built Llama Stack to handle all of this. It's an opinionated framework that wraps multiple Llama components under a single, versioned REST API. Instead of assembling these pieces yourself, you pick a distribution (a pre-built configuration of providers), run one command, and get the full stack.
Llama Stack ships as part of Meta's official Llama 4 documentation and is the reference for how Meta expects Llama models to be deployed at scale.
Six API domains are exposed through a single server:
- Inference: Token generation via Llama models (backed by vLLM, TGI, or Meta's reference engine)
- Safety: Input/output content filtering via Llama Guard 3
- Agents: Multi-step tool-calling loops with registered tool providers
- Memory: Vector store abstraction for RAG (Qdrant, Milvus, or in-memory)
- Tool Runtime: Web search, code interpreter, and RAG retrieval tools
- Eval: Benchmark harness for deployed model quality assessment
If you're coming from raw vLLM, the key mental model is: Llama Stack uses vLLM for inference and adds the application layer on top. Nothing about this architecture requires special hardware or cloud dependencies.
Architecture: Providers, Distributions, and the Unified API
Every API domain in Llama Stack has a provider: a specific implementation that handles that domain's functionality. Providers are interchangeable. You can swap the vector store from in-memory to Qdrant without changing any application code.
A distribution is a named, pre-tested bundle of provider choices. Meta ships several reference distributions. For GPU production deployments, the vllm-gpu distribution is the right starting point.
| API Domain | Default Provider (vllm-gpu) | What It Does |
|---|---|---|
| Inference | vLLM | Token generation via Llama models |
| Safety | Llama Guard 3 | Input/output content filtering |
| Agents | Meta Reference | Multi-step tool-calling loop |
| Memory | Qdrant / in-memory | Vector store for RAG |
| Tool Runtime | Built-in | Web search, code interpreter, RAG tool |
| Eval | Meta Reference | Benchmark harness for deployed models |
When you run llama stack build, it generates a run.yaml that wires these providers together. A trimmed version looks like this:
version: "2"
distribution_spec:
description: Llama Stack with vLLM inference backend
providers:
inference: remote::vllm
safety: inline::meta-reference
agents: inline::meta-reference
memory: inline::faiss
tool_runtime: inline::meta-reference
eval: inline::meta-reference
image_name: distribution-vllm-gpu
server:
port: 8321
models:
- model_id: meta-llama/Llama-4-Scout-17B-16E-Instruct
provider_model_id: meta-llama/Llama-4-Scout-17B-16E-Instruct
provider_id: vllm0
providers:
inference:
- provider_id: vllm0
provider_type: remote::vllm
config:
url: http://localhost:8000
max_tokens: 4096The remote::vllm provider means Llama Stack talks to an already-running vLLM server. You start vLLM separately, then start Llama Stack pointing at it.
Hardware Requirements for Llama 4 Scout and Maverick
Llama Stack's hardware requirements come entirely from the inference backend (vLLM) and the safety model (Llama Guard 3). Llama Guard 3 8B needs roughly 16GB of VRAM with FP16. Plan for this if you're running both on the same node.
| Configuration | GPU | VRAM | Quantization | VRAM Used (model) | Use Case |
|---|---|---|---|---|---|
| Scout, single GPU | H200 SXM5 | 141 GB | INT4 | ~55 GB | Development + production |
| Scout, tight fit | H100 SXM5 | 80 GB | INT4 | ~55 GB | Budget production |
| Scout, full quality | 3x H100 SXM5 | 240 GB | FP16 | ~218 GB | Max quality |
| Maverick, minimum | 4x H100 SXM5 | 320 GB | INT4 | ~200 GB | Multi-GPU cluster |
| Maverick, recommended | 2x H200 SXM5 | 282 GB | INT4 | ~200 GB | Efficient multi-GPU |
| Maverick, full quality | 4x B200 SXM6 | 768 GB | FP8 | ~400 GB | Full quality inference |
For the full VRAM math and quantization breakdown, see the Llama 4 GPU deployment guide. The key point for Llama Stack: the inference provider does not add overhead. VRAM usage is identical to running vLLM directly.
Step-by-Step Setup on Spheron
Step 1: Provision Your GPU Instance
Log into app.spheron.ai, select your GPU from the catalog, choose on-demand or spot pricing, and deploy. For Scout development, a single H200 SXM5 provides comfortable headroom. SSH into the instance and verify:
nvidia-smiYou should see your GPU(s) with the expected VRAM. For a single H200 SXM5, expect to see 141,034 MiB total memory. If you provisioned multiple GPUs, confirm the count matches.
If you need on-demand H200 access with per-minute billing, Spheron provisions bare-metal instances in under 2 minutes.
Step 2: Install Docker with NVIDIA Container Toolkit
Most Spheron GPU instances include the NVIDIA Container Toolkit pre-installed. Verify:
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smiIf this fails, install the toolkit:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart dockerLlama Stack requires CUDA 12.1+ for H100/H200 and CUDA 12.4+ for B200.
Step 3: Install Llama Stack CLI
Python 3.10+ is required. Install with pip:
pip install llama-stack
llama --versionThis installs both the llama-stack server package and the llama CLI. The CLI is what you use for building distributions and managing model registrations.
Step 4: Start the vLLM Inference Backend
Llama Stack's vllm-gpu distribution connects to a running vLLM server. Start vLLM first:
For Llama 4 Scout on a single H200 (INT4 with auto dtype selection):
docker run --gpus all --ipc=host -p 8000:8000 \
-e HF_TOKEN=your_hf_token \
vllm/vllm-openai:latest \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct \
--dtype auto \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.90 \
--max-model-len 32768For Llama 4 Maverick on 4x H100 SXM5 (INT4):
docker run --gpus all --ipc=host -p 8000:8000 \
-e HF_TOKEN=your_hf_token \
vllm/vllm-openai:latest \
--model meta-llama/Llama-4-Maverick-17B-128E-Instruct \
--dtype auto \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.92 \
--max-model-len 65536Both model IDs (meta-llama/Llama-4-Scout-17B-16E-Instruct and meta-llama/Llama-4-Maverick-17B-128E-Instruct) are gated on Hugging Face. You must accept the model license and set HF_TOKEN before the container can download them.
For a full vLLM setup with production tuning (FP8, multi-GPU tensor parallelism, monitoring), see our vLLM production deployment guide.
Step 5: Build a Llama Stack Distribution
With vLLM running, build the vllm-gpu distribution:
llama stack build --template vllm-gpu --image-type dockerThis generates a run.yaml in your current directory and pulls all provider dependencies. Review the generated file to confirm:
inferenceprovider points tohttp://localhost:8000(your vLLM server)safetyprovider is set toinline::meta-referencememoryprovider isinline::faiss(in-memory vector store for development)
To list all available templates: llama stack build --list-templates
Step 6: Start the Llama Stack Server
llama stack run ./run.yamlThe server starts on port 8321 and exposes the full REST API. You should see startup logs confirming each provider initialized:
Starting Llama Stack server on port 8321
Initializing inference provider: remote::vllm
Initializing safety provider: inline::meta-reference
Initializing agents provider: inline::meta-reference
Llama Stack server startedVerify the server is healthy:
curl http://localhost:8321/v1/modelsYou should see the registered Llama 4 model in the response.
Adding Llama Guard 3 Safety Filtering
Safety filtering is a first-class Llama Stack concept. Llama Guard 3 is a separate model that inspects inputs and outputs against a set of harm categories. It runs as the meta-reference safety provider.
To test the safety filter:
curl http://localhost:8321/v1/safety/run-shield \
-H "Content-Type: application/json" \
-d '{
"shield_id": "meta-llama/Llama-Guard-3-8B",
"messages": [
{"role": "user", "content": "How do I make a dangerous device?"}
]
}'A typical safe response:
{
"violation": null
}An unsafe response:
{
"violation": {
"user_message": "I'm sorry, I cannot assist with that request.",
"violation_type": "S2"
}
}The violation type maps to Llama Guard's harm taxonomy (S1 through S14 categories). Check the Meta Llama Guard documentation for the full category list.
For code generation use cases, the meta-reference safety provider also ships Code Shield, which filters outputs from code-generating prompts. Enable it by registering meta-llama/Llama-Guard-3-8B-Code as a second shield in your run.yaml.
VRAM note: Llama Guard 3 8B requires approximately 16GB of VRAM in FP16. On a single H100 80GB running Scout INT4 (55GB), you have 25GB headroom, which fits Llama Guard 3 8B comfortably. On H200 (141GB) the margin is even larger.
Agents API with Built-in Tool Use
The agents API implements a multi-step reasoning loop where the model calls registered tools, processes tool responses, and continues until it reaches a final answer. This is different from raw vLLM function calling: Llama Stack manages the entire loop, including tool execution and message threading.
Create an agent with web search enabled:
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url="http://localhost:8321")
agent_config = {
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"instructions": "You are a helpful assistant with access to web search.",
"tools": [
{
"type": "brave_search",
"engine": "brave",
"api_key": "your_brave_api_key"
}
],
"safety_config": {
"shields": [{"shield_id": "meta-llama/Llama-Guard-3-8B"}]
},
"max_infer_iters": 5,
}
agent = client.agents.create(**agent_config)
session = client.agents.session.create(agent_id=agent.agent_id, session_name="test")Send a turn to the agent:
response = client.agents.turn.create(
agent_id=agent.agent_id,
session_id=session.session_id,
messages=[{"role": "user", "content": "What is the latest news about Llama Stack?"}],
stream=False,
)
print(response.output_message.content)The loop runs until the model either produces a final answer or hits max_infer_iters. Each tool call is logged in the turn's steps field, so you can inspect what the agent did at each step.
Production RAG Pipeline with the Memory API
The memory API provides a vector store abstraction. For production, use Qdrant or Milvus as the backend. For development and testing, the in-memory FAISS provider works without additional infrastructure.
Create a memory bank:
client.memory.create(
bank_id="product-docs",
config={
"type": "vector",
"embedding_model": "all-MiniLM-L6-v2",
"chunk_size": 512,
"overlap_size": 64,
},
provider_id="faiss",
)Insert documents:
documents = [
{
"document_id": "doc-001",
"content": "Llama Stack is Meta's framework for production Llama deployments...",
"metadata": {"source": "docs", "version": "1.0"},
}
]
client.memory.insert(bank_id="product-docs", documents=documents)Query the memory bank:
results = client.memory.query(
bank_id="product-docs",
query="How do I set up Llama Stack agents?",
params={"max_chunks": 5},
)
for chunk in results.chunks:
print(chunk.content)For production RAG, switch the provider from faiss to qdrant in your run.yaml and configure the Qdrant connection URL. The client code does not change. Qdrant and Milvus both support persistent storage and horizontal scaling, which the in-memory provider does not. For a deeper look at running embedding models, vector search, and LLM inference on a single GPU node to minimize latency, see the agentic RAG GPU infrastructure guide.
Eval API
Llama Stack ships a built-in evaluation harness. You can benchmark a deployed distribution against standard datasets without writing custom evaluation code.
A minimal eval config (eval_config.yaml):
model: meta-llama/Llama-4-Scout-17B-16E-Instruct
dataset: mmlu
num_examples: 100
metrics:
- accuracy
- f1Run the evaluation:
llama eval run --config eval_config.yamlSupported datasets include MMLU, HumanEval, and GSM8K. Results are written to a local JSON file and include per-category accuracy for MMLU. For production quality gates, run evals before and after any model or configuration change and block deployments that drop below threshold.
Cost Breakdown on Spheron H100, H200, and B200
Live pricing as of 29 May 2026, fetched from Spheron's GPU catalog:
Per-GPU on-demand rates: H100 SXM5 $3.90/hr, H200 SXM5 $4.62/hr, B200 SXM6 $7.21/hr. Multi-GPU rows below show cluster totals (e.g. 4x H100 = 4 × $3.90 = $15.60/hr).
| Configuration | GPU | On-Demand | Spot | Monthly (on-demand) | Monthly (spot) |
|---|---|---|---|---|---|
| Scout, INT4 (budget) | 1x H100 SXM5 | $3.90/hr | $1.73/hr | ~$2,847 | ~$1,263 |
| Scout, INT4 (recommended) | 1x H200 SXM5 | $4.62/hr | $1.40/hr | ~$3,373 | ~$1,022 |
| Maverick, INT4 (minimum) | 4x H100 SXM5 | $15.60/hr | $6.91/hr | ~$11,388 | ~$5,044 |
| Maverick, INT4 (efficient) | 2x H200 SXM5 | $9.24/hr | $2.80/hr | ~$6,745 | ~$2,044 |
| Maverick, FP8 (full quality) | 4x B200 SXM6 | $28.84/hr | $10.73/hr | ~$21,053 | ~$7,833 |
Monthly estimates assume 730 hours of continuous runtime. Spot pricing saves 56-70% where available but requires handling preemption.
The H100 SXM5 on Spheron is the practical entry point for Scout with INT4 quantization. If context window matters more than cost, the H200 SXM5 rental gives you 141GB for longer sequences and larger batch sizes. For Maverick at full quality, B200 instances on Spheron provide the highest memory bandwidth with NVLink interconnects for multi-GPU tensor parallelism.
Pricing fluctuates based on GPU availability. The prices above are based on 29 May 2026 and may have changed. Check current GPU pricing for live rates.
Llama Stack vs vLLM vs SGLang vs TGI
| Framework | Best For | API Surface | Safety Built-In | Agents Built-In | RAG Built-In |
|---|---|---|---|---|---|
| Llama Stack | Full production app stack | Llama Stack + OpenAI compat | Yes (Llama Guard) | Yes | Yes (Memory API) |
| vLLM | High-throughput inference only | OpenAI-compatible | No | No | No |
| SGLang | Complex multi-step prompts, structured outputs | OpenAI-compatible | No | No | No |
| TGI | Simple serving, HuggingFace ecosystem | REST + gRPC | No | No | No |
Llama Stack is the right choice when you need the full application layer (safety, agents, RAG) and want a single versioned API surface. The tradeoff is operational complexity: you're running two servers (vLLM + Llama Stack) and have more components to monitor.
vLLM wins when you need maximum inference performance and will build application logic yourself. It's faster to set up, easier to monitor, and has broader model support. See our vLLM production guide for the complete multi-GPU setup. For a deep dive into the continuous batching and PagedAttention mechanics that make vLLM fast, see the LLM serving optimization guide.
SGLang wins for workloads with repeated prefix structures (multi-agent conversations, shared system prompts) where RadixAttention's cache hit rates make a measurable latency difference. See our SGLang deployment guide for production configuration.
TGI is still viable for simple deployments in the HuggingFace ecosystem, but lacks the multi-GPU performance optimizations of vLLM and the application-layer features of Llama Stack.
For pure inference throughput benchmarks across all four, see the inference framework benchmark.
Common Issues and Fixes
Llama Stack Server Fails to Connect to vLLM
If you see connection refused errors on startup, vLLM hasn't finished loading the model yet. The model download and weight loading for Scout takes 5-10 minutes on first run. Wait until you see:
INFO: Application startup complete.in the vLLM container logs before starting the Llama Stack server.
Llama Guard Safety Model Not Found
If the safety provider errors on startup with a model-not-found message, the Llama Guard 3 model needs to be registered separately. Add this to your run.yaml under models:
- model_id: meta-llama/Llama-Guard-3-8B
provider_model_id: meta-llama/Llama-Guard-3-8B
provider_id: vllm0
model_type: llmThis assumes you're serving Llama Guard 3 through the same vLLM instance. Alternatively, set the safety provider to use the inline::meta-reference provider which downloads and runs Llama Guard directly.
OOM on Agent Tool Calls
Agent loops can spike memory usage during tool execution, especially with web search returning long context. Set max_infer_iters to a low value (3-5) in development and monitor KV cache usage via vLLM's /metrics endpoint before increasing context length or concurrency.
Distribution Template Names
Meta has renamed distributions across Llama Stack releases. If vllm-gpu fails, check current templates with:
llama stack build --list-templatesCommon alternatives: meta-reference-gpu, remote-vllm, vllm. Use whatever appears in the current CLI output.
Llama Stack's vLLM distribution runs on Spheron H100 and H200 instances without any proprietary dependencies. One command provisions bare-metal GPU access; another starts the full stack.
Quick Setup Guide
Select GPU based on model and quantization. Llama 4 Scout (INT4): single H200 SXM5 141GB or 2x H100 SXM5 80GB. Llama 4 Scout (FP8): 2x H100 SXM5. Llama 4 Maverick (INT4): 4x H100 SXM5 or 2x H200. Full FP16: significantly more VRAM required. Check current pricing at /pricing/.
Log in to app.spheron.ai, select your GPU configuration from the catalog, choose on-demand or spot pricing, and deploy. SSH access is available in under 2 minutes. Verify GPUs with nvidia-smi before proceeding.
On your instance: pip install llama-stack. This installs both the llama-stack server and the llama CLI. Verify with: llama --version. Ensure Python 3.10+ and CUDA drivers are present.
Run: llama stack build --template vllm-gpu --image-type docker. This pulls all provider dependencies for the vLLM GPU distribution. Review the generated run.yaml to confirm provider assignments before starting.
Run: llama stack run ./run.yaml. The server starts on port 8321 by default and exposes the full Llama Stack REST API: /inference, /agents, /memory, /safety, /eval. Confirm the server is healthy with: curl http://localhost:8321/v1/models.
Use the llama-stack-client Python SDK or plain curl. Test inference at /v1/inference/chat-completion, safety filtering via the /v1/safety/run-shield endpoint, and a simple agent loop via /v1/agents. Confirm Llama Guard responses are filtering correctly before exposing to external traffic.
Frequently Asked Questions
Llama Stack is Meta's officially maintained, opinionated framework that bundles Llama inference, safety (Llama Guard), agents, tool use, RAG memory, and eval under a single unified API. Raw vLLM is a high-performance inference engine only - it handles tokens in and tokens out. Llama Stack sits on top of vLLM (as one of its inference providers) and adds the application-layer APIs: agents, tool calling, memory, safety filtering, and benchmark evaluation. Choose Llama Stack when you want a complete, production-ready application stack. Choose raw vLLM when you only need a fast inference server and will build application logic yourself.
Llama 4 Scout (109B total parameters, 17B active per token) requires approximately 55GB of VRAM with INT4 quantization or 218GB in FP16. A single H200 SXM5 (141GB VRAM) comfortably runs Scout with INT4 quantization and some context headroom. A single H100 SXM5 (80GB) can also run Scout with INT4 quantization but with tighter memory. For Llama 4 Maverick (400B), you need at least 4x H100 SXM5 in INT4, 2x H200 SXM5 in INT4, or 4x B200 SXM6 for full FP8 quality.
Yes. Llama Stack runs cleanly on standard NVIDIA CUDA Docker images available on Spheron GPU instances. There is no proprietary kernel, driver extension, or cloud-specific dependency required. Llama Stack's vLLM distribution (the recommended production setup) uses the standard vllm/vllm-openai Docker image as the inference backend, which Spheron instances support out of the box.
A distribution is a pre-configured bundle of provider choices for each API in Llama Stack: which inference backend to use, which safety model, which memory/vector store, which tool executor, and so on. Meta ships several reference distributions (e.g., 'meta-reference', 'vllm-gpu', 'remote-vllm'). You select a distribution to get a consistent, reproducible stack without manually wiring each API provider. For production GPU deployments, the 'vllm-gpu' distribution is the recommended starting point.
Llama Stack exposes its own API surface, which is not a direct drop-in for the OpenAI SDK. However, it includes an OpenAI compatibility layer so you can point standard OpenAI client libraries at your Llama Stack inference endpoint. The agents, memory, and eval APIs are Llama Stack-specific and have no direct OpenAI equivalent.
