Tutorial

Deploy Llama Stack on GPU Cloud: Meta's Production Framework for Llama Inference, Agents, and RAG (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 29, 2026
Llama StackLlama Stack DeploymentLlama Stack GPUMeta Llama StackLlama Stack Self HostLlama Stack vs vLLMLlama Stack AgentsLlama Stack RAGLLM DeploymentGPU Cloud
Deploy Llama Stack on GPU Cloud: Meta's Production Framework for Llama Inference, Agents, and RAG (2026)

TL;DR: Llama Stack on H200 SXM5

GPU: H200 SXM5 141GB on Spheron, $4.62/hr on-demand, $1.40/hr spot

Step 1: Start vLLM as the inference backend

bash
docker run --gpus all --ipc=host -p 8000:8000 \
  -e HF_TOKEN=your_hf_token \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --dtype auto --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90 --max-model-len 32768

Step 2: Build and start the Llama Stack server

bash
pip install llama-stack
llama stack build --template vllm-gpu --image-type docker
llama stack run ./run.yaml

Test inference:

bash
curl http://localhost:8321/v1/inference/chat-completion \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
    "messages": [{"role": "user", "content": "What is Llama Stack?"}]
  }'

Test agent creation:

bash
curl http://localhost:8321/v1/agents \
  -H "Content-Type: application/json" \
  -d '{
    "agent_config": {
      "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
      "instructions": "You are a helpful assistant.",
      "tools": []
    }
  }'

The rest of this guide covers every component in detail.



What Is Llama Stack?

When you deploy a raw vLLM server, you get tokens in and tokens out. That's useful but incomplete for production applications. You still need to wire up:

  • A safety filter so the model doesn't generate harmful outputs
  • An agents loop that handles multi-step tool calling
  • A vector store and retrieval pipeline for RAG
  • An evaluation harness to benchmark your deployment
  • A unified client API that doesn't change when you swap inference backends

Meta built Llama Stack to handle all of this. It's an opinionated framework that wraps multiple Llama components under a single, versioned REST API. Instead of assembling these pieces yourself, you pick a distribution (a pre-built configuration of providers), run one command, and get the full stack.

Llama Stack ships as part of Meta's official Llama 4 documentation and is the reference for how Meta expects Llama models to be deployed at scale.

Six API domains are exposed through a single server:

  • Inference: Token generation via Llama models (backed by vLLM, TGI, or Meta's reference engine)
  • Safety: Input/output content filtering via Llama Guard 3
  • Agents: Multi-step tool-calling loops with registered tool providers
  • Memory: Vector store abstraction for RAG (Qdrant, Milvus, or in-memory)
  • Tool Runtime: Web search, code interpreter, and RAG retrieval tools
  • Eval: Benchmark harness for deployed model quality assessment

If you're coming from raw vLLM, the key mental model is: Llama Stack uses vLLM for inference and adds the application layer on top. Nothing about this architecture requires special hardware or cloud dependencies.


Architecture: Providers, Distributions, and the Unified API

Every API domain in Llama Stack has a provider: a specific implementation that handles that domain's functionality. Providers are interchangeable. You can swap the vector store from in-memory to Qdrant without changing any application code.

A distribution is a named, pre-tested bundle of provider choices. Meta ships several reference distributions. For GPU production deployments, the vllm-gpu distribution is the right starting point.

API DomainDefault Provider (vllm-gpu)What It Does
InferencevLLMToken generation via Llama models
SafetyLlama Guard 3Input/output content filtering
AgentsMeta ReferenceMulti-step tool-calling loop
MemoryQdrant / in-memoryVector store for RAG
Tool RuntimeBuilt-inWeb search, code interpreter, RAG tool
EvalMeta ReferenceBenchmark harness for deployed models

When you run llama stack build, it generates a run.yaml that wires these providers together. A trimmed version looks like this:

yaml
version: "2"
distribution_spec:
  description: Llama Stack with vLLM inference backend
  providers:
    inference: remote::vllm
    safety: inline::meta-reference
    agents: inline::meta-reference
    memory: inline::faiss
    tool_runtime: inline::meta-reference
    eval: inline::meta-reference

image_name: distribution-vllm-gpu

server:
  port: 8321

models:
  - model_id: meta-llama/Llama-4-Scout-17B-16E-Instruct
    provider_model_id: meta-llama/Llama-4-Scout-17B-16E-Instruct
    provider_id: vllm0

providers:
  inference:
    - provider_id: vllm0
      provider_type: remote::vllm
      config:
        url: http://localhost:8000
        max_tokens: 4096

The remote::vllm provider means Llama Stack talks to an already-running vLLM server. You start vLLM separately, then start Llama Stack pointing at it.


Hardware Requirements for Llama 4 Scout and Maverick

Llama Stack's hardware requirements come entirely from the inference backend (vLLM) and the safety model (Llama Guard 3). Llama Guard 3 8B needs roughly 16GB of VRAM with FP16. Plan for this if you're running both on the same node.

ConfigurationGPUVRAMQuantizationVRAM Used (model)Use Case
Scout, single GPUH200 SXM5141 GBINT4~55 GBDevelopment + production
Scout, tight fitH100 SXM580 GBINT4~55 GBBudget production
Scout, full quality3x H100 SXM5240 GBFP16~218 GBMax quality
Maverick, minimum4x H100 SXM5320 GBINT4~200 GBMulti-GPU cluster
Maverick, recommended2x H200 SXM5282 GBINT4~200 GBEfficient multi-GPU
Maverick, full quality4x B200 SXM6768 GBFP8~400 GBFull quality inference

For the full VRAM math and quantization breakdown, see the Llama 4 GPU deployment guide. The key point for Llama Stack: the inference provider does not add overhead. VRAM usage is identical to running vLLM directly.


Step-by-Step Setup on Spheron

Step 1: Provision Your GPU Instance

Log into app.spheron.ai, select your GPU from the catalog, choose on-demand or spot pricing, and deploy. For Scout development, a single H200 SXM5 provides comfortable headroom. SSH into the instance and verify:

bash
nvidia-smi

You should see your GPU(s) with the expected VRAM. For a single H200 SXM5, expect to see 141,034 MiB total memory. If you provisioned multiple GPUs, confirm the count matches.

If you need on-demand H200 access with per-minute billing, Spheron provisions bare-metal instances in under 2 minutes.

Step 2: Install Docker with NVIDIA Container Toolkit

Most Spheron GPU instances include the NVIDIA Container Toolkit pre-installed. Verify:

bash
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

If this fails, install the toolkit:

bash
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Llama Stack requires CUDA 12.1+ for H100/H200 and CUDA 12.4+ for B200.

Step 3: Install Llama Stack CLI

Python 3.10+ is required. Install with pip:

bash
pip install llama-stack
llama --version

This installs both the llama-stack server package and the llama CLI. The CLI is what you use for building distributions and managing model registrations.

Step 4: Start the vLLM Inference Backend

Llama Stack's vllm-gpu distribution connects to a running vLLM server. Start vLLM first:

For Llama 4 Scout on a single H200 (INT4 with auto dtype selection):

bash
docker run --gpus all --ipc=host -p 8000:8000 \
  -e HF_TOKEN=your_hf_token \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --dtype auto \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768

For Llama 4 Maverick on 4x H100 SXM5 (INT4):

bash
docker run --gpus all --ipc=host -p 8000:8000 \
  -e HF_TOKEN=your_hf_token \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-4-Maverick-17B-128E-Instruct \
  --dtype auto \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 65536

Both model IDs (meta-llama/Llama-4-Scout-17B-16E-Instruct and meta-llama/Llama-4-Maverick-17B-128E-Instruct) are gated on Hugging Face. You must accept the model license and set HF_TOKEN before the container can download them.

For a full vLLM setup with production tuning (FP8, multi-GPU tensor parallelism, monitoring), see our vLLM production deployment guide.

Step 5: Build a Llama Stack Distribution

With vLLM running, build the vllm-gpu distribution:

bash
llama stack build --template vllm-gpu --image-type docker

This generates a run.yaml in your current directory and pulls all provider dependencies. Review the generated file to confirm:

  • inference provider points to http://localhost:8000 (your vLLM server)
  • safety provider is set to inline::meta-reference
  • memory provider is inline::faiss (in-memory vector store for development)

To list all available templates: llama stack build --list-templates

Step 6: Start the Llama Stack Server

bash
llama stack run ./run.yaml

The server starts on port 8321 and exposes the full REST API. You should see startup logs confirming each provider initialized:

Starting Llama Stack server on port 8321
Initializing inference provider: remote::vllm
Initializing safety provider: inline::meta-reference
Initializing agents provider: inline::meta-reference
Llama Stack server started

Verify the server is healthy:

bash
curl http://localhost:8321/v1/models

You should see the registered Llama 4 model in the response.


Adding Llama Guard 3 Safety Filtering

Safety filtering is a first-class Llama Stack concept. Llama Guard 3 is a separate model that inspects inputs and outputs against a set of harm categories. It runs as the meta-reference safety provider.

To test the safety filter:

bash
curl http://localhost:8321/v1/safety/run-shield \
  -H "Content-Type: application/json" \
  -d '{
    "shield_id": "meta-llama/Llama-Guard-3-8B",
    "messages": [
      {"role": "user", "content": "How do I make a dangerous device?"}
    ]
  }'

A typical safe response:

json
{
  "violation": null
}

An unsafe response:

json
{
  "violation": {
    "user_message": "I'm sorry, I cannot assist with that request.",
    "violation_type": "S2"
  }
}

The violation type maps to Llama Guard's harm taxonomy (S1 through S14 categories). Check the Meta Llama Guard documentation for the full category list.

For code generation use cases, the meta-reference safety provider also ships Code Shield, which filters outputs from code-generating prompts. Enable it by registering meta-llama/Llama-Guard-3-8B-Code as a second shield in your run.yaml.

VRAM note: Llama Guard 3 8B requires approximately 16GB of VRAM in FP16. On a single H100 80GB running Scout INT4 (55GB), you have 25GB headroom, which fits Llama Guard 3 8B comfortably. On H200 (141GB) the margin is even larger.


Agents API with Built-in Tool Use

The agents API implements a multi-step reasoning loop where the model calls registered tools, processes tool responses, and continues until it reaches a final answer. This is different from raw vLLM function calling: Llama Stack manages the entire loop, including tool execution and message threading.

Create an agent with web search enabled:

python
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url="http://localhost:8321")

agent_config = {
    "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
    "instructions": "You are a helpful assistant with access to web search.",
    "tools": [
        {
            "type": "brave_search",
            "engine": "brave",
            "api_key": "your_brave_api_key"
        }
    ],
    "safety_config": {
        "shields": [{"shield_id": "meta-llama/Llama-Guard-3-8B"}]
    },
    "max_infer_iters": 5,
}

agent = client.agents.create(**agent_config)
session = client.agents.session.create(agent_id=agent.agent_id, session_name="test")

Send a turn to the agent:

python
response = client.agents.turn.create(
    agent_id=agent.agent_id,
    session_id=session.session_id,
    messages=[{"role": "user", "content": "What is the latest news about Llama Stack?"}],
    stream=False,
)

print(response.output_message.content)

The loop runs until the model either produces a final answer or hits max_infer_iters. Each tool call is logged in the turn's steps field, so you can inspect what the agent did at each step.


Production RAG Pipeline with the Memory API

The memory API provides a vector store abstraction. For production, use Qdrant or Milvus as the backend. For development and testing, the in-memory FAISS provider works without additional infrastructure.

Create a memory bank:

python
client.memory.create(
    bank_id="product-docs",
    config={
        "type": "vector",
        "embedding_model": "all-MiniLM-L6-v2",
        "chunk_size": 512,
        "overlap_size": 64,
    },
    provider_id="faiss",
)

Insert documents:

python
documents = [
    {
        "document_id": "doc-001",
        "content": "Llama Stack is Meta's framework for production Llama deployments...",
        "metadata": {"source": "docs", "version": "1.0"},
    }
]
client.memory.insert(bank_id="product-docs", documents=documents)

Query the memory bank:

python
results = client.memory.query(
    bank_id="product-docs",
    query="How do I set up Llama Stack agents?",
    params={"max_chunks": 5},
)
for chunk in results.chunks:
    print(chunk.content)

For production RAG, switch the provider from faiss to qdrant in your run.yaml and configure the Qdrant connection URL. The client code does not change. Qdrant and Milvus both support persistent storage and horizontal scaling, which the in-memory provider does not. For a deeper look at running embedding models, vector search, and LLM inference on a single GPU node to minimize latency, see the agentic RAG GPU infrastructure guide.


Eval API

Llama Stack ships a built-in evaluation harness. You can benchmark a deployed distribution against standard datasets without writing custom evaluation code.

A minimal eval config (eval_config.yaml):

yaml
model: meta-llama/Llama-4-Scout-17B-16E-Instruct
dataset: mmlu
num_examples: 100
metrics:
  - accuracy
  - f1

Run the evaluation:

bash
llama eval run --config eval_config.yaml

Supported datasets include MMLU, HumanEval, and GSM8K. Results are written to a local JSON file and include per-category accuracy for MMLU. For production quality gates, run evals before and after any model or configuration change and block deployments that drop below threshold.


Cost Breakdown on Spheron H100, H200, and B200

Live pricing as of 29 May 2026, fetched from Spheron's GPU catalog:

Per-GPU on-demand rates: H100 SXM5 $3.90/hr, H200 SXM5 $4.62/hr, B200 SXM6 $7.21/hr. Multi-GPU rows below show cluster totals (e.g. 4x H100 = 4 × $3.90 = $15.60/hr).

ConfigurationGPUOn-DemandSpotMonthly (on-demand)Monthly (spot)
Scout, INT4 (budget)1x H100 SXM5$3.90/hr$1.73/hr~$2,847~$1,263
Scout, INT4 (recommended)1x H200 SXM5$4.62/hr$1.40/hr~$3,373~$1,022
Maverick, INT4 (minimum)4x H100 SXM5$15.60/hr$6.91/hr~$11,388~$5,044
Maverick, INT4 (efficient)2x H200 SXM5$9.24/hr$2.80/hr~$6,745~$2,044
Maverick, FP8 (full quality)4x B200 SXM6$28.84/hr$10.73/hr~$21,053~$7,833

Monthly estimates assume 730 hours of continuous runtime. Spot pricing saves 56-70% where available but requires handling preemption.

The H100 SXM5 on Spheron is the practical entry point for Scout with INT4 quantization. If context window matters more than cost, the H200 SXM5 rental gives you 141GB for longer sequences and larger batch sizes. For Maverick at full quality, B200 instances on Spheron provide the highest memory bandwidth with NVLink interconnects for multi-GPU tensor parallelism.

Pricing fluctuates based on GPU availability. The prices above are based on 29 May 2026 and may have changed. Check current GPU pricing for live rates.


Llama Stack vs vLLM vs SGLang vs TGI

FrameworkBest ForAPI SurfaceSafety Built-InAgents Built-InRAG Built-In
Llama StackFull production app stackLlama Stack + OpenAI compatYes (Llama Guard)YesYes (Memory API)
vLLMHigh-throughput inference onlyOpenAI-compatibleNoNoNo
SGLangComplex multi-step prompts, structured outputsOpenAI-compatibleNoNoNo
TGISimple serving, HuggingFace ecosystemREST + gRPCNoNoNo

Llama Stack is the right choice when you need the full application layer (safety, agents, RAG) and want a single versioned API surface. The tradeoff is operational complexity: you're running two servers (vLLM + Llama Stack) and have more components to monitor.

vLLM wins when you need maximum inference performance and will build application logic yourself. It's faster to set up, easier to monitor, and has broader model support. See our vLLM production guide for the complete multi-GPU setup. For a deep dive into the continuous batching and PagedAttention mechanics that make vLLM fast, see the LLM serving optimization guide.

SGLang wins for workloads with repeated prefix structures (multi-agent conversations, shared system prompts) where RadixAttention's cache hit rates make a measurable latency difference. See our SGLang deployment guide for production configuration.

TGI is still viable for simple deployments in the HuggingFace ecosystem, but lacks the multi-GPU performance optimizations of vLLM and the application-layer features of Llama Stack.

For pure inference throughput benchmarks across all four, see the inference framework benchmark.


Common Issues and Fixes

Llama Stack Server Fails to Connect to vLLM

If you see connection refused errors on startup, vLLM hasn't finished loading the model yet. The model download and weight loading for Scout takes 5-10 minutes on first run. Wait until you see:

INFO:     Application startup complete.

in the vLLM container logs before starting the Llama Stack server.

Llama Guard Safety Model Not Found

If the safety provider errors on startup with a model-not-found message, the Llama Guard 3 model needs to be registered separately. Add this to your run.yaml under models:

yaml
- model_id: meta-llama/Llama-Guard-3-8B
  provider_model_id: meta-llama/Llama-Guard-3-8B
  provider_id: vllm0
  model_type: llm

This assumes you're serving Llama Guard 3 through the same vLLM instance. Alternatively, set the safety provider to use the inline::meta-reference provider which downloads and runs Llama Guard directly.

OOM on Agent Tool Calls

Agent loops can spike memory usage during tool execution, especially with web search returning long context. Set max_infer_iters to a low value (3-5) in development and monitor KV cache usage via vLLM's /metrics endpoint before increasing context length or concurrency.

Distribution Template Names

Meta has renamed distributions across Llama Stack releases. If vllm-gpu fails, check current templates with:

bash
llama stack build --list-templates

Common alternatives: meta-reference-gpu, remote-vllm, vllm. Use whatever appears in the current CLI output.


Llama Stack's vLLM distribution runs on Spheron H100 and H200 instances without any proprietary dependencies. One command provisions bare-metal GPU access; another starts the full stack.

Spheron H200 → | Spheron H100 → | View all GPU pricing →

STEPS / 06

Quick Setup Guide

  1. Choose your GPU and model combination

    Select GPU based on model and quantization. Llama 4 Scout (INT4): single H200 SXM5 141GB or 2x H100 SXM5 80GB. Llama 4 Scout (FP8): 2x H100 SXM5. Llama 4 Maverick (INT4): 4x H100 SXM5 or 2x H200. Full FP16: significantly more VRAM required. Check current pricing at /pricing/.

  2. Provision a GPU instance on Spheron

    Log in to app.spheron.ai, select your GPU configuration from the catalog, choose on-demand or spot pricing, and deploy. SSH access is available in under 2 minutes. Verify GPUs with nvidia-smi before proceeding.

  3. Install Llama Stack CLI and dependencies

    On your instance: pip install llama-stack. This installs both the llama-stack server and the llama CLI. Verify with: llama --version. Ensure Python 3.10+ and CUDA drivers are present.

  4. Choose a distribution and build the stack

    Run: llama stack build --template vllm-gpu --image-type docker. This pulls all provider dependencies for the vLLM GPU distribution. Review the generated run.yaml to confirm provider assignments before starting.

  5. Start the Llama Stack server

    Run: llama stack run ./run.yaml. The server starts on port 8321 by default and exposes the full Llama Stack REST API: /inference, /agents, /memory, /safety, /eval. Confirm the server is healthy with: curl http://localhost:8321/v1/models.

  6. Test inference, safety, and agents

    Use the llama-stack-client Python SDK or plain curl. Test inference at /v1/inference/chat-completion, safety filtering via the /v1/safety/run-shield endpoint, and a simple agent loop via /v1/agents. Confirm Llama Guard responses are filtering correctly before exposing to external traffic.

FAQ / 05

Frequently Asked Questions

Llama Stack is Meta's officially maintained, opinionated framework that bundles Llama inference, safety (Llama Guard), agents, tool use, RAG memory, and eval under a single unified API. Raw vLLM is a high-performance inference engine only - it handles tokens in and tokens out. Llama Stack sits on top of vLLM (as one of its inference providers) and adds the application-layer APIs: agents, tool calling, memory, safety filtering, and benchmark evaluation. Choose Llama Stack when you want a complete, production-ready application stack. Choose raw vLLM when you only need a fast inference server and will build application logic yourself.

Llama 4 Scout (109B total parameters, 17B active per token) requires approximately 55GB of VRAM with INT4 quantization or 218GB in FP16. A single H200 SXM5 (141GB VRAM) comfortably runs Scout with INT4 quantization and some context headroom. A single H100 SXM5 (80GB) can also run Scout with INT4 quantization but with tighter memory. For Llama 4 Maverick (400B), you need at least 4x H100 SXM5 in INT4, 2x H200 SXM5 in INT4, or 4x B200 SXM6 for full FP8 quality.

Yes. Llama Stack runs cleanly on standard NVIDIA CUDA Docker images available on Spheron GPU instances. There is no proprietary kernel, driver extension, or cloud-specific dependency required. Llama Stack's vLLM distribution (the recommended production setup) uses the standard vllm/vllm-openai Docker image as the inference backend, which Spheron instances support out of the box.

A distribution is a pre-configured bundle of provider choices for each API in Llama Stack: which inference backend to use, which safety model, which memory/vector store, which tool executor, and so on. Meta ships several reference distributions (e.g., 'meta-reference', 'vllm-gpu', 'remote-vllm'). You select a distribution to get a consistent, reproducible stack without manually wiring each API provider. For production GPU deployments, the 'vllm-gpu' distribution is the recommended starting point.

Llama Stack exposes its own API surface, which is not a direct drop-in for the OpenAI SDK. However, it includes an OpenAI compatibility layer so you can point standard OpenAI client libraries at your Llama Stack inference endpoint. The agents, memory, and eval APIs are Llama Stack-specific and have no direct OpenAI equivalent.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.