Deploy CrewAI on GPU Cloud: Production Multi-Agent Workflows with Self-Hosted LLM Inference (2026 Guide)

CrewAI prototypes run fine against the OpenAI API. Production crews with 5-20 agents hit latency walls fast, and the per-token cost compounds with every agent in the pipeline. A 5-agent sequential crew makes 5 serial LLM calls per run; at ~$0.05 per execution on GPT-4o, 10,000 daily runs becomes a ~$500 daily API bill before you've added any tooling. Self-hosting the inference backend on a GPU changes the math.

CrewAI in the Multi-Agent Landscape

Three frameworks dominate production multi-agent deployments: CrewAI, LangGraph, and AutoGen. They solve different problems.

Framework	Model	State management	Best for
CrewAI	Role-based crew with defined responsibilities	Sequential or hierarchical task pipeline	Structured workflows where agents have clear, non-overlapping roles
LangGraph	Directed state graph with explicit nodes and edges	Persistent checkpointing (MemorySaver, BaseStore)	Complex control flow, branching, retries, human-in-the-loop
AutoGen	Conversational agent pairs or group chats	In-memory conversation history	Code-execution workflows, pair-programming agents

CrewAI is the fastest path to a working multi-agent pipeline. You define roles, goals, and backstories in plain text; the framework handles routing, task delegation, and result aggregation. The tradeoff is less control over execution state compared to LangGraph. If your use case needs conditional branching based on mid-run results, checkpoint-based resumability on spot instances, or explicit human interruption points, the LangGraph vs LangChain comparison walks through when the graph model pays off. For teams already using the Microsoft ecosystem, the Microsoft Agent Framework self-hosting guide covers the equivalent GPU backend setup for MAF, including the model client config and supervisor/worker graph topology.

For straightforward workflows like research-synthesis, document processing, or multi-step analysis pipelines with defined agent specializations, CrewAI wins on simplicity.

If your crew needs to expose its tasks to external orchestrators or receive delegations from non-CrewAI agents, the A2A multi-agent protocol guide covers the interoperability layer.

For teams drawn to Python code execution over JSON tool calls, the SmolAgents production deployment guide covers the CodeAgent paradigm with a self-hosted vLLM backend.

Why the LLM Backend Becomes the Bottleneck in Production

The fanout math is simple and brutal.

A 5-agent sequential crew makes 5 LLM calls per run. Each call takes 2-6 seconds depending on output length and model size. Total latency per run: 10-30 seconds minimum. At 100 concurrent users, that is 500 simultaneous LLM calls queuing against a shared API endpoint with rate limits.

In hierarchical process mode, the manager LLM runs concurrently with worker agents, so multiple requests hit the backend at the same time. A crew with a manager plus 4 workers can generate 5 concurrent LLM calls in a single crew execution cycle.

KV-cache pressure compounds the problem. When multiple agents share a single GPU-resident model endpoint, each concurrent request occupies VRAM for its context window. A 4096-token context at Qwen3-32B FP8 needs roughly 1.5GB of KV-cache per slot. With 10 concurrent agents, that is 15GB consumed by cache alone, on top of the ~35GB needed for the model weights — roughly 50GB total, which exceeds the L40S (48GB). Running 10 concurrent Qwen3-32B agents requires an A100 80GB or H100 80GB; a 24GB consumer card does not even fit the model weights. vLLM's continuous batching and PagedAttention manage how this cache memory is allocated across concurrent requests, enabling 2-4x more concurrent agents than naive static batching.

The API endpoint alternative hides these mechanics behind a rate limit error. You hit 429s and add exponential backoff. Token throughput stays capped. Self-hosting puts you in control of the concurrency ceiling.

Production Architecture

The production stack has four layers. Each one can run on separate machines or co-locate depending on budget.

Layer	Component	Compute	Role
1	CrewAI Orchestrator	CPU	Task routing, agent delegation, result aggregation. Calls Layer 2 over HTTP using the OpenAI-compatible API.
2	vLLM Inference Server	GPU (required)	OpenAI-compatible endpoint on port 8000. Handles all LLM calls, KV-cache allocation, and continuous batching.
3	Memory Backend	CPU + optional GPU	Mem0 or Zep for cross-session persistence. TEI embedding server (optional) uses GPU. ChromaDB or Qdrant as the vector store.
4	Tool Servers	CPU	Web search (Tavily, Serper), code interpreter (E2B), file tools, and database connectors via HTTP or MCP.

For most teams, Layers 1 and 3-4 run on a small CPU instance ($20-50/month). Layer 2 (the GPU) is the only costly piece, and it is also the only piece that matters for throughput.

GPU Sizing Guide

VRAM math for a crew depends on three numbers: model weight size, KV-cache per concurrent agent, and buffer headroom.

For FP8 quantization (vLLM default for supported models), model sizes are:

8B model: ~9GB
14B model: ~15GB
32B model: ~35GB
70B model: ~75GB

KV-cache per concurrent request at 4096-token context:

8B model: ~0.5GB per slot
32B model: ~1.5GB per slot
70B model: ~3.5GB per slot

Model (FP8)	Concurrent slots	Total VRAM (approx.)	Recommended GPU
8B	1	~10GB	L40S (48GB)
8B	4	~12GB	L40S (48GB)
32B	1	~37GB	L40S (48GB)
32B	4	~42GB	A100 80GB / H100 80GB
70B	1	~79GB	Multi-GPU (2× H100)
70B	4+	~90GB+	Multi-GPU (2× H100)

L40S instances on Spheron fit the largest category of real CrewAI deployments: 3-10 agent crews with 7-32B models in sequential mode. The 48GB at FP8 can run Qwen3-32B with headroom for 2-3 concurrent agent requests before KV-cache pressure degrades throughput.

For crews that need 32B models in hierarchical mode or anything larger, A100 on Spheron with its 80GB provides the buffer. Plan for no more than 85% VRAM utilization in production to leave room for cache growth during peak concurrency.

Step-by-Step: Deploying vLLM on Spheron for CrewAI

1. Provision the GPU instance

Log into app.spheron.ai. For most CrewAI deployments with models up to 32B, pick the L40S (48GB). For larger models or high-concurrency hierarchical crews, pick A100 80GB. Choose on-demand for a persistent production service; use spot only for batch or offline crew runs.

Once the instance is up, SSH in and verify the GPU:

bash

nvidia-smi

2. Install Docker with the NVIDIA container toolkit

bash

apt-get update
apt-get install -y docker.io nvidia-container-toolkit
systemctl restart docker

Verify that Docker can see the GPU:

bash

docker run --gpus all --rm nvidia/cuda:12.1-base-ubuntu22.04 nvidia-smi

3. Deploy vLLM

For a 32B model on an L40S:

bash

docker run --gpus all --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen3-32B \
  --dtype fp8 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 64 \
  --max-model-len 8192

Key flags:

--max-num-seqs 64: maximum concurrent requests (concurrent agents). Tune down if you see OOM errors.
--gpu-memory-utilization 0.85: leaves 15% of VRAM free for spikes.
--max-model-len 8192: cap context length to control KV-cache size.

For a 70B model on two H100s:

bash

docker run --gpus all --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen3-72B \
  --dtype fp8 \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 32

Verify the endpoint is up:

bash

curl http://localhost:8000/v1/models

4. Configure CrewAI to use the vLLM endpoint

Install CrewAI (pin to 0.80 or later for the stable Process enum and LLM class):

bash

pip install "crewai>=0.80" crewai-tools

Option A: environment variables (applies globally to all agents)

python

import os
os.environ["OPENAI_API_BASE"] = "http://YOUR_SPHERON_IP:8000/v1"
os.environ["OPENAI_API_KEY"] = "sk-placeholder"
os.environ["OPENAI_MODEL_NAME"] = "Qwen/Qwen3-32B"

Option B: per-agent LLM config (allows different models per agent)

python

from crewai import LLM

local_llm = LLM(
    model="openai/Qwen/Qwen3-32B",  # note the openai/ prefix for LiteLLM routing
    base_url="http://YOUR_SPHERON_IP:8000/v1",
    api_key="sk-placeholder",
)

Note the difference: the environment variable form uses the bare model name (Qwen/Qwen3-32B); the LLM constructor form requires the openai/ prefix (openai/Qwen/Qwen3-32B). This is a LiteLLM routing requirement, not a CrewAI quirk.

5. A working 3-agent crew example

python

import os
from crewai import Agent, Task, Crew, Process, LLM

# Point CrewAI at the self-hosted vLLM endpoint
local_llm = LLM(
    model="openai/Qwen/Qwen3-32B",
    base_url="http://YOUR_SPHERON_IP:8000/v1",
    api_key="sk-placeholder",
    max_tokens=2048,
    temperature=0.7,
)

# Define agents with specific roles
researcher = Agent(
    role="Research Analyst",
    goal="Find accurate, current information on the given topic",
    backstory="You specialize in finding and summarizing technical information.",
    llm=local_llm,
    verbose=True,
    max_iter=3,
)

writer = Agent(
    role="Technical Writer",
    goal="Produce clear, accurate technical content based on research findings",
    backstory="You turn complex technical information into readable content.",
    llm=local_llm,
    verbose=True,
    max_iter=3,
)

reviewer = Agent(
    role="Quality Reviewer",
    goal="Ensure accuracy, clarity, and completeness of the final output",
    backstory="You catch errors, gaps, and unclear explanations before publishing.",
    llm=local_llm,
    verbose=True,
    max_iter=2,
)

# Define tasks
research_task = Task(
    description="Research the topic: {topic}. Focus on recent developments and key facts.",
    expected_output="A structured summary with key findings, sources, and gaps.",
    agent=researcher,
    max_execution_time=120,  # seconds
)

writing_task = Task(
    description="Write a 500-word technical summary based on the research findings.",
    expected_output="A 500-word technical summary ready for publication.",
    agent=writer,
    context=[research_task],
    max_execution_time=120,
)

review_task = Task(
    description="Review the summary for accuracy, clarity, and completeness.",
    expected_output="The reviewed summary with any corrections noted.",
    agent=reviewer,
    context=[writing_task],
    max_execution_time=60,
)

# Assemble the crew
crew = Crew(
    agents=[researcher, writer, reviewer],
    tasks=[research_task, writing_task, review_task],
    process=Process.sequential,
    verbose=True,
)

result = crew.kickoff(inputs={"topic": "vLLM continuous batching"})
print(result)

Run this against a single agent first (Crew(agents=[researcher], tasks=[research_task])) to confirm the endpoint connection before adding the full pipeline.

Memory and Persistence: Hooking CrewAI to Mem0 and Zep

CrewAI's built-in memory=True flag uses ChromaDB in a local directory for short-term, entity, and long-term memory within a single session. It is fast to set up and works well for single-session workflows.

python

crew = Crew(
    agents=[researcher, writer, reviewer],
    tasks=[research_task, writing_task, review_task],
    process=Process.sequential,
    memory=True,  # enables built-in ChromaDB-backed memory
)

For cross-session persistence, the built-in memory resets on process restart. To retain knowledge across crew runs, you need an external backend. The agent memory infrastructure guide covers Mem0, Zep, and Letta deployment on Spheron GPUs, including how to run the embedding server and vector store alongside the inference backend.

The pattern for CrewAI specifically: deploy a Mem0 server pointing at your vLLM endpoint for LLM-based memory extraction, and your TEI embedding server for vector storage. Then create a custom CrewAI tool that calls the Mem0 API:

python

from crewai_tools import tool
import requests

@tool("memory_search")
def memory_search(query: str) -> str:
    """Search long-term memory for relevant past information."""
    response = requests.post(
        "http://your-mem0-server:8000/v1/memories/search",
        json={"query": query, "user_id": "crew_default", "limit": 5},
    )
    response.raise_for_status()
    memories = response.json().get("results", [])
    return "\n".join(m.get("memory", "") for m in memories)

Attach the tool to the agents that need historical context (typically the researcher and reviewer). The writer rarely needs past memories.

Observability for Multi-Agent Execution

Multi-agent traces are harder to debug than single-agent ones because failures can be buried in the middle of a pipeline. Langfuse is the cleanest option for CrewAI because LiteLLM has native Langfuse instrumentation: set three environment variables and all LLM calls are traced automatically.

bash

export LANGFUSE_PUBLIC_KEY=pk-lf-...
export LANGFUSE_SECRET_KEY=sk-lf-...
export LANGFUSE_HOST=https://cloud.langfuse.com  # or your self-hosted instance

With those variables set, every crew execution generates a trace with one span per agent LLM call. For a 5-agent crew, you see 5 spans (or more if agents retry):

crew_kickoff [total: 47.2s]
  ├─ researcher [12.3s, 1,847 tokens, $0.00 at self-hosted]
  ├─ writer     [18.9s, 2,203 tokens, $0.00 at self-hosted]
  └─ reviewer   [16.0s, 1,102 tokens, $0.00 at self-hosted]

The per-span token counts tell you which agent is generating the most output. If the writer agent is at 4x the tokens of the researcher, your writing task's expected_output instructions are too loose. Tighten them or reduce max_tokens on the writer's LLM config.

Arize Phoenix is a good alternative if you prefer a fully self-hosted trace backend. It supports OpenTelemetry and has a LiteLLM exporter.

Reliability Patterns

Timeout handling

Set max_execution_time on each Task to prevent runaway agents from blocking the pipeline:

python

research_task = Task(
    description="Research the topic: {topic}",
    expected_output="A structured summary.",
    agent=researcher,
    max_execution_time=120,  # 2 minutes per task
)

If the task times out, CrewAI raises a TaskExecutionError. Catch it at the crew level and decide whether to retry or return a partial result.

Agent retries

Set max_iter on each Agent to control how many times an agent can attempt a task before the framework gives up:

python

researcher = Agent(
    role="Research Analyst",
    goal="Find accurate information",
    backstory="...",
    llm=local_llm,
    max_iter=3,  # will retry up to 3 times on output format failures
)

Higher max_iter costs more tokens. For agents with tool calls that are likely to succeed on first try, set max_iter=2. For agents with complex output format requirements, max_iter=3-4 is reasonable.

Partial failure recovery with hierarchical process

In hierarchical mode, the manager LLM reviews each worker agent's output before passing it downstream. If a worker fails to meet the task criteria, the manager can re-delegate to the same agent with additional context.

python

crew = Crew(
    agents=[researcher, writer, reviewer],
    tasks=[research_task, writing_task, review_task],
    process=Process.hierarchical,
    manager_llm=local_llm,
    verbose=True,
)

The manager uses additional LLM calls per task (one to delegate, one to validate output), so budget 1.5-2x the LLM calls of sequential mode. The benefit is better output quality and automatic handling of agent failures without explicit error handling code.

Cost Economics: Self-Hosted vs API

If you want to push cost reduction further than task delegation alone, the plan-and-execute architecture guide covers the hardware-level split between frontier planner GPUs and small-model executor GPUs, typically cutting per-task inference spend by 80-90%.

Using live Spheron pricing and GPT-4o's published rates (input: $2.50/1M tokens, output: $10.00/1M tokens as of early 2026), here is the cost breakdown for a 5-agent crew that processes 2,000 input tokens and generates 500 output tokens per agent per run:

Crew runs/day	Tokens/run (est.)	OpenAI GPT-4o cost/day	L40S on-demand ($0.72/hr)	H100 SXM5 on-demand ($3.84/hr)
500	12,500	~$25	$17.28	$92.16
2,000	12,500	~$100	$17.28	$92.16
5,000	12,500	~$250	$17.28	$92.16
10,000	12,500	~$500	$17.28	$92.16

The GPU server runs 24 hours regardless of crew run count, so the self-hosted cost is flat per day. The API cost scales linearly with usage. At 500 runs/day, the L40S on-demand at $17.28/day beats GPT-4o at ~$25/day — the crossover point where self-hosting becomes cheaper. H100 on Spheron at $92.16/day breaks even around 2,000 runs/day where API costs hit ~$100. For 32B-class models at high throughput, the L40S is the stronger value pick; H100 80GB earns its cost for models above 40B or crews that need headroom for larger KV-cache.

H100 SXM5 spot pricing (around $1.69/hr vs $3.84/hr on-demand) cuts the crossover point substantially for H100 workloads. Note that L40S spot can run above on-demand during peak demand periods, so check live rates before assuming spot is cheaper. A single vLLM server can also handle crews from multiple services simultaneously, spreading the fixed GPU cost across more usage.

Pricing fluctuates based on GPU availability. The prices above are based on 26 May 2026 and may have changed. Check current GPU pricing → for live rates.

For teams running larger agent fleets where multiple crews compete for the same GPU capacity, see the scaling AI agent fleets guide for GPU tiering strategies across orchestrator and worker layers.

Reference Architecture: CrewAI + vLLM + Mem0 on Spheron

This is the complete working configuration for a production 3-agent crew with persistent memory:

python

import os
import requests
from crewai import Agent, Task, Crew, Process, LLM
from crewai_tools import tool

# vLLM endpoint on Spheron GPU
LOCAL_LLM = LLM(
    model="openai/Qwen/Qwen3-32B",
    base_url="http://YOUR_SPHERON_IP:8000/v1",
    api_key="sk-placeholder",
    max_tokens=2048,
    temperature=0.7,
)

MEM0_URL = "http://YOUR_MEM0_SERVER:8000"

@tool("memory_search")
def memory_search(query: str) -> str:
    """Search long-term memory for context relevant to the query."""
    response = requests.post(
        f"{MEM0_URL}/v1/memories/search",
        json={"query": query, "user_id": "crew_shared", "limit": 5},
    )
    response.raise_for_status()
    results = response.json().get("results", [])
    if not results:
        return "No relevant memories found."
    return "\n".join(f"- {m.get('memory', '')}" for m in results)

@tool("memory_add")
def memory_add(content: str) -> str:
    """Store a new memory for future crew runs."""
    response = requests.post(
        f"{MEM0_URL}/v1/memories/",
        json={"messages": [{"role": "user", "content": content}], "user_id": "crew_shared"},
    )
    response.raise_for_status()
    return "Memory stored."

researcher = Agent(
    role="Research Analyst",
    goal="Find accurate, current technical information",
    backstory="You focus on facts, primary sources, and recent developments.",
    llm=LOCAL_LLM,
    tools=[memory_search, memory_add],
    max_iter=3,
)

writer = Agent(
    role="Technical Writer",
    goal="Turn research into clear, accurate technical content",
    backstory="You write for engineers who need precise, concise explanations.",
    llm=LOCAL_LLM,
    max_iter=3,
)

reviewer = Agent(
    role="Quality Reviewer",
    goal="Catch errors, gaps, and unclear sections before output is finalized",
    backstory="You verify technical accuracy and flag anything that needs more evidence.",
    llm=LOCAL_LLM,
    tools=[memory_search],
    max_iter=2,
)

research_task = Task(
    description="Research: {topic}. Search memory first for prior findings.",
    expected_output="Structured summary: key facts, sources, gaps. Max 400 words.",
    agent=researcher,
    max_execution_time=120,
)

writing_task = Task(
    description="Write a 500-word technical explanation of the research findings.",
    expected_output="500-word technical summary. No marketing language.",
    agent=writer,
    context=[research_task],
    max_execution_time=120,
)

review_task = Task(
    description="Review the summary. Flag any factual errors or unclear sections.",
    expected_output="Final summary with review notes. Mark corrections inline.",
    agent=reviewer,
    context=[writing_task],
    max_execution_time=60,
)

crew = Crew(
    agents=[researcher, writer, reviewer],
    tasks=[research_task, writing_task, review_task],
    process=Process.sequential,
    verbose=True,
)

result = crew.kickoff(inputs={"topic": "vLLM PagedAttention and KV-cache management"})
print(result)

The same crew pattern adapts well to a self-hosted AI coding assistant workflow - swap the researcher for a code-reading agent and the writer for a code-generation agent.

CrewAI's production ceiling is set by the LLM backend it calls. Self-hosting vLLM on a bare-metal GPU gives consistent throughput across all concurrent agents without per-token API fees stacking up with every crew run.
L40S GPU on Spheron → | H100 GPU on Spheron → | View all GPU pricing →

STEPS / 06

Quick Setup Guide

Size your GPU based on crew configuration and model choice
Count the maximum number of agents that can run concurrently in your crew. In sequential process mode, only one agent runs at a time - a single GPU handles it. In hierarchical process mode, the manager LLM and worker agents may overlap. For each concurrent agent slot, reserve VRAM for KV-cache: a 4096-token context window at 8B FP8 uses about 512MB per slot. Add model weight VRAM on top. Total must stay below 85% of the GPU's physical VRAM.
Provision a GPU instance on Spheron and install Docker
Log into app.spheron.ai. Select L40S (48GB) for crews using 7-32B models or A100/H100 (80GB) for larger models or high-concurrency hierarchical crews. Choose on-demand for production services. SSH in, verify the GPU with nvidia-smi, and install Docker with the NVIDIA container toolkit (apt-get install -y docker.io nvidia-container-toolkit && systemctl restart docker).
Deploy vLLM with an OpenAI-compatible endpoint
Run: docker run --gpus all --ipc=host -p 8000:8000 vllm/vllm-openai:latest --model Qwen/Qwen3-32B --dtype fp8 --gpu-memory-utilization 0.85 --max-num-seqs 64. The --max-num-seqs flag sets the maximum concurrent requests, which maps to maximum concurrent agents. For tensor parallelism across multiple GPUs, add --tensor-parallel-size 2 (or 4 for 70B+ models). Verify with: curl http://localhost:8000/v1/models.
Configure CrewAI to route to your self-hosted endpoint
Install CrewAI: pip install crewai crewai-tools. In your crew script, set environment variables before any Agent instantiation: os.environ['OPENAI_API_BASE'] = 'http://YOUR_SPHERON_IP:8000/v1'; os.environ['OPENAI_API_KEY'] = 'sk-placeholder'; os.environ['OPENAI_MODEL_NAME'] = 'Qwen/Qwen3-32B'. Alternatively, pass llm=LLM(model='openai/Qwen/Qwen3-32B', base_url='http://YOUR_SPHERON_IP:8000/v1', api_key='sk-placeholder') to each Agent() constructor. Run a single-agent test crew before wiring up the full multi-agent pipeline.
Add observability with Langfuse
Install Langfuse SDK: pip install langfuse. Set LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, and LANGFUSE_HOST. CrewAI's LiteLLM layer auto-detects Langfuse when the env vars are present and traces all LLM calls. Each agent's calls appear as separate spans under the crew execution trace, showing token usage, latency, and cost per agent. Use Langfuse's cost tracking to compare per-execution costs between your self-hosted endpoint and OpenAI baseline.
Tune memory and run a cost comparison
Enable CrewAI's built-in memory with memory=True on the Crew for single-session context. For cross-session persistence, deploy Mem0 pointing at your vLLM endpoint for LLM-based extraction and your TEI embedding server for vector storage. To compute cost: take total tokens per crew execution from Langfuse, multiply by your vLLM server's hourly cost divided by throughput tokens per hour. Compare against the same token count at GPT-4o API rates to find your crossover volume.

FAQ / 05

Frequently Asked Questions

CrewAI is a role-based multi-agent framework where each agent is assigned a specific role, goal, and backstory, and agents collaborate through a sequential or hierarchical task pipeline. LangGraph models agents as nodes in a directed state graph with explicit edges and checkpointing - better for complex control flow with branching, retries, and human-in-the-loop interruptions. AutoGen focuses on conversational agent pairs and group chats, using code execution as a first-class primitive. CrewAI is the fastest way to orchestrate a crew of specialized agents with minimal boilerplate; LangGraph gives finer control over execution state.

GPU requirements depend on the model you choose and the number of concurrent agents in your crew. A 3-5 agent crew using a 7-8B parameter model fits comfortably on a single L40S (48GB) - the model occupies about 9GB in FP8, leaving ~39GB for KV-cache across concurrent agent requests. A crew using a 32B model needs ~35GB for the model weights alone; an L40S at FP8 fits sequential 32B crews, but hierarchical or high-concurrency setups need an A100 80GB or H100 80GB. For crews larger than 10 agents or models above 70B, budget for multi-GPU setups with tensor parallelism.

Set three environment variables before instantiating any Agent: OPENAI_API_BASE to your vLLM endpoint (e.g. http://your-spheron-ip:8000/v1), OPENAI_API_KEY to any non-empty string (vLLM does not validate it by default), and OPENAI_MODEL_NAME to the model ID you loaded in vLLM (e.g. Qwen/Qwen3-32B). CrewAI's LiteLLM layer treats anything at an OpenAI-compatible base URL as valid. You can also pass llm=LLM(model='openai/your-model', base_url='...', api_key='...') directly to each Agent constructor for per-agent model routing.

The crossover depends on token volume. A 5-agent crew where each agent processes 2,000 input tokens and generates 500 output tokens per run costs about $0.05 per crew execution on GPT-4o (5 agents x 2,000 input x $2.50/1M + 5 agents x 500 output x $10/1M = $0.05/run). At 10,000 crew executions per day, that is ~$500/day in API fees. An L40S on Spheron at on-demand pricing runs continuously for around $17/day (check current rates at spheron.network/pricing). If your crew runs more than ~350 executions per day, self-hosting an L40S pays for itself. The threshold drops further with spot pricing on H100 instances.

CrewAI has a built-in memory flag (memory=True on the Crew) that uses a local ChromaDB vector store for short-term and entity memory. For production cross-session persistence, wire an external memory backend: configure a Mem0 client to point at your self-hosted embedding server, then use a custom tool or beforekickoff hook to inject relevant memories into each agent's context at the start of each crew run. Zep CE handles the extraction and retrieval automatically if you point it at the same vLLM endpoint used by your crew.

CrewAI in the Multi-Agent Landscape

Why the LLM Backend Becomes the Bottleneck in Production

Production Architecture

GPU Sizing Guide

Step-by-Step: Deploying vLLM on Spheron for CrewAI

1. Provision the GPU instance

2. Install Docker with the NVIDIA container toolkit

3. Deploy vLLM

4. Configure CrewAI to use the vLLM endpoint

5. A working 3-agent crew example

Memory and Persistence: Hooking CrewAI to Mem0 and Zep

Observability for Multi-Agent Execution

Reliability Patterns

Timeout handling

Agent retries

Partial failure recovery with hierarchical process

Cost Economics: Self-Hosted vs API

Reference Architecture: CrewAI + vLLM + Mem0 on Spheron

Quick Setup Guide

Size your GPU based on crew configuration and model choice

Provision a GPU instance on Spheron and install Docker

Deploy vLLM with an OpenAI-compatible endpoint

Configure CrewAI to route to your self-hosted endpoint

Add observability with Langfuse

Tune memory and run a cost comparison

Frequently Asked Questions

01What is CrewAI and how does it differ from LangGraph and AutoGen?

02What GPU do I need to run CrewAI with a self-hosted LLM backend?

03How does CrewAI connect to a self-hosted vLLM endpoint?

04When does self-hosting the LLM backend for CrewAI beat using OpenAI API calls?

05How do I add persistent memory to CrewAI agents across sessions?

Build what's next.