Tutorial

Deploy OpenClaw on GPU Cloud: Self-Host the Open-Source Agentic Assistant With Your Own LLM Backend (2026 Guide)

deploy openclawself-host openclawopenclaw gpuopenclaw llm backendOpenClaw Self-HostAgentic Assistant GPU CloudvLLM OpenAI CompatibleOpen-Source AI AgentGPU Cloud
Deploy OpenClaw on GPU Cloud: Self-Host the Open-Source Agentic Assistant With Your Own LLM Backend (2026 Guide)

OpenClaw hit 370K+ GitHub stars faster than almost any open-source AI project in 2026. It's a fully self-hosted agentic assistant with 50+ built-in tool integrations, MCP compatibility, and a design that puts the LLM backend entirely under your control. The appeal is obvious: the same kind of capability people pay for in hosted agents, running on your own infrastructure, calling your own model.

The catch most teams hit: OpenClaw defaults to the OpenAI API. At agent-loop token volumes, the per-token bill grows fast. More importantly, every prompt and tool result goes through a third-party endpoint. For teams with data sensitivity requirements or predictable cost needs, the default setup doesn't work. Teams who have already moved to self-hosted frontends like Open WebUI or LibreChat face the same question with OpenClaw: how do you point it at your own model instead of the vendor API?

This guide covers the private GPU backend path. You run a vLLM inference server on a Spheron GPU instance, point OpenClaw at it, and keep everything on your own network. If you want context on setting up an OpenAI-compatible private API endpoint first, that guide covers the vLLM server setup in isolation.

What Is OpenClaw and Why It Went Viral

OpenClaw is structured around three components: a controller process, a tool registry, and an LLM backend. The controller manages the agentic loop. The tool registry holds the 50+ integrations: browser automation, code execution, file system access, web search, email, calendar, Slack, GitHub, and more. The LLM backend is the model that reasons over tool results and decides what to do next. The backend is pluggable: point it at any OpenAI-compatible endpoint and it works.

That three-component split is what makes self-hosting viable without forking the project. The controller and tool registry are CPU processes. The LLM backend is just an HTTP endpoint. Swap the OpenAI API URL for your vLLM server address and nothing else changes in the stack.

The viral growth came from a few things stacking up. MCP compatibility means you can connect OpenClaw to any existing MCP tool server, not just the built-in 50. The UI is polished in a way most open-source agent projects aren't. And the fully self-hosted story resonated with teams that had already moved away from ChatGPT for data reasons but still wanted an assistant that could actually do things, not just answer questions.

For data privacy, the self-hosted LLM path is what closes the loop. Tool calls that connect to Slack or GitHub still send data to those services, since that's unavoidable if you want those integrations. But the LLM reasoning, the prompts, the accumulated context in the agentic loop, none of that leaves your network if the backend is a vLLM server on your own GPU node.

Architecture: Two-Node Setup

The production setup uses two nodes:

NodeHardwareRole
Inference nodeH100 SXM5 80GB (or A100 80GB)vLLM serving the LLM over HTTP on port 8000
Controller node8-16 vCPU CPU instance, 32 GB RAMOpenClaw app + MCP tool servers

The communication path is: OpenClaw UI -> OpenClaw Controller -> vLLM HTTP endpoint on GPU node. The controller sends LLM requests to http://<inference-node-ip>:8000/v1 (OpenAI-compatible). Only the inference node needs GPU access. The controller node runs OpenClaw itself plus any local MCP servers you deploy (a local Chroma instance for RAG, a code execution sandbox, etc.).

For development or small-team testing, you can run both on the same GPU node. The separation matters when you want to scale the inference node independently, or when you want the CPU/memory bill for the controller to stay flat as GPU capacity scales up. A second scenario: if you run multiple OpenClaw instances pointing at the same vLLM endpoint, the controller nodes are cheap horizontal scale while the GPU stays singular. This two-node split is the same pattern as deploying OpenHands on GPU cloud, where a separate agent controller and GPU inference backend give you the same scaling flexibility.

Picking the LLM: Model Selection for Agentic Tool-Calling

The LLM you choose determines VRAM requirements, tool-call accuracy, and latency per step. In an agentic loop, each tool call triggers an LLM request: read the tool result, decide what to do next, emit a function call or a final answer. At 10-20 steps per task, time to first token compounds. H100 prefill speed is noticeably faster than A100 at the same model size, which shows up as task duration at production volumes.

ModelVRAM neededFormatRecommended GPUTool-callingBest for
Qwen3-32B~65 GBBF16H100 SXM5Excellent (hermes)Default choice for OpenClaw
Llama 3.3 70B~70 GBFP8A100 80GBStrong (llama3_json)Budget option
Qwen3-14B~30 GBBF16A100 80GBGood (hermes)Lightweight teams
Mistral Small 3.2 24B~48 GBBF16A100 80GB / H100Good (mistral)European data residency
Qwen3-235B-A22B MoE~240 GBFP84x H100 / 2x H200ExcellentFrontier performance

Qwen3-32B is the default recommendation for OpenClaw. The Qwen3 series has native structured output and function-calling support, and with --tool-call-parser hermes in vLLM the tool calls parse reliably. OpenClaw's 50+ tool integrations mean you will hit edge-case schema patterns regularly. Models without mature function-calling support start failing on those edge cases.

The --tool-call-parser flag is model-family-specific. Qwen3 uses hermes. Llama 3.x uses llama3_json. Mistral uses mistral. Using the wrong parser for your model causes silent tool-call failures: the agent receives what looks like a valid response but the function call is malformed, and it spins without making progress. Always match the parser to the model family.

For scaling patterns when you outgrow a single GPU, scaling AI agent fleets on GPU cloud covers multi-node orchestration for agent workloads.

GPU Sizing and Pricing

GPUVRAMOn-demand ($/hr)Spot ($/hr)Models supportedBest for
H100 SXM580 GB$3.98$2.91Qwen3-32B BF16, Llama 3.3 70B FP8Primary OpenClaw inference node
A100 80GB SXM480 GB$1.69$0.85Qwen3-32B BF16, Qwen3-14B, Llama 3.3 70B FP8Budget inference node
H200 SXM5141 GB$4.84$1.77Qwen3-235B MoE FP8, DeepSeek-V3 FP8Scale-out / frontier models

Pricing fluctuates based on GPU availability. The prices above are based on 16 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

For a single team running OpenClaw all day, one H100 SXM5 at $3.98/hr runs to roughly $95/day. That covers continuous operation, not just hours you're actively using it. For intermittent use (a few hours of active sessions per day), spot pricing at $2.91/hr brings the daily cost under $35. vLLM idles gracefully between requests, so you're paying for the GPU reservation, not per-call compute.

Step-by-Step Deployment

Step 1: Provision GPU and controller nodes

Log into app.spheron.ai and provision an H100 SXM5 80GB instance as the inference node. For the model weights, attach at least 100 GB persistent storage: Qwen3-32B at BF16 is ~65 GB of model weights plus cache overhead. For the controller node, a standard CPU instance with 8-16 vCPU and 32 GB RAM is sufficient.

For H100 SXM5 on Spheron, use the SXM5 variant specifically if you plan to use MIG partitioning for concurrent sessions later. PCIe variants don't support MIG.

Step 2: Install vLLM and download model weights

On the inference node:

bash
pip install 'vllm>=0.8.0' huggingface_hub hf_transfer
export HF_TOKEN=your_hf_token
export HF_HUB_ENABLE_HF_TRANSFER=1

# For Qwen3-32B
huggingface-cli download Qwen/Qwen3-32B

# For Llama 3.3 70B (FP8 budget option)
huggingface-cli download meta-llama/Llama-3.3-70B-Instruct

The HF_HUB_ENABLE_HF_TRANSFER=1 flag switches to a faster transfer protocol that cuts download time significantly on high-bandwidth GPU nodes.

Step 3: Launch the vLLM inference server

For Qwen3-32B at BF16 on H100 SXM5:

bash
vllm serve Qwen/Qwen3-32B \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --port 8000 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

For Llama 3.3 70B at FP8 on A100 80GB:

bash
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --quantization fp8 \
  --max-model-len 32768 \
  --port 8000 \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json

The --enable-auto-tool-choice flag is what makes vLLM emit proper function-call formatted responses instead of plain text. Without it, OpenClaw receives text it can't parse as a tool call, and the agentic loop breaks silently. The --tool-call-parser must match the model family: hermes for Qwen3, llama3_json for Llama 3.x, mistral for Mistral family.

vLLM binds to all interfaces by default. For production, restrict it:

bash
vllm serve Qwen/Qwen3-32B \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --port 8000 \
  --host <internal-network-ip> \
  --api-key your-strong-random-key \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

Step 4: Install and configure OpenClaw

Install OpenClaw on the controller node:

bash
npm install -g openclaw@latest
openclaw onboard

The onboard command walks through backend configuration interactively. When prompted for the LLM provider, select the OpenAI-compatible option and enter your vLLM server URL (http://<gpu-node-ip>:8000/v1) and the API key you set with VLLM_API_KEY. OpenClaw stores configuration in a JSON file. The key setting is agent.model, which uses a <provider>/<model-id> format. For a vLLM backend with Qwen3-32B, that value is openai-compatible/Qwen/Qwen3-32B.

Before running a session, verify the vLLM endpoint is reachable from the controller node:

bash
curl -s -o /dev/null -w "%{http_code}" http://<gpu-node-ip>:8000/health
# Should return 200

If that returns anything other than 200, the inference node's firewall or security group is blocking port 8000 from the controller.

Step 5: Wire up MCP tools without external data leakage

OpenClaw's 50+ built-in tools vary in their data exposure. Browser automation, GitHub integration, and email tools send data to those external services, which is expected behavior. The tools to audit are any that use hosted AI backends behind the scenes: cloud-based RAG services, hosted embedding APIs, third-party summarization tools. Those send your agent's context to external endpoints.

For private RAG, run Chroma or Qdrant locally on the controller node with a local embedding model:

bash
docker run -d -p 8001:8000 chromadb/chroma

Wire it as a local MCP server in OpenClaw's MCP config. The GPU-accelerated MCP server deployment guide covers how to wrap self-hosted retrieval backends as MCP tools with proper session handling for multi-turn agent conversations.

For the code execution tool, OpenClaw supports Docker-based sandboxing. Configure it to run sandboxes with network disabled if your tasks don't need outbound access:

yaml
sandbox:
  network_disabled: true
  runtime: docker

Step 6: Secure the vLLM endpoint

The VLLM_API_KEY environment variable locks the vLLM endpoint behind token authentication:

bash
export VLLM_API_KEY=your-strong-random-key
vllm serve Qwen/Qwen3-32B --dtype bfloat16 --port 8000 ...

Match the api_key in OpenClaw's config to this value. In production, bind vLLM to the internal network interface only (--host <internal-ip>) so the inference port is not reachable from the public internet. Only the OpenClaw controller node needs HTTP access to port 8000 on the inference node.

For the security group or firewall rules: allow TCP 8000 inbound on the inference node only from the controller node's internal IP. Block it from everything else.

Cost: Self-Hosted GPU Backend vs Hosted API Per-Token Pricing

The cost comparison depends on your daily output token volume. OpenClaw in active use generates high token counts: each tool-call step sends the accumulated context plus tool results back to the model. A single 20-step agent task can easily produce 5,000-10,000 output tokens across all the intermediate reasoning calls.

Daily output tokensGPT-4o API cost/dayH100 on-demand cost/dayH100 spot cost/day
500K$5$95 (flat, 24hr)$70
2M$20$95 (flat, 24hr)$70
5M$50$95 (flat, 24hr)$70
10M$100$95 (flat, 24hr)$70

GPT-4o output pricing at $10/1M tokens (as of mid-2026). H100 costs are daily flat rate from Spheron live pricing.

The GPU cost is fixed whether you use it or not. At low token volumes (under 1-2M output tokens/day), the API is cheaper in direct compute terms. At higher volumes, the GPU wins. The break-even varies, but for a small team running OpenClaw actively, 2M+ daily output tokens is typical once it's actually deployed and used for real work.

What this table doesn't show: the GPU also handles MCP tool-result summarization calls. When OpenClaw uses a retrieval tool, it sends the retrieved chunks back to the model to reason over them. Those are invisible in per-token API billing but very visible at agent-loop volumes. The GPU absorbs all of them at the same flat daily rate.

Scaling for Teams

One vLLM instance handles concurrent OpenClaw sessions. The default --max-num-seqs in vLLM is 256, which is more than enough for small team usage. You can lower it to control maximum concurrency if you want to limit KV cache usage:

bash
vllm serve Qwen/Qwen3-32B --max-num-seqs 32 ...

For 5-10 concurrent active users on Qwen3-32B, one H100 SXM5 handles the load comfortably. Each user's agentic session generates sequential LLM calls (one step at a time), not parallel ones, so the effective concurrency at any moment is much lower than the number of simultaneous sessions.

For teams above 10 concurrent active users: add a second H100 with tensor parallelism. Launch vLLM with --tensor-parallel-size 2 across two GPUs and it spreads the model weights. Alternatively, run two separate vLLM instances and load-balance with nginx upstream:

nginx
upstream vllm_pool {
    server gpu-node-1:8000;
    server gpu-node-2:8000;
}

Each OpenClaw controller points to the nginx upstream instead of a single node. The vLLM instances don't share KV cache, so session affinity doesn't matter for stateless tool-call patterns.

What to Watch in Logs

Three things break silently in vLLM-backed OpenClaw deployments:

Tool-call parsing failures. vLLM logs include the raw model output before parsing. If you see tool_call_parser error in the vLLM log, the model output didn't match the expected function-call format. The most common cause: using the wrong --tool-call-parser for the model family. Check the vLLM log at debug level to see the raw output.

Context window exhaustion in long loops. If an agentic task runs many steps, the accumulated context grows. When it exceeds --max-model-len, vLLM truncates or rejects the request. The symptom: OpenClaw tasks that succeed on short runs fail on complex multi-step tasks. Fix: increase --max-model-len (costs more KV cache VRAM) or implement context compression in OpenClaw's settings to prune old tool results from the context.

Queue depth under concurrency. The vLLM metrics endpoint at /metrics exposes vllm:num_requests_waiting. If this stays above 5 for more than 30 seconds, your GPU is undersized for the current concurrency. Add a second node or reduce the max number of concurrent sessions.


OpenClaw needs a GPU that stays up, not a serverless endpoint that cold-starts mid-agent-loop. Spheron H100 and A100 instances run on-demand with per-minute billing, so there's no idle waste between sessions.

H100 SXM5 on Spheron → | View all GPU pricing → | Get started →

STEPS / 06

Quick Setup Guide

  1. Pick your LLM and size the GPU

    Choose Qwen3-32B (BF16, ~65 GB VRAM) on H100 SXM5 for the best tool-calling accuracy, or Llama 3.3 70B (FP8, ~70 GB VRAM) on A100 80GB for a budget option (fits on the 80 GB card with tight KV cache headroom). Qwen3 series models have strong structured output and tool-calling performance which matters for OpenClaw's 50+ tool integrations. Avoid models without native function-calling support as they require prompt engineering workarounds that hurt agentic reliability.

  2. Provision a GPU node on Spheron

    Log into app.spheron.ai. Provision an H100 SXM5 80GB instance for Qwen3-32B, or an A100 80GB for Llama 3.3 70B FP8. Attach at least 100 GB persistent storage for model weights. Provision a separate CPU instance (8-16 vCPU, 32 GB RAM) for the OpenClaw controller, or run OpenClaw on the same GPU node during testing.

  3. Deploy vLLM on the GPU node

    Install vLLM (pip install vllm>=0.8.0). For Qwen3-32B: vllm serve Qwen/Qwen3-32B --dtype bfloat16 --max-model-len 32768 --port 8000 --enable-auto-tool-choice --tool-call-parser hermes. The --enable-auto-tool-choice and --tool-call-parser hermes flags are required for Qwen3 tool calls to parse correctly with OpenClaw's function-calling format.

  4. Install and configure OpenClaw

    Install OpenClaw globally with npm install -g openclaw@latest, then run openclaw onboard to configure your LLM backend. When prompted for the provider, select the OpenAI-compatible option and enter your vLLM server URL and API key. OpenClaw stores configuration in JSON, with the model specified in provider/model-id format (e.g. openai-compatible/Qwen/Qwen3-32B). After onboarding, verify the vLLM endpoint is reachable and start a session to confirm the assistant responds.

  5. Configure MCP tool integrations without external data leakage

    In OpenClaw's MCP settings, enable only the tools your deployment needs. For private deployments, disable any tool that sends prompts or context to an external hosted service (cloud-based RAG, Anthropic-hosted tools). Self-host your retrieval tool (Chroma or Qdrant with a local embedding model) and connect it as a local MCP server. Leave web browsing tools enabled only if the policy allows outbound requests from the agent host.

  6. Set up API key authentication on the vLLM endpoint

    Set VLLM_API_KEY environment variable to a strong random string before launching vLLM. Update OpenClaw's api_key config to match. This prevents other processes on the network from using your GPU's inference endpoint. For network isolation, bind vLLM to the internal network IP (--host <internal-ip>) rather than 0.0.0.0 in production. Only the OpenClaw controller needs HTTP access to port 8000.

FAQ / 05

Frequently Asked Questions

OpenClaw's controller node is CPU-only and lightweight. The GPU requirement is entirely for the LLM inference backend (vLLM or Ollama). A tool-calling-optimized model like Qwen3-32B at BF16 needs ~65 GB VRAM, fitting on a single H100 SXM5 (80 GB). For smaller team deployments, Qwen3-14B BF16 (~30 GB) or Llama 3.3 70B FP8 (~70 GB) runs on a single A100 80GB, though Llama 3.3 70B leaves tight KV cache headroom. Multi-model setups where OpenClaw routes to different LLMs by task type need one GPU node per active model, or MIG-partitioned H100/H200 slices.

Yes. OpenClaw uses an OpenAI-compatible API client internally. Set the base_url in OpenClaw's LLM provider config to point at your vLLM endpoint (e.g., http://<gpu-ip>:8000/v1) and supply any non-empty string as the api_key. The model name must match the model loaded in vLLM. No other code changes are needed.

Yes, if you wire it to a self-hosted LLM backend. OpenClaw communicates with the LLM over HTTP. If that HTTP endpoint is a vLLM server on your own GPU node, no prompt or context leaves your network. Tool calls that connect to external services (Slack, GitHub, email) send data to those services, but the LLM reasoning itself stays private.

With Qwen3-32B at BF16 on a single H100 SXM5, vLLM can handle 15-30 concurrent request streams at typical agentic prompt lengths. Each OpenClaw tool-call loop generates 5-20 LLM requests per user task, so one H100 supports roughly 5-10 concurrent active users at comfortable latency. For higher concurrency, run two H100s with tensor parallelism or use H200 for the larger KV cache headroom.

Running a single H100 SXM5 on Spheron at $3.98/hr on-demand costs roughly $95/day for continuous 24-hour operation, or about $70/day on spot. At GPT-4o output pricing of $10/1M tokens, break-even for a continuously running H100 is approximately 9-10M output tokens/day. For intermittent use (a few active hours per day rather than a 24-hour reservation), self-hosting becomes cost-effective at much lower volumes. The GPU also absorbs all MCP tool-result summarization calls at the same flat daily rate.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.