OpenHands consistently places at the top of the SWE-bench Verified leaderboard with open-weight models, and it's free to self-host under the MIT license. Devin charges per task on a subscription tier that stops making economic sense above a few hundred tasks per month. OpenHands on GPU cloud breaks that constraint.
If you're running an autonomous agent that writes, edits, and tests real code, you want the model backend under your control and the per-task cost predictable. This guide covers exactly that: the two-node architecture, model selection for SWE-bench-class performance, vLLM deployment on Spheron, and the security setup that makes Docker-sandboxed code execution safe in production.
Before the deployment steps: if you want Devstral 24B on GPU cloud as a standalone coding assistant (not the full autonomous agent loop), that guide covers single-GPU vLLM setup for IDE integration. If you're looking for self-hosted IDE autocomplete tools rather than an agent that autonomously completes tasks, that covers Continue, Aider, and Tabby. OpenHands is different from both: it's an agent loop that reads repos, edits files, runs tests, and iterates until a task passes. For the AI agent code execution sandboxes that sit underneath this kind of agent, including Firecracker and E2B, see that guide. And if you're evaluating OpenHands deployments against benchmarks, the SWE-bench evaluation infrastructure guide covers running the full harness on GPU cloud.
OpenHands in 2026
OpenHands started as OpenDevin in early 2024, was renamed, and now sits at version 1.7.0 as of May 2026. The core design is an observe-think-act loop: the agent sees a task, calls a tool (read file, run shell command, apply patch), observes the output, decides the next action, and iterates until a termination condition is met.
The runtime architecture has two components. First, the controller process: a Python server that manages the agent loop, handles the LLM abstraction via LiteLLM, and coordinates sandbox lifecycle. Second, the sandbox: a Docker container that the controller spawns for each task. The agent's code runs inside that sandbox, isolated from the controller host. Shell commands, file writes, and test runs all happen inside the sandbox container. The controller communicates with the sandbox over a socket.
Recent releases added headless mode for programmatic task submission via REST API, multi-agent support where one controller can spawn subagents for parallel subtasks, and a MicroAgent system for specialized skills. Headless mode is what makes batch processing at scale viable without a human clicking through the web UI.
On SWE-bench Verified (500 real GitHub issues), OpenHands with Claude Opus 4.6 scores 68.4%. With Devstral 24B as the LLM backend, it scores 46.8%. With Qwen3-235B-A22B MoE, estimates place it above 52%. Devin 2.0's publicly reported score is 45.8%, which open-weight OpenHands with Devstral already matches.
Architecture Overview
The production setup runs two nodes:
| Node | Hardware | Role |
|---|---|---|
| Inference node | H100 SXM5 80GB (or H200) | vLLM serves the LLM over HTTP |
| Controller node | 8-16 vCPU CPU instance | OpenHands app + Docker socket for sandbox management |
Communication flows: the controller sends LLM requests to http://<inference-node-ip>:8000/v1 (OpenAI-compatible). The controller also mounts the host Docker socket (/var/run/docker.sock) to spawn and manage sandbox containers. The sandbox containers run on the same host as the controller.
The two-node split is not strictly required at low scale. For development or small teams, you can run vLLM and the OpenHands controller on the same GPU instance. Separating them makes sense when you want to scale inference independently from the controller, or when you want GPU cost to scale with model load rather than with the number of agent tasks in flight.
Model Selection
Your choice of LLM backend determines VRAM requirements, benchmark performance, and per-task cost. Here are the practical options for self-hosted OpenHands:
| Model | VRAM (BF16) | SWE-bench Verified | Single GPU | Notes |
|---|---|---|---|---|
| Devstral 24B | ~50 GB | 46.8% | A100 80GB / H100 | Best coding specialist per dollar |
| Qwen3-32B | ~65 GB | Est. 45-50% | H100 80GB | Strong reasoning and coding |
| Qwen3-235B-A22B (MoE) | ~235 GB FP8 / ~470 GB BF16 | Est. 52%+ | 4x H100 / 2x H200 | Near-frontier open-weight |
| DeepSeek-V3 (MoE) | ~200 GB FP8 | ~50% | 8x H100 FP8 | Max performance open-weight |
| Claude Opus 4.6 (API) | N/A | 68.4% | None | Managed API, top OpenHands benchmark result |
For most teams starting with OpenHands, Devstral 24B on a single H100 is the practical default. You get 46.8% SWE-bench Verified at a single-GPU cost, with tool-call support that works correctly with the Mistral function calling parser. Qwen3-32B is worth the additional VRAM if your task mix goes beyond pure coding into reasoning-heavy debugging or cross-language work.
For a broader comparison of open-weight frontier models, see the open-weight frontier model showdown.
For teams that need H200 GPU rental on Spheron to run Qwen3-235B-A22B MoE, you need 2x H200 (282 GB combined HBM3e) to hold the full weight set at FP8, or 4x H100 at FP8. A single H200's 141 GB is not enough, because all expert weights must reside in VRAM even though only 22B parameters are active per forward pass. At spot pricing, 2x H200 is often cheaper per hour than 4x H100 on-demand, which changes the cost math for large MoE models significantly.
GPU Sizing and Pricing
| GPU | VRAM | On-demand ($/hr) | Spot ($/hr) | Models supported | Best for |
|---|---|---|---|---|---|
| H100 SXM5 | 80 GB | $4.21 | $0.80 | Devstral BF16, Qwen3-32B | Primary inference node |
| H200 SXM5 | 141 GB | $4.54 | $1.19 | Qwen3-235B MoE, DeepSeek-V3 FP8 | Large-model inference |
| A100 80GB SXM4 | 80 GB | $1.64 | $0.45 | Devstral BF16, Qwen3-32B | Budget single-GPU |
Pricing fluctuates based on GPU availability. The prices above are based on 09 May 2026 and may have changed. Check current GPU pricing → for live rates.
For single-team deployments running Devstral on H100, the all-in compute cost is $4.21/hr on-demand or $0.80/hr spot. At 30-minute average task duration and one agent session, that's $2.11/task on-demand and $0.40/task on spot before any MIG concurrency gains.
Step-by-Step Deployment
Step 1: Provision the inference node
Log into app.spheron.ai and provision an H100 SXM5 80GB instance. Choose spot pricing for the inference node if your tasks are interruptible. Attach at least 200 GB persistent storage for model weights and vLLM KV cache. For the controller node, a CPU instance with 8-16 vCPU and 32 GB RAM is sufficient.
For on-demand H100 access on Spheron, select the SXM5 variant if MIG partitioning for concurrent agents is part of your plan. MIG is not available on PCIe variants.
Step 2: Install vLLM and download weights
On the inference node:
pip install 'vllm>=0.8.0' huggingface_hub hf_transfer
export HF_TOKEN=your_hf_token
export HF_HUB_ENABLE_HF_TRANSFER=1
# For Devstral 24B
huggingface-cli download mistralai/Devstral-Small-2505
# For Qwen3-32B
huggingface-cli download Qwen/Qwen3-32BStep 3: Launch vLLM
For Devstral 24B at BF16 on H100:
vllm serve mistralai/Devstral-Small-2505 \
--dtype bfloat16 \
--max-model-len 65536 \
--port 8000 \
--enable-auto-tool-choice \
--tool-call-parser mistralThe --tool-call-parser mistral flag is required for Devstral. Omitting it causes malformed function call output that breaks the OpenHands agent loop silently. The agent receives tool outputs but they are unparseable, and you'll see the agent spinning without making progress. For Qwen3 models, use --tool-call-parser hermes instead.
For Qwen3-32B at BF16 on H100:
vllm serve Qwen/Qwen3-32B \
--dtype bfloat16 \
--max-model-len 32768 \
--port 8000 \
--enable-auto-tool-choice \
--tool-call-parser hermesDo not expose port 8000 to the public internet. Use the instance's internal network IP for controller-to-inference communication.
Step 4: Configure and launch OpenHands
On the controller node, create config.toml:
[core]
workspace_base = "/opt/workspace_base"
[llm]
model = "openai/devstral"
base_url = "http://<inference-node-ip>:8000/v1"
api_key = "none"The openai/ prefix on the model name tells LiteLLM to use the OpenAI-compatible request format. This works regardless of the actual model, as long as your vLLM server speaks the OpenAI API.
Pull and run OpenHands:
docker pull ghcr.io/all-hands-ai/openhands:1.7.0
docker run -d \
--restart unless-stopped \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=ghcr.io/all-hands-ai/runtime:1.7.0-nikolaik \
-e LOG_ALL_EVENTS=true \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /your/workspace:/opt/workspace_base \
-v /path/to/config.toml:/app/config.toml \
-p 3000:3000 \
--name openhands-app \
ghcr.io/all-hands-ai/openhands:1.7.0Two things to note here: first, the Docker socket mount (/var/run/docker.sock) is required. OpenHands uses it to spawn sandbox containers. Without it, the controller cannot start sandbox containers and the agent loop fails immediately. Second, the SANDBOX_RUNTIME_CONTAINER_IMAGE version must match the openhands image version. Running openhands:1.7.0 with runtime:1.6.0-nikolaik causes a container start failure with a cryptic error. Always use matching version tags.
Open the UI at http://<controller-ip>:3000. Submit a simple task ("add a docstring to function X in file Y") and watch the event log. You should see the controller spawn a sandbox container, send tool call requests to the vLLM endpoint, and iterate until the task completes.
Step 5: Run headless mode for batch tasks
OpenHands 1.7.0 supports a REST API for programmatic task submission. Use the v1 endpoint (the v0 /api/conversations path was removed April 1, 2026):
curl -X POST http://<controller-ip>:3000/api/v1/app-conversations \
-H "Content-Type: application/json" \
-d '{
"initial_message": {
"content": [
{"type": "text", "text": "Fix the failing test in tests/test_api.py"}
]
},
"selected_repository": "your-org/your-repo"
}'The response includes app_conversation_id. Poll GET /api/v1/app-conversations/<id> until status is READY, then fetch the result. This is the interface for integrating OpenHands into CI/CD pipelines or batch processing queues.
Scaling Concurrent Agents with MIG
MIG partitioning is available on A100 SXM4, H100 SXM5, and H200 SXM5 (not H100 PCIe, not A100 PCIe, not RTX-series). It splits a single GPU into isolated slices, each with dedicated VRAM and compute. For a deep dive into MIG vs. time-slicing vs. MPS for running multiple models on one GPU, see running multiple LLMs on one GPU.
On an H100 80GB, the 3g.40gb profile creates two slices, each with 40 GB VRAM. Each slice runs one vLLM instance serving one agent session:
# Enable MIG mode
nvidia-smi -i 0 -mig 1
# Create two 3g.40gb slices
nvidia-smi mig -cgi 3g.40gb,3g.40gb -i 0
nvidia-smi mig -cci -i 0
# List MIG instance UUIDs
nvidia-smi -L
# Launch two separate vLLM processes, each on one slice
CUDA_VISIBLE_DEVICES=MIG-<uuid-0> vllm serve mistralai/Devstral-Small-2505 \
--quantization fp8 --max-model-len 32768 --port 8000 \
--enable-auto-tool-choice --tool-call-parser mistral &
CUDA_VISIBLE_DEVICES=MIG-<uuid-1> vllm serve mistralai/Devstral-Small-2505 \
--quantization fp8 --max-model-len 32768 --port 8001 \
--enable-auto-tool-choice --tool-call-parser mistral &Run two OpenHands controllers, each configured to use a different vLLM port. Two fully isolated agent sessions, one H100, no interference between sessions.
MIG has one important constraint: it changes the VRAM available to each vLLM instance. A 3g.40gb slice gives 40 GB, which is not enough for Devstral 24B at BF16 (~50 GB). Use FP8 quantization (~26 GB) to fit within the slice. Qwen3-32B at BF16 needs 65 GB and doesn't fit a single MIG slice on H100. For Qwen3-32B concurrency, use NVIDIA MPS instead:
# Start MPS daemon (no MIG required)
nvidia-cuda-mps-control -d
# All vLLM processes share the GPU through MPS
vllm serve Qwen/Qwen3-32B --dtype bfloat16 --max-model-len 32768 --port 8000 ...MPS multiplexes a single vLLM process across concurrent agent request streams without partitioning VRAM. Throughput is shared, not isolated.
Cost per task with concurrent agents
The formula for per-task cost at MIG concurrency:
cost per task = (avg_task_duration_hours) × (GPU $/hr) / concurrent_agentsExample: 30-minute tasks, H100 on-demand at $4.21/hr, 2 concurrent MIG sessions:
cost per task = 0.5 × $4.21 / 2 = $1.05/taskOn spot at $0.80/hr:
cost per task = 0.5 × $0.80 / 2 = $0.20/taskSecurity
Four things to get right before running autonomous code execution in production:
Sandbox network isolation. Set SANDBOX_NETWORK_DISABLED=true in the OpenHands config. This prevents the sandbox container from making outbound network requests. Useful for tasks that shouldn't pull external packages or exfiltrate code during execution. For tasks that genuinely need network access (installing dependencies, API calls), disable this selectively per-task rather than leaving it open globally.
Secret handling. Never put API keys, database credentials, or GitHub tokens in config.toml. Mount a secrets directory as a read-only Docker volume into the controller container and reference secrets via environment variables. The sandbox container has its own filesystem isolation, but anything mounted into the controller's config is visible to the agent's Python environment.
Repo permissions. Use fine-grained GitHub PATs scoped to the specific repository the agent is working on. Do not give OpenHands an org-wide token or a token with write access to unrelated repos. The agent will use whatever permissions the token provides.
Docker socket access. The OpenHands controller requires /var/run/docker.sock mounted into its container. This is effectively root-equivalent access on the host. Run the controller container with --security-opt no-new-privileges and keep the controller host separate from production systems. The controller node should not have access to production databases or internal services. This is exactly why the two-node split from the architecture section matters: the inference node with the expensive GPU has no Docker socket access and cannot spawn containers.
Cost Comparison
Assuming 30-minute average task duration and 2 concurrent MIG sessions at 1,000+ tasks/month:
| Tasks/month | Devin (Team plan, est.) | GitHub Copilot Workspace | OpenHands on H100 on-demand | OpenHands on H100 spot |
|---|---|---|---|---|
| 100 | ~$500 (plan minimum) | ~$19-39/seat | ~$105 | ~$20 |
| 1,000 | ~$2,000-5,000 (per-task overage) | ~$190-390/seat | ~$1,050 | ~$200 |
| 10,000 | ~$20,000+ | Not designed for this volume | ~$10,500 | ~$2,000 |
Devin's team plan pricing is based on publicly reported figures from early 2026 and includes a fixed task allocation. Overage pricing varies by plan tier. Copilot Workspace pricing is per-seat and not designed for high-volume autonomous task execution. OpenHands costs are calculated from live H100 SXM5 pricing: on-demand $4.21/hr, spot $0.80/hr, 30-min avg task, 2 concurrent MIG sessions.
At 1,000 tasks/month, self-hosted OpenHands on H100 spot costs roughly 10-25x less than Devin at scale, and 2-5x less on on-demand pricing. The crossover point where self-hosting pays off depends on your ops overhead to maintain the GPU instance and the OpenHands stack. For teams already running GPU cloud infrastructure, that overhead is near zero.
OpenHands runs well on Spheron H100 instances, where flat hourly pricing keeps cost predictable as task volume grows - unlike serverless GPU billing that compounds on long agent loops.
Rent H100 on Spheron → | View all GPU pricing → | Get started →
