Most guides covering browser-use and computer-use agents focus on what they do. Almost none cover the GPU infrastructure to run a fleet of them. That gap matters because these agents have two distinct compute requirements that text-only LLM agents don't: a VLM backbone that must be resident in VRAM before the first screenshot arrives (no cold starts), and a headless browser pool that benefits from tight colocation with the VLM to avoid screenshot transfer latency. This post covers both requirements: VLM backbone selection, VRAM and concurrency math, headless browser orchestration patterns, per-action latency budgeting, sandboxing, and a cost comparison against Anthropic's Computer Use API.
For the GPU fundamentals behind AI agent workloads in general, see GPU infrastructure for AI agents: the 2026 compute playbook.
Why Browser-Use and Computer-Use Agents Need GPU Cloud
A text-only LLM agent can run on a single A100 80GB and handle 50-100 concurrent sessions with room to spare. Add screenshots to the mix and everything changes. A 1080p screenshot sent to a vision model encodes to 512-2048 visual tokens depending on resolution tiling strategy. At 50 concurrent browser agent sessions each holding one screenshot in active context, you're carrying 25,000-100,000 additional tokens in KV cache before any text history lands. That's why browser-use agents have a VRAM profile closer to an image batch inference job than to a standard conversational LLM.
The other constraint is latency. Text-only agents tolerate 200-500ms generation latency. Browser agents are different: a visible pause between "agent clicked the button" and "agent reads the result" makes the agent feel broken to users. The VLM needs warm VRAM and low-latency access to the screenshot pipeline. Collocating the VLM server and the headless browser pool on the same physical node cuts screenshot transmission from a network round-trip (10-30ms) to a loopback call (sub-1ms). That colocation only works if you control your own GPU instance, not if you're hitting a shared API endpoint.
Computer use products from vendors like Anthropic and OpenAI Operator abstract all of this away at API pricing. Self-hosting is the cost play above a certain volume threshold.
The Architecture of a Vision-Driven Browser Agent
There are four distinct layers in any production browser agent stack:
- VLM backbone - a vision encoder (ViT) plus a language model, served via vLLM. Receives a base64-encoded screenshot along with the action history, and outputs a grounded action:
click(x=412, y=287),type("search query"),scroll(direction="down"), or similar. - Screenshot pipeline - the headless browser captures the current viewport, encodes it to JPEG or PNG at a configured resolution, and sends it to the VLM inference endpoint.
- Action decoder - parses the VLM text output into concrete Playwright or Puppeteer API calls and executes them against the browser.
- Sandboxed browser pool - N isolated Chromium processes, each mapped to one agent session, with separate user data directories and optional network namespace isolation.
Here is the data flow:
Browser pool (N workers)
|
|-- [screenshot JPEG/PNG, base64]
v
VLM server (vLLM, localhost:8000)
|
|-- [action text: "click(412, 287)"]
v
Action decoder
|
|-- [Playwright call: page.click('body', position={x:412, y:287})]
v
Browser pool (same worker, next action)Colocating the browser pool on the same node as the VLM means the screenshot transfer in step 2 is a loopback call, not a network hop. For 1080p screenshots at JPEG quality 85 (typically 100-300 KB), this cuts transfer time from 5-30ms to under 1ms.
Choosing Your Vision Backbone
Qwen2.5-VL
The Qwen2.5-VL family (7B and 72B) is the current leader for GUI grounding tasks as of April 2026. ScreenSpot and OSWorld benchmark scores in the 7B-72B range put it ahead of comparable-size alternatives for clicking the correct UI element given a screenshot plus an instruction. Key properties:
- Native high-resolution support via dynamic resolution tiling (NaViT-style), handles 1080p inputs without forced downscaling.
- 7B at BF16: ~17 GB weights. Fits on a single A100 80GB, H100 80GB, H200, or B200.
- 72B at BF16: ~144 GB weights. Exceeds the H200's ~136 GB usable HBM3e, so it does not fit at BF16 on a single H200. Use FP8 or INT8 quantization to fit on a single H200 (4 concurrent sessions), or run BF16 on a single B200 (192 GB) comfortably. Alternatively, tensor-parallel across 2x H100 80GB.
Launch with vLLM:
vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
--dtype bfloat16 \
--max-model-len 16384 \
--served-model-name qwen2-5-vl \
--port 8000Llama 4 Vision (Scout 17B MoE)
Llama 4 Scout uses a 109B total / 17B active MoE architecture with a 10M-token context window. That context window is the standout feature: for multi-step browser sessions where you want the full history in context without truncation, Scout handles what other models cannot.
- BF16 footprint: 109B × 2 bytes ≈ 218 GB — does not fit on a single A100 or H100 at BF16.
- To fit on a single GPU: INT4 quantization brings the footprint to ~55 GB (fits on an A100 80GB with headroom), FP8 quantization to ~110 GB (fits on an H200 141 GB with ~30 GB for KV cache).
- Best for: long-session agents with extensive action history, or workflows where reading lengthy documents in full is required.
For detailed deployment steps, see Deploy Llama 4 on GPU Cloud.
InternVL3
InternVL3 is strongest on document-heavy tasks: reading financial statements, extracting table data from screenshots, navigating multi-page forms with dense text. Its multi-scale tiling strategy handles high-resolution inputs with fewer artifacts than alternatives.
- InternVL3 8B at BF16: ~17 GB.
- InternVL3 26B at INT8: ~26 GB, fits on a single H100 80GB.
- Best for: agents that read and extract from document-style pages (PDFs, data tables, long-form text content).
VLM comparison:
| Model | Params | VRAM (BF16) | ScreenSpot score | Best use case |
|---|---|---|---|---|
| Qwen2.5-VL 7B | 7B | ~17 GB | Top in class (7B) | General GUI grounding, high concurrency |
| Qwen2.5-VL 72B | 72B | ~144 GB | Top in class (72B) | Maximum click accuracy, lower concurrency |
| Llama 4 Scout 17B MoE | 109B total / 17B active | ~218 GB (INT4: ~55 GB) | Strong | Long-session history, 10M context |
| InternVL3 8B | 8B | ~17 GB | Mid-tier | Document extraction, OCR-heavy pages |
| InternVL3 26B | 26B | ~52 GB (INT8: ~26 GB) | Strong | High-accuracy document tasks |
For full vLLM deployment steps and GPU tier mapping for all three models, see Deploy Vision Language Models on GPU Cloud.
Headless Browser Orchestration on GPU Nodes
Three patterns cover most production use cases:
1. Playwright + Python
The simplest option. async_playwright spawns one BrowserContext per agent, captures screenshots, and sends them to the local vLLM endpoint. Here is the core screenshot-to-action loop:
import asyncio, base64, httpx
from playwright.async_api import async_playwright
VLLM_URL = "http://localhost:8000/v1/chat/completions"
async def run_agent_step(page, action_history: list[str]) -> str:
# Capture screenshot
screenshot_bytes = await page.screenshot(type="jpeg", quality=85)
b64 = base64.b64encode(screenshot_bytes).decode()
# Build multimodal request
messages = [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
{"type": "text", "text": f"History: {action_history}\nNext action:"},
],
}
]
async with httpx.AsyncClient(timeout=30) as client:
resp = await client.post(
VLLM_URL,
json={"model": "qwen2-5-vl", "messages": messages, "max_tokens": 64},
)
resp.raise_for_status()
choices = resp.json().get("choices", [])
if not choices:
raise ValueError(f"Empty choices in VLM response: {resp.text}")
return choices[0]["message"]["content"]
async def run_session(start_url: str, goal: str, max_steps: int = 100):
async with async_playwright() as p:
browser = await p.chromium.launch()
try:
page = await browser.new_page()
await page.goto(start_url)
history = [f"Goal: {goal}"]
for _ in range(max_steps):
action = await run_agent_step(page, history)
# Parse and execute action (omitted for brevity)
history.append(action)
finally:
await browser.close()For 50 concurrent sessions, run run_session as 50 separate asyncio tasks pointing at the same vLLM endpoint. The VLM handles batching internally.
2. Steel.dev
An open-source browser sandbox manager with session isolation, automatic cleanup, and a REST API wrapping Playwright. Good for multi-tenant browser fleets where you need clean session boundaries and per-session resource limits without managing process-level isolation yourself.
3. Browserbase
Hosted browser infrastructure where you supply the VLM and they supply the Chromium pool. Tradeoff: you pay per session but skip browser infrastructure management. Only makes economic sense at low session volumes (under a few hundred sessions per month), before the per-session cost exceeds self-hosted Playwright overhead.
All three patterns work on GPU nodes. Playwright and Steel run directly on the GPU instance. Browserbase would call out over the network, adding 10-30ms of screenshot transfer latency per action, which compounds badly over a 100-step session.
VRAM and Concurrency Math
The formula is straightforward:
Total VRAM = VLM weights + (sessions × per-session KV cache)For Qwen2.5-VL 7B on an H200:
Weights: 17 GB (BF16)
KV cache/session: 2 GB (4K context, 1 screenshot = ~1024 visual tokens)
H200 HBM3e: 141 GB (~136 GB usable after driver/OS overhead)
Max sessions = (136 - 17) / 2 = ~59
Target 40-50 sessions for 20% headroomPer-GPU concurrency table:
| GPU | VRAM | Qwen2.5-VL 7B sessions | Qwen2.5-VL 72B sessions |
|---|---|---|---|
| A100 80GB | 80 GB | ~28 | 0 (model does not fit) |
| H100 80GB SXM | 80 GB | ~28 | 0 |
| H200 141GB | 141 GB | ~50 | ~4 (needs FP8/INT8 to fit) |
| B200 192GB | 192 GB | ~75 | ~12 (BF16, single GPU) |
AWQ INT4 quantization halves the weight footprint for 7B models to ~9 GB, pushing the H200 toward 80-100 concurrent sessions. Quality impact on GUI grounding is small for most web interaction tasks (clicking buttons, filling forms), more noticeable for complex document reading.
For large fleets, H200 instances on Spheron provide 141 GB HBM3e with dedicated VRAM per node, so there is no shared-tenant memory pressure that would push you into OOM at peak load.
Teams running 72B VLM backbones for maximum GUI grounding accuracy can use Spheron's B200 bare-metal nodes to fit the full model in a single GPU and still serve 10-15 concurrent sessions.
Latency Budget for Human-Feel Agents
Each action in a browser session has four latency components:
- Screenshot capture - Playwright
page.screenshot()at JPEG quality 85, 1080p viewport: 50-150ms. Most of this is browser rendering time, not I/O. - VLM prefill - visual encoder plus prompt tokenization for Qwen2.5-VL 7B on H200: 100-200ms for a 1024 visual-token screenshot plus 512 text tokens.
- VLM decode - a 20-token action output at H200 speeds: 30-60ms.
- Action execution - Playwright click, type, or scroll: 10-100ms depending on page JavaScript.
Total per-action latency with a warm VLM: 200-510ms.
Cold-start VLM loading from NVMe adds 15-45 seconds per node restart. Keep the VLM process warm. If you autoscale to zero, build in a health check loop that keeps at least one node warm for latency-sensitive traffic.
For human-feel pacing, target 500-1,500ms per action intentionally. Adding a small delay between actions (250-500ms "think time") makes agents less detectable to bot protection systems and is worth building into your action executor.
Safety, Sandboxing, and Rate-Limiting
Browser agents touch the open internet from your GPU nodes. The main risks are cross-session data leakage, IP range blocking, and runaway sessions consuming compute indefinitely.
Network isolation - run each browser session in a separate network namespace or route egress through a proxy pool. This prevents cross-session cookie and storage leakage, and distributes egress IPs to avoid triggering domain-level rate limits.
Filesystem isolation - use a separate user-data-dir per session with Playwright, or run each session in an ephemeral Docker container with no shared volumes. Clear session data after each run.
Action whitelisting - define an allowed-action schema before deploying. Reject any VLM output that does not parse to a valid action type (click, type, scroll, navigate). Do not pass raw VLM text to an eval() or shell interpreter.
Request rate limiting - implement per-domain rate limits in the action executor. A naive browser agent will hammer a single domain at full speed and get your node's IP range blocked within minutes.
Session timeouts - set a maximum step count (100-200 actions) and a wall-clock timeout (5-10 minutes) per session. Terminate and log any session that exceeds either limit.
Audit logging - record every screenshot, VLM input and output, and action taken. Store to object storage (S3 or equivalent), not on the GPU node's local disk. Local disk fills fast under a 50-session workload where each session generates 100 screenshots at 100-300 KB each.
Cost Comparison: Anthropic Computer Use vs. Self-Hosted on Spheron
The math below uses Claude 3.7 Sonnet API pricing for the Computer Use API side ($3/M input tokens for images, $15/M output tokens).
Per-session cost calculation (100-step session, 1 screenshot per step):
| Metric | Anthropic Computer Use (Claude 3.7) | Self-hosted 7B (H200 spot) | Self-hosted 72B (H200 spot) |
|---|---|---|---|
| Image tokens per session | 150,000 (1,500/screenshot x 100) | N/A | N/A |
| Output tokens per session | ~5,000 (50 tokens/step x 100) | N/A | N/A |
| GPU cost per session | None | ~$0.005 (50 sessions, $1.19/hr) | ~$0.06 (4 sessions, $1.19/hr) |
| API cost per session | ~$0.525 ($0.45 images + $0.075 output) | None | None |
| Cost per session | ~$0.53 | ~$0.005 | ~$0.06 |
| Sessions per dollar | ~1.9 | ~200 | ~17 |
| Setup complexity | None | Moderate | Moderate |
| Latency control | None | Full | Full |
| Model choice | Claude only | Any open VLM | Any open VLM |
Pricing fluctuates based on GPU availability. The prices above are based on 29 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Break-even for self-hosting on H200 at $1.19/hr (spot): the monthly GPU cost is roughly $857. At $0.53 per Computer Use API session, you cover that with about 1,617 sessions per month, roughly 54 per day.
For current H200 and B200 rates on Spheron, see GPU pricing.
Note that Anthropic's computer-use pricing is based on the published Claude 3.7 Sonnet rate as of April 2026. If Anthropic has updated pricing after publication, the crossover point shifts accordingly. Check the Anthropic pricing page for the current rate before using the numbers above in a business case.
Reference Deployment
A complete single-node setup for 50 concurrent browser-use agent sessions on H200:
Step 1: Provision the instance
Go to app.spheron.ai, select H200 (141 GB HBM3e), choose the Ubuntu 22.04 + CUDA 12.3 template, and deploy. SSH access is available within 60-90 seconds of provisioning.
Step 2: Install dependencies
pip install "vllm>=0.6.0" playwright httpx
playwright install chromiumStep 3: Launch the vLLM server
vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
--dtype bfloat16 \
--max-model-len 16384 \
--served-model-name qwen2-5-vl \
--max-num-seqs 60 \
--port 8000Set --max-num-seqs to match your target concurrent session count plus 20% headroom. The vLLM server handles batching across all concurrent agent requests internally.
Step 4: Run the agent workers
import asyncio
from your_agent_module import run_session
async def main():
tasks = [
run_session(
start_url="https://example.com",
goal="Extract product prices from the catalog",
max_steps=100,
)
for _ in range(50) # 50 concurrent sessions
]
results = await asyncio.gather(*tasks, return_exceptions=True)
for i, result in enumerate(results):
if isinstance(result, Exception):
print(f"Session {i} failed: {result}")
asyncio.run(main())All 50 run_session coroutines share the same vLLM endpoint at http://localhost:8000. The VLM batches their concurrent screenshot-to-action requests efficiently.
Step 5: Monitor memory
nvidia-smi dmon -s mu -d 5Watch MEM utilization. Stay below 90% to avoid KV cache saturation. If you hit 95%+, reduce --max-num-seqs or apply AWQ INT4 quantization to the model weights.
For scaling this pattern to 100+ concurrent agent workers across multiple GPU nodes, see Scale AI Agent Fleets on GPU Cloud with MCP Orchestration.
Browser-use and computer-use agents are GPU workloads first. The VLM backbone needs dedicated VRAM, low-latency NVMe storage for fast model loading, and zero noisy neighbors to hit sub-500ms per-action budgets. Spheron provides on-demand H200 and B200 bare-metal nodes with dedicated VRAM, spot pricing for offline scraping fleets, and per-minute billing so you only pay for active sessions.
