How much VRAM do I need to run browser-use or computer-use agents?

The VLM backbone dominates. Qwen2.5-VL 7B at FP16 needs roughly 17 GB for weights plus 2-4 GB per concurrent agent session for screenshot visual tokens (each full-page screenshot encodes to 512-2048 visual tokens). On a single H200 (141 GB HBM3e) you can comfortably run Qwen2.5-VL 7B with 40-50 concurrent browser agents. For Llama 4 Vision (109B total, 17B active per token) expect higher total memory demand: at BF16 the full model needs ~218 GB, so you'll need INT4 quantization (~55 GB) to fit on a single A100 80GB. InternVL3 26B at INT8 needs ~26 GB weights and fits on a single H100 with ~20 concurrent sessions.

Which open-source vision model works best as a browser-use backbone?

For GUI grounding (clicking the right button, reading form fields), Qwen2.5-VL leads in the 7-72B range on ScreenSpot and OSWorld benchmarks as of April 2026. Llama 4 Vision (Scout 17B MoE) is a strong second for long-context workflows where 10M-token context matters. InternVL3 26B excels at document-heavy tasks like reading terms of service or extracting table data from screenshots. For raw agent throughput on a single H200, start with Qwen2.5-VL 7B and scale to 72B only if GUI click accuracy is insufficient.

How many parallel browser-use agents can I run on a single H200?

With Qwen2.5-VL 7B at BF16 (17 GB weights) on an H200 (141 GB HBM3e), you have roughly 120 GB for KV cache and overhead. Each concurrent agent session with a 1080p screenshot consumes approximately 2-3 GB of active KV cache. That gives you 40-60 concurrent sessions before memory pressure causes latency spikes. Limit to 40 sessions for comfortable headroom, or use AWQ INT4 quantization to halve the weight footprint and push toward 80-100 concurrent sessions.

How does self-hosting compare to Anthropic's Computer Use API cost?

Anthropic's computer-use beta in Claude 3.7 charges standard token rates for image inputs: a 1080p screenshot converts to roughly 1,500 input tokens at $3/M tokens (Sonnet pricing). A 100-step browser session with screenshots at each step costs roughly $0.45 in API fees alone, before any tool-call or output tokens. Self-hosting Qwen2.5-VL 7B on an H200 at $1.19/hr (spot) with 50 concurrent sessions reduces this to a few cents per 100-step session at scale, with the crossover happening at roughly 1,600-5,000 sessions per month.

Can I use spot GPU instances for browser-use agent workloads?

It depends on the use case. Spot instances are ideal for offline or batch browser agents (web scraping, data extraction, automated testing pipelines) where interruption is tolerable and you can checkpoint progress. For live, user-facing browser agents where a mid-session interruption breaks the UX, use on-demand dedicated instances. A practical approach: run the VLM backend on an on-demand instance and run the headless browser worker pool on spot nodes, treating interruptions as process-level failures that the orchestration layer restarts.

Browser-Use and Computer-Use AI Agent Deployment on GPU Cloud: Self-Host Operator-Style Agents with Vision Models (2026 Guide)

Most guides covering browser-use and computer-use agents focus on what they do. Almost none cover the GPU infrastructure to run a fleet of them. That gap matters because these agents have two distinct compute requirements that text-only LLM agents don't: a VLM backbone that must be resident in VRAM before the first screenshot arrives (no cold starts), and a headless browser pool that benefits from tight colocation with the VLM to avoid screenshot transfer latency. This post covers both requirements: VLM backbone selection, VRAM and concurrency math, headless browser orchestration patterns, per-action latency budgeting, sandboxing, and a cost comparison against Anthropic's Computer Use API.

For the GPU fundamentals behind AI agent workloads in general, see GPU infrastructure for AI agents: the 2026 compute playbook.

Why Browser-Use and Computer-Use Agents Need GPU Cloud

A text-only LLM agent can run on a single A100 80GB and handle 50-100 concurrent sessions with room to spare. Add screenshots to the mix and everything changes. A 1080p screenshot sent to a vision model encodes to 512-2048 visual tokens depending on resolution tiling strategy. At 50 concurrent browser agent sessions each holding one screenshot in active context, you're carrying 25,000-100,000 additional tokens in KV cache before any text history lands. That's why browser-use agents have a VRAM profile closer to an image batch inference job than to a standard conversational LLM.

The other constraint is latency. Text-only agents tolerate 200-500ms generation latency. Browser agents are different: a visible pause between "agent clicked the button" and "agent reads the result" makes the agent feel broken to users. The VLM needs warm VRAM and low-latency access to the screenshot pipeline. Collocating the VLM server and the headless browser pool on the same physical node cuts screenshot transmission from a network round-trip (10-30ms) to a loopback call (sub-1ms). That colocation only works if you control your own GPU instance, not if you're hitting a shared API endpoint.

Computer use products from vendors like Anthropic and OpenAI Operator abstract all of this away at API pricing. Self-hosting is the cost play above a certain volume threshold.

The Architecture of a Vision-Driven Browser Agent

There are four distinct layers in any production browser agent stack:

VLM backbone - a vision encoder (ViT) plus a language model, served via vLLM. Receives a base64-encoded screenshot along with the action history, and outputs a grounded action: click(x=412, y=287), type("search query"), scroll(direction="down"), or similar.
Screenshot pipeline - the headless browser captures the current viewport, encodes it to JPEG or PNG at a configured resolution, and sends it to the VLM inference endpoint.
Action decoder - parses the VLM text output into concrete Playwright or Puppeteer API calls and executes them against the browser.
Sandboxed browser pool - N isolated Chromium processes, each mapped to one agent session, with separate user data directories and optional network namespace isolation.

Here is the data flow:

Browser pool (N workers)
    |
    |-- [screenshot JPEG/PNG, base64]
    v
VLM server (vLLM, localhost:8000)
    |
    |-- [action text: "click(412, 287)"]
    v
Action decoder
    |
    |-- [Playwright call: page.click('body', position={x:412, y:287})]
    v
Browser pool (same worker, next action)

Colocating the browser pool on the same node as the VLM means the screenshot transfer in step 2 is a loopback call, not a network hop. For 1080p screenshots at JPEG quality 85 (typically 100-300 KB), this cuts transfer time from 5-30ms to under 1ms.

Choosing Your Vision Backbone

Qwen2.5-VL

The Qwen2.5-VL family (7B and 72B) is the current leader for GUI grounding tasks as of April 2026. ScreenSpot and OSWorld benchmark scores in the 7B-72B range put it ahead of comparable-size alternatives for clicking the correct UI element given a screenshot plus an instruction. Key properties:

Native high-resolution support via dynamic resolution tiling (NaViT-style), handles 1080p inputs without forced downscaling.
7B at BF16: ~17 GB weights. Fits on a single A100 80GB, H100 80GB, H200, or B200.
72B at BF16: ~144 GB weights. Exceeds the H200's ~136 GB usable HBM3e, so it does not fit at BF16 on a single H200. Use FP8 or INT8 quantization to fit on a single H200 (4 concurrent sessions), or run BF16 on a single B200 (192 GB) comfortably. Alternatively, tensor-parallel across 2x H100 80GB.

Launch with vLLM:

bash

vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
  --dtype bfloat16 \
  --max-model-len 16384 \
  --served-model-name qwen2-5-vl \
  --port 8000

Llama 4 Vision (Scout 17B MoE)

Llama 4 Scout uses a 109B total / 17B active MoE architecture with a 10M-token context window. That context window is the standout feature: for multi-step browser sessions where you want the full history in context without truncation, Scout handles what other models cannot.

BF16 footprint: 109B × 2 bytes ≈ 218 GB — does not fit on a single A100 or H100 at BF16.
To fit on a single GPU: INT4 quantization brings the footprint to ~55 GB (fits on an A100 80GB with headroom), FP8 quantization to ~110 GB (fits on an H200 141 GB with ~30 GB for KV cache).
Best for: long-session agents with extensive action history, or workflows where reading lengthy documents in full is required.

For detailed deployment steps, see Deploy Llama 4 on GPU Cloud.

InternVL3

InternVL3 is strongest on document-heavy tasks: reading financial statements, extracting table data from screenshots, navigating multi-page forms with dense text. Its multi-scale tiling strategy handles high-resolution inputs with fewer artifacts than alternatives.

InternVL3 8B at BF16: ~17 GB.
InternVL3 26B at INT8: ~26 GB, fits on a single H100 80GB.
Best for: agents that read and extract from document-style pages (PDFs, data tables, long-form text content).

VLM comparison:

Model	Params	VRAM (BF16)	ScreenSpot score	Best use case
Qwen2.5-VL 7B	7B	~17 GB	Top in class (7B)	General GUI grounding, high concurrency
Qwen2.5-VL 72B	72B	~144 GB	Top in class (72B)	Maximum click accuracy, lower concurrency
Llama 4 Scout 17B MoE	109B total / 17B active	~218 GB (INT4: ~55 GB)	Strong	Long-session history, 10M context
InternVL3 8B	8B	~17 GB	Mid-tier	Document extraction, OCR-heavy pages
InternVL3 26B	26B	~52 GB (INT8: ~26 GB)	Strong	High-accuracy document tasks

For full vLLM deployment steps and GPU tier mapping for all three models, see Deploy Vision Language Models on GPU Cloud.

Headless Browser Orchestration on GPU Nodes

Three patterns cover most production use cases:

1. Playwright + Python

The simplest option. async_playwright spawns one BrowserContext per agent, captures screenshots, and sends them to the local vLLM endpoint. Here is the core screenshot-to-action loop:

python

import asyncio, base64, httpx
from playwright.async_api import async_playwright

VLLM_URL = "http://localhost:8000/v1/chat/completions"

async def run_agent_step(page, action_history: list[str]) -> str:
    # Capture screenshot
    screenshot_bytes = await page.screenshot(type="jpeg", quality=85)
    b64 = base64.b64encode(screenshot_bytes).decode()

    # Build multimodal request
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
                {"type": "text", "text": f"History: {action_history}\nNext action:"},
            ],
        }
    ]

    async with httpx.AsyncClient(timeout=30) as client:
        resp = await client.post(
            VLLM_URL,
            json={"model": "qwen2-5-vl", "messages": messages, "max_tokens": 64},
        )
    resp.raise_for_status()
    choices = resp.json().get("choices", [])
    if not choices:
        raise ValueError(f"Empty choices in VLM response: {resp.text}")
    return choices[0]["message"]["content"]


async def run_session(start_url: str, goal: str, max_steps: int = 100):
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        try:
            page = await browser.new_page()
            await page.goto(start_url)
            history = [f"Goal: {goal}"]
            for _ in range(max_steps):
                action = await run_agent_step(page, history)
                # Parse and execute action (omitted for brevity)
                history.append(action)
        finally:
            await browser.close()

For 50 concurrent sessions, run run_session as 50 separate asyncio tasks pointing at the same vLLM endpoint. The VLM handles batching internally.

2. Steel.dev

An open-source browser sandbox manager with session isolation, automatic cleanup, and a REST API wrapping Playwright. Good for multi-tenant browser fleets where you need clean session boundaries and per-session resource limits without managing process-level isolation yourself.

3. Browserbase

Hosted browser infrastructure where you supply the VLM and they supply the Chromium pool. Tradeoff: you pay per session but skip browser infrastructure management. Only makes economic sense at low session volumes (under a few hundred sessions per month), before the per-session cost exceeds self-hosted Playwright overhead.

All three patterns work on GPU nodes. Playwright and Steel run directly on the GPU instance. Browserbase would call out over the network, adding 10-30ms of screenshot transfer latency per action, which compounds badly over a 100-step session.

VRAM and Concurrency Math

The formula is straightforward:

Total VRAM = VLM weights + (sessions × per-session KV cache)

For Qwen2.5-VL 7B on an H200:

Weights:           17 GB  (BF16)
KV cache/session:   2 GB  (4K context, 1 screenshot = ~1024 visual tokens)
H200 HBM3e:       141 GB  (~136 GB usable after driver/OS overhead)

Max sessions = (136 - 17) / 2 = ~59
Target 40-50 sessions for 20% headroom

Per-GPU concurrency table:

GPU	VRAM	Qwen2.5-VL 7B sessions	Qwen2.5-VL 72B sessions
A100 80GB	80 GB	~28	0 (model does not fit)
H100 80GB SXM	80 GB	~28	0
H200 141GB	141 GB	~50	~4 (needs FP8/INT8 to fit)
B200 192GB	192 GB	~75	~12 (BF16, single GPU)

AWQ INT4 quantization halves the weight footprint for 7B models to ~9 GB, pushing the H200 toward 80-100 concurrent sessions. Quality impact on GUI grounding is small for most web interaction tasks (clicking buttons, filling forms), more noticeable for complex document reading.

For large fleets, H200 instances on Spheron provide 141 GB HBM3e with dedicated VRAM per node, so there is no shared-tenant memory pressure that would push you into OOM at peak load.

Teams running 72B VLM backbones for maximum GUI grounding accuracy can use Spheron's B200 bare-metal nodes to fit the full model in a single GPU and still serve 10-15 concurrent sessions.

Latency Budget for Human-Feel Agents

Each action in a browser session has four latency components:

Screenshot capture - Playwright page.screenshot() at JPEG quality 85, 1080p viewport: 50-150ms. Most of this is browser rendering time, not I/O.
VLM prefill - visual encoder plus prompt tokenization for Qwen2.5-VL 7B on H200: 100-200ms for a 1024 visual-token screenshot plus 512 text tokens.
VLM decode - a 20-token action output at H200 speeds: 30-60ms.
Action execution - Playwright click, type, or scroll: 10-100ms depending on page JavaScript.

Total per-action latency with a warm VLM: 200-510ms.

Cold-start VLM loading from NVMe adds 15-45 seconds per node restart. Keep the VLM process warm. If you autoscale to zero, build in a health check loop that keeps at least one node warm for latency-sensitive traffic.

For human-feel pacing, target 500-1,500ms per action intentionally. Adding a small delay between actions (250-500ms "think time") makes agents less detectable to bot protection systems and is worth building into your action executor.

Safety, Sandboxing, and Rate-Limiting

Browser agents touch the open internet from your GPU nodes. The main risks are cross-session data leakage, IP range blocking, and runaway sessions consuming compute indefinitely.

Network isolation - run each browser session in a separate network namespace or route egress through a proxy pool. This prevents cross-session cookie and storage leakage, and distributes egress IPs to avoid triggering domain-level rate limits.

Filesystem isolation - use a separate user-data-dir per session with Playwright, or run each session in an ephemeral Docker container with no shared volumes. Clear session data after each run.

Action whitelisting - define an allowed-action schema before deploying. Reject any VLM output that does not parse to a valid action type (click, type, scroll, navigate). Do not pass raw VLM text to an eval() or shell interpreter.

Request rate limiting - implement per-domain rate limits in the action executor. A naive browser agent will hammer a single domain at full speed and get your node's IP range blocked within minutes.

Session timeouts - set a maximum step count (100-200 actions) and a wall-clock timeout (5-10 minutes) per session. Terminate and log any session that exceeds either limit.

Audit logging - record every screenshot, VLM input and output, and action taken. Store to object storage (S3 or equivalent), not on the GPU node's local disk. Local disk fills fast under a 50-session workload where each session generates 100 screenshots at 100-300 KB each.

Cost Comparison: Anthropic Computer Use vs. Self-Hosted on Spheron

The math below uses Claude 3.7 Sonnet API pricing for the Computer Use API side ($3/M input tokens for images, $15/M output tokens).

Per-session cost calculation (100-step session, 1 screenshot per step):

Metric	Anthropic Computer Use (Claude 3.7)	Self-hosted 7B (H200 spot)	Self-hosted 72B (H200 spot)
Image tokens per session	150,000 (1,500/screenshot x 100)	N/A	N/A
Output tokens per session	~5,000 (50 tokens/step x 100)	N/A	N/A
GPU cost per session	None	~$0.005 (50 sessions, $1.19/hr)	~$0.06 (4 sessions, $1.19/hr)
API cost per session	~$0.525 ($0.45 images + $0.075 output)	None	None
Cost per session	~$0.53	~$0.005	~$0.06
Sessions per dollar	~1.9	~200	~17
Setup complexity	None	Moderate	Moderate
Latency control	None	Full	Full
Model choice	Claude only	Any open VLM	Any open VLM

Pricing fluctuates based on GPU availability. The prices above are based on 29 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Break-even for self-hosting on H200 at $1.19/hr (spot): the monthly GPU cost is roughly $857. At $0.53 per Computer Use API session, you cover that with about 1,617 sessions per month, roughly 54 per day.

For current H200 and B200 rates on Spheron, see GPU pricing.

Note that Anthropic's computer-use pricing is based on the published Claude 3.7 Sonnet rate as of April 2026. If Anthropic has updated pricing after publication, the crossover point shifts accordingly. Check the Anthropic pricing page for the current rate before using the numbers above in a business case.

Reference Deployment

A complete single-node setup for 50 concurrent browser-use agent sessions on H200:

Step 1: Provision the instance

Go to app.spheron.ai, select H200 (141 GB HBM3e), choose the Ubuntu 22.04 + CUDA 12.3 template, and deploy. SSH access is available within 60-90 seconds of provisioning.

Step 2: Install dependencies

bash

pip install "vllm>=0.6.0" playwright httpx
playwright install chromium

Step 3: Launch the vLLM server

bash

vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
  --dtype bfloat16 \
  --max-model-len 16384 \
  --served-model-name qwen2-5-vl \
  --max-num-seqs 60 \
  --port 8000

Set --max-num-seqs to match your target concurrent session count plus 20% headroom. The vLLM server handles batching across all concurrent agent requests internally.

Step 4: Run the agent workers

python

import asyncio
from your_agent_module import run_session

async def main():
    tasks = [
        run_session(
            start_url="https://example.com",
            goal="Extract product prices from the catalog",
            max_steps=100,
        )
        for _ in range(50)  # 50 concurrent sessions
    ]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    for i, result in enumerate(results):
        if isinstance(result, Exception):
            print(f"Session {i} failed: {result}")

asyncio.run(main())

All 50 run_session coroutines share the same vLLM endpoint at http://localhost:8000. The VLM batches their concurrent screenshot-to-action requests efficiently.

Step 5: Monitor memory

bash

nvidia-smi dmon -s mu -d 5

Watch MEM utilization. Stay below 90% to avoid KV cache saturation. If you hit 95%+, reduce --max-num-seqs or apply AWQ INT4 quantization to the model weights.

For scaling this pattern to 100+ concurrent agent workers across multiple GPU nodes, see Scale AI Agent Fleets on GPU Cloud with MCP Orchestration.

Browser-use and computer-use agents are GPU workloads first. The VLM backbone needs dedicated VRAM, low-latency NVMe storage for fast model loading, and zero noisy neighbors to hit sub-500ms per-action budgets. Spheron provides on-demand H200 and B200 bare-metal nodes with dedicated VRAM, spot pricing for offline scraping fleets, and per-minute billing so you only pay for active sessions.
Get started on Spheron →

Why Browser-Use and Computer-Use Agents Need GPU Cloud

The Architecture of a Vision-Driven Browser Agent

Choosing Your Vision Backbone

Qwen2.5-VL

Llama 4 Vision (Scout 17B MoE)

InternVL3

Headless Browser Orchestration on GPU Nodes

VRAM and Concurrency Math

Latency Budget for Human-Feel Agents

Safety, Sandboxing, and Rate-Limiting

Cost Comparison: Anthropic Computer Use vs. Self-Hosted on Spheron

Reference Deployment

Build what's next.