How many GPUs do I need to run SWE-bench Verified at scale?

A 500-task SWE-bench Verified run with a 70B agent model (e.g. DeepSeek R2) requires roughly 8 GPUs for the agent (2x H200 per 4-way parallel worker pool). SWE-bench uses unit tests as ground truth, so no LLM judge is needed and you provision zero judge GPUs. With 20 spot H200s running 10 parallel workers (2 GPUs per worker), 500 tasks at ~30 min each complete in roughly 25 hours using Ray for parallel rollout dispatch. Smaller models (Claude Sonnet 4.6 via API, GPT-based) replace the agent GPU requirement with API costs.

What is the difference between SWE-bench, GAIA, Terminal-Bench, and OSWorld?

SWE-bench Verified tests a coding agent's ability to resolve real GitHub issues in sandboxed repo environments. GAIA evaluates general AI assistants on multi-step web and file tasks scored by human graders. Terminal-Bench Core tests command-line task completion in a Linux shell environment. OSWorld evaluates GUI agents controlling a full desktop OS via screenshot observations. Each benchmark targets a different modality: code editing, web browsing, CLI execution, and GUI control.

Can I run OSWorld without physical desktop hardware?

Yes. OSWorld uses QEMU/KVM virtual machines with a VNC framebuffer as the headless display. On GPU cloud, you deploy VNC-accessible VMs on CPU-heavy instances (16-32 vCPU) for the desktop environment, while a separate GPU instance runs the vision model (typically a 7B-13B VLM for screenshot interpretation). You do not need a physical monitor or GPU framebuffer for the VM itself.

How much does a single SWE-bench Verified run cost?

Cost depends on your agent model choice. A full 500-task run with Claude Sonnet 4.6 (via API) costs roughly $40-80 in API fees with zero judge GPU compute - SWE-bench uses unit tests as ground truth, so no judge model is needed. A self-hosted 70B open-weight model (e.g. DeepSeek R2 or Qwen2.5 72B) on 20 spot H200s (10 parallel workers at 2 GPUs each) takes roughly 25 hours and costs around $595 in GPU compute with no API fees. Exact figures require fetching current Spheron spot pricing.

What is the recommended judge model for SWE-bench and GAIA?

For code correctness (SWE-bench), automated unit tests are the ground-truth judge - no LLM needed. For GAIA and Terminal-Bench, Qwen2.5 72B Instruct or Llama 3.3 70B served via vLLM on 2x H100 or H200 GPUs provides strong judging at low cost. For OSWorld, Claude Sonnet 4.6 or GPT-4o-mini are used as screenshot graders. Avoid using the same model family as your agent to reduce self-preference bias.

AI Agent Benchmarking Infrastructure on GPU Cloud: Run SWE-bench, GAIA, Terminal-Bench, and OSWorld at Scale (2026 Guide)

Every major lab published SWE-bench and GAIA numbers in 2026. Behind every number is a repeatable infrastructure: Docker sandboxes, parallel rollout workers, judge models, and structured results pipelines. Building that infrastructure once and running it reliably is the actual hard problem. This guide covers the full stack: benchmark selection, GPU sizing, Ray-based parallel dispatch, cost breakdowns with live pricing, reproducibility controls, and CI integration. For the broader compute context for agent systems, see GPU Infrastructure for AI Agents: The 2026 Compute Playbook first.

Why Agent Benchmarks Became the 2026 Evaluation Standard

The shift from task-specific evals to agent benchmarks happened for a concrete reason: BLEU, ROUGE, and even held-out test sets don't measure whether an agent actually completes real tasks. SWE-bench Verified resolve rate became the de facto code agent leaderboard metric because it's grounded in real GitHub issues with deterministic pass/fail from unit test suites. No rubric subjectivity, no cherry-picking.

GAIA filled the gap for general assistants: 466 tasks requiring multi-step web browsing, file parsing, and tool use, scored by human graders against ground-truth answers. Level 1-3 difficulty tiers let you see exactly where an agent starts to fall apart.

The reason academic eval on a laptop doesn't scale to 500+ tasks is purely infrastructure. Running 500 SWE-bench tasks serially with a 70B model takes ~250 hours (500 tasks × 30 min/task). With 20 H200s and Ray for parallel dispatch, 10 parallel workers (2 GPUs each) bring that down to roughly 25 hours. The benchmark isn't different; the compute architecture is.

The Benchmark Landscape: What Each Suite Measures

Benchmark	Task Type	Task Count (standard)	Primary Metric	Modality
SWE-bench Verified	GitHub issue resolution	500	% resolved	Code editing
GAIA	Multi-step web + file tasks	466	Accuracy by level	Web/file/tool use
Terminal-Bench Core	Linux CLI task completion	~89	Task completion %	CLI
OSWorld	Desktop GUI control	369	Task success rate	GUI/vision
BrowseComp	Complex web retrieval	1,266	Accuracy	Web browsing
tau-bench	Tool-augmented reasoning	~165	Pass@1	Tool use

SWE-bench Verified (500 tasks): The agent receives a GitHub issue description and a snapshot of the repository. It runs inside a Docker container with the repo checked out and must produce a patch that passes the issue's associated test suite. The sandbox is deterministic: no internet access, no external state. Pass/fail comes from running pytest on the patch. No LLM judge required.

GAIA (466 tasks): Tasks are diverse: "find the 2023 GDP of the country whose capital is X, then convert it to Z currency at the rate from date Y." The agent gets access to web tools, file parsers, and calculators. Answers are short strings scored against human-annotated ground truth, with a fallback LLM judge for paraphrase matching.

Terminal-Bench Core (~89 tasks, v2.0): A Linux shell agent receives a natural language task and must complete it using bash commands. Scoring is deterministic: the benchmark checks exit codes, file diffs, or output strings. No vision model needed. Fast to run, roughly 10-12 minutes for ~89 tasks at moderate parallelism.

OSWorld (369 tasks): A GUI agent receives a screenshot of a desktop OS (Ubuntu, Windows, macOS via QEMU VM) and a natural language instruction. It must control the desktop via mouse/keyboard actions. Each step produces a new screenshot. Scoring requires a VLM to interpret the final state screenshot. This is the most infrastructure-intensive benchmark: you need QEMU/KVM support on your compute nodes. Note that KVM hardware virtualization must be available on your instances; check this requirement before provisioning. Spheron instances typically support nested virtualization, but confirm with the dashboard before starting an OSWorld run.

Harness Architecture: How a Benchmark Run Works

A benchmark run has three layers: rollout workers, a judge layer, and result aggregation.

Rollout Workers

Each worker pulls a task from a queue, spins up a Docker sandbox with the benchmark harness, invokes the agent, and writes the raw output (patch file, answer string, or action trace) to a shared volume. The generic structure:

python

def run_task(task_id: str, benchmark: str, agent_fn: Callable, sandbox_image: str) -> dict:
    container = docker.run(sandbox_image, task_id=task_id)
    raw_output = agent_fn(container.task_prompt, container.env)
    container.write_output(raw_output)
    result = container.score()
    return {"task_id": task_id, "benchmark": benchmark, "result": result}

Workers are stateless. Each run is independent, which means preemption by a spot instance just drops that task back into the queue for retry.

The Judge Layer

SWE-bench needs no LLM judge: unit tests are the ground truth. GAIA, Terminal-Bench, and OSWorld all need some form of judge:

GAIA: string normalization + exact match for most tasks, Qwen2.5 72B via vLLM as fallback for paraphrase variants
Terminal-Bench: exit codes and output diffs cover most tasks, LLM judge for ~15% of tasks that require semantic output validation
OSWorld: Claude Sonnet 4.6 or GPT-4o-mini reads the final screenshot and answers whether the task was completed

Result Aggregation

Workers write per-task results as JSON to a shared volume (or S3-compatible object store). After all tasks complete, run the official scoring scripts from each benchmark's GitHub repo. Don't reimplement scoring: the official scripts handle edge cases in normalization and partial credit that matter for comparing to published numbers.

Results JSON schema (per task):

json

{
  "task_id": "astropy__astropy-14309",
  "benchmark": "swebench_verified",
  "model": "deepseek-r2-70b",
  "patch_path": "results/patches/astropy__astropy-14309.patch",
  "resolved": true,
  "wall_time_seconds": 847,
  "tokens_used": 12400,
  "run_id": "run_20260424_001"
}

GPU Requirements Per Benchmark

Benchmark	Agent GPU	Judge GPU	Notes
SWE-bench Verified (self-hosted 70B)	8x H200 (4-way parallel)	None	Tests are unit tests, no LLM judge needed for pass/fail
SWE-bench Verified (API agent)	0 (API)	None	API cost replaces GPU cost for agent; unit tests handle pass/fail
GAIA (70B agent)	4x H200	2x H100 SXM5 GPUs (Qwen2.5 72B judge)	Level 3 tasks require multi-step tool use, 3-5 min per task
Terminal-Bench Core	2x H100	None (deterministic pass/fail)	Stateless, fast - ~89 tasks in ~10-12 min
OSWorld	0-4x GPU for VLM grader	2x GPU for vision judge	CPU-heavy for QEMU VMs; 1 GPU per 4 VMs for screenshot grading

Agent benchmark workloads are embarrassingly parallel. Each task is independent: no shared state, no communication between workers. This makes them ideal candidates for spot instances. A spot preemption drops one task, not the run. Checkpoint the results JSON after every 50 tasks, and a restart picks up where it left off with minimal overhead.

Spot pricing on Spheron runs roughly 75-80% below on-demand for H200 and H100 SXM5. For a 2-hour run, that gap is meaningful. Provision rollout workers on spot; keep the judge model server on a single on-demand instance to avoid preemption during a judging batch.

Parallel Evaluation with Ray and asyncio

Ray Setup for 500-Task SWE-bench

bash

# On head node (Spheron instance)
ray start --head --port=6379 --dashboard-host=0.0.0.0

# On each worker node
ray start --address=<HEAD_IP>:6379

python

import ray

@ray.remote(num_cpus=4, memory=8 * 1024**3)
def run_swebench_task(task_id: str, model_endpoint: str, sandbox_image: str) -> dict:
    # agent invocation inside Docker sandbox
    pass

ray.init(address="auto")
tasks = load_swebench_verified()  # 500 tasks
futures = [run_swebench_task.remote(t["id"], MODEL_URL, SANDBOX_IMAGE) for t in tasks]

# Collect results per-task so one failure doesn't discard all others.
results = []
for f in futures:
    try:
        results.append(ray.get(f))
    except Exception as e:
        results.append({"error": str(e), "resolved": False})

Ray distributes tasks across registered worker nodes automatically. The num_cpus=4 and memory=8GB per task slot means each worker node controls how many tasks run in parallel based on its available resources. Monitor progress via the Ray dashboard on port 8265.

With 20 Spheron H200 instances and a 30-minute average task time, you get roughly 40 parallel task slots running simultaneously. 500 tasks / 40 parallel slots = ~12.5 batches * 30 min = ~6 hours. For a self-hosted 70B model requiring 2 GPUs per worker, 20 H200s give 10 workers running in parallel, completing 500 tasks in roughly 25 hours at 30 min/task. For API-based agents where each task is a sequence of API calls, parallelism is limited by rate limits rather than GPU count.

asyncio for API-Based Agents

python

import asyncio
import httpx

async def run_task_async(task: dict, semaphore: asyncio.Semaphore) -> dict:
    async with semaphore:
        # agent API call
        pass

async def main():
    semaphore = asyncio.Semaphore(50)  # 50 concurrent API requests
    tasks = load_swebench_verified()
    raw = await asyncio.gather(*[run_task_async(t, semaphore) for t in tasks], return_exceptions=True)
    # Map exceptions to error records so failed tasks don't shrink the denominator.
    results = [r if not isinstance(r, Exception) else {"error": str(r), "resolved": False} for r in raw]

if __name__ == "__main__":
    asyncio.run(main())

For Claude Sonnet 4.6 or similar API agents, the semaphore controls concurrency against rate limits. 50 concurrent requests at 30-second average task latency means roughly 100 tasks/minute throughput, completing 500 tasks in about 5 minutes of wall clock time (excluding API latency tails).

Cost Breakdown: A Single SWE-bench Verified Run

Scenario: 500 Tasks, Three Agent Options

Live pricing fetched 26 Apr 2026:

H200 SXM5 spot: $1.19/GPU/hr, on-demand: $5.58/GPU/hr
H100 SXM5 spot: $0.80/GPU/hr, on-demand: $4.41/GPU/hr

Agent Model	Infrastructure	Est. GPU-Hours	GPU Cost (spot)	API Cost	Total
Claude Sonnet 4.6 (API)	None (unit tests, no judge needed)	0 GPU-hrs	$0	~$60-80	~$60-80
GPT-4o equivalent (API)	None (unit tests, no judge needed)	0 GPU-hrs	$0	~$50-70	~$50-70
DeepSeek R2 70B (self-hosted)	20x H200 spot, ~25 hrs	~500 GPU-hrs	~$595	$0	~$595

Pricing fluctuates based on GPU availability. The prices above are based on 26 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

For API-based agents, you're paying $60-80 per run with no GPU infrastructure to manage. Three runs per week costs $180-240 in API fees. Three self-hosted runs on the same H200 spot cluster cost roughly $1,785 (3 × $595 at spot pricing), which is significantly higher at that frequency. Self-hosted starts making economic sense when you run evaluations daily against a persistent serving cluster shared across multiple workloads, or when data privacy requirements rule out sending code patches to external APIs.

Spot pricing makes the self-hosted path even more compelling. Benchmark runs are batch-shaped: they can be interrupted and resumed. A checkpoint file after every 50 tasks means a spot preemption only loses at most 50 tasks of progress. At the cost difference between on-demand and spot ($5.58 vs $1.19/hr), that's close to an 80% saving on a workload that tolerates preemption well. For a deeper look at spot economics and failure patterns in practice, see Spot GPU Training: Real Cost Savings and Failure Patterns.

Judge Model Selection and Bias Mitigation

Which Judge for Which Benchmark

Benchmark	Ground Truth	LLM Judge Used	Notes
SWE-bench	Unit test suite	None	Pass/fail is deterministic
GAIA	Human annotation	Qwen2.5 72B or Llama 3.3 70B	String match + LLM fallback
Terminal-Bench	Shell exit code + output diff	None (mostly)	~15% of tasks use LLM for output correctness
OSWorld	Screenshot grader	Claude Sonnet 4.6 or GPT-4o-mini	VLM reads final state screenshot

Three Bias Patterns to Avoid

Self-preference bias: A model consistently scores its own outputs higher than equivalent outputs from other model families. The fix is straightforward: never use the candidate model as the judge. If you're evaluating DeepSeek R2, use Qwen2.5 72B or Llama 3.3 70B as judge. Keep the model families distinct.

Position bias: In pairwise evaluations, judges favor whichever response appears first in the prompt. Run every pairwise comparison twice with the candidate order swapped. If the judge picks option A in run 1 and option A (originally B) in run 2, it's biased by position. Discard ties and report the swap-corrected win rate.

Length bias: Judges reward longer responses even when shorter ones are more correct. Mitigate with an explicit rubric instruction: "Score for correctness and directness. Do not reward unnecessary elaboration." For GAIA and Terminal-Bench where answers are short strings, this bias rarely appears, but it matters for open-ended scoring tasks.

For a complete treatment of judge deployment, vLLM configuration, and CI integration for judge pipelines, see LLM-as-Judge Evaluation Pipelines on GPU Cloud.

Reproducibility Checklist

Reproducibility means: given the same model checkpoint, same harness version, and same tasks, you get the same score. In practice, six things break this:

Container pinning - reference Docker images by digest (sha256:...), not tag. Tags are mutable; a re-push of swebench/swe-bench:latest can silently change your sandbox environment between runs.
Seed control - set PYTHONHASHSEED=42, temperature=0, top_p=1.0 in all agent calls. Non-deterministic sampling produces different patches on re-runs, which changes the resolved count.
Model weight hash - record md5sum of local checkpoint files in run_manifest.json. A checkpoint that gets updated mid-evaluation run will produce mixed results that are impossible to attribute.
Harness version - pin benchmark library versions in requirements.txt (e.g., swebench==2.1.0). Upstream changes to task prompts or evaluation scripts change the score without changing your model.
Score auditing - always run official scoring scripts from benchmark repos, not custom reimplementations. Small differences in string normalization change GAIA accuracy by 1-3 percentage points.
Run manifest - store {model, harness_version, image_digest, seed, timestamp, gpu_type} alongside every results file.

Example run_manifest.json:

json

{
  "run_id": "run_20260424_001",
  "benchmark": "swebench_verified",
  "model": "deepseek-r2-70b",
  "model_checkpoint_md5": "a3f2e1c9d8b7...",
  "sandbox_image_digest": "sha256:1a2b3c4d...",
  "harness_version": "swebench==2.1.0",
  "ray_version": "2.10.0",
  "seed": 42,
  "temperature": 0.0,
  "timestamp": "2026-04-24T02:00:00Z",
  "gpu_type": "H200_SXM5",
  "gpu_count": 20
}

Building a Continuous Benchmark Pipeline

GitHub Actions Trigger

yaml

name: Agent Benchmark
on:
  push:
    branches: [main]
  schedule:
    - cron: "0 2 * * 1"  # Every Monday at 2am UTC
jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - name: Provision Spheron cluster
        id: provision
        run: |
          RESPONSE=$(curl -s -X POST https://api.spheron.ai/v1/instances \
            -H "Authorization: Bearer ${{ secrets.SPHERON_API_KEY }}" \
            -d '{"gpu": "H200", "count": 20, "pricing": "spot"}')
          INSTANCE_ID=$(echo "$RESPONSE" | jq -r '.id')
          if [ -z "$INSTANCE_ID" ] || [ "$INSTANCE_ID" = "null" ]; then echo "Provision failed" && exit 1; fi
          echo "instance_id=$INSTANCE_ID" >> $GITHUB_OUTPUT
      - name: Run SWE-bench
        run: python run_benchmark.py --benchmark swebench --tasks 500 --cluster-id ${{ steps.provision.outputs.instance_id }}
      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-results
          path: results/
      - name: Deprovision Spheron cluster
        if: always()
        run: |
          curl -s -X DELETE https://api.spheron.ai/v1/instances/${{ steps.provision.outputs.instance_id }} \
            -H "Authorization: Bearer ${{ secrets.SPHERON_API_KEY }}"

The weekly trigger gives you a time series of benchmark scores across model checkpoints. Gate main-branch merges on not dropping below a minimum resolved rate (e.g., block merges that regress SWE-bench by more than 2 percentage points). For Spheron API provisioning details, see the Spheron documentation.

Post-Training Iteration Loop

The benchmark pipeline feeds directly into the next training run. The loop looks like: train a checkpoint, trigger the benchmark run automatically, score results, compare against the baseline checkpoint, decide whether to continue training or adjust the data mix.

This loop is where the observability data pays off. If your SWE-bench score drops, you need to know whether it dropped on Python tasks, Go tasks, or all tasks equally. If it dropped specifically on file-manipulation tasks, that's a signal about your tool-call training data. Observability traces make that analysis possible. For orchestration patterns when running evaluation as part of a larger agent pipeline, see Scale AI Agent Fleets on GPU Cloud: MCP Orchestration and Autoscaling Guide.

Integrating Langfuse and Arize Phoenix for Per-Task Observability

What to Trace

Three things to capture per task:

The full agent trace: every prompt, every tool call, every intermediate step, final answer
Token counts and latency per step (to find where tasks are slow or expensive)
Task score annotated on the trace after scoring completes (to correlate behavior with outcome)

Langfuse Setup

python

from langfuse import Langfuse
from langfuse.decorators import observe

langfuse = Langfuse()

@observe()
def run_agent_on_task(task: dict) -> str:
    # agent execution
    pass

Tag every trace with task_id, benchmark_name, model_name, and run_id so you can filter and compare across runs in the Langfuse UI.

Linking Traces to Scores

python

langfuse.score(
    trace_id=trace.id,
    name="swebench_resolved",
    value=1.0 if resolved else 0.0,
    comment=f"Task {task['id']} - patch applied successfully"
)

Once you have scores annotated on traces, the analysis becomes concrete. Sort tasks by score and look at the top 20 failures: are they long context tasks? Tasks requiring specific tools? Tasks where the agent's first patch attempt was wrong and it ran out of tokens on retry? That pattern tells you exactly what to fix in the next training run, which tool call prompts need work, or where the context window needs extending.

For tasks that fail consistently across model versions, look at the average token count and step count. SWE-bench tasks that fail on every model tend to require 15,000+ tokens of context and 10+ tool calls; those are the tasks that expose context window limits, not model capability.

Agent benchmarking burns GPU hours fast, but the workload is perfectly shaped for spot instances: embarrassingly parallel, preemption-tolerant, and bursty. A full 500-task SWE-bench Verified run on 20 spot H200s with a self-hosted 70B model takes roughly 25 hours at a fraction of reserved-cluster cost.
Rent H200 → | Rent H100 → | View all GPU pricing →
Get started on Spheron →

Why Agent Benchmarks Became the 2026 Evaluation Standard

The Benchmark Landscape: What Each Suite Measures

Harness Architecture: How a Benchmark Run Works

Rollout Workers

The Judge Layer

Result Aggregation

GPU Requirements Per Benchmark

Parallel Evaluation with Ray and asyncio

Ray Setup for 500-Task SWE-bench

asyncio for API-Based Agents

Cost Breakdown: A Single SWE-bench Verified Run

Scenario: 500 Tasks, Three Agent Options

Judge Model Selection and Bias Mitigation

Which Judge for Which Benchmark

Three Bias Patterns to Avoid

Reproducibility Checklist

Building a Continuous Benchmark Pipeline

GitHub Actions Trigger

Post-Training Iteration Loop

Integrating Langfuse and Arize Phoenix for Per-Task Observability

What to Trace

Langfuse Setup

Linking Traces to Scores

Build what's next.