Every major lab published SWE-bench and GAIA numbers in 2026. Behind every number is a repeatable infrastructure: Docker sandboxes, parallel rollout workers, judge models, and structured results pipelines. Building that infrastructure once and running it reliably is the actual hard problem. This guide covers the full stack: benchmark selection, GPU sizing, Ray-based parallel dispatch, cost breakdowns with live pricing, reproducibility controls, and CI integration. For the broader compute context for agent systems, see GPU Infrastructure for AI Agents: The 2026 Compute Playbook first.
Why Agent Benchmarks Became the 2026 Evaluation Standard
The shift from task-specific evals to agent benchmarks happened for a concrete reason: BLEU, ROUGE, and even held-out test sets don't measure whether an agent actually completes real tasks. SWE-bench Verified resolve rate became the de facto code agent leaderboard metric because it's grounded in real GitHub issues with deterministic pass/fail from unit test suites. No rubric subjectivity, no cherry-picking.
GAIA filled the gap for general assistants: 466 tasks requiring multi-step web browsing, file parsing, and tool use, scored by human graders against ground-truth answers. Level 1-3 difficulty tiers let you see exactly where an agent starts to fall apart.
The reason academic eval on a laptop doesn't scale to 500+ tasks is purely infrastructure. Running 500 SWE-bench tasks serially with a 70B model takes ~250 hours (500 tasks × 30 min/task). With 20 H200s and Ray for parallel dispatch, 10 parallel workers (2 GPUs each) bring that down to roughly 25 hours. The benchmark isn't different; the compute architecture is.
The Benchmark Landscape: What Each Suite Measures
| Benchmark | Task Type | Task Count (standard) | Primary Metric | Modality |
|---|---|---|---|---|
| SWE-bench Verified | GitHub issue resolution | 500 | % resolved | Code editing |
| GAIA | Multi-step web + file tasks | 466 | Accuracy by level | Web/file/tool use |
| Terminal-Bench Core | Linux CLI task completion | ~89 | Task completion % | CLI |
| OSWorld | Desktop GUI control | 369 | Task success rate | GUI/vision |
| BrowseComp | Complex web retrieval | 1,266 | Accuracy | Web browsing |
| tau-bench | Tool-augmented reasoning | ~165 | Pass@1 | Tool use |
SWE-bench Verified (500 tasks): The agent receives a GitHub issue description and a snapshot of the repository. It runs inside a Docker container with the repo checked out and must produce a patch that passes the issue's associated test suite. The sandbox is deterministic: no internet access, no external state. Pass/fail comes from running pytest on the patch. No LLM judge required.
GAIA (466 tasks): Tasks are diverse: "find the 2023 GDP of the country whose capital is X, then convert it to Z currency at the rate from date Y." The agent gets access to web tools, file parsers, and calculators. Answers are short strings scored against human-annotated ground truth, with a fallback LLM judge for paraphrase matching.
Terminal-Bench Core (~89 tasks, v2.0): A Linux shell agent receives a natural language task and must complete it using bash commands. Scoring is deterministic: the benchmark checks exit codes, file diffs, or output strings. No vision model needed. Fast to run, roughly 10-12 minutes for ~89 tasks at moderate parallelism.
OSWorld (369 tasks): A GUI agent receives a screenshot of a desktop OS (Ubuntu, Windows, macOS via QEMU VM) and a natural language instruction. It must control the desktop via mouse/keyboard actions. Each step produces a new screenshot. Scoring requires a VLM to interpret the final state screenshot. This is the most infrastructure-intensive benchmark: you need QEMU/KVM support on your compute nodes. Note that KVM hardware virtualization must be available on your instances; check this requirement before provisioning. Spheron instances typically support nested virtualization, but confirm with the dashboard before starting an OSWorld run.
Harness Architecture: How a Benchmark Run Works
A benchmark run has three layers: rollout workers, a judge layer, and result aggregation.
Rollout Workers
Each worker pulls a task from a queue, spins up a Docker sandbox with the benchmark harness, invokes the agent, and writes the raw output (patch file, answer string, or action trace) to a shared volume. The generic structure:
def run_task(task_id: str, benchmark: str, agent_fn: Callable, sandbox_image: str) -> dict:
container = docker.run(sandbox_image, task_id=task_id)
raw_output = agent_fn(container.task_prompt, container.env)
container.write_output(raw_output)
result = container.score()
return {"task_id": task_id, "benchmark": benchmark, "result": result}Workers are stateless. Each run is independent, which means preemption by a spot instance just drops that task back into the queue for retry.
The Judge Layer
SWE-bench needs no LLM judge: unit tests are the ground truth. GAIA, Terminal-Bench, and OSWorld all need some form of judge:
- GAIA: string normalization + exact match for most tasks, Qwen2.5 72B via vLLM as fallback for paraphrase variants
- Terminal-Bench: exit codes and output diffs cover most tasks, LLM judge for ~15% of tasks that require semantic output validation
- OSWorld: Claude Sonnet 4.6 or GPT-4o-mini reads the final screenshot and answers whether the task was completed
Result Aggregation
Workers write per-task results as JSON to a shared volume (or S3-compatible object store). After all tasks complete, run the official scoring scripts from each benchmark's GitHub repo. Don't reimplement scoring: the official scripts handle edge cases in normalization and partial credit that matter for comparing to published numbers.
Results JSON schema (per task):
{
"task_id": "astropy__astropy-14309",
"benchmark": "swebench_verified",
"model": "deepseek-r2-70b",
"patch_path": "results/patches/astropy__astropy-14309.patch",
"resolved": true,
"wall_time_seconds": 847,
"tokens_used": 12400,
"run_id": "run_20260424_001"
}GPU Requirements Per Benchmark
| Benchmark | Agent GPU | Judge GPU | Notes |
|---|---|---|---|
| SWE-bench Verified (self-hosted 70B) | 8x H200 (4-way parallel) | None | Tests are unit tests, no LLM judge needed for pass/fail |
| SWE-bench Verified (API agent) | 0 (API) | None | API cost replaces GPU cost for agent; unit tests handle pass/fail |
| GAIA (70B agent) | 4x H200 | 2x H100 SXM5 GPUs (Qwen2.5 72B judge) | Level 3 tasks require multi-step tool use, 3-5 min per task |
| Terminal-Bench Core | 2x H100 | None (deterministic pass/fail) | Stateless, fast - ~89 tasks in ~10-12 min |
| OSWorld | 0-4x GPU for VLM grader | 2x GPU for vision judge | CPU-heavy for QEMU VMs; 1 GPU per 4 VMs for screenshot grading |
Agent benchmark workloads are embarrassingly parallel. Each task is independent: no shared state, no communication between workers. This makes them ideal candidates for spot instances. A spot preemption drops one task, not the run. Checkpoint the results JSON after every 50 tasks, and a restart picks up where it left off with minimal overhead.
Spot pricing on Spheron runs roughly 75-80% below on-demand for H200 and H100 SXM5. For a 2-hour run, that gap is meaningful. Provision rollout workers on spot; keep the judge model server on a single on-demand instance to avoid preemption during a judging batch.
Parallel Evaluation with Ray and asyncio
Ray Setup for 500-Task SWE-bench
# On head node (Spheron instance)
ray start --head --port=6379 --dashboard-host=0.0.0.0
# On each worker node
ray start --address=<HEAD_IP>:6379import ray
@ray.remote(num_cpus=4, memory=8 * 1024**3)
def run_swebench_task(task_id: str, model_endpoint: str, sandbox_image: str) -> dict:
# agent invocation inside Docker sandbox
pass
ray.init(address="auto")
tasks = load_swebench_verified() # 500 tasks
futures = [run_swebench_task.remote(t["id"], MODEL_URL, SANDBOX_IMAGE) for t in tasks]
# Collect results per-task so one failure doesn't discard all others.
results = []
for f in futures:
try:
results.append(ray.get(f))
except Exception as e:
results.append({"error": str(e), "resolved": False})Ray distributes tasks across registered worker nodes automatically. The num_cpus=4 and memory=8GB per task slot means each worker node controls how many tasks run in parallel based on its available resources. Monitor progress via the Ray dashboard on port 8265.
With 20 Spheron H200 instances and a 30-minute average task time, you get roughly 40 parallel task slots running simultaneously. 500 tasks / 40 parallel slots = ~12.5 batches * 30 min = ~6 hours. For a self-hosted 70B model requiring 2 GPUs per worker, 20 H200s give 10 workers running in parallel, completing 500 tasks in roughly 25 hours at 30 min/task. For API-based agents where each task is a sequence of API calls, parallelism is limited by rate limits rather than GPU count.
asyncio for API-Based Agents
import asyncio
import httpx
async def run_task_async(task: dict, semaphore: asyncio.Semaphore) -> dict:
async with semaphore:
# agent API call
pass
async def main():
semaphore = asyncio.Semaphore(50) # 50 concurrent API requests
tasks = load_swebench_verified()
raw = await asyncio.gather(*[run_task_async(t, semaphore) for t in tasks], return_exceptions=True)
# Map exceptions to error records so failed tasks don't shrink the denominator.
results = [r if not isinstance(r, Exception) else {"error": str(r), "resolved": False} for r in raw]
if __name__ == "__main__":
asyncio.run(main())For Claude Sonnet 4.6 or similar API agents, the semaphore controls concurrency against rate limits. 50 concurrent requests at 30-second average task latency means roughly 100 tasks/minute throughput, completing 500 tasks in about 5 minutes of wall clock time (excluding API latency tails).
Cost Breakdown: A Single SWE-bench Verified Run
Scenario: 500 Tasks, Three Agent Options
Live pricing fetched 26 Apr 2026:
- H200 SXM5 spot: $1.19/GPU/hr, on-demand: $5.58/GPU/hr
- H100 SXM5 spot: $0.80/GPU/hr, on-demand: $4.41/GPU/hr
| Agent Model | Infrastructure | Est. GPU-Hours | GPU Cost (spot) | API Cost | Total |
|---|---|---|---|---|---|
| Claude Sonnet 4.6 (API) | None (unit tests, no judge needed) | 0 GPU-hrs | $0 | ~$60-80 | ~$60-80 |
| GPT-4o equivalent (API) | None (unit tests, no judge needed) | 0 GPU-hrs | $0 | ~$50-70 | ~$50-70 |
| DeepSeek R2 70B (self-hosted) | 20x H200 spot, ~25 hrs | ~500 GPU-hrs | ~$595 | $0 | ~$595 |
Pricing fluctuates based on GPU availability. The prices above are based on 26 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
For API-based agents, you're paying $60-80 per run with no GPU infrastructure to manage. Three runs per week costs $180-240 in API fees. Three self-hosted runs on the same H200 spot cluster cost roughly $1,785 (3 × $595 at spot pricing), which is significantly higher at that frequency. Self-hosted starts making economic sense when you run evaluations daily against a persistent serving cluster shared across multiple workloads, or when data privacy requirements rule out sending code patches to external APIs.
Spot pricing makes the self-hosted path even more compelling. Benchmark runs are batch-shaped: they can be interrupted and resumed. A checkpoint file after every 50 tasks means a spot preemption only loses at most 50 tasks of progress. At the cost difference between on-demand and spot ($5.58 vs $1.19/hr), that's close to an 80% saving on a workload that tolerates preemption well. For a deeper look at spot economics and failure patterns in practice, see Spot GPU Training: Real Cost Savings and Failure Patterns.
Judge Model Selection and Bias Mitigation
Which Judge for Which Benchmark
| Benchmark | Ground Truth | LLM Judge Used | Notes |
|---|---|---|---|
| SWE-bench | Unit test suite | None | Pass/fail is deterministic |
| GAIA | Human annotation | Qwen2.5 72B or Llama 3.3 70B | String match + LLM fallback |
| Terminal-Bench | Shell exit code + output diff | None (mostly) | ~15% of tasks use LLM for output correctness |
| OSWorld | Screenshot grader | Claude Sonnet 4.6 or GPT-4o-mini | VLM reads final state screenshot |
Three Bias Patterns to Avoid
Self-preference bias: A model consistently scores its own outputs higher than equivalent outputs from other model families. The fix is straightforward: never use the candidate model as the judge. If you're evaluating DeepSeek R2, use Qwen2.5 72B or Llama 3.3 70B as judge. Keep the model families distinct.
Position bias: In pairwise evaluations, judges favor whichever response appears first in the prompt. Run every pairwise comparison twice with the candidate order swapped. If the judge picks option A in run 1 and option A (originally B) in run 2, it's biased by position. Discard ties and report the swap-corrected win rate.
Length bias: Judges reward longer responses even when shorter ones are more correct. Mitigate with an explicit rubric instruction: "Score for correctness and directness. Do not reward unnecessary elaboration." For GAIA and Terminal-Bench where answers are short strings, this bias rarely appears, but it matters for open-ended scoring tasks.
For a complete treatment of judge deployment, vLLM configuration, and CI integration for judge pipelines, see LLM-as-Judge Evaluation Pipelines on GPU Cloud.
Reproducibility Checklist
Reproducibility means: given the same model checkpoint, same harness version, and same tasks, you get the same score. In practice, six things break this:
- Container pinning - reference Docker images by digest (
sha256:...), not tag. Tags are mutable; a re-push ofswebench/swe-bench:latestcan silently change your sandbox environment between runs. - Seed control - set
PYTHONHASHSEED=42,temperature=0,top_p=1.0in all agent calls. Non-deterministic sampling produces different patches on re-runs, which changes the resolved count. - Model weight hash - record
md5sumof local checkpoint files inrun_manifest.json. A checkpoint that gets updated mid-evaluation run will produce mixed results that are impossible to attribute. - Harness version - pin benchmark library versions in
requirements.txt(e.g.,swebench==2.1.0). Upstream changes to task prompts or evaluation scripts change the score without changing your model. - Score auditing - always run official scoring scripts from benchmark repos, not custom reimplementations. Small differences in string normalization change GAIA accuracy by 1-3 percentage points.
- Run manifest - store
{model, harness_version, image_digest, seed, timestamp, gpu_type}alongside every results file.
Example run_manifest.json:
{
"run_id": "run_20260424_001",
"benchmark": "swebench_verified",
"model": "deepseek-r2-70b",
"model_checkpoint_md5": "a3f2e1c9d8b7...",
"sandbox_image_digest": "sha256:1a2b3c4d...",
"harness_version": "swebench==2.1.0",
"ray_version": "2.10.0",
"seed": 42,
"temperature": 0.0,
"timestamp": "2026-04-24T02:00:00Z",
"gpu_type": "H200_SXM5",
"gpu_count": 20
}Building a Continuous Benchmark Pipeline
GitHub Actions Trigger
name: Agent Benchmark
on:
push:
branches: [main]
schedule:
- cron: "0 2 * * 1" # Every Monday at 2am UTC
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- name: Provision Spheron cluster
id: provision
run: |
RESPONSE=$(curl -s -X POST https://api.spheron.ai/v1/instances \
-H "Authorization: Bearer ${{ secrets.SPHERON_API_KEY }}" \
-d '{"gpu": "H200", "count": 20, "pricing": "spot"}')
INSTANCE_ID=$(echo "$RESPONSE" | jq -r '.id')
if [ -z "$INSTANCE_ID" ] || [ "$INSTANCE_ID" = "null" ]; then echo "Provision failed" && exit 1; fi
echo "instance_id=$INSTANCE_ID" >> $GITHUB_OUTPUT
- name: Run SWE-bench
run: python run_benchmark.py --benchmark swebench --tasks 500 --cluster-id ${{ steps.provision.outputs.instance_id }}
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: benchmark-results
path: results/
- name: Deprovision Spheron cluster
if: always()
run: |
curl -s -X DELETE https://api.spheron.ai/v1/instances/${{ steps.provision.outputs.instance_id }} \
-H "Authorization: Bearer ${{ secrets.SPHERON_API_KEY }}"The weekly trigger gives you a time series of benchmark scores across model checkpoints. Gate main-branch merges on not dropping below a minimum resolved rate (e.g., block merges that regress SWE-bench by more than 2 percentage points). For Spheron API provisioning details, see the Spheron documentation.
Post-Training Iteration Loop
The benchmark pipeline feeds directly into the next training run. The loop looks like: train a checkpoint, trigger the benchmark run automatically, score results, compare against the baseline checkpoint, decide whether to continue training or adjust the data mix.
This loop is where the observability data pays off. If your SWE-bench score drops, you need to know whether it dropped on Python tasks, Go tasks, or all tasks equally. If it dropped specifically on file-manipulation tasks, that's a signal about your tool-call training data. Observability traces make that analysis possible. For orchestration patterns when running evaluation as part of a larger agent pipeline, see Scale AI Agent Fleets on GPU Cloud: MCP Orchestration and Autoscaling Guide.
Integrating Langfuse and Arize Phoenix for Per-Task Observability
What to Trace
Three things to capture per task:
- The full agent trace: every prompt, every tool call, every intermediate step, final answer
- Token counts and latency per step (to find where tasks are slow or expensive)
- Task score annotated on the trace after scoring completes (to correlate behavior with outcome)
Langfuse Setup
from langfuse import Langfuse
from langfuse.decorators import observe
langfuse = Langfuse()
@observe()
def run_agent_on_task(task: dict) -> str:
# agent execution
passTag every trace with task_id, benchmark_name, model_name, and run_id so you can filter and compare across runs in the Langfuse UI.
Linking Traces to Scores
langfuse.score(
trace_id=trace.id,
name="swebench_resolved",
value=1.0 if resolved else 0.0,
comment=f"Task {task['id']} - patch applied successfully"
)Once you have scores annotated on traces, the analysis becomes concrete. Sort tasks by score and look at the top 20 failures: are they long context tasks? Tasks requiring specific tools? Tasks where the agent's first patch attempt was wrong and it ran out of tokens on retry? That pattern tells you exactly what to fix in the next training run, which tool call prompts need work, or where the context window needs extending.
For tasks that fail consistently across model versions, look at the average token count and step count. SWE-bench tasks that fail on every model tend to require 15,000+ tokens of context and 10+ tool calls; those are the tasks that expose context window limits, not model capability.
Agent benchmarking burns GPU hours fast, but the workload is perfectly shaped for spot instances: embarrassingly parallel, preemption-tolerant, and bursty. A full 500-task SWE-bench Verified run on 20 spot H200s with a self-hosted 70B model takes roughly 25 hours at a fraction of reserved-cluster cost.
