LLM-as-Judge Evaluation Pipelines on GPU Cloud: Build Production Model Evaluation Infrastructure (2026 Guide)

Evaluation is the 2026 production bottleneck that most teams hit after they've solved serving. You can deploy a 70B model on 2x H100, hit 800 tokens/second, and have a clean Kubernetes setup, then realize you have no reliable way to measure whether the model is actually good. BLEU and ROUGE fail completely for code correctness, instruction following, and multi-turn coherence. The answer is a self-hosted LLM judge: a separate model that scores your candidate's outputs. This guide covers every piece of that infrastructure, from judge model selection through vLLM deployment, eval framework wiring, bias mitigation, and CI integration. For the broader inference engineering context, see inference engineering fundamentals for 2026.

Note: LLM-as-judge is an evaluation pattern. If you want a multi-model system that improves output quality at inference time rather than measuring it, see Mixture of Agents deployment on GPU cloud. An alternative approach skips the judge entirely: if your eval tasks have verifiable outcomes, the closed-loop HUD/GRPO pipeline turns scored trajectories into training data without any judge model.

Why BLEU and ROUGE Are Not Enough

BLEU and ROUGE count n-gram overlap between a generated output and a reference answer. That works fine for machine translation in 2012. It breaks for almost every task that matters in 2026.

Three concrete failures:

Code correctness. A generated function that is semantically identical to the reference but uses different variable names scores near zero on BLEU. A hallucinated function that matches the import statements in the reference scores higher. BLEU ignores execution.

Instruction following. ROUGE measures whether the same words appear. It does not check whether all instructions were followed. A response that completes 3 of 5 required steps but uses the right vocabulary scores nearly the same as one that completes all 5.

Multi-turn coherence. N-gram overlap has no concept of conversational continuity. A response that contradicts an earlier turn but uses similar phrasing scores fine.

Task Type	BLEU/ROUGE Adequate?	Why
Machine translation (closed vocab)	Yes	Surface similarity correlates with quality
Code generation	No	Execution and logic correctness are not captured
Instruction following	No	Completeness is not measurable by overlap
Summarization (abstractive)	No	Rewording reduces score artificially
Multi-turn dialogue	No	No cross-turn consistency signal
Safety evaluation	No	Harmful content can paraphrase safe content

LLM judges evaluate the things that actually matter: coherence, factual accuracy, helpfulness, instruction completion. The tradeoff is cost and latency: each scored output requires a full forward pass through a large model. That is why infrastructure matters.

Choosing a Judge Model

The right judge depends on what you're evaluating and what agreement rate you need with human annotators. Higher human agreement = larger model = more GPU.

Model	Parameters	VRAM at FP8	GPU Config on Spheron	Best For
Qwen2.5 72B	72B	~72 GB	1x H200	General instruction, code, math
GPT-OSS 120B (MoE, ~5.1B active)	120B	~60 GB at MXFP4	1x H100 or 1x H200	Reasoning, agent task evaluation
Nemotron Ultra 253B	253B	~200 GB at FP8	4x H100 80GB at FP8 or 4x H200 at FP8	RLHF labeling, frontier comparison

A single H200 SXM5 instance covers Qwen2.5 72B and GPT-OSS 120B comfortably at FP8. The H200's 141GB HBM3e is the right starting configuration for most teams: it fits a 72B judge without tensor parallelism, which keeps the setup simple and reduces latency per judgment.

For Nemotron Ultra 253B, NVIDIA's FP8 model card (nvidia/Llama-3_1-Nemotron-Ultra-253B-v1-FP8) specifies 4x H100 80GB as the reference configuration. Four H200s also work and give extra headroom for larger context windows. For full-precision BF16, NVIDIA's reference config is 8x H100 80GB. Run with --tensor-parallel-size 4 in all multi-GPU configurations.

FP8 vs FP16: Qwen2.5 72B in FP16 requires ~144GB. A single H200 at 141GB is just under that limit. FP8 cuts model weights to ~72GB and adds minimal quality loss on eval tasks. Use FP8 unless you have a specific reason not to.

Verify the Hugging Face model ID before deploying. The correct ID for Qwen2.5 72B is Qwen/Qwen2.5-72B-Instruct. Check the Hugging Face model page before writing it into a deployment script, as IDs can change between revisions.

Dataset Throughput Math

Before provisioning, calculate how many GPU hours your eval run needs.

The formula:

total_judge_tokens = num_eval_samples × avg_tokens_per_judgment
gpu_hours = total_judge_tokens / (throughput_tokens_per_sec × 3600)

Worked example for 10k pairwise evals:

Each judgment includes: the rubric system prompt (~500 tokens), two candidate responses (~1,000 tokens each), and the judge's scoring output (~500 tokens). Average: ~3,000 tokens per judgment.

Total tokens: 10,000 × 3,000 = 30,000,000 judge tokens.

H200 throughput with vLLM at batch size 128: ~1,200 tokens/second.

GPU hours: 30,000,000 / (1,200 × 3,600) = 6.9 GPU hours.

H100 SXM5 comparison: ~800 tokens/second = 10.4 GPU hours.

The H200 finishes the same eval run in 2/3 the time. At similar prices per GPU-hour, that is a direct cost reduction.

Deployment Pattern: vLLM Judge Server

Deploy the judge as an OpenAI-compatible HTTP endpoint using vLLM's server mode. Every major eval framework can call an OpenAI-compatible endpoint, so this pattern works with Inspect AI, lm-eval-harness, and promptfoo without any custom integration.

Single-GPU deployment (Qwen2.5 72B on H200):

bash

docker run \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-72B-Instruct \
  --dtype fp8 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 32768 \
  --max-num-seqs 128

Four-GPU deployment (Nemotron Ultra 253B on 4x H200 at FP8):

bash

docker run \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model nvidia/Llama-3_1-Nemotron-Ultra-253B-v1-FP8 \
  --dtype fp8 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 32768 \
  --max-num-seqs 64

Health check before sending eval traffic:

bash

curl http://localhost:8000/health
# Returns: {"status":"ok"}

Environment variables for eval frameworks:

bash

export OPENAI_API_KEY="none"  # vLLM accepts any value here
export OPENAI_BASE_URL="http://<instance-ip>:8000/v1"

For general vLLM production configuration including monitoring setup, see the vLLM production deployment guide. For Model Runner V2 flag tuning to increase throughput on batch-heavy workloads like eval runs, see the vLLM MRV2 deployment guide.

Eval Framework Integration

Inspect AI

Inspect AI is a Python evaluation framework from UK AISI. It supports structured task definitions and runs against any OpenAI-compatible endpoint.

python

from inspect_ai import Task, task
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import generate

@task
def my_eval():
    return Task(
        dataset=my_dataset,
        solver=[generate()],
        scorer=model_graded_fact(
            model="openai/Qwen2.5-72B-Instruct"
        )
    )

Run with your vLLM endpoint:

bash

inspect eval my_eval.py \
  --model openai/Qwen2.5-72B-Instruct \
  --model-base-url http://<instance-ip>:8000/v1

lm-eval-harness

EleutherAI's lm-eval-harness covers hundreds of standard benchmarks and supports custom task definitions.

bash

lm_eval \
  --model local-completions \
  --model_args "base_url=http://<instance-ip>:8000/v1,model=Qwen/Qwen2.5-72B-Instruct" \
  --tasks my_custom_task \
  --num_fewshot 0 \
  --output_path ./results

For custom LLM-judge tasks, define a YAML task config that calls the judge endpoint as the metric rather than using n-gram overlap. The api_based task type in lm-eval-harness supports this directly.

promptfoo

promptfoo is well-suited for red-teaming and prompt regression testing. Configure the judge provider in promptfooconfig.yaml:

yaml

providers:
  - id: openai:chat:Qwen/Qwen2.5-72B-Instruct
    config:
      apiBaseUrl: http://<instance-ip>:8000/v1
      apiKey: none

defaultTest:
  assert:
    - type: llm-rubric
      value: "The response is accurate, concise, and follows all instructions."
      provider: openai:chat:Qwen/Qwen2.5-72B-Instruct

Run the evaluation:

bash

promptfoo eval --config promptfooconfig.yaml

Scoring Modes: Pairwise, Scalar, and Multi-Aspect

The scoring mode determines the rubric structure, token cost per judgment, and what bias risks apply.

Pairwise preference: The judge sees two candidate responses (A and B) and picks a winner. Use this for model comparison (which version is better?) and RLHF preference labeling. Token cost: high, because both responses go in the prompt. Bias exposure: high position bias.

Scalar scoring: The judge assigns a score from 1 to 5 per criterion. Use this for regression testing (did the new model drop below 4/5 on accuracy?) and deployment gates. Token cost: medium. Bias exposure: low position bias, some verbosity bias.

Multi-aspect rubric: The judge scores multiple dimensions separately (helpfulness, accuracy, safety). Use this when you need diagnostic output rather than a pass/fail signal. Token cost: highest, because the rubric is longer. Bias exposure: medium.

Mode	Best Use	Token Cost	Position Bias Risk
Pairwise	Model A vs B, RLHF labeling	High	High
Scalar	Regression tests, deployment gates	Medium	Low
Multi-aspect	Diagnostic scoring, safety audits	High	Low

For scalar and multi-aspect scoring, use vLLM's guided JSON decoding to guarantee parseable output:

bash

--guided-json '{"type": "object", "properties": {"score": {"type": "integer", "minimum": 1, "maximum": 5}, "reason": {"type": "string"}}, "required": ["score", "reason"]}'

This eliminates output parsing failures without any post-processing.

For adversarial evaluation use cases where the judge model scores attack success rather than quality, see the AI red teaming infrastructure guide covering PyRIT and Inspect AI judge integration.

Bias Mitigation

LLM judges have systematic biases that, if ignored, corrupt your eval results.

Position bias: Judges favor whichever response appears first in pairwise comparisons. The fix is swap-and-discard: run every pairwise comparison twice with the candidate order reversed. If the judge picks A in run 1 but B in run 2 (i.e., it always prefers position 1), discard the pair as a tie. This costs 2x tokens but eliminates position bias completely.

Verbosity bias: Judges prefer longer responses, even when shorter ones are more correct. Mitigate by adding explicit rubric language: "Prefer concise, direct answers. Do not reward unnecessary elaboration." For scalar scoring, you can add length as a negative criterion when verbosity is not desired.

Self-preference: A model tends to prefer its own outputs when used as its own judge. Never use the candidate model as the judge. Use a separate model from a different training lineage. For example, do not use Llama-3-70B to evaluate Llama-3-70B outputs.

Reference-model consensus: For high-stakes evals (RLHF preference labeling, safety evaluation), run two judges from different model families and take majority vote. If both agree, record the result. If they disagree, flag the sample for human review. This catches systematic biases that affect a single model family.

Bias Type	Effect	Mitigation
Position	Prefers first response	Swap order, discard ties
Verbosity	Prefers longer responses	Explicit rubric penalty for length
Self-preference	Prefers its own outputs	Use a different model family as judge
Sycophancy	Agrees with authoritative tone	Blind rubric (no author attribution in prompt)

For teams running full agent benchmark suites, see AI Agent Benchmarking Infrastructure on GPU Cloud for how to wire judges into SWE-bench, GAIA, and OSWorld pipelines end to end.

Batch vs Online Eval

Two deployment patterns, each suited to different use cases.

Batch eval (nightly regression): Provision a spot instance, run the full eval suite, terminate the instance. The job is time-tolerant: it does not need to complete in under a second. Spot instances are the right pricing tier here: eval workloads are batch-shaped and can checkpoint state to disk if the instance is preempted. Each checkpoint saves scored samples to a JSON file, so a restart picks up where it left off.

Online eval (inline production judge): Route 1-2% of live production traffic through a judge endpoint that scores outputs in near real-time. Use this to monitor score drift on production traffic without running a separate offline eval suite. On-demand instances are required here: spot pricing is not appropriate for SLA-bound production traffic.

For billing model selection for batch eval, spot pricing on Spheron runs 40-60% below on-demand on H200 and 50-70% below on H100 SXM5. That turns a ~$8 on-demand eval run into a ~$3-5 spot run when availability permits.

Batch eval catches regressions before deployment. Online eval catches drift after. Both are necessary for production systems.

For optimizer-driven pipelines like DSPy's MIPROv2, LLM-as-judge metrics serve double duty: they score outputs during optimization and in production monitoring.

Cost Playbook: 10k Evals on Spheron

Pricing from Spheron's public /gpu-rental/ pages, checked 22 Apr 2026. H200 and H100 SXM5 on-demand rates sourced from the respective GPU rental pages; spot pricing runs 40-60% below on-demand and varies by availability. Throughput estimates at batch size 128 with vLLM FP8.

Judge Model	GPU Config	Cluster On-Demand Price/hr	GPU Hours for 10k Evals	Total On-Demand Cost
Qwen2.5 72B	1x H200 SXM5	$1.19/hr	~6.9 hrs	~$8.21
GPT-OSS 120B	1x H100 SXM5	$0.80/hr	~4 hrs	~$3.20
Nemotron Ultra 253B	4x H200 at FP8	$4.76/hr	~12 hrs	~$57.12

Pricing fluctuates based on GPU availability. The prices above are based on 22 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Teams running GPT-OSS 120B can start with a single H100 SXM5 thanks to MXFP4 quantization. The 120B MoE model has ~5.1B active parameters per forward pass and fits on a single 80GB H100 at MXFP4.

On-demand adds roughly 30-50% to the spot cost. Use on-demand only for inline production judges where spot preemption is unacceptable.

H200 spot availability note: H200 spot pricing varies. If H200 spot is unavailable when you need it, fall back to 2x H100 SXM5 at $0.80/GPU/hr on-demand (or spot where available), which handles Qwen2.5 72B at TP=2. Check the pricing page for live availability.

Reference Architecture: CI-Integrated Judge Pipeline

This GitHub Actions workflow provisions a Spheron H200 spot instance, runs the eval, asserts a score threshold, and terminates the instance. Adapt the Spheron API calls to match the current provisioning API documented at docs.spheron.ai/api-reference.

yaml

name: LLM Eval Pipeline

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 2 * * *'  # nightly at 2am UTC

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Provision Spheron H200 spot instance
        id: provision
        run: |
          INSTANCE=$(curl -s -X POST https://api.spheron.network/v1/instances \
            -H "Authorization: Bearer ${{ secrets.SPHERON_API_KEY }}" \
            -H "Content-Type: application/json" \
            -d '{"gpu": "H200", "count": 1, "pricing": "spot"}')
          echo "instance_id=$(echo "$INSTANCE" | jq -r .id)" >> $GITHUB_OUTPUT
          echo "instance_ip=$(echo "$INSTANCE" | jq -r .ip)" >> $GITHUB_OUTPUT

      - name: Wait for SSH readiness
        run: |
          READY=false
          for i in $(seq 30); do
            ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no \
              user@${{ steps.provision.outputs.instance_ip }} 'echo ready' && READY=true && break
            sleep 10
          done
          [ "$READY" = "true" ] || { echo "SSH readiness timeout after 5 minutes"; exit 1; }

      - name: Start vLLM judge server
        run: |
          ssh user@${{ steps.provision.outputs.instance_ip }} \
            'docker run -d --gpus all --ipc=host -p 8000:8000 \
              vllm/vllm-openai:latest \
              --model Qwen/Qwen2.5-72B-Instruct \
              --dtype fp8 \
              --gpu-memory-utilization 0.92 \
              --max-model-len 32768 \
              --max-num-seqs 128'
          # Wait for health check (5 min timeout)
          READY=false
          for i in $(seq 60); do
            curl -sf http://${{ steps.provision.outputs.instance_ip }}:8000/health && READY=true && break
            sleep 5
          done
          [ "$READY" = "true" ] || { echo "vLLM health check timeout after 5 minutes"; exit 1; }

      - name: Run eval harness
        run: |
          OPENAI_BASE_URL=http://${{ steps.provision.outputs.instance_ip }}:8000/v1 \
          lm_eval --model local-completions \
            --model_args "base_url=http://${{ steps.provision.outputs.instance_ip }}:8000/v1,model=Qwen/Qwen2.5-72B-Instruct" \
            --tasks my_eval_suite \
            --output_path ./eval-results \
            --write_out

      - name: Assert score threshold
        run: |
          SCORE=$(jq '.results.my_eval_suite.accuracy // empty' eval-results/results.json || true)
          [ -z "$SCORE" ] && { echo 'Accuracy key not found in results'; exit 1; }
          python -c 'import sys; score=float(sys.argv[1]); assert score >= 0.82, f"Eval score {score} below threshold 0.82"' "$SCORE"

      - name: Terminate instance
        if: always()
        run: |
          curl -s -X DELETE \
            https://api.spheron.network/v1/instances/${{ steps.provision.outputs.instance_id }} \
            -H "Authorization: Bearer ${{ secrets.SPHERON_API_KEY }}"

The if: always() on the terminate step ensures the instance is shut down even if the eval fails. This prevents runaway spot charges from failed runs.

For checkpoint-based resumption (handling mid-eval spot preemption), write scored samples to an S3-compatible store every 500 samples. On restart, load the checkpoint and skip already-scored pairs.

Monitoring Judge Quality Over Time

A judge pipeline that you deploy and forget is not a production system. Three things to track:

Score distribution per model version. After each eval run, record the histogram of scores. A judge giving 90% of responses a score of 4/5 is not calibrated: it either has a misconfigured rubric or a model that is too agreeable. Track distribution shift between runs as a signal of judge drift or rubric drift.

Human agreement rate. Periodically pull a random sample of 50-100 judge decisions and have a human annotator score the same pairs independently. Your target agreement rate depends on the task: 75-80% agreement is typical for general instruction tasks, 85%+ for factual accuracy tasks. If agreement drops below threshold, the judge needs recalibration.

Raw output archiving. Save every judge decision (prompt, response A, response B, judge output, parsed score, reasoning) to durable storage. This is essential for bias audits. If you discover weeks later that a particular judge systematically favored shorter responses, you need the raw data to identify which eval results are affected.

For structured JSON output from vLLM guided decoding (which makes raw output parsing reliable), see the structured output and function calling inference guide.

To trace individual evaluation requests end-to-end (prompt input, judge model response, scoring output, and latency breakdown), wire your judge server into an LLM observability platform like Langfuse or Arize Phoenix running on the same Spheron cluster.

Alert on:

Score standard deviation dropping below 0.5 (judge is not discriminating)
Mean score shifting more than 0.3 points between consecutive nightly runs
Position bias rate exceeding 15% on swap tests

Eval workloads are spiky and batch-shaped, the ideal fit for spot GPU pricing. Run your judge pipeline on H200 instances and pay only for the hours you use.
View H200 spot pricing → | See all GPU pricing →

STEPS / 07

Quick Setup Guide

Choose your judge model and GPU configuration
Select a judge model based on your quality bar and budget. Qwen2.5 72B at FP8 on 1x H200 is the best starting point - it covers most agentic and code eval tasks at reasonable cost. For RLHF-style preference labeling where human agreement rate matters, step up to GPT-OSS 120B on 1x H200. For final-stage evaluation against frontier models, Nemotron Ultra 253B on 4x H100 80GB at FP8 (NVIDIA's reference config) or 4x H200 at FP8 matches GPT-4o human agreement rates.
Provision an H200 spot instance on Spheron
Log into app.spheron.ai, select H200 from the GPU catalog, and choose Spot pricing for eval workloads (batch-shaped, tolerate preemption, checkpoint between runs). SSH into the instance and run nvidia-smi to confirm 141GB HBM3e is available. For multi-GPU judge configs (120B, 253B models), provision 2x or 4x H200 from the same node.
Deploy vLLM as the judge endpoint
Run vLLM in OpenAI-compatible server mode: docker run --gpus all --ipc=host -p 8000:8000 vllm/vllm-openai:latest --model Qwen/Qwen2.5-72B-Instruct --dtype fp8 --gpu-memory-utilization 0.92 --max-model-len 32768 --max-num-seqs 128. The judge endpoint is accessible at http://instance-ip:8000/v1. For 2x GPU configs, add --tensor-parallel-size 2.
Write your evaluation rubric
Choose your scoring mode: pairwise preference (A vs B, pick winner), scalar scoring (1-5 per criterion), or multi-aspect rubric (score helpfulness, accuracy, safety separately). Encode the rubric in a system prompt. For scalar scoring, require the judge to output structured JSON using vLLM's --guided-json flag to guarantee parse-safe output.
Configure your eval framework to call the judge endpoint
Set OPENAI_BASE_URL to your vLLM instance endpoint. Inspect AI: pass --model openai/judge-model-name and --base-url http://instance-ip:8000/v1. lm-eval-harness: use --model local-completions --model_args base_url=http://instance-ip:8000/v1. promptfoo: configure provider: openai:chat:judge-model-name with apiBaseUrl pointing to the vLLM endpoint.
Run your first eval batch and validate scores
Start with 100-200 eval samples before scaling. Check score distribution - a judge giving everything 4/5 or never picking a winner has a misconfigured rubric or a model that's too small. Validate a random 5% of scores manually. For pairwise eval, run the swap test on 10% of pairs to measure position bias rate. Acceptable rate is below 15%.
Integrate into CI for nightly regression
Trigger the eval pipeline on every model checkpoint or nightly via GitHub Actions / GitLab CI. The job provisions a Spheron spot instance via the Spheron API, runs vLLM, executes the eval harness, publishes results as a structured JSON artifact, and terminates the instance. Gate deployment on a minimum score threshold. Checkpoint eval state to disk every 500 samples to handle spot preemption.

FAQ / 05

Frequently Asked Questions

LLM-as-judge uses a capable language model to score or compare model outputs instead of n-gram overlap metrics. BLEU and ROUGE measure surface similarity, which fails completely for open-ended tasks like coding, reasoning, summarization, and multi-turn dialogue. An LLM judge evaluates coherence, helpfulness, factual accuracy, and instruction-following - the things that actually matter in production.

The leading open judges are: Qwen2.5 72B (strong general reasoning, fits on 2x H100 in FP16 or 1x H200 in FP8), GPT-OSS 120B (OpenAI's MoE judge, high reasoning ceiling, runs on a single H100 80GB or H200), and Nemotron Ultra 253B (NVIDIA's reference judge for RLHF-style labeling, needs 8x H100 or 4x H200). For cost-efficiency, Qwen2.5 72B at FP8 on a single H200 is the best starting point.

A 72B model in FP16 needs approximately 144GB of VRAM. A single H200 (141GB HBM3e) is too small for FP16 but works comfortably at FP8 (~72GB weights plus KV cache headroom). For FP16, use 2x H100 SXM5 (2 x 80GB = 160GB) with tensor parallelism enabled in vLLM (--tensor-parallel-size 2).

Run every pairwise comparison twice with the candidate order swapped. If the judge favors response A in position 1 but response B in position 1 when the order flips, the result is discarded as a tie. This swap-and-discard approach eliminates position bias at the cost of 2x judge token consumption. Verbosity bias (preferring longer answers) is mitigated by scoring rubrics that explicitly reward conciseness or by capping response length in the prompt.

For Qwen2.5 72B as judge at FP8 on a single H200 instance: at ~3,000 judge tokens per eval pair (prompt + completion), 10k evals require 30M tokens. At roughly 1,200 tokens/second on H200 with vLLM, that is about 6.9 GPU hours. At H200 on-demand pricing ($1.19/GPU/hr), 10k evals cost approximately $8.21. Spot pricing runs 40-60% below on-demand, so the same run costs roughly $3-5 on spot when H200 spot is available.

Why BLEU and ROUGE Are Not Enough

Choosing a Judge Model

Dataset Throughput Math

Deployment Pattern: vLLM Judge Server

Eval Framework Integration

Inspect AI

lm-eval-harness

promptfoo

Scoring Modes: Pairwise, Scalar, and Multi-Aspect

Bias Mitigation

Batch vs Online Eval

Cost Playbook: 10k Evals on Spheron

Reference Architecture: CI-Integrated Judge Pipeline

Monitoring Judge Quality Over Time

Quick Setup Guide

Choose your judge model and GPU configuration

Provision an H200 spot instance on Spheron

Deploy vLLM as the judge endpoint

Write your evaluation rubric

Configure your eval framework to call the judge endpoint

Run your first eval batch and validate scores

Integrate into CI for nightly regression

Frequently Asked Questions

01What is LLM-as-judge and why has it replaced BLEU/ROUGE?

02Which open-source models work best as LLM judges in 2026?

03How much GPU memory do I need to run a 72B judge model?

04How do I prevent position bias in LLM-as-judge evaluation?

05How much does it cost to run 10,000 evaluations with an LLM judge on Spheron?

Try It on Real GPUs