Tutorial

LLM-as-Judge Evaluation Pipelines on GPU Cloud: Build Production Model Evaluation Infrastructure (2026 Guide)

Back to BlogWritten by Mitrasish, Co-founderApr 22, 2026
LLM as JudgeLLM EvaluationSelf-Hosted LLM EvalGPU CloudvLLMH200AI Evaluation InfrastructureModel Evaluation PipelineQwen2.5 72BInspect AIPairwise EvaluationPosition Bias
LLM-as-Judge Evaluation Pipelines on GPU Cloud: Build Production Model Evaluation Infrastructure (2026 Guide)

Evaluation is the 2026 production bottleneck that most teams hit after they've solved serving. You can deploy a 70B model on 2x H100, hit 800 tokens/second, and have a clean Kubernetes setup, then realize you have no reliable way to measure whether the model is actually good. BLEU and ROUGE fail completely for code correctness, instruction following, and multi-turn coherence. The answer is a self-hosted LLM judge: a separate model that scores your candidate's outputs. This guide covers every piece of that infrastructure, from judge model selection through vLLM deployment, eval framework wiring, bias mitigation, and CI integration. For the broader inference engineering context, see inference engineering fundamentals for 2026.

Why BLEU and ROUGE Are Not Enough

BLEU and ROUGE count n-gram overlap between a generated output and a reference answer. That works fine for machine translation in 2012. It breaks for almost every task that matters in 2026.

Three concrete failures:

Code correctness. A generated function that is semantically identical to the reference but uses different variable names scores near zero on BLEU. A hallucinated function that matches the import statements in the reference scores higher. BLEU ignores execution.

Instruction following. ROUGE measures whether the same words appear. It does not check whether all instructions were followed. A response that completes 3 of 5 required steps but uses the right vocabulary scores nearly the same as one that completes all 5.

Multi-turn coherence. N-gram overlap has no concept of conversational continuity. A response that contradicts an earlier turn but uses similar phrasing scores fine.

Task TypeBLEU/ROUGE Adequate?Why
Machine translation (closed vocab)YesSurface similarity correlates with quality
Code generationNoExecution and logic correctness are not captured
Instruction followingNoCompleteness is not measurable by overlap
Summarization (abstractive)NoRewording reduces score artificially
Multi-turn dialogueNoNo cross-turn consistency signal
Safety evaluationNoHarmful content can paraphrase safe content

LLM judges evaluate the things that actually matter: coherence, factual accuracy, helpfulness, instruction completion. The tradeoff is cost and latency: each scored output requires a full forward pass through a large model. That is why infrastructure matters.

Choosing a Judge Model

The right judge depends on what you're evaluating and what agreement rate you need with human annotators. Higher human agreement = larger model = more GPU.

ModelParametersVRAM at FP8GPU Config on SpheronBest For
Qwen2.5 72B72B~72 GB1x H200General instruction, code, math
GPT-OSS 120B (MoE, ~5.1B active)120B~60 GB at MXFP41x H100 or 1x H200Reasoning, agent task evaluation
Nemotron Ultra 253B253B~200 GB at FP84x H100 80GB at FP8 or 4x H200 at FP8RLHF labeling, frontier comparison

A single H200 SXM5 instance covers Qwen2.5 72B and GPT-OSS 120B comfortably at FP8. The H200's 141GB HBM3e is the right starting configuration for most teams: it fits a 72B judge without tensor parallelism, which keeps the setup simple and reduces latency per judgment.

For Nemotron Ultra 253B, NVIDIA's FP8 model card (nvidia/Llama-3_1-Nemotron-Ultra-253B-v1-FP8) specifies 4x H100 80GB as the reference configuration. Four H200s also work and give extra headroom for larger context windows. For full-precision BF16, NVIDIA's reference config is 8x H100 80GB. Run with --tensor-parallel-size 4 in all multi-GPU configurations.

FP8 vs FP16: Qwen2.5 72B in FP16 requires ~144GB. A single H200 at 141GB is just under that limit. FP8 cuts model weights to ~72GB and adds minimal quality loss on eval tasks. Use FP8 unless you have a specific reason not to.

Verify the Hugging Face model ID before deploying. The correct ID for Qwen2.5 72B is Qwen/Qwen2.5-72B-Instruct. Check the Hugging Face model page before writing it into a deployment script, as IDs can change between revisions.

Dataset Throughput Math

Before provisioning, calculate how many GPU hours your eval run needs.

The formula:

total_judge_tokens = num_eval_samples × avg_tokens_per_judgment
gpu_hours = total_judge_tokens / (throughput_tokens_per_sec × 3600)

Worked example for 10k pairwise evals:

Each judgment includes: the rubric system prompt (~500 tokens), two candidate responses (~1,000 tokens each), and the judge's scoring output (~500 tokens). Average: ~3,000 tokens per judgment.

Total tokens: 10,000 × 3,000 = 30,000,000 judge tokens.

H200 throughput with vLLM at batch size 128: ~1,200 tokens/second.

GPU hours: 30,000,000 / (1,200 × 3,600) = 6.9 GPU hours.

H100 SXM5 comparison: ~800 tokens/second = 10.4 GPU hours.

The H200 finishes the same eval run in 2/3 the time. At similar prices per GPU-hour, that is a direct cost reduction.

Deployment Pattern: vLLM Judge Server

Deploy the judge as an OpenAI-compatible HTTP endpoint using vLLM's server mode. Every major eval framework can call an OpenAI-compatible endpoint, so this pattern works with Inspect AI, lm-eval-harness, and promptfoo without any custom integration.

Single-GPU deployment (Qwen2.5 72B on H200):

bash
docker run \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-72B-Instruct \
  --dtype fp8 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 32768 \
  --max-num-seqs 128

Four-GPU deployment (Nemotron Ultra 253B on 4x H200 at FP8):

bash
docker run \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model nvidia/Llama-3_1-Nemotron-Ultra-253B-v1-FP8 \
  --dtype fp8 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 32768 \
  --max-num-seqs 64

Health check before sending eval traffic:

bash
curl http://localhost:8000/health
# Returns: {"status":"ok"}

Environment variables for eval frameworks:

bash
export OPENAI_API_KEY="none"  # vLLM accepts any value here
export OPENAI_BASE_URL="http://<instance-ip>:8000/v1"

For general vLLM production configuration including monitoring setup, see the vLLM production deployment guide. For Model Runner V2 flag tuning to increase throughput on batch-heavy workloads like eval runs, see the vLLM MRV2 deployment guide.

Eval Framework Integration

Inspect AI

Inspect AI is a Python evaluation framework from UK AISI. It supports structured task definitions and runs against any OpenAI-compatible endpoint.

python
from inspect_ai import Task, task
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import generate

@task
def my_eval():
    return Task(
        dataset=my_dataset,
        solver=[generate()],
        scorer=model_graded_fact(
            model="openai/Qwen2.5-72B-Instruct"
        )
    )

Run with your vLLM endpoint:

bash
inspect eval my_eval.py \
  --model openai/Qwen2.5-72B-Instruct \
  --model-base-url http://<instance-ip>:8000/v1

lm-eval-harness

EleutherAI's lm-eval-harness covers hundreds of standard benchmarks and supports custom task definitions.

bash
lm_eval \
  --model local-completions \
  --model_args "base_url=http://<instance-ip>:8000/v1,model=Qwen/Qwen2.5-72B-Instruct" \
  --tasks my_custom_task \
  --num_fewshot 0 \
  --output_path ./results

For custom LLM-judge tasks, define a YAML task config that calls the judge endpoint as the metric rather than using n-gram overlap. The api_based task type in lm-eval-harness supports this directly.

promptfoo

promptfoo is well-suited for red-teaming and prompt regression testing. Configure the judge provider in promptfooconfig.yaml:

yaml
providers:
  - id: openai:chat:Qwen/Qwen2.5-72B-Instruct
    config:
      apiBaseUrl: http://<instance-ip>:8000/v1
      apiKey: none

defaultTest:
  assert:
    - type: llm-rubric
      value: "The response is accurate, concise, and follows all instructions."
      provider: openai:chat:Qwen/Qwen2.5-72B-Instruct

Run the evaluation:

bash
promptfoo eval --config promptfooconfig.yaml

Scoring Modes: Pairwise, Scalar, and Multi-Aspect

The scoring mode determines the rubric structure, token cost per judgment, and what bias risks apply.

Pairwise preference: The judge sees two candidate responses (A and B) and picks a winner. Use this for model comparison (which version is better?) and RLHF preference labeling. Token cost: high, because both responses go in the prompt. Bias exposure: high position bias.

Scalar scoring: The judge assigns a score from 1 to 5 per criterion. Use this for regression testing (did the new model drop below 4/5 on accuracy?) and deployment gates. Token cost: medium. Bias exposure: low position bias, some verbosity bias.

Multi-aspect rubric: The judge scores multiple dimensions separately (helpfulness, accuracy, safety). Use this when you need diagnostic output rather than a pass/fail signal. Token cost: highest, because the rubric is longer. Bias exposure: medium.

ModeBest UseToken CostPosition Bias Risk
PairwiseModel A vs B, RLHF labelingHighHigh
ScalarRegression tests, deployment gatesMediumLow
Multi-aspectDiagnostic scoring, safety auditsHighLow

For scalar and multi-aspect scoring, use vLLM's guided JSON decoding to guarantee parseable output:

bash
--guided-json '{"type": "object", "properties": {"score": {"type": "integer", "minimum": 1, "maximum": 5}, "reason": {"type": "string"}}, "required": ["score", "reason"]}'

This eliminates output parsing failures without any post-processing.

Bias Mitigation

LLM judges have systematic biases that, if ignored, corrupt your eval results.

Position bias: Judges favor whichever response appears first in pairwise comparisons. The fix is swap-and-discard: run every pairwise comparison twice with the candidate order reversed. If the judge picks A in run 1 but B in run 2 (i.e., it always prefers position 1), discard the pair as a tie. This costs 2x tokens but eliminates position bias completely.

Verbosity bias: Judges prefer longer responses, even when shorter ones are more correct. Mitigate by adding explicit rubric language: "Prefer concise, direct answers. Do not reward unnecessary elaboration." For scalar scoring, you can add length as a negative criterion when verbosity is not desired.

Self-preference: A model tends to prefer its own outputs when used as its own judge. Never use the candidate model as the judge. Use a separate model from a different training lineage. For example, do not use Llama-3-70B to evaluate Llama-3-70B outputs.

Reference-model consensus: For high-stakes evals (RLHF preference labeling, safety evaluation), run two judges from different model families and take majority vote. If both agree, record the result. If they disagree, flag the sample for human review. This catches systematic biases that affect a single model family.

Bias TypeEffectMitigation
PositionPrefers first responseSwap order, discard ties
VerbosityPrefers longer responsesExplicit rubric penalty for length
Self-preferencePrefers its own outputsUse a different model family as judge
SycophancyAgrees with authoritative toneBlind rubric (no author attribution in prompt)

Batch vs Online Eval

Two deployment patterns, each suited to different use cases.

Batch eval (nightly regression): Provision a spot instance, run the full eval suite, terminate the instance. The job is time-tolerant: it does not need to complete in under a second. Spot instances are the right pricing tier here: eval workloads are batch-shaped and can checkpoint state to disk if the instance is preempted. Each checkpoint saves scored samples to a JSON file, so a restart picks up where it left off.

Online eval (inline production judge): Route 1-2% of live production traffic through a judge endpoint that scores outputs in near real-time. Use this to monitor score drift on production traffic without running a separate offline eval suite. On-demand instances are required here: spot pricing is not appropriate for SLA-bound production traffic.

For billing model selection for batch eval, spot pricing on Spheron runs 40-60% below on-demand on H200 and 50-70% below on H100 SXM5. That turns a ~$8 on-demand eval run into a ~$3-5 spot run when availability permits.

Batch eval catches regressions before deployment. Online eval catches drift after. Both are necessary for production systems.

Cost Playbook: 10k Evals on Spheron

Pricing from Spheron's public /gpu-rental/ pages, checked 22 Apr 2026. H200 and H100 SXM5 on-demand rates sourced from the respective GPU rental pages; spot pricing runs 40-60% below on-demand and varies by availability. Throughput estimates at batch size 128 with vLLM FP8.

Judge ModelGPU ConfigCluster On-Demand Price/hrGPU Hours for 10k EvalsTotal On-Demand Cost
Qwen2.5 72B1x H200 SXM5$1.19/hr~6.9 hrs~$8.21
GPT-OSS 120B1x H100 SXM5$0.80/hr~4 hrs~$3.20
Nemotron Ultra 253B4x H200 at FP8$4.76/hr~12 hrs~$57.12

Pricing fluctuates based on GPU availability. The prices above are based on 22 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Teams running GPT-OSS 120B can start with a single H100 SXM5 thanks to MXFP4 quantization. The 120B MoE model has ~5.1B active parameters per forward pass and fits on a single 80GB H100 at MXFP4.

On-demand adds roughly 30-50% to the spot cost. Use on-demand only for inline production judges where spot preemption is unacceptable.

H200 spot availability note: H200 spot pricing varies. If H200 spot is unavailable when you need it, fall back to 2x H100 SXM5 at $0.80/GPU/hr on-demand (or spot where available), which handles Qwen2.5 72B at TP=2. Check the pricing page for live availability.

Reference Architecture: CI-Integrated Judge Pipeline

This GitHub Actions workflow provisions a Spheron H200 spot instance, runs the eval, asserts a score threshold, and terminates the instance. Adapt the Spheron API calls to match the current provisioning API documented at docs.spheron.ai/api-reference.

yaml
name: LLM Eval Pipeline

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 2 * * *'  # nightly at 2am UTC

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Provision Spheron H200 spot instance
        id: provision
        run: |
          INSTANCE=$(curl -s -X POST https://api.spheron.network/v1/instances \
            -H "Authorization: Bearer ${{ secrets.SPHERON_API_KEY }}" \
            -H "Content-Type: application/json" \
            -d '{"gpu": "H200", "count": 1, "pricing": "spot"}')
          echo "instance_id=$(echo "$INSTANCE" | jq -r .id)" >> $GITHUB_OUTPUT
          echo "instance_ip=$(echo "$INSTANCE" | jq -r .ip)" >> $GITHUB_OUTPUT

      - name: Wait for SSH readiness
        run: |
          READY=false
          for i in $(seq 30); do
            ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no \
              user@${{ steps.provision.outputs.instance_ip }} 'echo ready' && READY=true && break
            sleep 10
          done
          [ "$READY" = "true" ] || { echo "SSH readiness timeout after 5 minutes"; exit 1; }

      - name: Start vLLM judge server
        run: |
          ssh user@${{ steps.provision.outputs.instance_ip }} \
            'docker run -d --gpus all --ipc=host -p 8000:8000 \
              vllm/vllm-openai:latest \
              --model Qwen/Qwen2.5-72B-Instruct \
              --dtype fp8 \
              --gpu-memory-utilization 0.92 \
              --max-model-len 32768 \
              --max-num-seqs 128'
          # Wait for health check (5 min timeout)
          READY=false
          for i in $(seq 60); do
            curl -sf http://${{ steps.provision.outputs.instance_ip }}:8000/health && READY=true && break
            sleep 5
          done
          [ "$READY" = "true" ] || { echo "vLLM health check timeout after 5 minutes"; exit 1; }

      - name: Run eval harness
        run: |
          OPENAI_BASE_URL=http://${{ steps.provision.outputs.instance_ip }}:8000/v1 \
          lm_eval --model local-completions \
            --model_args "base_url=http://${{ steps.provision.outputs.instance_ip }}:8000/v1,model=Qwen/Qwen2.5-72B-Instruct" \
            --tasks my_eval_suite \
            --output_path ./eval-results \
            --write_out

      - name: Assert score threshold
        run: |
          SCORE=$(jq '.results.my_eval_suite.accuracy // empty' eval-results/results.json || true)
          [ -z "$SCORE" ] && { echo 'Accuracy key not found in results'; exit 1; }
          python -c 'import sys; score=float(sys.argv[1]); assert score >= 0.82, f"Eval score {score} below threshold 0.82"' "$SCORE"

      - name: Terminate instance
        if: always()
        run: |
          curl -s -X DELETE \
            https://api.spheron.network/v1/instances/${{ steps.provision.outputs.instance_id }} \
            -H "Authorization: Bearer ${{ secrets.SPHERON_API_KEY }}"

The if: always() on the terminate step ensures the instance is shut down even if the eval fails. This prevents runaway spot charges from failed runs.

For checkpoint-based resumption (handling mid-eval spot preemption), write scored samples to an S3-compatible store every 500 samples. On restart, load the checkpoint and skip already-scored pairs.

Monitoring Judge Quality Over Time

A judge pipeline that you deploy and forget is not a production system. Three things to track:

Score distribution per model version. After each eval run, record the histogram of scores. A judge giving 90% of responses a score of 4/5 is not calibrated: it either has a misconfigured rubric or a model that is too agreeable. Track distribution shift between runs as a signal of judge drift or rubric drift.

Human agreement rate. Periodically pull a random sample of 50-100 judge decisions and have a human annotator score the same pairs independently. Your target agreement rate depends on the task: 75-80% agreement is typical for general instruction tasks, 85%+ for factual accuracy tasks. If agreement drops below threshold, the judge needs recalibration.

Raw output archiving. Save every judge decision (prompt, response A, response B, judge output, parsed score, reasoning) to durable storage. This is essential for bias audits. If you discover weeks later that a particular judge systematically favored shorter responses, you need the raw data to identify which eval results are affected.

For structured JSON output from vLLM guided decoding (which makes raw output parsing reliable), see the structured output and function calling inference guide.

Alert on:

  • Score standard deviation dropping below 0.5 (judge is not discriminating)
  • Mean score shifting more than 0.3 points between consecutive nightly runs
  • Position bias rate exceeding 15% on swap tests

Eval workloads are spiky and batch-shaped, the ideal fit for spot GPU pricing. Run your judge pipeline on H200 instances and pay only for the hours you use.

View H200 spot pricing → | See all GPU pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.