Tutorial

Self-Host AI Code Review on GPU Cloud: Deploy Open-Source PR Review Agents (2026 Guide)

Self-Host AI Code ReviewOpen Source Code Review AgentAI PR Review GPU CloudCodeRabbit AlternativeGreptile Self-HostedvLLMGPU CloudCode Review AutomationAI Pull Request Review
Self-Host AI Code Review on GPU Cloud: Deploy Open-Source PR Review Agents (2026 Guide)

CodeRabbit Pro charges $24 per seat per month on the annual plan. At 50 engineers, that is $1,200/month, and every PR you send to it passes through a third-party inference endpoint. A single H100 SXM5 on Spheron at $1.49/hr spot handles the same workload for a fixed monthly rate, with your source code never leaving your infrastructure. This guide shows you how to build and run the full stack.

Why Teams Are Moving Code Review Off SaaS in 2026

Three concrete reasons, each with numbers.

Source-code privacy. Every PR sent to CodeRabbit, Greptile, or Sweep AI goes to a third-party inference endpoint. For teams under SOC 2, HIPAA, FedRAMP, or with IP-sensitive repos (fintech algorithms, pharma compound structures, defense contractor code), that transmission is a compliance blocker, not a preference. The same reasoning applies to AI coding assistants; for a fuller treatment of the privacy argument, see Self-Host Your AI Coding Assistant on GPU Cloud, which covers the same architecture for inline completions.

Cost at scale. CodeRabbit Pro at $24/seat/month (annual) and 50 engineers equals $1,200/month. One H100 SXM5 on Spheron at $1.49/hr spot costs $1,074/month at continuous operation, plus roughly $80/month for Redis and a small CPU host for the webhook handler. Total: $1,154/month. Self-hosting saves $46/month at 50 engineers and $1,246/month at 100. For a broader look at how GPU cloud pricing stacks up against SaaS, see GPU cloud pricing comparison for 2026. The LLM inference on-premise vs cloud analysis covers the deeper cost model if you are weighing owned hardware.

Customization. SaaS review tools run one model on one prompt template. A self-hosted stack lets you fine-tune on your internal style guides, inject repo-specific anti-pattern rules, enforce your security scanning policies, and swap models when better ones are released. None of that is possible with a SaaS subscription.

Reference Architecture: From Webhook to Inline Comment

The pipeline has five stages:

GitHub/GitLab PR Event
       |
       v
FastAPI Webhook Handler   <-- validate HMAC-SHA256 signature
       |
       v
Diff Parser + Context Window Builder
  (chunk diffs to <=8K tokens, attach file context)
       |
       v
vLLM Inference Endpoint   (Qwen2.5-Coder 32B or DeepSeek V3)
       |
       v
Comment Formatter + Deduplication Store (Redis)
       |
       v
GitHub/GitLab Review API  <-- post inline review comments

Webhook handler. Every incoming webhook from GitHub carries an X-Hub-Signature-256 header. Validate it with HMAC-SHA256 against your app's webhook secret before touching the payload. Unauthenticated webhooks let anyone trigger inference against your GPU by sending a spoofed POST to your handler. The handler enqueues a job and returns HTTP 200 within GitHub's 10-second timeout; all actual processing happens asynchronously in a worker.

Diff parser and context window builder. The GitHub REST API returns per-file diffs as unified diff strings. Large PRs can exceed 100K tokens when you include surrounding context. The chunker counts tokens with AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct"), splits at 6K tokens with 512-token overlap (to preserve context at chunk boundaries), and attaches the file name and language tag to each chunk as metadata. Overlap is small relative to the chunk size but prevents the model from missing function signatures split across chunk boundaries.

vLLM inference. vLLM exposes an OpenAI-compatible chat completions endpoint. Each chunk is sent as a separate inference call with a system prompt instructing the model to return structured JSON (file path, line number, severity, comment body). Parallel chunks use asyncio.gather so a 50-file PR processes in the same wall-clock time as a 10-file PR.

Deduplication store. Redis holds two sets of keys: idempotency keys to prevent double-processing on webhook re-deliveries, and comment-body hashes to prevent duplicate comments when a PR is force-pushed. Both use TTL expiry so the store never grows unbounded.

Comment poster. The GitHub review API accepts an array of per-line comments in a single request. Batch all comments for a review cycle into one POST rather than one per comment, to avoid hitting secondary rate limits.

Choosing Your Model: Qwen2.5-Coder 32B vs DeepSeek V3 vs Codestral

Code review requires identifying logic errors, not just completing tokens. Larger models do this better. A 7B model can catch obvious style issues; a 32B model catches subtle logic bugs and architectural problems. That gap matters for code review in a way it does not for autocomplete.

ModelHumanEvalContext windowVRAM (FP16)VRAM (FP8)Review quality notes
Qwen2.5-Coder 32B~92.7%128K~65GB~20GBBest single-GPU option; strong on multi-file logic and security issues
DeepSeek V3~82.6%128K~670GB (full)~84GB (INT4)General-purpose MoE, not a code specialist; needs 2+ GPU at reduced precision; strong reasoning depth
Codestral 22B~81.1%128K~44GB~14GBFits L40S 48GB at FP8; good for smaller teams on tighter budget
Granite Code 34B~82%128K~68GB~22GBFits H100 80GB at FP8; Apache 2.0 license, enterprise-friendly

For a single H100 80GB, Qwen2.5-Coder 32B at FP8 is the right choice: it fits with room to spare for KV cache, delivers the best review quality, and handles 128K context for large files. For deeper GPU benchmark comparisons, see Best GPU for AI Inference in 2026.

For instruction following and returning structured JSON from review prompts, 32B-class models are significantly more reliable than 7B models. The quality gap is more pronounced for code review than autocomplete because review requires semantic reasoning about correctness, not next-token prediction.

Deploying vLLM as the Inference Backend

The full vLLM Production Deployment 2026 guide covers multi-GPU tensor parallelism and advanced tuning. This section covers the minimum setup for a code review agent.

1. Provision an H100 80GB instance on Spheron.

Log in to app.spheron.ai and provision an H100 SXM5 instance on Spheron. Select H100 SXM5 80GB and deploy Ubuntu 22.04 with CUDA 12.4. SSH in and verify GPU access:

bash
nvidia-smi
# Should show H100 80GB HBM3 with ~80GB VRAM

2. Install Docker with NVIDIA container support.

bash
# Install Docker
curl -fsSL https://get.docker.com | sh

# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor \
  -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

3. Launch vLLM with prefix caching enabled.

bash
docker run --gpus all \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-Coder-32B-Instruct \
  --dtype fp8 \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.90

--enable-prefix-caching is important for a code review workload: the system prompt is identical across all review requests, so vLLM caches the prefill for it and only processes the diff tokens on each call. This reduces per-request latency by 20-40% once the cache is warm.

4. Verify the endpoint.

bash
curl http://localhost:8000/health
# {"status":"healthy"}

curl http://localhost:8000/v1/models
# Should list Qwen2.5-Coder-32B-Instruct

For an OpenAI-compatible proxy layer with auth and rate limiting, see Build a Self-Hosted OpenAI-Compatible API with vLLM.

Building the FastAPI GitHub App Webhook Handler

GitHub App registration.

Go to GitHub Settings > Developer settings > GitHub Apps > New GitHub App. Set:

  • Webhook URL: https://your-instance-ip/webhook
  • Permissions: Pull requests: Read & write, Contents: Read-only
  • Subscribe to events: Pull request

Download the private key PEM file and note your App ID and webhook secret.

FastAPI handler with HMAC validation.

python
import hashlib
import hmac
import os
from fastapi import FastAPI, Request, HTTPException
from contextlib import asynccontextmanager
import asyncio
import httpx

app = FastAPI()

WEBHOOK_SECRET = os.environ["GITHUB_WEBHOOK_SECRET"]
GITHUB_APP_PRIVATE_KEY = os.environ["GITHUB_APP_PRIVATE_KEY"]
GITHUB_APP_ID = os.environ["GITHUB_APP_ID"]
VLLM_BASE_URL = os.environ.get("VLLM_BASE_URL", "http://localhost:8000/v1")

_bg_tasks: set = set()


def verify_signature(payload: bytes, signature: str) -> bool:
    expected = "sha256=" + hmac.new(
        WEBHOOK_SECRET.encode(), payload, hashlib.sha256
    ).hexdigest()
    return hmac.compare_digest(expected, signature)


@app.post("/webhook")
async def handle_webhook(request: Request):
    payload = await request.body()
    sig = request.headers.get("X-Hub-Signature-256", "")
    if not verify_signature(payload, sig):
        raise HTTPException(status_code=401, detail="Invalid signature")

    event = request.headers.get("X-GitHub-Event")
    if event != "pull_request":
        return {"status": "ignored"}

    body = await request.json()
    action = body.get("action")
    if action not in ("opened", "synchronize", "reopened"):
        return {"status": "ignored"}

    # Enqueue async processing and return 200 immediately
    t = asyncio.create_task(process_pr(body))
    _bg_tasks.add(t)
    t.add_done_callback(_bg_tasks.discard)
    return {"status": "queued"}

Async diff fetcher.

python
async def get_pr_files(
    owner: str,
    repo: str,
    pull_number: int,
    token: str
) -> list[dict]:
    url = f"https://api.github.com/repos/{owner}/{repo}/pulls/{pull_number}/files"
    headers = {
        "Authorization": f"Bearer {token}",
        "Accept": "application/vnd.github+json",
        "X-GitHub-Api-Version": "2022-11-28"
    }
    files = []
    params: dict = {"per_page": 100, "page": 1}
    async with httpx.AsyncClient() as client:
        while url:
            resp = await client.get(url, headers=headers, params=params)
            resp.raise_for_status()
            files.extend(resp.json())
            next_url = resp.links.get("next", {}).get("url")
            url = next_url
            params = {}
    return files

Diff chunker with Qwen tokenizer.

Use the model's own tokenizer for accurate token counts. Qwen2.5-Coder uses a ~150k-token vocabulary that differs from GPT-4's cl100k_base, so using tiktoken with "gpt-4" would produce inaccurate chunk sizes and risk exceeding vLLM's --max-model-len limit.

python
from transformers import AutoTokenizer

CHUNK_TOKENS = 6000
OVERLAP_TOKENS = 512

_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct")

def chunk_diff(files: list[dict]) -> list[str]:
    enc = _tokenizer
    chunks = []
    current_chunk = []
    current_count = 0

    for f in files:
        filename = f["filename"]
        patch = f.get("patch", "")
        if not patch:
            continue
        file_text = f"### File: {filename}\n```\n{patch}\n```\n"
        token_ids = enc.encode(file_text)

        if current_count + len(token_ids) > CHUNK_TOKENS and current_chunk:
            chunks.append("".join(current_chunk))
            # Keep last OVERLAP_TOKENS worth of content for context
            overlap_text = "".join(current_chunk)
            overlap_ids = enc.encode(overlap_text)[-OVERLAP_TOKENS:]
            current_chunk = [enc.decode(overlap_ids)]
            current_count = len(overlap_ids)

        # If a single file exceeds CHUNK_TOKENS on its own, split it into sub-chunks
        if len(token_ids) > CHUNK_TOKENS:
            file_ids = enc.encode(file_text)
            for start in range(0, len(file_ids), CHUNK_TOKENS - OVERLAP_TOKENS):
                sub_ids = file_ids[start:start + CHUNK_TOKENS]
                chunks.append(enc.decode(sub_ids))
            current_chunk = []
            current_count = 0
            continue

        current_chunk.append(file_text)
        current_count += len(token_ids)

    if current_chunk:
        chunks.append("".join(current_chunk))

    return chunks

vLLM review call with structured output prompt.

python
from openai import AsyncOpenAI

REVIEW_SYSTEM_PROMPT = """You are a senior software engineer performing a code review.
Analyze the provided code diff and return a JSON array of review comments.

Each comment must have this structure:
{
  "path": "<file path>",
  "line": <line number in the diff, or null for general comments>,
  "severity": "error" | "warning" | "suggestion",
  "category": "security" | "logic" | "performance" | "style" | "documentation",
  "body": "<comment text>"
}

Focus on:
- Logic errors and off-by-one bugs
- Security issues (injection, auth bypass, unvalidated input)
- Performance problems (N+1 queries, unnecessary allocations)
- Missing error handling

Return only the JSON array, no other text."""


async def get_review_comments(diff_chunk: str) -> list[dict]:
    client = AsyncOpenAI(base_url=VLLM_BASE_URL, api_key="not-needed")
    response = await client.chat.completions.create(
        model="Qwen/Qwen2.5-Coder-32B-Instruct",
        messages=[
            {"role": "system", "content": REVIEW_SYSTEM_PROMPT},
            {"role": "user", "content": diff_chunk}
        ],
        temperature=0.1,
        max_tokens=2048
    )
    import json
    import re
    if not response.choices or response.choices[0].message.content is None:
        print("Empty response for chunk")
        return []
    content = response.choices[0].message.content.strip()
    content = re.sub(r'^```[a-zA-Z]*\n|```$', '', content, flags=re.MULTILINE).strip()
    try:
        return json.loads(content)
    except json.JSONDecodeError as e:
        print(f"Failed to parse review response as JSON: {e!r}")
        return []

Parallel chunk processing and comment poster.

python
import redis.asyncio as aioredis
import json

redis_client = aioredis.Redis(host="localhost", port=6379)

async def process_pr(body: dict):
    repo = body["repository"]
    pr = body["pull_request"]
    owner = repo["owner"]["login"]
    repo_name = repo["name"]
    pull_number = pr["number"]
    head_sha = pr["head"]["sha"]
    installation_id = body["installation"]["id"]

    # Idempotency: two-state key ("processing" | "done") so failures are retryable.
    # Setting "done" only after post_review succeeds means a spot-instance preemption
    # or GitHub API error leaves no key (or leaves "processing" which expires), so the
    # next webhook re-delivery or ARQ retry will run the review cleanly.
    idem_key = f"reviewed:{owner}/{repo_name}:{pull_number}:{head_sha}"
    existing = await redis_client.get(idem_key)
    if existing == b"done":
        return  # Already reviewed successfully
    if not await redis_client.set(idem_key, "processing", nx=True, ex=72 * 3600):
        return  # Another worker already claimed this job

    try:
        token = await get_installation_token(installation_id)
        files = await get_pr_files(owner, repo_name, pull_number, token)
        chunks = chunk_diff(files)

        # Process chunks in parallel
        all_comments = []
        results = await asyncio.gather(
            *[get_review_comments(chunk) for chunk in chunks],
            return_exceptions=True
        )
        for result in results:
            if isinstance(result, list):
                all_comments.extend(result)

        if not all_comments:
            await redis_client.delete(idem_key)
            return

        await post_review(owner, repo_name, pull_number, head_sha, all_comments, token)
        await redis_client.set(idem_key, "done", ex=72 * 3600)
    except Exception:
        await redis_client.delete(idem_key)
        raise


async def post_review(
    owner: str,
    repo: str,
    pull_number: int,
    commit_id: str,
    comments: list[dict],
    token: str
):
    # Fetch existing comments to deduplicate
    existing_url = f"https://api.github.com/repos/{owner}/{repo}/pulls/{pull_number}/comments"
    headers = {
        "Authorization": f"Bearer {token}",
        "Accept": "application/vnd.github+json",
        "X-GitHub-Api-Version": "2022-11-28"
    }
    async with httpx.AsyncClient() as client:
        existing = []
        next_url: str | None = f"{existing_url}?per_page=100"
        while next_url:
            resp = await client.get(next_url, headers=headers)
            if resp.status_code == 200:
                existing.extend(resp.json())
                next_url = resp.links.get("next", {}).get("url")
            else:
                break

    existing_hashes = {
        hashlib.sha256(c["body"].encode()).hexdigest()
        for c in existing
    }

    formatted = []
    general_bodies = []
    for c in comments:
        label = f"**[{c['severity'].upper()}]** "
        body_hash = hashlib.sha256((label + c["body"]).encode()).hexdigest()
        if body_hash in existing_hashes:
            continue  # Skip duplicate
        if c.get("line") is not None:
            formatted.append({
                "path": c["path"],
                "line": c["line"],
                "side": "RIGHT",
                "body": label + c["body"]
            })
        else:
            # General file-level comment: post in the review body, not the comments array.
            general_bodies.append(f"{label}`{c['path']}`: {c['body']}")

    if not formatted and not general_bodies:
        return

    review_url = f"https://api.github.com/repos/{owner}/{repo}/pulls/{pull_number}/reviews"
    review_body = {
        "commit_id": commit_id,
        "event": "COMMENT",
        "body": "\n\n".join(general_bodies),
        "comments": formatted
    }
    async with httpx.AsyncClient() as client:
        resp = await client.post(review_url, headers=headers, json=review_body)
        resp.raise_for_status()

For GitLab, the equivalent is the merge request webhook with object_kind: merge_request. The diff fetcher uses GET /api/v4/projects/{id}/merge_requests/{mr_iid}/changes. The comment poster uses POST /api/v4/projects/{id}/merge_requests/{mr_iid}/notes. The diff chunker and vLLM call are identical.

GPU Sizing: VRAM Math and Concurrency

VRAM breakdown for Qwen2.5-Coder 32B at FP8:

  • Model weights: ~20GB
  • KV cache for 8 concurrent 6K-token contexts: 8 x 6K tokens x 32 layers x 128 head_dim x 2 bytes = ~6GB
  • Total: ~26GB, well within an H100 80GB

For a deeper VRAM formula and worked examples at other model sizes, see GPU Memory Requirements for LLMs.

Team sizeAvg PRs/dayPeak concurrent reviewsRecommended GPUSpot priceEst. monthly
10 engineers~2031x L40S 48GB GPU rental$0.61/hr~$439/mo
50 engineers~10081x H100 SXM5 80GB$1.49/hr~$1,074/mo
200 engineers~400302x H100 SXM5 80GB (TP=2)$2.98/hr~$2,147/mo

Pricing fluctuates based on GPU availability. The prices above are based on 07 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

For Codestral 22B at FP8 on an L40S 48GB: weights are ~14GB, KV cache for 3 concurrent contexts adds ~2.5GB. Total ~16.5GB, fits comfortably on an L40S with room for concurrent requests.

For the 2x H100 tensor parallel setup, launch vLLM with --tensor-parallel-size 2. Both GPUs are visible via NVIDIA_VISIBLE_DEVICES=0,1 in the Docker environment. The model is sharded across both GPUs; inference throughput roughly doubles for large prefill operations.

Latency Budgets: Keeping PR Review Under 60 Seconds

Budget breakdown for a typical 3-file, 200-line PR:

StageLatency
Webhook receive to diff fetch (GitHub API)~2s
Diff parse + chunking~50ms
vLLM prefill for 6K-token chunk (H100 FP8)~3-5s
vLLM decode for ~500-token review response~4-6s
Comment formatting + Redis dedup check~100ms
GitHub API review POST~500ms
Total (1-chunk PR)~10-14s

For a large 50-file PR with 8 chunks: process in parallel with asyncio.gather. Each chunk takes ~10-14s; parallel execution means wall-clock time is close to a single-chunk time, not 8x. Total: ~20-25s. The 60-second budget is met by a wide margin.

For P99 latency modeling and SLO engineering for TTFT and ITL, see LLM Inference SLO Engineering: TTFT, ITL, and P99 Latency Budgets.

Cost Breakeven: Self-Hosted vs CodeRabbit, Greptile, and Sweep AI

Worked example: 50 engineers, 8 PRs/engineer/day, 20 working days/month = 8,000 PRs/month.

TierSaaS cost (50 engineers)Self-hosted costMonthly savings
CodeRabbit Pro ($24/seat, annual)$1,200/mo$1,154/mo$46/mo
Greptile (est. $25-30/seat)$1,250-1,500/mo$1,154/mo$96-346/mo
Sweep AI (per-PR model, est.)variable$1,154/movariable

Self-hosted monthly: H100 spot ($1,074) + Redis instance (~$30) + FastAPI CPU host (~$50) = $1,154/month. These figures use spot pricing at $1.49/hr. Spot instances are preemptible, but for a webhook-driven workload this is acceptable: the Redis idempotency key ensures any webhook re-delivery after preemption re-runs the review cleanly, and ARQ retries with backoff handle transient GPU unavailability.

Breakeven calculation: Self-hosting breaks even against CodeRabbit Pro ($24/seat/month annual) at approximately 48 engineers ($1,154 / $24 per seat). At 100 engineers, you save $1,246/month. At 200 engineers, CodeRabbit costs $4,800/month; self-hosting on two H100s costs $2,227/month, a saving of $2,573/month.

The crossover point shifts earlier if you need privacy compliance, since the compliance value of on-premises inference has a dollar value independent of headcount.

Production Hardening

Idempotency

GitHub re-delivers webhooks when your endpoint returns a non-2xx response or times out. Without idempotency, a transient failure creates duplicate review threads on the same PR. Use a single atomic SET NX EX call with the composite key {owner}/{repo}:{pull_number}:{head_sha} and TTL=72h. Using separate SETNX and EXPIRE calls is not safe: if the process crashes between them, the key persists without a TTL and permanently blocks re-review for that commit.

A single "set-and-forget" key has one flaw: if processing fails mid-way (spot-instance preemption, GitHub API error, vLLM OOM), the key is already set and subsequent retries skip the PR for the full TTL window. Use a two-state key (processing vs done) so only a successfully completed review is treated as final. Failed jobs delete the key so the next retry can proceed.

python
idem_key = f"reviewed:{owner}/{repo_name}:{pull_number}:{head_sha}"
existing = await redis_client.get(idem_key)
if existing == b"done":
    return  # Already reviewed successfully
if not await redis_client.set(idem_key, "processing", nx=True, ex=72 * 3600):
    return  # Another worker already claimed this job

try:
    # ... do all work ...
    await redis_client.set(idem_key, "done", ex=72 * 3600)
except Exception:
    await redis_client.delete(idem_key)  # Allow retries on failure
    raise

The 72-hour TTL covers redelivery windows without growing the key space indefinitely.

Retry Queues

GitHub's webhook timeout is 10 seconds. The diff fetch plus inference for a large PR takes longer than that. The pattern is: validate the signature, enqueue a job, return HTTP 200, then let a worker process the job asynchronously.

With ARQ (async Python, minimal dependencies):

python
# worker.py
from arq import create_pool
from arq.connections import RedisSettings

async def process_pr_job(ctx, body: dict):
    await process_pr(body)

class WorkerSettings:
    functions = [process_pr_job]
    redis_settings = RedisSettings()
    retry_jobs = True
    max_tries = 3
    job_timeout = 300  # 5-minute max per job

Exponential backoff between retries (30s, 120s, 300s) avoids hammering the GitHub API or vLLM endpoint during a transient outage.

Comment Deduplication

Force-pushes generate a new head_sha but the diff can be logically identical to the previous commit. Without deduplication, every synchronize event on a force-push triggers a new set of identical review comments. The Redis dedup store catches this at the idempotency level. The body-hash comparison on comment POST catches any cases that slip through (different SHA, same review content).

Secrets Handling

The GitHub App private key and webhook secret must be isolated from the vLLM process. The vLLM container needs no external secrets at all. Use separate environment namespaces:

  • Kubernetes: mount the GitHub App credentials as a Kubernetes Secret in the FastAPI pod only; the vLLM pod has no reference to it.
  • Standalone VMs: use a process-level environment isolation. Store credentials in HashiCorp Vault or AWS Secrets Manager. The FastAPI systemd unit reads credentials at startup via vault kv get; vLLM runs as a separate user with no Vault access.
  • Rotate the webhook secret quarterly. Generating a new secret in the GitHub App settings invalidates all current webhook deliveries immediately; update the environment variable in your FastAPI deployment before rotating in GitHub to avoid a gap.

Engineering teams running self-hosted AI code review typically start with a single H100 SXM5 instance and scale to a second node once the team crosses 100 engineers. Spheron's bare-metal GPU pricing keeps monthly inference costs predictable rather than tied to PR volume.

H100 GPU pricing → | H200 on Spheron → | View all GPU pricing →

STEPS / 06

Quick Setup Guide

  1. Provision a GPU instance on Spheron

    Log in to app.spheron.ai, select an H100 SXM5 80GB instance (on-demand or spot), and deploy Ubuntu 22.04 with CUDA 12.4 pre-installed. An SSH key is the only credential you need at this step. Per-minute billing means the instance costs nothing while you are not running it.

  2. Deploy vLLM with Qwen2.5-Coder 32B or DeepSeek V3

    Install Docker with NVIDIA container support, then launch vLLM in a container with --model Qwen/Qwen2.5-Coder-32B-Instruct, --dtype fp8, --tensor-parallel-size 1, and --max-model-len 32768. Verify with curl http://localhost:8000/health before proceeding.

  3. Build the FastAPI GitHub App webhook handler

    Register a GitHub App with pull_requests: write and contents: read permissions, set the webhook URL to your Spheron instance, and build a FastAPI handler that validates the HMAC-SHA256 signature on every incoming event. The handler enqueues the diff-processing job and returns HTTP 200 immediately within GitHub's 10-second timeout.

  4. Wire up PR diff context windowing

    Fetch changed files from GET /repos/{owner}/{repo}/pulls/{pull_number}/files, count tokens with AutoTokenizer.from_pretrained('Qwen/Qwen2.5-Coder-32B-Instruct'), and chunk diffs at 6K tokens with 512-token overlap. Send each chunk to the vLLM OpenAI-compatible endpoint with a system prompt instructing the model to return structured review comments as JSON.

  5. Configure inline comment posting via GitHub API

    Format vLLM output as GitHub review comments with file path, line position, and body text. POST to /repos/{owner}/{repo}/pulls/{pull_number}/reviews with event: COMMENT and the comments array. Fetch existing comments first and compare body hashes to skip duplicates when a PR is force-pushed.

  6. Harden for production: idempotency, retry queues, dedup

    Store (repo, pull_number, head_sha) tuples in Redis with a single atomic SET NX EX call (TTL=72h) to prevent duplicate processing on webhook re-deliveries. Use a Celery or ARQ worker for async processing with three retries at 30/120/300 second intervals. Store GitHub App private keys in a secrets manager, never in environment variables accessible to the vLLM process.

FAQ / 05

Frequently Asked Questions

A single H100 SXM5 on Spheron at $1.49/hr spot costs roughly $1,074/month. Add $80/month for Redis and a small CPU instance for the FastAPI handler. At $1,154/month total, self-hosting breaks even against CodeRabbit Pro ($24/seat/month annual) at around 48 engineers. Spot instances are preemptible, but idempotency via Redis means any preempted webhook re-delivers cleanly without duplicate reviews. Above that headcount, you save money every month while keeping all source code off third-party servers.

Qwen2.5-Coder 32B at FP8 is the best single-GPU choice for code review in 2026. It fits in ~20GB VRAM on an H100 80GB, leaves room for 8+ concurrent KV caches, and outperforms 7B models on logic-error identification, not just style issues. DeepSeek V3 at INT4 is a viable alternative if you already have a 2-GPU setup. For teams on a budget with sub-20-engineer scale, Codestral 22B on a single L40S is a workable starting point.

Qwen2.5-Coder 32B at FP8 uses about 20GB for weights. At 8 concurrent 6K-token review contexts, the KV cache adds roughly 6GB, for a total of ~26GB. An H100 80GB handles this comfortably. For smaller teams (under 20 engineers), Codestral 22B at FP8 fits in ~14GB and runs on an L40S 48GB.

A typical 3-file, 200-line PR takes one inference chunk. With Qwen2.5-Coder 32B on an H100 at FP8, the full pipeline completes in about 12 seconds: ~2s for diff fetch, ~5s for vLLM prefill and decode, ~3s for comment formatting and GitHub API post. Large PRs (50 files, ~800 lines) split into ~8 chunks processed with asyncio.gather in parallel, completing in about 20-25 seconds. The 60-second budget is comfortable with a single H100.

Self-hosting on a single H100 SXM5 at spot pricing ($1.49/hr) breaks even against CodeRabbit Pro ($24/seat/month annual) at approximately 48 engineers. Below that, CodeRabbit is cheaper once you factor in the fixed GPU cost. Spot instances are preemptible; for this workload that is acceptable because Redis idempotency keys ensure re-delivered webhooks do not trigger duplicate reviews. The economics shift earlier if your team has data privacy requirements, in which case the compliance value makes self-hosting worthwhile regardless of headcount.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.