Tutorial

GPU Infrastructure Behind AI Coding Tools: Cursor, Claude Code, and GitHub Copilot in 2026

Back to BlogWritten by Mitrasish, Co-founderMar 30, 2026
AI Coding Tools GPU InfrastructureAI Coding ToolsSelf-Hosted Coding AssistantvLLMGPU CloudDeepSeek CoderLLM InferenceDeepSeek V3Qwen 2.5
GPU Infrastructure Behind AI Coding Tools: Cursor, Claude Code, and GitHub Copilot in 2026

85% of developers regularly use AI coding tools. Cursor reached a $29B valuation with 40,000 NVIDIA engineers on the platform. Every autocomplete request, every agentic PR, every multi-file edit runs on GPU infrastructure somewhere. This post covers what that infrastructure actually looks like, and when it makes more sense to run your own.

For a complete guide to setting up your own coding assistant from scratch with Tabby or Continue, see Self-Host Your AI Coding Assistant on GPU Cloud.

The GPU Infrastructure Powering AI Coding Tools

AI coding tools operate across three distinct compute layers:

1. Token generation layer - the LLM inference itself. This runs on A100 or H100 clusters, handling the actual forward passes that produce completions. The bottleneck here is VRAM bandwidth and compute: how fast you can move weights and activations.

2. Context retrieval layer - embeddings and vector search for repository-aware completions. When a tool understands your codebase, it's querying a vector index of your files, functions, and docstrings in real time. This is less GPU-intensive than generation but adds latency.

3. Orchestration layer - routing, caching, and session state. This runs on CPU-based infrastructure but determines whether a request hits a warm KV cache or requires a full prefill.

The latency constraints are what make coding tools different from most LLM applications:

Use CaseTarget TTFTAcceptable P99GPU Implication
Single-line autocomplete<100ms<200msPrefill-optimized, small context
Chat / explain<500ms<1,500msBalanced
Agentic PR generation2-30s60sDecode-optimized, long context

Autocomplete is the hardest constraint. At sub-100ms TTFT, you're fighting physics: a network round trip from US-East to a data center adds 20-50ms before you've even touched the GPU. Tools like Cursor mitigate this with aggressive speculative prefetching, predicting what you're about to type and pre-running inference.

Cursor, Claude Code, and GitHub Copilot: Architecture Breakdown

Each major tool has different infrastructure priorities based on its product design.

ToolPrimary ModelContext UsedTarget TTFTGPU Tier (est.)
Cursor (autocomplete)Claude Sonnet 4.5/4.6 / custom2k-8k tokens<80msH100 clusters
Cursor (agent)Claude Opus 4.6 / GPT-5.432k-200k tokens1-10sH100 NVLink
Claude CodeClaude Opus 4.6 / Sonnet 4.6Up to 1M tokens2-30sH100 NVLink
GitHub CopilotGPT-5.x Codex variants2k-4k (completions), 64k+ (chat)<100msA100/H100

Cursor runs a hybrid architecture: a fast, fine-tuned model for line-level autocomplete, and a slower frontier model (Claude Opus 4.6, GPT-5.4) for agentic edits. The autocomplete path is heavily optimized for TTFT; the agent path optimizes for output quality over raw speed.

Claude Code is the most context-hungry of the three. It reads your entire repository state before responding, using up to 1M tokens of context per request. That requires H100 NVLink nodes for fast inter-GPU attention, since a long-context prefill across a large model won't fit on a single GPU. Latency expectations are also lower since it's operating more like a collaborator than a keystroke predictor.

GitHub Copilot sits closer to the autocomplete end of the spectrum. Inline completions use tight context windows (2k-4k tokens), while Copilot Chat supports 64k+ tokens via GPT-4o. High request volume across millions of users runs on A100/H100 clusters managed by Azure. Less transparent about the specific models behind completions than Cursor or Anthropic.

Cursor's enterprise self-hosted deployment option and JetBrains Central (early access in 2026) mark a real shift: teams with data privacy requirements can now run these tools on their own infrastructure. For a deeper look at agentic workload sizing, see the GPU infrastructure guide for AI agents.

Why Engineering Teams Are Moving to Self-Hosted

Three concrete reasons, with numbers.

1. Code privacy

Every autocomplete request sends a snippet of your code to a third-party API. For most web apps this is fine. For a financial institution, a defense contractor, or any team working on IP-sensitive code, it's a hard blocker. Self-hosting keeps inference entirely on your infrastructure - nothing leaves your cloud.

2. Cost at scale

Cursor Team is $40/seat/month. 100 engineers equals $4,000/month, or $48,000/year. A self-hosted setup on Spheron covering the same team runs on a 4x H100 node. At $2.40/hr per H100 SXM5, a 4x H100 node (the minimum for DeepSeek V3 at 4-bit) costs about $7,008/month on-demand, breaking even with around 176 Cursor Team seats. On spot pricing, a 4x H100 node runs ~$2,832/month, shifting the break-even to ~71 seats. For general cost optimization strategies, the GPU Cost Optimization Playbook covers spot instances and reserved capacity patterns.

3. Model control

You can fine-tune on your internal codebase. Use your own rate limits. No degradation when the provider's traffic spikes. No model deprecation announcements forcing you to retest your tooling.

Open-Source Coding Models in 2026

Three models dominate self-hosted coding deployments in 2026:

ModelParamsVRAM (bf16)VRAM (4-bit)HumanEval+†SWE-benchLicense
DeepSeek V3671B (37B active MoE)~17x H100 (~1.34 TB)4x H10082.6%-DeepSeek License
Qwen 2.5-Coder 32B32B~64GB (1x A100)~20GB (1x L40S)87.2%-Apache 2.0
Qwen 2.5-Coder 7B7B~14GB~6GB84.1%-Apache 2.0
StarCoder 2 15B15B~30GB~10GB72.6%-BigCode OpenRAIL

†HumanEval+ (EvalPlus) is a Python-only pass@1 benchmark and a stricter superset of the original HumanEval. All scores use the instruct/chat variant of each model (the version you actually deploy). Base model scores are lower across the board (e.g. StarCoder 2 15B base scores 46.3%; DeepSeek V3 base scores around 65%).

For most teams, Qwen 2.5-Coder 32B is the right choice. Strong benchmark performance (87.2% HumanEval+), single-GPU deployment with 4-bit quantization on an L40S or A100, and Apache 2.0 license for commercial use without restrictions.

DeepSeek V3 is the highest-quality option but needs at least 4x H100 at 4-bit quantization. It's worth it for teams with 50+ developers where per-seat SaaS costs clearly exceed infrastructure costs.

StarCoder 2 15B is included here for completeness, but at 72.6% HumanEval (instruct) it's not competitive with Qwen 2.5-Coder for most use cases. Its main appeal is the BigCode OpenRAIL license and smaller VRAM footprint.

Mistral's Devstral deployment guide covers the full vLLM setup for Devstral 24B on Spheron GPU cloud, including SWE-bench performance context and team cost math.

For the full VRAM requirements breakdown across all models, see the GPU requirements cheat sheet.

Deploy Your Own AI Coding Assistant on Spheron GPU Cloud

Here's the full setup from a fresh instance to a working coding assistant your IDE plugins can connect to.

Step 1: Provision a GPU instance

bash
# On app.spheron.ai:
# - Select GPU: L40S (small teams), H100 SXM5 (20+ devs), 4x H100 (50+ devs for DeepSeek V3)
# - Image: Ubuntu 22.04 + CUDA 12.4
# - After SSH:
nvidia-smi

Pick L40S for 1-10 developers running Qwen 2.5-Coder 32B at 4-bit. Pick a single H100 SXM5 for teams up to 20 developers with Qwen 2.5-Coder 32B in bf16. Use 4x H100 for DeepSeek V3 with 4-bit quantization (671B params at 0.5 bytes/param requires ~335 GB).

Step 2: Install vLLM

bash
pip install "vllm>=0.6.0" "torch>=2.4.0"

# Verify:
python -c "import vllm; print(vllm.__version__)"

Step 3a: Deploy Qwen 2.5-Coder 32B on a single H100

Bind vLLM to 127.0.0.1 so port 8000 is only reachable from the local machine. NGINX (Step 5) handles all external traffic. Do not change this to 0.0.0.0 unless you have a firewall rule blocking port 8000 from external access.

bash
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-Coder-32B-Instruct \
  --served-model-name coding-assistant \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --host 127.0.0.1 \
  --port 8000

Step 3b: Deploy DeepSeek V3 on 4x H100

The full-precision DeepSeek V3 model (bf16) requires roughly 1.34 TB of VRAM, which exceeds what any standard multi-GPU consumer node can hold. Use a pre-quantized AWQ version instead. A naive 4-bit estimate (671B params × 0.5 bytes/param) puts the ceiling at ~335 GB, which would exceed 4x H100 SXM5 (4 × 80 GB = 320 GB). The community AWQ checkpoint avoids this: cognitivecomputations/DeepSeek-V3-AWQ uses mixed-precision quantization (sub-4-bit for most layers) that brings the actual weight footprint to approximately 270-290 GB, comfortably within 320 GB. There is no official deepseek-ai/DeepSeek-V3-AWQ checkpoint on HuggingFace. For context lengths beyond 32k tokens, use 5x H100 instead and remove the --max-model-len cap:

bash
python -m vllm.entrypoints.openai.api_server \
  --model cognitivecomputations/DeepSeek-V3-AWQ \
  --quantization awq \
  --served-model-name coding-assistant \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --host 127.0.0.1 \
  --port 8000

Step 4: Configure Continue.dev

Install the Continue.dev extension in VS Code or JetBrains, then edit ~/.continue/config.json.

Temporary config for initial testing only (plain HTTP, no auth):

Since vLLM binds to 127.0.0.1, port 8000 is not reachable directly from your workstation. Before testing, open an SSH tunnel to forward the port locally:

bash
ssh -L 8000:localhost:8000 user@YOUR_SPHERON_IP

Then configure Continue.dev to point at localhost:8000:

json
{
  "models": [
    {
      "title": "Self-Hosted Coding Assistant",
      "provider": "openai",
      "model": "coding-assistant",
      "apiBase": "http://localhost:8000/v1",
      "apiKey": "none"
    }
  ]
}

Keep the tunnel open while testing. Once NGINX is set up in Step 5, close the tunnel and switch to the HTTPS endpoint instead.

After completing Step 5, update the config to route through the NGINX SSL proxy instead. Do not leave port 8000 exposed in production:

json
{
  "models": [
    {
      "title": "Self-Hosted Coding Assistant",
      "provider": "openai",
      "model": "coding-assistant",
      "apiBase": "https://your-coding-api.internal/v1",
      "apiKey": "YOUR_SECRET_TOKEN"
    }
  ]
}

The endpoint is OpenAI-compatible, so any IDE plugin that supports a custom OpenAI base URL will work: Continue.dev, Tabby, Aider, and others.

Step 5: Add NGINX authentication

Don't expose port 8000 directly. Put NGINX in front:

nginx
server {
    listen 443 ssl;
    server_name your-coding-api.internal;
    ssl_certificate /path/to/cert.pem;
    ssl_certificate_key /path/to/key.pem;

    location /v1/ {
        # Use auth_request to keep auth and proxy directives in separate blocks.
        # Mixing `if` with proxy_pass/proxy_set_header/proxy_buffering in the
        # same location block causes undefined behavior in some NGINX versions.
        auth_request /_auth;
        proxy_pass http://127.0.0.1:8000/v1/;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_read_timeout 300s;
        proxy_buffering off;
        # vLLM is configured with --max-model-len 32768. A 32k-token request
        # body with message history and code snippets easily exceeds NGINX's
        # default 1MB limit, causing 413 errors. 64MB gives ample headroom.
        client_max_body_size 64m;
    }

    # Internal auth subrequest - only reachable via auth_request, not directly.
    # Requires OpenResty (or NGINX with ngx_http_lua_module) for constant-time comparison.
    location = /_auth {
        internal;
        content_by_lua_block {
            local expected = "Bearer YOUR_SECRET_TOKEN"
            local got = ngx.var.http_authorization or ""
            -- Compare every byte regardless of early mismatch to avoid timing leaks.
            local match = (#got == #expected)
            for i = 1, #expected do
                if (got:byte(i) or 0) ~= expected:byte(i) then match = false end
            end
            if match then ngx.exit(ngx.HTTP_OK)
            else ngx.exit(ngx.HTTP_UNAUTHORIZED) end
        }
    }
}

Prerequisites and security notes:

- The /_auth block above requires OpenResty (or stock NGINX compiled with ngx_http_lua_module). If you're running stock NGINX without Lua, replace content_by_lua_block with if ($http_authorization = "Bearer YOUR_SECRET_TOKEN") { return 200; } return 401;. Note that this if-based comparison is not constant-time and leaks token information via response timing.

- The Lua version iterates all bytes even after a mismatch, keeping execution time constant regardless of where tokens diverge.

- For public-facing endpoints, use oauth2-proxy or a dedicated auth sidecar (e.g., Pomerium) instead of bearer tokens in NGINX config.

- Teams deploying behind a private VPC or firewall can use the simpler if-based approach. The practical timing window in that environment is negligible.

For a complete production vLLM setup including systemd configuration and load balancing, see our self-hosted OpenAI API guide. For multi-GPU tensor parallelism setup details, see the vLLM multi-GPU production guide.

GPU Sizing Guide: From Solo Developer to 100-Engineer Team

Pricing below is based on Spheron GPU cloud as of 30 Mar 2026. Monthly figures use 730 hours.

Team SizeModelGPU ConfigOn-Demand $/hrSpot $/hr$/month (on-demand)$/month (spot)
1-10 devsQwen 2.5-Coder 32B (4-bit)1x L40S$0.72N/A~$526N/A
10-20 devsQwen 2.5-Coder 32B1x H100 SXM5$2.40$0.97~$1,752~$708
20-50 devsDeepSeek V34x H100 SXM5$9.60$3.88~$7,008~$2,832
50-100 devsDeepSeek V34x H100 SXM5$9.60$3.88~$7,008~$2,832
100+ devsDeepSeek V38x H100 SXM5$19.20$7.76~$14,016~$5,665

"Concurrent users" means simultaneous active completion requests. Most developers are idle most of the time. A 100-person team typically has 10-20 concurrent requests at peak, not 100.

Pricing fluctuates based on GPU availability. The prices above are based on 30 Mar 2026 and may have changed. Check current GPU pricing for live rates.

Cost Comparison: SaaS Subscriptions vs Self-Hosted on Spheron

The break-even math for common team sizes:

SeatsCursor Team ($40/seat)GitHub Copilot Biz ($19/seat)Spheron 4x H100 (on-demand)Spheron 4x H100 (spot)
10$400$190$7,008$2,832
30$1,200$570$7,008$2,832
50$2,000$950$7,008$2,832
100$4,000$1,900$7,008$2,832

The crossover point for Cursor Team is at 176 seats on-demand (176 × $40 = $7,040/month vs $7,008/month self-hosted), or 71 seats on spot ($2,832/month). For GitHub Copilot Business, self-hosting becomes cheaper at around 369 seats on-demand or 150 seats on spot.

These numbers assume a single shared deployment. The self-hosted cost also buys unlimited usage: no per-seat caps, no rate limits, and the ability to run longer context requests that SaaS tiers restrict.

Pricing fluctuates based on GPU availability. The prices above are based on 30 Mar 2026 and may have changed. Check current GPU pricing for live rates.

The economics flip clearly in favor of self-hosting on spot once your team crosses 71 seats for Cursor Team, or 150 seats on spot for GitHub Copilot Business. On-demand pricing requires 176+ seats for Cursor Team to break even, but data privacy requirements often make the decision before cost does.


If your team is evaluating AI coding infrastructure, Spheron GPU cloud gives you on-demand H100 and A100 instances with no long-term commitments. Run DeepSeek V3 or Qwen 2.5-Coder behind your own firewall - no code leaves your cloud.

Rent H100 → | Rent A100 → | View all GPU pricing →

Get started on Spheron →

STEPS / 05

Quick Setup Guide

  1. Provision a GPU instance on Spheron

    Log in to app.spheron.ai, select your GPU tier (L40S for small teams, A100 or H100 for larger teams), choose on-demand or spot pricing, and deploy an Ubuntu 22.04 instance with CUDA 12.4 pre-installed.

  2. Install vLLM and dependencies

    SSH into the instance and install vLLM via pip with CUDA extras. Verify GPU visibility with nvidia-smi before proceeding. Install the vllm package version 0.6.x or later for DeepSeek V3 MoE support.

  3. Deploy the coding model with an OpenAI-compatible endpoint

    Run vLLM with your chosen coding model using the serve command. Set --served-model-name to a memorable identifier, configure --max-model-len for your context needs, and bind to 127.0.0.1:8000 so the port is only accessible via the local NGINX proxy. For DeepSeek V3 on H100s, enable tensor parallelism with --tensor-parallel-size matching your GPU count.

  4. Configure your IDE plugin to point to the self-hosted endpoint

    Install Continue.dev (VS Code or JetBrains) or Tabby. Set the API base URL to your Spheron instance IP on port 8000 and the model name to whatever you set in --served-model-name. No API key changes needed in your codebase - the endpoint is OpenAI-compatible.

  5. Add authentication and set up monitoring

    Place an NGINX reverse proxy in front of vLLM with bearer token authentication. Add a Prometheus + Grafana stack for token throughput and GPU utilization monitoring. Set up alerts for GPU memory pressure above 90%.

FAQ / 05

Frequently Asked Questions

Cursor uses cloud-hosted H100 and A100 GPU clusters from hyperscalers, routing requests through Claude Sonnet, GPT-4o, and their own fine-tuned models. Each inference node typically runs on A100 80GB or H100 GPUs with tensor parallelism across multiple cards for low-latency autocomplete.

Yes. Open-source coding models like DeepSeek V3, Qwen 2.5-Coder, and StarCoder 2 can be deployed on GPU cloud using vLLM with an OpenAI-compatible API. IDE plugins like Continue.dev connect to self-hosted endpoints with no code changes. For a step-by-step deployment guide using Tabby or Continue with Qwen2.5-Coder, see the self-host AI coding assistant guide on the Spheron blog.

A single L40S GPU (48GB VRAM) running Qwen 2.5-Coder 32B with 4-bit quantization handles teams of up to 10 developers. At around $0.72/hr on-demand on Spheron, that is under $530/month - cheaper than 45 Cursor Pro seats.

DeepSeek V3 scores 82.6% on HumanEval+ (EvalPlus, a Python-only pass@1 benchmark, instruct model). For full-file generation and agentic tasks it performs well against closed models. For single-line autocomplete latency it requires careful vLLM tuning to match Cursor's ~80ms TTFT.

Between 71 and 176 seats for Cursor Team, depending on whether you use spot or on-demand pricing for a 4x H100 DeepSeek V3 setup. On spot, a 4x H100 node at ~$2,832/month breaks even around 71 Cursor Team seats ($2,840/month). On on-demand, the crossover is at 176 seats (176 × $40 = $7,040/month vs $7,008/month self-hosted). For smaller teams, Qwen 2.5-Coder 32B on a single L40S or H100 has a much lower break-even. Below these thresholds, SaaS is usually cheaper once you account for DevOps overhead.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.