85% of developers regularly use AI coding tools. Cursor reached a $29B valuation with 40,000 NVIDIA engineers on the platform. Every autocomplete request, every agentic PR, every multi-file edit runs on GPU infrastructure somewhere. This post covers what that infrastructure actually looks like, and when it makes more sense to run your own.
The GPU Infrastructure Powering AI Coding Tools
AI coding tools operate across three distinct compute layers:
1. Token generation layer - the LLM inference itself. This runs on A100 or H100 clusters, handling the actual forward passes that produce completions. The bottleneck here is VRAM bandwidth and compute: how fast you can move weights and activations.
2. Context retrieval layer - embeddings and vector search for repository-aware completions. When a tool understands your codebase, it's querying a vector index of your files, functions, and docstrings in real time. This is less GPU-intensive than generation but adds latency.
3. Orchestration layer - routing, caching, and session state. This runs on CPU-based infrastructure but determines whether a request hits a warm KV cache or requires a full prefill.
The latency constraints are what make coding tools different from most LLM applications:
| Use Case | Target TTFT | Acceptable P99 | GPU Implication |
|---|---|---|---|
| Single-line autocomplete | <100ms | <200ms | Prefill-optimized, small context |
| Chat / explain | <500ms | <1,500ms | Balanced |
| Agentic PR generation | 2-30s | 60s | Decode-optimized, long context |
Autocomplete is the hardest constraint. At sub-100ms TTFT, you're fighting physics: a network round trip from US-East to a data center adds 20-50ms before you've even touched the GPU. Tools like Cursor mitigate this with aggressive speculative prefetching, predicting what you're about to type and pre-running inference.
Cursor, Claude Code, and GitHub Copilot: Architecture Breakdown
Each major tool has different infrastructure priorities based on its product design.
| Tool | Primary Model | Context Used | Target TTFT | GPU Tier (est.) |
|---|---|---|---|---|
| Cursor (autocomplete) | Claude Sonnet 4.5/4.6 / custom | 2k-8k tokens | <80ms | H100 clusters |
| Cursor (agent) | Claude Opus 4.6 / GPT-5.4 | 32k-200k tokens | 1-10s | H100 NVLink |
| Claude Code | Claude Opus 4.6 / Sonnet 4.6 | Up to 1M tokens | 2-30s | H100 NVLink |
| GitHub Copilot | GPT-5.x Codex variants | 2k-4k (completions), 64k+ (chat) | <100ms | A100/H100 |
Cursor runs a hybrid architecture: a fast, fine-tuned model for line-level autocomplete, and a slower frontier model (Claude Opus 4.6, GPT-5.4) for agentic edits. The autocomplete path is heavily optimized for TTFT; the agent path optimizes for output quality over raw speed.
Claude Code is the most context-hungry of the three. It reads your entire repository state before responding, using up to 1M tokens of context per request. That requires H100 NVLink nodes for fast inter-GPU attention, since a long-context prefill across a large model won't fit on a single GPU. Latency expectations are also lower since it's operating more like a collaborator than a keystroke predictor.
GitHub Copilot sits closer to the autocomplete end of the spectrum. Inline completions use tight context windows (2k-4k tokens), while Copilot Chat supports 64k+ tokens via GPT-4o. High request volume across millions of users runs on A100/H100 clusters managed by Azure. Less transparent about the specific models behind completions than Cursor or Anthropic.
Cursor's enterprise self-hosted deployment option and JetBrains Central (early access in 2026) mark a real shift: teams with data privacy requirements can now run these tools on their own infrastructure. For a deeper look at agentic workload sizing, see the GPU infrastructure guide for AI agents.
Why Engineering Teams Are Moving to Self-Hosted
Three concrete reasons, with numbers.
1. Code privacy
Every autocomplete request sends a snippet of your code to a third-party API. For most web apps this is fine. For a financial institution, a defense contractor, or any team working on IP-sensitive code, it's a hard blocker. Self-hosting keeps inference entirely on your infrastructure - nothing leaves your cloud.
2. Cost at scale
Cursor Team is $40/seat/month. 100 engineers equals $4,000/month, or $48,000/year. A self-hosted setup on Spheron covering the same team runs on a 4x H100 node. At $2.40/hr per H100 SXM5, a 4x H100 node (the minimum for DeepSeek V3 at 4-bit) costs about $7,008/month on-demand, breaking even with around 176 Cursor Team seats. On spot pricing, a 4x H100 node runs ~$2,832/month, shifting the break-even to ~71 seats. For general cost optimization strategies, the GPU Cost Optimization Playbook covers spot instances and reserved capacity patterns.
3. Model control
You can fine-tune on your internal codebase. Use your own rate limits. No degradation when the provider's traffic spikes. No model deprecation announcements forcing you to retest your tooling.
Open-Source Coding Models in 2026
Three models dominate self-hosted coding deployments in 2026:
| Model | Params | VRAM (bf16) | VRAM (4-bit) | HumanEval+† | SWE-bench | License |
|---|---|---|---|---|---|---|
| DeepSeek V3 | 671B (37B active MoE) | ~17x H100 (~1.34 TB) | 4x H100 | 82.6% | - | DeepSeek License |
| Qwen 2.5-Coder 32B | 32B | ~64GB (1x A100) | ~20GB (1x L40S) | 87.2% | - | Apache 2.0 |
| Qwen 2.5-Coder 7B | 7B | ~14GB | ~6GB | 84.1% | - | Apache 2.0 |
| StarCoder 2 15B | 15B | ~30GB | ~10GB | 72.6% | - | BigCode OpenRAIL |
†HumanEval+ (EvalPlus) is a Python-only pass@1 benchmark and a stricter superset of the original HumanEval. All scores use the instruct/chat variant of each model (the version you actually deploy). Base model scores are lower across the board (e.g. StarCoder 2 15B base scores 46.3%; DeepSeek V3 base scores around 65%).
For most teams, Qwen 2.5-Coder 32B is the right choice. Strong benchmark performance (87.2% HumanEval+), single-GPU deployment with 4-bit quantization on an L40S or A100, and Apache 2.0 license for commercial use without restrictions.
DeepSeek V3 is the highest-quality option but needs at least 4x H100 at 4-bit quantization. It's worth it for teams with 50+ developers where per-seat SaaS costs clearly exceed infrastructure costs.
StarCoder 2 15B is included here for completeness, but at 72.6% HumanEval (instruct) it's not competitive with Qwen 2.5-Coder for most use cases. Its main appeal is the BigCode OpenRAIL license and smaller VRAM footprint.
For the full VRAM requirements breakdown across all models, see the GPU requirements cheat sheet.
Deploy Your Own AI Coding Assistant on Spheron GPU Cloud
Here's the full setup from a fresh instance to a working coding assistant your IDE plugins can connect to.
Step 1: Provision a GPU instance
# On app.spheron.ai:
# - Select GPU: L40S (small teams), H100 SXM5 (20+ devs), 4x H100 (50+ devs for DeepSeek V3)
# - Image: Ubuntu 22.04 + CUDA 12.4
# - After SSH:
nvidia-smiPick L40S for 1-10 developers running Qwen 2.5-Coder 32B at 4-bit. Pick a single H100 SXM5 for teams up to 20 developers with Qwen 2.5-Coder 32B in bf16. Use 4x H100 for DeepSeek V3 with 4-bit quantization (671B params at 0.5 bytes/param requires ~335 GB).
Step 2: Install vLLM
pip install "vllm>=0.6.0" "torch>=2.4.0"
# Verify:
python -c "import vllm; print(vllm.__version__)"Step 3a: Deploy Qwen 2.5-Coder 32B on a single H100
Bind vLLM to 127.0.0.1 so port 8000 is only reachable from the local machine. NGINX (Step 5) handles all external traffic. Do not change this to 0.0.0.0 unless you have a firewall rule blocking port 8000 from external access.
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-Coder-32B-Instruct \
--served-model-name coding-assistant \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--host 127.0.0.1 \
--port 8000Step 3b: Deploy DeepSeek V3 on 4x H100
The full-precision DeepSeek V3 model (bf16) requires roughly 1.34 TB of VRAM, which exceeds what any standard multi-GPU consumer node can hold. Use a pre-quantized AWQ version instead. A naive 4-bit estimate (671B params × 0.5 bytes/param) puts the ceiling at ~335 GB, which would exceed 4x H100 SXM5 (4 × 80 GB = 320 GB). The community AWQ checkpoint avoids this: cognitivecomputations/DeepSeek-V3-AWQ uses mixed-precision quantization (sub-4-bit for most layers) that brings the actual weight footprint to approximately 270-290 GB, comfortably within 320 GB. There is no official deepseek-ai/DeepSeek-V3-AWQ checkpoint on HuggingFace. For context lengths beyond 32k tokens, use 5x H100 instead and remove the --max-model-len cap:
python -m vllm.entrypoints.openai.api_server \
--model cognitivecomputations/DeepSeek-V3-AWQ \
--quantization awq \
--served-model-name coding-assistant \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--host 127.0.0.1 \
--port 8000Step 4: Configure Continue.dev
Install the Continue.dev extension in VS Code or JetBrains, then edit ~/.continue/config.json.
Temporary config for initial testing only (plain HTTP, no auth):
Since vLLM binds to 127.0.0.1, port 8000 is not reachable directly from your workstation. Before testing, open an SSH tunnel to forward the port locally:
ssh -L 8000:localhost:8000 user@YOUR_SPHERON_IPThen configure Continue.dev to point at localhost:8000:
{
"models": [
{
"title": "Self-Hosted Coding Assistant",
"provider": "openai",
"model": "coding-assistant",
"apiBase": "http://localhost:8000/v1",
"apiKey": "none"
}
]
}Keep the tunnel open while testing. Once NGINX is set up in Step 5, close the tunnel and switch to the HTTPS endpoint instead.
After completing Step 5, update the config to route through the NGINX SSL proxy instead. Do not leave port 8000 exposed in production:
{
"models": [
{
"title": "Self-Hosted Coding Assistant",
"provider": "openai",
"model": "coding-assistant",
"apiBase": "https://your-coding-api.internal/v1",
"apiKey": "YOUR_SECRET_TOKEN"
}
]
}The endpoint is OpenAI-compatible, so any IDE plugin that supports a custom OpenAI base URL will work: Continue.dev, Tabby, Aider, and others.
Step 5: Add NGINX authentication
Don't expose port 8000 directly. Put NGINX in front:
server {
listen 443 ssl;
server_name your-coding-api.internal;
ssl_certificate /path/to/cert.pem;
ssl_certificate_key /path/to/key.pem;
location /v1/ {
# Use auth_request to keep auth and proxy directives in separate blocks.
# Mixing `if` with proxy_pass/proxy_set_header/proxy_buffering in the
# same location block causes undefined behavior in some NGINX versions.
auth_request /_auth;
proxy_pass http://127.0.0.1:8000/v1/;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_read_timeout 300s;
proxy_buffering off;
# vLLM is configured with --max-model-len 32768. A 32k-token request
# body with message history and code snippets easily exceeds NGINX's
# default 1MB limit, causing 413 errors. 64MB gives ample headroom.
client_max_body_size 64m;
}
# Internal auth subrequest — only reachable via auth_request, not directly.
# Requires OpenResty (or NGINX with ngx_http_lua_module) for constant-time comparison.
location = /_auth {
internal;
content_by_lua_block {
local expected = "Bearer YOUR_SECRET_TOKEN"
local got = ngx.var.http_authorization or ""
-- Compare every byte regardless of early mismatch to avoid timing leaks.
local match = (#got == #expected)
for i = 1, #expected do
if (got:byte(i) or 0) ~= expected:byte(i) then match = false end
end
if match then ngx.exit(ngx.HTTP_OK)
else ngx.exit(ngx.HTTP_UNAUTHORIZED) end
}
}
}Prerequisites and security notes:
- The
/_authblock above requires OpenResty (or stock NGINX compiled withngx_http_lua_module). If you're running stock NGINX without Lua, replacecontent_by_lua_blockwithif ($http_authorization = "Bearer YOUR_SECRET_TOKEN") { return 200; } return 401;. Note that thisif-based comparison is not constant-time and leaks token information via response timing.- The Lua version iterates all bytes even after a mismatch, keeping execution time constant regardless of where tokens diverge.
- For public-facing endpoints, use
oauth2-proxyor a dedicated auth sidecar (e.g., Pomerium) instead of bearer tokens in NGINX config.- Teams deploying behind a private VPC or firewall can use the simpler
if-based approach. The practical timing window in that environment is negligible.
For a complete production vLLM setup including systemd configuration and load balancing, see our self-hosted OpenAI API guide. For multi-GPU tensor parallelism setup details, see the vLLM multi-GPU production guide.
GPU Sizing Guide: From Solo Developer to 100-Engineer Team
Pricing below is based on Spheron GPU cloud as of 30 Mar 2026. Monthly figures use 730 hours.
| Team Size | Model | GPU Config | On-Demand $/hr | Spot $/hr | $/month (on-demand) | $/month (spot) |
|---|---|---|---|---|---|---|
| 1-10 devs | Qwen 2.5-Coder 32B (4-bit) | 1x L40S | $0.72 | N/A | ~$526 | N/A |
| 10-20 devs | Qwen 2.5-Coder 32B | 1x H100 SXM5 | $2.40 | $0.97 | ~$1,752 | ~$708 |
| 20-50 devs | DeepSeek V3 | 4x H100 SXM5 | $9.60 | $3.88 | ~$7,008 | ~$2,832 |
| 50-100 devs | DeepSeek V3 | 4x H100 SXM5 | $9.60 | $3.88 | ~$7,008 | ~$2,832 |
| 100+ devs | DeepSeek V3 | 8x H100 SXM5 | $19.20 | $7.76 | ~$14,016 | ~$5,665 |
"Concurrent users" means simultaneous active completion requests. Most developers are idle most of the time. A 100-person team typically has 10-20 concurrent requests at peak, not 100.
Pricing fluctuates based on GPU availability. The prices above are based on 30 Mar 2026 and may have changed. Check current GPU pricing for live rates.
Cost Comparison: SaaS Subscriptions vs Self-Hosted on Spheron
The break-even math for common team sizes:
| Seats | Cursor Team ($40/seat) | GitHub Copilot Biz ($19/seat) | Spheron 4x H100 (on-demand) | Spheron 4x H100 (spot) |
|---|---|---|---|---|
| 10 | $400 | $190 | $7,008 | $2,832 |
| 30 | $1,200 | $570 | $7,008 | $2,832 |
| 50 | $2,000 | $950 | $7,008 | $2,832 |
| 100 | $4,000 | $1,900 | $7,008 | $2,832 |
The crossover point for Cursor Team is at 176 seats on-demand (176 × $40 = $7,040/month vs $7,008/month self-hosted), or 71 seats on spot ($2,832/month). For GitHub Copilot Business, self-hosting becomes cheaper at around 369 seats on-demand or 150 seats on spot.
These numbers assume a single shared deployment. The self-hosted cost also buys unlimited usage: no per-seat caps, no rate limits, and the ability to run longer context requests that SaaS tiers restrict.
Pricing fluctuates based on GPU availability. The prices above are based on 30 Mar 2026 and may have changed. Check current GPU pricing for live rates.
The economics flip clearly in favor of self-hosting on spot once your team crosses 71 seats for Cursor Team, or 150 seats on spot for GitHub Copilot Business. On-demand pricing requires 176+ seats for Cursor Team to break even, but data privacy requirements often make the decision before cost does.
If your team is evaluating AI coding infrastructure, Spheron GPU cloud gives you on-demand H100 and A100 instances with no long-term commitments. Run DeepSeek V3 or Qwen 2.5-Coder behind your own firewall - no code leaves your cloud.
