What GPU infrastructure does Cursor AI run on?

Cursor uses cloud-hosted H100 and A100 GPU clusters from hyperscalers, routing requests through Claude Sonnet, GPT-4o, and their own fine-tuned models. Each inference node typically runs on A100 80GB or H100 GPUs with tensor parallelism across multiple cards for low-latency autocomplete.

Can I self-host an AI coding assistant like Cursor?

Yes. Open-source coding models like DeepSeek V3, Qwen 2.5-Coder, and StarCoder 2 can be deployed on GPU cloud using vLLM with an OpenAI-compatible API. IDE plugins like Continue.dev connect to self-hosted endpoints with no code changes.

What is the cheapest GPU setup for running an AI coding assistant?

A single L40S GPU (48GB VRAM) running Qwen 2.5-Coder 32B with 4-bit quantization handles teams of up to 10 developers. At around $0.72/hr on-demand on Spheron, that is under $530/month - cheaper than 45 Cursor Pro seats.

How does DeepSeek V3 compare to Cursor for code completion?

DeepSeek V3 scores 82.6% on HumanEval+ (EvalPlus, a Python-only pass@1 benchmark, instruct model). For full-file generation and agentic tasks it performs well against closed models. For single-line autocomplete latency it requires careful vLLM tuning to match Cursor's ~80ms TTFT.

What is the cost break-even for self-hosting vs Cursor licenses?

Between 71 and 176 seats for Cursor Team, depending on whether you use spot or on-demand pricing for a 4x H100 DeepSeek V3 setup. On spot, a 4x H100 node at ~$2,832/month breaks even around 71 Cursor Team seats ($2,840/month). On on-demand, the crossover is at 176 seats (176 × $40 = $7,040/month vs $7,008/month self-hosted). For smaller teams, Qwen 2.5-Coder 32B on a single L40S or H100 has a much lower break-even. Below these thresholds, SaaS is usually cheaper once you account for DevOps overhead.

GPU Infrastructure Behind AI Coding Tools: Cursor, Claude Code, and GitHub Copilot in 2026

85% of developers regularly use AI coding tools. Cursor reached a $29B valuation with 40,000 NVIDIA engineers on the platform. Every autocomplete request, every agentic PR, every multi-file edit runs on GPU infrastructure somewhere. This post covers what that infrastructure actually looks like, and when it makes more sense to run your own.

The GPU Infrastructure Powering AI Coding Tools

AI coding tools operate across three distinct compute layers:

1. Token generation layer - the LLM inference itself. This runs on A100 or H100 clusters, handling the actual forward passes that produce completions. The bottleneck here is VRAM bandwidth and compute: how fast you can move weights and activations.

2. Context retrieval layer - embeddings and vector search for repository-aware completions. When a tool understands your codebase, it's querying a vector index of your files, functions, and docstrings in real time. This is less GPU-intensive than generation but adds latency.

3. Orchestration layer - routing, caching, and session state. This runs on CPU-based infrastructure but determines whether a request hits a warm KV cache or requires a full prefill.

The latency constraints are what make coding tools different from most LLM applications:

Use Case	Target TTFT	Acceptable P99	GPU Implication
Single-line autocomplete	<100ms	<200ms	Prefill-optimized, small context
Chat / explain	<500ms	<1,500ms	Balanced
Agentic PR generation	2-30s	60s	Decode-optimized, long context

Autocomplete is the hardest constraint. At sub-100ms TTFT, you're fighting physics: a network round trip from US-East to a data center adds 20-50ms before you've even touched the GPU. Tools like Cursor mitigate this with aggressive speculative prefetching, predicting what you're about to type and pre-running inference.

Cursor, Claude Code, and GitHub Copilot: Architecture Breakdown

Each major tool has different infrastructure priorities based on its product design.

Tool	Primary Model	Context Used	Target TTFT	GPU Tier (est.)
Cursor (autocomplete)	Claude Sonnet 4.5/4.6 / custom	2k-8k tokens	<80ms	H100 clusters
Cursor (agent)	Claude Opus 4.6 / GPT-5.4	32k-200k tokens	1-10s	H100 NVLink
Claude Code	Claude Opus 4.6 / Sonnet 4.6	Up to 1M tokens	2-30s	H100 NVLink
GitHub Copilot	GPT-5.x Codex variants	2k-4k (completions), 64k+ (chat)	<100ms	A100/H100

Cursor runs a hybrid architecture: a fast, fine-tuned model for line-level autocomplete, and a slower frontier model (Claude Opus 4.6, GPT-5.4) for agentic edits. The autocomplete path is heavily optimized for TTFT; the agent path optimizes for output quality over raw speed.

Claude Code is the most context-hungry of the three. It reads your entire repository state before responding, using up to 1M tokens of context per request. That requires H100 NVLink nodes for fast inter-GPU attention, since a long-context prefill across a large model won't fit on a single GPU. Latency expectations are also lower since it's operating more like a collaborator than a keystroke predictor.

GitHub Copilot sits closer to the autocomplete end of the spectrum. Inline completions use tight context windows (2k-4k tokens), while Copilot Chat supports 64k+ tokens via GPT-4o. High request volume across millions of users runs on A100/H100 clusters managed by Azure. Less transparent about the specific models behind completions than Cursor or Anthropic.

Cursor's enterprise self-hosted deployment option and JetBrains Central (early access in 2026) mark a real shift: teams with data privacy requirements can now run these tools on their own infrastructure. For a deeper look at agentic workload sizing, see the GPU infrastructure guide for AI agents.

Why Engineering Teams Are Moving to Self-Hosted

Three concrete reasons, with numbers.

1. Code privacy

Every autocomplete request sends a snippet of your code to a third-party API. For most web apps this is fine. For a financial institution, a defense contractor, or any team working on IP-sensitive code, it's a hard blocker. Self-hosting keeps inference entirely on your infrastructure - nothing leaves your cloud.

2. Cost at scale

Cursor Team is $40/seat/month. 100 engineers equals $4,000/month, or $48,000/year. A self-hosted setup on Spheron covering the same team runs on a 4x H100 node. At $2.40/hr per H100 SXM5, a 4x H100 node (the minimum for DeepSeek V3 at 4-bit) costs about $7,008/month on-demand, breaking even with around 176 Cursor Team seats. On spot pricing, a 4x H100 node runs ~$2,832/month, shifting the break-even to ~71 seats. For general cost optimization strategies, the GPU Cost Optimization Playbook covers spot instances and reserved capacity patterns.

3. Model control

You can fine-tune on your internal codebase. Use your own rate limits. No degradation when the provider's traffic spikes. No model deprecation announcements forcing you to retest your tooling.

Open-Source Coding Models in 2026

Three models dominate self-hosted coding deployments in 2026:

Model	Params	VRAM (bf16)	VRAM (4-bit)	HumanEval+†	SWE-bench	License
DeepSeek V3	671B (37B active MoE)	~17x H100 (~1.34 TB)	4x H100	82.6%	-	DeepSeek License
Qwen 2.5-Coder 32B	32B	~64GB (1x A100)	~20GB (1x L40S)	87.2%	-	Apache 2.0
Qwen 2.5-Coder 7B	7B	~14GB	~6GB	84.1%	-	Apache 2.0
StarCoder 2 15B	15B	~30GB	~10GB	72.6%	-	BigCode OpenRAIL

†HumanEval+ (EvalPlus) is a Python-only pass@1 benchmark and a stricter superset of the original HumanEval. All scores use the instruct/chat variant of each model (the version you actually deploy). Base model scores are lower across the board (e.g. StarCoder 2 15B base scores 46.3%; DeepSeek V3 base scores around 65%).

For most teams, Qwen 2.5-Coder 32B is the right choice. Strong benchmark performance (87.2% HumanEval+), single-GPU deployment with 4-bit quantization on an L40S or A100, and Apache 2.0 license for commercial use without restrictions.

DeepSeek V3 is the highest-quality option but needs at least 4x H100 at 4-bit quantization. It's worth it for teams with 50+ developers where per-seat SaaS costs clearly exceed infrastructure costs.

StarCoder 2 15B is included here for completeness, but at 72.6% HumanEval (instruct) it's not competitive with Qwen 2.5-Coder for most use cases. Its main appeal is the BigCode OpenRAIL license and smaller VRAM footprint.

For the full VRAM requirements breakdown across all models, see the GPU requirements cheat sheet.

Deploy Your Own AI Coding Assistant on Spheron GPU Cloud

Here's the full setup from a fresh instance to a working coding assistant your IDE plugins can connect to.

Step 1: Provision a GPU instance

bash

# On app.spheron.ai:
# - Select GPU: L40S (small teams), H100 SXM5 (20+ devs), 4x H100 (50+ devs for DeepSeek V3)
# - Image: Ubuntu 22.04 + CUDA 12.4
# - After SSH:
nvidia-smi

Pick L40S for 1-10 developers running Qwen 2.5-Coder 32B at 4-bit. Pick a single H100 SXM5 for teams up to 20 developers with Qwen 2.5-Coder 32B in bf16. Use 4x H100 for DeepSeek V3 with 4-bit quantization (671B params at 0.5 bytes/param requires ~335 GB).

Step 2: Install vLLM

bash

pip install "vllm>=0.6.0" "torch>=2.4.0"

# Verify:
python -c "import vllm; print(vllm.__version__)"

Step 3a: Deploy Qwen 2.5-Coder 32B on a single H100

Bind vLLM to 127.0.0.1 so port 8000 is only reachable from the local machine. NGINX (Step 5) handles all external traffic. Do not change this to 0.0.0.0 unless you have a firewall rule blocking port 8000 from external access.

bash

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-Coder-32B-Instruct \
  --served-model-name coding-assistant \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --host 127.0.0.1 \
  --port 8000

Step 3b: Deploy DeepSeek V3 on 4x H100

The full-precision DeepSeek V3 model (bf16) requires roughly 1.34 TB of VRAM, which exceeds what any standard multi-GPU consumer node can hold. Use a pre-quantized AWQ version instead. A naive 4-bit estimate (671B params × 0.5 bytes/param) puts the ceiling at ~335 GB, which would exceed 4x H100 SXM5 (4 × 80 GB = 320 GB). The community AWQ checkpoint avoids this: cognitivecomputations/DeepSeek-V3-AWQ uses mixed-precision quantization (sub-4-bit for most layers) that brings the actual weight footprint to approximately 270-290 GB, comfortably within 320 GB. There is no official deepseek-ai/DeepSeek-V3-AWQ checkpoint on HuggingFace. For context lengths beyond 32k tokens, use 5x H100 instead and remove the --max-model-len cap:

bash

python -m vllm.entrypoints.openai.api_server \
  --model cognitivecomputations/DeepSeek-V3-AWQ \
  --quantization awq \
  --served-model-name coding-assistant \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --host 127.0.0.1 \
  --port 8000

Step 4: Configure Continue.dev

Install the Continue.dev extension in VS Code or JetBrains, then edit ~/.continue/config.json.

Temporary config for initial testing only (plain HTTP, no auth):

Since vLLM binds to 127.0.0.1, port 8000 is not reachable directly from your workstation. Before testing, open an SSH tunnel to forward the port locally:

bash

ssh -L 8000:localhost:8000 user@YOUR_SPHERON_IP

Then configure Continue.dev to point at localhost:8000:

json

{
  "models": [
    {
      "title": "Self-Hosted Coding Assistant",
      "provider": "openai",
      "model": "coding-assistant",
      "apiBase": "http://localhost:8000/v1",
      "apiKey": "none"
    }
  ]
}

Keep the tunnel open while testing. Once NGINX is set up in Step 5, close the tunnel and switch to the HTTPS endpoint instead.

After completing Step 5, update the config to route through the NGINX SSL proxy instead. Do not leave port 8000 exposed in production:

json

{
  "models": [
    {
      "title": "Self-Hosted Coding Assistant",
      "provider": "openai",
      "model": "coding-assistant",
      "apiBase": "https://your-coding-api.internal/v1",
      "apiKey": "YOUR_SECRET_TOKEN"
    }
  ]
}

The endpoint is OpenAI-compatible, so any IDE plugin that supports a custom OpenAI base URL will work: Continue.dev, Tabby, Aider, and others.

Step 5: Add NGINX authentication

Don't expose port 8000 directly. Put NGINX in front:

nginx

server {
    listen 443 ssl;
    server_name your-coding-api.internal;
    ssl_certificate /path/to/cert.pem;
    ssl_certificate_key /path/to/key.pem;

    location /v1/ {
        # Use auth_request to keep auth and proxy directives in separate blocks.
        # Mixing `if` with proxy_pass/proxy_set_header/proxy_buffering in the
        # same location block causes undefined behavior in some NGINX versions.
        auth_request /_auth;
        proxy_pass http://127.0.0.1:8000/v1/;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_read_timeout 300s;
        proxy_buffering off;
        # vLLM is configured with --max-model-len 32768. A 32k-token request
        # body with message history and code snippets easily exceeds NGINX's
        # default 1MB limit, causing 413 errors. 64MB gives ample headroom.
        client_max_body_size 64m;
    }

    # Internal auth subrequest — only reachable via auth_request, not directly.
    # Requires OpenResty (or NGINX with ngx_http_lua_module) for constant-time comparison.
    location = /_auth {
        internal;
        content_by_lua_block {
            local expected = "Bearer YOUR_SECRET_TOKEN"
            local got = ngx.var.http_authorization or ""
            -- Compare every byte regardless of early mismatch to avoid timing leaks.
            local match = (#got == #expected)
            for i = 1, #expected do
                if (got:byte(i) or 0) ~= expected:byte(i) then match = false end
            end
            if match then ngx.exit(ngx.HTTP_OK)
            else ngx.exit(ngx.HTTP_UNAUTHORIZED) end
        }
    }
}

Prerequisites and security notes:
- The /_auth block above requires OpenResty (or stock NGINX compiled with ngx_http_lua_module). If you're running stock NGINX without Lua, replace content_by_lua_block with if ($http_authorization = "Bearer YOUR_SECRET_TOKEN") { return 200; } return 401;. Note that this if-based comparison is not constant-time and leaks token information via response timing.
- The Lua version iterates all bytes even after a mismatch, keeping execution time constant regardless of where tokens diverge.
- For public-facing endpoints, use oauth2-proxy or a dedicated auth sidecar (e.g., Pomerium) instead of bearer tokens in NGINX config.
- Teams deploying behind a private VPC or firewall can use the simpler if-based approach. The practical timing window in that environment is negligible.

For a complete production vLLM setup including systemd configuration and load balancing, see our self-hosted OpenAI API guide. For multi-GPU tensor parallelism setup details, see the vLLM multi-GPU production guide.

GPU Sizing Guide: From Solo Developer to 100-Engineer Team

Pricing below is based on Spheron GPU cloud as of 30 Mar 2026. Monthly figures use 730 hours.

Team Size	Model	GPU Config	On-Demand $/hr	Spot $/hr	$/month (on-demand)	$/month (spot)
1-10 devs	Qwen 2.5-Coder 32B (4-bit)	1x L40S	$0.72	N/A	~$526	N/A
10-20 devs	Qwen 2.5-Coder 32B	1x H100 SXM5	$2.40	$0.97	~$1,752	~$708
20-50 devs	DeepSeek V3	4x H100 SXM5	$9.60	$3.88	~$7,008	~$2,832
50-100 devs	DeepSeek V3	4x H100 SXM5	$9.60	$3.88	~$7,008	~$2,832
100+ devs	DeepSeek V3	8x H100 SXM5	$19.20	$7.76	~$14,016	~$5,665

"Concurrent users" means simultaneous active completion requests. Most developers are idle most of the time. A 100-person team typically has 10-20 concurrent requests at peak, not 100.

Pricing fluctuates based on GPU availability. The prices above are based on 30 Mar 2026 and may have changed. Check current GPU pricing for live rates.

Cost Comparison: SaaS Subscriptions vs Self-Hosted on Spheron

The break-even math for common team sizes:

Seats	Cursor Team ($40/seat)	GitHub Copilot Biz ($19/seat)	Spheron 4x H100 (on-demand)	Spheron 4x H100 (spot)
10	$400	$190	$7,008	$2,832
30	$1,200	$570	$7,008	$2,832
50	$2,000	$950	$7,008	$2,832
100	$4,000	$1,900	$7,008	$2,832

The crossover point for Cursor Team is at 176 seats on-demand (176 × $40 = $7,040/month vs $7,008/month self-hosted), or 71 seats on spot ($2,832/month). For GitHub Copilot Business, self-hosting becomes cheaper at around 369 seats on-demand or 150 seats on spot.

These numbers assume a single shared deployment. The self-hosted cost also buys unlimited usage: no per-seat caps, no rate limits, and the ability to run longer context requests that SaaS tiers restrict.

Pricing fluctuates based on GPU availability. The prices above are based on 30 Mar 2026 and may have changed. Check current GPU pricing for live rates.

The economics flip clearly in favor of self-hosting on spot once your team crosses 71 seats for Cursor Team, or 150 seats on spot for GitHub Copilot Business. On-demand pricing requires 176+ seats for Cursor Team to break even, but data privacy requirements often make the decision before cost does.

If your team is evaluating AI coding infrastructure, Spheron GPU cloud gives you on-demand H100 and A100 instances with no long-term commitments. Run DeepSeek V3 or Qwen 2.5-Coder behind your own firewall - no code leaves your cloud.
Rent H100 → | Rent A100 → | View all GPU pricing →
Get started on Spheron →

The GPU Infrastructure Powering AI Coding Tools

Cursor, Claude Code, and GitHub Copilot: Architecture Breakdown

Why Engineering Teams Are Moving to Self-Hosted

Open-Source Coding Models in 2026

Deploy Your Own AI Coding Assistant on Spheron GPU Cloud

Step 1: Provision a GPU instance

Step 2: Install vLLM

Step 3a: Deploy Qwen 2.5-Coder 32B on a single H100

Step 3b: Deploy DeepSeek V3 on 4x H100

Step 4: Configure Continue.dev

Step 5: Add NGINX authentication

GPU Sizing Guide: From Solo Developer to 100-Engineer Team

Cost Comparison: SaaS Subscriptions vs Self-Hosted on Spheron

Build what's next.