GitHub Copilot Business costs $19 per seat per month. At 50 developers, that's $11,400 per year - and every line of your proprietary code goes through Microsoft's servers. This guide shows you how to run a Copilot-equivalent assistant on your own GPU cloud for less.
For background on how commercial tools like Cursor and Claude Code are architected, see GPU infrastructure behind Cursor, Claude Code, and Copilot.
Why Self-Host Your AI Coding Assistant in 2026
Three reasons, each with numbers.
Cost at team scale. A single A100 80GB on Spheron runs at $1.04/hr on-demand. Over a 30-day month (720 hours), that's $749. GitHub Copilot Business is $19/seat/month. The math: $749 / $19 = ~39 seats. Once your team passes 39 developers, self-hosting costs less than Copilot. Larger teams can share a single A100 node serving 15-25 concurrent autocomplete requests with Tabby's request queuing, so the per-seat cost drops further as headcount grows.
Data sovereignty. Every request to Copilot, Cursor, or any cloud API sends code to a third party. For healthcare, fintech, legal, and defense contractors, that is not an option. Running inference on your own GPU means credentials, proprietary algorithms, and unreleased product code never leave your network. Some compliance frameworks (SOC 2 Type II, HIPAA) explicitly require this.
Model control. You choose what runs. Swap Qwen2.5-Coder 7B for faster autocomplete on a shared node, or run Qwen2.5-Coder 32B for better multi-file understanding. Fine-tune on your own codebase to improve completions for internal libraries. Set your own context length. None of that is possible with a SaaS subscription.
Comparing Open-Source Coding Assistants: Tabby, Continue, FauxPilot, and Kilo
| Tool | Type | Model server included | IDE support | Multi-user | Best for |
|---|---|---|---|---|---|
| Tabby | Full stack | Yes (own runtime) | VS Code, JetBrains | Yes (team auth) | Teams wanting a drop-in Copilot replacement |
| Continue | IDE plugin only | No (uses vLLM/Ollama) | VS Code, JetBrains | Via backend | Devs who already run vLLM |
| FauxPilot | Full stack | Triton server | VS Code | Limited | Legacy GitHub Copilot plugin compatibility |
| Kilo | Agentic coding platform | No (uses any OpenAI API) | VS Code, JetBrains, CLI | Via backend | Agentic coding with broad IDE support |
Tabby is the closest to a managed Copilot replacement. It ships a model server, an IDE plugin, user authentication, and usage telemetry in one package. You point the plugin at your Spheron instance IP and it works. Team authentication lets admins create per-user API keys, revoke access, and track usage by developer. The tradeoff: you're tied to Tabby's supported model list, which covers all the major Qwen and DeepSeek variants.
Continue takes the opposite approach. It's an IDE plugin that connects to any OpenAI-compatible backend. You bring your own model server (vLLM, Ollama, or a hosted API) and configure Continue's config.json to point at it. This gives you full control over the inference stack, but there's no built-in auth or telemetry. Good if your team already runs vLLM for other workloads. See Build a Self-Hosted OpenAI-Compatible API with vLLM for a step-by-step vLLM deployment guide.
FauxPilot was built to mimic GitHub Copilot's API exactly, so the original Copilot VS Code extension works without modification. It has not had a significant release since 2023 and is not recommended for new deployments in 2026. The Triton inference server it uses is harder to operate than vLLM or Ollama for most teams.
Kilo is a full agentic coding platform with over 1.5M users, available as extensions for VS Code, JetBrains, and a CLI. Like Continue, it connects to any OpenAI-compatible API and brings no model server of its own. It supports agentic workflows (multi-step code changes, file creation, terminal commands) across all three environments. A solid option for teams that already run an inference backend and want strong IDE coverage beyond just VS Code.
Best Coding LLMs for Self-Hosting: Qwen2.5-Coder vs DeepSeek-Coder-V2 vs StarCoder 2
| Model | HumanEval | FIM Support | Params | VRAM (FP16) | VRAM (4-bit) |
|---|---|---|---|---|---|
| Qwen2.5-Coder 32B | 92.7% | Yes | 32B | ~65GB | ~20GB |
| Qwen2.5-Coder 14B | 89.6% | Yes | 14B | ~29GB | ~9GB |
| Qwen2.5-Coder 7B | 88.4% | Yes | 7B | ~15GB | ~5GB |
| DeepSeek-Coder-V2-Lite | 81.1% | Yes | 16B MoE | ~20GB | ~8GB |
| DeepSeek-Coder-V2 236B | ~90% | Yes | 236B MoE | 4x H100 | 4x A100 |
| StarCoder 2 15B | 72.6% | Yes | 15B | ~30GB | ~10GB |
HumanEval scores are pass@1 on instruct variants. StarCoder 2 15B is included for reference; Qwen2.5-Coder outperforms it by a wide margin. Qwen2.5-Coder became the most-deployed self-hosted coding model in early 2026, overtaking Llama-based models (RunPod deployment data, March 2026).
A note on FIM vs HumanEval: HumanEval measures code generation from a docstring. Fill-in-the-middle (FIM) is what autocomplete tools actually use. Qwen2.5-Coder supports both modes natively. The FIM-specific benchmarks (infilling tasks from SantaCoder's benchmark suite) show similar rank ordering, but TTFT matters as much as accuracy for real autocomplete use. The 7B and 14B models hit autocomplete latency targets more reliably on a shared node.
For large enterprise teams willing to operate multi-GPU infrastructure, DeepSeek-Coder-V2 236B (4x H100 minimum for FP8 inference) delivers near-GPT-4o quality. Do not attempt this on fewer than 4x H100 80GB; it will OOM at weight load time.
See also: GPU memory requirements for LLMs for a full VRAM sizing reference.
GPU Requirements and VRAM Sizing
| Model | GPU | VRAM Used | Concurrent Users | On-Demand Price | Spot Price |
|---|---|---|---|---|---|
| Qwen2.5-Coder 7B (FP16) | L40S 48GB | ~15GB | 1-5 | $1.80/hr | - |
| Qwen2.5-Coder 14B (FP16) | A100 80GB PCIe | ~29GB | 1-8 | $1.04/hr | $1.14/hr |
| Qwen2.5-Coder 32B (4-bit GPTQ) | A100 80GB PCIe | ~22GB | 1-15 | $1.04/hr | $1.14/hr |
| Qwen2.5-Coder 32B (FP16) | A100 80GB SXM4 | ~65GB | 1-15 | $1.64/hr | - |
| DeepSeek-Coder-V2 236B (FP8) | 4x H100 PCIe | ~160GB | 10-50 | ~$8.04/hr | - |
Pricing fluctuates based on GPU availability. The prices above are based on 09 Apr 2026 and may have changed. Check current GPU pricing for live rates.
The 32B model at 4-bit GPTQ fits comfortably on a single A100 80GB PCIe at $1.04/hr on-demand, and handles 1-15 concurrent autocomplete requests before latency degrades. For teams under 15 developers, this is the recommended configuration.
For context: at $1.04/hr, a 30-day A100 costs $749. At 15 concurrent users sharing the node, that's about $50/developer/month, below any commercial subscription. Smaller teams sharing the same node get even better economics.
See also: GPU requirements cheat sheet for 2026 for a broader sizing reference. For A100 GPU rental options and availability, see Rent NVIDIA A100 GPUs on Spheron.
Step-by-Step: Deploy Tabby with Qwen2.5-Coder on Spheron GPU Cloud
Prerequisites
- Spheron account at app.spheron.ai
- Hugging Face token (for gated model access, if needed)
- VS Code or JetBrains IDE
1. Provision the GPU instance
Log into app.spheron.ai, navigate to GPU deployments, and select A100 80GB PCIe for Qwen2.5-Coder 32B (4-bit), or L40S for the 7B model. Choose Ubuntu 22.04 with CUDA 12.4. The instance boots in 60-90 seconds. For details on instance types (spot vs dedicated vs bare metal), see the Spheron instance types docs.
Once booted, SSH in and verify the GPU:
# Verify GPU after instance boots
nvidia-smi2. Run Tabby with Qwen2.5-Coder
# Pull and run Tabby with Qwen2.5-Coder 7B
docker run -d \
--gpus all \
-p 8080:8080 \
-v $HOME/.tabby:/data \
tabbyml/tabby:latest \
serve \
--model Qwen/Qwen2.5-Coder-7B-Instruct \
--device cudaFor the 32B model, swap Qwen/Qwen2.5-Coder-7B-Instruct with Qwen/Qwen2.5-Coder-32B-Instruct. Pin the Tabby image to a specific version tag in production (e.g., tabbyml/tabby:v0.18.0) to avoid unexpected updates.
Tabby downloads the model weights from Hugging Face on first run. The 7B model is about 15GB; the 32B is around 65GB (FP16) or 20GB (4-bit). Wait for the log line Listening on 0.0.0.0:8080 before connecting the IDE plugin.
3. Configure the Tabby VS Code extension
- Install the
TabbyML.vscode-tabbyextension from the VS Code Marketplace. - Open VS Code settings and search for "Tabby server".
- Set the endpoint to
http://YOUR_SPHERON_IP:8080. - Open any code file and start typing. Tabby should show completions in 1-3 seconds on first request (the model warms up), then drop to 80-400ms for subsequent requests.
4. Cloud-init script for automated setup
Use this cloud-init config when provisioning the Spheron instance to automate the entire setup. Spheron supports both cloud-init YAML and Bash startup scripts at launch time. See the Spheron startup scripts docs for the full format reference.
#cloud-config
package_update: true
packages:
- docker.io
- nginx
runcmd:
- systemctl enable docker
- systemctl start docker
- docker run -d --gpus all -p 8080:8080 --restart unless-stopped
-v /data/tabby:/data tabbyml/tabby:latest serve
--model Qwen/Qwen2.5-Coder-7B-Instruct --device cudaThis starts Tabby as a Docker container with --restart unless-stopped, so it comes back up after a reboot. Swap the model name to match your chosen variant.
Step-by-Step: Deploy Continue with Ollama Backend on GPU Cloud
Continue is the better choice if you already run vLLM on your Spheron instance, or if you want a single backend serving both chat and autocomplete with different model sizes.
1. Install Ollama on the instance
curl -fsSL https://ollama.com/install.sh | shThe Ollama installer creates a systemd service. By default it binds to 127.0.0.1, which is not reachable from outside the instance. Set OLLAMA_HOST=0.0.0.0 via a systemd drop-in so the binding persists across reboots and SSH sessions:
Security warning: Setting
OLLAMA_HOST=0.0.0.0exposes port 11434 on every network interface, including the instance's public IP. Ollama has no built-in authentication. Before or immediately after this step, restrict access with a firewall rule so only your IP can reach the port:```bash
sudo ufw allow from <YOUR_IP> to any port 11434
sudo ufw deny 11434
sudo ufw allow 22 # keep SSH open to avoid locking yourself out
sudo ufw enable # activate the firewall (disabled by default on Ubuntu)
```
Replace
<YOUR_IP>with your developer machine's IP. Runufw statusto confirm the rules are active. If you skip this, anyone who discovers the instance IP can make unlimited inference requests at your expense. The NGINX bearer-token setup in the Production Tips section below is the recommended long-term solution for team access.
# Create a systemd override to bind Ollama to all interfaces
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
EOF
# Reload systemd and start Ollama
sudo systemctl daemon-reload
sudo systemctl enable ollama && sudo systemctl restart ollama# Pull the models (this downloads weights from Ollama's registry)
ollama pull qwen2.5-coder:32b
ollama pull qwen2.5-coder:7bUsing systemctl enable ollama && systemctl restart ollama ensures Ollama starts automatically on boot and restarts with the new OLLAMA_HOST environment variable applied. The separate restart is required because the installer already started Ollama bound to 127.0.0.1. Without it, the env var change never takes effect in the running process and your IDE will hit a connection-refused error.
2. Configure the Continue extension
Install Continue from the VS Code Marketplace, then edit ~/.continue/config.json:
{
"models": [
{
"title": "Qwen2.5-Coder 32B (Spheron)",
"provider": "ollama",
"model": "qwen2.5-coder:32b",
"apiBase": "http://YOUR_SPHERON_IP:11434"
}
],
"tabAutocompleteModel": {
"title": "Qwen2.5-Coder 7B (fast)",
"provider": "ollama",
"model": "qwen2.5-coder:7b",
"apiBase": "http://YOUR_SPHERON_IP:11434"
}
}The two-model pattern here is intentional. The 32B model handles chat requests ("explain this function", "refactor this class") where quality matters more than speed. The 7B model handles real-time tab completion, where you need sub-200ms responses to avoid breaking typing flow. Running both on a single A100 80GB is viable since Ollama loads models on demand and swaps them in memory.
For more on the Ollama vs vLLM tradeoff for self-hosting, see Ollama vs vLLM: which to use for self-hosting LLMs.
Performance Benchmarks: Self-Hosted vs GitHub Copilot vs Cursor
| System | HumanEval | FIM Accuracy | TTFT (autocomplete) | Cost per dev/month |
|---|---|---|---|---|
| GitHub Copilot Business | ~75% (est.) | High | 80-150ms | $19 |
| Cursor Pro | ~80% (est.) | Very high | 60-120ms | $20 |
| Tabby + Qwen2.5-Coder 32B (A100) | 92.7% | High | 150-400ms | $1.04/hr shared |
| Tabby + Qwen2.5-Coder 7B (L40S) | 88.4% | Good | 80-200ms | $1.80/hr shared |
| Continue + DeepSeek-Coder-V2 236B (4x H100) | ~90% | Very high | 200-600ms | ~$8.04/hr shared |
Copilot and Cursor HumanEval numbers are estimates based on published model composition reports; self-hosted Qwen2.5-Coder 32B scores come from the Qwen technical report (pass@1, instruct variant). TTFT figures depend on request concurrency and network latency from your laptop to the GPU instance.
The TTFT gap is real. Copilot and Cursor operate latency-optimized data centers with pre-warmed caches and speculative prefetching. Your Spheron instance is a general-purpose GPU node. At low concurrency (1-3 devs), self-hosted TTFT is competitive. At higher concurrency with a shared node, response times increase. The Qwen2.5-Coder 7B on L40S is the best match for Copilot's latency profile.
Cost Comparison: GPU Cloud Self-Hosting vs Commercial Subscriptions
| Team size | Copilot Business | Cursor Pro | Tabby (A100 80GB, on-demand) | Tabby (A100 80GB, spot) |
|---|---|---|---|---|
| 10 devs | $190/mo | $200/mo | $749/mo (24/7) | ~$821/mo |
| 25 devs | $475/mo | $500/mo | $749/mo | ~$821/mo |
| 50 devs | $950/mo | $1,000/mo | $749/mo | ~$821/mo |
| 100 devs | $1,900/mo | $2,000/mo | $749-$1,498/mo | ~$821-$1,642/mo |
Notes on the table:
- A100 80GB PCIe at $1.04/hr on-demand x 720 hrs = ~$749/month.
- Spot pricing at $1.14/hr x 720 hrs = ~$821/month. For this GPU, spot is currently more expensive than on-demand, so on-demand is the better choice for continuous serving. Spot instances are also interruptible, which breaks live autocomplete sessions.
- Break-even for a single A100 node is around 39 developers. Below that, Copilot is cheaper.
- A single A100 80GB running Qwen2.5-Coder 32B (4-bit) handles 15-25 concurrent developers with Tabby's request queuing. Teams over 100 may need two nodes, keeping costs well below the per-seat alternatives.
For a broader pricing comparison across GPU providers, see GPU cloud pricing comparison 2026.
Pricing fluctuates based on GPU availability. The prices above are based on 09 Apr 2026 and may have changed. Check current GPU pricing for live rates.
Production Tips: Multi-User Serving, Context Length, and Fine-Tuning on Your Codebase
Multi-user serving
Tabby handles this natively. Admins create per-user API keys through the Tabby dashboard, track usage by developer, and revoke access without touching the model server. For Continue + vLLM or Ollama backends, put NGINX in front with bearer token auth so each developer authenticates before hitting the inference endpoint.
A single A100 80GB running Qwen2.5-Coder 32B (4-bit) handles 15-25 concurrent completion requests before latency climbs above 500ms. Beyond that, either upgrade to an A100 SXM4 (faster memory bandwidth) or add a second node.
Context length configuration
Set Tabby's --context-length to 8192 for standard autocomplete. Longer contexts improve multi-file understanding (Tabby can look across open files and recently edited code) but increase TTFT and VRAM usage. For chat and explain tasks via Continue, use 32768 if your model and VRAM budget allow. Qwen2.5-Coder supports up to 128k context, but you need proportionally more VRAM for the KV cache at that length.
Fine-tuning on your codebase
Fine-tuning Qwen2.5-Coder with LoRA on internal code improves completions for proprietary libraries, internal APIs, and codebase-specific patterns. The VRAM overhead for LoRA training is modest (4-16GB on top of inference). See how to fine-tune an LLM in 2026 for a full walkthrough.
Once you have a fine-tuned adapter, serving it alongside the base model without reloading weights is straightforward with vLLM. See LoRA multi-adapter serving on GPU cloud for the serving setup.
Running your own coding assistant on Spheron means your code stays on your infrastructure, and the per-seat cost drops below any commercial subscription once your team passes 39 developers. For async tasks like PR review, documentation generation, and batch refactoring, a shared node drops the per-seat cost even further.
