What is Tabby and how does it compare to GitHub Copilot?

Tabby is an open-source, self-hosted AI coding assistant with a VS Code and JetBrains plugin. It runs a model you control on your own GPU, exposing an OpenAI-compatible API. Unlike GitHub Copilot, no code leaves your infrastructure. On Qwen2.5-Coder 32B with a single A100, Tabby delivers HumanEval scores within 5-8 points of Copilot while keeping all data on-prem.

Which GPU do I need to run Qwen2.5-Coder 32B for code completion?

Qwen2.5-Coder 32B at FP16 needs approximately 65GB VRAM. A single A100 80GB (PCIe or SXM) is the minimum practical single-GPU option. With 4-bit GPTQ quantization, it fits on an A100 40GB (approx. 20GB). For small teams (1-5 devs), a single A100 80GB on Spheron at $1.04/hr on-demand handles real-time autocomplete with acceptable latency.

What is Continue and how is it different from Tabby?

Continue is a VS Code and JetBrains IDE extension that connects to any OpenAI-compatible backend - Ollama, vLLM, or a hosted API. It does not include its own model server. Tabby is a full stack: model server, authentication, telemetry, and IDE plugin. Use Continue if you already have a vLLM or Ollama server running; use Tabby if you want a purpose-built coding assistant with user management.

How much does self-hosting an AI coding assistant cost vs Cursor or GitHub Copilot?

A single A100 80GB running Qwen2.5-Coder 32B (4-bit) on Spheron costs $1.04/hr on-demand (~$749/month at 24/7 operation). GitHub Copilot Business costs $19/seat/month. That single GPU breaks even at around 39 seats. For teams larger than 39, self-hosting is cheaper. Note: for A100 80GB PCIe on Spheron, spot pricing ($1.14/hr) is currently higher than on-demand, so on-demand is the better pick for continuous serving.

What is the best open-source model for self-hosted code completion in 2026?

Qwen2.5-Coder 32B leads HumanEval benchmarks among models that fit on a single 80GB GPU, scoring above DeepSeek-Coder-V2-Lite and StarCoder 2 15B. For fill-in-the-middle (FIM) autocomplete specifically, Qwen2.5-Coder 7B and 14B offer a better latency-accuracy tradeoff on smaller GPUs (L40S, A100 40GB). DeepSeek-Coder-V2 236B (MoE) requires 4x H100 minimum but delivers near-GPT-4o quality.

Self-Host Your AI Coding Assistant on GPU Cloud: Tabby, Continue, and Qwen-Coder Guide (2026)

GitHub Copilot Business costs $19 per seat per month. At 50 developers, that's $11,400 per year - and every line of your proprietary code goes through Microsoft's servers. This guide shows you how to run a Copilot-equivalent assistant on your own GPU cloud for less.

For background on how commercial tools like Cursor and Claude Code are architected, see GPU infrastructure behind Cursor, Claude Code, and Copilot.

Why Self-Host Your AI Coding Assistant in 2026

Three reasons, each with numbers.

Cost at team scale. A single A100 80GB on Spheron runs at $1.04/hr on-demand. Over a 30-day month (720 hours), that's $749. GitHub Copilot Business is $19/seat/month. The math: $749 / $19 = ~39 seats. Once your team passes 39 developers, self-hosting costs less than Copilot. Larger teams can share a single A100 node serving 15-25 concurrent autocomplete requests with Tabby's request queuing, so the per-seat cost drops further as headcount grows.

Data sovereignty. Every request to Copilot, Cursor, or any cloud API sends code to a third party. For healthcare, fintech, legal, and defense contractors, that is not an option. Running inference on your own GPU means credentials, proprietary algorithms, and unreleased product code never leave your network. Some compliance frameworks (SOC 2 Type II, HIPAA) explicitly require this.

Model control. You choose what runs. Swap Qwen2.5-Coder 7B for faster autocomplete on a shared node, or run Qwen2.5-Coder 32B for better multi-file understanding. Fine-tune on your own codebase to improve completions for internal libraries. Set your own context length. None of that is possible with a SaaS subscription.

Comparing Open-Source Coding Assistants: Tabby, Continue, FauxPilot, and Kilo

Tool	Type	Model server included	IDE support	Multi-user	Best for
Tabby	Full stack	Yes (own runtime)	VS Code, JetBrains	Yes (team auth)	Teams wanting a drop-in Copilot replacement
Continue	IDE plugin only	No (uses vLLM/Ollama)	VS Code, JetBrains	Via backend	Devs who already run vLLM
FauxPilot	Full stack	Triton server	VS Code	Limited	Legacy GitHub Copilot plugin compatibility
Kilo	Agentic coding platform	No (uses any OpenAI API)	VS Code, JetBrains, CLI	Via backend	Agentic coding with broad IDE support

Tabby is the closest to a managed Copilot replacement. It ships a model server, an IDE plugin, user authentication, and usage telemetry in one package. You point the plugin at your Spheron instance IP and it works. Team authentication lets admins create per-user API keys, revoke access, and track usage by developer. The tradeoff: you're tied to Tabby's supported model list, which covers all the major Qwen and DeepSeek variants.

Continue takes the opposite approach. It's an IDE plugin that connects to any OpenAI-compatible backend. You bring your own model server (vLLM, Ollama, or a hosted API) and configure Continue's config.json to point at it. This gives you full control over the inference stack, but there's no built-in auth or telemetry. Good if your team already runs vLLM for other workloads. See Build a Self-Hosted OpenAI-Compatible API with vLLM for a step-by-step vLLM deployment guide.

FauxPilot was built to mimic GitHub Copilot's API exactly, so the original Copilot VS Code extension works without modification. It has not had a significant release since 2023 and is not recommended for new deployments in 2026. The Triton inference server it uses is harder to operate than vLLM or Ollama for most teams.

Kilo is a full agentic coding platform with over 1.5M users, available as extensions for VS Code, JetBrains, and a CLI. Like Continue, it connects to any OpenAI-compatible API and brings no model server of its own. It supports agentic workflows (multi-step code changes, file creation, terminal commands) across all three environments. A solid option for teams that already run an inference backend and want strong IDE coverage beyond just VS Code.

Best Coding LLMs for Self-Hosting: Qwen2.5-Coder vs DeepSeek-Coder-V2 vs StarCoder 2

Model	HumanEval	FIM Support	Params	VRAM (FP16)	VRAM (4-bit)
Qwen2.5-Coder 32B	92.7%	Yes	32B	~65GB	~20GB
Qwen2.5-Coder 14B	89.6%	Yes	14B	~29GB	~9GB
Qwen2.5-Coder 7B	88.4%	Yes	7B	~15GB	~5GB
DeepSeek-Coder-V2-Lite	81.1%	Yes	16B MoE	~20GB	~8GB
DeepSeek-Coder-V2 236B	~90%	Yes	236B MoE	4x H100	4x A100
StarCoder 2 15B	72.6%	Yes	15B	~30GB	~10GB

HumanEval scores are pass@1 on instruct variants. StarCoder 2 15B is included for reference; Qwen2.5-Coder outperforms it by a wide margin. Qwen2.5-Coder became the most-deployed self-hosted coding model in early 2026, overtaking Llama-based models (RunPod deployment data, March 2026).

A note on FIM vs HumanEval: HumanEval measures code generation from a docstring. Fill-in-the-middle (FIM) is what autocomplete tools actually use. Qwen2.5-Coder supports both modes natively. The FIM-specific benchmarks (infilling tasks from SantaCoder's benchmark suite) show similar rank ordering, but TTFT matters as much as accuracy for real autocomplete use. The 7B and 14B models hit autocomplete latency targets more reliably on a shared node.

For large enterprise teams willing to operate multi-GPU infrastructure, DeepSeek-Coder-V2 236B (4x H100 minimum for FP8 inference) delivers near-GPT-4o quality. Do not attempt this on fewer than 4x H100 80GB; it will OOM at weight load time.

See also: GPU memory requirements for LLMs for a full VRAM sizing reference.

GPU Requirements and VRAM Sizing

Model	GPU	VRAM Used	Concurrent Users	On-Demand Price	Spot Price
Qwen2.5-Coder 7B (FP16)	L40S 48GB	~15GB	1-5	$1.80/hr	-
Qwen2.5-Coder 14B (FP16)	A100 80GB PCIe	~29GB	1-8	$1.04/hr	$1.14/hr
Qwen2.5-Coder 32B (4-bit GPTQ)	A100 80GB PCIe	~22GB	1-15	$1.04/hr	$1.14/hr
Qwen2.5-Coder 32B (FP16)	A100 80GB SXM4	~65GB	1-15	$1.64/hr	-
DeepSeek-Coder-V2 236B (FP8)	4x H100 PCIe	~160GB	10-50	~$8.04/hr	-

Pricing fluctuates based on GPU availability. The prices above are based on 09 Apr 2026 and may have changed. Check current GPU pricing for live rates.

The 32B model at 4-bit GPTQ fits comfortably on a single A100 80GB PCIe at $1.04/hr on-demand, and handles 1-15 concurrent autocomplete requests before latency degrades. For teams under 15 developers, this is the recommended configuration.

For context: at $1.04/hr, a 30-day A100 costs $749. At 15 concurrent users sharing the node, that's about $50/developer/month, below any commercial subscription. Smaller teams sharing the same node get even better economics.

See also: GPU requirements cheat sheet for 2026 for a broader sizing reference. For A100 GPU rental options and availability, see Rent NVIDIA A100 GPUs on Spheron.

Step-by-Step: Deploy Tabby with Qwen2.5-Coder on Spheron GPU Cloud

Prerequisites

Spheron account at app.spheron.ai
Hugging Face token (for gated model access, if needed)
VS Code or JetBrains IDE

1. Provision the GPU instance

Log into app.spheron.ai, navigate to GPU deployments, and select A100 80GB PCIe for Qwen2.5-Coder 32B (4-bit), or L40S for the 7B model. Choose Ubuntu 22.04 with CUDA 12.4. The instance boots in 60-90 seconds. For details on instance types (spot vs dedicated vs bare metal), see the Spheron instance types docs.

Once booted, SSH in and verify the GPU:

bash

# Verify GPU after instance boots
nvidia-smi

2. Run Tabby with Qwen2.5-Coder

bash

# Pull and run Tabby with Qwen2.5-Coder 7B
docker run -d \
  --gpus all \
  -p 8080:8080 \
  -v $HOME/.tabby:/data \
  tabbyml/tabby:latest \
  serve \
  --model Qwen/Qwen2.5-Coder-7B-Instruct \
  --device cuda

For the 32B model, swap Qwen/Qwen2.5-Coder-7B-Instruct with Qwen/Qwen2.5-Coder-32B-Instruct. Pin the Tabby image to a specific version tag in production (e.g., tabbyml/tabby:v0.18.0) to avoid unexpected updates.

Tabby downloads the model weights from Hugging Face on first run. The 7B model is about 15GB; the 32B is around 65GB (FP16) or 20GB (4-bit). Wait for the log line Listening on 0.0.0.0:8080 before connecting the IDE plugin.

3. Configure the Tabby VS Code extension

Install the TabbyML.vscode-tabby extension from the VS Code Marketplace.
Open VS Code settings and search for "Tabby server".
Set the endpoint to http://YOUR_SPHERON_IP:8080.
Open any code file and start typing. Tabby should show completions in 1-3 seconds on first request (the model warms up), then drop to 80-400ms for subsequent requests.

4. Cloud-init script for automated setup

Use this cloud-init config when provisioning the Spheron instance to automate the entire setup. Spheron supports both cloud-init YAML and Bash startup scripts at launch time. See the Spheron startup scripts docs for the full format reference.

yaml

#cloud-config
package_update: true
packages:
  - docker.io
  - nginx

runcmd:
  - systemctl enable docker
  - systemctl start docker
  - docker run -d --gpus all -p 8080:8080 --restart unless-stopped
      -v /data/tabby:/data tabbyml/tabby:latest serve
      --model Qwen/Qwen2.5-Coder-7B-Instruct --device cuda

This starts Tabby as a Docker container with --restart unless-stopped, so it comes back up after a reboot. Swap the model name to match your chosen variant.

Step-by-Step: Deploy Continue with Ollama Backend on GPU Cloud

Continue is the better choice if you already run vLLM on your Spheron instance, or if you want a single backend serving both chat and autocomplete with different model sizes.

1. Install Ollama on the instance

bash

curl -fsSL https://ollama.com/install.sh | sh

The Ollama installer creates a systemd service. By default it binds to 127.0.0.1, which is not reachable from outside the instance. Set OLLAMA_HOST=0.0.0.0 via a systemd drop-in so the binding persists across reboots and SSH sessions:

Security warning: Setting OLLAMA_HOST=0.0.0.0 exposes port 11434 on every network interface, including the instance's public IP. Ollama has no built-in authentication. Before or immediately after this step, restrict access with a firewall rule so only your IP can reach the port:
```bash
sudo ufw allow from <YOUR_IP> to any port 11434
sudo ufw deny 11434
sudo ufw allow 22 # keep SSH open to avoid locking yourself out
sudo ufw enable # activate the firewall (disabled by default on Ubuntu)
```
Replace <YOUR_IP> with your developer machine's IP. Run ufw status to confirm the rules are active. If you skip this, anyone who discovers the instance IP can make unlimited inference requests at your expense. The NGINX bearer-token setup in the Production Tips section below is the recommended long-term solution for team access.

bash

# Create a systemd override to bind Ollama to all interfaces
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
EOF

# Reload systemd and start Ollama
sudo systemctl daemon-reload
sudo systemctl enable ollama && sudo systemctl restart ollama

bash

# Pull the models (this downloads weights from Ollama's registry)
ollama pull qwen2.5-coder:32b
ollama pull qwen2.5-coder:7b

Using systemctl enable ollama && systemctl restart ollama ensures Ollama starts automatically on boot and restarts with the new OLLAMA_HOST environment variable applied. The separate restart is required because the installer already started Ollama bound to 127.0.0.1. Without it, the env var change never takes effect in the running process and your IDE will hit a connection-refused error.

2. Configure the Continue extension

Install Continue from the VS Code Marketplace, then edit ~/.continue/config.json:

json

{
  "models": [
    {
      "title": "Qwen2.5-Coder 32B (Spheron)",
      "provider": "ollama",
      "model": "qwen2.5-coder:32b",
      "apiBase": "http://YOUR_SPHERON_IP:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen2.5-Coder 7B (fast)",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b",
    "apiBase": "http://YOUR_SPHERON_IP:11434"
  }
}

The two-model pattern here is intentional. The 32B model handles chat requests ("explain this function", "refactor this class") where quality matters more than speed. The 7B model handles real-time tab completion, where you need sub-200ms responses to avoid breaking typing flow. Running both on a single A100 80GB is viable since Ollama loads models on demand and swaps them in memory.

For more on the Ollama vs vLLM tradeoff for self-hosting, see Ollama vs vLLM: which to use for self-hosting LLMs.

Performance Benchmarks: Self-Hosted vs GitHub Copilot vs Cursor

System	HumanEval	FIM Accuracy	TTFT (autocomplete)	Cost per dev/month
GitHub Copilot Business	~75% (est.)	High	80-150ms	$19
Cursor Pro	~80% (est.)	Very high	60-120ms	$20
Tabby + Qwen2.5-Coder 32B (A100)	92.7%	High	150-400ms	$1.04/hr shared
Tabby + Qwen2.5-Coder 7B (L40S)	88.4%	Good	80-200ms	$1.80/hr shared
Continue + DeepSeek-Coder-V2 236B (4x H100)	~90%	Very high	200-600ms	~$8.04/hr shared

Copilot and Cursor HumanEval numbers are estimates based on published model composition reports; self-hosted Qwen2.5-Coder 32B scores come from the Qwen technical report (pass@1, instruct variant). TTFT figures depend on request concurrency and network latency from your laptop to the GPU instance.

The TTFT gap is real. Copilot and Cursor operate latency-optimized data centers with pre-warmed caches and speculative prefetching. Your Spheron instance is a general-purpose GPU node. At low concurrency (1-3 devs), self-hosted TTFT is competitive. At higher concurrency with a shared node, response times increase. The Qwen2.5-Coder 7B on L40S is the best match for Copilot's latency profile.

Cost Comparison: GPU Cloud Self-Hosting vs Commercial Subscriptions

Team size	Copilot Business	Cursor Pro	Tabby (A100 80GB, on-demand)	Tabby (A100 80GB, spot)
10 devs	$190/mo	$200/mo	$749/mo (24/7)	~$821/mo
25 devs	$475/mo	$500/mo	$749/mo	~$821/mo
50 devs	$950/mo	$1,000/mo	$749/mo	~$821/mo
100 devs	$1,900/mo	$2,000/mo	$749-$1,498/mo	~$821-$1,642/mo

Notes on the table:

A100 80GB PCIe at $1.04/hr on-demand x 720 hrs = ~$749/month.
Spot pricing at $1.14/hr x 720 hrs = ~$821/month. For this GPU, spot is currently more expensive than on-demand, so on-demand is the better choice for continuous serving. Spot instances are also interruptible, which breaks live autocomplete sessions.
Break-even for a single A100 node is around 39 developers. Below that, Copilot is cheaper.
A single A100 80GB running Qwen2.5-Coder 32B (4-bit) handles 15-25 concurrent developers with Tabby's request queuing. Teams over 100 may need two nodes, keeping costs well below the per-seat alternatives.

For a broader pricing comparison across GPU providers, see GPU cloud pricing comparison 2026.

Pricing fluctuates based on GPU availability. The prices above are based on 09 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Production Tips: Multi-User Serving, Context Length, and Fine-Tuning on Your Codebase

Multi-user serving

Tabby handles this natively. Admins create per-user API keys through the Tabby dashboard, track usage by developer, and revoke access without touching the model server. For Continue + vLLM or Ollama backends, put NGINX in front with bearer token auth so each developer authenticates before hitting the inference endpoint.

A single A100 80GB running Qwen2.5-Coder 32B (4-bit) handles 15-25 concurrent completion requests before latency climbs above 500ms. Beyond that, either upgrade to an A100 SXM4 (faster memory bandwidth) or add a second node.

Context length configuration

Set Tabby's --context-length to 8192 for standard autocomplete. Longer contexts improve multi-file understanding (Tabby can look across open files and recently edited code) but increase TTFT and VRAM usage. For chat and explain tasks via Continue, use 32768 if your model and VRAM budget allow. Qwen2.5-Coder supports up to 128k context, but you need proportionally more VRAM for the KV cache at that length.

Fine-tuning on your codebase

Fine-tuning Qwen2.5-Coder with LoRA on internal code improves completions for proprietary libraries, internal APIs, and codebase-specific patterns. The VRAM overhead for LoRA training is modest (4-16GB on top of inference). See how to fine-tune an LLM in 2026 for a full walkthrough.

Once you have a fine-tuned adapter, serving it alongside the base model without reloading weights is straightforward with vLLM. See LoRA multi-adapter serving on GPU cloud for the serving setup.

Running your own coding assistant on Spheron means your code stays on your infrastructure, and the per-seat cost drops below any commercial subscription once your team passes 39 developers. For async tasks like PR review, documentation generation, and batch refactoring, a shared node drops the per-seat cost even further.
Rent A100 → | Rent H100 → | View all GPU pricing →
Get started on Spheron →

Why Self-Host Your AI Coding Assistant in 2026

Comparing Open-Source Coding Assistants: Tabby, Continue, FauxPilot, and Kilo

Best Coding LLMs for Self-Hosting: Qwen2.5-Coder vs DeepSeek-Coder-V2 vs StarCoder 2

GPU Requirements and VRAM Sizing

Step-by-Step: Deploy Tabby with Qwen2.5-Coder on Spheron GPU Cloud

Prerequisites

1. Provision the GPU instance

2. Run Tabby with Qwen2.5-Coder

3. Configure the Tabby VS Code extension

4. Cloud-init script for automated setup

Step-by-Step: Deploy Continue with Ollama Backend on GPU Cloud

1. Install Ollama on the instance

2. Configure the Continue extension

Performance Benchmarks: Self-Hosted vs GitHub Copilot vs Cursor

Cost Comparison: GPU Cloud Self-Hosting vs Commercial Subscriptions

Production Tips: Multi-User Serving, Context Length, and Fine-Tuning on Your Codebase

Multi-user serving

Context length configuration

Fine-tuning on your codebase

Build what's next.