Deploy Devstral on GPU Cloud: Self-Host Mistral's Coding Model with vLLM (2026)

Devstral scores 46.8% on SWE-bench Verified. GPT-4o scores 33.2% on the same benchmark. That gap is significant for repository-level coding tasks, and Devstral runs on a single L40S or A100 GPU. The cost problem is not the model itself. It's the per-seat SaaS math: Cursor Pro at $20/seat, GitHub Copilot Business at $19/seat. A 20-person engineering team pays $400-$480/month with nothing but an API bill and no control over what model runs or where your code goes.

This guide covers everything you need to self-host Devstral: GPU requirements, vLLM setup on Spheron, IDE plugin configuration for Continue, Aider, and Cline, and the actual cost math at different team sizes.

What Devstral Is

Devstral is a 24B coding-specialized model from Mistral AI, trained on SWE-bench agentic tasks. Unlike general-purpose models that handle code as one of many capabilities, Devstral is designed specifically for repository-level programming work: reading files, running tests, applying patches, and navigating codebases through tool calls.

The architecture uses Mistral Small 3.1 24B as the base, with fine-tuning on coding and tool-use data. The model supports native function calling in Mistral's tool format, which maps directly to the OpenAI tool_calls API structure. Any framework that speaks OpenAI function calling works without modification.

Property	Value
Parameters	24B
Architecture	Mistral Small 3.1 24B base
Context window	128K tokens
SWE-bench Verified	46.8%
License	Apache 2.0
Tool use	Native (Mistral function calling format)

The 128K context window matters for coding tasks. Multi-file edits, large test suites, and full repository reads can hit 32K tokens without trying. Having 128K headroom means Devstral can read entire modules before deciding what to change, which is what makes SWE-bench agentic scores meaningful in practice. That said, for most single-GPU configs below, you'll want to cap max-model-len below 128K to keep VRAM headroom for KV cache. The full context window is feasible only on 80GB cards.

GPU Hardware Requirements

The VRAM math starts with the parameter count. Devstral is 24B parameters.

BF16: 24B × 2 bytes = 48 GB weights + ~15% activation overhead = ~55 GB minimum VRAM
FP8: 24B × 1 byte = 24 GB weights + ~15% overhead = ~28 GB minimum VRAM
AWQ 4-bit: ~12 GB weights + overhead = ~15 GB minimum VRAM

The 15% overhead accounts for activations, framework buffers, and the CUDA context. KV cache is on top of that and depends on max-model-len and concurrent sequences.

For a GPU memory requirements guide for LLMs with full VRAM sizing across all major model families, see that reference. The table below covers the main Devstral deployment configs:

Config	Precision	VRAM for Weights + Overhead	KV Cache Headroom (32K ctx)	Notes
H100 SXM5 80GB	BF16	~55 GB	~25 GB	High throughput, high concurrency
A100 80GB	BF16	~55 GB	~25 GB	Reliable single-GPU production option
L40S 48GB	FP8	~28 GB	~18 GB	Best cost/performance for most teams
RTX 4090 24GB	AWQ 4-bit	~15 GB	~7 GB	Dev/test use, low concurrency

L40S at FP8 is the standout single-GPU option. FP8 brings weight footprint down to ~28 GB, leaving ~18 GB for KV cache at 32K context. That is enough for several concurrent coding sessions. The L40S does not have H100-class FP8 Tensor Cores, so FP8 throughput is lower than on H100, but the VRAM fit is what matters for a team-sized deployment.

A100 80GB at BF16 gives more KV cache headroom than L40S and handles longer contexts. If your team regularly sends 50K+ token requests (whole-repo reads, large file diffs), A100 is the better choice.

Reserve your H100 GPU rental on Spheron for high-throughput workloads serving 15+ concurrent developers. For budget single-GPU deployment, L40S on Spheron is the recommended starting point.

Step-by-Step Deployment with vLLM on Spheron

Step 1: Choose your GPU and quantization level

Pick based on team size and budget:

L40S + FP8 for teams of 8-12 developers. Best cost per developer.
A100 80GB + BF16 for teams needing longer context (65K+ tokens) or higher KV cache headroom.
H100 SXM5 + BF16 for teams of 15-20 developers or high-throughput agentic pipelines.
RTX 4090 + AWQ for individual developers or dev/test environments.

Step 2: Provision a Spheron GPU instance

Log in at app.spheron.ai and navigate to GPU Cloud. Select your GPU tier, enable spot pricing where available, and deploy with the PyTorch 2.5 / CUDA 12.4 base image. Attach at least 100 GB persistent storage for model weights and vLLM cache. See the Spheron docs for instance provisioning details.

For cloud-init automation, use this script to bootstrap the instance on first boot:

bash

#!/bin/bash
set -e

# Install system deps
apt-get update -qq && apt-get install -y -qq curl wget

# Install uv for fast Python package management
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"

# Create venv and install vLLM
uv venv /opt/vllm-env
source /opt/vllm-env/bin/activate
uv pip install 'vllm>=0.8.0' huggingface_hub hf_transfer

echo "Bootstrap complete"

Step 3: Install vLLM and download Devstral weights

bash

pip install 'vllm>=0.8.0'
pip install huggingface_hub hf_transfer

export HF_TOKEN=your_token_here
export HF_HUB_ENABLE_HF_TRANSFER=1

# Download Devstral weights (~48 GB BF16, ~24 GB FP8)
huggingface-cli download mistralai/Devstral-Small-2505

Verify the HuggingFace repo ID at huggingface.co/mistralai before running. The checkpoint name Devstral-Small-2505 reflects the May 2025 release date. If Mistral releases a revised checkpoint, the slug changes. Note: Mistral has since released mistralai/Devstral-Small-2-24B-Instruct-2512 (December 2025). The same setup steps apply to that checkpoint.

Step 4: Launch vLLM with the OpenAI-compatible server

For L40S at FP8 (recommended for most teams):

bash

vllm serve mistralai/Devstral-Small-2505 \
  --dtype fp8 \
  --max-model-len 32768 \
  --port 8000 \
  --served-model-name devstral \
  --enable-auto-tool-choice \
  --tool-call-parser mistral \
  --gpu-memory-utilization 0.92

For A100 80GB or H100 80GB at BF16 (longer contexts):

bash

vllm serve mistralai/Devstral-Small-2505 \
  --dtype bfloat16 \
  --max-model-len 65536 \
  --port 8000 \
  --served-model-name devstral \
  --enable-auto-tool-choice \
  --tool-call-parser mistral \
  --gpu-memory-utilization 0.90

Two flags are non-optional here: --enable-auto-tool-choice and --tool-call-parser mistral. Without --tool-call-parser mistral, Devstral's native function calling format is misinterpreted and tool calls return malformed JSON. Do not skip these flags regardless of which GPU you use.

The --max-model-len cap matters per GPU:

L40S: cap at 32768. At 128K context, FP8 weights + KV cache will OOM.
A100/H100 80GB: use 65536 comfortably, extend to 131072 if you have headroom.

For production deployments, consider adding a systemd unit to handle restarts automatically. See the vLLM production deployment guide for tensor parallelism details and systemd configuration.

Step 5: Test tool calling with a coding task

bash

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "devstral",
    "messages": [
      {
        "role": "user",
        "content": "Read the file main.py and add error handling to the main function."
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "read_file",
          "description": "Read a file from the filesystem",
          "parameters": {
            "type": "object",
            "properties": {
              "path": {"type": "string", "description": "File path to read"}
            },
            "required": ["path"]
          }
        }
      }
    ],
    "tool_choice": "auto"
  }'

A correct response includes a tool_calls array in the assistant message. If you see the tool call parameters as plain text in the response content instead of structured JSON, --tool-call-parser mistral is missing from the serve command.

Step 6: Connect Continue or Aider to the vLLM endpoint

See the IDE integration section below for config snippets.

Step 7: Set up production monitoring

bash

# Scrape vLLM Prometheus metrics
curl http://localhost:8000/metrics | grep -E "vllm_|process_"

# Watch GPU utilization live
nvidia-smi dmon -s u -d 5

# Watch queue depth and throughput
watch -n 5 'curl -s http://localhost:8000/metrics | grep "vllm_num_requests_running"'

Set --max-num-seqs 32 to cap concurrent requests at a level your GPU can sustain without queue buildup. This is the primary SLO protection knob.

SGLang and TensorRT-LLM Alternatives

vLLM covers most Devstral deployments. Two alternatives are worth knowing for specific scenarios.

SGLang uses RadixAttention, which caches the KV state for shared prompt prefixes. For coding tasks where multiple developers submit requests with similar system prompts or the same codebase context, RadixAttention reduces prefill cost significantly. Setup is similar to vLLM: python -m sglang.launch_server --model-path mistralai/Devstral-Small-2505 --enable-torch-compile. The trade-off is a smaller ecosystem and less mature Mistral tool-call support.

TensorRT-LLM requires an engine build step that takes 30-60 minutes but yields consistently higher throughput on H100 at large batch sizes. For a single-GPU L40S or A100 deployment, the throughput advantage does not justify the build complexity. For multi-GPU H100 deployments serving 50+ concurrent requests, TRT-LLM is worth evaluating.

Runtime	Throughput (tokens/sec, H100)	Time to First Token	Setup complexity
vLLM	2,800-3,200	80-150ms	Low (pip install)
SGLang	2,600-3,000 (prefix cached)	70-130ms	Medium
TensorRT-LLM	3,500-4,200	60-100ms	High (engine build)

Numbers are approximate at batch size 8 on a single H100 SXM5 with Devstral 24B BF16. For benchmark methodology and reproducible numbers across all three frameworks, see the vLLM vs TensorRT-LLM vs SGLang benchmarks.

Connecting Devstral to Your IDE

Continue

Add to ~/.continue/config.json:

json

{
  "models": [
    {
      "title": "Devstral (self-hosted)",
      "provider": "openai",
      "model": "devstral",
      "apiBase": "http://<your-instance-ip>:8000/v1",
      "apiKey": "none"
    }
  ]
}

Set model to devstral (the --served-model-name you used in the vLLM command, or omit --served-model-name and use the full HuggingFace path). The apiKey field is required by Continue's schema but ignored by vLLM when no authentication is configured.

Aider

bash

aider \
  --openai-api-base http://<your-instance-ip>:8000/v1 \
  --model devstral \
  --openai-api-key none

Aider's --architect mode works well with Devstral's tool-calling capability. The model reads your codebase, proposes changes, and applies them in one session.

Cline

In VS Code, open Cline's Settings panel (the gear icon in the Cline sidebar). Set:

API Provider: OpenAI Compatible
Base URL: http://<your-instance-ip>:8000/v1
Model ID: devstral
API Key: any non-empty string (vLLM ignores it if auth is off)

Cline's agentic workflows use Devstral's tool-call interface for file reads, terminal commands, and diff application.

Tabby with a custom model backend

Tabby 0.22+ supports custom OpenAI-compatible backends via the --model-id flag. Point it at your vLLM endpoint and set the model to devstral. Tabby handles user authentication and usage tracking on top, which is useful for teams that want per-developer API key management without building it themselves.

For a full Tabby setup with Continue as the IDE plugin, the self-hosted AI coding assistant guide walks through the complete stack including cloud-init scripts and SSH tunneling.

Devstral vs Qwen3 Coder vs DeepSeek-Coder V2

For teams deciding between coding-specialized models, here is a direct comparison on the metrics that matter for self-hosted deployments:

Model	Parameters	SWE-bench Verified	Single-GPU fit	On-Demand (Spheron)
Devstral 24B	24B	46.8%	L40S (FP8), A100 (BF16)	~$0.72/hr (L40S)
Qwen2.5-Coder 32B	32B	43.2%*	A100 80GB (BF16)	~$1.64/hr (A100)
DeepSeek-Coder-V2-Lite	16B MoE	~38%	A100 40GB	~$1.04/hr (A100 PCIe)
DeepSeek-Coder-V2 236B	236B MoE	~48%	4x H100 minimum	~$12/hr (4x H100)

When to pick each:

Devstral for agentic coding tasks (SWE-bench-style file editing, test running, multi-step patches) and teams that need native Mistral tool-call format. Best SWE-bench score in the single-GPU category.
Qwen2.5-Coder 32B for teams already running Qwen models, or where HumanEval scores (code completion benchmarks) matter more than agentic task performance. A100 is the minimum GPU.
DeepSeek-Coder-V2 (full 236B) for teams with multi-GPU budgets who need the highest overall code quality. 4x H100 minimum, meaningfully more expensive.

For production deployments requiring maximum Mistral capability across general tasks, the Mistral Large 3 deployment guide covers the 675B flagship MoE on 4x B200 or 8x H200.

*Qwen2.5-Coder 32B SWE-bench Verified score based on community evaluations. Verify against the official Qwen2.5-Coder technical report before using in benchmarks.

For a single-H100 option with a higher SWE-bench score, Cohere's new North-family coding model hits 67.6% SWE-Bench Verified pass@1 on the same hardware. The North Mini Code deployment guide covers vLLM setup, FP8 VRAM math, and OpenCode harness wiring on H100 SXM5.

For a 1T-parameter open MoE model that reaches ~57% on SWE-bench Pro, see the MiMo-V2.5-Pro deployment guide, though it requires 8x H200 or B200 nodes rather than a single GPU.

For context on how these models fit into commercial tooling infrastructure, see GPU infrastructure behind AI coding tools.

Cost Per Developer Per Month on Spheron

The FinOps question is straightforward: at what team size does self-hosting become cheaper than per-seat SaaS?

GPU	Precision	On-Demand	Spot	Devs served	$/dev/month (60% util)
RTX 4090	AWQ 4-bit	$0.53/hr	N/A	2-3	~$92
L40S	FP8	$0.72/hr	N/A	8-12	~$31
A100 80GB	BF16	$1.64/hr	$0.45/hr	10-15	~$16 (spot)
H100 SXM5	BF16	$3.10/hr	N/A	15-20	~$79

Pricing fluctuates based on GPU availability. The prices above are based on 02 May 2026 and may have changed. Check current GPU pricing → for live rates.

The L40S row is the inflection point for most teams. At $0.72/hr on-demand and 60% utilization, the monthly cost is $311. Across 10 developers: $31/developer/month.

Cursor Team is $40/seat/month. A 10-person team using Cursor Team pays $400/month. An L40S at 60% utilization costs $311/month - cheaper, with one important difference: you own the endpoint. No usage caps, no model deprecation emails, no code leaving your network. Every developer you add beyond 10 lowers the per-seat cost: at 15 developers, you're at $21/developer/month against $40 for Cursor Team.

A100 spot pricing changes the calculation further. At $0.45/hr spot, 60% utilization across 12 developers costs $16/developer/month - cheaper than Cursor Pro ($20/seat). The spot caveat applies: spot instances can be preempted. For a coding assistant, brief interruptions are tolerable (the IDE plugin retries automatically). For a critical production API, spot needs a fallback.

Production Checklist

Before routing real developer traffic through your Devstral instance:

Enable streaming. IDE plugins expect stream: true responses. Without it, the plugin waits for the full response before showing output, making short completions feel slow.
Set a request cap. Use --max-num-seqs 32 (or lower) to prevent GPU memory spikes from sudden request floods. Queue excess requests rather than OOM-crashing the server.
Configure Prometheus scraping. vLLM exposes /metrics out of the box. Scrape it with Grafana and alert on vllm_num_requests_waiting > 10 as an early warning of capacity pressure.
Persistent storage for weights. Mount a persistent volume at the HuggingFace cache path (~/.cache/huggingface). Re-downloading 24GB of weights on every restart adds minutes of downtime.
Systemd for auto-restart. Create a systemd service file so vLLM restarts automatically after OOM crashes or instance reboots:

ini

[Unit]
Description=Devstral vLLM Server
After=network.target

[Service]
ExecStart=/opt/vllm-env/bin/vllm serve mistralai/Devstral-Small-2505 \
  --dtype fp8 --max-model-len 32768 --port 8000 \
  --served-model-name devstral \
  --enable-auto-tool-choice --tool-call-parser mistral
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Structured outputs. For agentic pipelines that call Devstral programmatically, add --guided-decoding-backend outlines to enforce JSON schema on outputs. Tool calls already handle this, but explicit schema enforcement helps for batch processing pipelines that parse model output programmatically.
Horizontal scaling. For teams above 20 developers, run two vLLM instances on separate GPUs behind an NGINX upstream block. Each instance serves the same model, and NGINX load-balances round-robin. NGINX config:

nginx

upstream devstral {
    server gpu1:8000;
    server gpu2:8000;
}

server {
    listen 80;
    location /v1/ {
        proxy_pass http://devstral;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_buffering off;
        proxy_read_timeout 300s;
    }
}

Running Devstral on an L40S on Spheron costs less per month than 10 Cursor Team seats. Rent an L40S on Spheron → or compare all GPU options at Spheron GPU pricing →.

STEPS / 07

Quick Setup Guide

Choose your GPU and quantization level
Pick L40S 48GB with FP8 for budget single-GPU deployment. Pick A100 80GB for FP16 with maximum context headroom. Pick H100 80GB for teams needing high throughput at FP16. RTX 4090 with AWQ works for low-concurrency dev/test use. Consult the VRAM table in this post before provisioning.
Provision a Spheron GPU instance
Log into app.spheron.ai, go to GPU Cloud, select your GPU tier, choose spot pricing for 30-50% savings, and deploy with the PyTorch 2.5 / CUDA 12.4 base image. Attach at least 100 GB persistent storage for model weights and vLLM cache.
Install vLLM and download Devstral weights
Run pip install 'vllm>=0.8.0' and pip install huggingface_hub hf_transfer. Export HF_TOKEN. Enable fast transfers with HF_HUB_ENABLE_HF_TRANSFER=1. Run huggingface-cli download mistralai/Devstral-Small-2505 to pull the checkpoint.
Launch vLLM with the OpenAI-compatible server
For L40S FP8: vllm serve mistralai/Devstral-Small-2505 --dtype fp8 --max-model-len 32768 --port 8000 --enable-auto-tool-choice --tool-call-parser mistral. For A100 or H100 FP16: replace --dtype fp8 with --dtype bfloat16. Add --max-model-len 65536 to enable longer context windows on 80 GB cards.
Test tool calling with a coding task
Send a POST to /v1/chat/completions with tools defined as a JSON schema. Use a task that exercises file reads and code edits to verify the tool_calls response format is correct before wiring up IDE plugins.
Connect Continue or Aider to the vLLM endpoint
In Continue's ~/.continue/config.json, add a model entry with provider: openai, apiBase: http://<your-instance-ip>:8000/v1, and model: devstral. In Aider, pass --openai-api-base http://<ip>:8000/v1 --model devstral. Both will route requests through your self-hosted endpoint.
Set up production monitoring
Configure vLLM's built-in Prometheus metrics endpoint (/metrics) and scrape with Grafana. Monitor tokens/sec, GPU utilization via nvidia-smi dmon, and request queue depth. Set up systemd to restart the vLLM process on crash. Use --max-num-seqs to cap concurrent requests and protect latency SLOs.

FAQ / 05

Frequently Asked Questions

Devstral 24B at FP16 requires approximately 50 GB VRAM (48 GB weights plus activation overhead). A single L40S 48GB runs it at FP8 comfortably. An A100 80GB or H100 80GB handles the full FP16 checkpoint with significant KV cache headroom for longer contexts. RTX 4090 24GB works with AWQ 4-bit quantization. For production coding assistants serving 5-10 concurrent developers, L40S FP8 or A100 FP16 are the recommended single-GPU configs.

On SWE-bench Verified, Devstral achieves 46.8%, which matches or exceeds GPT-4o's score of 33.2% on the same benchmark. For repository-level coding tasks involving tool use, file reads, and multi-step edits, Devstral consistently outperforms similarly-sized general-purpose models. The trade-off is that Devstral is a coding specialist - it is not optimized for reasoning, vision, or general instruction following.

Yes. vLLM exposes an OpenAI-compatible API at /v1/chat/completions and /v1/completions. Continue, Aider, and Cline all support custom OpenAI-compatible endpoints. In Continue's config.json, set the provider to openai, point the apiBase at your vLLM instance IP and port, and set the model to devstral. Aider uses --openai-api-base and --model flags. Cline's Settings panel has a custom base URL field.

Cursor Pro costs $20/seat/month. A single L40S on Spheron at on-demand pricing runs a Devstral 24B FP8 server that handles 10-15 concurrent developers. At $0.72/hr on-demand, full utilization comes to roughly $518/month across a 10-person team, or $52/developer/month - higher than Cursor. At 60% utilization (a realistic daytime coding schedule), the monthly cost drops to around $311, or $31/developer/month - cheaper than Cursor Pro with full data sovereignty.

Yes. Devstral is trained specifically for agentic coding tasks that require tool calls: reading files, running tests, applying patches, and searching codebases. It uses Mistral's function calling format and is compatible with any framework that supports OpenAI tool_calls API format, including LangChain, LlamaIndex, and the OpenAI Python client.

What Devstral Is

GPU Hardware Requirements

Step-by-Step Deployment with vLLM on Spheron

Step 1: Choose your GPU and quantization level

Step 2: Provision a Spheron GPU instance

Step 3: Install vLLM and download Devstral weights

Step 4: Launch vLLM with the OpenAI-compatible server

Step 5: Test tool calling with a coding task

Step 6: Connect Continue or Aider to the vLLM endpoint

Step 7: Set up production monitoring

SGLang and TensorRT-LLM Alternatives

Connecting Devstral to Your IDE

Continue

Aider

Cline

Tabby with a custom model backend

Devstral vs Qwen3 Coder vs DeepSeek-Coder V2

Cost Per Developer Per Month on Spheron

Production Checklist

Quick Setup Guide

Choose your GPU and quantization level

Provision a Spheron GPU instance

Install vLLM and download Devstral weights

Launch vLLM with the OpenAI-compatible server

Test tool calling with a coding task

Connect Continue or Aider to the vLLM endpoint

Set up production monitoring

Frequently Asked Questions

01What GPU do I need to run Devstral?

02How does Devstral compare to GPT-4o for code generation?

03Can I use Devstral with Continue, Aider, or Cline?

04What is the cost per developer per month on Spheron vs Cursor?

05Does Devstral support function calling and tool use?

Build what's next.