Tutorial

Deploy Devstral on GPU Cloud: Self-Host Mistral's Coding Model with vLLM (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 2, 2026
DevstralDevstral DeploymentMistral DevstralDevstral 24BSelf-Hosted Coding LLMvLLMGPU CloudCursor AlternativeAiderCline
Deploy Devstral on GPU Cloud: Self-Host Mistral's Coding Model with vLLM (2026)

Devstral scores 46.8% on SWE-bench Verified. GPT-4o scores 33.2% on the same benchmark. That gap is significant for repository-level coding tasks, and Devstral runs on a single L40S or A100 GPU. The cost problem is not the model itself. It's the per-seat SaaS math: Cursor Pro at $20/seat, GitHub Copilot Business at $19/seat. A 20-person engineering team pays $400-$480/month with nothing but an API bill and no control over what model runs or where your code goes.

This guide covers everything you need to self-host Devstral: GPU requirements, vLLM setup on Spheron, IDE plugin configuration for Continue, Aider, and Cline, and the actual cost math at different team sizes.

What Devstral Is

Devstral is a 24B coding-specialized model from Mistral AI, trained on SWE-bench agentic tasks. Unlike general-purpose models that handle code as one of many capabilities, Devstral is designed specifically for repository-level programming work: reading files, running tests, applying patches, and navigating codebases through tool calls.

The architecture uses Mistral Small 3.1 24B as the base, with fine-tuning on coding and tool-use data. The model supports native function calling in Mistral's tool format, which maps directly to the OpenAI tool_calls API structure. Any framework that speaks OpenAI function calling works without modification.

PropertyValue
Parameters24B
ArchitectureMistral Small 3.1 24B base
Context window128K tokens
SWE-bench Verified46.8%
LicenseApache 2.0
Tool useNative (Mistral function calling format)

The 128K context window matters for coding tasks. Multi-file edits, large test suites, and full repository reads can hit 32K tokens without trying. Having 128K headroom means Devstral can read entire modules before deciding what to change, which is what makes SWE-bench agentic scores meaningful in practice. That said, for most single-GPU configs below, you'll want to cap max-model-len below 128K to keep VRAM headroom for KV cache. The full context window is feasible only on 80GB cards.

GPU Hardware Requirements

The VRAM math starts with the parameter count. Devstral is 24B parameters.

  • BF16: 24B × 2 bytes = 48 GB weights + ~15% activation overhead = ~55 GB minimum VRAM
  • FP8: 24B × 1 byte = 24 GB weights + ~15% overhead = ~28 GB minimum VRAM
  • AWQ 4-bit: ~12 GB weights + overhead = ~15 GB minimum VRAM

The 15% overhead accounts for activations, framework buffers, and the CUDA context. KV cache is on top of that and depends on max-model-len and concurrent sequences.

For a GPU memory requirements guide for LLMs with full VRAM sizing across all major model families, see that reference. The table below covers the main Devstral deployment configs:

ConfigPrecisionVRAM for Weights + OverheadKV Cache Headroom (32K ctx)Notes
H100 SXM5 80GBBF16~55 GB~25 GBHigh throughput, high concurrency
A100 80GBBF16~55 GB~25 GBReliable single-GPU production option
L40S 48GBFP8~28 GB~18 GBBest cost/performance for most teams
RTX 4090 24GBAWQ 4-bit~15 GB~7 GBDev/test use, low concurrency

L40S at FP8 is the standout single-GPU option. FP8 brings weight footprint down to ~28 GB, leaving ~18 GB for KV cache at 32K context. That is enough for several concurrent coding sessions. The L40S does not have H100-class FP8 Tensor Cores, so FP8 throughput is lower than on H100, but the VRAM fit is what matters for a team-sized deployment.

A100 80GB at BF16 gives more KV cache headroom than L40S and handles longer contexts. If your team regularly sends 50K+ token requests (whole-repo reads, large file diffs), A100 is the better choice.

Reserve your H100 GPU rental on Spheron for high-throughput workloads serving 15+ concurrent developers. For budget single-GPU deployment, L40S on Spheron is the recommended starting point.

Step-by-Step Deployment with vLLM on Spheron

Step 1: Choose your GPU and quantization level

Pick based on team size and budget:

  • L40S + FP8 for teams of 8-12 developers. Best cost per developer.
  • A100 80GB + BF16 for teams needing longer context (65K+ tokens) or higher KV cache headroom.
  • H100 SXM5 + BF16 for teams of 15-20 developers or high-throughput agentic pipelines.
  • RTX 4090 + AWQ for individual developers or dev/test environments.

Step 2: Provision a Spheron GPU instance

Log in at app.spheron.ai and navigate to GPU Cloud. Select your GPU tier, enable spot pricing where available, and deploy with the PyTorch 2.5 / CUDA 12.4 base image. Attach at least 100 GB persistent storage for model weights and vLLM cache. See the Spheron docs for instance provisioning details.

For cloud-init automation, use this script to bootstrap the instance on first boot:

bash
#!/bin/bash
set -e

# Install system deps
apt-get update -qq && apt-get install -y -qq curl wget

# Install uv for fast Python package management
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"

# Create venv and install vLLM
uv venv /opt/vllm-env
source /opt/vllm-env/bin/activate
uv pip install 'vllm>=0.8.0' huggingface_hub hf_transfer

echo "Bootstrap complete"

Step 3: Install vLLM and download Devstral weights

bash
pip install 'vllm>=0.8.0'
pip install huggingface_hub hf_transfer

export HF_TOKEN=your_token_here
export HF_HUB_ENABLE_HF_TRANSFER=1

# Download Devstral weights (~48 GB BF16, ~24 GB FP8)
huggingface-cli download mistralai/Devstral-Small-2505

Verify the HuggingFace repo ID at huggingface.co/mistralai before running. The checkpoint name Devstral-Small-2505 reflects the May 2025 release date. If Mistral releases a revised checkpoint, the slug changes. Note: Mistral has since released mistralai/Devstral-Small-2-24B-Instruct-2512 (December 2025). The same setup steps apply to that checkpoint.

Step 4: Launch vLLM with the OpenAI-compatible server

For L40S at FP8 (recommended for most teams):

bash
vllm serve mistralai/Devstral-Small-2505 \
  --dtype fp8 \
  --max-model-len 32768 \
  --port 8000 \
  --served-model-name devstral \
  --enable-auto-tool-choice \
  --tool-call-parser mistral \
  --gpu-memory-utilization 0.92

For A100 80GB or H100 80GB at BF16 (longer contexts):

bash
vllm serve mistralai/Devstral-Small-2505 \
  --dtype bfloat16 \
  --max-model-len 65536 \
  --port 8000 \
  --served-model-name devstral \
  --enable-auto-tool-choice \
  --tool-call-parser mistral \
  --gpu-memory-utilization 0.90

Two flags are non-optional here: --enable-auto-tool-choice and --tool-call-parser mistral. Without --tool-call-parser mistral, Devstral's native function calling format is misinterpreted and tool calls return malformed JSON. Do not skip these flags regardless of which GPU you use.

The --max-model-len cap matters per GPU:

  • L40S: cap at 32768. At 128K context, FP8 weights + KV cache will OOM.
  • A100/H100 80GB: use 65536 comfortably, extend to 131072 if you have headroom.

For production deployments, consider adding a systemd unit to handle restarts automatically. See the vLLM production deployment guide for tensor parallelism details and systemd configuration.

Step 5: Test tool calling with a coding task

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "devstral",
    "messages": [
      {
        "role": "user",
        "content": "Read the file main.py and add error handling to the main function."
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "read_file",
          "description": "Read a file from the filesystem",
          "parameters": {
            "type": "object",
            "properties": {
              "path": {"type": "string", "description": "File path to read"}
            },
            "required": ["path"]
          }
        }
      }
    ],
    "tool_choice": "auto"
  }'

A correct response includes a tool_calls array in the assistant message. If you see the tool call parameters as plain text in the response content instead of structured JSON, --tool-call-parser mistral is missing from the serve command.

Step 6: Connect Continue or Aider to the vLLM endpoint

See the IDE integration section below for config snippets.

Step 7: Set up production monitoring

bash
# Scrape vLLM Prometheus metrics
curl http://localhost:8000/metrics | grep -E "vllm_|process_"

# Watch GPU utilization live
nvidia-smi dmon -s u -d 5

# Watch queue depth and throughput
watch -n 5 'curl -s http://localhost:8000/metrics | grep "vllm_num_requests_running"'

Set --max-num-seqs 32 to cap concurrent requests at a level your GPU can sustain without queue buildup. This is the primary SLO protection knob.

SGLang and TensorRT-LLM Alternatives

vLLM covers most Devstral deployments. Two alternatives are worth knowing for specific scenarios.

SGLang uses RadixAttention, which caches the KV state for shared prompt prefixes. For coding tasks where multiple developers submit requests with similar system prompts or the same codebase context, RadixAttention reduces prefill cost significantly. Setup is similar to vLLM: python -m sglang.launch_server --model-path mistralai/Devstral-Small-2505 --enable-torch-compile. The trade-off is a smaller ecosystem and less mature Mistral tool-call support.

TensorRT-LLM requires an engine build step that takes 30-60 minutes but yields consistently higher throughput on H100 at large batch sizes. For a single-GPU L40S or A100 deployment, the throughput advantage does not justify the build complexity. For multi-GPU H100 deployments serving 50+ concurrent requests, TRT-LLM is worth evaluating.

RuntimeThroughput (tokens/sec, H100)Time to First TokenSetup complexity
vLLM2,800-3,20080-150msLow (pip install)
SGLang2,600-3,000 (prefix cached)70-130msMedium
TensorRT-LLM3,500-4,20060-100msHigh (engine build)

Numbers are approximate at batch size 8 on a single H100 SXM5 with Devstral 24B BF16. For benchmark methodology and reproducible numbers across all three frameworks, see the vLLM vs TensorRT-LLM vs SGLang benchmarks.

Connecting Devstral to Your IDE

Continue

Add to ~/.continue/config.json:

json
{
  "models": [
    {
      "title": "Devstral (self-hosted)",
      "provider": "openai",
      "model": "devstral",
      "apiBase": "http://<your-instance-ip>:8000/v1",
      "apiKey": "none"
    }
  ]
}

Set model to devstral (the --served-model-name you used in the vLLM command, or omit --served-model-name and use the full HuggingFace path). The apiKey field is required by Continue's schema but ignored by vLLM when no authentication is configured.

Aider

bash
aider \
  --openai-api-base http://<your-instance-ip>:8000/v1 \
  --model devstral \
  --openai-api-key none

Aider's --architect mode works well with Devstral's tool-calling capability. The model reads your codebase, proposes changes, and applies them in one session.

Cline

In VS Code, open Cline's Settings panel (the gear icon in the Cline sidebar). Set:

  • API Provider: OpenAI Compatible
  • Base URL: http://<your-instance-ip>:8000/v1
  • Model ID: devstral
  • API Key: any non-empty string (vLLM ignores it if auth is off)

Cline's agentic workflows use Devstral's tool-call interface for file reads, terminal commands, and diff application.

Tabby with a custom model backend

Tabby 0.22+ supports custom OpenAI-compatible backends via the --model-id flag. Point it at your vLLM endpoint and set the model to devstral. Tabby handles user authentication and usage tracking on top, which is useful for teams that want per-developer API key management without building it themselves.

For a full Tabby setup with Continue as the IDE plugin, the self-hosted AI coding assistant guide walks through the complete stack including cloud-init scripts and SSH tunneling.

Devstral vs Qwen3 Coder vs DeepSeek-Coder V2

For teams deciding between coding-specialized models, here is a direct comparison on the metrics that matter for self-hosted deployments:

ModelParametersSWE-bench VerifiedSingle-GPU fitOn-Demand (Spheron)
Devstral 24B24B46.8%L40S (FP8), A100 (BF16)~$0.72/hr (L40S)
Qwen2.5-Coder 32B32B43.2%*A100 80GB (BF16)~$1.64/hr (A100)
DeepSeek-Coder-V2-Lite16B MoE~38%A100 40GB~$1.04/hr (A100 PCIe)
DeepSeek-Coder-V2 236B236B MoE~48%4x H100 minimum~$12/hr (4x H100)

When to pick each:

  • Devstral for agentic coding tasks (SWE-bench-style file editing, test running, multi-step patches) and teams that need native Mistral tool-call format. Best SWE-bench score in the single-GPU category.
  • Qwen2.5-Coder 32B for teams already running Qwen models, or where HumanEval scores (code completion benchmarks) matter more than agentic task performance. A100 is the minimum GPU.
  • DeepSeek-Coder-V2 (full 236B) for teams with multi-GPU budgets who need the highest overall code quality. 4x H100 minimum, meaningfully more expensive.

*Qwen2.5-Coder 32B SWE-bench Verified score based on community evaluations. Verify against the official Qwen2.5-Coder technical report before using in benchmarks.

For context on how these models fit into commercial tooling infrastructure, see GPU infrastructure behind AI coding tools.

Cost Per Developer Per Month on Spheron

The FinOps question is straightforward: at what team size does self-hosting become cheaper than per-seat SaaS?

GPUPrecisionOn-DemandSpotDevs served$/dev/month (60% util)
RTX 4090AWQ 4-bit$0.53/hrN/A2-3~$92
L40SFP8$0.72/hrN/A8-12~$31
A100 80GBBF16$1.64/hr$0.45/hr10-15~$16 (spot)
H100 SXM5BF16$3.10/hrN/A15-20~$79

Pricing fluctuates based on GPU availability. The prices above are based on 02 May 2026 and may have changed. Check current GPU pricing → for live rates.

The L40S row is the inflection point for most teams. At $0.72/hr on-demand and 60% utilization, the monthly cost is $311. Across 10 developers: $31/developer/month.

Cursor Team is $40/seat/month. A 10-person team using Cursor Team pays $400/month. An L40S at 60% utilization costs $311/month - cheaper, with one important difference: you own the endpoint. No usage caps, no model deprecation emails, no code leaving your network. Every developer you add beyond 10 lowers the per-seat cost: at 15 developers, you're at $21/developer/month against $40 for Cursor Team.

A100 spot pricing changes the calculation further. At $0.45/hr spot, 60% utilization across 12 developers costs $16/developer/month - cheaper than Cursor Pro ($20/seat). The spot caveat applies: spot instances can be preempted. For a coding assistant, brief interruptions are tolerable (the IDE plugin retries automatically). For a critical production API, spot needs a fallback.

Production Checklist

Before routing real developer traffic through your Devstral instance:

  • Enable streaming. IDE plugins expect stream: true responses. Without it, the plugin waits for the full response before showing output, making short completions feel slow.
  • Set a request cap. Use --max-num-seqs 32 (or lower) to prevent GPU memory spikes from sudden request floods. Queue excess requests rather than OOM-crashing the server.
  • Configure Prometheus scraping. vLLM exposes /metrics out of the box. Scrape it with Grafana and alert on vllm_num_requests_waiting > 10 as an early warning of capacity pressure.
  • Persistent storage for weights. Mount a persistent volume at the HuggingFace cache path (~/.cache/huggingface). Re-downloading 24GB of weights on every restart adds minutes of downtime.
  • Systemd for auto-restart. Create a systemd service file so vLLM restarts automatically after OOM crashes or instance reboots:
ini
[Unit]
Description=Devstral vLLM Server
After=network.target

[Service]
ExecStart=/opt/vllm-env/bin/vllm serve mistralai/Devstral-Small-2505 \
  --dtype fp8 --max-model-len 32768 --port 8000 \
  --served-model-name devstral \
  --enable-auto-tool-choice --tool-call-parser mistral
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
  • Structured outputs. For agentic pipelines that call Devstral programmatically, add --guided-decoding-backend outlines to enforce JSON schema on outputs. Tool calls already handle this, but explicit schema enforcement helps for batch processing pipelines that parse model output programmatically.
  • Horizontal scaling. For teams above 20 developers, run two vLLM instances on separate GPUs behind an NGINX upstream block. Each instance serves the same model, and NGINX load-balances round-robin. NGINX config:
nginx
upstream devstral {
    server gpu1:8000;
    server gpu2:8000;
}

server {
    listen 80;
    location /v1/ {
        proxy_pass http://devstral;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_buffering off;
        proxy_read_timeout 300s;
    }
}

Running Devstral on an L40S on Spheron costs less per month than 10 Cursor Team seats. Rent an L40S on Spheron → or compare all GPU options at Spheron GPU pricing →.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.