Tutorial

Deploy Cohere North Mini Code on GPU Cloud: Self-Host the 30B Apache-2.0 Agentic Coding Model on a Single H100 (2026 Guide)

deploy cohere north mini codenorth mini code self hostcohere north mini code gpunorth mini code vllm h100open source coding model swe-bench 2026Cohere NorthAgentic Coding ModelvLLMH100Apache 2.0 LLM
Deploy Cohere North Mini Code on GPU Cloud: Self-Host the 30B Apache-2.0 Agentic Coding Model on a Single H100 (2026 Guide)

North Mini Code hits 67.6% on SWE-Bench Verified. Devstral scores 46.8% on the same benchmark. Both run on a single H100 at FP8. That 20-point gap is significant for repository-level coding agents where SWE-Bench Verified is the closest proxy to real-task performance. North Mini Code is also Apache 2.0, so there are no per-seat fees, usage restrictions, or model deprecation timelines to worry about.

This guide covers everything you need to self-host North Mini Code: VRAM math, vLLM deployment on a single H100 SXM5, OpenCode and SWE-Agent harness wiring, function calling configuration, and the cost comparison against Cohere's managed API.

What North Mini Code Is

North Mini Code is Cohere's first model in the North family, designed specifically for agentic coding tasks. It uses a 30B total / 3B-active MoE architecture, meaning all 30B parameters must reside in VRAM but only 3B activate per forward pass. That matters for throughput: inference speed is determined by active parameters, so North Mini Code generates tokens at a pace closer to a 3B dense model than a 30B one.

The model was trained with OpenCode harness integration natively, targeting the kind of multi-step file editing, test running, and patch application that SWE-Bench Verified measures. It uses the OpenAI function calling schema for tool use, which means any agentic coding framework that speaks the OpenAI API works without modification.

PropertyValue
Total parameters30B
Active parameters per forward pass3B
ArchitectureMoE (Mixture of Experts)
Context window256K tokens (64K max output)
SWE-Bench Verified67.6% pass@1
LicenseApache 2.0
HuggingFace ID (BF16)CohereLabs/North-Mini-Code-1.0
HuggingFace ID (FP8)CohereLabs/North-Mini-Code-1.0-fp8

VRAM misconception to avoid: The "3B active" figure refers to inference compute, not memory. All 30B parameters must be loaded into VRAM at startup. At FP8, that is approximately 30 GB of weights. Do not assume 3 GB VRAM because only 3B parameters activate per forward pass.

The weights are released under Apache 2.0, which permits commercial use, modification, and distribution. Cohere may publish an Acceptable Use Policy alongside the license; check the model card for the current terms before commercial deployment. This is different from Command A (CC-BY-NC 4.0), which restricts commercial use without a separate agreement.

Benchmarks: North Mini Code vs Devstral vs Qwen3-Coder-Next

ModelSWE-Bench VerifiedSingle-GPU fitVRAM at FP8Active params
Cohere North Mini Code67.6%H100 SXM5 80GB~34.5 GB3B
Qwen3-Coder-Next70.6%H200 SXM5 141GB~92 GB3B
Devstral Small 250546.8%L40S 48GB~28 GB24B (dense)

North Mini Code sits between Devstral and Qwen3-Coder-Next on SWE-Bench. It is the top-scoring model that fits on a single H100 at FP8. Qwen3-Coder-Next edges it out by 3 points but requires an H200 (141 GB VRAM) because the 80B total parameter footprint exceeds an H100's 80 GB at FP8.

SWE-Bench Verified measures a coding agent's ability to resolve real GitHub issues inside a sandboxed Docker environment. The agent reads the issue, edits files, runs tests, and produces a patch. Pass/fail comes from running the associated test suite. For infrastructure details on how to run SWE-Bench evaluations at scale, see the SWE-Bench infrastructure guide.

Why a Single H100 Works

The VRAM math for North Mini Code at different precisions:

PrecisionWeight footprintRuntime totalVRAM remaining (H100 SXM5 80GB)
FP8~30 GB~34.5 GB~45.5 GB KV cache headroom
BF16~60 GB~69 GB~11 GB (tight, not recommended)
AWQ INT4~15 GB~17 GB~63 GB (budget option)

FP8 is the right default. It leaves roughly 45 GB for KV cache, which covers 32K context windows for 15+ concurrent coding agent sessions. Accuracy loss versus BF16 is typically below 2% on coding benchmarks, though complex algorithmic reasoning tasks can show higher degradation. If you are deploying for a team where SWE-Bench-level task accuracy matters, run a quick FP8 vs BF16 comparison on a representative sample of your tasks before committing to production FP8.

BF16 technically fits on a single H100 at 69 GB total, but the 11 GB of remaining KV cache headroom is tight for concurrent sessions. A single long-context coding task can exhaust it and trigger OOM on concurrent requests.

For teams scaling beyond a single developer, two H100 SXM5 GPUs with --tensor-parallel-size 2 doubles the throughput and KV cache headroom. The tensor parallelism overhead for a 30B MoE at that scale is minimal.

Why FP8 Works for Coding Tasks

FP8 quantization compresses the weight representation from 2 bytes (BF16) to 1 byte per parameter, with a 1-2% accuracy cost on most benchmarks. For coding tasks that are primarily syntax-correct code generation, the accuracy loss is acceptable. The throughput benefit is concrete: FP8 typically delivers 1.5-2x more tokens per second than BF16 on the same GPU.

Where FP8 can hurt is highly complex algorithmic reasoning involving multi-step mathematical derivations or unusual control flow. If your coding agent handles problems like competitive programming or numerical methods, validate FP8 accuracy before deploying to production. For more on the tradeoffs, see the FP8 quantization performance guide.

Step-by-Step vLLM Deployment

Step 1: Provision your Spheron instance

Log in at app.spheron.ai, navigate to GPU Cloud, and select H100 SXM5 (80 GB). For persistent model storage, attach at least 150 GB to the instance. Spot pricing saves roughly 34% at current rates. For Spheron provisioning details, see docs.spheron.ai.

bash
# Verify GPU visibility after SSH into the instance
nvidia-smi
# Expected: NVIDIA H100 SXM5, 80GB, driver version 560+

Step 2: Install vLLM

bash
pip install git+https://github.com/vllm-project/vllm.git
pip install huggingface_hub hf_transfer
uv pip install 'cohere_melody>=0.9.0'

North Mini Code requires vLLM from the main branch. The cohere_command4 tool-call and reasoning parsers are not available in the released vllm>=0.8.0 package, so installing from main is required. The cohere_melody library provides the parser implementation.

Step 3: Download North Mini Code weights

bash
export HF_TOKEN=your_huggingface_token
export HF_HUB_ENABLE_HF_TRANSFER=1

# FP8 checkpoint (recommended for single-H100 deployment)
huggingface-cli download CohereLabs/North-Mini-Code-1.0-fp8

# BF16 checkpoint (full precision, tighter VRAM headroom)
# huggingface-cli download CohereLabs/North-Mini-Code-1.0

The FP8 checkpoint is Cohere's official pre-quantized build and loads faster than online quantization from the BF16 repo. FP8 weights are approximately 30 GB; BF16 weights are approximately 60 GB. HF_HUB_ENABLE_HF_TRANSFER speeds up the download significantly on high-bandwidth nodes.

Step 4: Launch vLLM with tool-call support

bash
vllm serve CohereLabs/North-Mini-Code-1.0-fp8 \
  --max-model-len 32768 \
  --served-model-name north-mini-code \
  --enable-auto-tool-choice \
  --tool-call-parser cohere_command4 \
  --reasoning-parser cohere_command4 \
  --moe-backend triton \
  --port 8000

Tool-call parser: The official North Mini Code FP8 model card specifies --tool-call-parser cohere_command4 --reasoning-parser cohere_command4 --moe-backend triton. The cohere_command4 parser requires the cohere_melody>=0.9.0 library installed in the previous step. Using the wrong parser causes the model to return plain text instead of structured tool_calls blocks, silently breaking tool use in OpenCode and SWE-Agent harnesses.

Why no --dtype fp8: The CohereLabs/North-Mini-Code-1.0-fp8 checkpoint is pre-quantized using compressed-tensors FP8. vLLM auto-detects this format from the model config and applies the correct quantization kernel automatically. Passing --dtype fp8 explicitly forces online quantization of an already-quantized model and can cause errors; omit it for pre-quantized checkpoints.

--max-model-len note: The --max-model-len 32768 cap is intentional for single-H100 deployment. The model itself supports up to 256K context and 64K max output (the official model card uses --max-model-len 320000), but loading a 256K KV cache would exhaust VRAM headroom on a single 80 GB H100. 32K covers most single-session coding tasks with comfortable headroom.

For teams needing longer context (full repository reads at 65K+ tokens), run two H100 SXM5 GPUs with --tensor-parallel-size 2 --max-model-len 65536.

Step 5: Send a test request

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "north-mini-code",
    "messages": [
      {"role": "user", "content": "Write a Python function that reads a file and returns line count."}
    ],
    "max_tokens": 256
  }'

A successful response returns a choices array with message.content containing the generated code.

Step 6: Test function calling

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "north-mini-code",
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "read_file",
          "description": "Read contents of a file",
          "parameters": {
            "type": "object",
            "properties": {
              "path": {"type": "string", "description": "File path to read"}
            },
            "required": ["path"]
          }
        }
      }
    ],
    "messages": [
      {"role": "user", "content": "Read the contents of requirements.txt"}
    ]
  }'

Confirm the response includes a tool_calls block with the correct name and arguments fields. If the response returns plain text instead of a tool call, the --tool-call-parser flag likely needs adjustment.

Step 7: SGLang alternative

SGLang's RadixAttention caches the KV state of shared prompt prefixes. For coding harnesses where multiple agent sessions share a system prompt or repository context, this can significantly reduce prefill compute on repeated requests.

bash
python -m sglang.launch_server \
  --model-path CohereLabs/North-Mini-Code-1.0-fp8 \
  --dtype fp8 \
  --port 8000 \
  --max-total-tokens 32768

SGLang has a smaller ecosystem and less mature tool-call support than vLLM. Use SGLang when you have measured prefix-cache hit rates above 40% in your workload; otherwise vLLM is the more reliable choice. For a detailed breakdown of when each backend wins, see the vLLM vs SGLang comparison. If you are evaluating Ollama as a lower-overhead local alternative, the Ollama vs vLLM comparison covers the serving trade-offs.

Wiring into Agentic Coding Harnesses

OpenCode

OpenCode reads its backend configuration from config.json. Point baseURL at your vLLM instance and set model to the served model name:

json
{
  "model": {
    "provider": "openai",
    "baseURL": "http://your-instance-ip:8000/v1",
    "apiKey": "placeholder",
    "model": "north-mini-code"
  }
}

The apiKey field must be set to a non-empty string; vLLM ignores it when authentication is not configured.

SWE-Agent style loops

Any SWE-Agent-style framework using the openai Python client works without modification:

python
from openai import OpenAI

client = OpenAI(
    base_url="http://your-instance-ip:8000/v1",
    api_key="placeholder"
)

response = client.chat.completions.create(
    model="north-mini-code",
    messages=[
        {"role": "system", "content": "You are a coding agent. Use the provided tools to resolve the issue."},
        {"role": "user", "content": issue_description}
    ],
    tools=tool_definitions,
    tool_choice="auto"
)

The tool_choice="auto" setting lets the model decide when to call tools. For coding harnesses, this is the standard setting. Switch to tool_choice="required" only for benchmarking runs where you need the model to always emit tool calls.

Continue, Aider, and Cline

vLLM's OpenAI-compatible API works with any IDE plugin that supports a custom base URL:

  • Continue: In ~/.continue/config.json, add an entry with provider: "openai", apiBase: "http://your-ip:8000/v1", and model: "north-mini-code".
  • Aider: Pass --openai-api-base http://your-ip:8000/v1 --model north-mini-code on the command line.
  • Cline: Set API provider to "OpenAI Compatible" in VS Code settings, enter the base URL, and set the model ID to north-mini-code.

Function Calling and Tool Use Tuning

North Mini Code uses the OpenAI function calling schema. Two flags are required for tool use with vLLM:

  • --enable-auto-tool-choice: Allows the model to decide when to call tools rather than always generating a tool call.
  • --tool-call-parser: Tells vLLM which tool call format to parse from the model output.

For production coding agent deployments, define tools for the three operations that appear in virtually every coding harness:

python
tools = [
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read file contents",
            "parameters": {
                "type": "object",
                "properties": {"path": {"type": "string"}},
                "required": ["path"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "execute_shell",
            "description": "Execute a shell command and return stdout/stderr",
            "parameters": {
                "type": "object",
                "properties": {"command": {"type": "string"}},
                "required": ["command"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "write_file",
            "description": "Write or overwrite a file",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {"type": "string"},
                    "content": {"type": "string"}
                },
                "required": ["path", "content"]
            }
        }
    }
]

Production tip: Set a max_tokens cap on each request (e.g., 4096 for normal tasks, 8192 for multi-file edits) to prevent runaway generation loops that exhaust KV cache. Coding agents can enter loops where they generate long reasoning traces before calling a tool; a hard token cap short-circuits this.

Monitoring throughput: Run nvidia-smi dmon -s u during inference to verify GPU utilization is consistently above 80%. Low utilization typically means the --max-num-seqs limit is too low or requests are arriving in bursts with idle gaps. Tune --max-num-seqs up to allow more requests to be scheduled concurrently and keep GPU utilization high.

Cost Comparison

Live pricing fetched 30 Jun 2026:

  • H100 SXM5 on-demand: $4.41/hr
  • H100 SXM5 spot: $2.91/hr

At FP8 with 3B active parameters, North Mini Code on H100 SXM5 generates approximately 3,500+ tokens per second at batch size 8 (throughput figures are estimates based on comparable MoE architectures with 3B active parameters; benchmark on your instance before capacity planning).

At 70% GPU utilization over one hour: roughly 8.8 million output tokens.

SetupGPU$/hr (on-demand)$/hr (spot)Concurrent agents
North Mini Code self-hosted1x H100 SXM5$4.41$2.9110-15
North Mini Code self-hosted2x H100 SXM5$8.82$5.8220-30
Cohere API (hosted, Command A tier proxy)Managed~$10.00/M output tokens-Unlimited (rate-limited)
Cursor Team (per seat)Managed$40/seat/month-1 per seat

At on-demand pricing with 70% utilization, the self-hosted H100 delivers output tokens at roughly $0.50/M, versus Cohere's managed API at $10/M (Command A tier proxy; no separate North Mini Code hosted price is published). For teams running coding agents at scale (thousands of API calls per day), that difference compounds quickly.

At 24/7 operation, the monthly GPU cost is approximately $2,095 at spot ($2.91/hr × 720 hrs) and $3,175 at on-demand ($4.41/hr × 720 hrs). The break-even against Cursor Team ($40/seat/month) comes at roughly 52 developers at spot pricing and 79 developers at on-demand pricing. For smaller teams, the math favors hosted APIs unless data privacy or output customization requirements force self-hosting.

Pricing fluctuates based on GPU availability. The prices above are based on 30 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Production Checklist

Before routing real developer traffic through your North Mini Code instance:

  • Enable streaming. IDE plugins and coding harnesses expect stream: true responses. Without it, the client waits for the full response before showing output.
  • Set --max-num-seqs. Cap concurrent requests (32 is a good starting point) to prevent memory spikes from sudden traffic bursts. Queue excess requests rather than OOM-crashing the server.
  • Persistent storage for weights. Mount a persistent volume at the HuggingFace cache path (~/.cache/huggingface). Re-downloading 30 GB on every restart adds significant downtime.
  • Systemd for auto-restart. Create a systemd service so vLLM restarts automatically after crashes or reboots:
ini
[Unit]
Description=North Mini Code vLLM Server
After=network.target

[Service]
ExecStart=/opt/vllm-env/bin/vllm serve CohereLabs/North-Mini-Code-1.0-fp8 \
  --max-model-len 32768 --port 8000 \
  --served-model-name north-mini-code \
  --enable-auto-tool-choice --tool-call-parser cohere_command4 \
  --reasoning-parser cohere_command4 --moe-backend triton
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
  • Validate FP8 accuracy on your tasks. Run a representative sample of your coding tasks at both FP8 and BF16 before production rollout. For typical code generation, FP8 accuracy loss is under 2%. For tasks involving complex algorithmic derivations or mathematical reasoning, measure the delta before committing.
  • Prometheus scraping. vLLM exposes /metrics out of the box. Alert on vllm_num_requests_waiting > 10 as an early warning of capacity pressure.

Cohere North Mini Code runs on a single H100 SXM5 at FP8, no per-seat API fees, full code privacy, Apache 2.0 license.

H100 SXM5 on Spheron → | View all GPU pricing →

Deploy North Mini Code on Spheron →

STEPS / 06

Quick Setup Guide

  1. Calculate VRAM and choose GPU configuration

    At FP8, North Mini Code requires approximately 34.5 GB total VRAM (30 GB weights plus 15% overhead). A single H100 SXM5 (80 GB) leaves around 45 GB for KV cache, which is sufficient for 32K context and 15+ concurrent coding agent sessions. At BF16 the runtime footprint reaches approximately 69 GB, leaving only 11 GB for KV cache. FP8 is recommended for single-GPU deployment. For AWQ INT4, the footprint drops to roughly 16-18 GB, fitting on an A100 40GB or L40S, but expect 3-5% accuracy loss on complex algorithmic tasks.

  2. Provision a Spheron H100 instance

    Log in at app.spheron.ai, navigate to GPU Cloud, and select an H100 SXM5 instance. Choose spot pricing for 30-40% hourly savings or on-demand for guaranteed availability. Deploy with the PyTorch 2.5 / CUDA 12.4 base image. Attach at least 150 GB of persistent storage for model weights and the vLLM cache directory.

  3. Install vLLM and download North Mini Code weights

    North Mini Code requires vLLM from the main branch for cohere_command4 parser support (not available in the released vllm>=0.8.0 package). Run pip install git+https://github.com/vllm-project/vllm.git and pip install huggingface_hub hf_transfer. Also install the Cohere melody library: uv pip install 'cohere_melody>=0.9.0'. Export your HuggingFace token as HF_TOKEN and enable fast transfers with HF_HUB_ENABLE_HF_TRANSFER=1. For single-H100 FP8 deployment, download the official FP8 checkpoint with huggingface-cli download CohereLabs/North-Mini-Code-1.0-fp8. For BF16, use huggingface-cli download CohereLabs/North-Mini-Code-1.0.

  4. Launch vLLM with tool-call support

    Run: vllm serve CohereLabs/North-Mini-Code-1.0-fp8 --served-model-name north-mini-code --max-model-len 32768 --enable-auto-tool-choice --tool-call-parser cohere_command4 --reasoning-parser cohere_command4 --moe-backend triton --port 8000. Do not pass --dtype fp8 for the pre-quantized checkpoint; vLLM auto-detects the compressed-tensors FP8 format. The cohere_command4 parser is specified by the official North Mini Code FP8 model card and requires cohere_melody>=0.9.0. Verify tool-call output format by sending a test request with a tools array before connecting production harnesses.

  5. Wire into OpenCode or SWE-Agent harness

    In OpenCode's config.json, set baseURL to http://your-instance-ip:8000/v1 and model to north-mini-code (or whatever you pass as --served-model-name). For SWE-Agent-style loops, use the openai Python client with base_url pointing at your vLLM endpoint. The tool-call format matches the OpenAI function calling schema, so no framework modification is needed.

  6. Test function calling and monitor throughput

    Send a test curl request with a tools array to /v1/chat/completions. Confirm the response includes a tool_calls block with correct function name and arguments JSON. Run nvidia-smi dmon -s u to verify GPU utilization above 80% during inference. Tune --max-num-seqs to control concurrent request concurrency and protect latency targets for your coding agent sessions.

FAQ / 05

Frequently Asked Questions

At FP8 precision, North Mini Code's 30B total parameters require approximately 30 GB for weights plus roughly 15% runtime overhead, bringing the minimum to about 34.5 GB. A single H100 SXM5 (80 GB) handles this comfortably with around 45 GB remaining for KV cache headroom. At BF16 the weight footprint is about 60 GB with total runtime reaching approximately 69 GB, which is technically possible on a single H100 but leaves only around 11 GB for KV cache. FP8 is the right default for single-GPU deployment.

North Mini Code is Cohere's first North-family model, a 30B total / 3B-active MoE architecture specialized for agentic coding workflows and natively integrated with OpenCode harnesses. It supports a 256K token context window with a 64K max output length. Command A is a 111B dense model designed for enterprise RAG, also with 256K context, but licensed under CC-BY-NC 4.0. North Mini Code achieves 67.6% SWE-Bench Verified pass@1 and ships under Apache 2.0.

North Mini Code scores 67.6% SWE-Bench Verified pass@1. Devstral Small 2505 scores 46.8%. Qwen3-Coder-Next scores 70.6%. North Mini Code achieves a substantially higher SWE-Bench score than Devstral on identical single-H100 FP8 hardware, and requires a single H100 rather than the H200 that Qwen3-Coder-Next needs at FP8.

Yes. vLLM exposes an OpenAI-compatible /v1/chat/completions endpoint. OpenCode uses a configurable baseURL and model field; point it at the vLLM server IP and port. SWE-Agent-style loops use the openai Python client with a custom base_url parameter. North Mini Code's tool-call format follows the OpenAI function calling schema, so any framework that speaks that API works without modification.

At spot H100 SXM5 pricing ($2.91/hr as of 30 Jun 2026), North Mini Code at 30B FP8 generates an estimated 3,500+ tokens per second at 70% sustained GPU utilization, yielding roughly 8.8 million output tokens per hour and a per-million output token cost of roughly $0.33/M. Cohere's managed API prices output tokens at $10/M for Command A tier models (no separate North Mini Code price is published). For teams running high-volume coding agents, self-hosting delivers roughly 20-30x token cost reduction at the expense of infrastructure management.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.