Deploy Kimi K2.7 Code on GPU Cloud: Self-Host Moonshot's 1T-Parameter Agentic Coding Model (2026)

Coming from Kimi K2.6? See the K2.6 deployment guide for the prior release. This guide covers K2.7 Code (June 2026).

Kimi K2.7 Code is Moonshot AI's June 12 2026 release. It shares the 1T/32B active MoE architecture with K2.6 but pivots to a coding-first focus: the benchmark suite is code-heavy (HumanEval, LiveCodeBench, SWE-bench), reasoning output is roughly 30% more concise per coding task, and the MoonViT vision encoder from K2.6 is absent. For teams running agentic coding workflows, that translates to lower per-task inference cost on the same hardware. The deployment footprint, VRAM sizing, and vLLM configuration are nearly identical to K2.6, with one meaningful addition: --enable-expert-parallel alongside tensor parallelism, which reduces all-to-all communication overhead for long code generation sequences.

What's New in Kimi K2.7 Code

Feature	K2.6	K2.7 Code
Total parameters	1T MoE	1T MoE
Active parameters	32B	32B
Expert count	384	384
Context window	256K	256K
Vision encoder (MoonViT)	Yes	No
Reasoning tokens	Baseline	~30% fewer than K2.6
Focus	General + multimodal agentic	Coding-first agentic
License	MIT	Modified MIT
HuggingFace release	April 2026	June 12 2026

The architecture is unchanged from K2.6: 1T total parameters with 32B active per forward pass, routed across 384 experts with top-8 selection. What changes is the training focus. K2.7 Code was trained with a heavier weighting on coding tasks, which produces shorter reasoning chains for code generation and presumably better performance on coding-specific benchmarks. Moonshot's published numbers come from their own evaluation suite only. No independent SWE-bench Verified results are available as of June 2026, so treat the benchmark figures as self-reported rather than externally validated.

The Modified MIT license is a meaningful detail. Before deploying K2.7 Code in commercial production, read the LICENSE file on the HuggingFace repository. The modifications to standard MIT are not publicly summarized, so review the full text.

Hardware Sizing

VRAM requirements for K2.7 Code follow the same weight structure as K2.6 (1T total parameters, 32B active per token). The KV cache benefits from the same Multi-head Latent Attention (MLA) compression as K2.6.

Precision	Model Weights	KV Cache (256K, batch 1)	KV Cache (32K, batch 8)	Minimum GPUs
FP16	~2 TB	~80 GB (with MLA)	~24 GB	16x H200 or 12x B200
FP8	~1 TB	~40 GB (with MLA)	~12 GB	8x H200 or 6x B200
AWQ INT4	~630 GB	~20 GB (with MLA)	~6 GB	8x H200 or 4x B200

The FP8 row is the practical production target. On 8x H200 SXM5 (~1128 GB HBM3e total), FP8 weights consume roughly 1 TB, leaving ~128 GB for KV cache at 256K context with small batch sizes. K2.7 Code also ships native INT4 weights, with vLLM, SGLang, and KTransformers all recommended by Moonshot for INT4 inference, making it a first-class deployment target rather than relying on community quantizations.

Expert Parallelism for Coding Workloads

K2.7 Code's 384 experts across an 8-GPU node benefit from --enable-expert-parallel in vLLM alongside --tensor-parallel-size 8. The tradeoff is specific to the workload: expert parallelism reduces all-to-all communication overhead compared to tensor parallelism alone, particularly for long generation sequences. Code output tends to be longer than general assistant responses, so the expert parallelism benefit is more pronounced for coding tasks. For the decision framework on when to use expert vs tensor parallelism in MoE deployments, see the MoE inference optimization guide.

Single-node vs multi-node

K2.7 Code targets 8-way tensor + expert parallelism on a single NVLink node. NVLink 4 (H200) at 900 GB/s and NVLink 5 (B200) at 1.8 TB/s handle the all-to-all communication for 384 experts efficiently. Spreading across two 4-GPU InfiniBand-connected nodes drops bandwidth to 400 Gb/s, which creates a bottleneck specifically during expert dispatch. Stay on a single 8-GPU NVLink node.

At 256K context, KV cache per request can exceed 20 GB for FP16 (even with MLA). For multi-session servers handling concurrent long-context requests, budget an additional 40-80 GB beyond the base weight requirement.

Recommended GPU Configurations on Spheron

GPU	VRAM	On-Demand (per GPU)	Spot (per GPU, 8-GPU bundle)	Notes
H200 SXM5 (8x)	141 GB HBM3e	$4.84/hr	$1.82/hr	Best value for FP8 and INT4 production
B200 SXM6 (8x)	192 GB HBM3e	$7.41/hr	$2.71/hr	Higher throughput, fits FP8 with more KV headroom

Teams running FP8 inference at 256K context can use H200 SXM5 on Spheron and fit the model comfortably with KV cache headroom for small batch sizes. H200's 900 GB/s NVLink bandwidth handles 8-way tensor + expert parallel without saturation for typical inference loads.

For higher-concurrency deployments where you need more KV cache headroom, Spheron B200 SXM6 nodes offer 192 GB per GPU, which opens up FP16 operation on a single 8-GPU node and larger batch sizes at 256K context. See best NVIDIA GPUs for LLMs for a side-by-side breakdown of H200 vs B200 architecture differences.

Pricing fluctuates based on GPU availability. The prices above are based on 14 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Deploy Kimi K2.7 Code with vLLM

Step 1: Launch Your GPU Instance

Log into Spheron dashboard
Select the GPU offer with 8x GPUs and click Next
Choose your GPU:

8x H200 SXM5 for cost-effective FP8 production
8x B200 SXM6 for higher KV headroom and batch throughput

Set storage to 800 GB minimum (model weights are ~630 GB compressed; extra space covers temporary download files and the OS/venv)
Choose Ubuntu 22.04 or Ubuntu 24.04

Step 2: Add the Startup Script

Add this cloud-init script to the deployment configuration. It installs vLLM nightly, downloads moonshotai/Kimi-K2.7-Code, and starts the inference server with expert parallelism enabled.

bash

#!/bin/bash
set -e

echo "--- Setting Up Environment ---"

sudo apt-get update -y
sudo apt-get install -y python3-venv

sudo python3 -m venv /opt/kimi_venv
source /opt/kimi_venv/bin/activate

pip install --upgrade pip
pip install -U vllm --pre \
    --extra-index-url https://wheels.vllm.ai/nightly/cu129 \
    --extra-index-url https://download.pytorch.org/whl/cu129

echo "--- Launching vLLM Server ---"

# For AWQ quantization, replace the serve line below with:
# nohup vllm serve moonshotai/Kimi-K2.7-Code-AWQ \
#     --quantization awq \
#     ...
# Note: Confirm the AWQ checkpoint name on HuggingFace before use,
# as quantized variants typically appear after the base model release.

nohup vllm serve moonshotai/Kimi-K2.7-Code \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --host 0.0.0.0 \
    --port 8000 \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --max-model-len 262144 \
    --trust-remote-code > /var/log/vllm.log 2>&1 &

echo "--- Waiting for server to initialize (ETA 20-30 mins) ---"

for i in {1..1800}; do
  if curl -s "http://localhost:8000/v1/models" > /dev/null; then
    echo "vLLM server is ready!"
    break
  fi

  if [ $((i % 15)) -eq 0 ]; then
    echo "Still waiting for model to load... ($i/1800)"
  fi

  sleep 2
done

if ! curl -s "http://localhost:8000/v1/models" > /dev/null; then
  echo "ERROR: Server took longer than 60 minutes to load."
  echo "Check /var/log/vllm.log for details."
  exit 1
fi

Note on model ID: Verify the exact HuggingFace repo path before running. The model may be published as moonshotai/Kimi-K2.7-Code or under a slightly different slug. A wrong model ID fails silently at download time.

Note on vision flags: K2.7 Code is text-only. Do not add --mm-encoder-tp-mode data, which was required for K2.6's MoonViT encoder. Adding it when the model has no vision encoder will cause a startup error.

Note on reasoning parser: The --reasoning-parser kimi_k2 flag tells vLLM how to parse thinking tokens. K2.7 Code still produces reasoning output (just more concisely than K2.6), so the flag is appropriate. If you find the model emits no reasoning tokens, the flag is harmless but can be omitted.

Note on vLLM version: K2.7 Code requires a nightly build until a stable release adds explicit support for the model and the kimi_k2 parsers. Pin the exact nightly build date in your startup script to avoid silent breakage from future nightly changes.

Note on gated access: If moonshotai/Kimi-K2.7-Code is a gated repo, add export HUGGING_FACE_HUB_TOKEN=your_token_here before the vllm serve line after confirming your account has access on the model page.

Step 3: Deploy and Monitor

Once deployed, SSH into the instance and monitor startup:

bash

tail -f /var/log/vllm.log

Model download and loading takes 20-30 minutes. You will see per-GPU shard loading messages followed by "Application startup complete" when the server is ready.

Step 4: Verify the Deployment

bash

curl http://localhost:8000/v1/models

You should see moonshotai/Kimi-K2.7-Code in the response. Run a quick coding inference to confirm:

bash

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/Kimi-K2.7-Code",
    "messages": [{"role": "user", "content": "Write a Python function to parse JSON from a URL with proper error handling"}],
    "max_tokens": 512
  }'

AWQ Quantization Path

To run on 4x B200 instead of 8 GPUs, use a pre-quantized AWQ checkpoint. AWQ reduces weight VRAM from ~1 TB (FP8) down to ~630 GB (INT4). On 4x B200 SXM6 (768 GB total), this leaves ~138 GB for KV cache, enough to serve the full 256K context window. Note: 4x H200 provides only 564 GB total, which is insufficient for the ~630 GB AWQ weight footprint.

bash

vllm serve moonshotai/Kimi-K2.7-Code-AWQ \
    --quantization awq \
    --tensor-parallel-size 4 \
    --host 0.0.0.0 \
    --port 8000 \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --max-model-len 262144 \
    --trust-remote-code

If moonshotai/Kimi-K2.7-Code-AWQ is not yet published, check the model's HuggingFace page for community quantized checkpoints. AWQ variants of large models typically appear days to weeks after the base model release. See the AWQ quantization guide for choosing between AWQ, FP8, and GPTQ based on your accuracy/speed tradeoff.

Wiring K2.7 Code into Agentic Coding Workflows

MCP Server Integration

K2.7 Code's parallel tool execution via the kimi_k2 tool call parser makes it well-suited for MCP-style agentic coding loops. The model can issue multiple tool calls in a single response turn, which cuts round-trip latency for tasks that need to read multiple files or run tests alongside code edits.

python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

# MCP-style tool definitions for coding tasks
tools = [
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read a source file",
            "parameters": {
                "type": "object",
                "properties": {"path": {"type": "string"}},
                "required": ["path"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "write_file",
            "description": "Write or edit a file",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {"type": "string"},
                    "content": {"type": "string"}
                },
                "required": ["path", "content"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "run_tests",
            "description": "Run the test suite",
            "parameters": {
                "type": "object",
                "properties": {"target": {"type": "string"}},
                "required": ["target"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.7-Code",
    messages=[{"role": "user", "content": "Refactor the authentication module and verify tests pass"}],
    tools=tools,
    tool_choice="auto",
    max_tokens=8192
)

# K2.7 Code can issue multiple tool calls in one turn
for call in response.choices[0].message.tool_calls or []:
    print(f"Tool: {call.function.name}, Args: {call.function.arguments}")

For the full MCP server setup, see the MCP server GPU deployment guide.

IDE Assistant Wiring (Continue.dev)

K2.7 Code exposes an OpenAI-compatible endpoint on port 8000, so it drops into Continue.dev's config.json as an openai-type model with apiBase pointing at your Spheron instance IP. Because the endpoint is fully OpenAI-compatible, the same configuration works for Aider, Cline, and any other tool that accepts a custom API base. For the full Continue.dev setup walkthrough, see the self-host AI coding assistant guide.

Long-Context Coding Sessions

For multi-file refactors at 256K context, add --enable-chunked-prefill and --max-num-batched-tokens 8192 to prevent memory spikes during large prefill. At 256K context, processing the full prefill in one step can spike GPU memory; chunking keeps it stable:

bash

vllm serve moonshotai/Kimi-K2.7-Code \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --host 0.0.0.0 \
    --port 8000 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192 \
    --max-model-len 262144 \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --trust-remote-code

Benchmarks

These figures are estimates based on K2.6 benchmarks at equivalent precision and node configuration, since K2.7 Code-specific published throughput numbers are not available at time of writing. K2.7 Code produces fewer reasoning tokens per coding task vs K2.6, which reduces output token count and improves effective throughput per coding request compared to raw tok/s figures. See the vLLM vs TensorRT-LLM vs SGLang benchmarks post for the methodology used to produce these estimates.

GPU Config	Context Length	Batch Size	Throughput (tok/s)	TTFT (ms)	TBT (ms)
8x H200 SXM5, FP8	8K	8	~1,800	~280	~8
8x H200 SXM5, FP8	64K	2	~900	~1,200	~9
8x H200 SXM5, FP8	256K	1	~420	~4,800	~11
8x B200 SXM6, FP8	8K	8	~3,200	~180	~5
8x B200 SXM6, FP8	64K	2	~1,600	~750	~6
8x B200 SXM6, FP8	256K	1	~750	~3,000	~7

Estimated based on K2.6 architecture at equivalent precision. K2.7 Code's ~30% fewer reasoning output tokens means effective throughput per coding task will be higher than raw tok/s implies.

Cost Per Million Tokens

At 1,800 tok/s on 8x H200 SXM5 on-demand, you are paying $38.72/hr. At 8K context, that works out to roughly $5.98/M output tokens on-demand. For comparison, closed coding APIs typically price at $10-30/M tokens depending on input/output split.

K2.7 Code's ~30% shorter reasoning chains reduce the output token count per coding task relative to K2.6, which lowers the effective per-task cost beyond what the $/M token figure implies.

Model	Hosting	$/M tokens (est.)	Notes
Kimi K2.7 Code	Spheron 8x H200 SXM5, on-demand	~$5.98	Based on $38.72/hr, ~1,800 tok/s at 8K
Kimi K2.7 Code	Spheron 8x B200 SXM6, on-demand	~$5.15	Based on $59.28/hr, ~3,200 tok/s at 8K
Kimi K2.7 Code	Spheron 8x H200 SXM5, 256K ctx	~$25.61	Based on $38.72/hr, ~420 tok/s
Kimi K2.6	Spheron 8x H200 SXM5 (May 2026)	~$9.69	Prior release baseline (older pricing)
Closed coding API	Vendor-hosted	$10-30/M	Per public pricing; varies by input/output split

Pricing fluctuates based on GPU availability. The prices above are based on 14 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

K2.7 Code vs K2.6 vs Other Open Coding Models

Model	Params (active)	Context	Coding Focus	License	Self-Host Min GPUs
Kimi K2.7 Code	1T/32B MoE	256K	Coding-first	Modified MIT	8x H200 (FP8)
Kimi K2.6	1T/32B MoE	256K	General + multimodal	MIT	8x H200 (FP8)
DeepSeek V3.2 Speciale	685B/37B MoE	128K	General	MIT	8x H200 (FP8)
Qwen2.5-Coder 32B	32B dense	128K	Coding	Apache 2.0	1x A100 80GB (FP8)

When to pick which:

K2.7 Code: coding-first agentic tasks, you want fewer output tokens per task, Modified MIT is acceptable for your use case, and you do not need vision input
K2.6: multimodal workflows that mix vision and code, need the MoonViT pipeline, or prefer standard MIT
DeepSeek V3.2 Speciale: smaller VRAM footprint (685B vs 1T), unrestricted MIT license, general-purpose workloads
Qwen2.5-Coder 32B: single-GPU deployment, small team or tight budget, latency-sensitive autocomplete

For a side-by-side deployment comparison including K2.6 setup steps, see the K2.6 deployment guide. For single-GPU coding assistant setup with Qwen-Coder and Continue.dev, see the self-host AI coding assistant guide.

Production Checklist

Observability: Enable the vLLM Prometheus metrics endpoint with --enable-metrics and point Grafana at http://localhost:8000/metrics. Key signals: vllm:gpu_cache_usage_perc, vllm:request_queue_depth, and vllm:tokens_per_second.
Content filtering: Apply guardrails at the API gateway layer before requests reach vLLM, not inside the model. An NGINX sidecar with rate limiting is a lightweight option.
Weight caching: K2.7 Code weights are ~630 GB in INT4. Attach a per-region NVMe volume and pre-download weights to cut startup from 20-30 minutes to 2-3 minutes on subsequent launches.
Version pinning: Pin the exact nightly build date in your startup script. K2.7 Code is brand-new as of June 2026, and parser support may not yet be in any stable vLLM release. A nightly update could silently break the kimi_k2 parsers if the model lands mid-nightly-cycle.
License review: Before deploying in commercial production, read the Modified MIT LICENSE file on the HuggingFace repository. Do not assume it is equivalent to standard MIT without checking.

Kimi K2.7 Code's 256K context and coding-first MoE architecture make it a strong self-hosted alternative to closed coding APIs. Spheron's H200 SXM5 and B200 SXM6 nodes give you the NVLink interconnect needed for 8-way tensor and expert parallelism on a single node.
H200 SXM5 on Spheron → | Spheron B200 SXM6 → | View all GPU pricing →

STEPS / 04

Quick Setup Guide

Launch a GPU instance on Spheron
Log into the Spheron dashboard at app.spheron.ai, select a GPU offer with 8 GPUs (H200 SXM5 or B200 SXM6), set storage to 800 GB minimum, and choose Ubuntu 22.04 or 24.04 as the base image.
Configure cloud-init for vLLM with expert parallelism
Add the startup script to the deployment configuration. The script installs vLLM nightly from wheels.vllm.ai, downloads moonshotai/Kimi-K2.7-Code from HuggingFace, and starts the inference server with tensor-parallel-size 8, enable-expert-parallel, and max-model-len 262144.
Wire K2.7 Code into an MCP-compatible agentic workflow
Point your MCP tool server or OpenAI-compatible client at http://your-spheron-ip:8000/v1. Use the kimi_k2 tool call parser for parallel tool execution across file reads, code writes, and test runs in a single response turn.
Verify the deployment
Run curl http://localhost:8000/v1/models and confirm moonshotai/Kimi-K2.7-Code appears in the response. Send a test coding request to verify tool calling and reasoning output are functioning correctly.

FAQ / 05

Frequently Asked Questions

Kimi K2.7 Code uses the same 1T/32B active MoE architecture as K2.6, so weight memory is roughly 630 GB in INT4 and around 2 TB in FP16. For practical deployment, you need 8x H200 SXM5 (141 GB each, ~1128 GB total) for FP8 production, or 6x B200 SXM6 for FP8 or 4x B200 SXM6 for AWQ INT4, with additional VRAM for KV cache at 256K context.

Yes. Kimi K2.7 Code works with vLLM nightly builds. Use --tensor-parallel-size 8, --enable-expert-parallel for MoE routing, --tool-call-parser kimi_k2, --reasoning-parser kimi_k2, and --max-model-len 262144. The nightly build from https://wheels.vllm.ai/nightly/cu129 is required until the model is supported in a stable vLLM release.

For production workloads, 8x H200 SXM5 is the most cost-effective option at $4.84/GPU/hr ($38.72/hr for the 8-GPU bundle). For maximum throughput where you need more KV cache headroom at 256K context, 8x B200 SXM6 at $7.41/GPU/hr offers higher tensor core density and HBM3e bandwidth.

K2.7 Code is coding-first rather than general-purpose. It produces approximately 30% fewer reasoning tokens per coding task vs K2.6, which lowers inference cost per code generation request. The architecture (1T total, 32B active, 384 experts) is identical to K2.6, so the deployment footprint and VRAM sizing are unchanged. K2.7 Code also ships under a Modified MIT license rather than K2.6's standard MIT. Note that K2.7 Code is text-only: the MoonViT vision encoder present in K2.6 is absent.

Yes. K2.7 Code's 384-expert MoE architecture benefits from --enable-expert-parallel in vLLM alongside --tensor-parallel-size 8. Expert parallelism reduces all-to-all communication overhead for long generation sequences, which matters for code output that tends to be longer than general assistant responses.

What's New in Kimi K2.7 Code

Hardware Sizing

Expert Parallelism for Coding Workloads

Single-node vs multi-node

Recommended GPU Configurations on Spheron

Deploy Kimi K2.7 Code with vLLM

Step 1: Launch Your GPU Instance

Step 2: Add the Startup Script

Step 3: Deploy and Monitor

Step 4: Verify the Deployment

AWQ Quantization Path

Wiring K2.7 Code into Agentic Coding Workflows

MCP Server Integration

IDE Assistant Wiring (Continue.dev)

Long-Context Coding Sessions

Benchmarks

Cost Per Million Tokens

K2.7 Code vs K2.6 vs Other Open Coding Models

Production Checklist

Quick Setup Guide

Launch a GPU instance on Spheron

Configure cloud-init for vLLM with expert parallelism

Wire K2.7 Code into an MCP-compatible agentic workflow

Verify the deployment

Frequently Asked Questions

01How much VRAM does Kimi K2.7 Code need?

02Can I run Kimi K2.7 Code with vLLM?

03What GPU does Spheron recommend for Kimi K2.7 Code?

04How is Kimi K2.7 Code different from K2.6?

05Does Kimi K2.7 Code support expert parallelism?

Try It on Real GPUs