Coming from Kimi K2.6? See the K2.6 deployment guide for the prior release. This guide covers K2.7 Code (June 2026).
Kimi K2.7 Code is Moonshot AI's June 12 2026 release. It shares the 1T/32B active MoE architecture with K2.6 but pivots to a coding-first focus: the benchmark suite is code-heavy (HumanEval, LiveCodeBench, SWE-bench), reasoning output is roughly 30% more concise per coding task, and the MoonViT vision encoder from K2.6 is absent. For teams running agentic coding workflows, that translates to lower per-task inference cost on the same hardware. The deployment footprint, VRAM sizing, and vLLM configuration are nearly identical to K2.6, with one meaningful addition: --enable-expert-parallel alongside tensor parallelism, which reduces all-to-all communication overhead for long code generation sequences.
What's New in Kimi K2.7 Code
| Feature | K2.6 | K2.7 Code |
|---|---|---|
| Total parameters | 1T MoE | 1T MoE |
| Active parameters | 32B | 32B |
| Expert count | 384 | 384 |
| Context window | 256K | 256K |
| Vision encoder (MoonViT) | Yes | No |
| Reasoning tokens | Baseline | ~30% fewer than K2.6 |
| Focus | General + multimodal agentic | Coding-first agentic |
| License | MIT | Modified MIT |
| HuggingFace release | April 2026 | June 12 2026 |
The architecture is unchanged from K2.6: 1T total parameters with 32B active per forward pass, routed across 384 experts with top-8 selection. What changes is the training focus. K2.7 Code was trained with a heavier weighting on coding tasks, which produces shorter reasoning chains for code generation and presumably better performance on coding-specific benchmarks. Moonshot's published numbers come from their own evaluation suite only. No independent SWE-bench Verified results are available as of June 2026, so treat the benchmark figures as self-reported rather than externally validated.
The Modified MIT license is a meaningful detail. Before deploying K2.7 Code in commercial production, read the LICENSE file on the HuggingFace repository. The modifications to standard MIT are not publicly summarized, so review the full text.
Hardware Sizing
VRAM requirements for K2.7 Code follow the same weight structure as K2.6 (1T total parameters, 32B active per token). The KV cache benefits from the same Multi-head Latent Attention (MLA) compression as K2.6.
| Precision | Model Weights | KV Cache (256K, batch 1) | KV Cache (32K, batch 8) | Minimum GPUs |
|---|---|---|---|---|
| FP16 | ~2 TB | ~80 GB (with MLA) | ~24 GB | 16x H200 or 12x B200 |
| FP8 | ~1 TB | ~40 GB (with MLA) | ~12 GB | 8x H200 or 6x B200 |
| AWQ INT4 | ~630 GB | ~20 GB (with MLA) | ~6 GB | 8x H200 or 4x B200 |
The FP8 row is the practical production target. On 8x H200 SXM5 (~1128 GB HBM3e total), FP8 weights consume roughly 1 TB, leaving ~128 GB for KV cache at 256K context with small batch sizes. K2.7 Code also ships native INT4 weights, with vLLM, SGLang, and KTransformers all recommended by Moonshot for INT4 inference, making it a first-class deployment target rather than relying on community quantizations.
Expert Parallelism for Coding Workloads
K2.7 Code's 384 experts across an 8-GPU node benefit from --enable-expert-parallel in vLLM alongside --tensor-parallel-size 8. The tradeoff is specific to the workload: expert parallelism reduces all-to-all communication overhead compared to tensor parallelism alone, particularly for long generation sequences. Code output tends to be longer than general assistant responses, so the expert parallelism benefit is more pronounced for coding tasks. For the decision framework on when to use expert vs tensor parallelism in MoE deployments, see the MoE inference optimization guide.
Single-node vs multi-node
K2.7 Code targets 8-way tensor + expert parallelism on a single NVLink node. NVLink 4 (H200) at 900 GB/s and NVLink 5 (B200) at 1.8 TB/s handle the all-to-all communication for 384 experts efficiently. Spreading across two 4-GPU InfiniBand-connected nodes drops bandwidth to 400 Gb/s, which creates a bottleneck specifically during expert dispatch. Stay on a single 8-GPU NVLink node.
At 256K context, KV cache per request can exceed 20 GB for FP16 (even with MLA). For multi-session servers handling concurrent long-context requests, budget an additional 40-80 GB beyond the base weight requirement.
Recommended GPU Configurations on Spheron
| GPU | VRAM | On-Demand (per GPU) | Spot (per GPU, 8-GPU bundle) | Notes |
|---|---|---|---|---|
| H200 SXM5 (8x) | 141 GB HBM3e | $4.84/hr | $1.82/hr | Best value for FP8 and INT4 production |
| B200 SXM6 (8x) | 192 GB HBM3e | $7.41/hr | $2.71/hr | Higher throughput, fits FP8 with more KV headroom |
Teams running FP8 inference at 256K context can use H200 SXM5 on Spheron and fit the model comfortably with KV cache headroom for small batch sizes. H200's 900 GB/s NVLink bandwidth handles 8-way tensor + expert parallel without saturation for typical inference loads.
For higher-concurrency deployments where you need more KV cache headroom, Spheron B200 SXM6 nodes offer 192 GB per GPU, which opens up FP16 operation on a single 8-GPU node and larger batch sizes at 256K context. See best NVIDIA GPUs for LLMs for a side-by-side breakdown of H200 vs B200 architecture differences.
Pricing fluctuates based on GPU availability. The prices above are based on 14 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Deploy Kimi K2.7 Code with vLLM
Step 1: Launch Your GPU Instance
- Log into Spheron dashboard
- Select the GPU offer with 8x GPUs and click Next
- Choose your GPU:
- 8x H200 SXM5 for cost-effective FP8 production
- 8x B200 SXM6 for higher KV headroom and batch throughput
- Set storage to 800 GB minimum (model weights are ~630 GB compressed; extra space covers temporary download files and the OS/venv)
- Choose Ubuntu 22.04 or Ubuntu 24.04
Step 2: Add the Startup Script
Add this cloud-init script to the deployment configuration. It installs vLLM nightly, downloads moonshotai/Kimi-K2.7-Code, and starts the inference server with expert parallelism enabled.
#!/bin/bash
set -e
echo "--- Setting Up Environment ---"
sudo apt-get update -y
sudo apt-get install -y python3-venv
sudo python3 -m venv /opt/kimi_venv
source /opt/kimi_venv/bin/activate
pip install --upgrade pip
pip install -U vllm --pre \
--extra-index-url https://wheels.vllm.ai/nightly/cu129 \
--extra-index-url https://download.pytorch.org/whl/cu129
echo "--- Launching vLLM Server ---"
# For AWQ quantization, replace the serve line below with:
# nohup vllm serve moonshotai/Kimi-K2.7-Code-AWQ \
# --quantization awq \
# ...
# Note: Confirm the AWQ checkpoint name on HuggingFace before use,
# as quantized variants typically appear after the base model release.
nohup vllm serve moonshotai/Kimi-K2.7-Code \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--host 0.0.0.0 \
--port 8000 \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--max-model-len 262144 \
--trust-remote-code > /var/log/vllm.log 2>&1 &
echo "--- Waiting for server to initialize (ETA 20-30 mins) ---"
for i in {1..1800}; do
if curl -s "http://localhost:8000/v1/models" > /dev/null; then
echo "vLLM server is ready!"
break
fi
if [ $((i % 15)) -eq 0 ]; then
echo "Still waiting for model to load... ($i/1800)"
fi
sleep 2
done
if ! curl -s "http://localhost:8000/v1/models" > /dev/null; then
echo "ERROR: Server took longer than 60 minutes to load."
echo "Check /var/log/vllm.log for details."
exit 1
fiNote on model ID: Verify the exact HuggingFace repo path before running. The model may be published as moonshotai/Kimi-K2.7-Code or under a slightly different slug. A wrong model ID fails silently at download time.
Note on vision flags: K2.7 Code is text-only. Do not add --mm-encoder-tp-mode data, which was required for K2.6's MoonViT encoder. Adding it when the model has no vision encoder will cause a startup error.
Note on reasoning parser: The --reasoning-parser kimi_k2 flag tells vLLM how to parse thinking tokens. K2.7 Code still produces reasoning output (just more concisely than K2.6), so the flag is appropriate. If you find the model emits no reasoning tokens, the flag is harmless but can be omitted.
Note on vLLM version: K2.7 Code requires a nightly build until a stable release adds explicit support for the model and the kimi_k2 parsers. Pin the exact nightly build date in your startup script to avoid silent breakage from future nightly changes.
Note on gated access: If moonshotai/Kimi-K2.7-Code is a gated repo, add export HUGGING_FACE_HUB_TOKEN=your_token_here before the vllm serve line after confirming your account has access on the model page.
Step 3: Deploy and Monitor
Once deployed, SSH into the instance and monitor startup:
tail -f /var/log/vllm.logModel download and loading takes 20-30 minutes. You will see per-GPU shard loading messages followed by "Application startup complete" when the server is ready.
Step 4: Verify the Deployment
curl http://localhost:8000/v1/modelsYou should see moonshotai/Kimi-K2.7-Code in the response. Run a quick coding inference to confirm:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "moonshotai/Kimi-K2.7-Code",
"messages": [{"role": "user", "content": "Write a Python function to parse JSON from a URL with proper error handling"}],
"max_tokens": 512
}'AWQ Quantization Path
To run on 4x B200 instead of 8 GPUs, use a pre-quantized AWQ checkpoint. AWQ reduces weight VRAM from ~1 TB (FP8) down to ~630 GB (INT4). On 4x B200 SXM6 (768 GB total), this leaves ~138 GB for KV cache, enough to serve the full 256K context window. Note: 4x H200 provides only 564 GB total, which is insufficient for the ~630 GB AWQ weight footprint.
vllm serve moonshotai/Kimi-K2.7-Code-AWQ \
--quantization awq \
--tensor-parallel-size 4 \
--host 0.0.0.0 \
--port 8000 \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--max-model-len 262144 \
--trust-remote-codeIf moonshotai/Kimi-K2.7-Code-AWQ is not yet published, check the model's HuggingFace page for community quantized checkpoints. AWQ variants of large models typically appear days to weeks after the base model release. See the AWQ quantization guide for choosing between AWQ, FP8, and GPTQ based on your accuracy/speed tradeoff.
Wiring K2.7 Code into Agentic Coding Workflows
MCP Server Integration
K2.7 Code's parallel tool execution via the kimi_k2 tool call parser makes it well-suited for MCP-style agentic coding loops. The model can issue multiple tool calls in a single response turn, which cuts round-trip latency for tasks that need to read multiple files or run tests alongside code edits.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
# MCP-style tool definitions for coding tasks
tools = [
{
"type": "function",
"function": {
"name": "read_file",
"description": "Read a source file",
"parameters": {
"type": "object",
"properties": {"path": {"type": "string"}},
"required": ["path"]
}
}
},
{
"type": "function",
"function": {
"name": "write_file",
"description": "Write or edit a file",
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string"},
"content": {"type": "string"}
},
"required": ["path", "content"]
}
}
},
{
"type": "function",
"function": {
"name": "run_tests",
"description": "Run the test suite",
"parameters": {
"type": "object",
"properties": {"target": {"type": "string"}},
"required": ["target"]
}
}
}
]
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.7-Code",
messages=[{"role": "user", "content": "Refactor the authentication module and verify tests pass"}],
tools=tools,
tool_choice="auto",
max_tokens=8192
)
# K2.7 Code can issue multiple tool calls in one turn
for call in response.choices[0].message.tool_calls or []:
print(f"Tool: {call.function.name}, Args: {call.function.arguments}")For the full MCP server setup, see the MCP server GPU deployment guide.
IDE Assistant Wiring (Continue.dev)
K2.7 Code exposes an OpenAI-compatible endpoint on port 8000, so it drops into Continue.dev's config.json as an openai-type model with apiBase pointing at your Spheron instance IP. Because the endpoint is fully OpenAI-compatible, the same configuration works for Aider, Cline, and any other tool that accepts a custom API base. For the full Continue.dev setup walkthrough, see the self-host AI coding assistant guide.
Long-Context Coding Sessions
For multi-file refactors at 256K context, add --enable-chunked-prefill and --max-num-batched-tokens 8192 to prevent memory spikes during large prefill. At 256K context, processing the full prefill in one step can spike GPU memory; chunking keeps it stable:
vllm serve moonshotai/Kimi-K2.7-Code \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--host 0.0.0.0 \
--port 8000 \
--enable-chunked-prefill \
--max-num-batched-tokens 8192 \
--max-model-len 262144 \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--trust-remote-codeBenchmarks
These figures are estimates based on K2.6 benchmarks at equivalent precision and node configuration, since K2.7 Code-specific published throughput numbers are not available at time of writing. K2.7 Code produces fewer reasoning tokens per coding task vs K2.6, which reduces output token count and improves effective throughput per coding request compared to raw tok/s figures. See the vLLM vs TensorRT-LLM vs SGLang benchmarks post for the methodology used to produce these estimates.
| GPU Config | Context Length | Batch Size | Throughput (tok/s) | TTFT (ms) | TBT (ms) |
|---|---|---|---|---|---|
| 8x H200 SXM5, FP8 | 8K | 8 | ~1,800 | ~280 | ~8 |
| 8x H200 SXM5, FP8 | 64K | 2 | ~900 | ~1,200 | ~9 |
| 8x H200 SXM5, FP8 | 256K | 1 | ~420 | ~4,800 | ~11 |
| 8x B200 SXM6, FP8 | 8K | 8 | ~3,200 | ~180 | ~5 |
| 8x B200 SXM6, FP8 | 64K | 2 | ~1,600 | ~750 | ~6 |
| 8x B200 SXM6, FP8 | 256K | 1 | ~750 | ~3,000 | ~7 |
Estimated based on K2.6 architecture at equivalent precision. K2.7 Code's ~30% fewer reasoning output tokens means effective throughput per coding task will be higher than raw tok/s implies.
Cost Per Million Tokens
At 1,800 tok/s on 8x H200 SXM5 on-demand, you are paying $38.72/hr. At 8K context, that works out to roughly $5.98/M output tokens on-demand. For comparison, closed coding APIs typically price at $10-30/M tokens depending on input/output split.
K2.7 Code's ~30% shorter reasoning chains reduce the output token count per coding task relative to K2.6, which lowers the effective per-task cost beyond what the $/M token figure implies.
| Model | Hosting | $/M tokens (est.) | Notes |
|---|---|---|---|
| Kimi K2.7 Code | Spheron 8x H200 SXM5, on-demand | ~$5.98 | Based on $38.72/hr, ~1,800 tok/s at 8K |
| Kimi K2.7 Code | Spheron 8x B200 SXM6, on-demand | ~$5.15 | Based on $59.28/hr, ~3,200 tok/s at 8K |
| Kimi K2.7 Code | Spheron 8x H200 SXM5, 256K ctx | ~$25.61 | Based on $38.72/hr, ~420 tok/s |
| Kimi K2.6 | Spheron 8x H200 SXM5 (May 2026) | ~$9.69 | Prior release baseline (older pricing) |
| Closed coding API | Vendor-hosted | $10-30/M | Per public pricing; varies by input/output split |
Pricing fluctuates based on GPU availability. The prices above are based on 14 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
K2.7 Code vs K2.6 vs Other Open Coding Models
| Model | Params (active) | Context | Coding Focus | License | Self-Host Min GPUs |
|---|---|---|---|---|---|
| Kimi K2.7 Code | 1T/32B MoE | 256K | Coding-first | Modified MIT | 8x H200 (FP8) |
| Kimi K2.6 | 1T/32B MoE | 256K | General + multimodal | MIT | 8x H200 (FP8) |
| DeepSeek V3.2 Speciale | 685B/37B MoE | 128K | General | MIT | 8x H200 (FP8) |
| Qwen2.5-Coder 32B | 32B dense | 128K | Coding | Apache 2.0 | 1x A100 80GB (FP8) |
When to pick which:
- K2.7 Code: coding-first agentic tasks, you want fewer output tokens per task, Modified MIT is acceptable for your use case, and you do not need vision input
- K2.6: multimodal workflows that mix vision and code, need the MoonViT pipeline, or prefer standard MIT
- DeepSeek V3.2 Speciale: smaller VRAM footprint (685B vs 1T), unrestricted MIT license, general-purpose workloads
- Qwen2.5-Coder 32B: single-GPU deployment, small team or tight budget, latency-sensitive autocomplete
For a side-by-side deployment comparison including K2.6 setup steps, see the K2.6 deployment guide. For single-GPU coding assistant setup with Qwen-Coder and Continue.dev, see the self-host AI coding assistant guide.
Production Checklist
- Observability: Enable the vLLM Prometheus metrics endpoint with
--enable-metricsand point Grafana athttp://localhost:8000/metrics. Key signals:vllm:gpu_cache_usage_perc,vllm:request_queue_depth, andvllm:tokens_per_second. - Content filtering: Apply guardrails at the API gateway layer before requests reach vLLM, not inside the model. An NGINX sidecar with rate limiting is a lightweight option.
- Weight caching: K2.7 Code weights are ~630 GB in INT4. Attach a per-region NVMe volume and pre-download weights to cut startup from 20-30 minutes to 2-3 minutes on subsequent launches.
- Version pinning: Pin the exact nightly build date in your startup script. K2.7 Code is brand-new as of June 2026, and parser support may not yet be in any stable vLLM release. A nightly update could silently break the
kimi_k2parsers if the model lands mid-nightly-cycle. - License review: Before deploying in commercial production, read the Modified MIT LICENSE file on the HuggingFace repository. Do not assume it is equivalent to standard MIT without checking.
Kimi K2.7 Code's 256K context and coding-first MoE architecture make it a strong self-hosted alternative to closed coding APIs. Spheron's H200 SXM5 and B200 SXM6 nodes give you the NVLink interconnect needed for 8-way tensor and expert parallelism on a single node.
H200 SXM5 on Spheron → | Spheron B200 SXM6 → | View all GPU pricing →
Quick Setup Guide
Log into the Spheron dashboard at app.spheron.ai, select a GPU offer with 8 GPUs (H200 SXM5 or B200 SXM6), set storage to 800 GB minimum, and choose Ubuntu 22.04 or 24.04 as the base image.
Add the startup script to the deployment configuration. The script installs vLLM nightly from wheels.vllm.ai, downloads moonshotai/Kimi-K2.7-Code from HuggingFace, and starts the inference server with tensor-parallel-size 8, enable-expert-parallel, and max-model-len 262144.
Point your MCP tool server or OpenAI-compatible client at http://your-spheron-ip:8000/v1. Use the kimi_k2 tool call parser for parallel tool execution across file reads, code writes, and test runs in a single response turn.
Run curl http://localhost:8000/v1/models and confirm moonshotai/Kimi-K2.7-Code appears in the response. Send a test coding request to verify tool calling and reasoning output are functioning correctly.
Frequently Asked Questions
Kimi K2.7 Code uses the same 1T/32B active MoE architecture as K2.6, so weight memory is roughly 630 GB in INT4 and around 2 TB in FP16. For practical deployment, you need 8x H200 SXM5 (141 GB each, ~1128 GB total) for FP8 production, or 6x B200 SXM6 for FP8 or 4x B200 SXM6 for AWQ INT4, with additional VRAM for KV cache at 256K context.
Yes. Kimi K2.7 Code works with vLLM nightly builds. Use --tensor-parallel-size 8, --enable-expert-parallel for MoE routing, --tool-call-parser kimi_k2, --reasoning-parser kimi_k2, and --max-model-len 262144. The nightly build from https://wheels.vllm.ai/nightly/cu129 is required until the model is supported in a stable vLLM release.
For production workloads, 8x H200 SXM5 is the most cost-effective option at $4.84/GPU/hr ($38.72/hr for the 8-GPU bundle). For maximum throughput where you need more KV cache headroom at 256K context, 8x B200 SXM6 at $7.41/GPU/hr offers higher tensor core density and HBM3e bandwidth.
K2.7 Code is coding-first rather than general-purpose. It produces approximately 30% fewer reasoning tokens per coding task vs K2.6, which lowers inference cost per code generation request. The architecture (1T total, 32B active, 384 experts) is identical to K2.6, so the deployment footprint and VRAM sizing are unchanged. K2.7 Code also ships under a Modified MIT license rather than K2.6's standard MIT. Note that K2.7 Code is text-only: the MoonViT vision encoder present in K2.6 is absent.
Yes. K2.7 Code's 384-expert MoE architecture benefits from --enable-expert-parallel in vLLM alongside --tensor-parallel-size 8. Expert parallelism reduces all-to-all communication overhead for long generation sequences, which matters for code output that tends to be longer than general assistant responses.
