Kimi K2.6 is Moonshot AI's April 2026 release, and it is a meaningfully different deployment target from its predecessor. The K2 architecture includes MoonViT (a tile-based vision encoder) and Multi-head Latent Attention (MLA), which shape the memory profile and hardware requirements for both K2.5 and K2.6. K2.6's headline changes are in agentic scale: 300-agent swarms and up to 4,000 coordinated steps for long-horizon autonomous workflows. If you are starting from the older version, the Kimi K2.5 deployment guide covers the baseline architecture. This guide focuses on the K2.6-specific changes and how to run it on Spheron's H200 SXM5 and B200 SXM6 nodes.
What's New in Kimi K2.6
| Feature | K2.5 | K2.6 |
|---|---|---|
| Context window | 256K tokens | 256K tokens |
| Agentic scale | Sequential tool execution | 300-agent swarms, up to 4,000 coordinated steps |
| Long-horizon task handling | Multi-step completions | Extended multi-file project workflows |
| Recommended hardware | 8x H200, 8x B200, 8x B300 | 8x H200, 8x B200 (H200 is now the baseline) |
MoonViT (vision encoder) and Multi-head Latent Attention (MLA) are part of the core K2 architecture and are present in both K2.5 and K2.6. They matter for hardware sizing (see below), but they are not K2.6 introductions.
MoonViT is the K2 vision encoder used across the K2 model family. It uses a tile-based encoding approach, where images are split into fixed-size tiles and each tile is encoded independently before being projected into the LLM's token space. This increases image token throughput compared to generic CLIP-based encoders, which matters when you are processing documents, screenshots, or video frames in batch.
Multi-head Latent Attention (MLA) is also part of the core K2 architecture and has the largest impact on infrastructure sizing. Standard multi-head attention caches one key and one value tensor per head per layer. MLA compresses these into a single low-rank latent vector, so the KV cache footprint at 256K context is substantially smaller. For a 1T parameter model at full context, this is the difference between needing to offload KV cache to CPU DRAM and being able to keep it in HBM. See the GPU memory requirements guide for a broader breakdown of how attention mechanism choice affects KV cache scaling.
Agentic improvements are the primary K2.6 delta. K2.6 scales to 300-agent swarms with up to 4,000 coordinated steps, enabling long-horizon autonomous workflows that K2.5 could not sustain. At the individual request level, K2.6 can issue multiple tool calls in a single response turn and collect all results before generating the next step. This cuts latency significantly for agentic tasks that touch multiple systems at once (e.g., searching a codebase and querying a database simultaneously).
Hardware Sizing
VRAM requirements for Kimi K2.6 follow the same weight structure as K2.5 (1T total parameters, 32B active per token), with MLA reducing KV cache requirements at long contexts.
| Precision | Model Weights | KV Cache (256K, batch 1) | KV Cache (32K, batch 8) | Minimum GPUs |
|---|---|---|---|---|
| FP16 | ~2 TB | ~80 GB (with MLA) | ~24 GB | 16x H200 or 12x B200 |
| FP8 | ~1 TB | ~40 GB (with MLA) | ~12 GB | 8x H200 or 6x B200 |
| AWQ INT4 | ~630 GB | ~20 GB (with MLA) | ~6 GB | 8x H200 or 4x B200 |
The FP8 row is the practical production target. On 8x H200 SXM5 (~1128 GB HBM3e total), FP8 weights consume roughly 1 TB, leaving ~128 GB for KV cache at 256K context with small batch sizes. That is workable for agentic use cases where you are typically running one or a small number of long sessions rather than high-concurrency short completions.
Single-node vs multi-node tradeoffs
K2.6 is designed for 8-way tensor parallelism on a single node with NVLink interconnects. NVLink 4 (H200) and NVLink 5 (B200) provide 900 GB/s and 1.8 TB/s bidirectional bandwidth respectively, which is sufficient for 8-way TP across the active 32B parameters. If you spread across two 4-GPU nodes connected by InfiniBand, the bandwidth drops to 400 Gb/s, and all-reduce latency becomes the bottleneck for attention head synchronization. Stay on a single 8-GPU NVLink node unless you are running FP16 weights and cannot fit on one node.
At 256K context, KV cache per request can exceed 20 GB for FP16 (even with MLA compression). For multi-session servers handling several concurrent long-context requests, budget an additional 40-80 GB beyond your base weight requirement.
Recommended GPU Configurations on Spheron
| GPU | VRAM | On-Demand (per GPU) | Spot (per GPU, 8-GPU bundle) | Notes |
|---|---|---|---|---|
| H200 SXM5 (8x) | 141 GB HBM3e | $4.36/hr | $1.76/hr | Best value for FP8 and INT4 production |
| B200 SXM6 (8x) | 192 GB HBM3e | $6.76/hr | $3.50/hr | Higher throughput, fits FP8 with more KV headroom |
| MI350X | 288 GB HBM3e | Check /pricing/ for availability | Not currently listed in Spheron catalog |
Teams running FP8 inference at 256K context can rent H200 SXM5 on Spheron and fit the model comfortably with KV cache headroom for small batch sizes. The H200's 900 GB/s NVLink bandwidth handles 8-way tensor parallel without saturation for typical inference loads.
For batch vision processing or higher-concurrency deployments where you need more KV cache headroom, Spheron B200 SXM6 instances offer 192 GB per GPU, which opens up FP16 operation on a single 8-GPU node and larger batch sizes at 256K context. See best NVIDIA GPUs for LLMs for a side-by-side breakdown of architecture differences between these two generations.
Pricing fluctuates based on GPU availability. The prices above are based on 12 May 2026 and may have changed. Check current GPU pricing for live rates.
Deploy Kimi K2.6 with vLLM
Step 1: Launch Your GPU Instance
- Log into Spheron dashboard
- Select the GPU offer with 8x GPUs and click Next
- Choose your GPU:
- 8x H200 SXM5 for cost-effective FP8 production
- 8x B200 SXM6 for higher KV headroom and batch throughput
- Set storage to 800 GB minimum (the model weights are ~630 GB compressed; extra space covers temporary download files and the OS/venv)
- Choose Ubuntu 22.04 or Ubuntu 24.04
Step 2: Add the Startup Script
Add this cloud-init script to the deployment configuration. It installs vLLM nightly, downloads moonshotai/Kimi-K2.6, and starts the inference server.
#!/bin/bash
set -e
echo "--- Setting Up Environment ---"
sudo apt-get update -y
sudo apt-get install -y python3-venv
sudo python3 -m venv /opt/kimi_venv
source /opt/kimi_venv/bin/activate
pip install --upgrade pip
pip install -U vllm --pre \
--extra-index-url https://wheels.vllm.ai/nightly/cu129 \
--extra-index-url https://download.pytorch.org/whl/cu129
echo "--- Launching vLLM Server ---"
# For AWQ quantization, replace the serve line below with:
# nohup vllm serve moonshotai/Kimi-K2.6-AWQ \
# --quantization awq \
# ...
# This cuts VRAM from ~1 TB (FP8) to ~630 GB (AWQ INT4). Requires 4x B200 SXM6 (768 GB total). 4x H200 (564 GB) is insufficient.
nohup vllm serve moonshotai/Kimi-K2.6 \
--tensor-parallel-size 8 \
--host 0.0.0.0 \
--port 8000 \
--mm-encoder-tp-mode data \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--max-model-len 262144 \
--trust-remote-code > /var/log/vllm.log 2>&1 &
echo "--- Waiting for server to initialize (ETA 20-30 mins) ---"
for i in {1..1800}; do
if curl -s "http://localhost:8000/v1/models" > /dev/null; then
echo "vLLM server is ready!"
break
fi
if [ $((i % 15)) -eq 0 ]; then
echo "Still waiting for model to load... ($i/1800)"
fi
sleep 2
done
if ! curl -s "http://localhost:8000/v1/models" > /dev/null; then
echo "ERROR: Server took longer than 60 minutes to load."
echo "Check /var/log/vllm.log for details."
exit 1
fiNote on model availability: If moonshotai/Kimi-K2.6 is a gated repo on HuggingFace, you will need to set HUGGING_FACE_HUB_TOKEN in your environment before running the script. Add export HUGGING_FACE_HUB_TOKEN=your_token_here before the vllm serve line and confirm the model page shows your account has access.
Note on vLLM version: K2.5 required a nightly build from wheels.vllm.ai. K2.6 likely follows the same pattern until a stable vLLM release adds explicit support. Check the vLLM changelog before switching to a stable release, as missing support for --mm-encoder-tp-mode data or kimi_k2 parsers will cause a silent fallback or startup error.
Step 3: Deploy and Monitor
Once deployed, SSH into the instance and monitor startup:
tail -f /var/log/vllm.logModel download and loading takes 20-30 minutes. You will see per-GPU shard loading messages followed by "Application startup complete" when the server is ready.
Step 4: Verify the Deployment
curl http://localhost:8000/v1/modelsYou should see moonshotai/Kimi-K2.6 in the response. Run a quick text inference to confirm:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "moonshotai/Kimi-K2.6",
"messages": [{"role": "user", "content": "Write a Python function to parse JSON from a URL"}],
"max_tokens": 512
}'AWQ Quantization Path
If you want to run on 4x B200 instead of 8 GPUs, use a pre-quantized AWQ checkpoint. AWQ reduces weight VRAM from ~1 TB (FP8) down to ~630 GB (INT4). On 4x B200 SXM6 (768 GB total), this leaves ~138 GB for KV cache, enough to serve the full 256K context window. Note: 4x H200 provides only 564 GB total, which is insufficient for the ~630 GB AWQ weight footprint. Replace the vllm serve command:
vllm serve moonshotai/Kimi-K2.6-AWQ \
--quantization awq \
--tensor-parallel-size 4 \
--host 0.0.0.0 \
--port 8000 \
--mm-encoder-tp-mode data \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--max-model-len 262144 \
--trust-remote-codeWith 4x B200 SXM6, the ~138 GB of headroom beyond the AWQ weight footprint is enough to run the full 256K context window. Stick with 8 GPUs if you need more KV cache headroom for larger batch sizes at 256K. See the AWQ quantization guide for choosing between AWQ, FP8, and GPTQ based on your accuracy/speed tradeoff.
Deploy Kimi K2.6 with SGLang
SGLang is the better choice for K2.6 when your workload involves long agentic sessions with repeated structure. The reasons are specific to how K2.6 is used, not generic framework preference.
RadixAttention prefix caching is the primary reason. Agentic sessions typically start with a fixed system prompt plus tool definitions that can run to 2,000-4,000 tokens. With standard attention, every new session re-processes this prefix from scratch. SGLang's RadixAttention caches the KV tensors for any shared prefix and reuses them across requests. For K2.6 deployed as an agentic coding assistant where the same tool schema appears in every session, this can cut TTFT by 40-60% at scale.
Structured output is the second reason. K2.6's parallel tool execution returns JSON payloads that need to be valid Python/JSON objects. SGLang's constrained decoding enforces output grammar at the token level, so the model cannot produce a syntactically broken tool call even under truncation or sampling pressure.
MLA optimization requires SGLang 0.4+. Add --enable-flashinfer-mla to activate the FlashInfer MLA kernel, which is optimized for the latent attention pattern and reduces attention computation time versus the generic flash attention path.
docker run --gpus all --ipc=host \
-p 8000:8000 \
lmsysorg/sglang:latest \
python -m sglang.launch_server \
--model-path moonshotai/Kimi-K2.6 \
--tp 8 \
--host 0.0.0.0 \
--port 8000 \
--enable-flashinfer-mla \
--trust-remote-codeUse vLLM for high-concurrency short-context requests where throughput per second matters more than per-session caching. Use SGLang for long agentic sessions, document pipelines, and any workload where the same tool schema or system prompt recurs across requests. The SGLang production deployment guide has a fuller comparison of when to switch.
Vision Input Pipeline (MoonViT)
MoonViT splits images into 448x448-pixel tiles and encodes each tile independently using a ViT backbone before projecting to the LLM token space. A single 1024x1024 image produces roughly 576 image tokens. A 4K screenshot can produce 2,000+ tokens, so budget accordingly when sizing context windows for vision workloads.
Sending an image request:
import base64
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
with open("screenshot.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.6",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Recreate this UI as a React component with Tailwind CSS"},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}}
]
}],
max_tokens=8192
)
print(response.choices[0].message.content)Sending a video clip:
import subprocess, base64, tempfile, os
# Extract one frame per second
with tempfile.TemporaryDirectory() as tmpdir:
subprocess.run([
"ffmpeg", "-i", "demo.mp4", "-vf", "fps=1", f"{tmpdir}/frame_%04d.png"
], check=True)
frames = sorted(os.listdir(tmpdir))[:8] # Cap at 8 frames
content = [{"type": "text", "text": "Describe the UI interaction shown in these frames"}]
for f in frames:
with open(f"{tmpdir}/{f}", "rb") as fp:
b64 = base64.b64encode(fp.read()).decode()
content.append({"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}})
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.6",
messages=[{"role": "user", "content": content}],
max_tokens=4096
)The --mm-encoder-tp-mode data flag distributes MoonViT encoding across all 8 tensor-parallel ranks in data-parallel mode, so each GPU encodes a subset of tiles in parallel rather than all 8 encoding the same tiles sequentially.
Agentic Deployment Patterns
Tool Calling
K2.6 uses the same kimi_k2 tool call parser as K2.5. The parallel tool execution capability means the model can now return multiple tool calls in one response turn:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
tools = [
{
"type": "function",
"function": {
"name": "search_files",
"description": "Search repository files for a pattern",
"parameters": {
"type": "object",
"properties": {
"pattern": {"type": "string"},
"path": {"type": "string"}
},
"required": ["pattern"]
}
}
},
{
"type": "function",
"function": {
"name": "run_tests",
"description": "Run the test suite for a module",
"parameters": {
"type": "object",
"properties": {
"module": {"type": "string"}
},
"required": ["module"]
}
}
}
]
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.6",
messages=[{"role": "user", "content": "Find all API endpoints and run the API tests in parallel"}],
tools=tools,
tool_choice="auto"
)
# K2.6 may return multiple tool_calls in response.choices[0].message.tool_calls
for call in response.choices[0].message.tool_calls or []:
print(f"Tool: {call.function.name}, Args: {call.function.arguments}")Long-horizon Task Configuration
For 256K context sessions, add --enable-chunked-prefill and tune --max-num-batched-tokens to control how many prefill tokens are processed per step. At 256K context, the full prefill in one step can spike memory; chunking keeps GPU memory stable:
vllm serve moonshotai/Kimi-K2.6 \
--tensor-parallel-size 8 \
--enable-chunked-prefill \
--max-num-batched-tokens 8192 \
--max-model-len 262144 \
--mm-encoder-tp-mode data \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--trust-remote-codeKV Cache Offloading
For extended multi-turn sessions that push past HBM capacity, use vLLM's built-in CPU offload or LMCache for NVMe-backed KV storage. The KV cache optimization guide covers both approaches. MLA's compressed KV format reduces what needs to be offloaded per layer compared to standard MHA, so K2.6 handles long sessions better than K2.5 even before you add offloading.
Autoscaling
For production agentic workloads, use spot instances for batch document processing (where latency is not user-facing) and on-demand instances for interactive sessions where a spot eviction would abort a live session. The cost difference is meaningful at scale: spot B200 SXM6 at $3.50/GPU/hr versus on-demand at $6.76/GPU/hr. For document pipelines that can checkpoint and resume, spot is the right choice. For real-time coding assistants or interactive agents, on-demand is safer. See the agentic RAG infrastructure guide for patterns on deploying K2.6 in a retrieval-augmented agentic setup, and MCP server GPU deployment if you are building K2.6 into an MCP-compatible tool server.
Benchmarks
These figures are estimates based on K2.5 benchmarks at equivalent precision and node configuration, as K2.6-specific published numbers are not yet available at time of writing. MLA's lower KV cache footprint tends to improve throughput at long contexts compared to K2.5 at the same precision. Check the vLLM vs TensorRT-LLM vs SGLang benchmarks post for the methodology used to produce these estimates.
| GPU Config | Context Length | Batch Size | Throughput (tok/s) | TTFT (ms) | TBT (ms) |
|---|---|---|---|---|---|
| 8x H200 SXM5, FP8 | 8K | 8 | ~1,800 | ~280 | ~8 |
| 8x H200 SXM5, FP8 | 64K | 2 | ~900 | ~1,200 | ~9 |
| 8x H200 SXM5, FP8 | 256K | 1 | ~420 | ~4,800 | ~11 |
| 8x B200 SXM6, FP8 | 8K | 8 | ~3,200 | ~180 | ~5 |
| 8x B200 SXM6, FP8 | 64K | 2 | ~1,600 | ~750 | ~6 |
| 8x B200 SXM6, FP8 | 256K | 1 | ~750 | ~3,000 | ~7 |
Estimated based on K2.5 architecture at equivalent precision. Actual K2.6 numbers may differ due to MLA attention kernel implementation and MoonViT overhead.
Cost Economics
At 1,000 tok/s on 8x H200 SXM5 on-demand, you are paying $34.88/hr (8 GPUs x $4.36). At 1 million tokens, that works out to roughly $9.69/M tokens on-demand or $3.91/M tokens at spot pricing. For comparison, closed-source multimodal APIs with comparable capability typically price at $10-30/M tokens depending on input/output split and vision input charges.
The B200 SXM6 costs more per GPU but delivers higher throughput. At the estimated 3,200 tok/s (batch 8, 8K context), 8x B200 on-demand at $54.08/hr ($6.76 x 8) comes to roughly $4.69/M tokens. That is a better cost-per-token than H200 on-demand for high-throughput batch workloads. At spot prices, H200 is cheaper per million tokens in both scenarios. H200's lower hourly rate more than offsets B200's throughput advantage whether you are running 8K or 256K context.
Spot pricing comparison by context length:
| Scenario | Cost |
|---|---|
| 8x B200 SXM6 spot, 8K context, 3,200 tok/s | ~$2.43/M tokens |
| 8x H200 SXM5 spot, 8K context, 1,800 tok/s | ~$2.17/M tokens |
| 8x H200 SXM5 spot, 256K context, 420 tok/s | ~$9.31/M tokens |
| 8x B200 SXM6 spot, 256K context, 750 tok/s | ~$10.37/M tokens |
Pricing fluctuates based on GPU availability. The prices above are based on 12 May 2026 and may have changed. Check current GPU pricing for live rates.
Production Checklist
- Observability: Enable the vLLM Prometheus metrics endpoint with
--enable-metricsand point Grafana athttp://localhost:8000/metrics. Key signals:vllm:gpu_cache_usage_perc,vllm:request_queue_depth, andvllm:tokens_per_second. - Content filtering: Apply any guardrails at the API gateway layer before requests reach vLLM, not inside the model. vLLM does not apply content filtering by default. Cloudflare Workers or an nginx sidecar are lightweight options for rate limiting and input filtering.
- Weight caching: Kimi K2.6 weights are ~630 GB in INT4. Attach a per-region NVMe volume on Spheron and pre-download weights to that volume. Subsequent instance launches in the same region skip the HuggingFace download entirely, cutting startup from 20-30 minutes to 2-3 minutes.
- Semantic cache: For agentic workloads where users send similar queries repeatedly (e.g., "explain this code", "fix this bug"), a semantic cache layer in front of vLLM (such as a Redis-backed prompt similarity lookup) reduces GPU compute and cost substantially.
- Version pinning: K2.6 currently requires a vLLM nightly build. Pin the exact wheel hash or nightly build date in your startup script so infrastructure updates do not silently break your deployment. Log the vLLM version at startup in your monitoring pipeline.
Kimi K2.6's 256K context and MoonViT vision are a good fit for agentic document pipelines and multimodal coding workflows. Spheron's H200 SXM5 and B200 SXM6 nodes give you the NVLink bandwidth to run 8-way tensor parallelism without network bottlenecks.
Rent H200 on Spheron → | Rent B200 → | View all GPU pricing →
