How much VRAM does Kimi K2.6 need?

Kimi K2.6 uses the same 1T/32B active MoE architecture as K2.5, so weight memory is roughly 630 GB in INT4 and around 2TB in FP16 across all parameters. For practical deployment you need 8x H200 SXM5 (141 GB each, ~1128 GB total) or 8x B200 SXM6 (192 GB each) at FP8 or INT4 precision, with additional VRAM reserved for KV cache at 256K context.

Can I run Kimi K2.6 with vLLM?

Yes. Kimi K2.6 works with vLLM nightly builds. Use --tensor-parallel-size 8, --mm-encoder-tp-mode data for MoonViT, --tool-call-parser kimi_k2, --reasoning-parser kimi_k2, and --max-model-len 262144. The nightly build from https://wheels.vllm.ai/nightly/cu129 is required until the model is supported in a stable vLLM release.

What GPU does Spheron recommend for Kimi K2.6?

For production workloads, 8x H200 SXM5 is the most cost-effective option at $4.36/hr per GPU on-demand ($1.76/hr spot in 8-GPU bundles). For maximum throughput or batch vision workloads, 8x B200 SXM6 at $6.76/hr per GPU on-demand ($3.50/hr spot in 8-GPU bundles) offers higher tensor core density and HBM3e bandwidth.

How is Kimi K2.6 different from K2.5?

K2.6's primary advances over K2.5 are in agentic scale: it supports 300-agent swarms with up to 4,000 coordinated steps, enabling long-horizon autonomous workflows. MoonViT (vision encoder) and Multi-head Latent Attention (MLA) are part of the core K2 architecture shared by both K2.5 and K2.6, not introductions in K2.6.

Does Kimi K2.6 support tool calling for agentic workflows?

Yes. Kimi K2.6 uses the kimi_k2 tool call parser, the same as K2.5. It supports parallel tool execution, multi-turn tool conversations, and the OpenAI function calling interface. The --tool-call-parser kimi_k2 flag in vLLM enables proper parsing of the model's native tool call format.

Deploy Kimi K2.6 on GPU Cloud: Self-Host Moonshot's Multimodal Agentic Model (2026)

Kimi K2.6 is Moonshot AI's April 2026 release, and it is a meaningfully different deployment target from its predecessor. The K2 architecture includes MoonViT (a tile-based vision encoder) and Multi-head Latent Attention (MLA), which shape the memory profile and hardware requirements for both K2.5 and K2.6. K2.6's headline changes are in agentic scale: 300-agent swarms and up to 4,000 coordinated steps for long-horizon autonomous workflows. If you are starting from the older version, the Kimi K2.5 deployment guide covers the baseline architecture. This guide focuses on the K2.6-specific changes and how to run it on Spheron's H200 SXM5 and B200 SXM6 nodes.

What's New in Kimi K2.6

Feature	K2.5	K2.6
Context window	256K tokens	256K tokens
Agentic scale	Sequential tool execution	300-agent swarms, up to 4,000 coordinated steps
Long-horizon task handling	Multi-step completions	Extended multi-file project workflows
Recommended hardware	8x H200, 8x B200, 8x B300	8x H200, 8x B200 (H200 is now the baseline)

MoonViT (vision encoder) and Multi-head Latent Attention (MLA) are part of the core K2 architecture and are present in both K2.5 and K2.6. They matter for hardware sizing (see below), but they are not K2.6 introductions.

MoonViT is the K2 vision encoder used across the K2 model family. It uses a tile-based encoding approach, where images are split into fixed-size tiles and each tile is encoded independently before being projected into the LLM's token space. This increases image token throughput compared to generic CLIP-based encoders, which matters when you are processing documents, screenshots, or video frames in batch.

Multi-head Latent Attention (MLA) is also part of the core K2 architecture and has the largest impact on infrastructure sizing. Standard multi-head attention caches one key and one value tensor per head per layer. MLA compresses these into a single low-rank latent vector, so the KV cache footprint at 256K context is substantially smaller. For a 1T parameter model at full context, this is the difference between needing to offload KV cache to CPU DRAM and being able to keep it in HBM. See the GPU memory requirements guide for a broader breakdown of how attention mechanism choice affects KV cache scaling.

Agentic improvements are the primary K2.6 delta. K2.6 scales to 300-agent swarms with up to 4,000 coordinated steps, enabling long-horizon autonomous workflows that K2.5 could not sustain. At the individual request level, K2.6 can issue multiple tool calls in a single response turn and collect all results before generating the next step. This cuts latency significantly for agentic tasks that touch multiple systems at once (e.g., searching a codebase and querying a database simultaneously).

Hardware Sizing

VRAM requirements for Kimi K2.6 follow the same weight structure as K2.5 (1T total parameters, 32B active per token), with MLA reducing KV cache requirements at long contexts.

Precision	Model Weights	KV Cache (256K, batch 1)	KV Cache (32K, batch 8)	Minimum GPUs
FP16	~2 TB	~80 GB (with MLA)	~24 GB	16x H200 or 12x B200
FP8	~1 TB	~40 GB (with MLA)	~12 GB	8x H200 or 6x B200
AWQ INT4	~630 GB	~20 GB (with MLA)	~6 GB	8x H200 or 4x B200

The FP8 row is the practical production target. On 8x H200 SXM5 (~1128 GB HBM3e total), FP8 weights consume roughly 1 TB, leaving ~128 GB for KV cache at 256K context with small batch sizes. That is workable for agentic use cases where you are typically running one or a small number of long sessions rather than high-concurrency short completions.

Single-node vs multi-node tradeoffs

K2.6 is designed for 8-way tensor parallelism on a single node with NVLink interconnects. NVLink 4 (H200) and NVLink 5 (B200) provide 900 GB/s and 1.8 TB/s bidirectional bandwidth respectively, which is sufficient for 8-way TP across the active 32B parameters. If you spread across two 4-GPU nodes connected by InfiniBand, the bandwidth drops to 400 Gb/s, and all-reduce latency becomes the bottleneck for attention head synchronization. Stay on a single 8-GPU NVLink node unless you are running FP16 weights and cannot fit on one node.

At 256K context, KV cache per request can exceed 20 GB for FP16 (even with MLA compression). For multi-session servers handling several concurrent long-context requests, budget an additional 40-80 GB beyond your base weight requirement.

Recommended GPU Configurations on Spheron

GPU	VRAM	On-Demand (per GPU)	Spot (per GPU, 8-GPU bundle)	Notes
H200 SXM5 (8x)	141 GB HBM3e	$4.36/hr	$1.76/hr	Best value for FP8 and INT4 production
B200 SXM6 (8x)	192 GB HBM3e	$6.76/hr	$3.50/hr	Higher throughput, fits FP8 with more KV headroom
MI350X	288 GB HBM3e	Check /pricing/ for availability	Not currently listed in Spheron catalog

Teams running FP8 inference at 256K context can rent H200 SXM5 on Spheron and fit the model comfortably with KV cache headroom for small batch sizes. The H200's 900 GB/s NVLink bandwidth handles 8-way tensor parallel without saturation for typical inference loads.

For batch vision processing or higher-concurrency deployments where you need more KV cache headroom, Spheron B200 SXM6 instances offer 192 GB per GPU, which opens up FP16 operation on a single 8-GPU node and larger batch sizes at 256K context. See best NVIDIA GPUs for LLMs for a side-by-side breakdown of architecture differences between these two generations.

Pricing fluctuates based on GPU availability. The prices above are based on 12 May 2026 and may have changed. Check current GPU pricing for live rates.

Deploy Kimi K2.6 with vLLM

Step 1: Launch Your GPU Instance

Log into Spheron dashboard
Select the GPU offer with 8x GPUs and click Next
Choose your GPU:

8x H200 SXM5 for cost-effective FP8 production
8x B200 SXM6 for higher KV headroom and batch throughput

Set storage to 800 GB minimum (the model weights are ~630 GB compressed; extra space covers temporary download files and the OS/venv)
Choose Ubuntu 22.04 or Ubuntu 24.04

Step 2: Add the Startup Script

Add this cloud-init script to the deployment configuration. It installs vLLM nightly, downloads moonshotai/Kimi-K2.6, and starts the inference server.

bash

#!/bin/bash

set -e

echo "--- Setting Up Environment ---"

sudo apt-get update -y
sudo apt-get install -y python3-venv

sudo python3 -m venv /opt/kimi_venv
source /opt/kimi_venv/bin/activate

pip install --upgrade pip
pip install -U vllm --pre \
    --extra-index-url https://wheels.vllm.ai/nightly/cu129 \
    --extra-index-url https://download.pytorch.org/whl/cu129

echo "--- Launching vLLM Server ---"

# For AWQ quantization, replace the serve line below with:
# nohup vllm serve moonshotai/Kimi-K2.6-AWQ \
#     --quantization awq \
#     ...
# This cuts VRAM from ~1 TB (FP8) to ~630 GB (AWQ INT4). Requires 4x B200 SXM6 (768 GB total). 4x H200 (564 GB) is insufficient.

nohup vllm serve moonshotai/Kimi-K2.6 \
    --tensor-parallel-size 8 \
    --host 0.0.0.0 \
    --port 8000 \
    --mm-encoder-tp-mode data \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --max-model-len 262144 \
    --trust-remote-code > /var/log/vllm.log 2>&1 &

echo "--- Waiting for server to initialize (ETA 20-30 mins) ---"

for i in {1..1800}; do
  if curl -s "http://localhost:8000/v1/models" > /dev/null; then
    echo "vLLM server is ready!"
    break
  fi

  if [ $((i % 15)) -eq 0 ]; then
    echo "Still waiting for model to load... ($i/1800)"
  fi

  sleep 2
done

if ! curl -s "http://localhost:8000/v1/models" > /dev/null; then
  echo "ERROR: Server took longer than 60 minutes to load."
  echo "Check /var/log/vllm.log for details."
  exit 1
fi

Note on model availability: If moonshotai/Kimi-K2.6 is a gated repo on HuggingFace, you will need to set HUGGING_FACE_HUB_TOKEN in your environment before running the script. Add export HUGGING_FACE_HUB_TOKEN=your_token_here before the vllm serve line and confirm the model page shows your account has access.

Note on vLLM version: K2.5 required a nightly build from wheels.vllm.ai. K2.6 likely follows the same pattern until a stable vLLM release adds explicit support. Check the vLLM changelog before switching to a stable release, as missing support for --mm-encoder-tp-mode data or kimi_k2 parsers will cause a silent fallback or startup error.

Step 3: Deploy and Monitor

Once deployed, SSH into the instance and monitor startup:

bash

tail -f /var/log/vllm.log

Model download and loading takes 20-30 minutes. You will see per-GPU shard loading messages followed by "Application startup complete" when the server is ready.

Step 4: Verify the Deployment

bash

curl http://localhost:8000/v1/models

You should see moonshotai/Kimi-K2.6 in the response. Run a quick text inference to confirm:

bash

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/Kimi-K2.6",
    "messages": [{"role": "user", "content": "Write a Python function to parse JSON from a URL"}],
    "max_tokens": 512
  }'

AWQ Quantization Path

If you want to run on 4x B200 instead of 8 GPUs, use a pre-quantized AWQ checkpoint. AWQ reduces weight VRAM from ~1 TB (FP8) down to ~630 GB (INT4). On 4x B200 SXM6 (768 GB total), this leaves ~138 GB for KV cache, enough to serve the full 256K context window. Note: 4x H200 provides only 564 GB total, which is insufficient for the ~630 GB AWQ weight footprint. Replace the vllm serve command:

bash

vllm serve moonshotai/Kimi-K2.6-AWQ \
    --quantization awq \
    --tensor-parallel-size 4 \
    --host 0.0.0.0 \
    --port 8000 \
    --mm-encoder-tp-mode data \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --max-model-len 262144 \
    --trust-remote-code

With 4x B200 SXM6, the ~138 GB of headroom beyond the AWQ weight footprint is enough to run the full 256K context window. Stick with 8 GPUs if you need more KV cache headroom for larger batch sizes at 256K. See the AWQ quantization guide for choosing between AWQ, FP8, and GPTQ based on your accuracy/speed tradeoff.

Deploy Kimi K2.6 with SGLang

SGLang is the better choice for K2.6 when your workload involves long agentic sessions with repeated structure. The reasons are specific to how K2.6 is used, not generic framework preference.

RadixAttention prefix caching is the primary reason. Agentic sessions typically start with a fixed system prompt plus tool definitions that can run to 2,000-4,000 tokens. With standard attention, every new session re-processes this prefix from scratch. SGLang's RadixAttention caches the KV tensors for any shared prefix and reuses them across requests. For K2.6 deployed as an agentic coding assistant where the same tool schema appears in every session, this can cut TTFT by 40-60% at scale.

Structured output is the second reason. K2.6's parallel tool execution returns JSON payloads that need to be valid Python/JSON objects. SGLang's constrained decoding enforces output grammar at the token level, so the model cannot produce a syntactically broken tool call even under truncation or sampling pressure.

MLA optimization requires SGLang 0.4+. Add --enable-flashinfer-mla to activate the FlashInfer MLA kernel, which is optimized for the latent attention pattern and reduces attention computation time versus the generic flash attention path.

bash

docker run --gpus all --ipc=host \
  -p 8000:8000 \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
    --model-path moonshotai/Kimi-K2.6 \
    --tp 8 \
    --host 0.0.0.0 \
    --port 8000 \
    --enable-flashinfer-mla \
    --trust-remote-code

Use vLLM for high-concurrency short-context requests where throughput per second matters more than per-session caching. Use SGLang for long agentic sessions, document pipelines, and any workload where the same tool schema or system prompt recurs across requests. The SGLang production deployment guide has a fuller comparison of when to switch.

Vision Input Pipeline (MoonViT)

MoonViT splits images into 448x448-pixel tiles and encodes each tile independently using a ViT backbone before projecting to the LLM token space. A single 1024x1024 image produces roughly 576 image tokens. A 4K screenshot can produce 2,000+ tokens, so budget accordingly when sizing context windows for vision workloads.

Sending an image request:

python

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

with open("screenshot.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.6",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Recreate this UI as a React component with Tailwind CSS"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}}
        ]
    }],
    max_tokens=8192
)
print(response.choices[0].message.content)

Sending a video clip:

python

import subprocess, base64, tempfile, os

# Extract one frame per second
with tempfile.TemporaryDirectory() as tmpdir:
    subprocess.run([
        "ffmpeg", "-i", "demo.mp4", "-vf", "fps=1", f"{tmpdir}/frame_%04d.png"
    ], check=True)
    frames = sorted(os.listdir(tmpdir))[:8]  # Cap at 8 frames
    content = [{"type": "text", "text": "Describe the UI interaction shown in these frames"}]
    for f in frames:
        with open(f"{tmpdir}/{f}", "rb") as fp:
            b64 = base64.b64encode(fp.read()).decode()
        content.append({"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}})

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.6",
    messages=[{"role": "user", "content": content}],
    max_tokens=4096
)

The --mm-encoder-tp-mode data flag distributes MoonViT encoding across all 8 tensor-parallel ranks in data-parallel mode, so each GPU encodes a subset of tiles in parallel rather than all 8 encoding the same tiles sequentially.

Agentic Deployment Patterns

Tool Calling

K2.6 uses the same kimi_k2 tool call parser as K2.5. The parallel tool execution capability means the model can now return multiple tool calls in one response turn:

python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

tools = [
    {
        "type": "function",
        "function": {
            "name": "search_files",
            "description": "Search repository files for a pattern",
            "parameters": {
                "type": "object",
                "properties": {
                    "pattern": {"type": "string"},
                    "path": {"type": "string"}
                },
                "required": ["pattern"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "run_tests",
            "description": "Run the test suite for a module",
            "parameters": {
                "type": "object",
                "properties": {
                    "module": {"type": "string"}
                },
                "required": ["module"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.6",
    messages=[{"role": "user", "content": "Find all API endpoints and run the API tests in parallel"}],
    tools=tools,
    tool_choice="auto"
)

# K2.6 may return multiple tool_calls in response.choices[0].message.tool_calls
for call in response.choices[0].message.tool_calls or []:
    print(f"Tool: {call.function.name}, Args: {call.function.arguments}")

Long-horizon Task Configuration

For 256K context sessions, add --enable-chunked-prefill and tune --max-num-batched-tokens to control how many prefill tokens are processed per step. At 256K context, the full prefill in one step can spike memory; chunking keeps GPU memory stable:

bash

vllm serve moonshotai/Kimi-K2.6 \
    --tensor-parallel-size 8 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192 \
    --max-model-len 262144 \
    --mm-encoder-tp-mode data \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --trust-remote-code

KV Cache Offloading

For extended multi-turn sessions that push past HBM capacity, use vLLM's built-in CPU offload or LMCache for NVMe-backed KV storage. The KV cache optimization guide covers both approaches. MLA's compressed KV format reduces what needs to be offloaded per layer compared to standard MHA, so K2.6 handles long sessions better than K2.5 even before you add offloading.

Autoscaling

For production agentic workloads, use spot instances for batch document processing (where latency is not user-facing) and on-demand instances for interactive sessions where a spot eviction would abort a live session. The cost difference is meaningful at scale: spot B200 SXM6 at $3.50/GPU/hr versus on-demand at $6.76/GPU/hr. For document pipelines that can checkpoint and resume, spot is the right choice. For real-time coding assistants or interactive agents, on-demand is safer. See the agentic RAG infrastructure guide for patterns on deploying K2.6 in a retrieval-augmented agentic setup, and MCP server GPU deployment if you are building K2.6 into an MCP-compatible tool server.

Benchmarks

These figures are estimates based on K2.5 benchmarks at equivalent precision and node configuration, as K2.6-specific published numbers are not yet available at time of writing. MLA's lower KV cache footprint tends to improve throughput at long contexts compared to K2.5 at the same precision. Check the vLLM vs TensorRT-LLM vs SGLang benchmarks post for the methodology used to produce these estimates.

GPU Config	Context Length	Batch Size	Throughput (tok/s)	TTFT (ms)	TBT (ms)
8x H200 SXM5, FP8	8K	8	~1,800	~280	~8
8x H200 SXM5, FP8	64K	2	~900	~1,200	~9
8x H200 SXM5, FP8	256K	1	~420	~4,800	~11
8x B200 SXM6, FP8	8K	8	~3,200	~180	~5
8x B200 SXM6, FP8	64K	2	~1,600	~750	~6
8x B200 SXM6, FP8	256K	1	~750	~3,000	~7

Estimated based on K2.5 architecture at equivalent precision. Actual K2.6 numbers may differ due to MLA attention kernel implementation and MoonViT overhead.

Cost Economics

At 1,000 tok/s on 8x H200 SXM5 on-demand, you are paying $34.88/hr (8 GPUs x $4.36). At 1 million tokens, that works out to roughly $9.69/M tokens on-demand or $3.91/M tokens at spot pricing. For comparison, closed-source multimodal APIs with comparable capability typically price at $10-30/M tokens depending on input/output split and vision input charges.

The B200 SXM6 costs more per GPU but delivers higher throughput. At the estimated 3,200 tok/s (batch 8, 8K context), 8x B200 on-demand at $54.08/hr ($6.76 x 8) comes to roughly $4.69/M tokens. That is a better cost-per-token than H200 on-demand for high-throughput batch workloads. At spot prices, H200 is cheaper per million tokens in both scenarios. H200's lower hourly rate more than offsets B200's throughput advantage whether you are running 8K or 256K context.

Spot pricing comparison by context length:

Scenario	Cost
8x B200 SXM6 spot, 8K context, 3,200 tok/s	~$2.43/M tokens
8x H200 SXM5 spot, 8K context, 1,800 tok/s	~$2.17/M tokens
8x H200 SXM5 spot, 256K context, 420 tok/s	~$9.31/M tokens
8x B200 SXM6 spot, 256K context, 750 tok/s	~$10.37/M tokens

Pricing fluctuates based on GPU availability. The prices above are based on 12 May 2026 and may have changed. Check current GPU pricing for live rates.

Production Checklist

Observability: Enable the vLLM Prometheus metrics endpoint with --enable-metrics and point Grafana at http://localhost:8000/metrics. Key signals: vllm:gpu_cache_usage_perc, vllm:request_queue_depth, and vllm:tokens_per_second.
Content filtering: Apply any guardrails at the API gateway layer before requests reach vLLM, not inside the model. vLLM does not apply content filtering by default. Cloudflare Workers or an nginx sidecar are lightweight options for rate limiting and input filtering.
Weight caching: Kimi K2.6 weights are ~630 GB in INT4. Attach a per-region NVMe volume on Spheron and pre-download weights to that volume. Subsequent instance launches in the same region skip the HuggingFace download entirely, cutting startup from 20-30 minutes to 2-3 minutes.
Semantic cache: For agentic workloads where users send similar queries repeatedly (e.g., "explain this code", "fix this bug"), a semantic cache layer in front of vLLM (such as a Redis-backed prompt similarity lookup) reduces GPU compute and cost substantially.
Version pinning: K2.6 currently requires a vLLM nightly build. Pin the exact wheel hash or nightly build date in your startup script so infrastructure updates do not silently break your deployment. Log the vLLM version at startup in your monitoring pipeline.

Kimi K2.6's 256K context and MoonViT vision are a good fit for agentic document pipelines and multimodal coding workflows. Spheron's H200 SXM5 and B200 SXM6 nodes give you the NVLink bandwidth to run 8-way tensor parallelism without network bottlenecks.
Rent H200 on Spheron → | Rent B200 → | View all GPU pricing →

What's New in Kimi K2.6

Hardware Sizing

Single-node vs multi-node tradeoffs

Recommended GPU Configurations on Spheron

Deploy Kimi K2.6 with vLLM

Step 1: Launch Your GPU Instance

Step 2: Add the Startup Script

Step 3: Deploy and Monitor

Step 4: Verify the Deployment

AWQ Quantization Path

Deploy Kimi K2.6 with SGLang

Vision Input Pipeline (MoonViT)

Agentic Deployment Patterns

Tool Calling

Long-horizon Task Configuration

KV Cache Offloading

Autoscaling

Benchmarks

Cost Economics

Production Checklist

Build what's next.