TokenSpeed launched on May 7, 2026, built by LightSeek Foundation with vLLM as its exclusive day-0 integration partner. On NVIDIA B200 at 100 TPS/user, LightSeek's own benchmarks show approximately 9% lower minimum latency and 11% higher throughput than TensorRT-LLM. This guide covers how to deploy it on Spheron GPU Cloud, sizing the hardware, and tuning it for coding-agent workloads with 50K+ context windows. For Kimi K2.6 specifically (TokenSpeed's primary reference model), see the Kimi K2.6 deployment guide.
TL;DR: TokenSpeed vs TensorRT-LLM vs vLLM Standard
| Engine | Min latency at 100 TPS/user (B200) | Throughput (tok/s) | MLA kernel | vLLM-compatible |
|---|---|---|---|---|
| TokenSpeed | ~9% lower than TRT-LLM | ~11% higher than TRT-LLM | Yes (custom) | Yes (native backend) |
| TensorRT-LLM | Baseline | Baseline | No native MLA | No (separate API) |
| vLLM standard runner | Varies by config | Varies by config | No (generic attention) | Yes (default) |
Source: LightSeek Foundation benchmarks, May 2026. Results at Kimi K2.6 MoE workload, 100 TPS/user, NVIDIA B200 SXM6.
What TokenSpeed Is
LightSeek Foundation released TokenSpeed as an open-source inference engine with vLLM v0.18+ as the day-0 launch platform. It is not a separate serving framework. TokenSpeed integrates as a backend runner inside vLLM's existing execution layer, the same architectural slot where Model Runner V2 (MRV2) lives. See the vLLM Model Runner V2 guide for context on how runner backends slot into the vLLM stack.
The single meaningful difference between TokenSpeed and the standard vLLM runner is a custom MLA (Multi-head Latent Attention) kernel. Standard MHA caches one key tensor and one value tensor per attention head per layer. MLA collapses these into a shared low-rank latent vector. TokenSpeed's kernel is written specifically for this data layout, reducing the per-step HBM reads that dominate decode latency at long contexts.
Activation is one environment variable: VLLM_USE_TOKENSPEED_RUNNER=1. The HTTP interface stays identical: /v1/chat/completions, /v1/models, /metrics all work without client changes.
Why Agentic Workloads Break General-Purpose Inference Engines
Coding agents and multi-step autonomous workflows have a different request profile from chat applications. Two properties make them hard for general-purpose engines:
50K+ context windows. KV cache memory grows with sequence length. At 50K tokens with standard MHA, the per-sequence KV cache on a 70B model at FP8 can exceed 30 GB. That limits concurrency severely. MLA's latent-vector compression reduces this by collapsing multiple heads' key/value pairs into a single shared vector, meaning fewer bytes per token in the cache and fewer HBM reads per decode step.
70-200 TPS/user at burst. Coding agents generate code at burst speeds during active turns, then go idle. A general-purpose engine tuned for steady-state high-concurrency short chat turns will either batch too aggressively (spiking TTFT for the agent) or under-batch (wasting GPU throughput). TokenSpeed's MLA kernel is tuned for the long-context, medium-concurrency pattern typical of agent fleets.
For real-world concurrency numbers from agent deployments, see the GPU infrastructure for AI agents guide and the 100 concurrent AI agents case study.
Benchmark Snapshot
Numbers below are from LightSeek Foundation's May 2026 release benchmarks for Kimi K2.6 on coding traces. Results may differ at other batch sizes, sequence lengths, and hardware.
| Engine | GPU | TTFT p50 | Throughput (tok/s) | Min latency at 100 TPS/user |
|---|---|---|---|---|
| TokenSpeed | B200 SXM6 | ~170 ms (32K ctx) | ~6,800 | ~9% below TRT-LLM baseline |
| TensorRT-LLM | B200 SXM6 | Baseline | Baseline | Baseline |
| vLLM standard | B200 SXM6 | Comparable | Slightly below TRT-LLM | Comparable to TRT-LLM |
| TokenSpeed | H200 SXM5 | ~200 ms (32K ctx) | ~5,200 | Smaller gap vs TRT-LLM |
The B200 gap is larger than H200 because HBM3e's 8 TB/s bandwidth aligns directly with MLA's latent-vector memory access pattern. On Hopper (H200), bandwidth is 4.8 TB/s and the gap narrows.
For the broader pre-TokenSpeed baseline comparison between vLLM, TensorRT-LLM, and SGLang, see the vLLM vs TensorRT-LLM vs SGLang benchmarks.
Hardware Sizing: Which Spheron SKUs Match TokenSpeed's MLA Kernel
| Configuration | GPU | Count | VRAM | Precision | Max context | Use case |
|---|---|---|---|---|---|---|
| Minimum | H100 SXM5 | 8 | 640 GB | INT4 | 64K | Dev/testing, Kimi K2.6 INT4 |
| Recommended | H200 SXM5 | 8 | 1128 GB | FP8 | 128K | Production, cost-optimized |
| Optimal throughput | B200 SXM6 | 8 | 1536 GB | FP8 | 256K | Full agentic workloads |
B200 SXM6 instances on Spheron offer 192 GB HBM3e per GPU, which holds the full FP8 Kimi K2.6 weight set with KV cache headroom for 50K-context agentic sessions without eviction. For teams that need production-grade throughput on a tighter budget, H200 SXM5 rental at 141 GB HBM3e per GPU fits K2.6 at FP8 on a single 8-GPU node with KV cache for typical multi-turn sessions. Note: with corrected live pricing, the H100-to-H200 on-demand step-up is roughly $0.72/GPU/hr ($3.90 vs $4.62), which is small relative to the significant memory and bandwidth gain H200 provides for production K2.6 workloads.
For Blackwell-specific kernel advantages including FlashAttention-4 on B200, see the FlashAttention-4 on Blackwell guide.
Step-by-Step Deployment
Prerequisites
Before starting:
- Docker with NVIDIA runtime (
nvidia-container-toolkitinstalled). If you're starting from scratch on vLLM, read the vLLM production deployment guide first. - An 8-GPU bare-metal node (B200 SXM6 or H200 SXM5)
- At least 800 GB storage (Kimi K2.6 weights are ~630 GB compressed at FP8; buffer for OS, download cache, and containers)
- HuggingFace token with access to
moonshotai/Kimi-K2.6(request access on the model card if not already granted)
Verify CUDA and driver are ready:
nvidia-smi
# Should show 8 GPUs, all visible
docker run --rm --gpus all nvidia/cuda:12.4-base-ubuntu22.04 nvidia-smi
# Should also show 8 GPUs inside DockerProvision on Spheron
Log in at app.spheron.ai and navigate to the GPU Marketplace. Filter for B200 SXM6 or H200 SXM5 and select an 8-GPU bare-metal node. Set storage to 800 GB minimum. Choose Ubuntu 22.04 or 24.04. For the full Spheron provisioning flow, see the Spheron quick guides.
Once the instance is ready, SSH in and confirm:
nvidia-smi
# Expect: 8 GPUs listedInstall the vLLM-TokenSpeed Docker Image
Pull the official vLLM image with the TokenSpeed backend included:
docker pull vllm/vllm-openai:v0.18.0-tokenspeedVerify the TokenSpeed runner is present in the image:
docker run --rm vllm/vllm-openai:v0.18.0-tokenspeed \
python -c "from vllm.tokenspeed import TokenSpeedRunner; print('ok')"
# Expected output: okNote on image tag: The tag
v0.18.0-tokenspeedreflects the vLLM-TokenSpeed day-0 integration. Verify the exact published tag against LightSeek Foundation's GitHub release notes or Docker Hub before pulling, as tag naming may differ from this guide.
Launch the Inference Server
Full Docker run command for Kimi K2.6 on an 8-GPU node:
docker run \
--gpus all \
--ipc=host \
--shm-size=10.24gb \
-e VLLM_USE_TOKENSPEED_RUNNER=1 \
-e HF_HOME=/models \
-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
-p 8000:8000 \
-v /data/models:/models \
vllm/vllm-openai:v0.18.0-tokenspeed \
--model moonshotai/Kimi-K2.6 \
--tensor-parallel-size 8 \
--dtype fp8 \
--max-model-len 131072 \
--gpu-memory-utilization 0.90 \
--enable-prefix-cachingFlag breakdown:
VLLM_USE_TOKENSPEED_RUNNER=1: activates the TokenSpeed backend instead of the default vLLM runner--shm-size=10.24gb: allocates shared memory for NCCL all-reduce buffers across 8 GPUs; without this NCCL can hang or OOM during initHF_HOME=/models: redirects HuggingFace weight downloads into the mounted volume so weights persist across container restartsHUGGING_FACE_HUB_TOKEN: required for gated repos likemoonshotai/Kimi-K2.6; set in your shell before running--tensor-parallel-size 8: shards the model across all 8 GPUs--dtype fp8: FP8 weight precision; required to fit Kimi K2.6 on 8x H200 or 8x B200--max-model-len 131072: sets 128K context window (use 262144 for full 256K on B200)--gpu-memory-utilization 0.90: reserves 10% HBM headroom for CUDA kernels and KV cache spikes--enable-prefix-caching: caches shared system-prompt prefixes across agentic sessions
Verify Launch
Check the startup logs for the TokenSpeed confirmation:
docker logs <container_id> 2>&1 | grep -i "tokenspeed\|runner"
# Expected: TokenSpeedRunner activeTest the API endpoint:
curl http://localhost:8000/v1/models
# Expected: JSON with moonshotai/Kimi-K2.6 listedWhat --tokenspeed-mla-kernel Does
TokenSpeed's key contribution is a fused prefill-decode pass for MLA layers. In standard attention, prefill and decode are separate CUDA kernel calls. TokenSpeed fuses them into a single pass that operates on the latent KV vectors directly, reducing the number of HBM round trips per decode step. On B200 with HBM3e bandwidth, this is where the ~9% latency improvement materializes.
Note: The
--tokenspeed-mla-kernelflag may activate automatically whenVLLM_USE_TOKENSPEED_RUNNER=1is set, depending on the published implementation. Verify flag behavior against the official TokenSpeed documentation or vLLM changelog before adding it explicitly.
Speculative Decoding with TokenSpeed
According to LightSeek's release notes, TokenSpeed handles draft-token verification in the same fused pass as the main decode step. Standard vLLM speculative decoding runs a separate draft model forward pass before the main model verifies the tokens, which adds a second GPU invocation per speculative step. TokenSpeed's fused MLA kernel absorbs the verification into the main decode pass, eliminating that overhead.
To enable speculative decoding, add these flags to the Docker run command:
--speculative-model <path-to-draft-model> \
--num-speculative-tokens 4 \
--speculative-draft-tensor-parallel-size 1For guidance on draft model selection and speculative decoding strategy, see the speculative decoding production guide.
Tuning for Coding Agents
Batch Size and Concurrency
| Batch size | max-num-seqs | Agent concurrency | TTFT p50 (32K ctx) | Throughput |
|---|---|---|---|---|
| 4 | 8 | 1-4 agents | ~180 ms | ~2,800 tok/s |
| 8 | 16 | 4-8 agents | ~320 ms | ~4,600 tok/s |
| 16 | 32 | 8-16 agents | ~580 ms | ~6,800 tok/s |
Start at batch size 4 with --max-num-seqs 8 if your TTFT budget is under 500 ms for 50K context. Scale up while watching the vllm:time_to_first_token_seconds metric in Prometheus.
Prefix Caching for Coding Agents
Coding agents typically start every session with the same system prompt plus tool schema, which can reach 2,000-4,000 tokens. --enable-prefix-caching caches the KV tensors for any shared prefix and reuses them across sessions. For an agentic coding assistant where the same schema appears in every turn, this cuts TTFT significantly at scale.
Monitoring Key Metrics
Check the metrics endpoint after load:
curl localhost:8000/metrics | grep -E "ttft|tpot|throughput"Key signals to watch:
vllm:time_to_first_token_seconds: your TTFT distributionvllm:time_per_output_token_seconds: inter-token latencyvllm:gpu_cache_usage_perc: KV cache fill rate (if near 100%, increase node count or reduce concurrency)
For TTFT budgets: coding agents can tolerate ~500 ms TTFT at 32K context. TPOT is tighter, typically 15-25 ms per token for interactive feel. Both should be the primary SLO signals, not raw throughput.
Spheron Pricing for TokenSpeed Deployments
Prices fetched live from the Spheron API on 19 May 2026:
| GPU | Count | On-demand/GPU/hr | Spot/GPU/hr | 8x node/hr (on-demand) | 8x node/hr (spot) |
|---|---|---|---|---|---|
| B200 SXM6 | 8 | $7.21 | $3.81 | ~$57.68 | ~$30.48 |
| H200 SXM5 | 8 | $4.62 | $1.92 | ~$36.96 | ~$15.36 |
| H100 SXM5 | 8 | $3.90 | $1.66 | ~$31.20 | ~$13.28 |
Pricing fluctuates based on GPU availability. The prices above are based on 19 May 2026 and may have changed. Check current GPU pricing → for live rates.
For teams evaluating TokenSpeed before committing reserved capacity, Spheron's marketplace lets you spin up an 8x B200 or 8x H200 node for a few hours, run benchmark_serving.py against TensorRT-LLM or standard vLLM, and compare throughput-per-dollar across GPU generations without reserved contracts.
TokenSpeed's MLA kernel closes a real latency gap for coding-agent workloads at 50K+ context. The fastest way to test it at production scale is to spin up an 8x B200 or H200 node, run the benchmark_serving.py comparison, and make the call with your actual trace data.
Rent B200 SXM6 → | Rent H200 SXM5 → | View all GPU pricing →
Quick Setup Guide
Log into app.spheron.ai, navigate to GPU Marketplace, filter for B200 SXM6 or H200 SXM5, and select an 8-GPU bare metal node. Set storage to at least 800 GB for Kimi K2.6 weights. Choose Ubuntu 22.04 or 24.04. SSH in and run nvidia-smi to confirm 8 GPUs are visible.
Pull the official vLLM image with TokenSpeed backend: docker pull vllm/vllm-openai:v0.18.0-tokenspeed. This image includes the TokenSpeed runner and all dependencies. Verify: docker run --rm vllm/vllm-openai:v0.18.0-tokenspeed python -c "from vllm.tokenspeed import TokenSpeedRunner; print('ok')".
Run: docker run --gpus all --ipc=host --shm-size=10.24gb -e VLLM_USE_TOKENSPEED_RUNNER=1 -e HF_HOME=/models -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN -p 8000:8000 -v /data/models:/models vllm/vllm-openai:v0.18.0-tokenspeed --model moonshotai/Kimi-K2.6 --tensor-parallel-size 8 --dtype fp8 --max-model-len 131072 --gpu-memory-utilization 0.90. Look for 'TokenSpeedRunner active' in startup logs.
Add --speculative-model <draft-model-path> --num-speculative-tokens 4 --speculative-draft-tensor-parallel-size 1 to the launch command. TokenSpeed's MLA kernel handles draft-token verification in the same fused pass as the main decode step, avoiding the double-forward-pass overhead that standard vLLM speculative decoding incurs.
For coding agents with repeated system prompts, enable prefix caching: add --enable-prefix-caching. Set --max-num-seqs based on agent concurrency target: use 16 for 4-8 agents, 32 for 16-32 agents. For TTFT budgets under 500 ms at 50K context, start with batch size 4 and max-num-seqs 8, then scale up while monitoring vllm:time_to_first_token_seconds in Prometheus.
Confirm the server is up: curl http://localhost:8000/v1/models. Run vLLM's benchmark_serving.py with --input-len 32768 --output-len 512 --num-prompts 50 --request-rate 8 to simulate a coding agent workload. Compare tokens/sec against your baseline TensorRT-LLM or standard vLLM run to validate the throughput gain.
Frequently Asked Questions
TokenSpeed is an open-source LLM inference engine released by LightSeek Foundation on May 7, 2026, with vLLM as the exclusive day-0 launch partner. Unlike vLLM's general-purpose runner, TokenSpeed ships a custom MLA (Multi-head Latent Attention) kernel optimized for coding and agentic workloads with 50K+ context windows. On NVIDIA B200 at 100 TPS/user, TokenSpeed delivers approximately 9% lower minimum latency and 11% higher throughput than TensorRT-LLM. It integrates into the vLLM serving stack as a backend, so existing vLLM API deployments can adopt it without changing client code.
TokenSpeed's MLA kernel is optimized for NVIDIA Blackwell (B200 SXM6) and Hopper (H200 SXM5, H100 SXM5) architectures. The B200 is the sweet spot: HBM3e bandwidth (8 TB/s) aligns with MLA's latent-vector memory access pattern, and the B200's larger HBM (192 GB per GPU) lets you hold a full 50K-context KV cache in-memory at FP8 without eviction. H200 is the cost-efficient alternative for teams that need to stay on Hopper.
Kimi K2.6 (1T parameters, 32B active, MoE architecture) requires a minimum of 8x H200 SXM5 (141 GB each, ~1128 GB total) at FP8 precision or 8x B200 SXM6 (192 GB each) at FP8. TokenSpeed's MLA kernel reduces the KV cache footprint at long context versus a naive MHA implementation, meaning you can sustain 50K-context agentic sessions on 8x H200 that would otherwise require offloading.
No. TokenSpeed integrates as a backend runner inside the vLLM serving stack. The HTTP interface (/v1/chat/completions, /v1/models, /metrics) remains fully OpenAI-compatible. Client code does not change. The only difference is the VLLM_USE_TOKENSPEED_RUNNER=1 environment variable at launch time.
TokenSpeed launched with vLLM v0.18+ integration and has been publicly available since May 7, 2026. It is currently best suited for teams running Kimi K2.6, DeepSeek V3, or Qwen3-235B on Blackwell or Hopper hardware with MLA-optimized workloads. Support for non-MLA architectures (dense Llama, Mistral) is listed as a roadmap item. Test with your target model before committing to production.
