Tutorial

Deploy TokenSpeed on GPU Cloud: Self-Host the Speed-of-Light LLM Inference Engine for Agentic Workloads (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 19, 2026
deploy tokenspeedTokenSpeed GPU Cloudtokenspeed vllmagentic llm inference enginetokenspeed blackwell b200LightSeek FoundationMLA kernelGPU Cloud
Deploy TokenSpeed on GPU Cloud: Self-Host the Speed-of-Light LLM Inference Engine for Agentic Workloads (2026)

TokenSpeed launched on May 7, 2026, built by LightSeek Foundation with vLLM as its exclusive day-0 integration partner. On NVIDIA B200 at 100 TPS/user, LightSeek's own benchmarks show approximately 9% lower minimum latency and 11% higher throughput than TensorRT-LLM. This guide covers how to deploy it on Spheron GPU Cloud, sizing the hardware, and tuning it for coding-agent workloads with 50K+ context windows. For Kimi K2.6 specifically (TokenSpeed's primary reference model), see the Kimi K2.6 deployment guide.

TL;DR: TokenSpeed vs TensorRT-LLM vs vLLM Standard

EngineMin latency at 100 TPS/user (B200)Throughput (tok/s)MLA kernelvLLM-compatible
TokenSpeed~9% lower than TRT-LLM~11% higher than TRT-LLMYes (custom)Yes (native backend)
TensorRT-LLMBaselineBaselineNo native MLANo (separate API)
vLLM standard runnerVaries by configVaries by configNo (generic attention)Yes (default)

Source: LightSeek Foundation benchmarks, May 2026. Results at Kimi K2.6 MoE workload, 100 TPS/user, NVIDIA B200 SXM6.

What TokenSpeed Is

LightSeek Foundation released TokenSpeed as an open-source inference engine with vLLM v0.18+ as the day-0 launch platform. It is not a separate serving framework. TokenSpeed integrates as a backend runner inside vLLM's existing execution layer, the same architectural slot where Model Runner V2 (MRV2) lives. See the vLLM Model Runner V2 guide for context on how runner backends slot into the vLLM stack.

The single meaningful difference between TokenSpeed and the standard vLLM runner is a custom MLA (Multi-head Latent Attention) kernel. Standard MHA caches one key tensor and one value tensor per attention head per layer. MLA collapses these into a shared low-rank latent vector. TokenSpeed's kernel is written specifically for this data layout, reducing the per-step HBM reads that dominate decode latency at long contexts.

Activation is one environment variable: VLLM_USE_TOKENSPEED_RUNNER=1. The HTTP interface stays identical: /v1/chat/completions, /v1/models, /metrics all work without client changes.

Why Agentic Workloads Break General-Purpose Inference Engines

Coding agents and multi-step autonomous workflows have a different request profile from chat applications. Two properties make them hard for general-purpose engines:

50K+ context windows. KV cache memory grows with sequence length. At 50K tokens with standard MHA, the per-sequence KV cache on a 70B model at FP8 can exceed 30 GB. That limits concurrency severely. MLA's latent-vector compression reduces this by collapsing multiple heads' key/value pairs into a single shared vector, meaning fewer bytes per token in the cache and fewer HBM reads per decode step.

70-200 TPS/user at burst. Coding agents generate code at burst speeds during active turns, then go idle. A general-purpose engine tuned for steady-state high-concurrency short chat turns will either batch too aggressively (spiking TTFT for the agent) or under-batch (wasting GPU throughput). TokenSpeed's MLA kernel is tuned for the long-context, medium-concurrency pattern typical of agent fleets.

For real-world concurrency numbers from agent deployments, see the GPU infrastructure for AI agents guide and the 100 concurrent AI agents case study.

Benchmark Snapshot

Numbers below are from LightSeek Foundation's May 2026 release benchmarks for Kimi K2.6 on coding traces. Results may differ at other batch sizes, sequence lengths, and hardware.

EngineGPUTTFT p50Throughput (tok/s)Min latency at 100 TPS/user
TokenSpeedB200 SXM6~170 ms (32K ctx)~6,800~9% below TRT-LLM baseline
TensorRT-LLMB200 SXM6BaselineBaselineBaseline
vLLM standardB200 SXM6ComparableSlightly below TRT-LLMComparable to TRT-LLM
TokenSpeedH200 SXM5~200 ms (32K ctx)~5,200Smaller gap vs TRT-LLM

The B200 gap is larger than H200 because HBM3e's 8 TB/s bandwidth aligns directly with MLA's latent-vector memory access pattern. On Hopper (H200), bandwidth is 4.8 TB/s and the gap narrows.

For the broader pre-TokenSpeed baseline comparison between vLLM, TensorRT-LLM, and SGLang, see the vLLM vs TensorRT-LLM vs SGLang benchmarks.

Hardware Sizing: Which Spheron SKUs Match TokenSpeed's MLA Kernel

ConfigurationGPUCountVRAMPrecisionMax contextUse case
MinimumH100 SXM58640 GBINT464KDev/testing, Kimi K2.6 INT4
RecommendedH200 SXM581128 GBFP8128KProduction, cost-optimized
Optimal throughputB200 SXM681536 GBFP8256KFull agentic workloads

B200 SXM6 instances on Spheron offer 192 GB HBM3e per GPU, which holds the full FP8 Kimi K2.6 weight set with KV cache headroom for 50K-context agentic sessions without eviction. For teams that need production-grade throughput on a tighter budget, H200 SXM5 rental at 141 GB HBM3e per GPU fits K2.6 at FP8 on a single 8-GPU node with KV cache for typical multi-turn sessions. Note: with corrected live pricing, the H100-to-H200 on-demand step-up is roughly $0.72/GPU/hr ($3.90 vs $4.62), which is small relative to the significant memory and bandwidth gain H200 provides for production K2.6 workloads.

For Blackwell-specific kernel advantages including FlashAttention-4 on B200, see the FlashAttention-4 on Blackwell guide.

Step-by-Step Deployment

Prerequisites

Before starting:

  • Docker with NVIDIA runtime (nvidia-container-toolkit installed). If you're starting from scratch on vLLM, read the vLLM production deployment guide first.
  • An 8-GPU bare-metal node (B200 SXM6 or H200 SXM5)
  • At least 800 GB storage (Kimi K2.6 weights are ~630 GB compressed at FP8; buffer for OS, download cache, and containers)
  • HuggingFace token with access to moonshotai/Kimi-K2.6 (request access on the model card if not already granted)

Verify CUDA and driver are ready:

bash
nvidia-smi
# Should show 8 GPUs, all visible
docker run --rm --gpus all nvidia/cuda:12.4-base-ubuntu22.04 nvidia-smi
# Should also show 8 GPUs inside Docker

Provision on Spheron

Log in at app.spheron.ai and navigate to the GPU Marketplace. Filter for B200 SXM6 or H200 SXM5 and select an 8-GPU bare-metal node. Set storage to 800 GB minimum. Choose Ubuntu 22.04 or 24.04. For the full Spheron provisioning flow, see the Spheron quick guides.

Once the instance is ready, SSH in and confirm:

bash
nvidia-smi
# Expect: 8 GPUs listed

Install the vLLM-TokenSpeed Docker Image

Pull the official vLLM image with the TokenSpeed backend included:

bash
docker pull vllm/vllm-openai:v0.18.0-tokenspeed

Verify the TokenSpeed runner is present in the image:

bash
docker run --rm vllm/vllm-openai:v0.18.0-tokenspeed \
  python -c "from vllm.tokenspeed import TokenSpeedRunner; print('ok')"
# Expected output: ok

Note on image tag: The tag v0.18.0-tokenspeed reflects the vLLM-TokenSpeed day-0 integration. Verify the exact published tag against LightSeek Foundation's GitHub release notes or Docker Hub before pulling, as tag naming may differ from this guide.

Launch the Inference Server

Full Docker run command for Kimi K2.6 on an 8-GPU node:

bash
docker run \
  --gpus all \
  --ipc=host \
  --shm-size=10.24gb \
  -e VLLM_USE_TOKENSPEED_RUNNER=1 \
  -e HF_HOME=/models \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  -p 8000:8000 \
  -v /data/models:/models \
  vllm/vllm-openai:v0.18.0-tokenspeed \
  --model moonshotai/Kimi-K2.6 \
  --tensor-parallel-size 8 \
  --dtype fp8 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching

Flag breakdown:

  • VLLM_USE_TOKENSPEED_RUNNER=1: activates the TokenSpeed backend instead of the default vLLM runner
  • --shm-size=10.24gb: allocates shared memory for NCCL all-reduce buffers across 8 GPUs; without this NCCL can hang or OOM during init
  • HF_HOME=/models: redirects HuggingFace weight downloads into the mounted volume so weights persist across container restarts
  • HUGGING_FACE_HUB_TOKEN: required for gated repos like moonshotai/Kimi-K2.6; set in your shell before running
  • --tensor-parallel-size 8: shards the model across all 8 GPUs
  • --dtype fp8: FP8 weight precision; required to fit Kimi K2.6 on 8x H200 or 8x B200
  • --max-model-len 131072: sets 128K context window (use 262144 for full 256K on B200)
  • --gpu-memory-utilization 0.90: reserves 10% HBM headroom for CUDA kernels and KV cache spikes
  • --enable-prefix-caching: caches shared system-prompt prefixes across agentic sessions

Verify Launch

Check the startup logs for the TokenSpeed confirmation:

bash
docker logs <container_id> 2>&1 | grep -i "tokenspeed\|runner"
# Expected: TokenSpeedRunner active

Test the API endpoint:

bash
curl http://localhost:8000/v1/models
# Expected: JSON with moonshotai/Kimi-K2.6 listed

What --tokenspeed-mla-kernel Does

TokenSpeed's key contribution is a fused prefill-decode pass for MLA layers. In standard attention, prefill and decode are separate CUDA kernel calls. TokenSpeed fuses them into a single pass that operates on the latent KV vectors directly, reducing the number of HBM round trips per decode step. On B200 with HBM3e bandwidth, this is where the ~9% latency improvement materializes.

Note: The --tokenspeed-mla-kernel flag may activate automatically when VLLM_USE_TOKENSPEED_RUNNER=1 is set, depending on the published implementation. Verify flag behavior against the official TokenSpeed documentation or vLLM changelog before adding it explicitly.

Speculative Decoding with TokenSpeed

According to LightSeek's release notes, TokenSpeed handles draft-token verification in the same fused pass as the main decode step. Standard vLLM speculative decoding runs a separate draft model forward pass before the main model verifies the tokens, which adds a second GPU invocation per speculative step. TokenSpeed's fused MLA kernel absorbs the verification into the main decode pass, eliminating that overhead.

To enable speculative decoding, add these flags to the Docker run command:

bash
  --speculative-model <path-to-draft-model> \
  --num-speculative-tokens 4 \
  --speculative-draft-tensor-parallel-size 1

For guidance on draft model selection and speculative decoding strategy, see the speculative decoding production guide.

Tuning for Coding Agents

Batch Size and Concurrency

Batch sizemax-num-seqsAgent concurrencyTTFT p50 (32K ctx)Throughput
481-4 agents~180 ms~2,800 tok/s
8164-8 agents~320 ms~4,600 tok/s
16328-16 agents~580 ms~6,800 tok/s

Start at batch size 4 with --max-num-seqs 8 if your TTFT budget is under 500 ms for 50K context. Scale up while watching the vllm:time_to_first_token_seconds metric in Prometheus.

Prefix Caching for Coding Agents

Coding agents typically start every session with the same system prompt plus tool schema, which can reach 2,000-4,000 tokens. --enable-prefix-caching caches the KV tensors for any shared prefix and reuses them across sessions. For an agentic coding assistant where the same schema appears in every turn, this cuts TTFT significantly at scale.

Monitoring Key Metrics

Check the metrics endpoint after load:

bash
curl localhost:8000/metrics | grep -E "ttft|tpot|throughput"

Key signals to watch:

  • vllm:time_to_first_token_seconds: your TTFT distribution
  • vllm:time_per_output_token_seconds: inter-token latency
  • vllm:gpu_cache_usage_perc: KV cache fill rate (if near 100%, increase node count or reduce concurrency)

For TTFT budgets: coding agents can tolerate ~500 ms TTFT at 32K context. TPOT is tighter, typically 15-25 ms per token for interactive feel. Both should be the primary SLO signals, not raw throughput.

Spheron Pricing for TokenSpeed Deployments

Prices fetched live from the Spheron API on 19 May 2026:

GPUCountOn-demand/GPU/hrSpot/GPU/hr8x node/hr (on-demand)8x node/hr (spot)
B200 SXM68$7.21$3.81~$57.68~$30.48
H200 SXM58$4.62$1.92~$36.96~$15.36
H100 SXM58$3.90$1.66~$31.20~$13.28

Pricing fluctuates based on GPU availability. The prices above are based on 19 May 2026 and may have changed. Check current GPU pricing → for live rates.

For teams evaluating TokenSpeed before committing reserved capacity, Spheron's marketplace lets you spin up an 8x B200 or 8x H200 node for a few hours, run benchmark_serving.py against TensorRT-LLM or standard vLLM, and compare throughput-per-dollar across GPU generations without reserved contracts.


TokenSpeed's MLA kernel closes a real latency gap for coding-agent workloads at 50K+ context. The fastest way to test it at production scale is to spin up an 8x B200 or H200 node, run the benchmark_serving.py comparison, and make the call with your actual trace data.

Rent B200 SXM6 → | Rent H200 SXM5 → | View all GPU pricing →

STEPS / 06

Quick Setup Guide

  1. Launch an 8x B200 or 8x H200 instance on Spheron

    Log into app.spheron.ai, navigate to GPU Marketplace, filter for B200 SXM6 or H200 SXM5, and select an 8-GPU bare metal node. Set storage to at least 800 GB for Kimi K2.6 weights. Choose Ubuntu 22.04 or 24.04. SSH in and run nvidia-smi to confirm 8 GPUs are visible.

  2. Install the vLLM-TokenSpeed Docker image

    Pull the official vLLM image with TokenSpeed backend: docker pull vllm/vllm-openai:v0.18.0-tokenspeed. This image includes the TokenSpeed runner and all dependencies. Verify: docker run --rm vllm/vllm-openai:v0.18.0-tokenspeed python -c "from vllm.tokenspeed import TokenSpeedRunner; print('ok')".

  3. Load the model and start the inference server

    Run: docker run --gpus all --ipc=host --shm-size=10.24gb -e VLLM_USE_TOKENSPEED_RUNNER=1 -e HF_HOME=/models -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN -p 8000:8000 -v /data/models:/models vllm/vllm-openai:v0.18.0-tokenspeed --model moonshotai/Kimi-K2.6 --tensor-parallel-size 8 --dtype fp8 --max-model-len 131072 --gpu-memory-utilization 0.90. Look for 'TokenSpeedRunner active' in startup logs.

  4. Configure speculative decoding for coding agents

    Add --speculative-model <draft-model-path> --num-speculative-tokens 4 --speculative-draft-tensor-parallel-size 1 to the launch command. TokenSpeed's MLA kernel handles draft-token verification in the same fused pass as the main decode step, avoiding the double-forward-pass overhead that standard vLLM speculative decoding incurs.

  5. Tune batch size and KV cache for long-prefix workloads

    For coding agents with repeated system prompts, enable prefix caching: add --enable-prefix-caching. Set --max-num-seqs based on agent concurrency target: use 16 for 4-8 agents, 32 for 16-32 agents. For TTFT budgets under 500 ms at 50K context, start with batch size 4 and max-num-seqs 8, then scale up while monitoring vllm:time_to_first_token_seconds in Prometheus.

  6. Verify deployment and run a benchmark

    Confirm the server is up: curl http://localhost:8000/v1/models. Run vLLM's benchmark_serving.py with --input-len 32768 --output-len 512 --num-prompts 50 --request-rate 8 to simulate a coding agent workload. Compare tokens/sec against your baseline TensorRT-LLM or standard vLLM run to validate the throughput gain.

FAQ / 05

Frequently Asked Questions

TokenSpeed is an open-source LLM inference engine released by LightSeek Foundation on May 7, 2026, with vLLM as the exclusive day-0 launch partner. Unlike vLLM's general-purpose runner, TokenSpeed ships a custom MLA (Multi-head Latent Attention) kernel optimized for coding and agentic workloads with 50K+ context windows. On NVIDIA B200 at 100 TPS/user, TokenSpeed delivers approximately 9% lower minimum latency and 11% higher throughput than TensorRT-LLM. It integrates into the vLLM serving stack as a backend, so existing vLLM API deployments can adopt it without changing client code.

TokenSpeed's MLA kernel is optimized for NVIDIA Blackwell (B200 SXM6) and Hopper (H200 SXM5, H100 SXM5) architectures. The B200 is the sweet spot: HBM3e bandwidth (8 TB/s) aligns with MLA's latent-vector memory access pattern, and the B200's larger HBM (192 GB per GPU) lets you hold a full 50K-context KV cache in-memory at FP8 without eviction. H200 is the cost-efficient alternative for teams that need to stay on Hopper.

Kimi K2.6 (1T parameters, 32B active, MoE architecture) requires a minimum of 8x H200 SXM5 (141 GB each, ~1128 GB total) at FP8 precision or 8x B200 SXM6 (192 GB each) at FP8. TokenSpeed's MLA kernel reduces the KV cache footprint at long context versus a naive MHA implementation, meaning you can sustain 50K-context agentic sessions on 8x H200 that would otherwise require offloading.

No. TokenSpeed integrates as a backend runner inside the vLLM serving stack. The HTTP interface (/v1/chat/completions, /v1/models, /metrics) remains fully OpenAI-compatible. Client code does not change. The only difference is the VLLM_USE_TOKENSPEED_RUNNER=1 environment variable at launch time.

TokenSpeed launched with vLLM v0.18+ integration and has been publicly available since May 7, 2026. It is currently best suited for teams running Kimi K2.6, DeepSeek V3, or Qwen3-235B on Blackwell or Hopper hardware with MLA-optimized workloads. Support for non-MLA architectures (dense Llama, Mistral) is listed as a roadmap item. Test with your target model before committing to production.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.