LMDeploy is the third major open-source LLM serving framework alongside vLLM and TensorRT-LLM, and it's the one most teams overlook. Its TurboMind engine is a C++/CUDA inference backend built by the Shanghai AI Laboratory, the same team behind InternLM. If you've been benchmarking vLLM vs TensorRT-LLM vs SGLang and wondering what else is out there, LMDeploy belongs in that conversation, especially for AWQ INT4 and MXFP4 workloads where its native C++ kernel paths produce measurable throughput gains. For production TensorRT-LLM deployment context, see the TensorRT-LLM production deployment guide. Full documentation is at docs.spheron.ai/quick-guides/llms/frameworks/lmdeploy.
TL;DR
| Engine | Architecture | Best For | MXFP4 | AWQ INT4 | v0.12.3 Feature |
|---|---|---|---|---|---|
| LMDeploy TurboMind | C++/CUDA | Mixed-precision, InternLM, DeepSeek | V100-B200 | Native | PD disaggregation via DLSlime |
| vLLM | Python/C++ | Broad model support | Blackwell only | Native | MRV2, gRPC |
| TensorRT-LLM | Compiled CUDA | Max throughput | Blackwell FP4 | Via modelopt | B300 support |
| SGLang | Python/C++ | Shared prefix, RAG | Blackwell only | Native | NSA backend |
Use LMDeploy when:
- You're running AWQ INT4 or MXFP4 on H100, V100, or older Hopper hardware (not just Blackwell)
- Your workload centers on InternLM, Qwen3, or DeepSeek model families
- You need prefill-decode disaggregation without adopting NVIDIA Dynamo's full orchestration layer
- You want 1.5x vLLM throughput on GPT-OSS 120B MXFP4 without a TensorRT-LLM compilation step
What Is LMDeploy and the TurboMind Engine
TurboMind started as the inference backend for InternLM before being open-sourced as the core of the LMDeploy project. The key architectural difference from vLLM is where the heavy work lives.
vLLM's design is Python-first: the scheduling loop, batching logic, and dispatch to CUDA kernels happen in Python (or Python-managed Paged Attention extensions). This works well for broad model support but adds overhead when switching between quantization modes or dispatching to specialized INT4 and FP4 kernels.
TurboMind is C++ all the way down. Continuous batching, the paged KV cache, and the quantization kernels for AWQ INT4 and MXFP4 are all implemented in C++ with hand-tuned CUDA. When you call an AWQ model, TurboMind dispatches directly to a native INT4 decode kernel without a Python-managed dispatch step in the hot path. That is where the throughput gain comes from.
The dual-backend design gives you an escape hatch: TurboMind handles anything in its kernel catalog (InternLM, LLaMA family, Qwen, Mistral, Baichuan, DeepSeek, GPT-OSS). For models outside that list, LMDeploy falls back to a PyTorch backend automatically. Throughput is lower in fallback mode, but you get correctness without needing a different deployment tool.
The Sept 2025 v0.10.0 release extended MXFP4 support from Blackwell-only to the full CUDA GPU range from V100 through B200. On Blackwell (B200, B300, RTX PRO 6000), MXFP4 uses native FP4 tensor cores. On Hopper and Volta (H100, V100), it runs in software-emulated mode: you still get the memory bandwidth savings from 4-bit weights, but without native FP4 matrix op acceleration.
When LMDeploy Wins: Mixed-Precision Workloads
AWQ + KV-cache INT8
TurboMind's AWQ kernels avoid the dequantize-to-FP16 step that slows Python-dispatched frameworks. In vLLM's Python-dispatch path, AWQ INT4 weights are unpacked to FP16 in a preprocessing kernel before the matmul. TurboMind skips that step: the INT4 decode kernel works directly on packed weights, reducing memory reads and eliminating a separate compute pass.
The memory layout matters too. TurboMind stores AWQ weights in a blocked INT4 layout with scale factors interleaved at the granularity TurboMind's CUDA kernels expect. This avoids register pressure from layout conversion at inference time.
For KV cache, TurboMind supports INT8 KV cache quantization independently of weight quantization. You can run BF16 weights with INT8 KV cache to extend effective context length, or combine AWQ INT4 weights with INT8 KV cache for maximum VRAM efficiency. For the full AWQ deployment workflow, see the AWQ quantization guide.
MXFP4 on V100 through B200
The v0.10.0 MXFP4 extension (Sept 2025) is the most practically significant feature for teams running H100 clusters. Before this release, MXFP4 serving was Blackwell-only across all major frameworks. TurboMind now brings MXFP4 to the installed H100 and H200 base.
On H100, you get the memory bandwidth savings from 4-bit weights (models that needed 80GB BF16 now fit in ~20GB) without native FP4 tensor core acceleration. The throughput improvement relative to BF16 comes from memory bandwidth, not compute: TurboMind reads fewer bytes per token from VRAM. On B200, you get both the memory savings and native FP4 tensor core execution.
The 1.5x vLLM throughput figure for GPT-OSS 120B MXFP4 on H100 comes from the LMDeploy v0.10.0 release benchmarks. That's H800 in the benchmark, which is equivalent to H100 in terms of compute and memory bandwidth. For the MXFP4 quantization workflow and the distinction between MXFP4 and NVFP4, see the MXFP4 quantization guide.
InternLM-family First-class Support
TurboMind was built by the InternLM team, which means InternLM-specific architecture patterns are native to the kernel implementation: RoPE scaling, grouped-query attention (GQA), and InternLM2's specific attention pattern all have optimized kernel paths. There are no monkey-patches or adapter layers required.
For teams running InternLM2.5 7B, 20B, or InternLM3 models, this translates to noticeably lower TTFT compared to vLLM, where InternLM support goes through a generic LLaMA-family path.
Hardware Sizing on Spheron GPU Cloud
| GPU | VRAM | On-demand ($/hr) | Spot ($/hr) | LMDeploy Use Case |
|---|---|---|---|---|
| H100 SXM5 instance | 80 GB | $5.07 | $2.91 | InternLM-7B to 20B BF16, DeepSeek 7B AWQ |
| H200 SXM5 | 141 GB | $5.92 | $1.40 | 70B BF16 single-GPU, large KV cache decode nodes |
| B200 SXM6 | 192 GB | $7.00 | $2.14 | MXFP4 native acceleration, GPT-OSS 120B MXFP4 |
| RTX PRO 6000 | 96 GB | $1.70 | $1.32 | InternLM 20B BF16 + AWQ, budget MXFP4 |
| A100 SXM4 | 80 GB | $1.64 | N/A | InternLM 7B, Qwen3 7B, AWQ INT4 cost-optimized |
Pricing fluctuates based on GPU availability. The prices above are based on 01 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Install and Dependency Setup
CUDA requirements: CUDA 11.8+ for TurboMind base, CUDA 12.4+ recommended for MXFP4. Driver 520+ for CUDA 12.
# On a fresh Spheron GPU instance
nvidia-smi # verify driver 520+ for CUDA 12
# Install LMDeploy with all extras (TurboMind C++ backend included)
pip install lmdeploy[all]
# Verify TurboMind backend
python -c "from lmdeploy import turbomind; print('TurboMind OK')"
# Docker alternative (recommended for production)
docker pull openmmlab/lmdeploy:latest-cuda12.4For the Spheron quick-start including GPU instance provisioning, see docs.spheron.ai/quick-start.
Deploy InternLM, Qwen3, and DeepSeek with TurboMind
InternLM2.5-7B (BF16, single H100)
Requires ~15 GB VRAM. One H100 SXM5 handles this with plenty of room for KV cache.
lmdeploy serve api_server \
internlm/internlm2_5-7b-chat \
--backend turbomind \
--server-port 23333 \
--tp 1Test the endpoint:
curl http://localhost:23333/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "internlm/internlm2_5-7b-chat",
"messages": [{"role": "user", "content": "Explain paged KV cache in 2 sentences."}]
}'Qwen3-7B (AWQ INT4, single H100)
AWQ drops the 7B model to ~4 GB stored weights. On an H100, this gives you substantial KV cache headroom for long contexts.
lmdeploy serve api_server \
Qwen/Qwen3-7B-Instruct-AWQ \
--backend turbomind \
--model-format awq \
--server-port 23333DeepSeek-V3 (AWQ, 5x H100 tensor parallel)
DeepSeek-V3 is a 671B-parameter MoE model. At AWQ INT4 (4 bits per parameter) the stored weights occupy roughly 335 GB (671B × 4 / 8). Five H100 SXM5s (400 GB combined VRAM) handle it with tensor parallelism. Alternatively, four H200 SXM5s (564 GB combined) also provide sufficient headroom. For DeepSeek-specific configuration, see docs.spheron.ai/quick-guides/llms/deepseek-r1/.
lmdeploy serve api_server \
deepseek-ai/DeepSeek-V3-AWQ \
--backend turbomind \
--model-format awq \
--tp 5 \
--server-port 23333Deploy GPT-OSS with MXFP4: 1.5x vLLM Throughput
GPT-OSS 120B at MXFP4 fits on a single H100 SXM5 (quantized weights drop from ~240 GB BF16 to ~60 GB). TurboMind's native MXFP4 kernel path is what drives the throughput advantage over vLLM for this workload.
The --quant-policy 4 flag enables MXFP4 in TurboMind. On H100, this uses software-emulated mixed-precision paths; on B200, it triggers native FP4 tensor core acceleration.
# GPT-OSS 120B with MXFP4 on H100/H200/B200
lmdeploy serve api_server \
openai/gpt-oss-120b \
--backend turbomind \
--quant-policy 4 \
--tp 1 \
--server-port 23333Throughput comparison on identical prompts (512 input / 256 output tokens):
| GPU | Precision | LMDeploy tok/s | vLLM tok/s | Ratio |
|---|---|---|---|---|
| H100 SXM5 | MXFP4 | ~1,950 | ~1,300 | 1.5x |
| H200 SXM5 | MXFP4 | ~2,600 | ~1,750 | 1.49x |
| B200 SXM6 | MXFP4 (native FP4) | ~4,200 | ~2,800 | 1.5x |
Based on LMDeploy's published benchmarks and H800/H100 test environments. These are baseline figures; actual results vary with batch size, context length, and sampling parameters. For the full GPT-OSS setup guide including SGLang and Ollama comparisons, see the GPT-OSS deployment guide.
Serving API: OpenAI-Compatible Endpoint, restful_api, and Pipeline Modes
Three serving modes cover different deployment patterns:
1. api_server (OpenAI-compatible REST, production)
The standard production path. Exposes /v1/chat/completions, /v1/completions, and /v1/embeddings.
lmdeploy serve api_server \
internlm/internlm2_5-7b-chat \
--backend turbomind \
--server-port 23333curl http://localhost:23333/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "internlm/internlm2_5-7b-chat",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 100
}'2. Python client (programmatic, batch requests)
from lmdeploy.serve.openai.api_client import APIClient
client = APIClient("http://localhost:23333")
for item in client.chat_completions_v1(
model="internlm/internlm2_5-7b-chat",
messages=[{"role": "user", "content": "Hello"}],
):
print(item)3. Pipeline mode (batch offline)
Best for batch scoring, synthetic data generation, and evaluation runs where you don't need an HTTP server. The pipeline interface processes lists of prompts in a single call.
from lmdeploy import pipeline, TurbomindEngineConfig
engine_config = TurbomindEngineConfig(max_batch_size=32)
pipe = pipeline("internlm/internlm2_5-7b-chat", backend_config=engine_config)
responses = pipe(["What is MXFP4?", "Explain paged KV cache.", "What is AWQ?"])
for resp in responses:
print(resp.text)For full API reference, see docs.spheron.ai/quick-guides/llms/frameworks/lmdeploy.
DeepSeek Prefill-Decode Disaggregation with DLSlime and Mooncake
LMDeploy v0.12 added prefill-decode (PD) disaggregation via two external components:
- DLSlime: manages KV cache transfer between prefill and decode nodes over RDMA or TCP
- Mooncake: the distributed memory layer providing disaggregated KV cache routing and scheduling
The architecture:
[Client] -> [LMDeploy Router] -> [Prefill nodes: H100/B200] --(KV via DLSlime)--> [Decode nodes: H200]DLSlime and Mooncake are not bundled with the default pip install lmdeploy. Install them separately:
pip install dlslime mooncake-transfer-engineBoth require RDMA-capable network interfaces (InfiniBand or RoCE) for production deployments. A TCP fallback exists but adds latency; for latency-sensitive workloads, provision RDMA-capable Spheron instances.
Minimal 2-node setup (1 prefill H100 + 1 decode H200):
# Step 1: Start the proxy/router (handles routing between prefill and decode nodes)
lmdeploy serve proxy \
--server-name 0.0.0.0 \
--server-port 23333 \
--routing-strategy min_expected_latency \
--serving-strategy DistServe
# Step 2: On prefill node (H100) - registers with proxy on startup
lmdeploy serve api_server \
deepseek-ai/DeepSeek-V3-AWQ \
--backend pytorch \
--model-format awq \
--server-port 23334 \
--role Prefill \
--proxy-url http://<router-ip>:23333
# Step 3: On decode node (H200) - registers with proxy on startup
lmdeploy serve api_server \
deepseek-ai/DeepSeek-V3-AWQ \
--backend pytorch \
--model-format awq \
--server-port 23335 \
--role Decode \
--proxy-url http://<router-ip>:23333PD disaggregation uses the PyTorch backend (--backend pytorch). The --role Prefill / --role Decode flags on the standard api_server subcommand replace the older prefill_worker / decode_worker subcommands, and lmdeploy serve proxy --serving-strategy DistServe replaces the old router subcommand.
For background on why prefill and decode have opposite GPU requirements, see the prefill-decode disaggregation guide. For an alternative orchestration layer that sits on top of vLLM, see the NVIDIA Dynamo disaggregated inference guide. For Kubernetes-native disaggregated serving, see the LLM-D Kubernetes disaggregated inference guide.
Benchmarks: LMDeploy vs vLLM vs TensorRT-LLM vs SGLang
Identical hardware: H100 SXM5 80GB, InternLM2.5-7B BF16, 512 input / 256 output tokens. These figures are based on LMDeploy's published benchmarks and scaling from the vLLM/SGLang benchmark data in the vLLM vs TensorRT-LLM vs SGLang benchmarks post. Treat them as directional, not independently reproduced on Spheron hardware.
| Concurrency | LMDeploy TurboMind | vLLM | TensorRT-LLM | SGLang |
|---|---|---|---|---|
| 1 req | 135 tok/s | 120 tok/s | 130 tok/s | 125 tok/s |
| 10 req | 720 tok/s | 650 tok/s | 710 tok/s | 680 tok/s |
| 50 req | 1,980 tok/s | 1,850 tok/s | 2,100 tok/s | 1,920 tok/s |
| 100 req | 2,550 tok/s | 2,400 tok/s | 2,780 tok/s | 2,460 tok/s |
LMDeploy TurboMind beats vLLM at all concurrency levels tested. TensorRT-LLM still leads at high concurrency due to compiled kernel graphs, but TurboMind closes much of the gap without a 28-minute compilation step. The TRT-LLM advantage is most pronounced at concurrency 50+ where compiled CUDA graph execution and persistent kernel state become dominant.
TTFT at concurrency 10 (p50 / p95, ms):
| Engine | p50 TTFT | p95 TTFT |
|---|---|---|
| LMDeploy TurboMind | 108 ms | 182 ms |
| vLLM | 120 ms | 195 ms |
| TensorRT-LLM | 105 ms | 170 ms |
| SGLang | 112 ms | 178 ms |
TurboMind has the second-lowest TTFT after TensorRT-LLM, which matters for interactive use cases where every millisecond of first-token latency is user-visible.
Cost per Million Tokens on Spheron GPU Cloud
Formula: cost_per_million = (price_per_hour / throughput_toks_per_sec / 3600) * 1_000_000
InternLM2.5-7B BF16, H100 SXM5, concurrency 50 (1,980 tok/s from the benchmark table above):
| GPU | Mode | Price/hr | Throughput | $/M tokens |
|---|---|---|---|---|
| H100 SXM5 | On-demand | $5.07 | 1,980 tok/s | $0.71 |
| H100 SXM5 | Spot | $2.91 | 1,980 tok/s | $0.41 |
| H200 SXM5 | On-demand | $5.92 | 2,600 tok/s | $0.63 |
| A100 SXM4 | On-demand | $1.64 | 1,100 tok/s (7B AWQ) | $0.41 |
For comparison, AWS g5.12xlarge (4x A10G, 96 GB total) runs roughly $5.67/hr on-demand for a comparable 4-GPU configuration, with lower per-GPU VRAM bandwidth than H100. Azure NC A100 v4 series (single A100 80GB) runs approximately $3.67/hr on-demand list price. Spheron provides bare-metal access with no hypervisor overhead, which adds another 5-15% throughput advantage on top of these price differences.
For a broader cost-per-token comparison across GPU models and quantization formats, see AI inference cost economics 2026 and GPU cloud pricing comparison 2026.
Pricing fluctuates based on GPU availability. The prices above are based on 01 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Production Checklist: Monitoring, Autoscaling, Fallback
Monitoring
LMDeploy exposes a Prometheus endpoint at /metrics. Key metrics to watch:
lmdeploy_requests_total: total requests served (counter, rate gives req/s)lmdeploy_kv_cache_usage: KV cache fill percentage (alert above 90%)lmdeploy_requests_waiting: queue depth (alert above N to trigger scale-out)
For full Prometheus and Grafana setup for GPU inference, see the LLM observability guide.
Autoscaling
LMDeploy's api_server is stateless at the HTTP layer. Scale-out means launching additional LMDeploy instances, each backed by one or more GPUs. A practical autoscaling setup with KEDA:
- Expose
lmdeploy_requests_waitingvia Prometheus adapter - Configure a KEDA ScaledObject targeting queue depth (e.g., 1 replica per 20 queued requests)
- Each new replica is a new
lmdeploy serve api_servercontainer with its own GPU allocation - Route via nginx or a service mesh with round-robin load balancing
For KEDA + Prometheus GPU autoscaling configuration, see the KEDA GPU autoscaling guide.
Fallback to PyTorch Engine
TurboMind has a defined model support list (InternLM, LLaMA/LLaMA-2/LLaMA-3, Mistral, Qwen, Baichuan, DeepSeek, GPT-OSS). For model architectures outside this list, LMDeploy falls back to the PyTorch backend automatically:
lmdeploy serve api_server \
your-custom-model \
--backend pytorch \
--server-port 23333No code change required. The PyTorch backend has lower throughput than TurboMind but maintains correctness for any HuggingFace-compatible decoder-only model. Use it as a compatibility layer while you wait for upstream TurboMind kernel support.
LMDeploy's TurboMind engine runs best on bare-metal GPU access where C++/CUDA kernel scheduling isn't competing with hypervisor overhead. Spheron provides on-demand H100 and H200 instances with no virtualization layer and per-minute billing.
H100 GPU pricing → | View all GPU options → | Get started on Spheron →
Quick Setup Guide
Launch an H100 SXM5 or H200 SXM5 on Spheron from app.spheron.ai. For MXFP4 inference on Blackwell hardware, choose a B200 SXM6. SSH into the instance and verify GPU access with nvidia-smi.
Install via pip: pip install lmdeploy. For CUDA 12.x: pip install lmdeploy[all]. Verify with: lmdeploy --version. For Docker, pull openmmlab/lmdeploy:latest-cuda12.4.
Run: lmdeploy serve api_server internlm/internlm2_5-7b-chat --backend turbomind --server-port 23333. For AWQ: add --model-format awq. For MXFP4: add --quant-policy 4.
curl http://localhost:23333/v1/chat/completions -H 'Content-Type: application/json' -d '{"model": "internlm/internlm2_5-7b-chat", "messages": [{"role": "user", "content": "Hello"}]}'
LMDeploy exposes Prometheus metrics at /metrics. Watch lmdeploy_requests_total and lmdeploy_kv_cache_usage for queue depth and memory pressure. Add a health check at /health and configure an nginx reverse proxy with rate limiting.
Frequently Asked Questions
LMDeploy is an open-source LLM serving framework built on TurboMind, a C++/CUDA inference engine. Unlike vLLM which uses a Python-first architecture, TurboMind implements continuous batching, paged KV cache, and quantization kernels directly in C++ with hand-tuned CUDA kernels. The result is lower decode latency and higher throughput for mixed-precision workloads - particularly AWQ INT4 and MXFP4 - where TurboMind's native kernel paths outperform vLLM's Python-dispatched equivalents.
LMDeploy v0.10.0 added MXFP4 support for NVIDIA GPUs from V100 through B200 (and B300 Blackwell Ultra). Hardware-accelerated FP4 tensor cores are available only on Blackwell (B200, B300, RTX 5090, RTX PRO 6000). On Hopper and Volta GPUs, MXFP4 runs in software-emulated mode, still providing memory bandwidth savings but without native FP4 matrix op acceleration.
Yes. LMDeploy's restful_api server exposes a fully OpenAI-compatible endpoint at /v1/chat/completions, /v1/completions, and /v1/embeddings. Your existing client code that targets the OpenAI API works without modification by changing only the base_url and api_key.
LMDeploy v0.12 integrates DLSlime and Mooncake for KV cache transfer between prefill and decode nodes. Prefill-dense nodes (H100, B200) handle prompt processing; decode nodes (H200 for large KV cache capacity) handle token generation. DLSlime manages the inter-node KV cache transfer, while Mooncake provides the distributed memory layer for efficient cache routing.
Yes. LMDeploy was developed by the Shanghai AI Laboratory team that also maintains InternLM, so InternLM model support is native and first-class. DeepSeek V3 and R1 are fully supported as of v0.11, with PD disaggregation via DLSlime added in v0.12. Qwen3, GPT-OSS, and most HuggingFace-compatible decoder-only architectures are also supported.
