Deploy LMDeploy on GPU Cloud: TurboMind Inference for InternLM, Qwen3, and DeepSeek (2026)

LMDeploy is the third major open-source LLM serving framework alongside vLLM and TensorRT-LLM, and it's the one most teams overlook. Its TurboMind engine is a C++/CUDA inference backend built by the Shanghai AI Laboratory, the same team behind InternLM. If you've been benchmarking vLLM vs TensorRT-LLM vs SGLang and wondering what else is out there, LMDeploy belongs in that conversation, especially for AWQ INT4 and MXFP4 workloads where its native C++ kernel paths produce measurable throughput gains. For production TensorRT-LLM deployment context, see the TensorRT-LLM production deployment guide. Full documentation is at docs.spheron.ai/quick-guides/llms/frameworks/lmdeploy.

TL;DR

Engine	Architecture	Best For	MXFP4	AWQ INT4	v0.12.3 Feature
LMDeploy TurboMind	C++/CUDA	Mixed-precision, InternLM, DeepSeek	V100-B200	Native	PD disaggregation via DLSlime
vLLM	Python/C++	Broad model support	Blackwell only	Native	MRV2, gRPC
TensorRT-LLM	Compiled CUDA	Max throughput	Blackwell FP4	Via modelopt	B300 support
SGLang	Python/C++	Shared prefix, RAG	Blackwell only	Native	NSA backend

Use LMDeploy when:

You're running AWQ INT4 or MXFP4 on H100, V100, or older Hopper hardware (not just Blackwell)
Your workload centers on InternLM, Qwen3, or DeepSeek model families
You need prefill-decode disaggregation without adopting NVIDIA Dynamo's full orchestration layer
You want 1.5x vLLM throughput on GPT-OSS 120B MXFP4 without a TensorRT-LLM compilation step

What Is LMDeploy and the TurboMind Engine

TurboMind started as the inference backend for InternLM before being open-sourced as the core of the LMDeploy project. The key architectural difference from vLLM is where the heavy work lives.

vLLM's design is Python-first: the scheduling loop, batching logic, and dispatch to CUDA kernels happen in Python (or Python-managed Paged Attention extensions). This works well for broad model support but adds overhead when switching between quantization modes or dispatching to specialized INT4 and FP4 kernels.

TurboMind is C++ all the way down. Continuous batching, the paged KV cache, and the quantization kernels for AWQ INT4 and MXFP4 are all implemented in C++ with hand-tuned CUDA. When you call an AWQ model, TurboMind dispatches directly to a native INT4 decode kernel without a Python-managed dispatch step in the hot path. That is where the throughput gain comes from.

The dual-backend design gives you an escape hatch: TurboMind handles anything in its kernel catalog (InternLM, LLaMA family, Qwen, Mistral, Baichuan, DeepSeek, GPT-OSS). For models outside that list, LMDeploy falls back to a PyTorch backend automatically. Throughput is lower in fallback mode, but you get correctness without needing a different deployment tool.

The Sept 2025 v0.10.0 release extended MXFP4 support from Blackwell-only to the full CUDA GPU range from V100 through B200. On Blackwell (B200, B300, RTX PRO 6000), MXFP4 uses native FP4 tensor cores. On Hopper and Volta (H100, V100), it runs in software-emulated mode: you still get the memory bandwidth savings from 4-bit weights, but without native FP4 matrix op acceleration.

When LMDeploy Wins: Mixed-Precision Workloads

AWQ + KV-cache INT8

TurboMind's AWQ kernels avoid the dequantize-to-FP16 step that slows Python-dispatched frameworks. In vLLM's Python-dispatch path, AWQ INT4 weights are unpacked to FP16 in a preprocessing kernel before the matmul. TurboMind skips that step: the INT4 decode kernel works directly on packed weights, reducing memory reads and eliminating a separate compute pass.

The memory layout matters too. TurboMind stores AWQ weights in a blocked INT4 layout with scale factors interleaved at the granularity TurboMind's CUDA kernels expect. This avoids register pressure from layout conversion at inference time.

For KV cache, TurboMind supports INT8 KV cache quantization independently of weight quantization. You can run BF16 weights with INT8 KV cache to extend effective context length, or combine AWQ INT4 weights with INT8 KV cache for maximum VRAM efficiency. For the full AWQ deployment workflow, see the AWQ quantization guide.

MXFP4 on V100 through B200

The v0.10.0 MXFP4 extension (Sept 2025) is the most practically significant feature for teams running H100 clusters. Before this release, MXFP4 serving was Blackwell-only across all major frameworks. TurboMind now brings MXFP4 to the installed H100 and H200 base.

On H100, you get the memory bandwidth savings from 4-bit weights (models that needed 80GB BF16 now fit in ~20GB) without native FP4 tensor core acceleration. The throughput improvement relative to BF16 comes from memory bandwidth, not compute: TurboMind reads fewer bytes per token from VRAM. On B200, you get both the memory savings and native FP4 tensor core execution.

The 1.5x vLLM throughput figure for GPT-OSS 120B MXFP4 on H100 comes from the LMDeploy v0.10.0 release benchmarks. That's H800 in the benchmark, which is equivalent to H100 in terms of compute and memory bandwidth. For the MXFP4 quantization workflow and the distinction between MXFP4 and NVFP4, see the MXFP4 quantization guide.

InternLM-family First-class Support

TurboMind was built by the InternLM team, which means InternLM-specific architecture patterns are native to the kernel implementation: RoPE scaling, grouped-query attention (GQA), and InternLM2's specific attention pattern all have optimized kernel paths. There are no monkey-patches or adapter layers required.

For teams running InternLM2.5 7B, 20B, or InternLM3 models, this translates to noticeably lower TTFT compared to vLLM, where InternLM support goes through a generic LLaMA-family path.

Hardware Sizing on Spheron GPU Cloud

GPU	VRAM	On-demand ($/hr)	Spot ($/hr)	LMDeploy Use Case
H100 SXM5 instance	80 GB	$5.07	$2.91	InternLM-7B to 20B BF16, DeepSeek 7B AWQ
H200 SXM5	141 GB	$5.92	$1.40	70B BF16 single-GPU, large KV cache decode nodes
B200 SXM6	192 GB	$7.00	$2.14	MXFP4 native acceleration, GPT-OSS 120B MXFP4
RTX PRO 6000	96 GB	$1.70	$1.32	InternLM 20B BF16 + AWQ, budget MXFP4
A100 SXM4	80 GB	$1.64	N/A	InternLM 7B, Qwen3 7B, AWQ INT4 cost-optimized

Pricing fluctuates based on GPU availability. The prices above are based on 01 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Install and Dependency Setup

CUDA requirements: CUDA 11.8+ for TurboMind base, CUDA 12.4+ recommended for MXFP4. Driver 520+ for CUDA 12.

bash

# On a fresh Spheron GPU instance
nvidia-smi  # verify driver 520+ for CUDA 12

# Install LMDeploy with all extras (TurboMind C++ backend included)
pip install lmdeploy[all]

# Verify TurboMind backend
python -c "from lmdeploy import turbomind; print('TurboMind OK')"

# Docker alternative (recommended for production)
docker pull openmmlab/lmdeploy:latest-cuda12.4

For the Spheron quick-start including GPU instance provisioning, see docs.spheron.ai/quick-start.

Deploy InternLM, Qwen3, and DeepSeek with TurboMind

InternLM2.5-7B (BF16, single H100)

Requires ~15 GB VRAM. One H100 SXM5 handles this with plenty of room for KV cache.

bash

lmdeploy serve api_server \
  internlm/internlm2_5-7b-chat \
  --backend turbomind \
  --server-port 23333 \
  --tp 1

Test the endpoint:

bash

curl http://localhost:23333/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "internlm/internlm2_5-7b-chat",
    "messages": [{"role": "user", "content": "Explain paged KV cache in 2 sentences."}]
  }'

Qwen3-7B (AWQ INT4, single H100)

AWQ drops the 7B model to ~4 GB stored weights. On an H100, this gives you substantial KV cache headroom for long contexts.

bash

lmdeploy serve api_server \
  Qwen/Qwen3-7B-Instruct-AWQ \
  --backend turbomind \
  --model-format awq \
  --server-port 23333

DeepSeek-V3 (AWQ, 5x H100 tensor parallel)

DeepSeek-V3 is a 671B-parameter MoE model. At AWQ INT4 (4 bits per parameter) the stored weights occupy roughly 335 GB (671B × 4 / 8). Five H100 SXM5s (400 GB combined VRAM) handle it with tensor parallelism. Alternatively, four H200 SXM5s (564 GB combined) also provide sufficient headroom. For DeepSeek-specific configuration, see docs.spheron.ai/quick-guides/llms/deepseek-r1/.

bash

lmdeploy serve api_server \
  deepseek-ai/DeepSeek-V3-AWQ \
  --backend turbomind \
  --model-format awq \
  --tp 5 \
  --server-port 23333

Deploy GPT-OSS with MXFP4: 1.5x vLLM Throughput

GPT-OSS 120B at MXFP4 fits on a single H100 SXM5 (quantized weights drop from ~240 GB BF16 to ~60 GB). TurboMind's native MXFP4 kernel path is what drives the throughput advantage over vLLM for this workload.

The --quant-policy 4 flag enables MXFP4 in TurboMind. On H100, this uses software-emulated mixed-precision paths; on B200, it triggers native FP4 tensor core acceleration.

bash

# GPT-OSS 120B with MXFP4 on H100/H200/B200
lmdeploy serve api_server \
  openai/gpt-oss-120b \
  --backend turbomind \
  --quant-policy 4 \
  --tp 1 \
  --server-port 23333

Throughput comparison on identical prompts (512 input / 256 output tokens):

GPU	Precision	LMDeploy tok/s	vLLM tok/s	Ratio
H100 SXM5	MXFP4	~1,950	~1,300	1.5x
H200 SXM5	MXFP4	~2,600	~1,750	1.49x
B200 SXM6	MXFP4 (native FP4)	~4,200	~2,800	1.5x

Based on LMDeploy's published benchmarks and H800/H100 test environments. These are baseline figures; actual results vary with batch size, context length, and sampling parameters. For the full GPT-OSS setup guide including SGLang and Ollama comparisons, see the GPT-OSS deployment guide.

Serving API: OpenAI-Compatible Endpoint, restful_api, and Pipeline Modes

Three serving modes cover different deployment patterns:

1. api_server (OpenAI-compatible REST, production)

The standard production path. Exposes /v1/chat/completions, /v1/completions, and /v1/embeddings.

bash

lmdeploy serve api_server \
  internlm/internlm2_5-7b-chat \
  --backend turbomind \
  --server-port 23333

bash

curl http://localhost:23333/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "internlm/internlm2_5-7b-chat",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 100
  }'

2. Python client (programmatic, batch requests)

python

from lmdeploy.serve.openai.api_client import APIClient

client = APIClient("http://localhost:23333")
for item in client.chat_completions_v1(
    model="internlm/internlm2_5-7b-chat",
    messages=[{"role": "user", "content": "Hello"}],
):
    print(item)

3. Pipeline mode (batch offline)

Best for batch scoring, synthetic data generation, and evaluation runs where you don't need an HTTP server. The pipeline interface processes lists of prompts in a single call.

python

from lmdeploy import pipeline, TurbomindEngineConfig

engine_config = TurbomindEngineConfig(max_batch_size=32)
pipe = pipeline("internlm/internlm2_5-7b-chat", backend_config=engine_config)

responses = pipe(["What is MXFP4?", "Explain paged KV cache.", "What is AWQ?"])
for resp in responses:
    print(resp.text)

For full API reference, see docs.spheron.ai/quick-guides/llms/frameworks/lmdeploy.

DeepSeek Prefill-Decode Disaggregation with DLSlime and Mooncake

LMDeploy v0.12 added prefill-decode (PD) disaggregation via two external components:

DLSlime: manages KV cache transfer between prefill and decode nodes over RDMA or TCP
Mooncake: the distributed memory layer providing disaggregated KV cache routing and scheduling

The architecture:

[Client] -> [LMDeploy Router] -> [Prefill nodes: H100/B200] --(KV via DLSlime)--> [Decode nodes: H200]

DLSlime and Mooncake are not bundled with the default pip install lmdeploy. Install them separately:

bash

pip install dlslime mooncake-transfer-engine

Both require RDMA-capable network interfaces (InfiniBand or RoCE) for production deployments. A TCP fallback exists but adds latency; for latency-sensitive workloads, provision RDMA-capable Spheron instances.

Minimal 2-node setup (1 prefill H100 + 1 decode H200):

bash

# Step 1: Start the proxy/router (handles routing between prefill and decode nodes)
lmdeploy serve proxy \
  --server-name 0.0.0.0 \
  --server-port 23333 \
  --routing-strategy min_expected_latency \
  --serving-strategy DistServe

# Step 2: On prefill node (H100) - registers with proxy on startup
lmdeploy serve api_server \
  deepseek-ai/DeepSeek-V3-AWQ \
  --backend pytorch \
  --model-format awq \
  --server-port 23334 \
  --role Prefill \
  --proxy-url http://<router-ip>:23333

# Step 3: On decode node (H200) - registers with proxy on startup
lmdeploy serve api_server \
  deepseek-ai/DeepSeek-V3-AWQ \
  --backend pytorch \
  --model-format awq \
  --server-port 23335 \
  --role Decode \
  --proxy-url http://<router-ip>:23333

PD disaggregation uses the PyTorch backend (--backend pytorch). The --role Prefill / --role Decode flags on the standard api_server subcommand replace the older prefill_worker / decode_worker subcommands, and lmdeploy serve proxy --serving-strategy DistServe replaces the old router subcommand.

For background on why prefill and decode have opposite GPU requirements, see the prefill-decode disaggregation guide. For an alternative orchestration layer that sits on top of vLLM, see the NVIDIA Dynamo disaggregated inference guide. For Kubernetes-native disaggregated serving, see the LLM-D Kubernetes disaggregated inference guide.

Benchmarks: LMDeploy vs vLLM vs TensorRT-LLM vs SGLang

Identical hardware: H100 SXM5 80GB, InternLM2.5-7B BF16, 512 input / 256 output tokens. These figures are based on LMDeploy's published benchmarks and scaling from the vLLM/SGLang benchmark data in the vLLM vs TensorRT-LLM vs SGLang benchmarks post. Treat them as directional, not independently reproduced on Spheron hardware.

Concurrency	LMDeploy TurboMind	vLLM	TensorRT-LLM	SGLang
1 req	135 tok/s	120 tok/s	130 tok/s	125 tok/s
10 req	720 tok/s	650 tok/s	710 tok/s	680 tok/s
50 req	1,980 tok/s	1,850 tok/s	2,100 tok/s	1,920 tok/s
100 req	2,550 tok/s	2,400 tok/s	2,780 tok/s	2,460 tok/s

LMDeploy TurboMind beats vLLM at all concurrency levels tested. TensorRT-LLM still leads at high concurrency due to compiled kernel graphs, but TurboMind closes much of the gap without a 28-minute compilation step. The TRT-LLM advantage is most pronounced at concurrency 50+ where compiled CUDA graph execution and persistent kernel state become dominant.

TTFT at concurrency 10 (p50 / p95, ms):

Engine	p50 TTFT	p95 TTFT
LMDeploy TurboMind	108 ms	182 ms
vLLM	120 ms	195 ms
TensorRT-LLM	105 ms	170 ms
SGLang	112 ms	178 ms

TurboMind has the second-lowest TTFT after TensorRT-LLM, which matters for interactive use cases where every millisecond of first-token latency is user-visible.

Cost per Million Tokens on Spheron GPU Cloud

Formula: cost_per_million = (price_per_hour / throughput_toks_per_sec / 3600) * 1_000_000

InternLM2.5-7B BF16, H100 SXM5, concurrency 50 (1,980 tok/s from the benchmark table above):

GPU	Mode	Price/hr	Throughput	$/M tokens
H100 SXM5	On-demand	$5.07	1,980 tok/s	$0.71
H100 SXM5	Spot	$2.91	1,980 tok/s	$0.41
H200 SXM5	On-demand	$5.92	2,600 tok/s	$0.63
A100 SXM4	On-demand	$1.64	1,100 tok/s (7B AWQ)	$0.41

For comparison, AWS g5.12xlarge (4x A10G, 96 GB total) runs roughly $5.67/hr on-demand for a comparable 4-GPU configuration, with lower per-GPU VRAM bandwidth than H100. Azure NC A100 v4 series (single A100 80GB) runs approximately $3.67/hr on-demand list price. Spheron provides bare-metal access with no hypervisor overhead, which adds another 5-15% throughput advantage on top of these price differences.

For a broader cost-per-token comparison across GPU models and quantization formats, see AI inference cost economics 2026 and GPU cloud pricing comparison 2026.

Pricing fluctuates based on GPU availability. The prices above are based on 01 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Production Checklist: Monitoring, Autoscaling, Fallback

Monitoring

LMDeploy exposes a Prometheus endpoint at /metrics. Key metrics to watch:

lmdeploy_requests_total: total requests served (counter, rate gives req/s)
lmdeploy_kv_cache_usage: KV cache fill percentage (alert above 90%)
lmdeploy_requests_waiting: queue depth (alert above N to trigger scale-out)

For full Prometheus and Grafana setup for GPU inference, see the LLM observability guide.

Autoscaling

LMDeploy's api_server is stateless at the HTTP layer. Scale-out means launching additional LMDeploy instances, each backed by one or more GPUs. A practical autoscaling setup with KEDA:

Expose lmdeploy_requests_waiting via Prometheus adapter
Configure a KEDA ScaledObject targeting queue depth (e.g., 1 replica per 20 queued requests)
Each new replica is a new lmdeploy serve api_server container with its own GPU allocation
Route via nginx or a service mesh with round-robin load balancing

For KEDA + Prometheus GPU autoscaling configuration, see the KEDA GPU autoscaling guide.

Fallback to PyTorch Engine

TurboMind has a defined model support list (InternLM, LLaMA/LLaMA-2/LLaMA-3, Mistral, Qwen, Baichuan, DeepSeek, GPT-OSS). For model architectures outside this list, LMDeploy falls back to the PyTorch backend automatically:

bash

lmdeploy serve api_server \
  your-custom-model \
  --backend pytorch \
  --server-port 23333

No code change required. The PyTorch backend has lower throughput than TurboMind but maintains correctness for any HuggingFace-compatible decoder-only model. Use it as a compatibility layer while you wait for upstream TurboMind kernel support.

LMDeploy's TurboMind engine runs best on bare-metal GPU access where C++/CUDA kernel scheduling isn't competing with hypervisor overhead. Spheron provides on-demand H100 and H200 instances with no virtualization layer and per-minute billing.
H100 GPU pricing → | View all GPU options → | Get started on Spheron →

STEPS / 05

Quick Setup Guide

Provision a GPU instance on Spheron
Launch an H100 SXM5 or H200 SXM5 on Spheron from app.spheron.ai. For MXFP4 inference on Blackwell hardware, choose a B200 SXM6. SSH into the instance and verify GPU access with nvidia-smi.
Install LMDeploy
Install via pip: pip install lmdeploy. For CUDA 12.x: pip install lmdeploy[all]. Verify with: lmdeploy --version. For Docker, pull openmmlab/lmdeploy:latest-cuda12.4.
Deploy a model with TurboMind backend
Run: lmdeploy serve api_server internlm/internlm2_5-7b-chat --backend turbomind --server-port 23333. For AWQ: add --model-format awq. For MXFP4: add --quant-policy 4.
Test the OpenAI-compatible endpoint
curl http://localhost:23333/v1/chat/completions -H 'Content-Type: application/json' -d '{"model": "internlm/internlm2_5-7b-chat", "messages": [{"role": "user", "content": "Hello"}]}'
Enable production monitoring
LMDeploy exposes Prometheus metrics at /metrics. Watch lmdeploy_requests_total and lmdeploy_kv_cache_usage for queue depth and memory pressure. Add a health check at /health and configure an nginx reverse proxy with rate limiting.

FAQ / 05

Frequently Asked Questions

LMDeploy is an open-source LLM serving framework built on TurboMind, a C++/CUDA inference engine. Unlike vLLM which uses a Python-first architecture, TurboMind implements continuous batching, paged KV cache, and quantization kernels directly in C++ with hand-tuned CUDA kernels. The result is lower decode latency and higher throughput for mixed-precision workloads - particularly AWQ INT4 and MXFP4 - where TurboMind's native kernel paths outperform vLLM's Python-dispatched equivalents.

LMDeploy v0.10.0 added MXFP4 support for NVIDIA GPUs from V100 through B200 (and B300 Blackwell Ultra). Hardware-accelerated FP4 tensor cores are available only on Blackwell (B200, B300, RTX 5090, RTX PRO 6000). On Hopper and Volta GPUs, MXFP4 runs in software-emulated mode, still providing memory bandwidth savings but without native FP4 matrix op acceleration.

Yes. LMDeploy's restful_api server exposes a fully OpenAI-compatible endpoint at /v1/chat/completions, /v1/completions, and /v1/embeddings. Your existing client code that targets the OpenAI API works without modification by changing only the base_url and api_key.

LMDeploy v0.12 integrates DLSlime and Mooncake for KV cache transfer between prefill and decode nodes. Prefill-dense nodes (H100, B200) handle prompt processing; decode nodes (H200 for large KV cache capacity) handle token generation. DLSlime manages the inter-node KV cache transfer, while Mooncake provides the distributed memory layer for efficient cache routing.

Yes. LMDeploy was developed by the Shanghai AI Laboratory team that also maintains InternLM, so InternLM model support is native and first-class. DeepSeek V3 and R1 are fully supported as of v0.11, with PD disaggregation via DLSlime added in v0.12. Qwen3, GPT-OSS, and most HuggingFace-compatible decoder-only architectures are also supported.

TL;DR

What Is LMDeploy and the TurboMind Engine

When LMDeploy Wins: Mixed-Precision Workloads

AWQ + KV-cache INT8

MXFP4 on V100 through B200

InternLM-family First-class Support

Hardware Sizing on Spheron GPU Cloud

Install and Dependency Setup

Deploy InternLM, Qwen3, and DeepSeek with TurboMind

InternLM2.5-7B (BF16, single H100)

Qwen3-7B (AWQ INT4, single H100)

DeepSeek-V3 (AWQ, 5x H100 tensor parallel)

Deploy GPT-OSS with MXFP4: 1.5x vLLM Throughput

Serving API: OpenAI-Compatible Endpoint, restful_api, and Pipeline Modes

1. api_server (OpenAI-compatible REST, production)

2. Python client (programmatic, batch requests)

3. Pipeline mode (batch offline)

DeepSeek Prefill-Decode Disaggregation with DLSlime and Mooncake

Benchmarks: LMDeploy vs vLLM vs TensorRT-LLM vs SGLang

Cost per Million Tokens on Spheron GPU Cloud

Production Checklist: Monitoring, Autoscaling, Fallback

Monitoring

Autoscaling

Fallback to PyTorch Engine

Quick Setup Guide

Provision a GPU instance on Spheron

Install LMDeploy

Deploy a model with TurboMind backend

Test the OpenAI-compatible endpoint

Enable production monitoring

Frequently Asked Questions

01What is LMDeploy and how does it differ from vLLM?

02Which GPUs does LMDeploy MXFP4 support?

03Can I use LMDeploy with the OpenAI API format?

04How does DeepSeek prefill-decode disaggregation work with LMDeploy?

05Is LMDeploy production-ready for serving InternLM and DeepSeek models?

Build what's next.