Deploy DeepEP and DeepGEMM on GPU Cloud: MoE Inference Kernels Guide (2026)

Engineers deploying DeepSeek V3/V4, GLM-5.1, and Kimi K2.6 in production hit two consistent bottlenecks: all-to-all expert dispatch that blocks GPU compute, and generic GEMM kernels that leave FP8 throughput on the table. DeepEP and DeepGEMM are DeepSeek's library answers to both. For broader context on MoE architecture and memory planning, see the MoE inference optimization guide.

What DeepEP and DeepGEMM Solve

The standard MoE inference pipeline has a stall baked in. When the router assigns a token to an expert on a different GPU, all computation pauses while that token is shipped over NVLink via NCCL all-to-all. On an 8x H200 NVLink cluster, this communication block can consume 12-18% of per-step wall time depending on batch size and model depth.

DeepEP solves this by using NVSHMEM to initiate RDMA transfers in the background while the local GPU keeps computing on locally-assigned expert tokens. By the time local computation finishes, most of the remote tokens have already arrived. The dispatch overhead drops from 12-18% to roughly 3-5%.

DeepGEMM addresses a separate gap. Standard cuBLAS GEMM is tuned for large square matrices. MoE expert layers use grouped GEMM with variable batch dimensions: not all experts get equal token load per step, so the effective (M, N, K) shapes change constantly. DeepGEMM JIT-compiles Triton kernels at first call for each shape tuple and caches them. On H200 and B200 FP8 Tensor Cores, these shape-specific kernels outperform generic implementations by 1.4-2x.

The combined effect on a representative 8x B200 cluster running DeepSeek V4 at 32K context (representative estimates based on public SGLang/DeepSeek benchmark data):

Setup	Tokens/sec (approx)	TTFT p50	Dispatch overhead
Generic GEMM + NCCL all-to-all	~2,000	~240ms	~15% of step time
DeepGEMM only	~2,700 (+35%)	~180ms	~15% (unchanged)
DeepEP + DeepGEMM	~3,100 (+55%)	~120ms	~3-5%

These figures are representative estimates for a throughput-optimized 8-way NVLink configuration. Actual results vary by batch size, context length, and hardware generation.

Architecture: DeepEP Communication Overlap

MoE inference has two communication steps per layer: expert dispatch (send tokens to the GPU holding the assigned expert) and expert combine (send results back). In standard NCCL, both are blocking: no GPU starts computing until all tokens have been exchanged.

DeepEP replaces this with async background transfers that overlap with local compute. In V1, this used NVSHMEM one-sided RDMA puts; in V2 (current upstream), it uses the NCCL Gin backend, which reuses existing NCCL communicators for the same effect. The key difference from standard NCCL all-to-all is that a GPU can initiate transfers to remote GPU memory without requiring the remote GPU to participate synchronously. This enables a pipelined execution pattern:

GPU starts computing on locally-assigned expert tokens from the current batch.
Simultaneously, the background transfer path (NVSHMEM RDMA in V1, NCCL Gin in V2) begins pushing dispatched tokens to remote GPUs.
When local computation finishes, most remote tokens have already arrived via the background path.
A short synchronization step handles any remaining in-flight tokens.

The benefit scales with the ratio of local compute time to dispatch communication time. On 8x H200 with NVLink (900 GB/s bidirectional), this overlap hides most of the dispatch latency for typical DeepSeek V3/V4-class expert dimensions.

In V1, NVSHMEM requires symmetric memory: every GPU pre-allocates a fixed region of VRAM that other GPUs can address directly, costing approximately 200-400 MB of VRAM per GPU depending on batch size and dispatch chunk configuration. V2's NCCL Gin backend reuses existing NCCL communicators and does not have this symmetric memory requirement.

Architecture: DeepGEMM Shape-Specific JIT Compilation

Expert layers in MoE models do not receive equal token loads. The router assigns tokens to experts based on learned routing weights, so some experts receive many tokens per step and others receive few or none. This means the effective M dimension (batch of tokens per expert) varies every step, producing irregular grouped GEMM shapes that generic BLAS libraries handle poorly.

DeepGEMM compiles Triton kernels at first call for each unique (M, N, K) shape tuple and caches the compiled artifacts. Subsequent calls for the same shape reuse the compiled kernel with zero additional overhead. The JIT cache persists across server restarts by default.

FP8 precision is the other piece. DeepGEMM uses E4M3 format (4 exponent bits, 3 mantissa bits) with per-tensor or per-row scaling. On B200 SXM6, MXFP8 microscaling hardware provides additional throughput on top of standard FP8 by using finer-grained scale factors that preserve precision for outlier activations.

The tradeoff: first call for each new (M, N, K) shape triggers a 1-3 second compilation step. For a 256-expert model with variable token routing, this means the first several inference steps are slow while the kernel cache builds up. After ~10-20 steps covering the common expert activation patterns, the cache stabilizes and all subsequent calls hit the fast path. Pre-warm before serving live traffic.

Hardware Requirements

Not all GPU configurations benefit equally from these libraries. The requirements differ between DeepEP and DeepGEMM.

DeepEP requirements: NVLink connectivity between GPUs is needed for the low-latency fast path. SXM form factor GPUs have NVLink on-socket; PCIe GPUs do not. DeepEP V2 (current upstream) reuses existing NCCL communicators for its fast path, which works over NVLink or InfiniBand. DeepEP V1 used NVSHMEM RDMA, which also requires InfiniBand HDR or NDR for multi-node deployments. On PCIe-only setups, DeepEP falls back to standard NCCL, eliminating the overlap benefit. Ethernet-only multi-node does not support the high-bandwidth fast path.

DeepGEMM requirements: FP8 Tensor Core support is needed for hardware-accelerated FP8 grouped GEMM. This covers all Hopper and Blackwell SKUs: H100 SXM5, H100 PCIe, H200 SXM5, and B200 SXM6. A100 (Ampere) has no hardware FP8 and is not supported. H100 PCIe runs DeepGEMM but lacks the NVLink interconnect that DeepEP needs for its NVSHMEM fast path.

Use case	GPU	Min count	Interconnect	DeepEP?	DeepGEMM?
Single-node MoE serving	H200 SXM5	4-8	NVLink 900 GB/s	Yes	Yes
Single-node MoE serving	B200 SXM6	4-8	NVLink 1.8 TB/s	Yes	Yes (MXFP8)
Multi-node MoE cluster	H200 SXM5	8+	NVLink + IB NDR	Yes (V2: NCCL Gin; V1: NVSHMEM IB)	Yes
Budget single-node	H100 SXM5	8	NVLink 900 GB/s	Yes	Yes
PCIe budget	H100 PCIe	4-8	PCIe	No benefit	Yes (works, lower throughput than SXM)

Engineers who want NVLink-connected B200 clusters for this setup can rent B200 GPU clusters on Spheron by the hour with no minimum commitment. For a detailed breakdown of B200 Blackwell architecture, FP8/FP4 capabilities, and when the upgrade makes financial sense, see the NVIDIA B200 complete guide. For slightly lower cost with the same NVLink topology, H200 SXM5 rental provides 141 GB per GPU and supports both DeepEP and DeepGEMM.

Step-by-Step Installation

Prerequisites

Before installing, confirm your environment meets these requirements:

CUDA 12.3 or later (nvcc --version)
Python 3.10+
PyTorch 2.3+
NVLink topology verified (nvidia-smi topo -m - look for NV4, NV8, or NV18 labels between GPU pairs)
For multi-node: InfiniBand present (ibstat should return active ports)

Install NVSHMEM

If you are using DeepEP V2 (the default upstream), NVSHMEM is optional and only needed for legacy code paths. Skip this subsection unless you are pinning to V1. For V2, install the NCCL Gin backend with: pip install "nvidia-nccl-cu13>=2.30.4" --no-deps

NVSHMEM is required for DeepEP V1's fast path. Check whether it is already present in your container first:

bash

# Check if NVSHMEM is already installed (common in NGC PyTorch containers)
ldconfig -p | grep nvshmem

If not found, install via the NVIDIA package manager:

bash

# Via NVIDIA package manager (CUDA 12.x)
apt-get install -y libnvshmem-dev

# Or download the tarball from NVIDIA NGC and install manually
# The NVSHMEM version must match your CUDA version - check the NGC catalog

Some NGC container images (nvcr.io/nvidia/pytorch:24.xx-py3) include NVSHMEM pre-installed. If you're using a bare-metal instance without an NGC base image, the package manager path is more reliable.

Build DeepEP

bash

git clone https://github.com/deepseek-ai/DeepEP
cd DeepEP
export NVSHMEM_HOME=/usr/local/nvshmem   # adjust to your NVSHMEM install path
pip install -e . --no-build-isolation
python -c "import deep_ep; print(deep_ep.__version__)"

The --no-build-isolation flag is required. It lets pip use the system PyTorch and NVSHMEM detection. With build isolation enabled, pip creates a clean environment that does not see your installed PyTorch, which breaks DeepEP's setup-time path detection.

Install DeepGEMM

bash

pip install git+https://github.com/deepseek-ai/DeepGEMM
# Verify installation
python -c "import deep_gemm; print('DeepGEMM installed')"

Validate the Installation

Run a quick sanity check to confirm the FP8 Tensor Core path is accessible:

bash

python -c "
import torch
import deep_gemm
# torch.randn does not support FP8 dtypes; generate in float32 then cast
a = torch.randn(128, 7168, device='cuda').to(torch.float8_e4m3fn)
b = torch.randn(4096, 7168, device='cuda').to(torch.float8_e4m3fn)
out = torch.empty(128, 4096, dtype=torch.bfloat16, device='cuda')
# DeepGEMM requires 2D block-wise scale tensors, not per-tensor scalars.
# a_scale shape: (M, ceil(K / 128)) = (128, 56) for K=7168
# b_scale shape: (ceil(N / 128), ceil(K / 128)) = (32, 56) for N=4096, K=7168
a_scale = torch.ones(128, 56, device='cuda', dtype=torch.float32)
b_scale = torch.ones(32, 56, device='cuda', dtype=torch.float32)
# Shape (128, 4096, 7168) - typical DeepSeek V3 expert shape
# This triggers JIT compilation on first call (1-3 seconds)
deep_gemm.gemm_fp8_fp8_bf16_nt(a, a_scale, b, b_scale, out)
print('DeepGEMM kernel compiled and ready, output shape:', out.shape)
"

Note: DeepGEMM's FP8 API requires 2D block-wise scale tensors. The block size along K is 128, so a_scale must have shape (M, ceil(K/128)) and b_scale must have shape (ceil(N/128), ceil(K/128)). Passing per-tensor scalars (shape (1,)) will cause a CUDA dimension-mismatch error at runtime. The exact function signatures may also differ in newer DeepGEMM versions, so check the DeepGEMM README for the current call signature before building application code on top of it.

SGLang Integration

SGLang has native DeepEP and DeepGEMM support in its MoE backend. When both libraries are installed, SGLang auto-detects and uses them without any extra flags required.

For DeepSeek V3/V4 on 8x B200 SXM6:

bash

# Set environment variables before launching the server (exports after a blocking command are never reached)
export SGLANG_DEEPEP_CHUNK_SIZE=128        # token dispatch chunk size
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS=16384   # max tokens in flight per dispatch

python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 \
  --tp 8 \
  --enable-dp-attention \
  --dp 8 \
  --mem-fraction-static 0.88 \
  --dtype fp8 \
  --port 30000

The --enable-dp-attention flag enables data-parallel attention: different GPUs handle different sequence positions in the attention layers while expert parallelism handles the MoE layers. This reduces cross-GPU all-reduce communication on long sequences, which matters most at context lengths above 16K.

For GLM-5.1 on 8x H200 SXM5:

bash

python -m sglang.launch_server \
  --model-path zai-org/GLM-5.1-FP8 \
  --tp 8 \
  --enable-dp-attention \
  --mem-fraction-static 0.85 \
  --port 30000

SGLang's handling of GLM-5.1's 40B-active expert routing also benefits from DeepGEMM's FP8 grouped GEMM kernels, giving measurable throughput improvements on H200 compared to generic GEMM. For the full SGLang production setup, see the SGLang production deployment guide.

vLLM Integration

vLLM uses DeepGEMM automatically when installed and the GPU supports FP8. No additional flag is required to activate it in vLLM 0.14+.

For DeepSeek V4 on 8x H200 SXM5:

bash

vllm serve deepseek-ai/DeepSeek-V4 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --port 8000

# DeepGEMM is auto-used when installed; no extra flag needed in vLLM 0.14+
# DeepEP is not yet upstream in vLLM as of May 2026; check the vLLM issue tracker for current status

As of May 2026, DeepEP is better supported in SGLang than in vLLM. DeepEP integration is not yet upstream in vLLM; check the vLLM issue tracker for current status before trying --enable-deepep. For production DeepEP deployments, SGLang is the more reliable choice. DeepGEMM works well in both. For vLLM-specific tuning details, see the vLLM production deployment guide. For a TensorRT-LLM-based serving alternative with different performance tradeoffs, see the TensorRT-LLM production deployment guide.

Tuning for Throughput vs Latency

Three main levers control the throughput/latency tradeoff:

Parameter	Throughput-first	Latency-first	Where to set
Dispatch chunk size	256 tokens	64 tokens	`SGLANG_DEEPEP_CHUNK_SIZE`
Expert parallel size	= GPU count	GPU count / 2	`--tp` + `--dp` in SGLang
Hidden dim padding	128 (default)	64	DeepGEMM compile flag `DEEPGEMM_HIDDEN_DIM_PAD`
FP8 KV cache	Enabled	Enabled	`--kv-cache-dtype fp8_e5m2` in vLLM
Memory fraction	0.92	0.85	`--gpu-memory-utilization`

Dispatch chunk size controls how many tokens are batched together before a dispatch step. Larger chunks amortize the overhead of initiating RDMA transfers, which improves throughput for high-concurrency batch jobs. Smaller chunks reduce the wait before dispatch starts, which matters for TTFT on short-prompt interactive workloads.

Expert parallel size sets how many GPUs participate in expert routing. Higher expert parallel size means more all-to-all hops and higher communication volume, but each GPU handles fewer experts so per-GPU expert computation is lighter. On 8-way NVLink H200 (900 GB/s), full 8-way expert parallelism is usually throughput-optimal. For multi-node InfiniBand deployments, reducing expert parallelism to 4 and using pipeline parallelism across nodes can reduce cross-node traffic significantly.

Hidden dim padding affects VRAM usage slightly. DeepGEMM pads hidden dimensions to a multiple of this value for aligned memory access. The default of 128 works well for most expert shapes. Use 64 only if VRAM is tight.

Measure TTFT and tokens/sec with both configurations using the SGLang bench_serving script before committing to a production configuration.

Benchmarks: DeepSeek V4, GLM-5.1, and Kimi K2.6

The figures below are representative estimates for throughput-optimized SGLang configurations on Spheron B200 SXM6 and H200 SXM5 instances with CUDA 12.4, and SGLang current as of May 2026. Reproduce with python -m sglang.bench_serving. Actual results vary by batch size, context distribution, and model version.

DeepSeek V4 on 8x B200 SXM6 (FP8, 32K context):

Setup	Tokens/sec (approx)	TTFT p50	Cost/1M tokens (on-demand)
Generic GEMM + NCCL	~2,000	~240ms	~$8.13
DeepGEMM only	~2,700 (+35%)	~180ms	~$6.02
DeepEP + DeepGEMM	~3,100 (+55%)	~120ms	~$5.25

8x B200 SXM6 on-demand: $58.56/hr. Cost per million tokens derived from live pricing as of 14 May 2026.

Pricing fluctuates based on GPU availability. The prices above are based on 14 May 2026 and may have changed. Check current GPU pricing → for live rates.

GLM-5.1 on 8x H200 SXM5 (FP8, 32K context):

Setup	Tokens/sec (approx)	TTFT p50	Cost/1M tokens (on-demand)
Generic GEMM + NCCL	~1,400	~280ms	~$7.49
DeepGEMM only	~1,890 (+35%)	~210ms	~$5.55
DeepEP + DeepGEMM	~2,170 (+55%)	~140ms	~$4.83

8x H200 SXM5 on-demand: $37.76/hr.

Pricing fluctuates based on GPU availability. The prices above are based on 14 May 2026 and may have changed. Check current GPU pricing → for live rates.

Kimi K2.6 on 8x H200 SXM5 (FP8, 64K context):

Kimi K2.6's 384-expert routing at 64K context puts more pressure on dispatch than shorter-context workloads. The relative gain from DeepEP is larger here because the all-to-all communication time grows with context length. For the full Kimi K2.6 setup, see the Kimi K2.6 deployment guide.

Setup	Tokens/sec (approx)	TTFT p50	Cost/1M tokens (on-demand)
Generic GEMM + NCCL	~900	~480ms	~$11.65
DeepGEMM only	~1,215 (+35%)	~360ms	~$8.63
DeepEP + DeepGEMM	~1,450 (+60%)	~200ms	~$7.23

Pricing fluctuates based on GPU availability. The prices above are based on 14 May 2026 and may have changed. Check current GPU pricing → for live rates.

Common Failure Modes and Debugging

PXN: PCIe-NVLink Topology Mismatch

Symptom: DeepEP initializes without error, but throughput is the same as baseline or worse. nvidia-smi topo -m shows PXN between some GPU pairs.

Root cause: Some multi-GPU setups have GPUs on different PCIe root complexes connected via NVLink bridge. NVSHMEM RDMA paths that cross a PXN hop add latency compared to direct NVLink transfers.

Fix: Run nvidia-smi topo -m and identify which GPU pairs have direct NVLink connections (NV4, NV8, or NV18 in the topology matrix, not PXN). Use CUDA_VISIBLE_DEVICES ordering to align expert group assignments with the highest-bandwidth GPU pairs. Reassign expert groups so frequently co-dispatched experts land on GPUs with direct NVLink.

NIC Topology Mismatch (Multi-Node InfiniBand)

Symptom: Multi-node DeepEP runs slower than single-node, or NVSHMEM throws errors about NIC affinity during initialization.

Root cause: Each GPU should use the InfiniBand NIC on the same PCIe root complex. NVSHMEM routes through the NIC, so a mismatch adds unnecessary PCIe crossing overhead.

Fix: Set NVSHMEM_IB_DEVICE_ORDER=NODES_FIRST and configure NCCL_IB_HCA to match your NIC assignment from ibstat. Run nvshmem_test to verify symmetric memory allocation succeeds before launching the model server.

FP8 Scaling Collapse

Symptom: Model output is incoherent or perplexity spikes on certain inputs. The same checkpoint runs fine in BF16.

Root cause: FP8 E4M3 has a narrow dynamic range (~-448 to +448). Per-tensor scaling shifts to cover the largest outlier in the tensor, which crushes representable precision for everything else. This is most common at long context lengths where positional encodings introduce large values.

Fix: Switch from per-tensor to per-row or per-channel FP8 scaling. In vLLM, provide a pre-computed scaling file via --quantization-param-path. In SGLang, consult the SGLang docs for the current per-row FP8 scaling config option. Verify by running identical prompts in BF16 and FP8 and comparing output distributions.

DeepGEMM JIT Compilation Timeout

Symptom: First request to the server hangs for 10-30 seconds. Health checks fail during startup. After the warm-up period, everything is fine.

Root cause: DeepGEMM compiles a Triton kernel on first call for each (M, N, K) shape. Large MoE models have many distinct expert activation patterns.

Fix: Pre-warm the server before opening the endpoint to live traffic. Pass a representative batch through the model covering your expected expert activation patterns. Set a longer health-check timeout in your load balancer during the warm-up window (60-120 seconds is typically sufficient for a full kernel cache build).

Cost Comparison: Spheron vs Hyperscaler Managed Services

Managed API providers do not expose kernel-level tuning. If you're using AWS Bedrock or Azure AI Foundry for DeepSeek-class models, DeepEP and DeepGEMM gains are not available to you. Self-hosted GPU cloud is the only way to capture them. Spheron's bare-metal B200 and H200 SXM nodes include the NVLink/NVSwitch topology required for DeepEP single-node deployments.

Provider	GPU	Config	Hourly rate	Tokens/sec (DeepSeek V4, est.)	Cost/1M tokens (est.)
Spheron (on-demand)	B200 SXM6	8-GPU	$58.56	~3,100 (with DeepEP+DeepGEMM)	~$5.25
Spheron (spot)	B200 SXM6	8-GPU	$16.48	~3,100 (with DeepEP+DeepGEMM)	~$1.48
Spheron (on-demand)	H200 SXM5	8-GPU	$37.76	~2,170 (with DeepEP+DeepGEMM)	~$4.83
Spheron (spot)	H200 SXM5	8-GPU	$9.52	~2,170 (with DeepEP+DeepGEMM)	~$1.22
AWS Bedrock (managed API)	N/A	managed	per-token	N/A	Check AWS pricing page
Azure AI Foundry (managed)	N/A	managed	per-token	N/A	Check Azure pricing page

Pricing fluctuates based on GPU availability. The prices above are based on 14 May 2026 and may have changed. Check current GPU pricing → for live rates.

Managed API providers price on a per-token basis at rates that typically do not reflect the cost savings from kernel-level optimization. Self-hosting on Spheron H200 SXM5 instances gives you the FP8 Tensor Core support needed for DeepGEMM plus the NVLink interconnect required for DeepEP, at hourly rates that make the per-token math favorable at production scale. See GPU cloud pricing for current live rates.

DeepEP and DeepGEMM give you production-grade MoE kernel performance, but only if your GPU cluster has the right NVLink topology. Spheron's bare-metal B200 and H200 SXM nodes come with NVLink/NVSwitch interconnect out of the box, no IB setup required for single-node deployments.
Rent B200 → | Rent H200 → | View all pricing →
Get started on Spheron →

STEPS / 06

Quick Setup Guide

Verify hardware and topology requirements
Run 'nvidia-smi topo -m' to confirm NVLink connectivity between all GPUs. SXM form factor GPUs (H100 SXM5, H200 SXM5, B200 SXM6) will show NVLink connections. PCIe GPUs will show PIX or PHB connections only, which limits DeepEP benefit. Confirm CUDA version is 12.3+ with 'nvcc --version'. For multi-node deployments, run 'ibstat' to verify InfiniBand adapter presence and confirm the adapter is connected to a non-blocking switch fabric.
Build and install DeepEP with NVSHMEM
Clone the repository: 'git clone https://github.com/deepseek-ai/DeepEP && cd DeepEP'. Set environment flags: 'export NVSHMEM_HOME=/usr/local/nvshmem' (or wherever NVSHMEM is installed from the NVIDIA package manager). Build: 'pip install -e . --no-build-isolation'. Verify with 'python -c "import deep_ep; print(deep_ep.__version__)"'. If NVSHMEM is not pre-installed, DeepEP's setup.py will attempt to pull it via the NVIDIA package manager (requires network access and NVIDIA NGC credentials for some versions).
Install DeepGEMM
Install directly from GitHub: 'pip install git+https://github.com/deepseek-ai/DeepGEMM'. DeepGEMM uses JIT compilation, so the first inference call for each (M, N, K) shape triggers a Triton/CUDA compile step of 1-3 seconds. Subsequent calls for the same shape use the compiled kernel cache. Pre-warm by running a short benchmark pass before serving live traffic.
Launch SGLang with DeepEP and DeepGEMM enabled
SGLang detects and uses both libraries automatically when installed. Start the server: 'python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 8 --enable-dp-attention --dp 8 --mem-fraction-static 0.88 --port 30000'. For B200 clusters, add '--dtype fp8'. For DeepEP-specific tuning, set the environment variable 'SGLANG_DEEPEP_CHUNK_SIZE=128' (adjust down to 64 for lower-latency workloads, up to 256 for throughput-bound batch jobs).
Launch vLLM with DeepGEMM and expert parallelism
vLLM uses DeepGEMM when installed and the GPU supports FP8. Enable expert parallelism: 'vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 8 --enable-expert-parallel --dtype fp8 --max-model-len 32768 --gpu-memory-utilization 0.90 --port 8000'. DeepEP integration is not yet upstream in vLLM as of May 2026; check the vLLM issue tracker for current status before trying '--enable-deepep'. For production DeepEP deployments, SGLang is the more reliable choice.
Tune dispatch chunk size and expert parallel size
Start with expert-parallel-size equal to your GPU count. For throughput-bound batch jobs, increase the dispatch chunk size (SGLANG_DEEPEP_CHUNK_SIZE) to 256 to amortize dispatch overhead over larger token batches. For latency-sensitive serving where TTFT matters, drop chunk size to 64 and consider reducing expert-parallel-size to 4 on an 8-GPU node, assigning the remaining 4 GPUs to tensor parallelism for the attention layers. Measure TTFT and tokens/sec with both settings using the SGLang bench_serving script before committing to a configuration.

FAQ / 05

Frequently Asked Questions

DeepEP is a dedicated all-to-all communication library for expert dispatch and combine steps in MoE inference. It overlaps GPU computation with inter-GPU dispatch (via NVSHMEM in V1 or NCCL Gin in V2), hiding most of the inter-GPU dispatch latency. vLLM's built-in expert parallelism uses standard NCCL all-to-all, which blocks compute until communication finishes. On a well-connected 8xB200 NVLink cluster, DeepEP reduces dispatch overhead from ~15% of step time to ~3-5%.

DeepGEMM's FP8 grouped GEMM kernels run on any Hopper or Blackwell GPU with FP8 Tensor Core support: H100 SXM5, H100 PCIe, H200 SXM5, and B200 SXM6. They do not run on A100 (Ampere has no hardware FP8). H100 PCIe runs DeepGEMM but lacks the NVLink bandwidth that DeepEP needs for its NVSHMEM fast path. On H200, FP8 throughput is roughly 2x the BF16 rate. On B200, native FP8 with microscaling gives another 1.5-2x on top of H200.

DeepEP V1 required NVSHMEM. DeepEP V2 (current upstream as of mid-2026) uses a NCCL Gin backend by default and only depends on NVSHMEM for legacy code paths. For V2, NVSHMEM is optional; the standard install path is 'pip install "nvidia-nccl-cu13>=2.30.4" --no-deps'. If you are pinning to V1: DeepEP V1's low-latency dispatch path uses NVSHMEM for RDMA-based symmetric memory transfers between GPUs. Without NVSHMEM, V1 falls back to a slower NCCL-based path that loses most of the overlap benefit. NVSHMEM requires CUDA 12.3+ and a supported NVLink or InfiniBand topology. On NVLink-connected nodes (SXM form factors), NVSHMEM uses NVLink directly for intra-node GPU-to-GPU transfers. For multi-node deployments across separate hosts, InfiniBand is required for NVSHMEM RDMA. On PCIe-connected nodes without InfiniBand, DeepEP loses most of its overlap benefit regardless of version.

DeepGEMM works with any model that uses grouped GEMM for MoE expert layers: GLM-5.1, Kimi K2.6, Mistral MoE models, and Mixtral all qualify. DeepEP's all-to-all communication library is model-agnostic - it handles the dispatch and combine stages for any expert-parallel setup. SGLang's MoE backends call both libraries automatically when they are installed and the topology supports it.

You need at least 2 GPUs with NVLink connectivity (SXM form factor) or InfiniBand between nodes. DeepEP's overlap is most effective at 4x or 8x GPU configurations where the all-to-all communication time is large enough to hide behind computation. On a single GPU or PCIe multi-GPU without InfiniBand, install DeepGEMM for the FP8 GEMM wins but skip DeepEP's NVSHMEM path.

What DeepEP and DeepGEMM Solve

Architecture: DeepEP Communication Overlap

Architecture: DeepGEMM Shape-Specific JIT Compilation

Hardware Requirements

Step-by-Step Installation

Prerequisites

Install NVSHMEM

Build DeepEP

Install DeepGEMM

Validate the Installation

SGLang Integration

vLLM Integration

Tuning for Throughput vs Latency

Benchmarks: DeepSeek V4, GLM-5.1, and Kimi K2.6

Common Failure Modes and Debugging

PXN: PCIe-NVLink Topology Mismatch

NIC Topology Mismatch (Multi-Node InfiniBand)

FP8 Scaling Collapse

DeepGEMM JIT Compilation Timeout

Cost Comparison: Spheron vs Hyperscaler Managed Services

Quick Setup Guide

Verify hardware and topology requirements

Build and install DeepEP with NVSHMEM

Install DeepGEMM

Launch SGLang with DeepEP and DeepGEMM enabled

Launch vLLM with DeepGEMM and expert parallelism

Tune dispatch chunk size and expert parallel size

Frequently Asked Questions

01What does DeepEP do that vLLM's built-in expert parallelism doesn't?

02Does DeepGEMM work on H100 or only on H200 and B200?

03Do I need NVSHMEM to use DeepEP?

04Can I use DeepEP and DeepGEMM with models other than DeepSeek?

05What is the minimum cluster topology to see real benefit from DeepEP?

Build what's next.