Dense attention at 1M context costs roughly ~180,000 TFLOPs of attention compute per forward pass on a 70B-class model. That cost grows quadratically with sequence length: double the context, quadruple the attention FLOPs. DeepSeek Sparse Attention (DSA) breaks that scaling by attending over a fixed top-2048 tokens per query, cutting attention compute by ~98% at 128K context without sacrificing long-range retrieval accuracy. This guide covers the full DSA deployment stack: how the two-stage mechanism works, hardware requirements, vLLM and SGLang setup, benchmarks, and when to use DSA over ring attention or NVMe KV offloading. For the GLM-5.1 base deployment (weight download, multi-GPU setup, and quantization), see the full GLM-5.1 deployment guide. For DeepSeek V4 multi-GPU configuration, see the DeepSeek V4 deployment guide.
What Is DeepSeek Sparse Attention
DSA is the attention primitive that powers long-context inference in GLM-5.1 (754B MoE) and DeepSeek V4 (1T MoE). Both models are trained with it, so the sparsity pattern is learned rather than heuristic.
The mechanism works in two stages. In the first stage, a lightweight "lightning indexer" runs compressed query and key projections to produce a relevance score for every token block in the context. The projections are low-rank, so this step is fast: roughly 1-5% of the compute of a full attention pass. The output is a sorted list of block scores, one per position group.
In the second stage, the model loads only the top-K blocks from GPU HBM and runs standard attention over the sparse selection. K is fixed at 2048 tokens regardless of context length, so the active compute KV footprint is constant at ~0.6 GB. The key/value tensors for unselected tokens are never transferred from HBM to the compute units.
At 128K context this matters a lot. Dense BF16 KV for a 70B-class model at 128K takes ~40 GB per concurrent user. With DSA selecting a fixed 2048 tokens, the active compute window stays at ~0.6 GB, enabling far higher concurrency on the same hardware compared to dense attention.
DSA vs FlashAttention, Ring Attention, and Sliding-Window Attention
These techniques solve different problems. Knowing which to use requires understanding where each one actually applies.
| Technique | FLOPs complexity | KV memory per user at 128K | Multi-GPU required for 128K | Long-range retrieval |
|---|---|---|---|---|
| Dense + FlashAttention | O(S²) | ~40 GB (BF16) | No | Full, 100% |
| Sliding-Window (4K) | O(S × 4K) | ~1.2 GB | No | Window only, misses distant tokens |
| Ring Attention (8-GPU) | O(S²/8) per GPU | ~5 GB per GPU | Yes (8 GPUs) | Full, 100% |
| DSA | O(S × 2048) fixed K | ~0.6 GB (active compute) | No | Adaptive, ~98-99% on RULER |
FlashAttention is an IO-aware kernel, not a sparsity technique. It attends over every token but reads HBM in tiles that fit in SRAM, reducing the number of HBM round-trips. It reduces memory bandwidth pressure but not FLOPs. DSA reduces both FLOPs and memory bandwidth by skipping all tokens outside the top-2048 selected per query.
Sliding-window attention is a hard-coded local approximation. It drops everything outside a fixed window (often 4K-8K tokens). This works fine for tasks that only need recent context, but it catastrophically fails on retrieval tasks where the relevant token is at position 0 of a 1M-token context. DSA's learned indexer scores by content relevance, not recency, so it handles long-range retrieval correctly.
Ring Attention distributes the full context across GPUs via a communication ring, keeping full attention but spreading the memory cost. It is the right tool for contexts above 1M tokens. At 128K-512K, DSA achieves comparable memory savings without the inter-GPU communication overhead. See the Ring Attention deployment guide for the multi-GPU sequence parallelism setup. For contexts under 500K where HBM is the constraint, the NVMe KV cache offloading guide covers the disk-tier approach.
DSA and Ring Attention are not mutually exclusive. You can run DSA within each Ring Attention rank to reduce per-rank FLOPs further on long contexts.
The Two-Stage DSA Pattern
The lightning indexer (stage 1) uses low-rank projections of the query and key tensors to produce a block-level relevance score without computing the full attention matrix. In PyTorch-like pseudocode:
# Stage 1: Lightning indexer
# Q_proj, K_proj: low-rank projections (rank << d_head)
block_scores = (Q_proj @ K_proj.T) # shape: [n_queries, n_blocks]
block_scores = block_scores.softmax(dim=-1)
# Select top-K blocks (K = 2048 tokens fixed; n_blocks_to_select = 2048 // block_size)
topk_block_indices = block_scores.topk(k=2048 // block_size, dim=-1).indices
# Stage 2: Fine-grained token selection
# Load only the KV blocks at topk_block_indices from HBM
selected_K = K[topk_block_indices] # sparse KV, fixed 2048 tokens
selected_V = V[topk_block_indices]
# Standard attention over sparse selection
attn_output = flash_attention(Q, selected_K, selected_V)The indexer projection matrices are learned during training, not hand-crafted. This means the sparsity pattern is content-adaptive: a query about a topic from chapter 1 of a long document will score chapter 1 blocks highly regardless of their position in the sequence.
The --dsa-block-size parameter controls the granularity of block scoring. Smaller blocks give finer-grained selection (lower approximation error) but more indexer overhead. The hardware-optimal values are:
- H100 SXM5 / H200 SXM5 (HBM3/HBM3e, 3-4.8 TB/s): block-size 64
- B200 SXM6 (HBM3e, 8 TB/s): block-size 128
The B200's higher bandwidth can load larger blocks faster, so the indexer overhead per selected block is lower and you can afford coarser block granularity.
Memory and FLOPs Math: Compute Reduction at Scale
For a 70B-class frontier model (80 layers, 8 GQA KV heads, 128 head dim), the attention FLOPs formula per forward pass is:
Dense FLOPs = 2 × L × H_kv × d_head × S²
Substituting the parameters: 2 × 80 × 8 × 128 × S² = 163,840 × S²
DSA attends over a fixed K=2048 tokens for each of the S query positions, so the active compute is:
DSA FLOPs = 2 × L × H_kv × d_head × K × S = (K/S) × Dense FLOPs, where K = 2048 (fixed)
At 128K context (S=131,072): K/S = 2048/131,072 = 1.56%, giving ~98.4% FLOPs reduction. At 512K the savings reach ~99.6%. The reduction grows with context length because K is fixed while S grows.
The active compute KV footprint is also constant: 2 × L × H_kv × d_head × K × 2 bytes regardless of S. For the 70B model parameters above, that is ~0.6 GB per query pass at any context length:
| Context Length | Dense Attention FLOPs | DSA FLOPs (K=2048 fixed) | KV Memory Dense (BF16) | DSA Active Compute KV |
|---|---|---|---|---|
| 32K | ~176 TFLOP | ~11 TFLOP | ~10 GB | ~0.6 GB |
| 128K | ~2,815 TFLOP | ~44 TFLOP | ~40 GB | ~0.6 GB |
| 512K | ~45,034 TFLOP | ~176 TFLOP | ~160 GB | ~0.6 GB |
| 1M | ~180,147 TFLOP | ~352 TFLOP | ~320 GB | ~0.6 GB |
The constant active compute KV means DSA's per-query HBM bandwidth cost does not scale with context length, enabling far higher concurrency than dense attention on the same hardware.
Hardware Requirements: H100, H200, and B200 for DSA Workloads
The DSA lightning indexer is bandwidth-bound at 512K tokens. The indexer step reads block-level K projections sequentially, and the access pattern is irregular enough that HBM bandwidth rather than FLOP throughput determines latency. Higher HBM bandwidth directly reduces indexer latency.
| GPU | VRAM | HBM Bandwidth | Max DSA context (single user, 70B FP8) | On-demand (from) | Spot (from) |
|---|---|---|---|---|---|
| H100 SXM5 | 80 GB | 3.35 TB/s | ~32K tokens | $5.01/hr | $1.46/hr |
| H200 SXM5 | 141 GB | 4.8 TB/s | ~220K tokens | $5.92/hr | $3.31/hr |
| B200 SXM6 | 192 GB | 8 TB/s | ~370K tokens | $8.61/hr | $5.34/hr |
"Max DSA context" here means the context length where the full stored KV cache for one user fits within one GPU's VRAM alongside 70B FP8 model weights (~70 GB). Using the BF16 KV formula above (327,680 bytes per token), H100's 80 GB leaves ~10 GB for stored KV, covering ~32K tokens. H200's 141 GB leaves ~70 GB, covering ~220K. B200's 192 GB leaves ~120 GB, reaching ~370K.
For H200 SXM5 on Spheron (4.8 TB/s HBM3e, 141 GB), the index step runs in roughly 2-4ms at 512K context, contributing less than 5% overhead to prefill time. At 1M context the index step on H200 takes ~8ms, at which point moving to B200 SXM6 availability (8 TB/s) halves that overhead. For workloads staying at 128K context, H100 SXM5 instances are cost-efficient and the 3.35 TB/s bandwidth is not a bottleneck.
For GLM-5.1 (754B FP8, requires 8x H200), or DeepSeek V4 (500 GB FP8 weights, requires 4x H200 minimum), DSA context limits scale with the pooled VRAM remaining after weights.
Deploy GLM-5.1 with DSA in vLLM
The full GLM-5.1 deployment guide covers weight download, multi-GPU NVLink topology, and base vLLM configuration. This section focuses specifically on enabling and tuning the DSA attention backend on top of that base setup.
Step 1: Install vLLM (main branch as of June 2026) and verify CUDA
pip install "vllm>=0.19.0"
nvcc --version # Must be CUDA 12.4 or higher
python -c "import vllm; print(vllm.__version__)"Step 2: Download GLM-5.1 FP8 weights
huggingface-cli download zai-org/GLM-5.1-FP8 \
--local-dir ./glm5-1-fp8 \
--repo-type modelThis requires ~800 GB of NVMe storage. The download takes 20-60 minutes depending on your connection.
Step 3: Launch with DSA enabled
python -m vllm.entrypoints.openai.api_server \
--model ./glm5-1-fp8 \
--served-model-name glm5-1-fp8 \
--tensor-parallel-size 8 \
--quantization fp8 \
--enable-expert-parallel \
--max-model-len 131072 \
--attention-backend dsa \
--dsa-block-size 64 \
--enable-chunked-prefill \
--gpu-memory-utilization 0.92 \
--port 8000The --attention-backend dsa flag activates the two-stage sparse pattern. Set --dsa-block-size 64 for H200 instances. Use --dsa-block-size 128 on B200.
--enable-chunked-prefill is required for high-concurrency workloads. Without it, a single long-context prefill blocks decode steps for all other in-flight requests.
Step 4: Validate the endpoint
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "glm5-1-fp8",
"messages": [{"role": "user", "content": "Summarize this document: [your 128K token document here]"}],
"max_tokens": 512
}'Step 5: Monitor GPU utilization
nvidia-smi dmon -s u -d 2Watch for GPU utilization staying above 80% during prefill. Drops below 60% during prefill indicate the indexer is stalling on memory bandwidth: reduce --dsa-block-size by half and retest.
Deploy DeepSeek V4 with DSA on Multi-GPU
The full DeepSeek V4 deployment guide covers FP8 weight sizing, TP vs EP configuration, and benchmark projections. This section adds the DSA attention backend on top.
DeepSeek V4 was officially released in April 2026 with day-0 vLLM support. Two variants are available: deepseek-ai/DeepSeek-V4-Pro (the full-scale model, recommended for long-context DSA inference) and deepseek-ai/DeepSeek-V4-Flash (smaller and faster). Both share the same DSA attention architecture.
Launch DeepSeek V4 Pro with DSA:
VLLM_ATTENTION_BACKEND=dsa python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-V4-Pro \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--quantization fp8 \
--max-model-len 131072 \
--gpu-memory-utilization 0.90 \
--dsa-block-size 64 \
--enable-chunked-prefill \
--port 8000You can pass --attention-backend dsa as a flag or set VLLM_ATTENTION_BACKEND=dsa as an environment variable. Both are equivalent.
For 8x H200, increase --tensor-parallel-size 8 and remove the VRAM constraint headroom. The 8x configuration leaves more room for KV cache, extending per-user context from ~128K to ~256K with DSA.
Verify DSA is active:
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-V4-Pro \
--attention-backend dsa \
--log-level debug 2>&1 | grep -i "attention backend\|dsa"A debug log line like Attention backend: DSA (block_size=64) confirms the backend loaded correctly.
SGLang DSA Deployment
Recent SGLang builds (v0.5.9+ as of June 2026) support DSA for GLM-5.1 and DeepSeek models via the Native Sparse Attention (NSA) backend. The flag names differ from vLLM:
python -m sglang.launch_server \
--model-path ./glm5-1-fp8 \
--tp 8 \
--quantization fp8 \
--context-length 131072 \
--attention-backend dsa \
--nsa-prefill-backend trtllm \
--nsa-decode-backend trtllm \
--port 30000The --nsa-prefill-backend trtllm and --nsa-decode-backend trtllm flags route DSA compute through TRT-LLM DSA kernels, which benchmark at 3x-5x the throughput of the native CUDA kernel on Blackwell hardware. On Hopper (H100, H200), the default CUDA path is used.
For a detailed SGLang vs vLLM throughput comparison on long-context workloads, see the vLLM vs TRT-LLM vs SGLang benchmark guide.
Benchmark Walkthrough: Latency, Throughput, and Accuracy
Run the throughput benchmark from the vLLM repository to compare DSA vs dense baselines:
# DSA baseline
python benchmarks/benchmark_throughput.py \
--model ./glm5-1-fp8 \
--attention-backend dsa \
--dsa-block-size 64 \
--input-len 32768 \
--output-len 512 \
--num-prompts 50
# Dense FlashAttention baseline
python benchmarks/benchmark_throughput.py \
--model ./glm5-1-fp8 \
--attention-backend flash_attn \
--input-len 32768 \
--output-len 512 \
--num-prompts 50Projected benchmark results on 8x H200 SXM5 (based on vLLM MoE benchmarks and DSA throughput data from GLM-5.1 paper):
| Context Length | Model | Dense TPS | DSA TPS | Speedup | TTFT Dense | TTFT DSA | RULER delta |
|---|---|---|---|---|---|---|---|
| 32K | GLM-5.1 FP8 | ~850 | ~1,100 | 1.3x | ~4.2s | ~3.1s | 0% |
| 128K | GLM-5.1 FP8 | ~420 | ~860 | 2.0x | ~18s | ~8s | -1% |
| 512K | GLM-5.1 FP8 | ~105 | ~340 | 3.2x | ~72s | ~22s | -2% |
The RULER accuracy delta is small: DSA shows a ~1% drop at 128K and ~2% at 512K on the RULER long-context benchmark. On needle-in-haystack tasks at 128K, DSA retrieval rate is near-identical to dense because the indexer reliably scores the relevant tokens highly. The accuracy gap widens at 1M tokens.
At 32K context the speedup is modest (1.3x) because the indexer overhead consumes some of the FLOPs savings. The crossover point where DSA is clearly better is around 64K tokens: above that, the ~97% FLOPs reduction dominates the indexer overhead.
Common Deployment Pitfalls
1. KV cache layout mismatch
DSA requires block-sparse KV layout, not the dense paged layout used by the default FlashAttention backend. If you switch from flash_attn to dsa without a clean restart, vLLM may read KV blocks written in the wrong format. Always restart the server when changing --attention-backend, not just reload.
Fix: kill -9 $(lsof -ti:8000) && python -m vllm.entrypoints.openai.api_server ... --attention-backend dsa
2. Index-step memory overflow at 1M context
The indexer needs O(S / block_size) scratch memory for block scores. At 1M context with block-size 32, the score tensor is 32,768 blocks × 64 heads × 4 bytes = ~8 MB per request. At high concurrency across many simultaneous requests, these scratch buffers can accumulate and overflow GPU memory.
Fix: increase --dsa-block-size from 32 to 64 or 128 to reduce score tensor size, or lower --max-num-seqs to cap active concurrent requests.
3. Prefill bottleneck with long inputs
The indexer runs only during prefill, adding ~1.2x prefill latency at 512K context compared to DSA decode-only serving. Plan for higher TTFT on first-turn long-context requests.
Fix: separate prefill and decode compute with chunked prefill (--enable-chunked-prefill) so the prefill indexer step interleaves with decode for other requests, reducing head-of-line blocking.
4. Wrong --dsa-block-size for your GPU
B200 requires block-size 128; H200 and H100 require block-size 64. The wrong value degrades throughput by 15-30% and in some cases causes precision warnings from the indexer kernel.
Fix: set --dsa-block-size 128 for B200 instances and --dsa-block-size 64 for H100/H200.
5. Expert parallel and DSA not both enabled on MoE models
On GLM-5.1 and DeepSeek V4, you need both --enable-expert-parallel and --attention-backend dsa. Running expert parallelism without DSA leaves significant FLOPs savings on the table for attention layers (98%+ at 128K context). Running DSA without expert parallelism limits expert routing efficiency.
Fix: include both flags together: --enable-expert-parallel --attention-backend dsa.
6. Missing --enable-chunked-prefill at high concurrency
Without chunked prefill, a 512K-token prefill blocks all other in-flight decode requests for the duration of the prefill step. DSA does not eliminate this blocking; it only reduces the prefill FLOPs. Chunked prefill interleaves prefill chunks with decode steps so the blocking window is bounded.
Fix: add --enable-chunked-prefill to the vLLM launch command for any production setup handling more than 2 concurrent users.
Spheron GPU Cloud Pricing for DSA Long-Context Inference
| GPU | On-demand (from) | Spot (from) | Recommended for DSA |
|---|---|---|---|
| H100 SXM5 80 GB | $5.01/hr | $1.46/hr | 32K-128K context, single-user |
| H200 SXM5 141 GB | $5.92/hr | $3.31/hr | 128K-512K context, multi-user |
| B200 SXM6 192 GB | $8.61/hr | $5.34/hr | 512K-1M context, peak throughput |
For cost-per-million-tokens math at 128K context with DSA enabled:
On 8x H200 at spot ($3.31/hr each = $26.48/hr cluster), GLM-5.1 FP8 at 128K with DSA achieves roughly 860 tokens/sec output throughput. At that rate, 1 million output tokens takes ~19.4 minutes, costing $26.48/hr × (19.4/60) hr = ~$8.56 per million output tokens.
Without DSA (dense attention), throughput drops to ~420 tokens/sec at 128K. Cost per million output tokens rises to ~$17.51, more than double.
Pricing fluctuates based on GPU availability. The prices above are based on 05 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
When to Use DSA vs Ring Attention or NVMe KV Offloading
| Scenario | Best technique | Why |
|---|---|---|
| 32K-128K context, high concurrency | DSA | 94-98% FLOPs reduction, no multi-GPU needed, content-adaptive retrieval |
| 128K-512K context, single user | DSA | Active KV fits on one H200, index step overhead is <5% of prefill |
| 512K-1M context, multi-user | DSA + multi-GPU pooled VRAM or ring attention | DSA active compute KV is constant ~0.6 GB per query; B200 (192 GB) covers stored KV up to ~370K tokens per user on a single GPU. Full 1M context requires multi-GPU pooled VRAM or ring attention |
| >1M context, multi-user | Ring Attention (DSA per rank optional) | DSA alone cannot fit multi-user 1M KV even on B200; ring shards across GPUs |
| High prefix cache hit rate, <500K | NVMe KV offloading first, DSA optional | If 60%+ of requests share a long prefix, NVMe prefix caching may outperform DSA's memory savings on cold requests |
For multi-million-token contexts, the Ring Attention guide covers the full multi-GPU setup. For contexts under 500K where VRAM overflow is the only problem (not compute), the NVMe KV offloading approach may require less configuration, especially if you already have LMCache deployed.
DSA is the right default for models that support it (GLM-5.1, DeepSeek V4) when your context is 128K-512K and you want maximum concurrency from a single-node deployment. Above 1M tokens or when serving many simultaneous users each needing full 1M KV, combine DSA with ring attention rather than choosing one over the other.
DSA inference at 128K context cuts attention FLOPs by ~98% using a fixed top-2048 token selection, but the index step is HBM-bandwidth-bound at 512K+ tokens. H200 SXM5 instances on Spheron (4.8 TB/s HBM3e, 141 GB) cover GLM-5.1 and DeepSeek V4 DSA deployments from 128K to 512K context with on-demand and spot options.
H200 SXM5 on Spheron → | B200 for million-token inference → | View all GPU pricing →
Quick Setup Guide
Calculate KV cache size with and without DSA. At 128K context, BF16 KV for a 70B-class model is roughly 40 GB per concurrent user. DSA with a fixed top-2048 token selection reduces the active compute KV footprint to about 0.6 GB per query pass, enabling far higher concurrency than dense attention on the same hardware. At 512K tokens, full KV is 160 GB per user while the DSA active compute window stays constant at 0.6 GB.
Log in to app.spheron.ai and select an H200 SXM5 141 GB or B200 SXM6 192 GB instance. For GLM-5.1 FP8, provision 8x H200 (1,128 GB total VRAM). For DeepSeek V4 Pro FP8, the minimum is 4x H200 (564 GB). Choose on-demand for interactive serving with guaranteed uptime, or spot for batch jobs where cost savings offset potential preemption.
On your GPU instance: pip install 'vllm>=0.19.0'. Confirm CUDA 12.4+ is installed (nvcc --version). Verify the installation: python -c 'import vllm; print(vllm.__version__)'. The DSA attention backend ships with vLLM for GLM-5.1 and DeepSeek V4 models, no separate package needed.
Run: python -m vllm.entrypoints.openai.api_server --model zai-org/GLM-5.1-FP8 --tensor-parallel-size 8 --quantization fp8 --enable-expert-parallel --max-model-len 131072 --attention-backend dsa --dsa-block-size 64 --port 8000. The --attention-backend dsa flag activates the two-stage sparse pattern. Set --dsa-block-size to 64 for H200 (optimized for 4.8 TB/s HBM) or 128 for B200 (8 TB/s HBM).
Run: VLLM_ATTENTION_BACKEND=dsa python -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-V4-Pro --tensor-parallel-size 4 --enable-expert-parallel --quantization fp8 --max-model-len 131072 --gpu-memory-utilization 0.90 --port 8000. Use deepseek-ai/DeepSeek-V4-Flash for the smaller variant; both share the same DSA attention architecture. Scale tensor-parallel-size to match your GPU count (8 for 8x H100 or H200 cluster).
Run the throughput benchmark: python benchmarks/benchmark_throughput.py --model zai-org/GLM-5.1-FP8 --attention-backend dsa --input-len 32768 --output-len 512 --num-prompts 50. Repeat with --attention-backend flash_attn for the dense baseline. Try --dsa-block-size 32, 64, and 128 and pick the value that maximizes tokens/sec on your specific GPU SKU. Monitor the index-step timing in the vLLM --log-level debug output to confirm the indexer is not bottlenecking prefill.
Frequently Asked Questions
DeepSeek Sparse Attention (DSA) is a two-stage hardware-aware attention mechanism used in GLM-5.1 and DeepSeek V4. The first stage runs a lightweight 'lightning indexer' using compressed query/key projections to score every token block in the context. The second stage loads and attends over only the top-K blocks, specifically a fixed 2048 tokens regardless of context length. At 128K context, this reduces attention FLOPs by roughly 98% compared to dense attention, enabling practical long-context inference on a single H200 without ring attention or NVMe offloading.
FlashAttention is an IO-aware dense attention kernel that attends over every token but in a tiling pattern that reduces HBM reads. Ring Attention distributes the full sequence across multiple GPUs. Sliding-window attention hard-codes a fixed local window (e.g., 4K tokens) and drops everything outside it. DSA uses learned, content-adaptive sparsity: the indexer picks the top-2048 tokens that matter for each query regardless of position (a fixed count, not a fraction of context), so DSA can retrieve critical tokens from the start of a 1M-token context, which sliding-window attention cannot. DSA and Ring Attention are complementary: you can run DSA within each Ring Attention rank to reduce per-rank FLOPs further.
The DSA lightning indexer runs on H100, H200, and B200 via standard CUDA kernels. H100 SXM5 (80 GB, 3.35 TB/s) is the practical entry point but the index step is HBM-bandwidth-bound, and that latency dominates at 512K+ tokens. H200 SXM5 (141 GB, 4.8 TB/s) is the production sweet spot for 128K multi-user DSA serving. B200 SXM6 (192 GB, 8 TB/s) handles 512K-1M token DSA without index-step bottleneck and is the recommended choice for GLM-5.1 or DeepSeek V4 at million-token context.
Yes. Recent vLLM builds (main branch as of June 2026) support GLM-5.1 with the DSA attention backend. Use --attention-backend dsa when launching the server, or set the VLLM_ATTENTION_BACKEND=dsa environment variable. The DSA backend patches the attention layers to use the two-stage indexer-then-attend pattern. At 128K context, enabling DSA roughly doubles throughput on 8x H200 compared to the dense FlashAttention baseline with no measurable accuracy regression on RULER benchmarks.
Use DSA when: context is 32K-512K tokens, you want single-GPU serving with maximum concurrency, and long-range token retrieval accuracy matters (DSA outperforms sliding-window on needle-in-haystack tasks). Use NVMe KV offloading when GPU HBM is the only bottleneck and context is under 500K. Use ring attention when context exceeds 1M tokens or when serving multiple concurrent requests each with full 1M-token KV cache, since DSA's memory savings alone cannot fit that KV on one GPU.
