Ring Attention and Tree Attention on GPU Cloud: Sequence Parallelism for 10M-Token LLM Inference (2026 Guide)

Q: What is Ring Attention and how does it enable 10M-token context inference?

Ring Attention (Liu et al., 2023) distributes the attention computation across GPUs arranged in a logical ring. Each GPU holds a shard of the full sequence's key-value tensors and rotates them around the ring one step at a time, computing its portion of the attention output while the KV data is in transit. This overlaps communication with compute, hiding most of the all-to-all transfer latency. At 10M tokens, a 70B-class frontier model accumulates roughly 3.3 TB of BF16 KV cache for a single request. Ring Attention shards that across GPUs so each node holds only its 1/N fraction, making the otherwise impossible context length tractable with 16-32 B200 GPUs.

Q: What is the difference between Ring Attention, Tree Attention, and Striped Attention?

Ring Attention passes KV shards around a ring of GPUs, overlapping sends with compute. It has O(1/N) memory per GPU and O(N) total communication volume, making it bandwidth-efficient for large rings but sensitive to load imbalance since the causal mask means earlier sequence positions do more computation than later ones. Tree Attention uses a reduce-tree topology (like a tournament bracket) to compute attention with O(log N) communication rounds, which cuts latency at very large GPU counts but requires stable log-softmax numerics at each merge step. Striped Attention (Brandon et al., 2023) fixes Ring Attention's load imbalance problem by interleaving sequence chunks in a striped pattern across GPUs so each GPU sees a balanced mix of early and late positions. Striped Attention is the default recommendation for causal models today.

Q: What interconnect do I need for Ring Attention or Striped Attention?

The minimum viable interconnect depends on ring size and context length. For an 8-GPU ring (single node) over NVLink 4.0, you get 900 GB/s bidirectional bandwidth, which is more than adequate. For multi-node rings (16-64 GPUs), you need either InfiniBand NDR 400G (roughly 350 GB/s effective all-reduce) or NVLink 5.0 Switch (via NVL72 systems). RoCEv2 at 400GbE is the minimum viable Ethernet option but adds meaningful latency at ring sizes above 16. The key metric: your per-step KV transfer volume divided by available bisection bandwidth should be under 10ms to avoid compute under-utilization below 80%.

Q: Does SGLang support Ring Attention natively?

SGLang v0.4+ includes a sequence-parallel path that implements context parallelism using an all-gather + local attention pattern, which is functionally equivalent to Ring Attention for most inference workloads. Enable it with --context-parallel-size N where N is your ring/CP size. SGLang's implementation integrates with FlashAttention-3 on Hopper and FlashAttention-4 on Blackwell automatically, so you get optimal attention kernels within each CP rank's local chunk without manual kernel selection. The SGLang router handles load balancing across CP groups when combined with its disaggregated serving API.

Q: When should I use sequence parallelism instead of NVMe KV offloading?

NVMe KV offloading works well for moderate context lengths (up to roughly 500K tokens) where the latency cost of cold NVMe reads (7 GB/s, ~100us-1ms per access) is acceptable and HBM shortage is the only constraint. Sequence parallelism is the right choice when: (1) context length exceeds ~1M tokens and NVMe bandwidth itself becomes the bottleneck, (2) you need to serve requests with TTFT under a few seconds and cannot afford the recomputation or NVMe load latency, or (3) you are running batch inference where consistent inter-GPU communication latency is more predictable than variable NVMe seek patterns. At 4M-10M tokens, sequence parallelism is the only production-grade option short of purpose-built SRAM inference chips.

Past 1M tokens, the standard tricks stop working. NVMe KV offloading tops out around 500K tokens before NVMe bandwidth itself becomes the bottleneck. Prefill-decode disaggregation helps throughput but doesn't reduce per-request KV memory. Serving Qwen 3.6 Plus or Kimi K2.6 at their advertised 4M-10M token contexts requires distributing the attention computation itself across GPUs. That's sequence parallelism.

For the foundations, see our KV Cache Optimization guide and NVMe KV Cache Offloading guide.

The 1M-Token Memory Wall

KV cache memory scales linearly with context length. At 128K tokens, a single Llama 3.1 70B request at BF16 uses ~43 GB just for KV. At 1M tokens, that same request needs ~328 GB. No single GPU has that much HBM, and NVMe bandwidth can't keep up once you scale to multi-million-token contexts.

Here's the math for a 70B-class model (80 layers, 8 GQA heads, 128 head dim):

Context Length	BF16 KV (single request)	FP8 KV (single request)
128K tokens	~43 GB	~21 GB
500K tokens	~164 GB	~82 GB
1M tokens	~328 GB	~164 GB
4M tokens	~1.3 TB	~655 GB
10M tokens	~3.3 TB	~1.6 TB

The formula: KV_bytes = 2 × L × H_kv × D × S × bytes_per_element

At 1M tokens with FP8 quantization (~164 GB), you need at least 2-3 H200s (141 GB each) just for the KV cache of a single request, leaving almost no room for model weights. At 4M tokens, FP8 KV alone exceeds 4-5 H200s combined HBM.

The bottleneck shifts depending on scale:

Under 500K tokens: HBM is the binding constraint. NVMe offloading at 7 GB/s works because cold blocks load faster than recomputing them. This is exactly what NVMe KV Cache Offloading covers.
500K-4M tokens: NVMe bandwidth becomes the bottleneck. Even with fast NVMe (7 GB/s), loading ~164-655 GB of FP8 KV data per request takes seconds per token generation step.
4M-10M tokens: There is no single-node solution. The KV cache physically cannot fit in a reasonable number of GPUs' combined HBM unless you distribute the attention computation itself.

Sequence parallelism is the solution at the upper end of this curve. Instead of storing the full KV cache on every GPU, you shard it across a ring of GPUs so each one holds 1/N of the sequence.

Ring Attention vs Tree Attention vs Striped Attention

Three variants of sequence parallelism exist, each with different communication topology and performance tradeoffs:

Property	Ring Attention	Tree Attention	Striped Attention
Topology	Circular ring	Reduce-tree (log N rounds)	Ring with interleaved chunks
Memory per GPU	O(1/N) sequence	O(1/N) sequence	O(1/N) sequence
Communication rounds	N steps (one full rotation)	O(log N) steps	N steps
Load balance (causal)	Poor (early positions do 4x work)	Poor by default	Balanced
Numeric stability	Standard softmax	Requires log-sum-exp merge	Standard softmax
Best for	Non-causal models, encoder attention	Very large GPU counts (64+)	Decoder-only (GPT, Llama, Qwen)

Ring Attention rotates KV shards around the ring one step per round, computing local attention while data is in transit. The communication-compute overlap is excellent when chunk sizes are tuned right. But for decoder-only (causal) models, the triangular attention mask means GPU 0 (holding position 0-128K) computes dense attention while GPU 7 (holding position 896K-1M) computes only ~12% as many non-masked blocks. That load imbalance wastes 3-4x GPU time on late-position ranks.

Tree Attention uses a binary reduce tree: each GPU computes local attention, then adjacent pairs merge attention outputs using log-sum-exp stabilization. The number of communication rounds drops from N to log2(N), which matters at ring sizes above 32. The tradeoff is that each merge step requires numerically stable softmax combination, adding implementation complexity and some accuracy risk at low precision.

Striped Attention (Brandon et al., 2023) fixes Ring Attention's load imbalance by assigning non-contiguous sequence chunks to each GPU. Instead of GPU 0 holding tokens 0-128K and GPU 7 holding 896K-1M, each GPU holds interleaved positions (GPU 0: 0, 8, 16, 24...; GPU 1: 1, 9, 17, 25...; etc.). Under the causal mask, every GPU now handles an approximately equal mix of dense early-position blocks and sparse late-position blocks. GPU utilization is balanced across the ring.

The recommendation: Use Striped Attention for all causal decoder-only models. Plain Ring Attention is fine for bidirectional attention (encoders, Gemma 3's sliding window for prefill). Tree Attention is worth exploring only when ring size exceeds 32 and latency rather than throughput is the primary constraint.

Hardware Requirements

Sequence parallelism is interconnect-limited. The ring communication volume per ring step scales with the KV shard size: moving a 1/N chunk of the sequence's KV tensors across the ring. The bandwidth you have determines the maximum ring size and context length you can serve at acceptable GPU utilization.

Interconnect	Bandwidth	Ring Size	Max Context (70B FP8)	Notes
NVLink 4.0 (H200 node)	900 GB/s bidirectional	8 GPUs (single node)	~1M tokens	Best intra-node option
NVLink 5.0 Switch (NVL72)	1.8 TB/s	8-72 GPUs	~4M tokens	B200/B300 NVL72 rack
InfiniBand NDR 400G	~350 GB/s effective	16-64 GPUs	~4M-10M tokens	Standard multi-node option
RoCEv2 400GbE	~270 GB/s effective	Up to 16 GPUs	~2M tokens	Ethernet fallback, avoid for CP>16

For single-node ring attention on bare-metal H200 instances, NVLink 4.0 at 900 GB/s is more than adequate. An 8-GPU ring can handle 1M-token contexts for 70B-class models with FP8 KV, with per-ring-step transfer times well under 2ms for chunk sizes up to 8K tokens.

For multi-node rings targeting 4M-10M token contexts, InfiniBand NDR is the practical requirement. RoCEv2 is technically viable at ring sizes of 8-16, but latency climbs fast above that. If you're building a 32+ GPU ring for 10M-token inference, InfiniBand NDR or NVLink 5.0 Switch is not optional.

The chunk size calculation: for any interconnect, the per-step transfer volume (KV shard size for one chunk) divided by the available bandwidth should stay under 10ms to maintain above 80% GPU compute utilization. For InfiniBand NDR at ~350 GB/s:

Max transfer per step = 10ms × 350 GB/s = 3.5 GB
Max chunk at FP8 (70B, 0.163 MB/token) = 3.5 GB / 0.163 MB = ~21,500 tokens

This means 16K-21K tokens per chunk is the practical sweet spot for multi-node InfiniBand rings with FP8. Going to 32K-token chunks at 350 GB/s gives ~15ms per step, still within acceptable range for FP8 workloads.

For the full interconnect cost analysis, see our GPU Networking guide: InfiniBand vs RoCE vs Spectrum-X. For Spheron cluster instance configurations with InfiniBand NDR fabric, see the Spheron cluster instance types docs.

Setting Up Ring Attention

Megatron-Core with transformer_engine

Megatron-Core's context parallelism (CP) implementation supports Ring Attention with configurable chunking strategies including striped layout. Install from source:

bash

pip install git+https://github.com/NVIDIA/Megatron-LM
pip install transformer_engine[pytorch]

Core config for Ring Attention with striped layout on a Qwen 3.6 Plus-scale model:

python

model_config = {
    "context_parallel_size": 8,        # ring size, must divide total GPUs
    "sequence_parallel": True,          # activates SP communication hooks
    "use_flash_attn": True,             # FA4 on Blackwell, FA3 on Hopper
    "cp_comm_type": "p2p",              # peer-to-peer ring communication
}

For multi-node setups, set these NCCL environment variables before launching your job. See our NCCL Tuning guide for the full context:

bash

export NCCL_ALGO=Ring
export NCCL_NTHREADS=512
export NCCL_IB_HCA=mlx5_0,mlx5_1    # adjust to your IB HCAs
export NCCL_IB_GID_INDEX=3            # RoCEv2 GID index if applicable

Place context-parallel ranks on the same physical node where possible. Two CP ranks on the same node communicate over NVLink at 900 GB/s; cross-node CP ranks use InfiniBand at ~350 GB/s. The performance difference is about 2.5x. For a 16-GPU CP ring across two nodes, NVLink handles intra-node pairs, InfiniBand handles cross-node pairs. Design your parallelism topology to minimize cross-node CP communication.

transformer_engine handles kernel selection automatically: FA4 on SM100 (B200, B300), FA3 on SM90 (H100, H200), FA2 on older architectures. No manual kernel selection required within each CP rank's local compute step.

For more on Megatron-Core setup and parallelism configs, see Distributed LLM Training: FSDP, DeepSpeed, and Megatron-Core multi-node.

SGLang Sequence-Parallel Serving

SGLang v0.4+ implements CP with an all-gather + local attention pattern, equivalent to Ring Attention for inference. The interface is a single launch flag:

bash

python -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-Plus \
  --tp 8 \
  --context-parallel-size 4 \
  --context-length 1000000 \
  --dtype bfloat16 \
  --mem-fraction-static 0.85 \
  --chunked-prefill-size 8192

This example: 32 total GPUs (TP=8, CP=4), 4 nodes, 1M-token context. The --context-parallel-size flag activates the ring-attention communication schedule with striped layout for causal models. --chunked-prefill-size 8192 interleaves 8K-token prefill chunks with decode steps, preventing long prefills from blocking ongoing decode requests.

For a B200 multi-GPU cluster, add --enable-fp8-kv to halve KV memory per GPU. Combined with CP=4, a single 1M-token request distributes its ~164 GB FP8 KV cache across 4 CP ranks (32 GPUs total), each CP rank holding ~41 GB, well within B200's 192 GB HBM per GPU.

For B200 GPU rental on a multi-node cluster, check B200 GPU rental for availability and cluster configuration options.

FlashAttention-4 Integration and Chunked Prefill

FlashAttention-4 activates automatically within each CP rank on Blackwell GPUs (B200, B300) via transformer_engine and vLLM's FlashAttention dispatch. No manual kernel selection needed. Each GPU's local attention chunk (1/N of the full sequence) runs the SM100 tile-based FA4 kernel, which reduces HBM read overhead by roughly 2x compared to FA3. The FA4 gains compound with CP: you get FA4's per-token efficiency on each chunk, plus CP's memory distribution across the ring.

For the full FA4 architecture explanation and migration guide, see FlashAttention-4 on GPU Cloud: Blackwell Inference Guide.

Chunked prefill with sequence parallelism processes the input prompt in fixed-size chunks. Each chunk is processed ring-wise across all CP ranks before moving to the next chunk. This matters for two reasons:

TTFT for blended workloads: Without chunked prefill, a single long-context prefill can block the decode queue for seconds. Chunked prefill lets the decode queue continue while prefill progresses chunk by chunk.
Memory footprint: Processing 1M-token prompts in 8K-token chunks limits peak activation memory on each rank. Without chunking, the intermediate activation buffer scales with the full chunk size seen by each rank (1M / N tokens), which can OOM on 80 GB GPUs.

Recommended chunk sizes:

Bare-metal B200 instances + NVLink 5.0 (intra-node): 4K-8K tokens per chunk
B200 + InfiniBand NDR (multi-node): 8K-16K tokens per chunk
On-demand H200 access + NVLink 4.0 (intra-node): 4K-8K tokens per chunk

For blended workloads (concurrent prefill + decode), set --chunked-prefill-size 8192 in SGLang or --enable-chunked-prefill --max-num-batched-tokens 8192 in vLLM.

Tuning: Ring Size, Chunk Size, and Overlap

Three knobs control performance of a ring-attention setup:

Ring Size (CP Degree)

Ring size determines how the sequence is sharded. More GPUs per ring means smaller KV shard per GPU, but more communication rounds and more inter-GPU coordination.

Total GPUs	TP Degree	CP Degree	Notes
8 (1 node)	8	1	Standard TP, no CP. Fine for up to ~500K tokens.
8 (1 node)	4	2	Light CP. Handles ~1M tokens with headroom.
16 (2 nodes)	8	2	Standard multi-node TP. ~1M tokens.
16 (2 nodes)	4	4	CP-heavy. ~2M tokens.
32 (4 nodes)	8	4	Good balance for ~4M tokens.
64 (8 nodes)	8	8	64-GPU ring. ~10M tokens viable.

The key constraint: TP degree is bounded by the number of KV heads. If your model has 8 GQA heads, max TP = 8. CP has no such bound. For very large context lengths, pushing CP higher while keeping TP at max is usually the right direction.

Chunk Size and the 10ms Rule

Chunk size is the number of tokens in each ring communication step. The rule: chunk transfer time must stay under 10ms to keep GPU utilization above 80%.

chunk_transfer_time = (chunk_tokens × KV_bytes_per_token) / interconnect_bandwidth_GBs

For B200 (FP8, 70B: 0.163 MB/token) + InfiniBand NDR (350 GB/s):

Chunk Size	Transfer Volume	Transfer Time	GPU Utilization
4K tokens	~0.65 GB	~1.9ms	~95%+
8K tokens	~1.3 GB	~3.7ms	~90%+
16K tokens	~2.6 GB	~7.5ms	~85%
32K tokens	~5.2 GB	~14.9ms	~75%

For multi-node InfiniBand rings with FP8, 16K-21K tokens per chunk is the practical sweet spot. At 4K tokens you get excellent overlap but more kernel launch overhead; at 32K tokens transfer time climbs to ~15ms, starting to cut into compute utilization.

For intra-node NVLink at 900-1800 GB/s, even 32K-token chunks have under 8ms transfer time, so intra-node rings are not chunk-size-sensitive in the same way.

Parallelism Decomposition

The full parallelism equation: total GPUs = TP × CP × PP (tensor parallel × context parallel × pipeline parallel). For inference, PP is usually 1 (pipeline stages add latency for single requests). So:

At 4M tokens, 32x B200 (InfiniBand NDR):

Option A: TP=8, CP=4. Four 8-GPU TP groups, each with 4-way CP. Easy to configure in SGLang and Megatron.
Option B: TP=4, CP=8. Smaller TP means more communication for weight sharding, but larger CP accommodates longer sequences per ring step.
Benchmark both. The crossover depends on the model's attention head count and the specific batch workload.

Benchmarks and Cost Analysis

These figures are estimated performance ranges based on hardware interconnect specs and published FlashAttention and SGLang CP benchmarks. They are not measured results from a specific Spheron deployment. Run your own benchmarks on your target hardware for production planning.

Configuration	Context	GPUs	Est. Decode Throughput	Est. TTFT	GPU Mem / GPU	On-Demand Cost
8x H200 SXM5, CP=8, NVLink 4.0	1M tokens	8	100-200 tok/s	5-15s	~45-55 GB	$37.76/hr
16x B200 SXM6, CP=16, IB NDR	4M tokens	16	40-80 tok/s	20-50s	~100-120 GB	$112.00/hr
32x B200 SXM6, CP=32, IB NDR	10M tokens	32	15-30 tok/s	80-150s	~100-120 GB	$224.00/hr

B200 pricing: $7.00/GPU/hr on-demand ($1.71/GPU/hr spot). H200 pricing: $4.72/GPU/hr on-demand.

Cost per million output tokens (on-demand, midpoint throughput estimates):

Configuration	Context	Cost / Million Tokens
8x H200, CP=8	1M	~$52-105
16x B200, CP=16	4M	~$390-780
32x B200, CP=32	10M	~$2,100-4,200

Long-context inference at 10M tokens is expensive. That's not a Spheron pricing issue; it reflects the physics of attention: every decode step reads O(N) KV tokens from HBM. At 10M tokens, a single forward pass reads ~1.6 TB of FP8 KV data (distributed across 32 Spheron B200 GPUs, ~50 GB per GPU per step). The compute is real. Sequence parallelism makes it possible; it doesn't make it cheap.

For live rates, check current GPU pricing.

Pricing fluctuates based on GPU availability. The prices above are based on 14 May 2026 and may have changed. Check current GPU pricing → for live rates.

Decision Matrix: When to Use Each Approach

Approach	Context Range	GPU Requirement	Spheron Cost Estimate	Notes
Sequence parallelism (Ring/Striped)	1M-10M tokens	8-64 GPUs, InfiniBand NDR or NVLink	$38-$224/hr	Only option above 4M tokens
NVMe KV Offloading (LMCache)	Up to ~500K tokens	1-4 GPUs, bare-metal NVMe	$2-$16/hr	Cost-efficient for moderate context
Context truncation	Up to model max	Any GPU	Any	Quality loss for long documents
SRAM inference chips (Groq 3 LPX)	Up to ~1M tokens	Purpose-built hardware	Not on Spheron	Extremely fast TTFT, fixed hardware

The SRAM chip row covers purpose-built inference accelerators with large on-chip SRAM that avoids HBM entirely. NVIDIA's Rubin CPX was announced in this category before being replaced. For the current state of that hardware category, see NVIDIA Rubin CPX and long-context inference.

The crossover point between NVMe KV offloading and sequence parallelism is around 500K-1M tokens. Below 500K, NVMe offloading covers the gap at much lower cost. Above 1M, sequence parallelism is the only practical option.

Sequence parallelism also complements prefill-decode disaggregation at extreme context lengths. Disaggregation addresses the prefill/decode resource mismatch; sequence parallelism addresses the per-request memory constraint. At 4M+ token workloads you likely need both.

For H200 SXM5 bare-metal instances for single-node ring setups, capacity is available at $4.72/hr per GPU. See H200 SXM5 instances on Spheron for the current configuration options.

Practical Checklist Before Running Ring Attention in Production

Verify interconnect type before provisioning. Multi-node ring attention requires InfiniBand NDR or NVLink Switch. Standard Ethernet (even at 400GbE RoCEv2) adds enough latency at ring sizes above 16 to drop GPU utilization below 70%.

Calculate KV-per-GPU before you launch. The formula: (2 × L × H_kv × D × S × bytes) / N. Plug in your model's architecture constants. If the result exceeds 80% of HBM capacity, add more nodes or reduce context length.

Use Striped Attention for decoder-only models. In Megatron-Core: set context_parallel_size=N and cp_comm_type=p2p; striped sequence layout is the default for causal models in recent Megatron-Core releases. In SGLang: --context-parallel-size N uses striped layout for causal models by default.

Tune chunk size to the 10ms rule. For InfiniBand NDR, start at 8K tokens per chunk and benchmark against 4K and 16K. For NVLink (intra-node), start at 8K as well; the optimal may be higher given the bandwidth headroom.

Start with TP=max_kv_heads, then scale CP. For an 8-KV-head model, TP=8 exhausts the tensor-parallel dimension. Scale context length by adding CP on top.

Benchmark TTFT vs throughput separately. Single-request TTFT is dominated by prefill time, which scales with context length. Throughput (tokens/sec aggregate across a request batch) is dominated by decode, which scales with batch size and KV read bandwidth. Optimize for the metric that matters for your workload.

Sequence parallelism for 4M-10M token contexts requires the right interconnect, not just more GPUs. Spheron's multi-node B200 and H200 clusters come with InfiniBand NDR fabric and transparent per-hour pricing, so you can model exact cost-per-million-tokens before committing to a long run.
Rent H200 for long-context inference → | Rent B200 multi-node → | View all GPU pricing →

STEPS / 06

Quick Setup Guide

Calculate KV memory per GPU for your target context and ring size
Use: KV_per_GPU = (2 × L × H_kv × D × S × bytes_per_element) / N where N is the ring/CP size. For a frontier 70B-class model at 1M tokens with 80 layers, 8 GQA heads, 128 head dim, FP8: (2 × 80 × 8 × 128 × 1,000,000 × 1) / 8 = ~20 GB per GPU in an 8-way ring. Scale S to 4M or 10M and N to 16 or 32 GPUs to plan your cluster size.
Choose striped vs standard ring chunking
For causal language models (decoder-only: Qwen 3.6 Plus, Kimi K2.6, GLM-5.1), always use Striped Attention chunking. Standard Ring Attention gives early-position GPUs 4x the compute of late-position GPUs due to the causal mask. Striped chunking interleaves position ranges so each GPU handles an equal mix of dense (early) and sparse (late) attention blocks. In Megatron-Core, set context_parallel_size=N and cp_comm_type=p2p; striped sequence layout is the default for causal models in recent Megatron-Core releases. In SGLang set --context-parallel-size with the default ring ordering, which uses striped layout for causal models automatically.
Set up multi-node Ring Attention with Megatron-Core and transformer_engine
Install Megatron-Core from source (pip install git+https://github.com/NVIDIA/Megatron-LM). In your model config, set context_parallel_size=N (must divide total GPUs evenly), sequence_parallel=True, use_flash_attn=True, and cp_comm_type=p2p for ring communication. transformer_engine handles the FA4 dispatch on Blackwell and FA3 on Hopper. For multi-node, set NCCL_ALGO=Ring and NCCL_NTHREADS=512 to maximize ring all-reduce throughput. Place context-parallel ranks on the same node where possible to minimize inter-node KV transfer volume.
Configure SGLang sequence-parallel serving
Launch SGLang with context parallelism: python -m sglang.launch_server --model-path <model> --tp 8 --context-parallel-size 4 --context-length 1000000 --dtype bfloat16 --mem-fraction-static 0.85. This creates a 4-node setup where each node runs tensor-parallel rank 8 and context-parallel rank 4 simultaneously. The --context-parallel-size argument activates the ring-attention communication schedule. Add --chunked-prefill-size 8192 to interleave prefill chunks with decode steps for blended workloads.
Overlap communication with compute to maximize GPU utilization
The key knob is the chunk size fed to each ring step. Smaller chunks allow more communication-compute overlap but increase the number of kernel launches and reduce arithmetic intensity. For B200 with NVLink 5.0 at 1.8 TB/s, a chunk size of 4K-8K tokens per ring step keeps comms hidden behind compute. For multi-node with InfiniBand NDR at ~350 GB/s effective, increase chunk size to 16K-32K tokens to maintain 80%+ GPU utilization. In Megatron-Core this is controlled by the --seq-length / context_parallel_size division.
Benchmark and tune ring size vs tensor-parallel degree
For a fixed GPU budget, ring size and tensor-parallel (TP) degree multiply: total GPUs = TP × CP × PP. At 1M tokens on 16x B200s, try (TP=8, CP=2) vs (TP=4, CP=4) and benchmark tokens/sec with vLLM's benchmark_throughput tool. TP scales with the number of attention heads (max TP <= num_kv_heads), CP scales with sequence length. At 10M tokens, CP=16 on a 128-GPU cluster (TP=8) is a reasonable starting point.

FAQ / 05

Frequently Asked Questions

Ring Attention (Liu et al., 2023) distributes the attention computation across GPUs arranged in a logical ring. Each GPU holds a shard of the full sequence's key-value tensors and rotates them around the ring one step at a time, computing its portion of the attention output while the KV data is in transit. This overlaps communication with compute, hiding most of the all-to-all transfer latency. At 10M tokens, a 70B-class frontier model accumulates roughly 3.3 TB of BF16 KV cache for a single request. Ring Attention shards that across GPUs so each node holds only its 1/N fraction, making the otherwise impossible context length tractable with 16-32 B200 GPUs.

Ring Attention passes KV shards around a ring of GPUs, overlapping sends with compute. It has O(1/N) memory per GPU and O(N) total communication volume, making it bandwidth-efficient for large rings but sensitive to load imbalance since the causal mask means earlier sequence positions do more computation than later ones. Tree Attention uses a reduce-tree topology (like a tournament bracket) to compute attention with O(log N) communication rounds, which cuts latency at very large GPU counts but requires stable log-softmax numerics at each merge step. Striped Attention (Brandon et al., 2023) fixes Ring Attention's load imbalance problem by interleaving sequence chunks in a striped pattern across GPUs so each GPU sees a balanced mix of early and late positions. Striped Attention is the default recommendation for causal models today.

The minimum viable interconnect depends on ring size and context length. For an 8-GPU ring (single node) over NVLink 4.0, you get 900 GB/s bidirectional bandwidth, which is more than adequate. For multi-node rings (16-64 GPUs), you need either InfiniBand NDR 400G (roughly 350 GB/s effective all-reduce) or NVLink 5.0 Switch (via NVL72 systems). RoCEv2 at 400GbE is the minimum viable Ethernet option but adds meaningful latency at ring sizes above 16. The key metric: your per-step KV transfer volume divided by available bisection bandwidth should be under 10ms to avoid compute under-utilization below 80%.

SGLang v0.4+ includes a sequence-parallel path that implements context parallelism using an all-gather + local attention pattern, which is functionally equivalent to Ring Attention for most inference workloads. Enable it with --context-parallel-size N where N is your ring/CP size. SGLang's implementation integrates with FlashAttention-3 on Hopper and FlashAttention-4 on Blackwell automatically, so you get optimal attention kernels within each CP rank's local chunk without manual kernel selection. The SGLang router handles load balancing across CP groups when combined with its disaggregated serving API.

NVMe KV offloading works well for moderate context lengths (up to roughly 500K tokens) where the latency cost of cold NVMe reads (7 GB/s, ~100us-1ms per access) is acceptable and HBM shortage is the only constraint. Sequence parallelism is the right choice when: (1) context length exceeds ~1M tokens and NVMe bandwidth itself becomes the bottleneck, (2) you need to serve requests with TTFT under a few seconds and cannot afford the recomputation or NVMe load latency, or (3) you are running batch inference where consistent inter-GPU communication latency is more predictable than variable NVMe seek patterns. At 4M-10M tokens, sequence parallelism is the only production-grade option short of purpose-built SRAM inference chips.

The 1M-Token Memory Wall

Ring Attention vs Tree Attention vs Striped Attention

Hardware Requirements

Setting Up Ring Attention

Megatron-Core with transformer_engine

SGLang Sequence-Parallel Serving

FlashAttention-4 Integration and Chunked Prefill

Tuning: Ring Size, Chunk Size, and Overlap

Ring Size (CP Degree)

Chunk Size and the 10ms Rule

Parallelism Decomposition

Benchmarks and Cost Analysis

Decision Matrix: When to Use Each Approach

Practical Checklist Before Running Ring Attention in Production

Quick Setup Guide

Calculate KV memory per GPU for your target context and ring size

Choose striped vs standard ring chunking

Set up multi-node Ring Attention with Megatron-Core and transformer_engine

Configure SGLang sequence-parallel serving

Overlap communication with compute to maximize GPU utilization

Benchmark and tune ring size vs tensor-parallel degree

Frequently Asked Questions

01What is Ring Attention and how does it enable 10M-token context inference?

02What is the difference between Ring Attention, Tree Attention, and Striped Attention?

03What interconnect do I need for Ring Attention or Striped Attention?

04Does SGLang support Ring Attention natively?

05When should I use sequence parallelism instead of NVMe KV offloading?

Build what's next.