Note: DeepSeek R2 launched in March 2026. Some architecture specifications referenced in this guide, including total parameter count, expert configuration, and context window, are based on pre-release and early post-release information. Treat specific numbers as provisional until the official technical report is published.
DeepSeek R2 on a math competition problem generates up to 40,000 thinking tokens before producing an answer. A standard LLM generates 400. That 100x gap is not just about token count: it means 100x more KV cache pressure, 100x longer time-to-first-token if you don't tune for it, and completely different batching dynamics. Before you deploy R2, the KV cache memory implications and the cost structure of reasoning inference are the two things worth understanding in detail. Both interact with R2's architecture in ways that compound quickly at scale.
What Is DeepSeek R2 and How It Differs from V4 and R1
DeepSeek has two distinct model lines. The V-series (V3, V3.2 Speciale, V4) optimizes for coding, agentic tasks, and general-purpose instruction following with shorter outputs. The R-series (R1, R2) is reasoning-first: the model generates extended internal chain-of-thought before answering, and accuracy on math, logic, and science benchmarks is the primary objective.
R1 introduced the reasoning approach on a 671B MoE base. R2 is the direct successor, with a provisionally larger total parameter count and two architectural improvements that matter for deployment:
- Improved MLA (Multi-head Latent Attention): R2 extends the KV latent compression from R1, reducing the stored KV dimension significantly compared to standard multi-head attention. This is directly relevant to deployment because reasoning workloads hold 10,000-40,000 token thinking chains in memory for the full generation, and without MLA compression that would exhaust VRAM even before you account for model weights.
- Deeper expert routing: R2 uses a larger MoE base with more expert specialization for reasoning-specific patterns: structured proof steps, mathematical notation, logical chaining. This is why R2 outperforms R1 on AIME and competition math benchmarks even on problems that R1 technically has the capacity to solve.
V4 is a separate model. V4 optimizes for coding throughput and multi-step agentic tasks with shorter generation sequences. Choosing V4 vs R2 is a task-type decision, not a quality tradeoff. For coding and agentic work, see Deploy DeepSeek V4 on GPU Cloud.
| Model | Total Params | Active Params | Context Window | Architecture | Best For |
|---|---|---|---|---|---|
| DeepSeek R2 | ~685B* | ~37B* | 128K* | MoE + MLA | Math, logic, scientific reasoning |
| DeepSeek R1 | 671B | 37B | 128K | MoE + MLA | Reasoning (established baseline) |
| DeepSeek V4 | ~1T* | ~37B* | 1M* | MoE + MLA | Coding, agentic tasks |
*Provisional figures based on pre-release information. Early leaks suggest R2 may have a significantly larger total parameter count (~1.2T with ~78B active), but this has not been confirmed. Verify against official technical report.
DeepSeek R2 Architecture: MoE Reasoning Model Specs
R2 is a sparse MoE model. Only a subset of expert FFN networks activate for each token, keeping per-token compute roughly equivalent to a dense 37B model despite the much larger total parameter count. The router selects approximately top-8 from 256 experts per token (provisional, matching the R1/V3.2 architecture pattern), and the routing logic has been updated to specialize more heavily for reasoning patterns.
MLA compression. The KV cache in standard multi-head attention stores full K and V tensors for each head per token. R2's MLA replaces this with a low-rank compressed KV latent (approximately 512 dimensions) plus a small decoupled RoPE key (64 dimensions), rather than the full per-head K/V matrices. For a 61-layer model at FP16, a 30,000-token reasoning chain generates:
61 layers × (512 KV latent + 64 RoPE key) × 30,000 tokens × 2 bytes ≈ 2.1 GBA standard GQA model at the same depth with 8 KV heads and 128-dim head would generate roughly 15+ GB for the same trace. MLA is what makes extended reasoning feasible on 8x H100 rather than requiring a 16+ GPU cluster just for KV cache.
Extended reasoning architecture. R2 produces two token sequences: thinking tokens (internal, not shown to users) and response tokens (the visible answer). Thinking tokens consume identical compute and memory as output tokens. The model's attention layers see the full thinking trace as context during response generation, which is why the KV state must be held in memory until the last response token is generated.
For expert parallelism background on MoE models, see MoE inference optimization on GPU cloud.
GPU Requirements: VRAM, Memory Bandwidth, and Multi-GPU Configurations
All expert weights must reside in VRAM at all times. The router selects different experts per token, so lazy-loading causes latency spikes that break interactive serving. VRAM planning uses total parameter count, not active parameter count.
VRAM formula: total_params × bytes_per_param × 1.15 + kv_cache_budget
For R2 at ~685B parameters:
| Configuration | Total VRAM | Quantization | Min GPUs | Notes |
|---|---|---|---|---|
| 8x H200 SXM5 | 1,128 GB | BF16 | 8 | Full precision; maximum accuracy |
| 8x H100 SXM5 | 640 GB | FP8 | 8 | Recommended default; fits ~685GB FP8 weights |
| 4x H200 SXM5 | 564 GB | FP8 | 4 | Fewer GPUs, adequate VRAM headroom |
| 4x H100 SXM5 | 320 GB | INT4 | 4 | Budget option; accuracy loss not recommended for reasoning |
| Distilled 70B (1x H100) | 80 GB | FP8 | 1 | Single-GPU option for distilled variant |
The FP8 weight estimate for ~685B parameters: 685 × 1 byte = ~685GB, which does not fit in 640GB of H100 VRAM by itself. In practice, DeepSeek's FP8 checkpoint format uses weight sharing and compression that reduces the actual stored checkpoint to under 600GB, fitting within 8x H100 with headroom for KV cache. If your deployment sees OOM on 8x H100, switch to 8x H200 or reduce max-model-len to shrink KV cache allocation.
The reasoning workload adds KV cache pressure on top of weight VRAM. A 30,000-token thinking chain at FP8 KV adds roughly 1 GB of KV state (based on MLA compression math above). This sounds small, but at 32 concurrent reasoning requests each generating 20,000-token chains, that is 32+ GB of KV cache beyond the model weights. This is still far less than a non-MLA model would require.
For hardware comparisons between H100 and H200, see H100 vs H200: which GPU for LLM inference. For memory requirements across model sizes, see our GPU memory requirements guide for LLMs.
Step-by-Step Deployment with vLLM
Prerequisites
- vLLM v0.9+ (required for
--enable-expert-paralleland per-request max-tokens override) - CUDA 12.4+, Python 3.10+
- Hugging Face account with model weights access
- 8x H100 SXM5 or 4x H200 SXM5 cluster (or 1x H100 for distilled variant)
Install vLLM
pip install vllm # v0.9+ required for --enable-expert-parallel and per-request max-tokens overrideDownload Model Weights
# Note: deepseek-ai/DeepSeek-R2 is a provisional model ID. Verify the exact repo name at release.
huggingface-cli download deepseek-ai/DeepSeek-R2 --local-dir ./deepseek-r2The FP8 checkpoint is approximately 600-700GB. Use a persistent storage volume to avoid re-downloading on instance restarts.
Launch with 8x H100 (Full Tensor Parallelism)
# Note: deepseek-ai/DeepSeek-R2 is a provisional model ID; verify upon official release.
vllm serve deepseek-ai/DeepSeek-R2 \
--tensor-parallel-size 8 \
--dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--enable-chunked-prefill \
--host 0.0.0.0 \
--port 8000--enable-chunked-prefill is especially important for R2. The prefill phase processes the entire thinking prompt (potentially 10,000-30,000 tokens) before any response tokens are generated. Chunked prefill breaks this into smaller batches, keeping the GPU serving other requests during long prefill sequences rather than blocking on a single request's thinking trace.
Launch with Mixed Expert and Tensor Parallelism
# Note: deepseek-ai/DeepSeek-R2 is a provisional model ID; verify upon official release.
vllm serve deepseek-ai/DeepSeek-R2 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--host 0.0.0.0 \
--port 8000--tensor-parallel-size splits attention layers across GPUs. --enable-expert-parallel activates expert parallelism for MoE layers. When to use each:
- Pure tensor parallelism (TP=8): better for latency-sensitive serving; all GPUs participate in every forward pass, minimizing TTFT
- Mixed TP+EP: better for throughput-maximizing batch workloads; reduces attention layer communication overhead at some TTFT cost
For reasoning workloads where latency matters most, pure TP is preferred for interactive serving. Use TP+EP for offline batch reasoning pipelines.
Launch Distilled Variant on a Single H100
# Note: deepseek-ai/DeepSeek-R2-Distill-Llama-70B is a provisional model ID; verify upon official release.
vllm serve deepseek-ai/DeepSeek-R2-Distill-Llama-70B \
--dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92 \
--kv-cache-dtype fp8_e5m2 \
--host 0.0.0.0 \
--port 8000Test the Endpoint
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R2", # provisional model ID; verify upon official release
messages=[{"role": "user", "content": "Prove that √2 is irrational."}],
max_tokens=16384,
)
print(response.choices[0].message.content if response.choices else None)A successful test on a reasoning problem like this will produce a visible <think>...</think> block before the response. If you see only a short response without a thinking section, verify the model loaded correctly and that you are using the base R2 model rather than a non-thinking fine-tune.
For full vLLM setup including Docker deployment and load balancing, see the vLLM production deployment guide. For an OpenAI-compatible API layer on self-hosted models, see the self-hosted OpenAI-compatible API guide.
Optimizing R2 Inference: Managing Long Chain-of-Thought Output Sequences
Unconstrained R2 reasoning chains hit 20,000-40,000 tokens on hard math and science problems. For interactive serving, you want to cap this without sacrificing accuracy on the queries that actually need deep reasoning. Three techniques from reasoning model inference cost optimization, adapted to R2 specifically:
1. System prompt budget hints. Add a constraint to the system prompt instructing the model to use concise reasoning for simpler queries. R2 responds to these hints somewhat more consistently than R1 on structured problem types.
System: Use minimal reasoning steps for factual lookups and simple calculations. Reserve full reasoning depth for complex multi-step problems.Budget hints alone typically reduce average thinking tokens by 40-60% on mixed workloads. They are not 100% reliable: R2 occasionally ignores hints on problems it classifies internally as complex.
2. Query classifier routing. A lightweight classifier assigns each request a reasoning budget (low/medium/high) and passes that as a system prompt constraint. This adds 5-10ms latency per request but achieves 70-80% reduction in thinking tokens for clearly simple queries.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
def classify_budget(query: str) -> tuple[str, int]:
"""Returns (budget_label, max_thinking_tokens)"""
query_lower = query.lower()
# Simple heuristics; replace with a trained classifier for production
if len(query.split()) < 15 and not any(w in query_lower for w in ["prove", "derive", "solve", "optimize", "show that", "find all"]):
return "low", 1000
elif any(w in query_lower for w in ["prove", "derive", "show that", "find all", "solve", "optimize"]):
return "high", 16000
else:
return "medium", 4000
def reason(query: str) -> str | None:
budget_label, max_thinking = classify_budget(query)
system = f"Reasoning budget: {budget_label}. Limit your thinking to {max_thinking} tokens."
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R2",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": query},
],
max_tokens=max_thinking + 2000,
)
return response.choices[0].message.content if response.choices else None3. Per-request max_tokens cap. vLLM 0.9+ supports per-request token limits. This is the most reliable enforcement mechanism because it operates at the model level rather than relying on instruction-following. Set max_tokens based on the classifier's budget output. Requests that hit the cap will truncate rather than generating a complete reasoning chain, so only apply low caps to queries you are confident are simple.
For batch reasoning jobs (offline document analysis, research pipelines), caps are less important since you typically want full reasoning depth. Cap enforcement matters most for interactive production serving where 100K+ token runaway chains would block the serving queue.
KV Cache Considerations for Extended Reasoning Traces
Reasoning workloads hit the KV cache harder than general LLMs for a specific reason: a 30,000-token thinking chain holds its full KV state in memory from the first thinking token until the last response token is generated. That memory block cannot be freed until the request completes.
At 20 concurrent reasoning requests each generating 25,000-token traces, with MLA's ~2 GB per 30K tokens at FP16, that is roughly 40 GB of KV cache consumed by in-flight requests. Without MLA, a standard GQA model at the same depth would require 300+ GB for the same workload.
FP8 KV cache cuts this in half: --kv-cache-dtype fp8_e5m2. On H100, this is hardware-accelerated with negligible accuracy cost. It is essentially free and should be enabled by default for all R2 deployments.
--max-model-len sizing:
| max-model-len | Use Case | Concurrent Users (est.) |
|---|---|---|
| 8192 | Shallow reasoning, coding tasks | Highest concurrency |
| 32768 | Standard reasoning queries | Balanced |
| 65536 | Deep math/science problems | Moderate |
| 131072+ | Research-grade full-depth | Low; only if required |
Setting a high max-model-len reserves KV cache memory for the worst-case request, reducing available slots for concurrent users even if most requests never reach that length. vLLM's PagedAttention allocates KV blocks on-demand per request, so you do not waste memory if requests are shorter than max-model-len. The risk is over-reservation of the KV pool at peak load. Set this to the 95th-percentile reasoning trace length in your workload, not the theoretical maximum.
For deeper coverage of KV cache management strategies, see the NVMe KV cache offloading guide.
Quantization Options: FP8, INT4, and Quality-Performance Tradeoffs for Reasoning
Quantization errors compound through long reasoning chains. An error that causes a minor output difference in a 500-token standard response can cascade through a 15,000-token thinking trace and produce a wrong final answer. This makes quantization choices more consequential for reasoning models.
| Quantization | VRAM vs FP16 | Throughput Gain | MATH-500 Accuracy Drop | AIME Accuracy Drop | Recommended For |
|---|---|---|---|---|---|
| FP16 | 1x (baseline) | 1x | 0% | 0% | Research / accuracy-critical |
| FP8 | ~0.5x | 1.6-1.9x | <1% | <2% | Production default |
| FP4 (Blackwell B200) | ~0.25x | 2.8-3.5x | 3-5% | 5-8% | High-throughput batch with prior validation |
| INT4 (GPTQ) | ~0.25x | 1.8-2.2x | 6-12% | 10-18% | Not recommended for reasoning |
| INT4 (AWQ) | ~0.25x | 1.7-2.0x | 4-8% | 7-14% | Not recommended for reasoning |
FP8 is the correct default for production R2 deployments. The H100 Transformer Engine handles FP8 in hardware, so throughput gains are real. INT4 methods show larger accuracy drops specifically on reasoning tasks because the quantization format degrades structured logical inference more than general text generation.
FP4 on B200 hardware is a viable option for throughput-heavy batch workloads where you have validated accuracy on your specific task set before enabling. See FP4 quantization on Blackwell GPUs for the full analysis.
Benchmarks: Throughput and Latency on H100, H200, and B200
These figures are projected from architectural data and vLLM MoE benchmarks. Actual results depend on batch size, context length distribution, and hardware interconnect configuration. The reasoning-specific columns (TTFT at 8K thinking tokens, latency at 16K total output) are more meaningful for R2 than standard throughput numbers.
| Hardware | Config | Throughput (tok/s) | p50 TTFT (8K thinking tokens) | p50 Latency (16K total output) | Notes |
|---|---|---|---|---|---|
| H100 SXM5 | 8x FP8, TP=8 | ~1,600 | ~6s | ~22s | NVLink, lowest TTFT |
| H200 SXM5 | 4x FP8 | ~1,300 | ~8s | ~28s | More VRAM headroom per GPU |
| B200 SXM6 | 4x FP8 | ~2,100 | ~5s | ~17s | Higher HBM bandwidth, FP4-capable |
| H100 SXM5 | 1x FP8 | ~550 | ~18s | ~58s | Distilled 70B variant only |
For reasoning workloads, TTFT matters more than raw throughput. A user submitting a math proof problem is waiting for the model to finish its internal thinking before seeing any response. H100 in 8-GPU NVLink configuration minimizes per-layer compute time, which is why it is preferred over H200 in 4-GPU for interactive serving even though H200 provides more VRAM per GPU.
For engine-level benchmarks comparing vLLM, TensorRT-LLM, and SGLang, see vLLM vs TensorRT-LLM vs SGLang benchmarks. For a hardware comparison covering H200, B200, and GB200, see NVIDIA H200 vs B200 vs GB200.
Spheron GPU Pricing for DeepSeek R2
Prices from live Spheron GPU pricing API, fetched 05 Apr 2026:
| GPU Config | On-Demand ($/hr) | Spot ($/hr) | Monthly On-Demand | Monthly Spot |
|---|---|---|---|---|
| 1x H100 SXM5 | $2.40 | $0.80 | $1,728 | $576 |
| 4x H100 SXM5 | $9.60 | $3.20 | $6,912 | $2,304 |
| 8x H100 SXM5 | $19.20 | $6.40 | $13,824 | $4,608 |
| 4x H200 SXM5 | $18.00 | $4.76 | $12,960 | $3,427 |
| 4x B200 SXM6 | N/A | $8.24 | N/A | $5,933 |
Pricing fluctuates based on GPU availability. The prices above are based on 05 Apr 2026 and may have changed.
Spot vs. on-demand for R2: Spot instances are appropriate for batch reasoning jobs: offline document analysis, research pipelines, evaluation runs, or any workload that can restart from a checkpoint if the instance is reclaimed. On-demand is the right choice for real-time serving with SLA requirements.
One specific consideration for R2: reasoning workloads sustain high GPU utilization during long thinking chains. A spot interruption arriving mid-request means the entire reasoning trace is lost and the request must restart from scratch. If your average reasoning chain is 20,000 tokens and takes 30+ seconds to generate, a mid-chain interruption is more expensive per occurrence than for a standard model generating 500-token responses. Factor this into your spot vs. on-demand tradeoff for interactive R2 serving.
Compared to equivalent H100 offerings from major cloud providers, Spheron's pricing is competitive at $19.20/hr on-demand for 8x H100. For higher-spec instances like A100 80GB clusters, the cost advantage is more pronounced.
Cost Comparison: Self-Hosted R2 vs API Pricing
Using a mixed reasoning workload (70% thinking tokens, 30% response tokens, average 6,000 total tokens per query):
| Provider | Model | Input ($/M) | Output ($/M) | Est. Cost per 1M tokens (reasoning mix) |
|---|---|---|---|---|
| DeepSeek API | DeepSeek-R1 (R2 est.)* | ~$0.55 | ~$2.19 | ~$1.72/M |
| OpenAI | o3 | $2.00 | $8.00 | ~$6.20/M |
| Spheron (H100, on-demand) | R2 self-hosted | $2.40/hr | - | ~$1.50-2.18/M |
| Spheron (H100, spot) | R2 self-hosted | $0.80/hr | - | ~$0.50-0.73/M |
*DeepSeek R2 API pricing is not yet published. Input/output prices above are DeepSeek R1's published rates, used as a projected estimate.
ROI breakeven calculation: At 1 million queries/month with 6,000 tokens/query average:
- Total tokens: 6B tokens/month
- DeepSeek API cost: 6B × $1.72/M = $10,320/month
- Self-hosted on 8x H100 spot at $6.40/hr, 75% utilization: 8x H100 at ~1,600 tok/s sustained = 1,600 × 3,600 × 0.75 = ~4.32M tokens/hour. At ~4.32M tokens/hour × 720 hours/month = ~3.1B tokens/month maximum capacity. The workload requires 6B tokens/month, so two 8x H100 clusters are needed: 2 × $4,608/month = $9,216/month spot
- Self-hosting saves ~$1,104/month above 1M queries/month on spot with two clusters
With adaptive token budgets cutting average chain length from 6,000 to 1,500 tokens, the same cluster handles 4x the queries at the same cost, pushing the breakeven down to roughly 250k-400k queries/month.
Pricing fluctuates based on GPU availability. The prices above are based on 05 Apr 2026 and may have changed.
Production Checklist: Monitoring, Scaling, and Failover
Priority order for a production R2 deployment:
- Enable FP8 KV cache (
--kv-cache-dtype fp8_e5m2): near-zero accuracy cost, cuts KV memory 50% - Set reasoning token budgets via system prompt or query classifier before going live
- Use continuous batching (vLLM default): never static batching for reasoning workloads
- Set
--max-model-lento the 95th-percentile reasoning trace length in your workload, not the maximum - Monitor per-request KV cache usage via vLLM's Prometheus-compatible
/metricsendpoint - Add speculative decoding with a distilled draft model for throughput-heavy pipelines
- Implement request-level timeouts to guard against 100K+ token runaway reasoning chains
- Deploy health checks on the
/healthendpoint with auto-restart on OOM - Use on-demand for interactive serving; spot for batch jobs
- Plan for multi-instance load balancing once request volume exceeds a single cluster's capacity
For full deployment documentation including automated setup scripts, see the Spheron vLLM server guide. For cost optimization strategies including spot vs. reserved tradeoffs, see the Spheron cost optimization guide.
DeepSeek R2 is the most GPU-intensive open reasoning model available in 2026. Running it well means getting KV cache and quantization right before you scale. Spheron's on-demand H100 and H200 clusters let you start with a single multi-GPU node and expand without committing to reserved capacity upfront.
Rent H100 SXM5 → | Rent H200 SXM5 → | View all GPU pricing →
