Note: DeepSeek R2 launched in March 2026. Some architecture specifications referenced in this guide, including total parameter count, expert configuration, and context window, are based on pre-release and early post-release information. Treat specific numbers as provisional until the official technical report is published.
DeepSeek R2 on a math competition problem generates up to 40,000 thinking tokens before producing an answer. A standard LLM generates 400. That 100x gap is not just about token count: it means 100x more KV cache pressure, 100x longer time-to-first-token if you don't tune for it, and completely different batching dynamics. Before you deploy R2, the KV cache memory implications and the cost structure of reasoning inference are the two things worth understanding in detail. Both interact with R2's architecture in ways that compound quickly at scale.
What Is DeepSeek R2 and How It Differs from V4 and R1
DeepSeek has two distinct model lines. The V-series (V3, V3.2 Speciale, V4) optimizes for coding, agentic tasks, and general-purpose instruction following with shorter outputs. The R-series (R1, R2) is reasoning-first: the model generates extended internal chain-of-thought before answering, and accuracy on math, logic, and science benchmarks is the primary objective.
R1 introduced the reasoning approach on a 671B MoE base. R2 is the direct successor, with a provisionally larger total parameter count and two architectural improvements that matter for deployment:
- Improved MLA (Multi-head Latent Attention): R2 extends the KV latent compression from R1, reducing the stored KV dimension significantly compared to standard multi-head attention. This is directly relevant to deployment because reasoning workloads hold 10,000-40,000 token thinking chains in memory for the full generation, and without MLA compression that would exhaust VRAM even before you account for model weights.
- Deeper expert routing: R2 uses a larger MoE base with more expert specialization for reasoning-specific patterns: structured proof steps, mathematical notation, logical chaining. This is why R2 outperforms R1 on AIME and competition math benchmarks even on problems that R1 technically has the capacity to solve.
V4 is a separate model. V4 optimizes for coding throughput and multi-step agentic tasks with shorter generation sequences. Choosing V4 vs R2 is a task-type decision, not a quality tradeoff. For coding and agentic work, see Deploy DeepSeek V4 on GPU Cloud.
| Model | Total Params | Active Params | Context Window | Architecture | Best For |
|---|---|---|---|---|---|
| DeepSeek R2 | ~685B* | ~37B* | 128K* | MoE + MLA | Math, logic, scientific reasoning |
| DeepSeek R1 | 671B | 37B | 128K | MoE + MLA | Reasoning (established baseline) |
| DeepSeek V4 | ~1T* | ~37B* | 1M* | MoE + MLA | Coding, agentic tasks |
*Provisional figures based on pre-release information. Early leaks suggest R2 may have a significantly larger total parameter count (~1.2T with ~78B active), but this has not been confirmed. Verify against official technical report.
For a comparison with Xiaomi's approach, see deploying MiMo-V2-Flash, another 309B MoE reasoning model with hybrid thinking mode and a 256K context window. For a dense-transformer alternative that beats R1 on GPQA Diamond and LiveCodeBench at lower hardware cost, see how Nemotron Ultra 253B compares.
DeepSeek R2 Architecture: MoE Reasoning Model Specs
R2 is a sparse MoE model. Only a subset of expert FFN networks activate for each token, keeping per-token compute roughly equivalent to a dense 37B model despite the much larger total parameter count. The router selects approximately top-8 from 256 experts per token (provisional, matching the R1/V3.2 architecture pattern), and the routing logic has been updated to specialize more heavily for reasoning patterns.
MLA compression. The KV cache in standard multi-head attention stores full K and V tensors for each head per token. R2's MLA replaces this with a low-rank compressed KV latent (approximately 512 dimensions) plus a small decoupled RoPE key (64 dimensions), rather than the full per-head K/V matrices. For a 61-layer model at FP16, a 30,000-token reasoning chain generates:
61 layers × (512 KV latent + 64 RoPE key) × 30,000 tokens × 2 bytes ≈ 2.1 GBA standard GQA model at the same depth with 8 KV heads and 128-dim head would generate roughly 15+ GB for the same trace. MLA is what makes extended reasoning feasible on 8x H100 rather than requiring a 16+ GPU cluster just for KV cache.
Extended reasoning architecture. R2 produces two token sequences: thinking tokens (internal, not shown to users) and response tokens (the visible answer). Thinking tokens consume identical compute and memory as output tokens. The model's attention layers see the full thinking trace as context during response generation, which is why the KV state must be held in memory until the last response token is generated.
For expert parallelism background on MoE models, see MoE inference optimization on GPU cloud.
GPU Requirements: VRAM, Memory Bandwidth, and Multi-GPU Configurations
All expert weights must reside in VRAM at all times. The router selects different experts per token, so lazy-loading causes latency spikes that break interactive serving. VRAM planning uses total parameter count, not active parameter count.
VRAM formula: total_params × bytes_per_param × 1.15 + kv_cache_budget
For R2 at ~685B parameters:
| Configuration | Total VRAM | Quantization | Min GPUs | Notes |
|---|---|---|---|---|
| 8x H200 SXM5 | 1,128 GB | BF16 | 8 | Full precision; maximum accuracy |
| 8x H100 SXM5 | 640 GB | FP8 | 8 | Recommended default; fits ~685GB FP8 weights |
| 4x H200 SXM5 | 564 GB | FP8 | 4 | Fewer GPUs, adequate VRAM headroom |
| 4x H100 SXM5 | 320 GB | INT4 | 4 | Budget option; accuracy loss not recommended for reasoning |
| Distilled 70B (1x H100) | 80 GB | FP8 | 1 | Single-GPU option for distilled variant |
The FP8 weight estimate for ~685B parameters: 685 × 1 byte = ~685GB, which does not fit in 640GB of H100 VRAM by itself. In practice, DeepSeek's FP8 checkpoint format uses weight sharing and compression that reduces the actual stored checkpoint to under 600GB, fitting within 8x H100 with headroom for KV cache. If your deployment sees OOM on 8x H100, switch to 8x H200 or reduce max-model-len to shrink KV cache allocation.
The reasoning workload adds KV cache pressure on top of weight VRAM. A 30,000-token thinking chain at FP8 KV adds roughly 1 GB of KV state (based on MLA compression math above). This sounds small, but at 32 concurrent reasoning requests each generating 20,000-token chains, that is 32+ GB of KV cache beyond the model weights. This is still far less than a non-MLA model would require.
For hardware comparisons between H100 and H200, see H100 vs H200: which GPU for LLM inference. For memory requirements across model sizes, see our GPU memory requirements guide for LLMs.
Step-by-Step Deployment with vLLM
Prerequisites
- vLLM v0.9+ (required for
--enable-expert-paralleland per-request max-tokens override) - CUDA 12.4+, Python 3.10+
- Hugging Face account with model weights access
- 8x H100 SXM5 or 4x H200 SXM5 cluster (or 1x H100 for distilled variant)
Install vLLM
pip install vllm # v0.9+ required for --enable-expert-parallel and per-request max-tokens overrideDownload Model Weights
# Note: deepseek-ai/DeepSeek-R2 is a provisional model ID. Verify the exact repo name at release.
huggingface-cli download deepseek-ai/DeepSeek-R2 --local-dir ./deepseek-r2The FP8 checkpoint is approximately 600-700GB. Use a persistent storage volume to avoid re-downloading on instance restarts.
Launch with 8x H100 (Full Tensor Parallelism)
# Note: deepseek-ai/DeepSeek-R2 is a provisional model ID; verify upon official release.
vllm serve deepseek-ai/DeepSeek-R2 \
--tensor-parallel-size 8 \
--dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--enable-chunked-prefill \
--host 0.0.0.0 \
--port 8000--enable-chunked-prefill is especially important for R2. The prefill phase processes the entire thinking prompt (potentially 10,000-30,000 tokens) before any response tokens are generated. Chunked prefill breaks this into smaller batches, keeping the GPU serving other requests during long prefill sequences rather than blocking on a single request's thinking trace.
Launch with Mixed Expert and Tensor Parallelism
# Note: deepseek-ai/DeepSeek-R2 is a provisional model ID; verify upon official release.
vllm serve deepseek-ai/DeepSeek-R2 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--host 0.0.0.0 \
--port 8000--tensor-parallel-size splits attention layers across GPUs. --enable-expert-parallel activates expert parallelism for MoE layers. When to use each:
- Pure tensor parallelism (TP=8): better for latency-sensitive serving; all GPUs participate in every forward pass, minimizing TTFT
- Mixed TP+EP: better for throughput-maximizing batch workloads; reduces attention layer communication overhead at some TTFT cost
For reasoning workloads where latency matters most, pure TP is preferred for interactive serving. Use TP+EP for offline batch reasoning pipelines.
Launch Distilled Variant on a Single H100
# Note: deepseek-ai/DeepSeek-R2-Distill-Llama-70B is a provisional model ID; verify upon official release.
vllm serve deepseek-ai/DeepSeek-R2-Distill-Llama-70B \
--dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92 \
--kv-cache-dtype fp8_e5m2 \
--host 0.0.0.0 \
--port 8000Test the Endpoint
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R2", # provisional model ID; verify upon official release
messages=[{"role": "user", "content": "Prove that √2 is irrational."}],
max_tokens=16384,
)
print(response.choices[0].message.content if response.choices else None)A successful test on a reasoning problem like this will produce a visible <think>...</think> block before the response. If you see only a short response without a thinking section, verify the model loaded correctly and that you are using the base R2 model rather than a non-thinking fine-tune.
For full vLLM setup including Docker deployment and load balancing, see the vLLM production deployment guide. For an OpenAI-compatible API layer on self-hosted models, see the self-hosted OpenAI-compatible API guide.
Optimizing R2 Inference: Managing Long Chain-of-Thought Output Sequences
Unconstrained R2 reasoning chains hit 20,000-40,000 tokens on hard math and science problems. For interactive serving, you want to cap this without sacrificing accuracy on the queries that actually need deep reasoning. Three techniques from reasoning model inference cost optimization, adapted to R2 specifically:
1. System prompt budget hints. Add a constraint to the system prompt instructing the model to use concise reasoning for simpler queries. R2 responds to these hints somewhat more consistently than R1 on structured problem types.
System: Use minimal reasoning steps for factual lookups and simple calculations. Reserve full reasoning depth for complex multi-step problems.Budget hints alone typically reduce average thinking tokens by 40-60% on mixed workloads. They are not 100% reliable: R2 occasionally ignores hints on problems it classifies internally as complex.
2. Query classifier routing. A lightweight classifier assigns each request a reasoning budget (low/medium/high) and passes that as a system prompt constraint. This adds 5-10ms latency per request but achieves 70-80% reduction in thinking tokens for clearly simple queries.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
def classify_budget(query: str) -> tuple[str, int]:
"""Returns (budget_label, max_thinking_tokens)"""
query_lower = query.lower()
# Simple heuristics; replace with a trained classifier for production
if len(query.split()) < 15 and not any(w in query_lower for w in ["prove", "derive", "solve", "optimize", "show that", "find all"]):
return "low", 1000
elif any(w in query_lower for w in ["prove", "derive", "show that", "find all", "solve", "optimize"]):
return "high", 16000
else:
return "medium", 4000
def reason(query: str) -> str | None:
budget_label, max_thinking = classify_budget(query)
system = f"Reasoning budget: {budget_label}. Limit your thinking to {max_thinking} tokens."
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R2",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": query},
],
max_tokens=max_thinking + 2000,
)
return response.choices[0].message.content if response.choices else None3. Per-request max_tokens cap. vLLM 0.9+ supports per-request token limits. This is the most reliable enforcement mechanism because it operates at the model level rather than relying on instruction-following. Set max_tokens based on the classifier's budget output. Requests that hit the cap will truncate rather than generating a complete reasoning chain, so only apply low caps to queries you are confident are simple.
For batch reasoning jobs (offline document analysis, research pipelines), caps are less important since you typically want full reasoning depth. Cap enforcement matters most for interactive production serving where 100K+ token runaway chains would block the serving queue.
KV Cache Considerations for Extended Reasoning Traces
Reasoning workloads hit the KV cache harder than general LLMs for a specific reason: a 30,000-token thinking chain holds its full KV state in memory from the first thinking token until the last response token is generated. That memory block cannot be freed until the request completes.
At 20 concurrent reasoning requests each generating 25,000-token traces, with MLA's ~2 GB per 30K tokens at FP16, that is roughly 40 GB of KV cache consumed by in-flight requests. Without MLA, a standard GQA model at the same depth would require 300+ GB for the same workload.
FP8 KV cache cuts this in half: --kv-cache-dtype fp8_e5m2. On H100, this is hardware-accelerated with negligible accuracy cost. It is essentially free and should be enabled by default for all R2 deployments.
--max-model-len sizing:
| max-model-len | Use Case | Concurrent Users (est.) |
|---|---|---|
| 8192 | Shallow reasoning, coding tasks | Highest concurrency |
| 32768 | Standard reasoning queries | Balanced |
| 65536 | Deep math/science problems | Moderate |
| 131072+ | Research-grade full-depth | Low; only if required |
Setting a high max-model-len reserves KV cache memory for the worst-case request, reducing available slots for concurrent users even if most requests never reach that length. vLLM's PagedAttention allocates KV blocks on-demand per request, so you do not waste memory if requests are shorter than max-model-len. The risk is over-reservation of the KV pool at peak load. Set this to the 95th-percentile reasoning trace length in your workload, not the theoretical maximum.
For deeper coverage of KV cache management strategies, see the NVMe KV cache offloading guide.
Quantization Options: FP8, INT4, and Quality-Performance Tradeoffs for Reasoning
Quantization errors compound through long reasoning chains. An error that causes a minor output difference in a 500-token standard response can cascade through a 15,000-token thinking trace and produce a wrong final answer. This makes quantization choices more consequential for reasoning models.
| Quantization | VRAM vs FP16 | Throughput Gain | MATH-500 Accuracy Drop | AIME Accuracy Drop | Recommended For |
|---|---|---|---|---|---|
| FP16 | 1x (baseline) | 1x | 0% | 0% | Research / accuracy-critical |
| FP8 | ~0.5x | 1.6-1.9x | <1% | <2% | Production default |
| FP4 (Blackwell B200) | ~0.25x | 2.8-3.5x | 3-5% | 5-8% | High-throughput batch with prior validation |
| INT4 (GPTQ) | ~0.25x | 1.8-2.2x | 6-12% | 10-18% | Not recommended for reasoning |
| INT4 (AWQ) | ~0.25x | 1.7-2.0x | 4-8% | 7-14% | Not recommended for reasoning |
FP8 is the correct default for production R2 deployments. The H100 Transformer Engine handles FP8 in hardware, so throughput gains are real. INT4 methods show larger accuracy drops specifically on reasoning tasks because the quantization format degrades structured logical inference more than general text generation.
FP4 on B200 hardware is a viable option for throughput-heavy batch workloads where you have validated accuracy on your specific task set before enabling. See FP4 quantization on Blackwell GPUs for the full analysis.
Benchmarks: Throughput and Latency on H100, H200, and B200
These figures are projected from architectural data and vLLM MoE benchmarks. Actual results depend on batch size, context length distribution, and hardware interconnect configuration. The reasoning-specific columns (TTFT at 8K thinking tokens, latency at 16K total output) are more meaningful for R2 than standard throughput numbers.
| Hardware | Config | Throughput (tok/s) | p50 TTFT (8K thinking tokens) | p50 Latency (16K total output) | Notes |
|---|---|---|---|---|---|
| H100 SXM5 | 8x FP8, TP=8 | ~1,600 | ~6s | ~22s | NVLink, lowest TTFT |
| H200 SXM5 | 4x FP8 | ~1,300 | ~8s | ~28s | More VRAM headroom per GPU |
| B200 SXM6 | 4x FP8 | ~2,100 | ~5s | ~17s | Higher HBM bandwidth, FP4-capable |
| H100 SXM5 | 1x FP8 | ~550 | ~18s | ~58s | Distilled 70B variant only |
For reasoning workloads, TTFT matters more than raw throughput. A user submitting a math proof problem is waiting for the model to finish its internal thinking before seeing any response. H100 in 8-GPU NVLink configuration minimizes per-layer compute time, which is why it is preferred over H200 in 4-GPU for interactive serving even though H200 provides more VRAM per GPU.
For engine-level benchmarks comparing vLLM, TensorRT-LLM, and SGLang, see vLLM vs TensorRT-LLM vs SGLang benchmarks. For a hardware comparison covering H200, B200, and GB200, see NVIDIA H200 vs B200 vs GB200.
Spheron GPU Pricing for DeepSeek R2
Prices from live Spheron GPU pricing API, fetched 05 Apr 2026:
| GPU Config | On-Demand ($/hr) | Spot ($/hr) | Monthly On-Demand | Monthly Spot |
|---|---|---|---|---|
| 1x H100 SXM5 | $2.40 | $0.80 | $1,728 | $576 |
| 4x H100 SXM5 | $9.60 | $3.20 | $6,912 | $2,304 |
| 8x H100 SXM5 | $19.20 | $6.40 | $13,824 | $4,608 |
| 4x H200 SXM5 | $18.00 | $4.76 | $12,960 | $3,427 |
| 4x B200 SXM6 | N/A | $8.24 | N/A | $5,933 |
Pricing fluctuates based on GPU availability. The prices above are based on 05 Apr 2026 and may have changed.
Spot vs. on-demand for R2: Spot instances are appropriate for batch reasoning jobs: offline document analysis, research pipelines, evaluation runs, or any workload that can restart from a checkpoint if the instance is reclaimed. On-demand is the right choice for real-time serving with SLA requirements.
One specific consideration for R2: reasoning workloads sustain high GPU utilization during long thinking chains. A spot interruption arriving mid-request means the entire reasoning trace is lost and the request must restart from scratch. If your average reasoning chain is 20,000 tokens and takes 30+ seconds to generate, a mid-chain interruption is more expensive per occurrence than for a standard model generating 500-token responses. Factor this into your spot vs. on-demand tradeoff for interactive R2 serving.
Compared to equivalent H100 offerings from major cloud providers, Spheron's pricing is competitive at $19.20/hr on-demand for 8x H100. For higher-spec instances like A100 80GB clusters, the cost advantage is more pronounced.
Cost Comparison: Self-Hosted R2 vs API Pricing
Using a mixed reasoning workload (70% thinking tokens, 30% response tokens, average 6,000 total tokens per query):
| Provider | Model | Input ($/M) | Output ($/M) | Est. Cost per 1M tokens (reasoning mix) |
|---|---|---|---|---|
| DeepSeek API | DeepSeek-R1 (R2 est.)* | ~$0.55 | ~$2.19 | ~$1.72/M |
| OpenAI | o3 | $2.00 | $8.00 | ~$6.20/M |
| Spheron (H100, on-demand) | R2 self-hosted | $2.40/hr | - | ~$1.50-2.18/M |
| Spheron (H100, spot) | R2 self-hosted | $0.80/hr | - | ~$0.50-0.73/M |
*DeepSeek R2 API pricing is not yet published. Input/output prices above are DeepSeek R1's published rates, used as a projected estimate.
ROI breakeven calculation: At 1 million queries/month with 6,000 tokens/query average:
- Total tokens: 6B tokens/month
- DeepSeek API cost: 6B × $1.72/M = $10,320/month
- Self-hosted on 8x H100 spot at $6.40/hr, 75% utilization: 8x H100 at ~1,600 tok/s sustained = 1,600 × 3,600 × 0.75 = ~4.32M tokens/hour. At ~4.32M tokens/hour × 720 hours/month = ~3.1B tokens/month maximum capacity. The workload requires 6B tokens/month, so two 8x H100 clusters are needed: 2 × $4,608/month = $9,216/month spot
- Self-hosting saves ~$1,104/month above 1M queries/month on spot with two clusters
With adaptive token budgets cutting average chain length from 6,000 to 1,500 tokens, the same cluster handles 4x the queries at the same cost, pushing the breakeven down to roughly 250k-400k queries/month.
Pricing fluctuates based on GPU availability. The prices above are based on 05 Apr 2026 and may have changed.
Production Checklist: Monitoring, Scaling, and Failover
Priority order for a production R2 deployment:
- Enable FP8 KV cache (
--kv-cache-dtype fp8_e5m2): near-zero accuracy cost, cuts KV memory 50% - Set reasoning token budgets via system prompt or query classifier before going live
- Use continuous batching (vLLM default): never static batching for reasoning workloads
- Set
--max-model-lento the 95th-percentile reasoning trace length in your workload, not the maximum - Monitor per-request KV cache usage via vLLM's Prometheus-compatible
/metricsendpoint - Add speculative decoding with a distilled draft model for throughput-heavy pipelines
- Implement request-level timeouts to guard against 100K+ token runaway reasoning chains
- Deploy health checks on the
/healthendpoint with auto-restart on OOM - Use on-demand for interactive serving; spot for batch jobs
- Plan for multi-instance load balancing once request volume exceeds a single cluster's capacity
For full deployment documentation including automated setup scripts, see the Spheron vLLM server guide. For cost optimization strategies including spot vs. reserved tradeoffs, see the Spheron cost optimization guide.
DeepSeek R2 is the most GPU-intensive open reasoning model available in 2026. Running it well means getting KV cache and quantization right before you scale. Spheron's on-demand H100 and H200 clusters let you start with a single multi-GPU node and expand without committing to reserved capacity upfront.
Rent H100 SXM5 → | Rent H200 SXM5 → | View all GPU pricing →
Quick Setup Guide
Calculate VRAM needs based on total parameter count (provisionally ~685B), not active parameters. In FP8 (1 byte/param), the weights alone require roughly 685GB. Add 15-20% overhead for activations and framework, plus KV cache budget for your target context length. 8x H100 SXM5 (640GB total) is the recommended minimum for FP8 with compressed MLA KV cache. 4x H200 SXM5 (564GB) is viable for FP8 with additional headroom.
Log in to app.spheron.ai, select H100 SXM5 or H200 SXM5, choose cluster size (4 or 8 GPUs), and deploy an instance with NVLink interconnect enabled. NVLink is critical for R2 because all-to-all routing communication between expert GPUs runs at 900 GB/s bidirectional on NVLink versus much lower on PCIe, directly affecting latency for every MoE layer forward pass.
Install vLLM v0.9+ with CUDA 12.4 and Python 3.10+. Use huggingface-cli to download the model weights to a persistent storage volume so you avoid re-downloading on instance restarts. The FP8 checkpoint is approximately 685GB. Verify the model ID on the official HuggingFace page at release, as provisional IDs sometimes change.
For latency-sensitive interactive serving, use pure tensor parallelism (--tensor-parallel-size 8 on 8x H100). This keeps all GPUs active on every forward pass, minimizing time-to-first-token. For throughput-heavy batch reasoning jobs, use mixed TP+EP (--tensor-parallel-size 4 --enable-expert-parallel), which reduces attention layer communication overhead at some TTFT cost. Reasoning workloads are TTFT-sensitive, so prefer pure TP for production serving.
Run vllm serve with the model path, --dtype fp8, --max-model-len tuned to your 95th-percentile reasoning trace length, --enable-chunked-prefill (critical for R2 since long prefill of thinking tokens is the main TTFT contributor), and --gpu-memory-utilization 0.90. The server exposes an OpenAI-compatible /v1/chat/completions endpoint. Test with a math or logic question that requires multi-step reasoning to verify the model is generating thinking tokens.
Enable FP8 KV cache (--kv-cache-dtype fp8_e5m2) to halve KV memory with near-zero accuracy cost. Set max-model-len to your 95th-percentile reasoning trace length rather than the theoretical maximum. Implement a query classifier that routes simple queries (factual lookups, short math) to a lower max_tokens cap and complex queries to full reasoning depth. Monitor per-request KV cache usage via vLLM's Prometheus /metrics endpoint and set request-level timeouts to guard against runaway 100K+ token reasoning chains.
Frequently Asked Questions
For the full DeepSeek R2 model (provisionally ~685B total parameters), you need at least 8x H100 SXM5 80GB GPUs for FP8 inference (640GB total VRAM fits the model plus KV cache headroom), or 4x H200 SXM5 141GB for FP8 with more VRAM headroom for long reasoning traces. For the distilled 70B variant, a single H100 80GB handles it comfortably.
R1 introduced DeepSeek's chain-of-thought reasoning approach with a 671B MoE base. R2 is the direct successor, with a larger total parameter count and improved MLA (Multi-head Latent Attention) that compresses KV cache more aggressively, which matters because reasoning workloads hold large thinking-token sequences in memory. V4 is a separate line optimized for coding and agentic tasks with shorter outputs. R2 is purpose-built for math, logic, and science reasoning where extended thinking chains are the norm, not the exception.
Two factors combine. First, like all MoE models, all expert weights must reside in VRAM even though only a fraction activate per token. Second, reasoning workloads generate 10,000-40,000 thinking tokens per request, and each token's KV state is held in memory for the full generation duration. R2's MLA architecture compresses the KV latent dimension significantly compared to standard MHA, which partially offsets this, but long thinking chains still require substantial KV cache budget on top of weight VRAM.
Yes. FP8 quantization on H100 hardware is hardware-accelerated via the Transformer Engine and causes less than 1-2% accuracy loss on MATH-500 and AIME benchmarks. INT4 is not recommended for reasoning workloads because quantization errors compound through long thinking chains and can derail multi-step logical inference. FP4 on Blackwell B200 hardware is viable for throughput-heavy batch workloads after accuracy validation on your specific task set.
At moderate to high query volumes, yes. The crossover point depends on your average reasoning chain length. For workloads averaging 6,000 thinking tokens per query, self-hosting on H100 spot at $0.80/hr typically beats API pricing above 400k-600k queries per month. With adaptive token budgets cutting average thinking chains from 6,000 to 1,500 tokens, the same H100 handles 4x the throughput, shifting the breakeven lower still. For low query volumes under 100k/month, the API is usually cheaper than running dedicated GPUs.
