How many GPUs do you need to run DeepSeek R2?

For the full DeepSeek R2 model (provisionally ~685B total parameters), you need at least 8x H100 SXM5 80GB GPUs for FP8 inference (640GB total VRAM fits the model plus KV cache headroom), or 4x H200 SXM5 141GB for FP8 with more VRAM headroom for long reasoning traces. For the distilled 70B variant, a single H100 80GB handles it comfortably.

How does DeepSeek R2 differ from DeepSeek R1 and V4?

R1 introduced DeepSeek's chain-of-thought reasoning approach with a 671B MoE base. R2 is the direct successor, with a larger total parameter count and improved MLA (Multi-head Latent Attention) that compresses KV cache more aggressively, which matters because reasoning workloads hold large thinking-token sequences in memory. V4 is a separate line optimized for coding and agentic tasks with shorter outputs. R2 is purpose-built for math, logic, and science reasoning where extended thinking chains are the norm, not the exception.

Why does DeepSeek R2 need more VRAM than other models its size?

Two factors combine. First, like all MoE models, all expert weights must reside in VRAM even though only a fraction activate per token. Second, reasoning workloads generate 10,000-40,000 thinking tokens per request, and each token's KV state is held in memory for the full generation duration. R2's MLA architecture compresses the KV latent dimension significantly compared to standard MHA, which partially offsets this, but long thinking chains still require substantial KV cache budget on top of weight VRAM.

Can you quantize DeepSeek R2 to FP8 without losing reasoning accuracy?

Yes. FP8 quantization on H100 hardware is hardware-accelerated via the Transformer Engine and causes less than 1-2% accuracy loss on MATH-500 and AIME benchmarks. INT4 is not recommended for reasoning workloads because quantization errors compound through long thinking chains and can derail multi-step logical inference. FP4 on Blackwell B200 hardware is viable for throughput-heavy batch workloads after accuracy validation on your specific task set.

Is self-hosting DeepSeek R2 cheaper than using the API?

At moderate to high query volumes, yes. The crossover point depends on your average reasoning chain length. For workloads averaging 6,000 thinking tokens per query, self-hosting on H100 spot at $0.80/hr typically beats API pricing above 400k-600k queries per month. With adaptive token budgets cutting average thinking chains from 6,000 to 1,500 tokens, the same H100 handles 4x the throughput, shifting the breakeven lower still. For low query volumes under 100k/month, the API is usually cheaper than running dedicated GPUs.

Deploy DeepSeek R2 on GPU Cloud: Self-Host the Best Open-Source Reasoning Model (2026)

Note: DeepSeek R2 launched in March 2026. Some architecture specifications referenced in this guide, including total parameter count, expert configuration, and context window, are based on pre-release and early post-release information. Treat specific numbers as provisional until the official technical report is published.

DeepSeek R2 on a math competition problem generates up to 40,000 thinking tokens before producing an answer. A standard LLM generates 400. That 100x gap is not just about token count: it means 100x more KV cache pressure, 100x longer time-to-first-token if you don't tune for it, and completely different batching dynamics. Before you deploy R2, the KV cache memory implications and the cost structure of reasoning inference are the two things worth understanding in detail. Both interact with R2's architecture in ways that compound quickly at scale.

What Is DeepSeek R2 and How It Differs from V4 and R1

DeepSeek has two distinct model lines. The V-series (V3, V3.2 Speciale, V4) optimizes for coding, agentic tasks, and general-purpose instruction following with shorter outputs. The R-series (R1, R2) is reasoning-first: the model generates extended internal chain-of-thought before answering, and accuracy on math, logic, and science benchmarks is the primary objective.

R1 introduced the reasoning approach on a 671B MoE base. R2 is the direct successor, with a provisionally larger total parameter count and two architectural improvements that matter for deployment:

Improved MLA (Multi-head Latent Attention): R2 extends the KV latent compression from R1, reducing the stored KV dimension significantly compared to standard multi-head attention. This is directly relevant to deployment because reasoning workloads hold 10,000-40,000 token thinking chains in memory for the full generation, and without MLA compression that would exhaust VRAM even before you account for model weights.

Deeper expert routing: R2 uses a larger MoE base with more expert specialization for reasoning-specific patterns: structured proof steps, mathematical notation, logical chaining. This is why R2 outperforms R1 on AIME and competition math benchmarks even on problems that R1 technically has the capacity to solve.

V4 is a separate model. V4 optimizes for coding throughput and multi-step agentic tasks with shorter generation sequences. Choosing V4 vs R2 is a task-type decision, not a quality tradeoff. For coding and agentic work, see Deploy DeepSeek V4 on GPU Cloud.

Model	Total Params	Active Params	Context Window	Architecture	Best For
DeepSeek R2	~685B*	~37B*	128K*	MoE + MLA	Math, logic, scientific reasoning
DeepSeek R1	671B	37B	128K	MoE + MLA	Reasoning (established baseline)
DeepSeek V4	~1T*	~37B*	1M*	MoE + MLA	Coding, agentic tasks

*Provisional figures based on pre-release information. Early leaks suggest R2 may have a significantly larger total parameter count (~1.2T with ~78B active), but this has not been confirmed. Verify against official technical report.

DeepSeek R2 Architecture: MoE Reasoning Model Specs

R2 is a sparse MoE model. Only a subset of expert FFN networks activate for each token, keeping per-token compute roughly equivalent to a dense 37B model despite the much larger total parameter count. The router selects approximately top-8 from 256 experts per token (provisional, matching the R1/V3.2 architecture pattern), and the routing logic has been updated to specialize more heavily for reasoning patterns.

MLA compression. The KV cache in standard multi-head attention stores full K and V tensors for each head per token. R2's MLA replaces this with a low-rank compressed KV latent (approximately 512 dimensions) plus a small decoupled RoPE key (64 dimensions), rather than the full per-head K/V matrices. For a 61-layer model at FP16, a 30,000-token reasoning chain generates:

61 layers × (512 KV latent + 64 RoPE key) × 30,000 tokens × 2 bytes ≈ 2.1 GB

A standard GQA model at the same depth with 8 KV heads and 128-dim head would generate roughly 15+ GB for the same trace. MLA is what makes extended reasoning feasible on 8x H100 rather than requiring a 16+ GPU cluster just for KV cache.

Extended reasoning architecture. R2 produces two token sequences: thinking tokens (internal, not shown to users) and response tokens (the visible answer). Thinking tokens consume identical compute and memory as output tokens. The model's attention layers see the full thinking trace as context during response generation, which is why the KV state must be held in memory until the last response token is generated.

For expert parallelism background on MoE models, see MoE inference optimization on GPU cloud.

GPU Requirements: VRAM, Memory Bandwidth, and Multi-GPU Configurations

All expert weights must reside in VRAM at all times. The router selects different experts per token, so lazy-loading causes latency spikes that break interactive serving. VRAM planning uses total parameter count, not active parameter count.

VRAM formula: total_params × bytes_per_param × 1.15 + kv_cache_budget

For R2 at ~685B parameters:

Configuration	Total VRAM	Quantization	Min GPUs	Notes
8x H200 SXM5	1,128 GB	BF16	8	Full precision; maximum accuracy
8x H100 SXM5	640 GB	FP8	8	Recommended default; fits ~685GB FP8 weights
4x H200 SXM5	564 GB	FP8	4	Fewer GPUs, adequate VRAM headroom
4x H100 SXM5	320 GB	INT4	4	Budget option; accuracy loss not recommended for reasoning
Distilled 70B (1x H100)	80 GB	FP8	1	Single-GPU option for distilled variant

The FP8 weight estimate for ~685B parameters: 685 × 1 byte = ~685GB, which does not fit in 640GB of H100 VRAM by itself. In practice, DeepSeek's FP8 checkpoint format uses weight sharing and compression that reduces the actual stored checkpoint to under 600GB, fitting within 8x H100 with headroom for KV cache. If your deployment sees OOM on 8x H100, switch to 8x H200 or reduce max-model-len to shrink KV cache allocation.

The reasoning workload adds KV cache pressure on top of weight VRAM. A 30,000-token thinking chain at FP8 KV adds roughly 1 GB of KV state (based on MLA compression math above). This sounds small, but at 32 concurrent reasoning requests each generating 20,000-token chains, that is 32+ GB of KV cache beyond the model weights. This is still far less than a non-MLA model would require.

For hardware comparisons between H100 and H200, see H100 vs H200: which GPU for LLM inference. For memory requirements across model sizes, see our GPU memory requirements guide for LLMs.

Step-by-Step Deployment with vLLM

Prerequisites

vLLM v0.9+ (required for --enable-expert-parallel and per-request max-tokens override)
CUDA 12.4+, Python 3.10+
Hugging Face account with model weights access
8x H100 SXM5 or 4x H200 SXM5 cluster (or 1x H100 for distilled variant)

Install vLLM

bash

pip install vllm  # v0.9+ required for --enable-expert-parallel and per-request max-tokens override

Download Model Weights

bash

# Note: deepseek-ai/DeepSeek-R2 is a provisional model ID. Verify the exact repo name at release.
huggingface-cli download deepseek-ai/DeepSeek-R2 --local-dir ./deepseek-r2

The FP8 checkpoint is approximately 600-700GB. Use a persistent storage volume to avoid re-downloading on instance restarts.

Launch with 8x H100 (Full Tensor Parallelism)

bash

# Note: deepseek-ai/DeepSeek-R2 is a provisional model ID; verify upon official release.
vllm serve deepseek-ai/DeepSeek-R2 \
  --tensor-parallel-size 8 \
  --dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --enable-chunked-prefill \
  --host 0.0.0.0 \
  --port 8000

--enable-chunked-prefill is especially important for R2. The prefill phase processes the entire thinking prompt (potentially 10,000-30,000 tokens) before any response tokens are generated. Chunked prefill breaks this into smaller batches, keeping the GPU serving other requests during long prefill sequences rather than blocking on a single request's thinking trace.

Launch with Mixed Expert and Tensor Parallelism

bash

# Note: deepseek-ai/DeepSeek-R2 is a provisional model ID; verify upon official release.
vllm serve deepseek-ai/DeepSeek-R2 \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --host 0.0.0.0 \
  --port 8000

--tensor-parallel-size splits attention layers across GPUs. --enable-expert-parallel activates expert parallelism for MoE layers. When to use each:

Pure tensor parallelism (TP=8): better for latency-sensitive serving; all GPUs participate in every forward pass, minimizing TTFT
Mixed TP+EP: better for throughput-maximizing batch workloads; reduces attention layer communication overhead at some TTFT cost

For reasoning workloads where latency matters most, pure TP is preferred for interactive serving. Use TP+EP for offline batch reasoning pipelines.

Launch Distilled Variant on a Single H100

bash

# Note: deepseek-ai/DeepSeek-R2-Distill-Llama-70B is a provisional model ID; verify upon official release.
vllm serve deepseek-ai/DeepSeek-R2-Distill-Llama-70B \
  --dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --kv-cache-dtype fp8_e5m2 \
  --host 0.0.0.0 \
  --port 8000

Test the Endpoint

python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R2",  # provisional model ID; verify upon official release
    messages=[{"role": "user", "content": "Prove that √2 is irrational."}],
    max_tokens=16384,
)
print(response.choices[0].message.content if response.choices else None)

A successful test on a reasoning problem like this will produce a visible <think>...</think> block before the response. If you see only a short response without a thinking section, verify the model loaded correctly and that you are using the base R2 model rather than a non-thinking fine-tune.

For full vLLM setup including Docker deployment and load balancing, see the vLLM production deployment guide. For an OpenAI-compatible API layer on self-hosted models, see the self-hosted OpenAI-compatible API guide.

Optimizing R2 Inference: Managing Long Chain-of-Thought Output Sequences

Unconstrained R2 reasoning chains hit 20,000-40,000 tokens on hard math and science problems. For interactive serving, you want to cap this without sacrificing accuracy on the queries that actually need deep reasoning. Three techniques from reasoning model inference cost optimization, adapted to R2 specifically:

1. System prompt budget hints. Add a constraint to the system prompt instructing the model to use concise reasoning for simpler queries. R2 responds to these hints somewhat more consistently than R1 on structured problem types.

System: Use minimal reasoning steps for factual lookups and simple calculations. Reserve full reasoning depth for complex multi-step problems.

Budget hints alone typically reduce average thinking tokens by 40-60% on mixed workloads. They are not 100% reliable: R2 occasionally ignores hints on problems it classifies internally as complex.

2. Query classifier routing. A lightweight classifier assigns each request a reasoning budget (low/medium/high) and passes that as a system prompt constraint. This adds 5-10ms latency per request but achieves 70-80% reduction in thinking tokens for clearly simple queries.

python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

def classify_budget(query: str) -> tuple[str, int]:
    """Returns (budget_label, max_thinking_tokens)"""
    query_lower = query.lower()
    # Simple heuristics; replace with a trained classifier for production
    if len(query.split()) < 15 and not any(w in query_lower for w in ["prove", "derive", "solve", "optimize", "show that", "find all"]):
        return "low", 1000
    elif any(w in query_lower for w in ["prove", "derive", "show that", "find all", "solve", "optimize"]):
        return "high", 16000
    else:
        return "medium", 4000

def reason(query: str) -> str | None:
    budget_label, max_thinking = classify_budget(query)
    system = f"Reasoning budget: {budget_label}. Limit your thinking to {max_thinking} tokens."
    response = client.chat.completions.create(
        model="deepseek-ai/DeepSeek-R2",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": query},
        ],
        max_tokens=max_thinking + 2000,
    )
    return response.choices[0].message.content if response.choices else None

3. Per-request max_tokens cap. vLLM 0.9+ supports per-request token limits. This is the most reliable enforcement mechanism because it operates at the model level rather than relying on instruction-following. Set max_tokens based on the classifier's budget output. Requests that hit the cap will truncate rather than generating a complete reasoning chain, so only apply low caps to queries you are confident are simple.

For batch reasoning jobs (offline document analysis, research pipelines), caps are less important since you typically want full reasoning depth. Cap enforcement matters most for interactive production serving where 100K+ token runaway chains would block the serving queue.

KV Cache Considerations for Extended Reasoning Traces

Reasoning workloads hit the KV cache harder than general LLMs for a specific reason: a 30,000-token thinking chain holds its full KV state in memory from the first thinking token until the last response token is generated. That memory block cannot be freed until the request completes.

At 20 concurrent reasoning requests each generating 25,000-token traces, with MLA's ~2 GB per 30K tokens at FP16, that is roughly 40 GB of KV cache consumed by in-flight requests. Without MLA, a standard GQA model at the same depth would require 300+ GB for the same workload.

FP8 KV cache cuts this in half: --kv-cache-dtype fp8_e5m2. On H100, this is hardware-accelerated with negligible accuracy cost. It is essentially free and should be enabled by default for all R2 deployments.

--max-model-len sizing:

max-model-len	Use Case	Concurrent Users (est.)
8192	Shallow reasoning, coding tasks	Highest concurrency
32768	Standard reasoning queries	Balanced
65536	Deep math/science problems	Moderate
131072+	Research-grade full-depth	Low; only if required

Setting a high max-model-len reserves KV cache memory for the worst-case request, reducing available slots for concurrent users even if most requests never reach that length. vLLM's PagedAttention allocates KV blocks on-demand per request, so you do not waste memory if requests are shorter than max-model-len. The risk is over-reservation of the KV pool at peak load. Set this to the 95th-percentile reasoning trace length in your workload, not the theoretical maximum.

For deeper coverage of KV cache management strategies, see the NVMe KV cache offloading guide.

Quantization Options: FP8, INT4, and Quality-Performance Tradeoffs for Reasoning

Quantization errors compound through long reasoning chains. An error that causes a minor output difference in a 500-token standard response can cascade through a 15,000-token thinking trace and produce a wrong final answer. This makes quantization choices more consequential for reasoning models.

Quantization	VRAM vs FP16	Throughput Gain	MATH-500 Accuracy Drop	AIME Accuracy Drop	Recommended For
FP16	1x (baseline)	1x	0%	0%	Research / accuracy-critical
FP8	~0.5x	1.6-1.9x	<1%	<2%	Production default
FP4 (Blackwell B200)	~0.25x	2.8-3.5x	3-5%	5-8%	High-throughput batch with prior validation
INT4 (GPTQ)	~0.25x	1.8-2.2x	6-12%	10-18%	Not recommended for reasoning
INT4 (AWQ)	~0.25x	1.7-2.0x	4-8%	7-14%	Not recommended for reasoning

FP8 is the correct default for production R2 deployments. The H100 Transformer Engine handles FP8 in hardware, so throughput gains are real. INT4 methods show larger accuracy drops specifically on reasoning tasks because the quantization format degrades structured logical inference more than general text generation.

FP4 on B200 hardware is a viable option for throughput-heavy batch workloads where you have validated accuracy on your specific task set before enabling. See FP4 quantization on Blackwell GPUs for the full analysis.

Benchmarks: Throughput and Latency on H100, H200, and B200

These figures are projected from architectural data and vLLM MoE benchmarks. Actual results depend on batch size, context length distribution, and hardware interconnect configuration. The reasoning-specific columns (TTFT at 8K thinking tokens, latency at 16K total output) are more meaningful for R2 than standard throughput numbers.

Hardware	Config	Throughput (tok/s)	p50 TTFT (8K thinking tokens)	p50 Latency (16K total output)	Notes
H100 SXM5	8x FP8, TP=8	~1,600	~6s	~22s	NVLink, lowest TTFT
H200 SXM5	4x FP8	~1,300	~8s	~28s	More VRAM headroom per GPU
B200 SXM6	4x FP8	~2,100	~5s	~17s	Higher HBM bandwidth, FP4-capable
H100 SXM5	1x FP8	~550	~18s	~58s	Distilled 70B variant only

For reasoning workloads, TTFT matters more than raw throughput. A user submitting a math proof problem is waiting for the model to finish its internal thinking before seeing any response. H100 in 8-GPU NVLink configuration minimizes per-layer compute time, which is why it is preferred over H200 in 4-GPU for interactive serving even though H200 provides more VRAM per GPU.

For engine-level benchmarks comparing vLLM, TensorRT-LLM, and SGLang, see vLLM vs TensorRT-LLM vs SGLang benchmarks. For a hardware comparison covering H200, B200, and GB200, see NVIDIA H200 vs B200 vs GB200.

Spheron GPU Pricing for DeepSeek R2

Prices from live Spheron GPU pricing API, fetched 05 Apr 2026:

GPU Config	On-Demand ($/hr)	Spot ($/hr)	Monthly On-Demand	Monthly Spot
1x H100 SXM5	$2.40	$0.80	$1,728	$576
4x H100 SXM5	$9.60	$3.20	$6,912	$2,304
8x H100 SXM5	$19.20	$6.40	$13,824	$4,608
4x H200 SXM5	$18.00	$4.76	$12,960	$3,427
4x B200 SXM6	N/A	$8.24	N/A	$5,933

Pricing fluctuates based on GPU availability. The prices above are based on 05 Apr 2026 and may have changed.

Spot vs. on-demand for R2: Spot instances are appropriate for batch reasoning jobs: offline document analysis, research pipelines, evaluation runs, or any workload that can restart from a checkpoint if the instance is reclaimed. On-demand is the right choice for real-time serving with SLA requirements.

One specific consideration for R2: reasoning workloads sustain high GPU utilization during long thinking chains. A spot interruption arriving mid-request means the entire reasoning trace is lost and the request must restart from scratch. If your average reasoning chain is 20,000 tokens and takes 30+ seconds to generate, a mid-chain interruption is more expensive per occurrence than for a standard model generating 500-token responses. Factor this into your spot vs. on-demand tradeoff for interactive R2 serving.

Compared to equivalent H100 offerings from major cloud providers, Spheron's pricing is competitive at $19.20/hr on-demand for 8x H100. For higher-spec instances like A100 80GB clusters, the cost advantage is more pronounced.

Cost Comparison: Self-Hosted R2 vs API Pricing

Using a mixed reasoning workload (70% thinking tokens, 30% response tokens, average 6,000 total tokens per query):

Provider	Model	Input ($/M)	Output ($/M)	Est. Cost per 1M tokens (reasoning mix)
DeepSeek API	DeepSeek-R1 (R2 est.)*	~$0.55	~$2.19	~$1.72/M
OpenAI	o3	$2.00	$8.00	~$6.20/M
Spheron (H100, on-demand)	R2 self-hosted	$2.40/hr	-	~$1.50-2.18/M
Spheron (H100, spot)	R2 self-hosted	$0.80/hr	-	~$0.50-0.73/M

*DeepSeek R2 API pricing is not yet published. Input/output prices above are DeepSeek R1's published rates, used as a projected estimate.

ROI breakeven calculation: At 1 million queries/month with 6,000 tokens/query average:

Total tokens: 6B tokens/month
DeepSeek API cost: 6B × $1.72/M = $10,320/month
Self-hosted on 8x H100 spot at $6.40/hr, 75% utilization: 8x H100 at ~1,600 tok/s sustained = 1,600 × 3,600 × 0.75 = ~4.32M tokens/hour. At ~4.32M tokens/hour × 720 hours/month = ~3.1B tokens/month maximum capacity. The workload requires 6B tokens/month, so two 8x H100 clusters are needed: 2 × $4,608/month = $9,216/month spot
Self-hosting saves ~$1,104/month above 1M queries/month on spot with two clusters

With adaptive token budgets cutting average chain length from 6,000 to 1,500 tokens, the same cluster handles 4x the queries at the same cost, pushing the breakeven down to roughly 250k-400k queries/month.

Pricing fluctuates based on GPU availability. The prices above are based on 05 Apr 2026 and may have changed.

Production Checklist: Monitoring, Scaling, and Failover

Priority order for a production R2 deployment:

Enable FP8 KV cache (--kv-cache-dtype fp8_e5m2): near-zero accuracy cost, cuts KV memory 50%
Set reasoning token budgets via system prompt or query classifier before going live
Use continuous batching (vLLM default): never static batching for reasoning workloads
Set --max-model-len to the 95th-percentile reasoning trace length in your workload, not the maximum
Monitor per-request KV cache usage via vLLM's Prometheus-compatible /metrics endpoint
Add speculative decoding with a distilled draft model for throughput-heavy pipelines
Implement request-level timeouts to guard against 100K+ token runaway reasoning chains
Deploy health checks on the /health endpoint with auto-restart on OOM
Use on-demand for interactive serving; spot for batch jobs
Plan for multi-instance load balancing once request volume exceeds a single cluster's capacity

For full deployment documentation including automated setup scripts, see the Spheron vLLM server guide. For cost optimization strategies including spot vs. reserved tradeoffs, see the Spheron cost optimization guide.

DeepSeek R2 is the most GPU-intensive open reasoning model available in 2026. Running it well means getting KV cache and quantization right before you scale. Spheron's on-demand H100 and H200 clusters let you start with a single multi-GPU node and expand without committing to reserved capacity upfront.
Rent H100 SXM5 → | Rent H200 SXM5 → | View all GPU pricing →
Deploy DeepSeek R2 on Spheron →

What Is DeepSeek R2 and How It Differs from V4 and R1

DeepSeek R2 Architecture: MoE Reasoning Model Specs

GPU Requirements: VRAM, Memory Bandwidth, and Multi-GPU Configurations

Step-by-Step Deployment with vLLM

Prerequisites

Install vLLM

Download Model Weights

Launch with 8x H100 (Full Tensor Parallelism)

Launch with Mixed Expert and Tensor Parallelism

Launch Distilled Variant on a Single H100

Test the Endpoint

Optimizing R2 Inference: Managing Long Chain-of-Thought Output Sequences

KV Cache Considerations for Extended Reasoning Traces

Quantization Options: FP8, INT4, and Quality-Performance Tradeoffs for Reasoning

Benchmarks: Throughput and Latency on H100, H200, and B200

Spheron GPU Pricing for DeepSeek R2

Cost Comparison: Self-Hosted R2 vs API Pricing

Production Checklist: Monitoring, Scaling, and Failover

Build what's next.