Why do reasoning models cost so much more to run than standard LLMs?

Reasoning models generate internal chain-of-thought tokens before producing the final answer. Models like DeepSeek-R1 and o3 can produce 2,000-30,000 thinking tokens per query compared to 200-500 for a standard model, which multiplies GPU memory, compute, and time linearly.

What techniques can reduce reasoning model inference costs by up to 8x?

The core idea is throttling thinking token budgets based on query complexity. Simple queries get a shallow reasoning chain; complex queries get full depth. Implementations include system prompt budget hints, a lightweight query classifier that routes requests to low/medium/high reasoning depth, and per-request max-token caps enforced at the vLLM level. Across mixed workloads, these techniques can reduce average thinking token generation by 75-88%, which maps to roughly 5-8x lower inference cost compared to unconstrained reasoning.

Which GPU is best for self-hosting DeepSeek-R1?

For the full 671B DeepSeek-R1 model, you need at least 8x H100 80GB SXM5 GPUs for FP8 inference. For the distilled 70B version (DeepSeek-R1-Distill-Llama-70B), a single H100 80GB handles it comfortably. The H100 SXM5 starts at $2.40/hr on-demand or $0.80/hr on spot on Spheron.

Is self-hosting a reasoning model cheaper than using the API?

At scale it usually is. DeepSeek-R1 API pricing runs around $0.55-2.19 per million tokens depending on thinking budget. Self-hosted on an H100 spot instance at $0.80/hr, you can generate roughly 1-2 million tokens per hour at high batch concurrency, though long reasoning chains (4,000+ tokens) typically sustain 450-550 tok/s, putting practical throughput closer to 1.6-2M tokens/hour at peak. At 75% utilization that's 700-900M tokens/month per GPU. Self-hosting beats API pricing above roughly 500k-800k queries/month on spot, depending on average reasoning chain length. On-demand H100 at $2.40/hr is closer to API pricing, so spot instances are key for the economics to work.

Does quantization hurt reasoning model accuracy?

FP8 quantization has minimal accuracy impact on reasoning quality - benchmark scores typically drop less than 1-2% while cutting VRAM by ~50%. FP4 is more aggressive and can reduce accuracy by 3-8% on math and coding reasoning tasks. INT4 is generally not recommended for reasoning workloads where correctness is critical.

Reduce Reasoning Model Inference Costs by 8x: GPU Optimization Guide (2026)

DeepSeek-R1 on a simple coding question generates 4,000 thinking tokens. GPT-4o on the same question generates 150. That 26x difference is why reasoning inference costs look nothing like standard LLM costs. Before you scale a reasoning model to production, the KV cache memory implications and speculative decoding techniques are worth understanding in detail, because both interact with reasoning workloads in ways that compound meaningfully at scale.

The Reasoning Token Explosion

Chain-of-thought models emit two token sequences: thinking tokens (internal monologue) and response tokens (visible output). Thinking tokens are not shown to users but they consume exactly the same compute and memory as output tokens.

Model	Avg Thinking Tokens	Avg Response Tokens	Total Multiplier vs GPT-4o
DeepSeek-R1	4,000-12,000	500-1,500	8-25x
o3 (high)	10,000-30,000	300-800	20-50x
Qwen-QwQ-32B	2,000-8,000	400-1,000	6-18x
Claude Sonnet 4.6 (extended thinking)	3,000-10,000	400-1,200	8-22x
GPT-4o	0	300-600	1x (baseline)

Every thinking token occupies KV cache memory for the duration of the request. A 12,000-token reasoning chain on DeepSeek-R1-70B holds roughly 3.9 GB of KV cache in FP16 for the entire generation. The request also blocks that memory from other users, increases time-to-first-token, and costs the same compute as output tokens the user actually sees.

GPU Economics of Reasoning

Three pressures compound on each other with reasoning workloads.

KV cache growth. DeepSeek-R1-671B uses Multi-head Latent Attention (MLA), which compresses the KV cache to a small latent dimension rather than storing full K and V tensors per head. Using the actual architecture (61 transformer layers, 512-dim compressed KV latent plus 64-dim decoupled RoPE key per layer), a 30,000-token reasoning chain at FP16 generates approximately:

61 layers × (512 KV latent + 64 RoPE key) × 30,000 tokens × 2 bytes ≈ 2 GB

That is for a single request. MLA is what makes serving a 671B model feasible on 8x H100 80 GB; a standard multi-head attention model with the same depth and head count would require hundreds of GB of KV cache for the same context length.

Batch starvation. A single 30,000-token reasoning request can hold GPU memory that would otherwise serve 50 standard queries at 600-token output length. The long, variable output length of reasoning workloads makes static batching nearly useless and amplifies the importance of continuous batching.

Time-to-first-token. Thinking tokens are autoregressive and cannot be parallelized. The user waits for the entire reasoning chain to complete before the response begins. At 10,000 thinking tokens on a single H100 at 2,000 tokens/second, that is 5 seconds of TTFT before the first visible word appears.

Hardware Selection: H100 vs B200 for Reasoning

The right GPU depends heavily on which DeepSeek-R1 variant you are running. The full 671B MoE model requires multi-GPU configurations; the distilled 70B version fits on a single H100 or B200.

GPU	HBM Capacity	HBM Bandwidth	FP8 Throughput (dense)	Max Concurrent 70B Requests (FP8)	Spheron On-Demand	Est. Cost per 1M Reasoning Tokens
H100 SXM5	80 GB	3.35 TB/s	3,958 TFLOPS	1-2	$2.40/hr	~$1.28-1.86/M
H200 SXM5	141 GB	4.8 TB/s	3,958 TFLOPS	3-4	$3.96/hr	~$0.90-1.36/M
B200 SXM6	180 GB	7.7 TB/s	4,500 TFLOPS	6-8	$7.43/hr	~$1.08-1.80/M

For the full 671B model, multi-GPU configurations apply: 8x H100 SXM5 (640 GB aggregate) at $19.20/hr vs 4x B200 (720 GB aggregate) at $29.72/hr. The B200 configuration provides more memory headroom with half the GPU count, better interconnect efficiency, and FP4 support for compatible workloads, though at a higher total cost per hour. For GPU architecture details, see the NVIDIA B200 complete guide and best GPU for AI inference 2026. To deploy DeepSeek-R1 on Spheron, the DeepSeek-R1 deployment guide covers hardware selection and automated setup scripts for each model variant.

Note on spot instances: H100 SXM5 spot pricing is $0.80/hr and B200 SXM6 is $1.67/hr on Spheron. Spot instances can be reclaimed. For short-lived batch reasoning jobs, spot is appropriate. For long multi-hour inference sessions, use on-demand instances. See the cost optimization guide for details on instance types, spot vs dedicated tradeoffs, and reserved GPU options.

Pricing fluctuates based on GPU availability. The prices above are based on 01 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Optimization Stack: Speculative Decoding and KV Cache Compression

Two techniques from the speculative decoding production guide and KV cache optimization guide are especially effective for reasoning workloads.

Speculative decoding with a distilled draft model. Reasoning model thinking tokens follow predictable patterns, structured math steps, code annotations, logical premises. A smaller distilled version of the same reasoning model captures these patterns with high accuracy. In production, using DeepSeek-R1-Distill-Llama-8B as a draft model for DeepSeek-R1-Distill-Llama-70B inference gives 2-3x throughput improvement, which is higher than the typical 1.5-2x on general instruction-following workloads. Both models share the Llama 3 tokenizer, which is required for vLLM's draft-model speculative decoding to work correctly.

FP8 KV cache quantization. Because reasoning requests generate 5-30x more tokens than standard requests, the KV cache is proportionally larger. For a distilled 70B model (standard GQA attention, 8 KV heads, 80 layers) at 30,000 tokens, the FP16 KV cache runs around 9 GB per request. FP8 KV (--kv-cache-dtype fp8) cuts that to roughly 4.5 GB with negligible quality impact, doubling effective context capacity.

Both combined:

bash

vllm serve deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
  --kv-cache-dtype fp8 \
  --speculative-model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
  --num-speculative-tokens 5 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92

This configuration runs on a single H100 SXM5 GPU. The --max-model-len 32768 cap prevents the KV cache pool from reserving memory for 128K-token reasoning chains that you are unlikely to serve. For a step-by-step walkthrough of deploying vLLM on Spheron, see the vLLM server guide.

Adaptive Reasoning Depth: Controlling Thinking Token Budgets

Not all queries benefit from deep reasoning. "What is 15% of $80?" does not need 8,000 thinking tokens. A factual lookup does not need a multi-step logical chain. The practical fix is to assign a reasoning budget proportional to estimated query complexity at inference time, rather than letting the model use its maximum thinking depth for every request.

Across mixed workloads (a realistic mix of simple math, factual questions, coding tasks, and complex multi-step problems), adaptive token budget controls reduce average thinking token generation by 75-88%. For a workload where the unconstrained average would be 8,000 thinking tokens, this brings it down to 1,000-2,000. That is the basis for the "up to 8x" claim in the title. Queries that receive full reasoning budget are unaffected in output quality.

Three practical implementations:

System prompt budget hints. Add "Use minimal reasoning steps unless the problem requires deep analysis" to the system prompt. Qwen-QwQ-32B is particularly responsive to this approach; DeepSeek-R1 responds somewhat less consistently. A conservative estimate is 40-60% reduction in average thinking tokens on mixed workloads.

Lightweight query classifier. Run a simple classifier on the incoming query to estimate required reasoning depth (low/medium/high). Pass the estimated budget as a constraint in the system prompt. This adds roughly 5-10ms per request and can achieve 70-80% reduction in thinking tokens on clearly simple queries.

Per-request max-model-len override. vLLM 0.7+ supports per-request overrides to max_tokens for thinking sections. Route classified simple queries to a lower cap. This is the most reliable enforcement mechanism since it is enforced at the model level rather than relying on the model following instructions.

The effectiveness varies by model. Qwen-QwQ is more instruction-following in its reasoning behavior. DeepSeek-R1 sometimes ignores budget hints on problems it classifies as complex.

Batching Strategies for Long Reasoning Outputs

Reasoning models produce output lengths ranging from 100 tokens (trivial queries with minimal thinking) to 30,000+ tokens (complex multi-step problems). This variance is 100-300x wider than standard instruction-following models. Standard static batching assumes similar output lengths across a batch, which makes it extremely wasteful here.

Continuous batching is the baseline requirement. vLLM and SGLang both use continuous batching by default. Do not use any framework or configuration that falls back to static batching for reasoning workloads.

Request bucketing by estimated length. Route short queries (factual lookups, simple arithmetic, single-hop questions) to a smaller GPU instance with a lower max_model_len. Route complex reasoning queries (code review, multi-step proofs, research analysis) to a high-memory instance. This prevents long reasoning requests from blocking short requests in the same queue.

Priority queuing. During peak load, long reasoning requests can starve short requests that are waiting. A priority queue that deprioritizes requests with estimated long outputs during peak hours prevents user-visible latency on interactive queries. SGLang's RadixAttention prefix cache is particularly useful for reasoning workloads where thinking chains share common prefixes, such as a shared system prompt with chain-of-thought instructions. For a detailed engine comparison covering throughput and latency across vLLM and SGLang, see vLLM vs TensorRT-LLM vs SGLang benchmarks.

Quantization Impact on Reasoning Quality

Quantization errors compound through long reasoning chains. An error that would cause a minor output difference in a 500-token standard response can cascade through a 10,000-token thinking chain and produce a wrong final answer. This makes quantization tradeoffs more consequential for reasoning models than for standard generation.

Quantization	VRAM vs FP16	Throughput Gain	MATH-500 Accuracy Drop	AIME Accuracy Drop	Recommended For
FP16	1x (baseline)	1x	0%	0%	Research / accuracy-critical
FP8	~0.5x	1.6-1.9x	<1%	<2%	Production default
FP4 (Blackwell)	~0.25x	2.8-3.5x	3-5%	5-8%	High-throughput, some accuracy budget
INT4 (GPTQ)	~0.25x	1.8-2.2x	6-12%	10-18%	Not recommended for reasoning
INT4 (AWQ)	~0.25x	1.7-2.0x	4-8%	7-14%	Not recommended for reasoning

FP8 is the safe default for reasoning workloads. FP4 on Blackwell hardware is viable for throughput-heavy deployments where you have run accuracy validation on your specific task set first. INT4 methods (GPTQ, AWQ) show larger accuracy drops specifically on reasoning tasks because the quantization format degrades structured logical inference more than general text generation. For the full FP4 cost and quality analysis, see FP4 quantization on Blackwell GPUs.

Cost Comparison: Self-Hosted vs API

Using DeepSeek-R1-Distill-Llama-70B (the distilled variant that fits on a single H100) as the reference point:

Provider	Model	Input Price	Output Price	Cost per 1M Tokens (70/30 thinking/output mix)
OpenAI	o3	$2.00/M	$8.00/M	~$3.80/M
Anthropic	Claude Sonnet 4.6 (extended thinking)	$3.00/M	$15.00/M	~$6.60/M
DeepSeek API	DeepSeek-R1	$0.55/M	$2.19/M	~$1.04/M
Spheron (H100 SXM5, on-demand)	DeepSeek-R1-Distill-Llama-70B	$2.40/hr	-	~$1.28-1.86/M
Spheron (H100 SXM5, spot)	DeepSeek-R1-Distill-Llama-70B	$0.80/hr	-	~$0.40-0.60/M

ROI breakeven at 1M queries/month:

Assume 4,000 thinking tokens + 500 response tokens per query = 4,500 tokens/query average
1M queries = 4.5 billion tokens/month
DeepSeek API cost: 4.5B × $1.04/M = $4,680/month
Self-hosted on H100 spot at 75% utilization: one H100 at $0.80/hr generates roughly 700-900M tokens/month at realistic long-context throughput (450-550 tok/s for 4,000+ token reasoning chains). Scaling to 4.5B tokens requires 6-7 GPUs at approximately $3,456-4,032/month
Breakeven favors self-hosting above roughly 500k-800k queries/month on spot, depending on actual reasoning chain length

The breakeven is lower when you use adaptive reasoning depth. Cutting average thinking tokens from 4,000 to 1,000 per query drops token volume by 70%, which shifts the breakeven down further and makes API pricing less competitive.

Pricing fluctuates based on GPU availability. The prices above are based on 01 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Implementation Checklist

Priority order for reasoning model cost reduction:

Enable FP8 KV cache quantization (--kv-cache-dtype fp8) - free win with near-zero accuracy cost
Set a reasoning token budget via system prompt or query classifier
Use continuous batching only (vLLM or SGLang; no static batching)
Add speculative decoding with a 7B distill draft model
Apply FP8 weight quantization if not already default
Route queries by complexity to right-sized GPU instances
Consider FP4 on B200 only after accuracy validation on your specific task

Steps 1 and 2 together typically achieve 3-5x cost reduction with minimal operational complexity. Steps 3-6 compound on top of that. FP4 (step 7) is optional and workload-dependent.

Running reasoning models at scale on API providers gets expensive fast. Self-hosting DeepSeek-R1 on Spheron H100s, with FP8 quantization and adaptive token budgets, brings cost-per-query below most API options above 300k queries/month.
Rent H100 → | Rent B200 → | View all pricing →