DeepSeek-R1 on a simple coding question generates 4,000 thinking tokens. GPT-4o on the same question generates 150. That 26x difference is why reasoning inference costs look nothing like standard LLM costs. Before you scale a reasoning model to production, the KV cache memory implications and speculative decoding techniques are worth understanding in detail, because both interact with reasoning workloads in ways that compound meaningfully at scale.
The Reasoning Token Explosion
Chain-of-thought models emit two token sequences: thinking tokens (internal monologue) and response tokens (visible output). Thinking tokens are not shown to users but they consume exactly the same compute and memory as output tokens.
| Model | Avg Thinking Tokens | Avg Response Tokens | Total Multiplier vs GPT-4o |
|---|---|---|---|
| DeepSeek-R1 | 4,000-12,000 | 500-1,500 | 8-25x |
| o3 (high) | 10,000-30,000 | 300-800 | 20-50x |
| Qwen-QwQ-32B | 2,000-8,000 | 400-1,000 | 6-18x |
| Claude Sonnet 4.6 (extended thinking) | 3,000-10,000 | 400-1,200 | 8-22x |
| GPT-4o | 0 | 300-600 | 1x (baseline) |
Every thinking token occupies KV cache memory for the duration of the request. A 12,000-token reasoning chain on DeepSeek-R1-70B holds roughly 3.9 GB of KV cache in FP16 for the entire generation. The request also blocks that memory from other users, increases time-to-first-token, and costs the same compute as output tokens the user actually sees.
GPU Economics of Reasoning
Three pressures compound on each other with reasoning workloads.
KV cache growth. DeepSeek-R1-671B uses Multi-head Latent Attention (MLA), which compresses the KV cache to a small latent dimension rather than storing full K and V tensors per head. Using the actual architecture (61 transformer layers, 512-dim compressed KV latent plus 64-dim decoupled RoPE key per layer), a 30,000-token reasoning chain at FP16 generates approximately:
61 layers × (512 KV latent + 64 RoPE key) × 30,000 tokens × 2 bytes ≈ 2 GBThat is for a single request. MLA is what makes serving a 671B model feasible on 8x H100 80 GB; a standard multi-head attention model with the same depth and head count would require hundreds of GB of KV cache for the same context length.
Batch starvation. A single 30,000-token reasoning request can hold GPU memory that would otherwise serve 50 standard queries at 600-token output length. The long, variable output length of reasoning workloads makes static batching nearly useless and amplifies the importance of continuous batching.
Time-to-first-token. Thinking tokens are autoregressive and cannot be parallelized. The user waits for the entire reasoning chain to complete before the response begins. At 10,000 thinking tokens on a single H100 at 2,000 tokens/second, that is 5 seconds of TTFT before the first visible word appears.
Hardware Selection: H100 vs B200 for Reasoning
The right GPU depends heavily on which DeepSeek-R1 variant you are running. The full 671B MoE model requires multi-GPU configurations; the distilled 70B version fits on a single H100 or B200.
| GPU | HBM Capacity | HBM Bandwidth | FP8 Throughput (dense) | Max Concurrent 70B Requests (FP8) | Spheron On-Demand | Est. Cost per 1M Reasoning Tokens |
|---|---|---|---|---|---|---|
| H100 SXM5 | 80 GB | 3.35 TB/s | 3,958 TFLOPS | 1-2 | $2.40/hr | ~$1.28-1.86/M |
| H200 SXM5 | 141 GB | 4.8 TB/s | 3,958 TFLOPS | 3-4 | $3.96/hr | ~$0.90-1.36/M |
| B200 SXM6 | 180 GB | 7.7 TB/s | 4,500 TFLOPS | 6-8 | $7.43/hr | ~$1.08-1.80/M |
For the full 671B model, multi-GPU configurations apply: 8x H100 SXM5 (640 GB aggregate) at $19.20/hr vs 4x B200 (720 GB aggregate) at $29.72/hr. The B200 configuration provides more memory headroom with half the GPU count, better interconnect efficiency, and FP4 support for compatible workloads, though at a higher total cost per hour. For GPU architecture details, see the NVIDIA B200 complete guide and best GPU for AI inference 2026. To deploy DeepSeek-R1 on Spheron, the DeepSeek-R1 deployment guide covers hardware selection and automated setup scripts for each model variant.
Note on spot instances: H100 SXM5 spot pricing is $0.80/hr and B200 SXM6 is $1.67/hr on Spheron. Spot instances can be reclaimed. For short-lived batch reasoning jobs, spot is appropriate. For long multi-hour inference sessions, use on-demand instances. See the cost optimization guide for details on instance types, spot vs dedicated tradeoffs, and reserved GPU options.
Pricing fluctuates based on GPU availability. The prices above are based on 01 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Optimization Stack: Speculative Decoding and KV Cache Compression
Two techniques from the speculative decoding production guide and KV cache optimization guide are especially effective for reasoning workloads.
Speculative decoding with a distilled draft model. Reasoning model thinking tokens follow predictable patterns, structured math steps, code annotations, logical premises. A smaller distilled version of the same reasoning model captures these patterns with high accuracy. In production, using DeepSeek-R1-Distill-Llama-8B as a draft model for DeepSeek-R1-Distill-Llama-70B inference gives 2-3x throughput improvement, which is higher than the typical 1.5-2x on general instruction-following workloads. Both models share the Llama 3 tokenizer, which is required for vLLM's draft-model speculative decoding to work correctly.
FP8 KV cache quantization. Because reasoning requests generate 5-30x more tokens than standard requests, the KV cache is proportionally larger. For a distilled 70B model (standard GQA attention, 8 KV heads, 80 layers) at 30,000 tokens, the FP16 KV cache runs around 9 GB per request. FP8 KV (--kv-cache-dtype fp8) cuts that to roughly 4.5 GB with negligible quality impact, doubling effective context capacity.
Both combined:
vllm serve deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
--kv-cache-dtype fp8 \
--speculative-model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--num-speculative-tokens 5 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92This configuration runs on a single H100 SXM5 GPU. The --max-model-len 32768 cap prevents the KV cache pool from reserving memory for 128K-token reasoning chains that you are unlikely to serve. For a step-by-step walkthrough of deploying vLLM on Spheron, see the vLLM server guide.
Adaptive Reasoning Depth: Controlling Thinking Token Budgets
Not all queries benefit from deep reasoning. "What is 15% of $80?" does not need 8,000 thinking tokens. A factual lookup does not need a multi-step logical chain. The practical fix is to assign a reasoning budget proportional to estimated query complexity at inference time, rather than letting the model use its maximum thinking depth for every request.
Across mixed workloads (a realistic mix of simple math, factual questions, coding tasks, and complex multi-step problems), adaptive token budget controls reduce average thinking token generation by 75-88%. For a workload where the unconstrained average would be 8,000 thinking tokens, this brings it down to 1,000-2,000. That is the basis for the "up to 8x" claim in the title. Queries that receive full reasoning budget are unaffected in output quality.
Three practical implementations:
System prompt budget hints. Add "Use minimal reasoning steps unless the problem requires deep analysis" to the system prompt. Qwen-QwQ-32B is particularly responsive to this approach; DeepSeek-R1 responds somewhat less consistently. A conservative estimate is 40-60% reduction in average thinking tokens on mixed workloads.
Lightweight query classifier. Run a simple classifier on the incoming query to estimate required reasoning depth (low/medium/high). Pass the estimated budget as a constraint in the system prompt. This adds roughly 5-10ms per request and can achieve 70-80% reduction in thinking tokens on clearly simple queries.
Per-request max-model-len override. vLLM 0.7+ supports per-request overrides to max_tokens for thinking sections. Route classified simple queries to a lower cap. This is the most reliable enforcement mechanism since it is enforced at the model level rather than relying on the model following instructions.
The effectiveness varies by model. Qwen-QwQ is more instruction-following in its reasoning behavior. DeepSeek-R1 sometimes ignores budget hints on problems it classifies as complex.
Batching Strategies for Long Reasoning Outputs
Reasoning models produce output lengths ranging from 100 tokens (trivial queries with minimal thinking) to 30,000+ tokens (complex multi-step problems). This variance is 100-300x wider than standard instruction-following models. Standard static batching assumes similar output lengths across a batch, which makes it extremely wasteful here.
Continuous batching is the baseline requirement. vLLM and SGLang both use continuous batching by default. Do not use any framework or configuration that falls back to static batching for reasoning workloads.
Request bucketing by estimated length. Route short queries (factual lookups, simple arithmetic, single-hop questions) to a smaller GPU instance with a lower max_model_len. Route complex reasoning queries (code review, multi-step proofs, research analysis) to a high-memory instance. This prevents long reasoning requests from blocking short requests in the same queue.
Priority queuing. During peak load, long reasoning requests can starve short requests that are waiting. A priority queue that deprioritizes requests with estimated long outputs during peak hours prevents user-visible latency on interactive queries. SGLang's RadixAttention prefix cache is particularly useful for reasoning workloads where thinking chains share common prefixes, such as a shared system prompt with chain-of-thought instructions. For a detailed engine comparison covering throughput and latency across vLLM and SGLang, see vLLM vs TensorRT-LLM vs SGLang benchmarks.
Quantization Impact on Reasoning Quality
Quantization errors compound through long reasoning chains. An error that would cause a minor output difference in a 500-token standard response can cascade through a 10,000-token thinking chain and produce a wrong final answer. This makes quantization tradeoffs more consequential for reasoning models than for standard generation.
| Quantization | VRAM vs FP16 | Throughput Gain | MATH-500 Accuracy Drop | AIME Accuracy Drop | Recommended For |
|---|---|---|---|---|---|
| FP16 | 1x (baseline) | 1x | 0% | 0% | Research / accuracy-critical |
| FP8 | ~0.5x | 1.6-1.9x | <1% | <2% | Production default |
| FP4 (Blackwell) | ~0.25x | 2.8-3.5x | 3-5% | 5-8% | High-throughput, some accuracy budget |
| INT4 (GPTQ) | ~0.25x | 1.8-2.2x | 6-12% | 10-18% | Not recommended for reasoning |
| INT4 (AWQ) | ~0.25x | 1.7-2.0x | 4-8% | 7-14% | Not recommended for reasoning |
FP8 is the safe default for reasoning workloads. FP4 on Blackwell hardware is viable for throughput-heavy deployments where you have run accuracy validation on your specific task set first. INT4 methods (GPTQ, AWQ) show larger accuracy drops specifically on reasoning tasks because the quantization format degrades structured logical inference more than general text generation. For the full FP4 cost and quality analysis, see FP4 quantization on Blackwell GPUs.
Cost Comparison: Self-Hosted vs API
Using DeepSeek-R1-Distill-Llama-70B (the distilled variant that fits on a single H100) as the reference point:
| Provider | Model | Input Price | Output Price | Cost per 1M Tokens (70/30 thinking/output mix) |
|---|---|---|---|---|
| OpenAI | o3 | $2.00/M | $8.00/M | ~$3.80/M |
| Anthropic | Claude Sonnet 4.6 (extended thinking) | $3.00/M | $15.00/M | ~$6.60/M |
| DeepSeek API | DeepSeek-R1 | $0.55/M | $2.19/M | ~$1.04/M |
| Spheron (H100 SXM5, on-demand) | DeepSeek-R1-Distill-Llama-70B | $2.40/hr | - | ~$1.28-1.86/M |
| Spheron (H100 SXM5, spot) | DeepSeek-R1-Distill-Llama-70B | $0.80/hr | - | ~$0.40-0.60/M |
ROI breakeven at 1M queries/month:
- Assume 4,000 thinking tokens + 500 response tokens per query = 4,500 tokens/query average
- 1M queries = 4.5 billion tokens/month
- DeepSeek API cost: 4.5B × $1.04/M = $4,680/month
- Self-hosted on H100 spot at 75% utilization: one H100 at $0.80/hr generates roughly 700-900M tokens/month at realistic long-context throughput (450-550 tok/s for 4,000+ token reasoning chains). Scaling to 4.5B tokens requires 6-7 GPUs at approximately $3,456-4,032/month
- Breakeven favors self-hosting above roughly 500k-800k queries/month on spot, depending on actual reasoning chain length
The breakeven is lower when you use adaptive reasoning depth. Cutting average thinking tokens from 4,000 to 1,000 per query drops token volume by 70%, which shifts the breakeven down further and makes API pricing less competitive.
Pricing fluctuates based on GPU availability. The prices above are based on 01 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Implementation Checklist
Priority order for reasoning model cost reduction:
- Enable FP8 KV cache quantization (
--kv-cache-dtype fp8) - free win with near-zero accuracy cost - Set a reasoning token budget via system prompt or query classifier
- Use continuous batching only (vLLM or SGLang; no static batching)
- Add speculative decoding with a 7B distill draft model
- Apply FP8 weight quantization if not already default
- Route queries by complexity to right-sized GPU instances
- Consider FP4 on B200 only after accuracy validation on your specific task
Steps 1 and 2 together typically achieve 3-5x cost reduction with minimal operational complexity. Steps 3-6 compound on top of that. FP4 (step 7) is optional and workload-dependent.
Running reasoning models at scale on API providers gets expensive fast. Self-hosting DeepSeek-R1 on Spheron H100s, with FP8 quantization and adaptive token budgets, brings cost-per-query below most API options above 300k queries/month.
