The ability to improve model output quality by spending more GPU at inference time is now a first-class concern for anyone running reasoning models in production. This changes how you provision GPUs, size KV cache, and think about billing. Before diving in, the KV cache memory implications for reasoning workloads and cost optimization techniques for reasoning inference are directly relevant to what follows.
What Is Inference-Time Compute Scaling
Inference-time compute scaling, also called test-time compute scaling, means spending more GPU compute on a single query to get a better answer. The model doesn't change. The weights stay fixed. What changes is how many tokens the model generates while working through the problem.
This is fundamentally different from training-time scaling, where you improve model quality by using more compute during pre-training. Training-time scaling requires months, billions of dollars, and produces a fixed artifact. Inference-time scaling is a dial you turn per request.
Models like o3, DeepSeek-R1, Claude with extended thinking, and QwQ-32B achieve higher accuracy by generating internal reasoning tokens before producing the visible response. The quality of the answer correlates with how much thinking the model is allowed to do. More thinking tokens, generally better answers.
| Scaling Paradigm | Where Compute Is Spent | GPU Demand Pattern |
|---|---|---|
| Training-time scaling | Pre-training FLOPS | Predictable, long-running jobs |
| Inference-time scaling | Per-query generation | Bursty, variable, request-scoped |
The operational implications are significant. Training jobs run on fixed hardware for fixed durations. Inference-time scaling creates workloads where two consecutive requests to the same model can differ by 50x in GPU time consumed.
How Reasoning Models Use 10-100x More GPU Per Query
A standard LLM generates 200-500 output tokens for most queries. A reasoning model emits thinking tokens before the visible response. Those thinking tokens consume exactly the same compute and KV cache memory as output tokens.
| Model | Avg Thinking Tokens | Avg Response Tokens | Total Multiplier vs GPT-4o |
|---|---|---|---|
| DeepSeek-R1 | 4,000-12,000 | 500-1,500 | 8-25x |
| o3 (high) | 10,000-30,000 | 300-800 | 20-50x |
| QwQ-32B | 2,000-8,000 | 400-1,000 | 6-18x |
| Claude Sonnet 4.6 (adaptive thinking) | 3,000-10,000 | 400-1,200 | 8-22x |
| GPT-4o | - | 300-600 | 1x (baseline) |
The KV cache grows with every token generated. For DeepSeek-R1-Distill-Llama-70B (80 layers, 8 KV heads, 128 head_dim), the KV cache size at a 10,000-token reasoning chain is:
2 x 80 layers x 8 KV heads x 128 head_dim x 10,000 tokens x 2 bytes (FP16)
= ~3.28 GB per concurrent requestAt 30,000 tokens (o3-level reasoning), that's ~9.8 GB per request. With 4 concurrent requests, you need ~39 GB just for KV cache, before model weights. See the GPU memory requirements guide for LLMs for the full VRAM model including weights and activation memory, and KV cache optimization techniques to cut that 9.8 GB to under 5 GB with FP8 quantization.
Techniques: Chain-of-Thought, Best-of-N, Tree Search, and Verification Loops
Not all inference-time compute strategies work the same way, and they have very different GPU memory implications.
| Technique | GPU Multiplier | Accuracy Gain | Best For |
|---|---|---|---|
| Chain-of-thought (CoT) | 5-30x tokens | High on reasoning tasks | Math, code, logic |
| Best-of-N sampling | N x baseline | Moderate | Code generation, factual Q&A |
| Tree search (MCTS-style) | 20-100x | Highest | Complex multi-step planning |
| Verification loops | 2-5x | High when verifier is accurate | Proof checking, test generation |
Chain-of-thought generates a single sequential reasoning chain. One KV cache per request, growing with each thinking token. Memory-efficient relative to other approaches.
Best-of-N sampling runs N independent generation passes and picks the best result. For N=8, you need either 8x the sequential time (serial) or 8x the concurrent KV caches (parallel). Parallel Best-of-8 on a query that normally uses 500 tokens uses 8 separate KV caches simultaneously.
Tree search maintains multiple partial reasoning paths in memory at once. An MCTS-style search with branching factor 4 and depth 5 can hold up to 1,024 partial KV cache states at peak. This is why tree search can require 20-100x the baseline GPU memory.
Verification loops run a generator and a separate verifier model. The verifier scores candidate outputs and the generator reruns with feedback. Memory overhead is 2x at minimum (two models), but both models are typically smaller than a single large generator.
GPU Memory and Bandwidth Requirements for Extended Reasoning
For reasoning workloads, the GPU spec that matters most is not FLOPS. It is HBM capacity and HBM bandwidth.
During the decode phase of a reasoning chain, the model reads the full accumulated KV cache on every single token generation step. A 10,000-token reasoning chain means the model reads ~3.28 GB of KV cache per decode step. At 10 tokens per second, that is 32.8 GB/s of sustained KV cache reads, on top of weight reads. Bandwidth saturation is where reasoning inference stalls, not arithmetic throughput.
| GPU | HBM Capacity | HBM Bandwidth | Max Reasoning Context (DeepSeek-R1 70B, FP16 KV) |
|---|---|---|---|
| H100 SXM5 80GB | 80 GB | 3.35 TB/s | ~24,000 tokens (with 70GB weights) |
| H200 SXM5 | 141 GB | 4.8 TB/s | ~60,000 tokens (with 70GB weights) |
| B200 SXM6 | 180 GB | 7.7 TB/s | ~90,000 tokens (with 70GB FP8 weights) |
MLA note: The full DeepSeek-R1-671B model uses Multi-head Latent Attention (MLA), which compresses KV vectors to a lower-dimensional latent space before storage. This reduces KV cache memory pressure by roughly 8-10x compared to standard MHA at the same context length. A 30,000-token reasoning chain on DeepSeek-R1-671B uses approximately 2 GB for the KV cache rather than 9 GB. Distilled variants (DeepSeek-R1-Distill-Llama-70B) use standard GQA and do not get this benefit.
For understanding how continuous batching behaves under long reasoning chains, see LLM Serving Optimization: Continuous Batching, PagedAttention, and Chunked Prefill. For cases where your workload benefits from splitting prefill and decode across different GPU types, see Prefill-Decode Disaggregation on GPU Cloud.
Right-Sizing GPU Cloud Instances for Variable Compute Budgets
Inference-time compute is a per-request budget, not a fixed server spec. A coding assistant answering a simple question needs 1,000 thinking tokens. The same assistant tackling an architecture review might need 15,000. Your GPU instances need to accommodate the worst case.
The floor is non-negotiable: the GPU must hold model weights plus enough KV cache for at least one reasoning chain at your maximum thinking depth. Beyond that, headroom determines your concurrency.
| Use Case | Model Size | Reasoning Depth | Recommended GPU | Min VRAM |
|---|---|---|---|---|
| Light reasoning (coding assistant) | 7B-14B | 1,000-3,000 tokens | RTX 5090 / L40S | 24-48 GB |
| Medium reasoning (research agent) | 32B-70B | 3,000-10,000 tokens | H100 80GB | 80 GB |
| Heavy reasoning (multi-step planning) | 70B-671B | 10,000-30,000 tokens | H200 / B200 | 141-180 GB |
Tensor parallelism as a KV cache multiplier: Splitting a model across 2 GPUs with tensor parallelism doubles available KV cache memory proportionally. A 2x H100 setup gives 160 GB total, enough for DeepSeek-R1 70B (70 GB weights) plus ~90 GB of KV cache, which covers roughly 27,000 tokens at FP8 KV per concurrent request.
Rent the right GPU for your reasoning depth: H100, H200, B200.
Autoscaling for Bursty Reasoning Traffic
Reasoning workloads are bursty in two dimensions. First, traffic spikes at the request level. Second, individual requests vary by 50x in GPU time consumed. Two concurrent o3-level requests can spike GPU memory usage by 10x compared to two standard queries hitting the same endpoint.
Requests-per-second is the wrong scaling trigger for reasoning workloads. A queue with 10 standard LLM requests clears in seconds. A queue with 10 o3-level requests can take minutes. Scale on token budget in queue, not request count.
Practical autoscaling approach:
- Monitor
vllm:num_requests_waitingandvllm:kv_cache_usage_percfrom vLLM's Prometheus endpoint. - When
kv_cache_usage_percexceeds 85%, add GPU capacity. KV cache saturation is your primary signal. - Use spot instances for async and batch reasoning jobs that tolerate latency spikes. Use on-demand for real-time reasoning paths.
- Per-second billing matters here. An o3-equivalent query taking 45 seconds of GPU time costs 80x less per job with per-second billing than with hourly minimums (assuming the job finishes well within the hour). At any realistic job mix where average reasoning time is under 10 minutes, per-second billing reduces GPU costs by 20-40% over hourly minimums.
Dynamic GPU allocation pattern: Keep a baseline of on-demand GPUs sized for your P50 reasoning load. Use spot instances for burst. When a spot instance gets preempted mid-reasoning-chain, the request returns to queue and runs on the next available GPU.
See Serverless GPU vs On-Demand vs Reserved for a detailed comparison of billing model tradeoffs. For multi-agent workloads where multiple reasoning models run in parallel, GPU Infrastructure for AI Agents 2026 covers the coordination and GPU allocation patterns.
Cost Analysis: Inference-Time Scaling vs Training a Bigger Model
The economic case for test-time compute is straightforward for accuracy targets that don't require the absolute best model quality.
Training a 7B model to match 70B quality requires roughly 10x more training compute, more data, and months of iteration. For many tasks, a 7B model with Best-of-32 sampling or heavy CoT achieves competitive quality at a fraction of the infrastructure cost.
Live Spheron GPU pricing (fetched 07 Apr 2026):
| GPU | On-Demand | Spot |
|---|---|---|
| H100 SXM5 | $2.90/hr | $0.80/hr |
| H200 SXM5 | $4.54/hr | N/A |
| B200 SXM6 | $7.43/hr | $1.71/hr |
Cost comparison for a 70B reasoning job vs 7B with extended inference:
| Approach | Model | GPU | Time Per Query | Cost Per Query |
|---|---|---|---|---|
| Standard inference | 70B | H100 on-demand ($2.90/hr) | ~8s | ~$0.0064 |
| Heavy CoT (15K tokens) | 7B | H100 on-demand ($2.90/hr) | ~40s | ~$0.032 |
| Best-of-32 sampling | 7B | H100 spot ($0.80/hr) | ~120s (32 passes) | ~$0.027 |
| Extended reasoning | 70B | H100 spot ($0.80/hr) | ~40s | ~$0.0089 |
At low query volumes (under 10,000 queries/day), spot-based extended reasoning with a 70B model costs roughly $26/day and matches or exceeds 70B standard quality on most reasoning tasks. Training a better base model to match that quality would cost orders of magnitude more.
The crossover point: at very high volumes (millions of queries/day), heavy CoT with a 7B model on spot becomes significantly cheaper than running 70B on-demand. At moderate volumes, 70B on spot with extended reasoning is the best value.
Hyperscaler equivalents for H100-class GPUs typically run at 2-4x the hourly rate with hourly billing minimums. Per-second billing on Spheron eliminates the minimum charge for short reasoning jobs.
Pricing fluctuates based on GPU availability. The prices above are based on 07 Apr 2026 and may have changed. Check current GPU pricing for live rates.
For the full FinOps framework for AI inference, see AI Inference Cost Economics in 2026.
Deploying Inference-Time Scaling with vLLM and SGLang
vLLM Configuration for Reasoning Models
vLLM handles reasoning workloads well with a few key parameter changes:
--max-model-len: Set this to cover the full thinking token budget plus prompt plus response. For o3-equivalent workloads with 30,000 thinking tokens, 2,000-token prompts, and 1,000-token responses, set --max-model-len 33000. Every 1,024 tokens of configured max length allocates additional KV cache VRAM at startup.
--kv-cache-dtype fp8: Halves KV cache VRAM with minimal quality impact. Critical for reasoning workloads where KV cache dominates memory. On an H200 with --max-model-len 32768 and DeepSeek-R1 70B, FP8 KV lets you serve roughly 12 concurrent reasoning chains instead of 6 at FP16. See the KV cache optimization guide for the full quantization breakdown.
--max-num-seqs: Tune this down for reasoning workloads. Default is 256, which assumes short contexts. At 32K context per request, each concurrent sequence holds ~10 GB of KV cache at FP16 (or ~5 GB at FP8). On an 80 GB H100 with 70 GB occupied by weights, you have ~10 GB for KV cache, limiting you to 1-2 concurrent sequences at FP16 or 2-4 at FP8. Set --max-num-seqs to match the actual concurrency your memory budget allows.
--enable-chunked-prefill: For reasoning workloads, long prompts (2,000-5,000 tokens) combined with 30,000-token thinking chains create head-of-line blocking where a single heavy request blocks the queue. Chunked prefill breaks long prefills into smaller chunks, interleaving them with decode steps for other requests.
Example launch for H200 with DeepSeek-R1-Distill-Llama-70B:
docker run --gpus all --ipc=host -p 8000:8000 \
vllm/vllm-openai:latest \
--model deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
--dtype fp8 \
--kv-cache-dtype fp8 \
--max-model-len 33000 \
--max-num-seqs 8 \
--enable-chunked-prefill \
--gpu-memory-utilization 0.95SGLang Configuration for Reasoning Models
SGLang's RadixAttention feature is particularly useful for multi-turn reasoning workloads where the model revisits earlier reasoning steps. It caches KV states for shared prefixes across requests, reducing redundant computation.
--context-length: Override to match your maximum expected reasoning chain length. Equivalent to vLLM's --max-model-len.
--mem-fraction-static: Controls the fraction of GPU memory reserved for the KV cache pool. Lower values leave more memory for model weights; higher values support more concurrent reasoning chains. Default is 0.9; for reasoning workloads, 0.88-0.92 depending on model size.
For multi-turn reasoning agents that revisit context across turns, RadixAttention can cut TTFT by 40-70% by avoiding re-prefill of shared context. This compounds significantly for agentic workloads running dozens of reasoning steps.
For the full vLLM setup including multi-GPU tensor parallelism and production monitoring, see the vLLM production deployment guide. For SGLang-specific setup and RadixAttention configuration details, see the SGLang production deployment guide. For throughput and latency benchmarks comparing vLLM, TensorRT-LLM, and SGLang on reasoning workloads, see vLLM vs TensorRT-LLM vs SGLang benchmarks.
For cluster setup documentation on Spheron, see docs.spheron.ai.
Inference-time compute scaling puts GPU demand squarely on a per-request basis, a workload profile where per-second billing and elastic access matter more than reserved capacity. Rent H200 | Rent B200 | View all pricing
