How much more GPU memory does a reasoning model require vs a standard LLM?

A reasoning model can require 10-100x more GPU memory for the KV cache compared to a standard LLM on the same query. A standard LLM generates 200-500 output tokens; a reasoning model like o3 (high mode) generates 10,000-30,000 thinking tokens before the response. Each token requires KV cache entries across all attention layers. For DeepSeek-R1-Distill-Llama-70B at 30,000 thinking tokens, the KV cache alone occupies roughly 9 GB at FP16, compared to under 200 MB for a 500-token standard response.

Which GPU is best for inference-time compute scaling workloads?

For heavy reasoning workloads (70B+ models, 10,000+ thinking tokens), the H200 SXM5 (141 GB HBM3e) or B200 SXM6 (180 GB HBM3e) are the best options. The H200's larger memory pool allows more concurrent reasoning chains and longer thinking sequences without multi-GPU tensor parallelism. For medium reasoning workloads (32B-70B models), the H100 SXM5 (80 GB) handles most cases with FP8 KV quantization. For light reasoning (7B-14B), RTX-class GPUs with 24-48 GB VRAM work well.

How does per-second billing help with variable inference compute?

Reasoning model inference time is highly variable. A simple query might complete in 2 seconds; a complex multi-step problem might run for 90 seconds on the same model. With hourly billing, a 45-second reasoning job costs the same as a 3,600-second job. With per-second billing, you pay only for the actual GPU time consumed. For a workload mix where average reasoning time is 30 seconds, per-second billing costs 96-99% less than hourly billing for individual requests, and 20-40% less in aggregate compared to hourly minimums at realistic utilization.

What is the difference between Best-of-N sampling and chain-of-thought for inference-time scaling?

Chain-of-thought (CoT) scaling generates one long reasoning chain before the answer. The model thinks sequentially, building on prior reasoning steps. Best-of-N sampling runs N independent generation passes and picks the best output (by majority vote, verifier score, or reward model). CoT is more memory-efficient (one KV cache per request) but requires the model to have strong internal reasoning. Best-of-N is more compute-intensive (N separate passes) but works even with models that weren't specifically trained for long-form reasoning.

Inference-Time Compute Scaling on GPU Cloud: Allocate More GPU to Think Harder, Not Train Bigger (2026)

Q: What is inference-time compute scaling?

Inference-time compute scaling (also called test-time compute scaling) is the practice of allocating more GPU compute per query to improve answer quality, instead of training a larger model. Reasoning models like DeepSeek-R1, o3, and Claude with extended thinking generate internal chain-of-thought tokens before producing the visible response. More thinking tokens generally means better answers, and you control the compute budget per query at inference time.

The ability to improve model output quality by spending more GPU at inference time is now a first-class concern for anyone running reasoning models in production. This changes how you provision GPUs, size KV cache, and think about billing. Before diving in, the KV cache memory implications for reasoning workloads and cost optimization techniques for reasoning inference are directly relevant to what follows.

What Is Inference-Time Compute Scaling

Inference-time compute scaling, also called test-time compute scaling, means spending more GPU compute on a single query to get a better answer. The model doesn't change. The weights stay fixed. What changes is how many tokens the model generates while working through the problem.

This is fundamentally different from training-time scaling, where you improve model quality by using more compute during pre-training. Training-time scaling requires months, billions of dollars, and produces a fixed artifact. Inference-time scaling is a dial you turn per request.

Models like o3, DeepSeek-R1, Claude with extended thinking, and QwQ-32B achieve higher accuracy by generating internal reasoning tokens before producing the visible response. The quality of the answer correlates with how much thinking the model is allowed to do. More thinking tokens, generally better answers.

Scaling Paradigm	Where Compute Is Spent	GPU Demand Pattern
Training-time scaling	Pre-training FLOPS	Predictable, long-running jobs
Inference-time scaling	Per-query generation	Bursty, variable, request-scoped

The operational implications are significant. Training jobs run on fixed hardware for fixed durations. Inference-time scaling creates workloads where two consecutive requests to the same model can differ by 50x in GPU time consumed.

How Reasoning Models Use 10-100x More GPU Per Query

A standard LLM generates 200-500 output tokens for most queries. A reasoning model emits thinking tokens before the visible response. Those thinking tokens consume exactly the same compute and KV cache memory as output tokens.

Model	Avg Thinking Tokens	Avg Response Tokens	Total Multiplier vs GPT-4o
DeepSeek-R1	4,000-12,000	500-1,500	8-25x
o3 (high)	10,000-30,000	300-800	20-50x
QwQ-32B	2,000-8,000	400-1,000	6-18x
Claude Sonnet 4.6 (adaptive thinking)	3,000-10,000	400-1,200	8-22x
GPT-4o	-	300-600	1x (baseline)

The KV cache grows with every token generated. For DeepSeek-R1-Distill-Llama-70B (80 layers, 8 KV heads, 128 head_dim), the KV cache size at a 10,000-token reasoning chain is:

2 x 80 layers x 8 KV heads x 128 head_dim x 10,000 tokens x 2 bytes (FP16)
= ~3.28 GB per concurrent request

At 30,000 tokens (o3-level reasoning), that's ~9.8 GB per request. With 4 concurrent requests, you need ~39 GB just for KV cache, before model weights. See the GPU memory requirements guide for LLMs for the full VRAM model including weights and activation memory, and KV cache optimization techniques to cut that 9.8 GB to under 5 GB with FP8 quantization.

Techniques: Chain-of-Thought, Best-of-N, Tree Search, and Verification Loops

Not all inference-time compute strategies work the same way, and they have very different GPU memory implications.

Technique	GPU Multiplier	Accuracy Gain	Best For
Chain-of-thought (CoT)	5-30x tokens	High on reasoning tasks	Math, code, logic
Best-of-N sampling	N x baseline	Moderate	Code generation, factual Q&A
Tree search (MCTS-style)	20-100x	Highest	Complex multi-step planning
Verification loops	2-5x	High when verifier is accurate	Proof checking, test generation

Chain-of-thought generates a single sequential reasoning chain. One KV cache per request, growing with each thinking token. Memory-efficient relative to other approaches.

Best-of-N sampling runs N independent generation passes and picks the best result. For N=8, you need either 8x the sequential time (serial) or 8x the concurrent KV caches (parallel). Parallel Best-of-8 on a query that normally uses 500 tokens uses 8 separate KV caches simultaneously.

Tree search maintains multiple partial reasoning paths in memory at once. An MCTS-style search with branching factor 4 and depth 5 can hold up to 1,024 partial KV cache states at peak. This is why tree search can require 20-100x the baseline GPU memory.

Verification loops run a generator and a separate verifier model. The verifier scores candidate outputs and the generator reruns with feedback. Memory overhead is 2x at minimum (two models), but both models are typically smaller than a single large generator.

GPU Memory and Bandwidth Requirements for Extended Reasoning

For reasoning workloads, the GPU spec that matters most is not FLOPS. It is HBM capacity and HBM bandwidth.

During the decode phase of a reasoning chain, the model reads the full accumulated KV cache on every single token generation step. A 10,000-token reasoning chain means the model reads ~3.28 GB of KV cache per decode step. At 10 tokens per second, that is 32.8 GB/s of sustained KV cache reads, on top of weight reads. Bandwidth saturation is where reasoning inference stalls, not arithmetic throughput.

GPU	HBM Capacity	HBM Bandwidth	Max Reasoning Context (DeepSeek-R1 70B, FP16 KV)
H100 SXM5 80GB	80 GB	3.35 TB/s	~24,000 tokens (with 70GB weights)
H200 SXM5	141 GB	4.8 TB/s	~60,000 tokens (with 70GB weights)
B200 SXM6	180 GB	7.7 TB/s	~90,000 tokens (with 70GB FP8 weights)

MLA note: The full DeepSeek-R1-671B model uses Multi-head Latent Attention (MLA), which compresses KV vectors to a lower-dimensional latent space before storage. This reduces KV cache memory pressure by roughly 8-10x compared to standard MHA at the same context length. A 30,000-token reasoning chain on DeepSeek-R1-671B uses approximately 2 GB for the KV cache rather than 9 GB. Distilled variants (DeepSeek-R1-Distill-Llama-70B) use standard GQA and do not get this benefit.

For understanding how continuous batching behaves under long reasoning chains, see LLM Serving Optimization: Continuous Batching, PagedAttention, and Chunked Prefill. For cases where your workload benefits from splitting prefill and decode across different GPU types, see Prefill-Decode Disaggregation on GPU Cloud.

Right-Sizing GPU Cloud Instances for Variable Compute Budgets

Inference-time compute is a per-request budget, not a fixed server spec. A coding assistant answering a simple question needs 1,000 thinking tokens. The same assistant tackling an architecture review might need 15,000. Your GPU instances need to accommodate the worst case.

The floor is non-negotiable: the GPU must hold model weights plus enough KV cache for at least one reasoning chain at your maximum thinking depth. Beyond that, headroom determines your concurrency.

Use Case	Model Size	Reasoning Depth	Recommended GPU	Min VRAM
Light reasoning (coding assistant)	7B-14B	1,000-3,000 tokens	RTX 5090 / L40S	24-48 GB
Medium reasoning (research agent)	32B-70B	3,000-10,000 tokens	H100 80GB	80 GB
Heavy reasoning (multi-step planning)	70B-671B	10,000-30,000 tokens	H200 / B200	141-180 GB

Tensor parallelism as a KV cache multiplier: Splitting a model across 2 GPUs with tensor parallelism doubles available KV cache memory proportionally. A 2x H100 setup gives 160 GB total, enough for DeepSeek-R1 70B (70 GB weights) plus ~90 GB of KV cache, which covers roughly 27,000 tokens at FP8 KV per concurrent request.

Rent the right GPU for your reasoning depth: H100, H200, B200.

Autoscaling for Bursty Reasoning Traffic

Reasoning workloads are bursty in two dimensions. First, traffic spikes at the request level. Second, individual requests vary by 50x in GPU time consumed. Two concurrent o3-level requests can spike GPU memory usage by 10x compared to two standard queries hitting the same endpoint.

Requests-per-second is the wrong scaling trigger for reasoning workloads. A queue with 10 standard LLM requests clears in seconds. A queue with 10 o3-level requests can take minutes. Scale on token budget in queue, not request count.

Practical autoscaling approach:

Monitor vllm:num_requests_waiting and vllm:kv_cache_usage_perc from vLLM's Prometheus endpoint.
When kv_cache_usage_perc exceeds 85%, add GPU capacity. KV cache saturation is your primary signal.
Use spot instances for async and batch reasoning jobs that tolerate latency spikes. Use on-demand for real-time reasoning paths.
Per-second billing matters here. An o3-equivalent query taking 45 seconds of GPU time costs 80x less per job with per-second billing than with hourly minimums (assuming the job finishes well within the hour). At any realistic job mix where average reasoning time is under 10 minutes, per-second billing reduces GPU costs by 20-40% over hourly minimums.

Dynamic GPU allocation pattern: Keep a baseline of on-demand GPUs sized for your P50 reasoning load. Use spot instances for burst. When a spot instance gets preempted mid-reasoning-chain, the request returns to queue and runs on the next available GPU.

See Serverless GPU vs On-Demand vs Reserved for a detailed comparison of billing model tradeoffs. For multi-agent workloads where multiple reasoning models run in parallel, GPU Infrastructure for AI Agents 2026 covers the coordination and GPU allocation patterns.

Cost Analysis: Inference-Time Scaling vs Training a Bigger Model

The economic case for test-time compute is straightforward for accuracy targets that don't require the absolute best model quality.

Training a 7B model to match 70B quality requires roughly 10x more training compute, more data, and months of iteration. For many tasks, a 7B model with Best-of-32 sampling or heavy CoT achieves competitive quality at a fraction of the infrastructure cost.

Live Spheron GPU pricing (fetched 07 Apr 2026):

GPU	On-Demand	Spot
H100 SXM5	$2.90/hr	$0.80/hr
H200 SXM5	$4.54/hr	N/A
B200 SXM6	$7.43/hr	$1.71/hr

Cost comparison for a 70B reasoning job vs 7B with extended inference:

Approach	Model	GPU	Time Per Query	Cost Per Query
Standard inference	70B	H100 on-demand ($2.90/hr)	~8s	~$0.0064
Heavy CoT (15K tokens)	7B	H100 on-demand ($2.90/hr)	~40s	~$0.032
Best-of-32 sampling	7B	H100 spot ($0.80/hr)	~120s (32 passes)	~$0.027
Extended reasoning	70B	H100 spot ($0.80/hr)	~40s	~$0.0089

At low query volumes (under 10,000 queries/day), spot-based extended reasoning with a 70B model costs roughly $26/day and matches or exceeds 70B standard quality on most reasoning tasks. Training a better base model to match that quality would cost orders of magnitude more.

The crossover point: at very high volumes (millions of queries/day), heavy CoT with a 7B model on spot becomes significantly cheaper than running 70B on-demand. At moderate volumes, 70B on spot with extended reasoning is the best value.

Hyperscaler equivalents for H100-class GPUs typically run at 2-4x the hourly rate with hourly billing minimums. Per-second billing on Spheron eliminates the minimum charge for short reasoning jobs.

Pricing fluctuates based on GPU availability. The prices above are based on 07 Apr 2026 and may have changed. Check current GPU pricing for live rates.

For the full FinOps framework for AI inference, see AI Inference Cost Economics in 2026.

Deploying Inference-Time Scaling with vLLM and SGLang

vLLM Configuration for Reasoning Models

vLLM handles reasoning workloads well with a few key parameter changes:

--max-model-len: Set this to cover the full thinking token budget plus prompt plus response. For o3-equivalent workloads with 30,000 thinking tokens, 2,000-token prompts, and 1,000-token responses, set --max-model-len 33000. Every 1,024 tokens of configured max length allocates additional KV cache VRAM at startup.

--kv-cache-dtype fp8: Halves KV cache VRAM with minimal quality impact. Critical for reasoning workloads where KV cache dominates memory. On an H200 with --max-model-len 32768 and DeepSeek-R1 70B, FP8 KV lets you serve roughly 12 concurrent reasoning chains instead of 6 at FP16. See the KV cache optimization guide for the full quantization breakdown.

--max-num-seqs: Tune this down for reasoning workloads. Default is 256, which assumes short contexts. At 32K context per request, each concurrent sequence holds ~10 GB of KV cache at FP16 (or ~5 GB at FP8). On an 80 GB H100 with 70 GB occupied by weights, you have ~10 GB for KV cache, limiting you to 1-2 concurrent sequences at FP16 or 2-4 at FP8. Set --max-num-seqs to match the actual concurrency your memory budget allows.

--enable-chunked-prefill: For reasoning workloads, long prompts (2,000-5,000 tokens) combined with 30,000-token thinking chains create head-of-line blocking where a single heavy request blocks the queue. Chunked prefill breaks long prefills into smaller chunks, interleaving them with decode steps for other requests.

Example launch for H200 with DeepSeek-R1-Distill-Llama-70B:

bash

docker run --gpus all --ipc=host -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
  --dtype fp8 \
  --kv-cache-dtype fp8 \
  --max-model-len 33000 \
  --max-num-seqs 8 \
  --enable-chunked-prefill \
  --gpu-memory-utilization 0.95

SGLang Configuration for Reasoning Models

SGLang's RadixAttention feature is particularly useful for multi-turn reasoning workloads where the model revisits earlier reasoning steps. It caches KV states for shared prefixes across requests, reducing redundant computation.

--context-length: Override to match your maximum expected reasoning chain length. Equivalent to vLLM's --max-model-len.

--mem-fraction-static: Controls the fraction of GPU memory reserved for the KV cache pool. Lower values leave more memory for model weights; higher values support more concurrent reasoning chains. Default is 0.9; for reasoning workloads, 0.88-0.92 depending on model size.

For multi-turn reasoning agents that revisit context across turns, RadixAttention can cut TTFT by 40-70% by avoiding re-prefill of shared context. This compounds significantly for agentic workloads running dozens of reasoning steps.

For the full vLLM setup including multi-GPU tensor parallelism and production monitoring, see the vLLM production deployment guide. For SGLang-specific setup and RadixAttention configuration details, see the SGLang production deployment guide. For throughput and latency benchmarks comparing vLLM, TensorRT-LLM, and SGLang on reasoning workloads, see vLLM vs TensorRT-LLM vs SGLang benchmarks.

For cluster setup documentation on Spheron, see docs.spheron.ai.

Inference-time compute scaling puts GPU demand squarely on a per-request basis, a workload profile where per-second billing and elastic access matter more than reserved capacity. Rent H200 | Rent B200 | View all pricing
Get started on Spheron