Engineering

What Is Inference Engineering? The 2026 GPU Cloud Guide

Back to BlogWritten by Mitrasish, Co-founderApr 7, 2026
Inference EngineeringLLM InferenceGPU CloudvLLMAI InfrastructureMachine Learning
What Is Inference Engineering? The 2026 GPU Cloud Guide

Inference now accounts for the majority of AI compute in production systems. Training ends when the model ships. Inference runs every hour, every day, as long as users are hitting your API. The discipline that makes that sustainable at scale is inference engineering.

Inference engineers own four things: hardware selection, serving framework configuration, cost-per-token optimization, and reliability SLAs. If you are deploying a model to production and care about whether it serves 10 users or 10,000 without tripling your GPU bill, inference engineering is the work.

What Is Inference Engineering

Inference engineering is the practice of serving trained machine learning models to users at scale. It is not about building models. It is about making them fast, affordable, and reliable after they are built.

The discipline emerged as LLMs went to production. A 70B parameter model running on a single GPU naively processes one request at a time. An inference engineer gets that same model to serve hundreds of concurrent users at a fraction of the cost per token, using batching, quantization, caching, and hardware selection.

What inference engineers own:

  • Hardware selection: which GPU, how many, which memory tier
  • Serving framework configuration: vLLM, SGLang, TensorRT-LLM, or NVIDIA Dynamo
  • KV cache management: memory allocation, prefix caching, quantization
  • Cost-per-token optimization: throughput vs latency trade-offs, spot vs on-demand, right-sizing
  • Reliability SLAs: time-to-first-token targets, inter-token latency budgets, uptime guarantees

Inference Engineering vs ML Engineering vs MLOps

These three roles share territory but own different things. The confusion is common because small teams often combine them.

DisciplineOwnsDoes Not Own
ML EngineeringModel training, datasets, evaluation, fine-tuning pipelinesServing infrastructure, hardware costs, latency SLAs
MLOpsCI/CD for ML, experiment tracking, model registry, pipeline orchestrationServing optimization, GPU memory management, cost-per-token
Inference EngineeringServing frameworks, GPU selection, KV cache tuning, throughput/latency trade-offs, inference FinOpsModel architecture, training code, data pipelines

The split happened because serving LLMs at scale requires depth that generalist ML and DevOps roles do not cover. You need to understand GPU memory hierarchies, kernel-level batching behavior, quantization accuracy trade-offs, and cost-per-token math simultaneously. That depth is its own specialty.

Core Skills Every Inference Engineer Needs

Batching and Scheduling

Static batching groups requests into fixed batches and waits for all of them to finish before accepting new ones. If one request in a batch generates 2,000 tokens and another generates 20, every GPU cycle spent waiting for the long request is wasted capacity.

Continuous batching fixes this by operating at the iteration level. At each decode step, finished requests are immediately replaced by new ones from the queue. The GPU stays full. Throughput goes up 3-5x compared to static batching for typical LLM workloads.

For a deep technical breakdown of how continuous batching, PagedAttention, and chunked prefill work together, see our LLM serving optimization guide. If you are new to the inference engineering discipline, read that post alongside this one for the full picture before diving into framework configuration.

KV Cache Management

The KV cache stores key-value attention tensors for every token the model has processed. It is what allows the model to generate subsequent tokens without reprocessing the entire context from scratch.

The problem is size. For a 70B model with FP16 weights, the KV cache for a single 4,096-token request takes roughly 4GB of VRAM. At 50 concurrent requests, that is 200GB, more than any single GPU holds. KV cache management is how you fit more concurrent requests into available VRAM.

Key techniques:

  • PagedAttention: allocates KV cache in fixed-size blocks rather than contiguous chunks, eliminating fragmentation and wasted pre-allocation
  • Prefix caching: reuses KV cache for identical system prompts across requests, reducing compute for shared-prompt workloads by 20-40%
  • KV cache quantization: storing KV tensors in INT8 or INT4 instead of FP16 cuts memory requirements by 2-4x at modest accuracy cost

For a deeper look at each of these techniques and how to configure them in vLLM, see our KV cache optimization guide.

Quantization

Quantization reduces the numerical precision of model weights and activations, which shrinks memory footprint and increases throughput.

The practical progression:

PrecisionMemory vs FP16Throughput GainAccuracy Impact
FP161x (baseline)BaselineNone
FP80.5x1.5-2xUnder 1% on most models
INT4 AWQ0.25x2-3x1-3% on sensitive tasks

FP8 is the default for H100 and B200 hardware. It halves memory bandwidth requirements with negligible accuracy loss on Llama 3, Qwen, and DeepSeek class models. INT4 AWQ works well for A100-class GPUs where VRAM is the binding constraint.

Serving Frameworks

Choosing the right serving framework has more impact on cost and throughput than any hardware upgrade. The framework controls batching behavior, memory management, kernel selection, and API compatibility. The next section covers the four main options in 2026.

The Inference Stack in 2026

vLLM

vLLM introduced PagedAttention in 2023 and became the default open-source LLM serving framework. Its V1 engine (default since 2025, now at v0.19.0) adds chunked prefill, improved scheduler logic, and better multi-modal support.

Best for: general LLM serving, teams starting out, OpenAI API drop-in replacement, any workload where ecosystem compatibility matters more than maximum raw throughput.

Strengths: widest model support, active community, OpenAI-compatible REST API, native FP8 quantization on H100.

Trade-offs: not the fastest option for throughput-critical deployments; TensorRT-LLM beats it on raw NVIDIA hardware performance.

For production deployment configuration, see our vLLM production deployment guide.

SGLang

SGLang uses RadixAttention, a variant of prefix caching that automatically identifies and reuses shared prefixes across requests in the batch. For multi-turn conversations and agentic workloads with repeated system prompts, this makes a measurable difference.

Best for: agentic workflows with tool calls and multi-turn context, structured output generation (JSON schema enforcement), workloads with high prefix reuse across requests.

Strengths: RadixAttention prefix caching, constrained generation for structured outputs, lower TTFT on repeated-prefix workloads.

Trade-offs: smaller ecosystem than vLLM; fewer model integrations and less community documentation for edge cases.

NVIDIA Dynamo

NVIDIA Dynamo is designed for disaggregated serving across multiple GPU nodes. It handles KV cache routing between prefill and decode workers, with NIXL (NVIDIA Inference Xfer Library) managing tensor transfer between nodes.

Best for: teams running 70B+ models at high load across multiple GPUs, disaggregated prefill-decode deployments, multi-node inference at scale.

Strengths: purpose-built for multi-node disaggregation, native NIXL integration, supports heterogeneous GPU pools for prefill vs decode.

Trade-offs: operational complexity; overkill for single-node deployments. For the full disaggregated serving architecture, see our prefill-decode disaggregation guide.

TensorRT-LLM

TensorRT-LLM compiles models into optimized CUDA engines using NVIDIA's TensorRT runtime. The compilation step is time-consuming (30-90 minutes per model per GPU type), but the resulting engines deliver the highest raw throughput on NVIDIA hardware.

Best for: latency-critical production deployments on NVIDIA hardware where engineering time is available for engine compilation and maintenance.

Strengths: highest throughput on NVIDIA GPUs, FP4 support on Blackwell (B200), in-flight batching with custom CUDA kernels.

Trade-offs: compile-time overhead, harder to update models quickly, NVIDIA-only, steeper learning curve than vLLM.

For a side-by-side comparison, see our vLLM vs TensorRT-LLM vs SGLang benchmarks.

Framework decision matrix:

FrameworkBest ForStrengthsTrade-offs
vLLMGeneral LLM serving, OpenAI API compatibilityWide model support, easy setupNot the fastest on raw throughput
SGLangMulti-turn, agentic, structured outputsRadixAttention, prefix reuseSmaller ecosystem
NVIDIA DynamoMulti-node disaggregated servingScales across GPU poolsComplex ops, overkill for single-node
TensorRT-LLMLatency-critical NVIDIA deploymentsHighest raw throughputCompilation overhead, NVIDIA-only

GPU Selection for Inference Workloads

GPU choice is the biggest cost lever in inference. Pick the wrong tier and you either overpay for capacity you cannot use (H200 on a 7B model) or cap throughput by running out of memory bandwidth (A100 on a 70B at high concurrency). The gap between optimal and suboptimal GPU selection is typically 2-4x in cost.

The Decision Matrix

Current pricing on Spheron as of 07 Apr 2026:

GPUHBMMemory BWBest ForOn-Demand $/hrSpot $/hr
A100 80GB SXM480 GB2 TB/s7B-34B models, cost-sensitive workloads$1.64$0.45
H100 SXM580 GB3.35 TB/s34B-70B high-throughput production$2.90$0.80
H200 SXM5141 GB4.8 TB/s70B-405B memory-bound models$4.54N/A
B200 SXM6192 GB8 TB/s405B+, next-gen scale$7.43$1.71

Pricing fluctuates based on GPU availability. The prices above are based on 07 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

How to read this table:

A100 80GB SXM4 at $1.64/hr is the right choice for 7B-34B models at moderate concurrency, teams prioritizing cost over raw throughput, and batch inference jobs that can run on spot at $0.45/hr. Explore A100 GPU rental → for availability.

H100 SXM5 at $2.90/hr delivers the best price-performance for production serving of 34B-70B models. The 3.35 TB/s memory bandwidth handles high-concurrency decode without becoming the bottleneck. At $2.90/hr versus the A100's $1.64/hr, the H100 premium pays off when throughput matters: more memory bandwidth means more concurrent requests served per hour at 70B scale. For most production 70B deployments with FP8 quantization, H100 is the right default. See H100 GPU rental →.

H200 SXM5 at $4.54/hr is justified when model weights plus KV cache exceed 80GB. A 70B model in FP16 takes 140GB; in FP8 it takes 70GB. With FP8 quantization, most 70B models fit on H100. Move to H200 when you are running FP16, have long context windows that balloon KV cache, or are serving 405B-class models. See H200 GPU rental →.

B200 SXM6 at $7.43/hr is the right choice for 405B+ models and next-generation scale. The 192GB HBM and 8 TB/s bandwidth make it uniquely suited for the largest models. At $7.43/hr it is the most expensive option in this table, but for workloads that require 192GB of memory or maximum memory bandwidth, there is no comparable alternative. Spot pricing is available at $1.71/hr for batch workloads. See B200 GPU rental →.

Inference Cost Economics: Token-Level GPU FinOps

The metric that matters for inference budgeting is cost per million tokens (CPM). It converts GPU rental rates and throughput into a single comparable figure.

The formula:

CPM = (GPU $/hr) / (tokens/sec × 3600) × 1,000,000

Example: H100 SXM5 at $2.90/hr serving Llama 3 70B FP8 at 1,200 tokens/sec:

CPM = $2.90 / (1,200 × 3,600) × 1,000,000
    = $2.90 / 4,320,000 × 1,000,000
    = ~$0.67 per million tokens

For comparison, managed inference APIs for 70B-class models typically charge $0.50-$0.90 per million tokens. Self-hosted on Spheron at optimized configuration you can reach $0.20-$0.35/M tokens, with further reductions using spot instances.

For the full FinOps playbook with case studies and monthly budget optimization, see our AI inference cost economics guide.

The main cost levers, ranked by impact and implementation effort:

LeverTypical SavingsComplexity
Continuous batching2-3x throughput at same GPU costLow (default in vLLM/SGLang)
FP8 quantization40-50% fewer tokens per dollarLow (single flag in vLLM)
Right-sizing GPU30-50% vs defaulting to H100 for small modelsLow (one-time benchmark)
Spot instances40-60% cost reductionMedium (retry logic required)
Prefix caching20-40% reduction for shared-prompt workloadsMedium (vLLM config)

Advanced Techniques

Speculative Decoding

Speculative decoding uses a small draft model to propose multiple tokens simultaneously, which the main model then verifies in a single forward pass. When the draft model's predictions match, you effectively generate 3-5 tokens per main model forward pass instead of one.

The catch: it only helps when the acceptance rate is above 80%. Below that, you are spending compute on rejected proposals and gaining nothing. Acceptance rates are highest for predictable, repetitive outputs: chat completions, code generation with clear patterns, document filling. They are lower for creative tasks with high output entropy.

Disaggregated Prefill-Decode Serving

Prefill (processing the input prompt) is compute-bound. Decode (generating tokens) is memory-bandwidth-bound. Running both phases on the same GPU means each phase interferes with the other.

Disaggregated serving splits them onto separate GPU pools: compute-dense GPUs handle prefill, memory-bandwidth-dense GPUs handle decode. For long-context workloads with 8K+ token prompts and high concurrency, this can deliver 60-75% throughput improvement over colocated serving.

This is an advanced inference engineering technique. The full setup guide is at prefill-decode disaggregation on GPU cloud.

Multi-Token Prediction

Models trained with multi-token prediction (MTP) heads can generate 2-4 tokens per forward pass without speculative decoding overhead. DeepSeek V3 includes MTP heads; other models trained with multi-token prediction objectives can also benefit. When running these models on compatible serving stacks, enable MTP to reduce wall-clock latency at no additional compute cost.

Tensor Parallelism and Pipeline Parallelism

Tensor parallelism (TP) splits individual model layers across multiple GPUs within a node. It is required for 70B models on H100 (70B FP16 = 140GB, does not fit in 80GB). Use TP=4 for 70B FP8, TP=8 for 70B FP16, TP=8 for 405B FP8.

Pipeline parallelism (PP) chains model stages across nodes. It adds inter-node communication latency to each forward pass but enables 405B+ models on GPU clusters with modest per-node counts. Use PP only when the model does not fit on a single node with tensor parallelism.

For H100 8-GPU nodes: start with TP=8 for 70B FP16 and TP=4 for 70B FP8. Verify near-linear throughput scaling before adding more nodes.

Building an Inference Engineering Practice on GPU Cloud

Here is a practical first-week setup for inference engineering on Spheron, structured as a sequence of progressively more optimized deployments.

Step 1: Baseline measurement

Rent two H100 SXM5s (70B FP16 requires ~140 GB VRAM, which exceeds a single 80 GB GPU). Run vLLM with default settings against your target model. Measure throughput (tokens/sec) and TTFT at 10, 50, and 100 concurrent users using benchmark_serving.py. This is your baseline. Every subsequent change gets measured against it.

bash
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --dtype float16 \
  --tensor-parallel-size 2 \
  --max-model-len 8192

Step 2: Apply FP8 and tune KV cache

Enable FP8 quantization and raise gpu-memory-utilization to 0.95. Measure again at the same concurrency levels.

bash
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --dtype float16 \
  --quantization fp8 \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.95 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 8192

FP8 frees up VRAM for more KV cache, which lets vLLM hold more concurrent requests in flight. At 100 concurrent users, this step typically increases throughput 60-90% on 70B models.

Step 3: Scale with tensor parallelism

Move to a multi-GPU configuration with TP=4 or TP=8. Validate that throughput scales near-linearly. If it does not, the bottleneck has shifted to inter-GPU communication or scheduler overhead, not GPU compute.

Step 4: Optimize cost

Switch batch inference jobs (embeddings, async summarization, document processing) to spot instances. Add prefix caching for any workload where many users share the same system prompt. Calculate your actual CPM and compare against your per-user budget.

Spheron's GPU fleet includes H100, H200, B200, and A100 across data center partners globally, with per-second billing, bare metal for custom serving stacks, and pre-configured vLLM templates. For deployment configuration details, see the Spheron documentation.


Inference engineering teams need flexible GPU access, for benchmarking new models, testing quantization configs, and scaling production serving without being locked into reserved instances. Spheron gives you on-demand access to H100, H200, B200, and A100 GPUs with per-second billing, bare metal options for custom serving stacks, and pre-configured vLLM templates.

Rent H100 → | Rent H200 → | Rent B200 → | View all pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.