Inference now accounts for the majority of AI compute in production systems. Training ends when the model ships. Inference runs every hour, every day, as long as users are hitting your API. The discipline that makes that sustainable at scale is inference engineering.
Inference engineers own four things: hardware selection, serving framework configuration, cost-per-token optimization, and reliability SLAs. If you are deploying a model to production and care about whether it serves 10 users or 10,000 without tripling your GPU bill, inference engineering is the work.
What Is Inference Engineering
Inference engineering is the practice of serving trained machine learning models to users at scale. It is not about building models. It is about making them fast, affordable, and reliable after they are built.
The discipline emerged as LLMs went to production. A 70B parameter model running on a single GPU naively processes one request at a time. An inference engineer gets that same model to serve hundreds of concurrent users at a fraction of the cost per token, using batching, quantization, caching, and hardware selection.
What inference engineers own:
- Hardware selection: which GPU, how many, which memory tier
- Serving framework configuration: vLLM, SGLang, TensorRT-LLM, or NVIDIA Dynamo
- KV cache management: memory allocation, prefix caching, quantization
- Cost-per-token optimization: throughput vs latency trade-offs, spot vs on-demand, right-sizing
- Reliability SLAs: time-to-first-token targets, inter-token latency budgets, uptime guarantees
Inference Engineering vs ML Engineering vs MLOps
These three roles share territory but own different things. The confusion is common because small teams often combine them.
| Discipline | Owns | Does Not Own |
|---|---|---|
| ML Engineering | Model training, datasets, evaluation, fine-tuning pipelines | Serving infrastructure, hardware costs, latency SLAs |
| MLOps | CI/CD for ML, experiment tracking, model registry, pipeline orchestration | Serving optimization, GPU memory management, cost-per-token |
| Inference Engineering | Serving frameworks, GPU selection, KV cache tuning, throughput/latency trade-offs, inference FinOps | Model architecture, training code, data pipelines |
The split happened because serving LLMs at scale requires depth that generalist ML and DevOps roles do not cover. You need to understand GPU memory hierarchies, kernel-level batching behavior, quantization accuracy trade-offs, and cost-per-token math simultaneously. That depth is its own specialty.
Core Skills Every Inference Engineer Needs
Batching and Scheduling
Static batching groups requests into fixed batches and waits for all of them to finish before accepting new ones. If one request in a batch generates 2,000 tokens and another generates 20, every GPU cycle spent waiting for the long request is wasted capacity.
Continuous batching fixes this by operating at the iteration level. At each decode step, finished requests are immediately replaced by new ones from the queue. The GPU stays full. Throughput goes up 3-5x compared to static batching for typical LLM workloads.
For a deep technical breakdown of how continuous batching, PagedAttention, and chunked prefill work together, see our LLM serving optimization guide. If you are new to the inference engineering discipline, read that post alongside this one for the full picture before diving into framework configuration.
KV Cache Management
The KV cache stores key-value attention tensors for every token the model has processed. It is what allows the model to generate subsequent tokens without reprocessing the entire context from scratch.
The problem is size. For a 70B model with FP16 weights, the KV cache for a single 4,096-token request takes roughly 4GB of VRAM. At 50 concurrent requests, that is 200GB, more than any single GPU holds. KV cache management is how you fit more concurrent requests into available VRAM.
Key techniques:
- PagedAttention: allocates KV cache in fixed-size blocks rather than contiguous chunks, eliminating fragmentation and wasted pre-allocation
- Prefix caching: reuses KV cache for identical system prompts across requests, reducing compute for shared-prompt workloads by 20-40%
- KV cache quantization: storing KV tensors in INT8 or INT4 instead of FP16 cuts memory requirements by 2-4x at modest accuracy cost
For a deeper look at each of these techniques and how to configure them in vLLM, see our KV cache optimization guide.
Quantization
Quantization reduces the numerical precision of model weights and activations, which shrinks memory footprint and increases throughput.
The practical progression:
| Precision | Memory vs FP16 | Throughput Gain | Accuracy Impact |
|---|---|---|---|
| FP16 | 1x (baseline) | Baseline | None |
| FP8 | 0.5x | 1.5-2x | Under 1% on most models |
| INT4 AWQ | 0.25x | 2-3x | 1-3% on sensitive tasks |
FP8 is the default for H100 and B200 hardware. It halves memory bandwidth requirements with negligible accuracy loss on Llama 3, Qwen, and DeepSeek class models. INT4 AWQ works well for A100-class GPUs where VRAM is the binding constraint.
Serving Frameworks
Choosing the right serving framework has more impact on cost and throughput than any hardware upgrade. The framework controls batching behavior, memory management, kernel selection, and API compatibility. The next section covers the four main options in 2026.
The Inference Stack in 2026
vLLM
vLLM introduced PagedAttention in 2023 and became the default open-source LLM serving framework. Its V1 engine (default since 2025, now at v0.19.0) adds chunked prefill, improved scheduler logic, and better multi-modal support.
Best for: general LLM serving, teams starting out, OpenAI API drop-in replacement, any workload where ecosystem compatibility matters more than maximum raw throughput.
Strengths: widest model support, active community, OpenAI-compatible REST API, native FP8 quantization on H100.
Trade-offs: not the fastest option for throughput-critical deployments; TensorRT-LLM beats it on raw NVIDIA hardware performance.
For production deployment configuration, see our vLLM production deployment guide.
SGLang
SGLang uses RadixAttention, a variant of prefix caching that automatically identifies and reuses shared prefixes across requests in the batch. For multi-turn conversations and agentic workloads with repeated system prompts, this makes a measurable difference.
Best for: agentic workflows with tool calls and multi-turn context, structured output generation (JSON schema enforcement), workloads with high prefix reuse across requests.
Strengths: RadixAttention prefix caching, constrained generation for structured outputs, lower TTFT on repeated-prefix workloads.
Trade-offs: smaller ecosystem than vLLM; fewer model integrations and less community documentation for edge cases.
NVIDIA Dynamo
NVIDIA Dynamo is designed for disaggregated serving across multiple GPU nodes. It handles KV cache routing between prefill and decode workers, with NIXL (NVIDIA Inference Xfer Library) managing tensor transfer between nodes.
Best for: teams running 70B+ models at high load across multiple GPUs, disaggregated prefill-decode deployments, multi-node inference at scale.
Strengths: purpose-built for multi-node disaggregation, native NIXL integration, supports heterogeneous GPU pools for prefill vs decode.
Trade-offs: operational complexity; overkill for single-node deployments. For the full disaggregated serving architecture, see our prefill-decode disaggregation guide.
TensorRT-LLM
TensorRT-LLM compiles models into optimized CUDA engines using NVIDIA's TensorRT runtime. The compilation step is time-consuming (30-90 minutes per model per GPU type), but the resulting engines deliver the highest raw throughput on NVIDIA hardware.
Best for: latency-critical production deployments on NVIDIA hardware where engineering time is available for engine compilation and maintenance.
Strengths: highest throughput on NVIDIA GPUs, FP4 support on Blackwell (B200), in-flight batching with custom CUDA kernels.
Trade-offs: compile-time overhead, harder to update models quickly, NVIDIA-only, steeper learning curve than vLLM.
For a side-by-side comparison, see our vLLM vs TensorRT-LLM vs SGLang benchmarks.
Framework decision matrix:
| Framework | Best For | Strengths | Trade-offs |
|---|---|---|---|
| vLLM | General LLM serving, OpenAI API compatibility | Wide model support, easy setup | Not the fastest on raw throughput |
| SGLang | Multi-turn, agentic, structured outputs | RadixAttention, prefix reuse | Smaller ecosystem |
| NVIDIA Dynamo | Multi-node disaggregated serving | Scales across GPU pools | Complex ops, overkill for single-node |
| TensorRT-LLM | Latency-critical NVIDIA deployments | Highest raw throughput | Compilation overhead, NVIDIA-only |
GPU Selection for Inference Workloads
GPU choice is the biggest cost lever in inference. Pick the wrong tier and you either overpay for capacity you cannot use (H200 on a 7B model) or cap throughput by running out of memory bandwidth (A100 on a 70B at high concurrency). The gap between optimal and suboptimal GPU selection is typically 2-4x in cost.
The Decision Matrix
Current pricing on Spheron as of 07 Apr 2026:
| GPU | HBM | Memory BW | Best For | On-Demand $/hr | Spot $/hr |
|---|---|---|---|---|---|
| A100 80GB SXM4 | 80 GB | 2 TB/s | 7B-34B models, cost-sensitive workloads | $1.64 | $0.45 |
| H100 SXM5 | 80 GB | 3.35 TB/s | 34B-70B high-throughput production | $2.90 | $0.80 |
| H200 SXM5 | 141 GB | 4.8 TB/s | 70B-405B memory-bound models | $4.54 | N/A |
| B200 SXM6 | 192 GB | 8 TB/s | 405B+, next-gen scale | $7.43 | $1.71 |
Pricing fluctuates based on GPU availability. The prices above are based on 07 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
How to read this table:
A100 80GB SXM4 at $1.64/hr is the right choice for 7B-34B models at moderate concurrency, teams prioritizing cost over raw throughput, and batch inference jobs that can run on spot at $0.45/hr. Explore A100 GPU rental → for availability.
H100 SXM5 at $2.90/hr delivers the best price-performance for production serving of 34B-70B models. The 3.35 TB/s memory bandwidth handles high-concurrency decode without becoming the bottleneck. At $2.90/hr versus the A100's $1.64/hr, the H100 premium pays off when throughput matters: more memory bandwidth means more concurrent requests served per hour at 70B scale. For most production 70B deployments with FP8 quantization, H100 is the right default. See H100 GPU rental →.
H200 SXM5 at $4.54/hr is justified when model weights plus KV cache exceed 80GB. A 70B model in FP16 takes 140GB; in FP8 it takes 70GB. With FP8 quantization, most 70B models fit on H100. Move to H200 when you are running FP16, have long context windows that balloon KV cache, or are serving 405B-class models. See H200 GPU rental →.
B200 SXM6 at $7.43/hr is the right choice for 405B+ models and next-generation scale. The 192GB HBM and 8 TB/s bandwidth make it uniquely suited for the largest models. At $7.43/hr it is the most expensive option in this table, but for workloads that require 192GB of memory or maximum memory bandwidth, there is no comparable alternative. Spot pricing is available at $1.71/hr for batch workloads. See B200 GPU rental →.
Inference Cost Economics: Token-Level GPU FinOps
The metric that matters for inference budgeting is cost per million tokens (CPM). It converts GPU rental rates and throughput into a single comparable figure.
The formula:
CPM = (GPU $/hr) / (tokens/sec × 3600) × 1,000,000Example: H100 SXM5 at $2.90/hr serving Llama 3 70B FP8 at 1,200 tokens/sec:
CPM = $2.90 / (1,200 × 3,600) × 1,000,000
= $2.90 / 4,320,000 × 1,000,000
= ~$0.67 per million tokensFor comparison, managed inference APIs for 70B-class models typically charge $0.50-$0.90 per million tokens. Self-hosted on Spheron at optimized configuration you can reach $0.20-$0.35/M tokens, with further reductions using spot instances.
For the full FinOps playbook with case studies and monthly budget optimization, see our AI inference cost economics guide.
The main cost levers, ranked by impact and implementation effort:
| Lever | Typical Savings | Complexity |
|---|---|---|
| Continuous batching | 2-3x throughput at same GPU cost | Low (default in vLLM/SGLang) |
| FP8 quantization | 40-50% fewer tokens per dollar | Low (single flag in vLLM) |
| Right-sizing GPU | 30-50% vs defaulting to H100 for small models | Low (one-time benchmark) |
| Spot instances | 40-60% cost reduction | Medium (retry logic required) |
| Prefix caching | 20-40% reduction for shared-prompt workloads | Medium (vLLM config) |
Advanced Techniques
Speculative Decoding
Speculative decoding uses a small draft model to propose multiple tokens simultaneously, which the main model then verifies in a single forward pass. When the draft model's predictions match, you effectively generate 3-5 tokens per main model forward pass instead of one.
The catch: it only helps when the acceptance rate is above 80%. Below that, you are spending compute on rejected proposals and gaining nothing. Acceptance rates are highest for predictable, repetitive outputs: chat completions, code generation with clear patterns, document filling. They are lower for creative tasks with high output entropy.
Disaggregated Prefill-Decode Serving
Prefill (processing the input prompt) is compute-bound. Decode (generating tokens) is memory-bandwidth-bound. Running both phases on the same GPU means each phase interferes with the other.
Disaggregated serving splits them onto separate GPU pools: compute-dense GPUs handle prefill, memory-bandwidth-dense GPUs handle decode. For long-context workloads with 8K+ token prompts and high concurrency, this can deliver 60-75% throughput improvement over colocated serving.
This is an advanced inference engineering technique. The full setup guide is at prefill-decode disaggregation on GPU cloud.
Multi-Token Prediction
Models trained with multi-token prediction (MTP) heads can generate 2-4 tokens per forward pass without speculative decoding overhead. DeepSeek V3 includes MTP heads; other models trained with multi-token prediction objectives can also benefit. When running these models on compatible serving stacks, enable MTP to reduce wall-clock latency at no additional compute cost.
Tensor Parallelism and Pipeline Parallelism
Tensor parallelism (TP) splits individual model layers across multiple GPUs within a node. It is required for 70B models on H100 (70B FP16 = 140GB, does not fit in 80GB). Use TP=4 for 70B FP8, TP=8 for 70B FP16, TP=8 for 405B FP8.
Pipeline parallelism (PP) chains model stages across nodes. It adds inter-node communication latency to each forward pass but enables 405B+ models on GPU clusters with modest per-node counts. Use PP only when the model does not fit on a single node with tensor parallelism.
For H100 8-GPU nodes: start with TP=8 for 70B FP16 and TP=4 for 70B FP8. Verify near-linear throughput scaling before adding more nodes.
Building an Inference Engineering Practice on GPU Cloud
Here is a practical first-week setup for inference engineering on Spheron, structured as a sequence of progressively more optimized deployments.
Step 1: Baseline measurement
Rent two H100 SXM5s (70B FP16 requires ~140 GB VRAM, which exceeds a single 80 GB GPU). Run vLLM with default settings against your target model. Measure throughput (tokens/sec) and TTFT at 10, 50, and 100 concurrent users using benchmark_serving.py. This is your baseline. Every subsequent change gets measured against it.
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--dtype float16 \
--tensor-parallel-size 2 \
--max-model-len 8192Step 2: Apply FP8 and tune KV cache
Enable FP8 quantization and raise gpu-memory-utilization to 0.95. Measure again at the same concurrency levels.
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--dtype float16 \
--quantization fp8 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.95 \
--enable-chunked-prefill \
--max-num-batched-tokens 8192FP8 frees up VRAM for more KV cache, which lets vLLM hold more concurrent requests in flight. At 100 concurrent users, this step typically increases throughput 60-90% on 70B models.
Step 3: Scale with tensor parallelism
Move to a multi-GPU configuration with TP=4 or TP=8. Validate that throughput scales near-linearly. If it does not, the bottleneck has shifted to inter-GPU communication or scheduler overhead, not GPU compute.
Step 4: Optimize cost
Switch batch inference jobs (embeddings, async summarization, document processing) to spot instances. Add prefix caching for any workload where many users share the same system prompt. Calculate your actual CPM and compare against your per-user budget.
Spheron's GPU fleet includes H100, H200, B200, and A100 across data center partners globally, with per-second billing, bare metal for custom serving stacks, and pre-configured vLLM templates. For deployment configuration details, see the Spheron documentation.
Inference engineering teams need flexible GPU access, for benchmarking new models, testing quantization configs, and scaling production serving without being locked into reserved instances. Spheron gives you on-demand access to H100, H200, B200, and A100 GPUs with per-second billing, bare metal options for custom serving stacks, and pre-configured vLLM templates.
Rent H100 → | Rent H200 → | Rent B200 → | View all pricing →
