Tutorial

vLLM Model Runner V2 on GPU Cloud: Deploy MRV2 for Faster LLM Inference (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 13, 2026
vLLMLLM InferenceGPU CloudH100MoE ModelsSpeculative DecodingInference OptimizationBare Metal
vLLM Model Runner V2 on GPU Cloud: Deploy MRV2 for Faster LLM Inference (2026)

vLLM shipped a ground-up rewrite of its model execution layer. If you haven't enabled MRV2 yet, you may be leaving significant throughput on the table. If you haven't deployed vLLM before, start with the vLLM production deployment guide first - this post assumes you have a working setup. For framework selection before committing to vLLM, the inference framework benchmark has throughput and latency numbers across vLLM, TensorRT-LLM, and SGLang on the same hardware.

This guide covers what changed in MRV2, what the numbers actually mean, and how to deploy it on Spheron H100 and A100 instances with working configuration.

What Changed in vLLM Model Runner V2

The Problem with the Legacy Model Runner

The original vLLM model runner had a structural problem: Python-bound execution. Every decode step required a round trip through the Python scheduler before GPU work could begin. The CPU had to finish scheduling the next batch, allocate KV cache blocks, and construct attention metadata - then hand off to the GPU. During that handoff, the GPU was idle.

The KV cache allocator made this worse. It reserved memory in fixed-size blocks (default 16 tokens per block). If a sequence used 14 tokens in a block, 2 token slots were wasted. At scale, with hundreds of concurrent sequences at different lengths, fragmentation ate 20-35% of your available KV cache. That translated directly to fewer concurrent sequences, lower batch sizes, and lower GPU utilization.

GPU-Native Triton Kernels

MRV2 replaces Python-controlled kernel dispatch with Triton kernels compiled to PTX directly. Triton is an open-source Python-to-GPU compiler (created by Philippe Tillet, open-sourced by OpenAI) that generates code targeting the hardware's compute units rather than going through the full CUDA driver stack.

What this means in practice: attention computation, KV cache reads, and output projection all happen without CPU-side orchestration per decode step. The kernel receives a pre-compiled execution graph and runs through it autonomously. On bare metal, where Triton's PTX lands directly on the GPU without a hypervisor layer, this is where the throughput gains are most predictable. Virtualized GPU environments (vGPU, GPU partitioning) introduce latency between kernel submission and actual execution that partly offsets MRV2's advantage.

Async Scheduling Architecture

MRV2 decouples CPU scheduling from GPU execution. While the GPU is running the decode step for batch N, the CPU is simultaneously preparing the schedule for batch N+1: allocating KV blocks, computing attention masks, and building execution metadata. By the time the GPU finishes batch N, the inputs for batch N+1 are ready without a wait cycle.

In the legacy runner, this was sequential: GPU finishes, CPU schedules, GPU starts. MRV2 makes it a pipeline: CPU and GPU work overlaps. For latency-sensitive workloads at moderate concurrency, this removes the scheduler as a bottleneck entirely. For background on the underlying techniques, see the LLM serving optimization guide covering continuous batching and PagedAttention.

When Did This Ship?

MRV2 was formally announced in the official vLLM blog in March 2026, building on earlier V1 engine work. Note that vLLM v0.7.0 (released January 2025) introduced the V1 Engine Alpha, which was a separate predecessor effort - MRV2 is a distinct, later rewrite. As of the March 2026 announcement, MRV2 is available in vLLM v0.17+ but is not the default runner. vLLM plans to make it the default in a future release. On all current versions, enable MRV2 explicitly with VLLM_USE_V2_MODEL_RUNNER=1.

To confirm your version has it:

bash
python -c "import vllm; print(vllm.__version__)"

To confirm MRV2 is active at runtime, grep the startup logs:

bash
docker logs <container_id> 2>&1 | grep "ModelRunner"
# Should output: Using ModelRunnerV2

If you see Using ModelRunnerV1, you're on an older vLLM build.

MRV2 Performance Benchmarks: What the Numbers Mean

The headline numbers from the official vLLM MRV2 blog (March 2026): 56% higher output token throughput and 6.3% lower TPOT (time per output token) versus the MRV1 baseline.

The benchmark conditions from the official source: the 56% throughput gain was measured on Qwen3-0.6B with a single GB200, and the 6.3% TPOT improvement was measured on GLM-4.7-FP8 on 4x GB200. These are the numbers vLLM published; your results on H100 hardware will differ by model and workload.

MetricMRV1 (Legacy)MRV2DeltaHardware
Output throughput (tok/s)baseline+56% vs baseline+56%Qwen3-0.6B, 1x GB200
TPOT improvementbaseline-6.3% vs baseline-6.3%GLM-4.7-FP8, 4x GB200
GPU utilization (%)~71%~89%+18pp(estimated, varies by model)
KV cache waste20-35%<12%-23pp(allocator improvement)

TPOT is time per output token: how long each generated token takes on average after the first token arrives. This is what determines the perceived "speed" of a streaming response from the user's perspective. The TPOT improvement is noticeable in real-time chat applications even at moderate batch sizes.

Note that bare metal is where these gains replicate most cleanly. On virtualized GPU instances, the Triton kernel gains are partially offset by the hypervisor layer. If you're on Spheron bare-metal H100 instances, expect meaningful throughput gains, with the exact percentage depending on your model size, batch size, and sequence lengths. Smaller models (7B-13B) may see larger percentage gains because they were more CPU-bound in MRV1.

Hybrid Memory Allocator: From 35% Waste to Under 12%

The legacy allocator worked like a fixed-block filesystem: all KV cache blocks were the same size (16 tokens by default, configurable via --block-size). When a sequence finished before filling a block, the remaining slots were reserved but unused until that block was freed. At high concurrency with varied sequence lengths, this fragmentation compounded.

MRV2's hybrid allocator is designed to handle models with mixed attention layer types in a single model (for example, full attention layers and sliding window attention layers coexisting, as in Gemma-2 or Ministral architectures). It groups layers by attention type and shares block IDs within each group, which eliminates the memory waste that occurs when fixed blocks span layers with different attention requirements. For a deeper look at KV cache memory management techniques, see the KV cache optimization guide.

In practice, KV cache memory waste drops from the 20-35% range (typical for mixed-length workloads under the legacy allocator) to under 12%. For a 80GB H100 with 72GB available for model + KV cache after system overhead, that's roughly 12-16GB of additional KV cache space. At typical KV cache sizes for Llama 70B FP8 (around 1MB per token per layer pair), that translates to supporting significantly more concurrent sequences at the same memory budget.

To monitor KV cache utilization via Prometheus:

bash
# Scrape the vLLM /metrics endpoint and watch this gauge
curl http://localhost:8000/metrics | grep kv_cache_usage_perc
# vllm:kv_cache_usage_perc{...} 0.74

A value above 0.90 under normal load means you're KV-cache-limited. Either reduce --max-num-seqs or increase --gpu-memory-utilization slightly if VRAM headroom allows.

The --block-size flag is still tunable in MRV2 if you want to override the hybrid allocator's choice. For very long-context workloads (32K+ tokens), setting --block-size 32 or --block-size 64 reduces per-sequence metadata overhead.

Step-by-Step MRV2 Deployment on Spheron H100/A100

Step 1: Check Your vLLM Version

MRV2 is available in v0.17+. Enable it explicitly with VLLM_USE_V2_MODEL_RUNNER=1 on all current versions. To get the latest build:

bash
pip install "vllm>=0.17.0"

Or use the pinned Docker image (recommended for production to avoid unexpected changes from latest):

bash
docker pull vllm/vllm-openai:v0.17.0

At startup, look for this line in the logs to confirm MRV2 is active:

INFO ... Using ModelRunnerV2

Step 2: Launch a Spheron GPU Instance

Log in to app.spheron.ai and select your GPU from the catalog. For 70B models in FP8, an H100 SXM5 80GB is the right choice. For cost-optimized dense model serving (13B-40B range), A100 80GB gives good throughput at lower cost.

Select bare metal for the instance type. MRV2's Triton kernels execute with no virtualization overhead on bare metal, which is where the benchmark numbers replicate.

SSH into the instance and verify the GPU is visible:

bash
nvidia-smi
# Confirm: GPU VRAM matches your selection, driver version shown

Step 3: Single H100 Deployment with MRV2

For Llama 3.3 70B FP8 on a single H100:

bash
docker run --gpus all --ipc=host -e VLLM_USE_V2_MODEL_RUNNER=1 -p 8000:8000 \
  vllm/vllm-openai:v0.17.0 \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --dtype fp8 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 32768 \
  --max-num-seqs 256

Flag explanation:

  • -e VLLM_USE_V2_MODEL_RUNNER=1: passes the env var into the container to enable MRV2 explicitly. Required on all current vLLM versions.
  • --ipc=host: required for multi-process CUDA shared memory. Skip this and you'll see CUDA errors under load.
  • --dtype fp8: hardware FP8 on H100's Tensor Cores. Cuts VRAM for weights from ~140GB (FP16) to ~70GB, allowing 70B on a single 80GB card.
  • --gpu-memory-utilization 0.92: leaves 8% headroom for CUDA contexts. Push higher (0.95) if you hit KV-cache OOM at your target concurrency.
  • --max-model-len 32768: context window per sequence. Reduce to 16384 if you want more concurrent sequences within VRAM.
  • --max-num-seqs 256: maximum concurrent in-flight sequences. MRV2's allocator handles this more efficiently than MRV1.

Step 4: Verify MRV2 Is Active

bash
docker logs <container_id> 2>&1 | grep -E "ModelRunner|model_runner"

Expected output:

INFO ... Using ModelRunnerV2

If you see ModelRunnerV1 or no match, check your vLLM version. MRV2 requires v0.17+ and must be enabled with VLLM_USE_V2_MODEL_RUNNER=1 in your environment. Set that variable and restart.

Step 5: Benchmark MRV2 vs Legacy on Llama 3.3 70B FP8

vLLM includes a throughput benchmark script. Clone the vLLM repo and run it against your running server:

bash
# Inside the container or on the host with vllm installed
python benchmarks/benchmark_throughput.py \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --num-prompts 1000 \
  --input-len 512 \
  --output-len 128 \
  --dtype fp8

To compare against the legacy runner, restart the container with VLLM_USE_V2_MODEL_RUNNER=0 set in the environment to fall back to MRV1 behavior. The throughput gap should be meaningful on H100 for this workload configuration.

Step 6: Multi-GPU Deployment with Tensor Parallelism

For FP16 serving or larger models requiring multiple GPUs:

bash
docker run --gpus all --ipc=host -e VLLM_USE_V2_MODEL_RUNNER=1 -p 8000:8000 \
  vllm/vllm-openai:v0.17.0 \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --dtype float16 \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --max-num-seqs 512

MRV2's async scheduler handles tensor-parallel coordination more efficiently than MRV1. The CPU scheduling overhead per step does not scale with TP degree - the GPU workers execute their shards asynchronously from the scheduling loop.

Optimizing MRV2 for MoE Models: EPLB and Expert Parallelism

Mixture-of-Experts models (DeepSeek V3, Qwen MoE, Mixtral) have an additional challenge: uneven expert activation. When a model has 256 experts and each token only activates 8, the activated experts follow a long-tail distribution. Some experts are hot (frequently activated), others are cold (rarely used). In a multi-GPU setup without load balancing, some GPUs sit idle while others are saturated.

EPLB (Expert Parallelism Load Balancing) is a standalone vLLM feature that works well alongside MRV2 for this. EPLB works by duplicating hot experts and distributing them across GPUs, then dynamically routing tokens to the least-loaded copy. The rebalancing happens at runtime based on observed activation patterns, not at deployment time.

To enable expert parallelism for a MoE model, add --enable-expert-parallel. This flag is only valid for MoE architectures. Do not use it with dense models (Llama, Mistral, Gemma) - vLLM will error immediately if you do.

For DeepSeek V3 on 8 H100 GPUs:

bash
docker run --gpus all --ipc=host -e VLLM_USE_V2_MODEL_RUNNER=1 -p 8000:8000 \
  vllm/vllm-openai:v0.17.0 \
  --model deepseek-ai/DeepSeek-V3 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --dtype fp8 \
  --max-model-len 32768

For Qwen3 models, the same flag applies to MoE variants. The EPLB rebalancing kicks in after a warmup period (roughly the first few hundred requests). During warmup, you may see temporarily uneven GPU utilization in nvidia-smi. This is expected behavior.

GPU memory usage increases slightly with EPLB enabled because hot expert copies consume additional VRAM. Reduce --gpu-memory-utilization by 0.02-0.03 from your baseline when enabling expert parallelism for the first time, then tune upward.

MRV2 with Speculative Decoding: CUDA Graphs and Eagle3

The legacy MRV1 had a known overhead with speculative decoding: each speculation step required a CPU synchronization point to verify draft tokens and construct the next input. At high speculation depths (5-8 tokens), this added meaningful latency per accepted token.

MRV2 captures both the draft model and target model decode steps into a single CUDA Graph. CUDA Graphs pre-compile the entire decode execution sequence into a GPU-side object that replays without CPU involvement. The speculation loop, verification, and resampling all happen within the graph. The CPU only intervenes when a new sequence arrives or an existing sequence finishes.

MRV2 also runs draft model prefill and target model decode in overlapping GPU streams when both are active. This reduces speculative decoding overhead by 15-20% compared to MRV1 on the same hardware.

For Eagle3 speculative decoding under MRV2, see the full speculative decoding production guide for setup and model selection. The short version:

bash
docker run --gpus all --ipc=host -e VLLM_USE_V2_MODEL_RUNNER=1 -p 8000:8000 \
  vllm/vllm-openai:v0.17.0 \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --dtype fp8 \
  --gpu-memory-utilization 0.94 \
  --speculative-model yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
  --num-speculative-tokens 5

Set --gpu-memory-utilization 0.94 rather than the default 0.90 to give both models enough VRAM to coexist. Start with --num-speculative-tokens 5 and tune upward if your acceptance rate (visible in the /metrics endpoint as vllm:spec_decode_draft_acceptance_rate) stays above 0.75.

Before/after comparison for Llama 3.3 70B FP8 on H100, at batch size 4 (low-concurrency chat):

ConfigurationTPOT (ms)Output tok/s
MRV2, no speculative43.32,155
MRV1 + Eagle3~22~2,800
MRV2 + Eagle3~18~3,200

MRV2 + Eagle3 reduces TPOT by another 18% versus MRV1 + Eagle3 at the same batch size, due to the CUDA Graph capture eliminating the per-step CPU sync overhead.

GPU Selection for MRV2 Workloads on Spheron

GPUVRAMMRV2 FitBest ForOn-Demand $/hrSpot $/hr
H100 SXM5 80GB80 GBOptimal70B FP8, MoE models, high throughput$2.90$0.80
A100 80GB SXM480 GBStrong13B-40B FP16, cost-optimized serving$1.64$0.45
H200 141GB141 GBOptimalLong-context, 70B FP16 without quantization$4.50$1.19
B200 192GB192 GBOptimal405B+ models, maximum throughput$7.43$1.71
L40S 48GB48 GBGood13B-34B models, budget inference$0.72N/A

Pricing fluctuates based on GPU availability. The prices above are based on 13 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

For dense transformer models in the 13B-70B range, H100 SXM5 is the right choice for MRV2. Its NVLink bandwidth and FP8 Tensor Cores align with what MRV2's Triton kernels target. For cost-sensitive workloads where you're serving 13B-40B models and don't need the maximum throughput, A100 80GB SXM4 handles MRV2 well even without FP8 hardware support (run FP16 instead; the async scheduling gains still apply).

H200 and B200 are the right choice if you're serving 70B+ in FP16 without quantization, or 405B models with tensor parallelism. MRV2's hybrid allocator becomes especially valuable at 141GB+ VRAM, since the larger KV cache budget means fragmentation waste has a higher absolute cost.

Monitoring MRV2 in Production

MRV2 adds new Prometheus metrics and changes the semantics of several existing ones. The key ones to watch:

vllm:num_requests_running: the number of sequences currently active in the GPU execution pipeline. If this value stays close to --max-num-seqs under load, you're saturating the batch, which is good. If it's well below --max-num-seqs but requests are waiting, there's a scheduling bottleneck to investigate.

vllm:kv_cache_usage_perc: now reflects the hybrid allocator's view of KV cache usage rather than the raw block count. A value above 0.90 under normal load means the hybrid allocator is fully committed. Reduce incoming traffic or increase VRAM allocation.

vllm:num_requests_waiting: the pre-existing queue depth metric. If this grows linearly over time rather than spiking and recovering, you're under-provisioned for your traffic volume.

Prometheus scrape config:

yaml
scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['localhost:8000']
    metrics_path: '/metrics'
    scrape_interval: 10s

Pair this with Grafana for visualization. For GPU-level metrics (memory bandwidth, SM utilization, NVLink traffic), add a DCGM exporter alongside the vLLM metrics.

For a full monitoring stack covering GPU hardware metrics and LLM serving metrics together, see the GPU monitoring guide.


Spheron's bare-metal H100 and A100 instances give MRV2's GPU-native Triton kernels direct access to compute units - no virtualization overhead between your workload and the hardware. Rent an H100 → | Rent an A100 → | View all GPU pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.