Engineering

Fractional GPUs for AI Inference: vGPU, MPS, and Right-Sizing Your GPU Cloud Spend (2026 Guide)

Back to BlogWritten by Mitrasish, Co-founderApr 5, 2026
GPU CloudLLM InferenceNVIDIA vGPUGPU SharingMIGMPSCost OptimizationAI Infrastructure
Fractional GPUs for AI Inference: vGPU, MPS, and Right-Sizing Your GPU Cloud Spend (2026 Guide)

A 7B model in FP16 needs 14 GB VRAM. An H100 PCIe has 80 GB. That leaves 66 GB sitting idle if you run one model per GPU. For most inference workloads serving moderate traffic, GPU SM utilization stays below 40%. You are paying for 100% and using a fraction. Fractional GPU technologies fix this by splitting one physical GPU into isolated or shared slices, so multiple models or tenants can share the hardware without each paying for a full GPU. The GPU cost optimization playbook identifies this idle capacity as one of the most common sources of wasted cloud spend. The run multiple LLMs on one GPU guide covers MIG and time-slicing setup step by step. This post goes wider: all four fractional GPU methods, when each fits, cost math with live Spheron pricing, and the cases where sharing backfires.

The GPU Utilization Problem

Training keeps GPU SMs busy above 90% because the forward pass, backward pass, and optimizer step fill every cycle. Inference is different. Requests arrive in bursts. Between bursts, the GPU waits. Even during a burst, the decode phase is memory-bandwidth bound, not compute bound, so SM occupancy is lower than the raw TFLOPS suggest.

The result: most inference deployments run at 20-40% average SM utilization. A 7B INT4 model on an H100 PCIe handles 2,000-4,000 tokens per second at peak, but at typical API traffic levels it spends most of its time waiting for the next request.

The VRAM story is similar:

ModelPrecisionVRAM neededH100 PCIe (80GB) headroomA100 80GB headroomL40S (48GB) headroom
7BFP16~14 GB66 GB66 GB34 GB
7BINT4~4 GB76 GB76 GB44 GB
13BFP16~26 GB54 GB54 GB22 GB
13BINT4~7 GB73 GB73 GB41 GB
70BFP16~140 GBDoes not fitDoes not fitDoes not fit
70BINT4~35 GB45 GB45 GB13 GB

Running a single 7B FP16 model on an H100 leaves 83% of VRAM unused. Fractional GPU methods let you fill that space with additional model instances. For guidance on which GPU fits which model size and workload, see best GPU for AI inference 2026.

Fractional GPU Technologies: Four Options

TechnologyIsolationVRAM allocationGPU compatibilityUse case
MIGHardwareFixed slices (10/20/40 GB on H100)A100, A30, H100, H200, B200Multi-tenant, hard isolation
MPSProcess-levelShared poolVolta+ (all modern data center GPUs)Concurrent inference, controlled env
NVIDIA vGPUHypervisorConfigurable profilesRequires vGPU-licensed GPU and hypervisorEnterprise VM deployments
Time-slicingNoneShared poolAll NVIDIA GPUsDev/test, low-traffic endpoints

MIG (Multi-Instance GPU)

MIG partitions a GPU at the hardware level into up to 7 isolated instances, each with its own VRAM, cache, and compute engines. A crash or memory error in one instance does not affect the others. Each instance appears as a separate CUDA device.

H100 80GB MIG profiles (run nvidia-smi mig -lgip to list valid combinations for your driver version):

ProfileVRAM per instanceMax instancesBest for
1g.10gb10 GB77B INT4, 3B FP16
2g.20gb20 GB37B FP16, 13B INT4
3g.40gb40 GB213B FP16, 30B INT4
4g.40gb40 GB130B INT4 with more compute
7g.80gb80 GB1Full GPU (no sharing)

A100 80GB also supports up to 7 instances with the 1g.10gb profile, identical to the H100 80GB (7 × 10 GB = 70 GB, within the 80 GB physical capacity). The A100 40GB uses the smaller 1g.5gb profile to reach 7 instances (5 GB VRAM per slice).

The right-sizing angle: a 7B INT4 model fits in a 1g.10gb slice (needs 4 GB VRAM, well within 10 GB). That gives you 7 independent inference endpoints for the price of one H100. Compare that to renting 7 separate RTX 4090 instances at $0.51/hr each ($3.57/hr total) versus one H100 at $2.01/hr with 7 MIG slices (effective $0.29/hr per slice).

Important: the numeric profile IDs used in nvidia-smi mig -cgi (e.g., 14 for 2g.20gb on H100 SXM5) differ between GPU models and driver versions. Always run nvidia-smi mig -lgip first and use the profile ID from your output rather than copying IDs from documentation.

MPS (Multi-Process Service)

MPS is a CUDA daemon that replaces the default GPU time-multiplexing with a single shared GPU context. Multiple client processes submit work through the MPS server, which schedules their kernels to run concurrently. The result: lower launch overhead and actual parallel kernel execution across processes, unlike time-slicing which interleaves them sequentially.

Starting the MPS daemon:

bash
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d

Capping SM allocation per process with CUDA_MPS_ACTIVE_THREAD_PERCENTAGE:

bash
# Process 1 gets 50% of SMs
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=50 CUDA_VISIBLE_DEVICES=0 \
  python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --port 8000 --gpu-memory-utilization 0.45 &

# Process 2 gets 50% of SMs
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=50 CUDA_VISIBLE_DEVICES=0 \
  python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --port 8001 --gpu-memory-utilization 0.45 &

One important caveat: CUDA_MPS_ACTIVE_THREAD_PERCENTAGE sets a hard cap on the maximum SM percentage a client process can use. This is different from MIG, which provides hard VRAM and compute isolation. Do not use MPS as a substitute for MIG when strict tenant isolation is required.

MPS versus MIG for L40S and RTX GPUs: The L40S does not support MIG. Neither does the RTX 4090 or RTX 5090. MPS is the only hardware-concurrent option for these GPUs, making it the primary right-sizing tool for non-A100/H100 inference deployments.

On Volta+ GPUs, MPS also provides address space isolation between client processes, reducing the blast radius of errant memory accesses compared to pure time-slicing.

NVIDIA vGPU

NVIDIA vGPU virtualizes the GPU at the hypervisor layer. A licensed NVIDIA vGPU host driver runs on VMware vSphere, Proxmox, or KVM, and presents virtual GPU devices to guest VMs. Each VM gets a fraction of the physical GPU's VRAM and compute.

Key vGPU profile types:

Profile seriesPurpose
C-series (e.g., A100-4C, A100-40C)Compute and inference workloads
Q-seriesWorkstation / professional visualization
A-seriesVirtual applications (vApps)
B-seriesVirtual desktops (VDI)

On the A100 40GB, the A100-4C profile gives a VM 4 GB of VRAM. On the A100 80GB, the equivalent A100D-4C profile provides 4 GB per VM. The A100-40C profile gives 40 GB. Profile selection determines the fraction.

What this means for cloud buyers: Fractional GPU VMs on hyperscalers (e.g., Google Cloud's fractional G4 instances) use NVIDIA vGPU under the hood. The "fractional" price reflects the GPU slice plus the NVIDIA vGPU software license plus the hypervisor management overhead, all bundled into the VM rate.

Bare-metal trade-off: Running on Spheron bare-metal gives you full physical GPU access with no hypervisor overhead and no vGPU licensing cost. You use MIG or MPS to share the GPU instead. For the typical inference use case, bare metal with MIG offers equivalent isolation to vGPU without the licensing markup.

If you want to use vGPU on Spheron bare-metal: You would need to install a hypervisor (Proxmox or KVM) on the instance and obtain an NVIDIA vGPU software license. This is an enterprise workflow suited for teams with existing VMware or Proxmox infrastructure, not for typical inference deployments.

Time-Slicing

Time-slicing uses the GPU's built-in hardware scheduling to rapidly switch between processes. No VRAM isolation, no compute isolation. If one process allocates more VRAM than expected, it can OOM-kill the others.

The compatibility advantage is the main reason to use it: time-slicing works on every NVIDIA GPU, including consumer cards, without any special mode or daemon. It is the simplest way to run multiple processes on a single GPU in a dev or testing environment.

When time-slicing fits:

  • Development notebooks and experimentation
  • Internal tools with low, sporadic traffic
  • Pre-Volta GPUs that lack MPS address space isolation
  • Situations where operational simplicity matters more than performance isolation

Decision Matrix: Which Method for Your Workload

ScenarioModel sizeGPURecommended methodWhy
Single small model, low traffic7B INT4H100/A100MIG (1g.10gb)Dedicated VRAM slice, up to 7x density
Multiple small models, isolated7B-13B FP16H100/A100MIG (2g.20gb or 3g.40gb)Hardware isolation, no cross-process interference
Multiple models, L40S or RTX GPU7B-13B INT4L40S, RTX 4090MPS with thread percentage limitsConcurrent kernels, no MIG support on these GPUs
Multi-tenant inference as a serviceAnyH100/A100MIG with per-tenant containersStrongest isolation guarantee
Development, testing, experimentsAnyAnyTime-slicingSimple setup, acceptable for non-production
Enterprise VM deploymentAnyvGPU-licensed GPUNVIDIA vGPUFits existing hypervisor workflow
Large model, full throughput needed70B+ FP16H100, A100Full GPU (no fractioning)VRAM requirement leaves no room to share

Setting Up Fractional GPU Inference with vLLM

MIG with vLLM

bash
# List available MIG profiles - use the profile ID from YOUR output, not hardcoded values
sudo nvidia-smi mig -lgip

# Create a 2g.20gb MIG instance on H100 (profile ID 14 on H100 SXM5 - verify with lgip)
sudo nvidia-smi mig -cgi 14 -C

# List MIG instances and their UUIDs
nvidia-smi -L | grep MIG

# Launch vLLM targeting the MIG device by UUID
docker run --gpus '"device=MIG-GPU-xxxx-yyy-zzz"' vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.90

One MIG instance per container. Each container gets its own slice with dedicated VRAM and compute. For more detail on MIG configuration and multi-model deployment patterns, see the run multiple LLMs on one GPU guide.

MPS with vLLM

bash
# Start MPS daemon
nvidia-cuda-mps-control -d

# Verify daemon is running
echo get_default_active_thread_percentage | nvidia-cuda-mps-control

# Process 1: first 7B model, capped at 50% SMs, 45% VRAM
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=50 CUDA_VISIBLE_DEVICES=0 \
  python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --port 8000 --gpu-memory-utilization 0.45 &

# Process 2: second 7B model, capped at 50% SMs, 45% VRAM
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=50 CUDA_VISIBLE_DEVICES=0 \
  python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --port 8001 --gpu-memory-utilization 0.45 &

Set --gpu-memory-utilization 0.45 per process (not 0.5) when running two instances. The flag applies to total physical VRAM, not a per-process quota. If both processes set 0.5, KV cache growth in either process can push total usage above 100% and cause an out-of-memory error. The 10% buffer (2 x 0.45 = 0.90) gives enough headroom for uneven KV cache allocation.

Triton Inference Server with MPS

Triton can launch multiple model instances on the same GPU automatically. In your model config:

json
{
  "instance_group": [
    {"kind": "KIND_GPU", "count": 2, "gpus": [0]}
  ]
}

Two model instances on GPU 0. When MPS is running, Triton's two instances execute kernels concurrently via the MPS server rather than time-slicing.

Throughput per Dollar: Fractional vs Full GPU

Scenario: Serving a 7B model (Llama 3.1 8B, INT4 quantized, batch=1, 512 context)

ConfigurationGPU costApprox. throughputEffective cost per 1M tokens
1x H100 PCIe (full, 1 model)$2.01/hr~4,000 tok/s~$0.14
H100 PCIe, 3x MIG 2g.20gb slices$2.01/hr total / $0.67/hr per slice~1,100 tok/s per slice~$0.17 per slice
1x L40S PCIe (full, 1 model)$2.06/hr~2,200 tok/s~$0.26
L40S PCIe, 2x processes via MPS$2.06/hr total / $1.03/hr per process~1,400 tok/s per process~$0.20 per process
1x RTX 4090 (full, 1 model)$0.51/hr~1,500 tok/s~$0.09

Throughput numbers above are representative estimates for INT4 quantized 7B models at batch=1 with vLLM 0.6.x. Actual numbers vary with context length, batch size, quantization implementation, and hardware memory bandwidth. Run your own benchmark using your production model and traffic pattern before making capacity decisions.

Pricing fluctuates based on GPU availability. The prices above are based on 05 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

The cost story: a 3-slice MIG configuration on a single H100 serves 3 concurrent 7B models at $0.67/hr effective cost per model, versus renting 3 separate RTX 4090 instances at $0.51/hr each ($1.53/hr total). The 2g.20gb profile maxes out at 3 instances per H100 (3 × 2 GPC slices = 6 of 7, with 1 unused), so 3 is the ceiling for this profile. MIG wins on isolation and VRAM headroom per slice; RTX 4090 is cheaper if you only need the compute.

Multi-Tenant Inference: Serving Multiple Models on Shared Infrastructure

For teams running inference as a service with per-tenant isolation, the pattern is:

  • One MIG instance per tenant (H100/A100) or one MPS process per tenant (L40S/RTX)
  • Each tenant gets their own vLLM or llama.cpp server on a distinct port
  • A routing layer (nginx upstream block or Traefik with path-based routing) directs requests to the correct backend

Health checks per instance:

bash
# Check each vLLM endpoint independently
curl -s http://localhost:8000/health
curl -s http://localhost:8001/health

For serving multiple tasks from a single model rather than running separate models per tenant, LoRA adapters are a complementary approach. See LoRA multi-adapter serving on GPU cloud for that pattern.

Cost Analysis: Fractional vs Full GPU vs Spot

StrategyBest forCost vs baseline
Full GPU on-demandLarge models (70B+ FP16), high traffic, full VRAM neededBaseline
Fractional GPU (MIG/MPS)7B-13B models, moderate traffic, multi-tenant40-70% lower per-model
Spot GPU (A100 available on Spheron)Fault-tolerant batch inference, checkpoint-enabled30-50% lower vs on-demand
Fractional + spot combinedBatch inference with small modelsLargest savings

A100 80GB SXM4 instances start at $1.05/hr on-demand, with spot pricing from $0.45/hr. A100 80GB PCIe starts at $1.43/hr on-demand with spot from $1.14/hr. Combining spot with MIG partitioning on an A100 gives you multiple model instances on an interruptible GPU, suitable for batch inference jobs that checkpoint their state.

See serverless GPU vs on-demand vs reserved for a full breakdown of GPU billing models and when each reduces total cost.

When NOT to Use Fractional GPUs

Fractional GPU sharing is the right call for a specific subset of workloads. Here are the cases where it backfires:

  1. Model does not fit in the slice. A 70B FP16 model needs 140 GB VRAM. No MIG profile on a single H100 80GB comes close. Multi-GPU tensor parallelism is the answer, not fractioning.
  1. SM utilization is already above 80%. If a single inference process is using 80%+ of GPU compute under normal load, sharing SMs with another process adds latency without reducing cost. Profile first with nvidia-smi dmon.
  1. Speculative decoding pipelines. Draft model and target model run in lockstep. Capping SM allocation with MPS slows the draft model and can erase the latency benefit of speculative decoding entirely.
  1. Tensor parallelism across GPUs. TP workloads use NCCL all-reduce across multiple physical GPUs. MIG/MPS on a single GPU does not help multi-GPU tensor parallel inference, and MPS with multiple GPUs does not improve inter-GPU bandwidth.
  1. Long-context workloads with large batch sizes. 128K+ context windows with large batches fill VRAM with KV cache. The memory overhead leaves no room to share the GPU with another process. See NVMe KV cache offloading for LLM inference for strategies to reduce KV cache memory pressure.

Spheron for Fractional GPU Inference

Spheron provides bare-metal H100, A100, and L40S instances with full root access. This is required for MIG configuration (nvidia-smi mig -mig 1 needs root) and MPS daemon setup. Managed cloud VMs from hyperscalers often restrict MIG mode at the hypervisor level, meaning you cannot enable or reconfigure MIG partitions on AWS, GCP, or Azure GPU instances without special arrangements.

Pricing is per GPU per hour with per-minute billing. On hyperscalers, fractional GPU VMs bundle vGPU licensing fees and hypervisor management overhead into an opaque per-vCPU or per-GB rate. On Spheron bare metal, you pay the full GPU rate and use MIG or MPS to extract multiple model slots from the hardware. The math typically favors bare metal plus fractioning over paying for pre-fractioned VMs once you account for licensing markup.

Right-sizing workflow: start with a full GPU benchmark on Spheron using per-minute billing (you pay only for what you run). Measure actual SM and VRAM utilization with nvidia-smi dmon. If SM utilization is below 50% and VRAM is below 50% at your target traffic, switch to a MIG or MPS configuration and re-benchmark. If latency SLAs still hold, your cost per model drops by the sharing factor.

H100, A100, and L40S are all available on Spheron. For spec comparisons and current pricing, see H100 rental, A100 rental, and L40S rental.


Most inference workloads need 20-40 GB of VRAM and 30-50% of a GPU's compute, not a full H100. Spheron bare-metal instances give you root access to configure MIG and MPS, per-minute billing so you only pay while running, and transparent pricing with no fractional GPU licensing markup.

Rent an H100 → | Rent an A100 → | View all pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.