A 7B model in FP16 needs 14 GB VRAM. An H100 PCIe has 80 GB. That leaves 66 GB sitting idle if you run one model per GPU. For most inference workloads serving moderate traffic, GPU SM utilization stays below 40%. You are paying for 100% and using a fraction. Fractional GPU technologies fix this by splitting one physical GPU into isolated or shared slices, so multiple models or tenants can share the hardware without each paying for a full GPU. The GPU cost optimization playbook identifies this idle capacity as one of the most common sources of wasted cloud spend. The run multiple LLMs on one GPU guide covers MIG and time-slicing setup step by step. This post goes wider: all four fractional GPU methods, when each fits, cost math with live Spheron pricing, and the cases where sharing backfires.
The GPU Utilization Problem
Training keeps GPU SMs busy above 90% because the forward pass, backward pass, and optimizer step fill every cycle. Inference is different. Requests arrive in bursts. Between bursts, the GPU waits. Even during a burst, the decode phase is memory-bandwidth bound, not compute bound, so SM occupancy is lower than the raw TFLOPS suggest.
The result: most inference deployments run at 20-40% average SM utilization. A 7B INT4 model on an H100 PCIe handles 2,000-4,000 tokens per second at peak, but at typical API traffic levels it spends most of its time waiting for the next request.
The VRAM story is similar:
| Model | Precision | VRAM needed | H100 PCIe (80GB) headroom | A100 80GB headroom | L40S (48GB) headroom |
|---|---|---|---|---|---|
| 7B | FP16 | ~14 GB | 66 GB | 66 GB | 34 GB |
| 7B | INT4 | ~4 GB | 76 GB | 76 GB | 44 GB |
| 13B | FP16 | ~26 GB | 54 GB | 54 GB | 22 GB |
| 13B | INT4 | ~7 GB | 73 GB | 73 GB | 41 GB |
| 70B | FP16 | ~140 GB | Does not fit | Does not fit | Does not fit |
| 70B | INT4 | ~35 GB | 45 GB | 45 GB | 13 GB |
Running a single 7B FP16 model on an H100 leaves 83% of VRAM unused. Fractional GPU methods let you fill that space with additional model instances. For guidance on which GPU fits which model size and workload, see best GPU for AI inference 2026.
Fractional GPU Technologies: Four Options
| Technology | Isolation | VRAM allocation | GPU compatibility | Use case |
|---|---|---|---|---|
| MIG | Hardware | Fixed slices (10/20/40 GB on H100) | A100, A30, H100, H200, B200 | Multi-tenant, hard isolation |
| MPS | Process-level | Shared pool | Volta+ (all modern data center GPUs) | Concurrent inference, controlled env |
| NVIDIA vGPU | Hypervisor | Configurable profiles | Requires vGPU-licensed GPU and hypervisor | Enterprise VM deployments |
| Time-slicing | None | Shared pool | All NVIDIA GPUs | Dev/test, low-traffic endpoints |
MIG (Multi-Instance GPU)
MIG partitions a GPU at the hardware level into up to 7 isolated instances, each with its own VRAM, cache, and compute engines. A crash or memory error in one instance does not affect the others. Each instance appears as a separate CUDA device.
H100 80GB MIG profiles (run nvidia-smi mig -lgip to list valid combinations for your driver version):
| Profile | VRAM per instance | Max instances | Best for |
|---|---|---|---|
| 1g.10gb | 10 GB | 7 | 7B INT4, 3B FP16 |
| 2g.20gb | 20 GB | 3 | 7B FP16, 13B INT4 |
| 3g.40gb | 40 GB | 2 | 13B FP16, 30B INT4 |
| 4g.40gb | 40 GB | 1 | 30B INT4 with more compute |
| 7g.80gb | 80 GB | 1 | Full GPU (no sharing) |
A100 80GB also supports up to 7 instances with the 1g.10gb profile, identical to the H100 80GB (7 × 10 GB = 70 GB, within the 80 GB physical capacity). The A100 40GB uses the smaller 1g.5gb profile to reach 7 instances (5 GB VRAM per slice).
The right-sizing angle: a 7B INT4 model fits in a 1g.10gb slice (needs 4 GB VRAM, well within 10 GB). That gives you 7 independent inference endpoints for the price of one H100. Compare that to renting 7 separate RTX 4090 instances at $0.51/hr each ($3.57/hr total) versus one H100 at $2.01/hr with 7 MIG slices (effective $0.29/hr per slice).
Important: the numeric profile IDs used in nvidia-smi mig -cgi (e.g., 14 for 2g.20gb on H100 SXM5) differ between GPU models and driver versions. Always run nvidia-smi mig -lgip first and use the profile ID from your output rather than copying IDs from documentation.
MPS (Multi-Process Service)
MPS is a CUDA daemon that replaces the default GPU time-multiplexing with a single shared GPU context. Multiple client processes submit work through the MPS server, which schedules their kernels to run concurrently. The result: lower launch overhead and actual parallel kernel execution across processes, unlike time-slicing which interleaves them sequentially.
Starting the MPS daemon:
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -dCapping SM allocation per process with CUDA_MPS_ACTIVE_THREAD_PERCENTAGE:
# Process 1 gets 50% of SMs
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=50 CUDA_VISIBLE_DEVICES=0 \
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--port 8000 --gpu-memory-utilization 0.45 &
# Process 2 gets 50% of SMs
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=50 CUDA_VISIBLE_DEVICES=0 \
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--port 8001 --gpu-memory-utilization 0.45 &One important caveat: CUDA_MPS_ACTIVE_THREAD_PERCENTAGE sets a hard cap on the maximum SM percentage a client process can use. This is different from MIG, which provides hard VRAM and compute isolation. Do not use MPS as a substitute for MIG when strict tenant isolation is required.
MPS versus MIG for L40S and RTX GPUs: The L40S does not support MIG. Neither does the RTX 4090 or RTX 5090. MPS is the only hardware-concurrent option for these GPUs, making it the primary right-sizing tool for non-A100/H100 inference deployments.
On Volta+ GPUs, MPS also provides address space isolation between client processes, reducing the blast radius of errant memory accesses compared to pure time-slicing.
NVIDIA vGPU
NVIDIA vGPU virtualizes the GPU at the hypervisor layer. A licensed NVIDIA vGPU host driver runs on VMware vSphere, Proxmox, or KVM, and presents virtual GPU devices to guest VMs. Each VM gets a fraction of the physical GPU's VRAM and compute.
Key vGPU profile types:
| Profile series | Purpose |
|---|---|
| C-series (e.g., A100-4C, A100-40C) | Compute and inference workloads |
| Q-series | Workstation / professional visualization |
| A-series | Virtual applications (vApps) |
| B-series | Virtual desktops (VDI) |
On the A100 40GB, the A100-4C profile gives a VM 4 GB of VRAM. On the A100 80GB, the equivalent A100D-4C profile provides 4 GB per VM. The A100-40C profile gives 40 GB. Profile selection determines the fraction.
What this means for cloud buyers: Fractional GPU VMs on hyperscalers (e.g., Google Cloud's fractional G4 instances) use NVIDIA vGPU under the hood. The "fractional" price reflects the GPU slice plus the NVIDIA vGPU software license plus the hypervisor management overhead, all bundled into the VM rate.
Bare-metal trade-off: Running on Spheron bare-metal gives you full physical GPU access with no hypervisor overhead and no vGPU licensing cost. You use MIG or MPS to share the GPU instead. For the typical inference use case, bare metal with MIG offers equivalent isolation to vGPU without the licensing markup.
If you want to use vGPU on Spheron bare-metal: You would need to install a hypervisor (Proxmox or KVM) on the instance and obtain an NVIDIA vGPU software license. This is an enterprise workflow suited for teams with existing VMware or Proxmox infrastructure, not for typical inference deployments.
Time-Slicing
Time-slicing uses the GPU's built-in hardware scheduling to rapidly switch between processes. No VRAM isolation, no compute isolation. If one process allocates more VRAM than expected, it can OOM-kill the others.
The compatibility advantage is the main reason to use it: time-slicing works on every NVIDIA GPU, including consumer cards, without any special mode or daemon. It is the simplest way to run multiple processes on a single GPU in a dev or testing environment.
When time-slicing fits:
- Development notebooks and experimentation
- Internal tools with low, sporadic traffic
- Pre-Volta GPUs that lack MPS address space isolation
- Situations where operational simplicity matters more than performance isolation
Decision Matrix: Which Method for Your Workload
| Scenario | Model size | GPU | Recommended method | Why |
|---|---|---|---|---|
| Single small model, low traffic | 7B INT4 | H100/A100 | MIG (1g.10gb) | Dedicated VRAM slice, up to 7x density |
| Multiple small models, isolated | 7B-13B FP16 | H100/A100 | MIG (2g.20gb or 3g.40gb) | Hardware isolation, no cross-process interference |
| Multiple models, L40S or RTX GPU | 7B-13B INT4 | L40S, RTX 4090 | MPS with thread percentage limits | Concurrent kernels, no MIG support on these GPUs |
| Multi-tenant inference as a service | Any | H100/A100 | MIG with per-tenant containers | Strongest isolation guarantee |
| Development, testing, experiments | Any | Any | Time-slicing | Simple setup, acceptable for non-production |
| Enterprise VM deployment | Any | vGPU-licensed GPU | NVIDIA vGPU | Fits existing hypervisor workflow |
| Large model, full throughput needed | 70B+ FP16 | H100, A100 | Full GPU (no fractioning) | VRAM requirement leaves no room to share |
Setting Up Fractional GPU Inference with vLLM
MIG with vLLM
# List available MIG profiles - use the profile ID from YOUR output, not hardcoded values
sudo nvidia-smi mig -lgip
# Create a 2g.20gb MIG instance on H100 (profile ID 14 on H100 SXM5 - verify with lgip)
sudo nvidia-smi mig -cgi 14 -C
# List MIG instances and their UUIDs
nvidia-smi -L | grep MIG
# Launch vLLM targeting the MIG device by UUID
docker run --gpus '"device=MIG-GPU-xxxx-yyy-zzz"' vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--gpu-memory-utilization 0.90One MIG instance per container. Each container gets its own slice with dedicated VRAM and compute. For more detail on MIG configuration and multi-model deployment patterns, see the run multiple LLMs on one GPU guide.
MPS with vLLM
# Start MPS daemon
nvidia-cuda-mps-control -d
# Verify daemon is running
echo get_default_active_thread_percentage | nvidia-cuda-mps-control
# Process 1: first 7B model, capped at 50% SMs, 45% VRAM
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=50 CUDA_VISIBLE_DEVICES=0 \
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--port 8000 --gpu-memory-utilization 0.45 &
# Process 2: second 7B model, capped at 50% SMs, 45% VRAM
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=50 CUDA_VISIBLE_DEVICES=0 \
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--port 8001 --gpu-memory-utilization 0.45 &Set --gpu-memory-utilization 0.45 per process (not 0.5) when running two instances. The flag applies to total physical VRAM, not a per-process quota. If both processes set 0.5, KV cache growth in either process can push total usage above 100% and cause an out-of-memory error. The 10% buffer (2 x 0.45 = 0.90) gives enough headroom for uneven KV cache allocation.
Triton Inference Server with MPS
Triton can launch multiple model instances on the same GPU automatically. In your model config:
{
"instance_group": [
{"kind": "KIND_GPU", "count": 2, "gpus": [0]}
]
}Two model instances on GPU 0. When MPS is running, Triton's two instances execute kernels concurrently via the MPS server rather than time-slicing.
Throughput per Dollar: Fractional vs Full GPU
Scenario: Serving a 7B model (Llama 3.1 8B, INT4 quantized, batch=1, 512 context)
| Configuration | GPU cost | Approx. throughput | Effective cost per 1M tokens |
|---|---|---|---|
| 1x H100 PCIe (full, 1 model) | $2.01/hr | ~4,000 tok/s | ~$0.14 |
| H100 PCIe, 3x MIG 2g.20gb slices | $2.01/hr total / $0.67/hr per slice | ~1,100 tok/s per slice | ~$0.17 per slice |
| 1x L40S PCIe (full, 1 model) | $2.06/hr | ~2,200 tok/s | ~$0.26 |
| L40S PCIe, 2x processes via MPS | $2.06/hr total / $1.03/hr per process | ~1,400 tok/s per process | ~$0.20 per process |
| 1x RTX 4090 (full, 1 model) | $0.51/hr | ~1,500 tok/s | ~$0.09 |
Throughput numbers above are representative estimates for INT4 quantized 7B models at batch=1 with vLLM 0.6.x. Actual numbers vary with context length, batch size, quantization implementation, and hardware memory bandwidth. Run your own benchmark using your production model and traffic pattern before making capacity decisions.
Pricing fluctuates based on GPU availability. The prices above are based on 05 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
The cost story: a 3-slice MIG configuration on a single H100 serves 3 concurrent 7B models at $0.67/hr effective cost per model, versus renting 3 separate RTX 4090 instances at $0.51/hr each ($1.53/hr total). The 2g.20gb profile maxes out at 3 instances per H100 (3 × 2 GPC slices = 6 of 7, with 1 unused), so 3 is the ceiling for this profile. MIG wins on isolation and VRAM headroom per slice; RTX 4090 is cheaper if you only need the compute.
Multi-Tenant Inference: Serving Multiple Models on Shared Infrastructure
For teams running inference as a service with per-tenant isolation, the pattern is:
- One MIG instance per tenant (H100/A100) or one MPS process per tenant (L40S/RTX)
- Each tenant gets their own vLLM or llama.cpp server on a distinct port
- A routing layer (nginx upstream block or Traefik with path-based routing) directs requests to the correct backend
Health checks per instance:
# Check each vLLM endpoint independently
curl -s http://localhost:8000/health
curl -s http://localhost:8001/healthFor serving multiple tasks from a single model rather than running separate models per tenant, LoRA adapters are a complementary approach. See LoRA multi-adapter serving on GPU cloud for that pattern.
Cost Analysis: Fractional vs Full GPU vs Spot
| Strategy | Best for | Cost vs baseline |
|---|---|---|
| Full GPU on-demand | Large models (70B+ FP16), high traffic, full VRAM needed | Baseline |
| Fractional GPU (MIG/MPS) | 7B-13B models, moderate traffic, multi-tenant | 40-70% lower per-model |
| Spot GPU (A100 available on Spheron) | Fault-tolerant batch inference, checkpoint-enabled | 30-50% lower vs on-demand |
| Fractional + spot combined | Batch inference with small models | Largest savings |
A100 80GB SXM4 instances start at $1.05/hr on-demand, with spot pricing from $0.45/hr. A100 80GB PCIe starts at $1.43/hr on-demand with spot from $1.14/hr. Combining spot with MIG partitioning on an A100 gives you multiple model instances on an interruptible GPU, suitable for batch inference jobs that checkpoint their state.
See serverless GPU vs on-demand vs reserved for a full breakdown of GPU billing models and when each reduces total cost.
When NOT to Use Fractional GPUs
Fractional GPU sharing is the right call for a specific subset of workloads. Here are the cases where it backfires:
- Model does not fit in the slice. A 70B FP16 model needs 140 GB VRAM. No MIG profile on a single H100 80GB comes close. Multi-GPU tensor parallelism is the answer, not fractioning.
- SM utilization is already above 80%. If a single inference process is using 80%+ of GPU compute under normal load, sharing SMs with another process adds latency without reducing cost. Profile first with
nvidia-smi dmon.
- Speculative decoding pipelines. Draft model and target model run in lockstep. Capping SM allocation with MPS slows the draft model and can erase the latency benefit of speculative decoding entirely.
- Tensor parallelism across GPUs. TP workloads use NCCL all-reduce across multiple physical GPUs. MIG/MPS on a single GPU does not help multi-GPU tensor parallel inference, and MPS with multiple GPUs does not improve inter-GPU bandwidth.
- Long-context workloads with large batch sizes. 128K+ context windows with large batches fill VRAM with KV cache. The memory overhead leaves no room to share the GPU with another process. See NVMe KV cache offloading for LLM inference for strategies to reduce KV cache memory pressure.
Spheron for Fractional GPU Inference
Spheron provides bare-metal H100, A100, and L40S instances with full root access. This is required for MIG configuration (nvidia-smi mig -mig 1 needs root) and MPS daemon setup. Managed cloud VMs from hyperscalers often restrict MIG mode at the hypervisor level, meaning you cannot enable or reconfigure MIG partitions on AWS, GCP, or Azure GPU instances without special arrangements.
Pricing is per GPU per hour with per-minute billing. On hyperscalers, fractional GPU VMs bundle vGPU licensing fees and hypervisor management overhead into an opaque per-vCPU or per-GB rate. On Spheron bare metal, you pay the full GPU rate and use MIG or MPS to extract multiple model slots from the hardware. The math typically favors bare metal plus fractioning over paying for pre-fractioned VMs once you account for licensing markup.
Right-sizing workflow: start with a full GPU benchmark on Spheron using per-minute billing (you pay only for what you run). Measure actual SM and VRAM utilization with nvidia-smi dmon. If SM utilization is below 50% and VRAM is below 50% at your target traffic, switch to a MIG or MPS configuration and re-benchmark. If latency SLAs still hold, your cost per model drops by the sharing factor.
H100, A100, and L40S are all available on Spheron. For spec comparisons and current pricing, see H100 rental, A100 rental, and L40S rental.
Most inference workloads need 20-40 GB of VRAM and 30-50% of a GPU's compute, not a full H100. Spheron bare-metal instances give you root access to configure MIG and MPS, per-minute billing so you only pay while running, and transparent pricing with no fractional GPU licensing markup.
