Most LLM inference servers run at 30-40% GPU utilization. If you are hosting three separate 7B models each handling sporadic traffic, each GPU is active less than 15% of the time. You are paying for three GPUs but using the equivalent of half of one. The GPU cost optimization playbook documents this exact pattern as one of the top sources of wasted cloud spend. The solution is GPU sharing: running multiple models on one GPU using hardware or software partitioning. Which method you choose depends on your GPU model, isolation requirements, and traffic pattern. This guide covers all three: MIG, time-slicing, and MPS, with real commands and cost math. For a GPU selection guide covering L40S, H100, H200, and B200 for inference, see best GPU for AI inference 2026.
Three Ways to Share a GPU
| Method | Isolation | VRAM per process | Compatible GPUs | Overhead |
|---|---|---|---|---|
| MIG | Hardware (dedicated) | Fixed slice | A100, H100, H200, B200 | Near zero |
| Time-slicing | None (shared pool) | Shared | All NVIDIA GPUs | Context switch latency |
| MPS | Process-level | Shared | Kepler+ (enhanced on Volta+) | Low (concurrent kernels) |
MIG (Multi-Instance GPU)
MIG uses actual hardware partitioning at the SM (streaming multiprocessor) and memory controller level. When you create a MIG instance, the H100 assigns specific SM partitions and a dedicated VRAM range to that instance. The partitioning happens in silicon, not software, so a crash or memory fault in one instance cannot propagate to another.
Each MIG instance appears to the operating system as a completely independent CUDA device, with its own PCIe bus ID and NVML identity. You can pass a MIG instance directly to a Docker container using its UUID, and that container has no visibility into other MIG instances on the same physical GPU.
The trade-off is inflexibility. MIG profile sizes are fixed (10 GB, 20 GB, 40 GB slices on H100 80GB). You cannot dynamically resize an instance without destroying and recreating it, which interrupts any running workload. And MIG requires root-level access to the physical GPU, which rules out managed cloud VMs where the hypervisor controls the MIG configuration.
Time-Slicing
Time-slicing is purely software. The NVIDIA device plugin for Kubernetes (or the driver directly) schedules GPU time between processes using round-robin context switching. Each process gets on the order of 1-2ms of GPU compute before the driver switches to the next process.
The major limitation is VRAM. Time-sliced processes all share the full GPU VRAM pool with no boundaries between them. If process A allocates 30 GB and process B tries to allocate 30 GB on a 48 GB GPU, one of them gets an OOM error. You have to manually constrain VRAM allocation per process (using gpu_memory_utilization in vLLM, for example) to avoid oversubscription.
Time-slicing works on every NVIDIA GPU: L40S, RTX 4090, A100, H100. Setup is a three-line Kubernetes ConfigMap. It is the right tool when you want fast setup and isolation requirements are low, such as development environments or internal tools.
MPS (Multi-Process Service)
MPS is NVIDIA's middle-ground option. It allows multiple CUDA processes to execute kernels on the GPU at the same time (true concurrent execution), unlike time-slicing where only one process runs at a time. This can meaningfully improve throughput for workloads that do not individually saturate the GPU.
On pre-Volta GPUs, MPS used a shared GPU address space where memory faults could cross process boundaries. On Volta and later (which includes every data center GPU discussed in this post: A100, H100, H200, B200), each MPS client gets its own fully isolated GPU address space, so address space faults are contained per client. MPS still does not partition SMs the way MIG does, so a process that saturates SM capacity will starve other MPS clients. For untrusted multi-tenant workloads, MIG is the better option. MPS works well for controlled environments where you own all the processes and want concurrent kernel execution to improve SM utilization.
MPS requires starting an MPS daemon (nvidia-cuda-mps-control -d) before launching your processes, and tearing it down afterwards. The daemon manages the shared context.
Which GPUs Support MIG
| GPU | MIG Support | Max Instances | Max Profile | VRAM per 7-way slice |
|---|---|---|---|---|
| H100 SXM5 80GB | Yes | 7 | 1g.10gb | 10 GB |
| H100 PCIe 80GB | Yes | 7 | 1g.10gb | 10 GB |
| A100 SXM4 80GB | Yes | 7 | 1g.10gb | 10 GB |
| A100 PCIe 80GB | Yes | 7 | 1g.10gb | 10 GB |
| H200 SXM5 141GB | Yes | 7 | 1g.18gb | 18 GB |
| B200 180GB | Yes | 7 | 1g.23gb | 23 GB |
| L40S 48GB | No | N/A | N/A | N/A |
| RTX 4090 24GB | No | N/A | N/A | N/A |
The reason bare-metal access matters: AWS, GCP, and Azure provision GPUs as passthrough devices in shared hypervisors. The MIG configuration for each physical GPU is locked by the hypervisor layer and cannot be modified by tenant VMs. Enabling MIG mode and selecting profiles requires issuing commands directly to the GPU via the NVIDIA management library (NVML), which is blocked on shared VMs.
On a Spheron bare-metal H100 or A100 instance, you have root access to the physical GPU. You can enable MIG mode, create and destroy instances, and change profiles without any restrictions. See Spheron instance types for details on bare-metal options.
Tutorial: Run 3 LLMs on One H100 with MIG
This tutorial uses an H100 SXM5 80GB with three 2g.20gb MIG instances, each hosting a separate model. You will need root access on a bare-metal H100 node.
Step 1: Verify and Enable MIG Mode
First, confirm MIG support and current status:
nvidia-smi --query-gpu=name,mig.mode.current --format=csv,noheaderExpected output: NVIDIA H100 80GB SXM5, Disabled
Enable MIG mode on GPU 0:
sudo nvidia-smi -i 0 -mig 1You should see: MIG mode of GPU 0 is: Enabled
Note: On some driver versions, enabling MIG mode requires either a GPU reset (
sudo nvidia-smi --id=0 -r) or a full system reboot to take effect. Ifnvidia-smistill shows MIG as disabled after the command, trysudo rebootand re-run the enable command after restart.
Step 2: List Available MIG Profiles
sudo nvidia-smi mig -lgipThis prints a table like:
+-----------------------------------------------------------------------------+
| GPU instance profiles: |
| GPU Name ID Instances Memory P2P SM DEC ENC |
| (MIG 1g.10gb) 19 7/7 9.5GiB No 14 0 0 |
| (MIG 2g.20gb) 14 3/3 19.5GiB No 28 1 0 |
| (MIG 3g.40gb) 9 2/2 39.5GiB No 42 2 0 |
| (MIG 4g.40gb) 5 1/1 39.5GiB No 56 2 0 |
| (MIG 7g.80gb) 0 1/1 79.1GiB No 132 7 0 |
+-----------------------------------------------------------------------------+The profile ID in the ID column is what you use to create instances. Check this table on your specific hardware. Profile IDs can differ between H100 SXM5, H100 PCIe, and A100 variants. The example below uses 14 for 2g.20gb on H100 SXM5, but your GPU may list a different ID.
Step 3: Create Three 2g.20gb Instances
sudo nvidia-smi mig -cgi 14,14,14 -CThe -C flag creates compute instances automatically inside each GPU instance.
Verify:
sudo nvidia-smi mig -lgiExpected output shows three 2g.20gb instances with UUIDs:
+-------------------------------------------------------+
| Existing GPU Instances on GPU 0 |
| GPU Name Profile Instance Placement |
| (MIG 2g.20gb) 14 1 {0-27} |
| (MIG 2g.20gb) 14 2 {28-55} |
| (MIG 2g.20gb) 14 3 {56-83} |
+-------------------------------------------------------+Step 4: Get the MIG UUIDs
nvidia-smi -LThis prints the physical GPU and each MIG instance with its UUID:
GPU 0: NVIDIA H100 80GB SXM5 (UUID: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
MIG 2g.20gb Device 0: (UUID: MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
MIG 2g.20gb Device 1: (UUID: MIG-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy)
MIG 2g.20gb Device 2: (UUID: MIG-zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz)Copy the three MIG UUIDs. You will use them in the next step.
Step 5: Launch Three vLLM Containers on Separate Instances
Each container gets one MIG instance and listens on a different host port. This example serves Phi-4 14B (4-bit via bitsandbytes) on instance 0, Mistral 7B Instruct (4-bit via bitsandbytes) on instance 1, and Qwen3-8B (4-bit via bitsandbytes) on instance 2.
Important: The
--gpusflag with MIG UUIDs requires Docker 19.03+ and the NVIDIA Container Toolkit. The quoting syntax is non-obvious and must be exact: use single quotes around"device=UUID".
# Instance 0: Phi-4 14B 4-bit on port 8000
docker run -d --gpus '"device=MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"' \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
vllm/vllm-openai:latest \
--model microsoft/phi-4 \
--quantization bitsandbytes \
--load-format bitsandbytes \
--dtype float16 \
--max-model-len 8192
# Instance 1: Mistral 7B Instruct 4-bit on port 8001
docker run -d --gpus '"device=MIG-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy"' \
-p 8001:8000 \
-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--quantization bitsandbytes \
--load-format bitsandbytes \
--dtype float16 \
--max-model-len 8192
# Instance 2: Qwen3-8B 4-bit on port 8002
docker run -d --gpus '"device=MIG-zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz"' \
-p 8002:8000 \
-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
vllm/vllm-openai:latest \
--model Qwen/Qwen3-8B \
--quantization bitsandbytes \
--load-format bitsandbytes \
--dtype float16 \
--max-model-len 8192For full vLLM server setup on Spheron, including instance selection and container configuration, see the Spheron vLLM server guide.
Step 6: Verify All Three Endpoints
curl http://localhost:8000/v1/models | python3 -m json.tool
curl http://localhost:8001/v1/models | python3 -m json.tool
curl http://localhost:8002/v1/models | python3 -m json.toolEach should return a different model name. You now have three independent LLM endpoints running on a single H100, with hardware-level memory isolation between them.
Tutorial: Time-Slice an L40S for Multiple Endpoints
Time-slicing is the Kubernetes-native approach for GPUs that do not support MIG. The NVIDIA device plugin handles context scheduling at the driver level.
Step 1: Create the Device Plugin ConfigMap
# time-slicing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |-
version: v1
sharing:
timeSlicing:
replicas: 3Apply it:
kubectl apply -f time-slicing-config.yamlStep 2: Patch the Device Plugin to Use the Config
kubectl patch clusterpolicies/cluster-policy \
--type merge \
-p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "any"}}}}'After patching, the node will advertise nvidia.com/gpu: 3 instead of 1, even though there is only one physical GPU.
Step 3: Deploy Three vLLM Pods
Each pod requests one GPU slice:
apiVersion: v1
kind: Pod
metadata:
name: model-a
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
resources:
limits:
nvidia.com/gpu: "1"
args:
- "--model"
- "Qwen/Qwen3-8B"
- "--quantization"
- "bitsandbytes"
- "--load-format"
- "bitsandbytes"
- "--gpu-memory-utilization"
- "0.30"
- "--max-model-len"
- "4096"Set
--gpu-memory-utilizationto 0.30 or lower when time-slicing three processes on one GPU, and use INT4 quantization for any 7B+ model. With three processes sharing 48 GB, each process can address at most 14.4 GB of VRAM (0.30 × 48 GB). vLLM allocates KV cache from whatever VRAM remains after loading model weights, so an FP16 8B model (~15 GiB of weights) would exceed that budget and fail to start. BitsAndBytes INT4 quantization reduces Qwen3-8B weights to roughly 5-6 GB, leaving ~8-9 GB for KV cache, enough to serve 4096-token contexts comfortably. vLLM defaults to 90% VRAM allocation, which causes all three pods to crash-loop when running concurrently.
For Spheron LLM deployment options including vLLM and Ollama frameworks, see the Spheron LLM docs. For detailed vLLM production configuration including multi-GPU tensor parallelism and load balancing, see our vLLM production deployment guide.
Performance Impact
Approximate latency and throughput figures for Llama 3.1 8B across GPU sharing methods:
| Method | GPU | Config | TTFT (p50) | Throughput (tok/s) | Notes |
|---|---|---|---|---|---|
| Dedicated GPU | H100 SXM5 | 1x H100 solo | ~25ms | ~800 | Baseline |
| MIG 2g.20gb | H100 SXM5 | 2/7 H100 | ~28ms | ~265 | Near-linear scaling with compute fraction |
| Time-slicing x3 | L40S | L40S shared | ~35ms | ~190 | Higher variance under concurrent load |
| MPS x3 | H100 SXM5 | H100 shared | ~27ms | ~310 | Better SM utilization than time-slicing |
MIG throughput scales near-linearly with the fraction of compute assigned. A 2g.20gb instance has roughly 2/7 of the H100's compute, so 2/7 of 800 tok/s is about 229 tok/s as a naive compute-fraction prediction. The measured ~265 tok/s slightly exceeds the 229 tok/s linear prediction because MIG instances have dedicated memory controllers, reducing memory bandwidth contention that would occur with software-based sharing methods like time-slicing.
Time-slicing overhead grows non-linearly under concurrent load. At low traffic (1-2 requests per model), the GPU often sits idle per-model anyway, so time-slicing overhead is minimal. Under three simultaneous sustained requests, context switching consumes 5-15% of available compute.
Cost Analysis
Scenario: 3 models serving moderate traffic (Mistral 7B Instruct, Phi-4, Qwen3-8B)
| Option | Setup | Cost/hr | Notes |
|---|---|---|---|
| 1x H100 SXM5 with MIG (3 instances) | 3x 2g.20gb | $2.40 | Single GPU, 3 isolated endpoints |
| 3x A100 SXM4 80GB (dedicated) | 3 separate GPUs | $3.15 | Full isolation, 3x the GPU count |
| 3x A100 PCIe 80GB (dedicated) | 3 separate GPUs | $3.21 | PCIe variant ($1.07/GPU x 3) |
| 3x time-sliced on 1x L40S | 1 GPU shared | $2.04 | No VRAM isolation, works for low traffic |
Running three dedicated A100 SXM4 instances costs $3.15/hr vs $2.40/hr for one H100 with MIG. That is a 24% cost reduction while keeping hardware isolation. If your models fit in 20 GB each and you care about fault isolation, MIG on H100 is cheaper than three separate A100s.
The L40S time-slicing option costs $2.04/hr for one GPU shared across three endpoints. That is cheaper than three separate A100s, but carries trade-offs: no VRAM isolation, higher latency variance under load, and less predictable behavior when all three models get simultaneous traffic spikes.
Pricing fluctuates based on GPU availability. The prices above are based on 26 Mar 2026 and may have changed. Check current GPU pricing for live rates.
When to Use Each Method
| Use Case | Recommended Method | Why |
|---|---|---|
| Multi-tenant SaaS (per-customer isolation) | MIG | Hardware memory faults cannot cross instance boundaries |
| Development and testing (multiple models) | Time-slicing | No setup overhead, works on any GPU |
| Batch inference, high throughput | MPS | Concurrent kernel execution fills SMs better |
| Production API, mixed model sizes | MIG (larger slices) | Guaranteed VRAM + predictable latency |
| Low-traffic endpoints, cost optimization | Time-slicing | One GPU for 3-4 models at minimal overhead |
Cloud VMs from AWS, GCP, and Azure lock MIG configuration at the hypervisor layer, which means tenants cannot change MIG profiles at runtime. Spheron provides root-level GPU access on bare-metal nodes, so you can enable MIG mode, select profiles, and reconfigure as your workload changes without filing a support ticket or waiting for a new instance type. See H100 GPU rental and current pricing for on-demand rates. For details on bare-metal access, see Spheron instance types.
Running multiple LLMs on shared GPUs requires root-level hardware access that most managed clouds don't provide. Spheron bare-metal H100 and A100 instances come with full MIG access, per-minute billing, and no minimum commitment.
