Tutorial

Run Multiple LLMs on One GPU: MIG, Time-Slicing, and MPS Guide

Back to BlogWritten by Mitrasish, Co-founderMar 26, 2026
GPU CloudLLM InferenceNVIDIA MIGGPU SharingMulti-GPUKubernetesH100Cost Optimization
Run Multiple LLMs on One GPU: MIG, Time-Slicing, and MPS Guide

Most LLM inference servers run at 30-40% GPU utilization. If you are hosting three separate 7B models each handling sporadic traffic, each GPU is active less than 15% of the time. You are paying for three GPUs but using the equivalent of half of one. The GPU cost optimization playbook documents this exact pattern as one of the top sources of wasted cloud spend. The solution is GPU sharing: running multiple models on one GPU using hardware or software partitioning. Which method you choose depends on your GPU model, isolation requirements, and traffic pattern. This guide covers all three: MIG, time-slicing, and MPS, with real commands and cost math. For a GPU selection guide covering L40S, H100, H200, and B200 for inference, see best GPU for AI inference 2026.

Three Ways to Share a GPU

MethodIsolationVRAM per processCompatible GPUsOverhead
MIGHardware (dedicated)Fixed sliceA100, H100, H200, B200Near zero
Time-slicingNone (shared pool)SharedAll NVIDIA GPUsContext switch latency
MPSProcess-levelSharedKepler+ (enhanced on Volta+)Low (concurrent kernels)

MIG (Multi-Instance GPU)

MIG uses actual hardware partitioning at the SM (streaming multiprocessor) and memory controller level. When you create a MIG instance, the H100 assigns specific SM partitions and a dedicated VRAM range to that instance. The partitioning happens in silicon, not software, so a crash or memory fault in one instance cannot propagate to another.

Each MIG instance appears to the operating system as a completely independent CUDA device, with its own PCIe bus ID and NVML identity. You can pass a MIG instance directly to a Docker container using its UUID, and that container has no visibility into other MIG instances on the same physical GPU.

The trade-off is inflexibility. MIG profile sizes are fixed (10 GB, 20 GB, 40 GB slices on H100 80GB). You cannot dynamically resize an instance without destroying and recreating it, which interrupts any running workload. And MIG requires root-level access to the physical GPU, which rules out managed cloud VMs where the hypervisor controls the MIG configuration.

Time-Slicing

Time-slicing is purely software. The NVIDIA device plugin for Kubernetes (or the driver directly) schedules GPU time between processes using round-robin context switching. Each process gets on the order of 1-2ms of GPU compute before the driver switches to the next process.

The major limitation is VRAM. Time-sliced processes all share the full GPU VRAM pool with no boundaries between them. If process A allocates 30 GB and process B tries to allocate 30 GB on a 48 GB GPU, one of them gets an OOM error. You have to manually constrain VRAM allocation per process (using gpu_memory_utilization in vLLM, for example) to avoid oversubscription.

Time-slicing works on every NVIDIA GPU: L40S, RTX 4090, A100, H100. Setup is a three-line Kubernetes ConfigMap. It is the right tool when you want fast setup and isolation requirements are low, such as development environments or internal tools.

MPS (Multi-Process Service)

MPS is NVIDIA's middle-ground option. It allows multiple CUDA processes to execute kernels on the GPU at the same time (true concurrent execution), unlike time-slicing where only one process runs at a time. This can meaningfully improve throughput for workloads that do not individually saturate the GPU.

On pre-Volta GPUs, MPS used a shared GPU address space where memory faults could cross process boundaries. On Volta and later (which includes every data center GPU discussed in this post: A100, H100, H200, B200), each MPS client gets its own fully isolated GPU address space, so address space faults are contained per client. MPS still does not partition SMs the way MIG does, so a process that saturates SM capacity will starve other MPS clients. For untrusted multi-tenant workloads, MIG is the better option. MPS works well for controlled environments where you own all the processes and want concurrent kernel execution to improve SM utilization.

MPS requires starting an MPS daemon (nvidia-cuda-mps-control -d) before launching your processes, and tearing it down afterwards. The daemon manages the shared context.

Which GPUs Support MIG

GPUMIG SupportMax InstancesMax ProfileVRAM per 7-way slice
H100 SXM5 80GBYes71g.10gb10 GB
H100 PCIe 80GBYes71g.10gb10 GB
A100 SXM4 80GBYes71g.10gb10 GB
A100 PCIe 80GBYes71g.10gb10 GB
H200 SXM5 141GBYes71g.18gb18 GB
B200 180GBYes71g.23gb23 GB
L40S 48GBNoN/AN/AN/A
RTX 4090 24GBNoN/AN/AN/A

The reason bare-metal access matters: AWS, GCP, and Azure provision GPUs as passthrough devices in shared hypervisors. The MIG configuration for each physical GPU is locked by the hypervisor layer and cannot be modified by tenant VMs. Enabling MIG mode and selecting profiles requires issuing commands directly to the GPU via the NVIDIA management library (NVML), which is blocked on shared VMs.

On a Spheron bare-metal H100 or A100 instance, you have root access to the physical GPU. You can enable MIG mode, create and destroy instances, and change profiles without any restrictions. See Spheron instance types for details on bare-metal options.

Tutorial: Run 3 LLMs on One H100 with MIG

This tutorial uses an H100 SXM5 80GB with three 2g.20gb MIG instances, each hosting a separate model. You will need root access on a bare-metal H100 node.

Step 1: Verify and Enable MIG Mode

First, confirm MIG support and current status:

bash
nvidia-smi --query-gpu=name,mig.mode.current --format=csv,noheader

Expected output: NVIDIA H100 80GB SXM5, Disabled

Enable MIG mode on GPU 0:

bash
sudo nvidia-smi -i 0 -mig 1

You should see: MIG mode of GPU 0 is: Enabled

Note: On some driver versions, enabling MIG mode requires either a GPU reset (sudo nvidia-smi --id=0 -r) or a full system reboot to take effect. If nvidia-smi still shows MIG as disabled after the command, try sudo reboot and re-run the enable command after restart.

Step 2: List Available MIG Profiles

bash
sudo nvidia-smi mig -lgip

This prints a table like:

+-----------------------------------------------------------------------------+
| GPU instance profiles:                                                       |
| GPU   Name             ID    Instances   Memory   P2P    SM    DEC   ENC   |
|       (MIG 1g.10gb)    19        7/7      9.5GiB   No    14     0     0    |
|       (MIG 2g.20gb)    14        3/3     19.5GiB   No    28     1     0    |
|       (MIG 3g.40gb)     9        2/2     39.5GiB   No    42     2     0    |
|       (MIG 4g.40gb)     5        1/1     39.5GiB   No    56     2     0    |
|       (MIG 7g.80gb)     0        1/1     79.1GiB   No   132     7     0    |
+-----------------------------------------------------------------------------+

The profile ID in the ID column is what you use to create instances. Check this table on your specific hardware. Profile IDs can differ between H100 SXM5, H100 PCIe, and A100 variants. The example below uses 14 for 2g.20gb on H100 SXM5, but your GPU may list a different ID.

Step 3: Create Three 2g.20gb Instances

bash
sudo nvidia-smi mig -cgi 14,14,14 -C

The -C flag creates compute instances automatically inside each GPU instance.

Verify:

bash
sudo nvidia-smi mig -lgi

Expected output shows three 2g.20gb instances with UUIDs:

+-------------------------------------------------------+
| Existing GPU Instances on GPU  0                      |
|  GPU   Name          Profile  Instance   Placement    |
|        (MIG 2g.20gb)   14       1        {0-27}       |
|        (MIG 2g.20gb)   14       2        {28-55}      |
|        (MIG 2g.20gb)   14       3        {56-83}      |
+-------------------------------------------------------+

Step 4: Get the MIG UUIDs

bash
nvidia-smi -L

This prints the physical GPU and each MIG instance with its UUID:

GPU 0: NVIDIA H100 80GB SXM5 (UUID: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
  MIG 2g.20gb     Device  0: (UUID: MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
  MIG 2g.20gb     Device  1: (UUID: MIG-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy)
  MIG 2g.20gb     Device  2: (UUID: MIG-zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz)

Copy the three MIG UUIDs. You will use them in the next step.

Step 5: Launch Three vLLM Containers on Separate Instances

Each container gets one MIG instance and listens on a different host port. This example serves Phi-4 14B (4-bit via bitsandbytes) on instance 0, Mistral 7B Instruct (4-bit via bitsandbytes) on instance 1, and Qwen3-8B (4-bit via bitsandbytes) on instance 2.

Important: The --gpus flag with MIG UUIDs requires Docker 19.03+ and the NVIDIA Container Toolkit. The quoting syntax is non-obvious and must be exact: use single quotes around "device=UUID".

bash
# Instance 0: Phi-4 14B 4-bit on port 8000
docker run -d --gpus '"device=MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"' \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  vllm/vllm-openai:latest \
  --model microsoft/phi-4 \
  --quantization bitsandbytes \
  --load-format bitsandbytes \
  --dtype float16 \
  --max-model-len 8192

# Instance 1: Mistral 7B Instruct 4-bit on port 8001
docker run -d --gpus '"device=MIG-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy"' \
  -p 8001:8000 \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --quantization bitsandbytes \
  --load-format bitsandbytes \
  --dtype float16 \
  --max-model-len 8192

# Instance 2: Qwen3-8B 4-bit on port 8002
docker run -d --gpus '"device=MIG-zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz"' \
  -p 8002:8000 \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen3-8B \
  --quantization bitsandbytes \
  --load-format bitsandbytes \
  --dtype float16 \
  --max-model-len 8192

For full vLLM server setup on Spheron, including instance selection and container configuration, see the Spheron vLLM server guide.

Step 6: Verify All Three Endpoints

bash
curl http://localhost:8000/v1/models | python3 -m json.tool
curl http://localhost:8001/v1/models | python3 -m json.tool
curl http://localhost:8002/v1/models | python3 -m json.tool

Each should return a different model name. You now have three independent LLM endpoints running on a single H100, with hardware-level memory isolation between them.

Tutorial: Time-Slice an L40S for Multiple Endpoints

Time-slicing is the Kubernetes-native approach for GPUs that do not support MIG. The NVIDIA device plugin handles context scheduling at the driver level.

Step 1: Create the Device Plugin ConfigMap

yaml
# time-slicing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        replicas: 3

Apply it:

bash
kubectl apply -f time-slicing-config.yaml

Step 2: Patch the Device Plugin to Use the Config

bash
kubectl patch clusterpolicies/cluster-policy \
  --type merge \
  -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "any"}}}}'

After patching, the node will advertise nvidia.com/gpu: 3 instead of 1, even though there is only one physical GPU.

Step 3: Deploy Three vLLM Pods

Each pod requests one GPU slice:

yaml
apiVersion: v1
kind: Pod
metadata:
  name: model-a
spec:
  containers:
  - name: vllm
    image: vllm/vllm-openai:latest
    resources:
      limits:
        nvidia.com/gpu: "1"
    args:
      - "--model"
      - "Qwen/Qwen3-8B"
      - "--quantization"
      - "bitsandbytes"
      - "--load-format"
      - "bitsandbytes"
      - "--gpu-memory-utilization"
      - "0.30"
      - "--max-model-len"
      - "4096"

Set --gpu-memory-utilization to 0.30 or lower when time-slicing three processes on one GPU, and use INT4 quantization for any 7B+ model. With three processes sharing 48 GB, each process can address at most 14.4 GB of VRAM (0.30 × 48 GB). vLLM allocates KV cache from whatever VRAM remains after loading model weights, so an FP16 8B model (~15 GiB of weights) would exceed that budget and fail to start. BitsAndBytes INT4 quantization reduces Qwen3-8B weights to roughly 5-6 GB, leaving ~8-9 GB for KV cache, enough to serve 4096-token contexts comfortably. vLLM defaults to 90% VRAM allocation, which causes all three pods to crash-loop when running concurrently.

For Spheron LLM deployment options including vLLM and Ollama frameworks, see the Spheron LLM docs. For detailed vLLM production configuration including multi-GPU tensor parallelism and load balancing, see our vLLM production deployment guide.

Performance Impact

Approximate latency and throughput figures for Llama 3.1 8B across GPU sharing methods:

MethodGPUConfigTTFT (p50)Throughput (tok/s)Notes
Dedicated GPUH100 SXM51x H100 solo~25ms~800Baseline
MIG 2g.20gbH100 SXM52/7 H100~28ms~265Near-linear scaling with compute fraction
Time-slicing x3L40SL40S shared~35ms~190Higher variance under concurrent load
MPS x3H100 SXM5H100 shared~27ms~310Better SM utilization than time-slicing

MIG throughput scales near-linearly with the fraction of compute assigned. A 2g.20gb instance has roughly 2/7 of the H100's compute, so 2/7 of 800 tok/s is about 229 tok/s as a naive compute-fraction prediction. The measured ~265 tok/s slightly exceeds the 229 tok/s linear prediction because MIG instances have dedicated memory controllers, reducing memory bandwidth contention that would occur with software-based sharing methods like time-slicing.

Time-slicing overhead grows non-linearly under concurrent load. At low traffic (1-2 requests per model), the GPU often sits idle per-model anyway, so time-slicing overhead is minimal. Under three simultaneous sustained requests, context switching consumes 5-15% of available compute.

Cost Analysis

Scenario: 3 models serving moderate traffic (Mistral 7B Instruct, Phi-4, Qwen3-8B)

OptionSetupCost/hrNotes
1x H100 SXM5 with MIG (3 instances)3x 2g.20gb$2.40Single GPU, 3 isolated endpoints
3x A100 SXM4 80GB (dedicated)3 separate GPUs$3.15Full isolation, 3x the GPU count
3x A100 PCIe 80GB (dedicated)3 separate GPUs$3.21PCIe variant ($1.07/GPU x 3)
3x time-sliced on 1x L40S1 GPU shared$2.04No VRAM isolation, works for low traffic

Running three dedicated A100 SXM4 instances costs $3.15/hr vs $2.40/hr for one H100 with MIG. That is a 24% cost reduction while keeping hardware isolation. If your models fit in 20 GB each and you care about fault isolation, MIG on H100 is cheaper than three separate A100s.

The L40S time-slicing option costs $2.04/hr for one GPU shared across three endpoints. That is cheaper than three separate A100s, but carries trade-offs: no VRAM isolation, higher latency variance under load, and less predictable behavior when all three models get simultaneous traffic spikes.

Pricing fluctuates based on GPU availability. The prices above are based on 26 Mar 2026 and may have changed. Check current GPU pricing for live rates.

When to Use Each Method

Use CaseRecommended MethodWhy
Multi-tenant SaaS (per-customer isolation)MIGHardware memory faults cannot cross instance boundaries
Development and testing (multiple models)Time-slicingNo setup overhead, works on any GPU
Batch inference, high throughputMPSConcurrent kernel execution fills SMs better
Production API, mixed model sizesMIG (larger slices)Guaranteed VRAM + predictable latency
Low-traffic endpoints, cost optimizationTime-slicingOne GPU for 3-4 models at minimal overhead

Cloud VMs from AWS, GCP, and Azure lock MIG configuration at the hypervisor layer, which means tenants cannot change MIG profiles at runtime. Spheron provides root-level GPU access on bare-metal nodes, so you can enable MIG mode, select profiles, and reconfigure as your workload changes without filing a support ticket or waiting for a new instance type. See H100 GPU rental and current pricing for on-demand rates. For details on bare-metal access, see Spheron instance types.


Running multiple LLMs on shared GPUs requires root-level hardware access that most managed clouds don't provide. Spheron bare-metal H100 and A100 instances come with full MIG access, per-minute billing, and no minimum commitment.

Rent an H100 → | Rent an A100 → | View all pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.