What GPUs support NVIDIA MIG partitioning?

MIG is supported on A100, H100, H200, and B200 series GPUs. The A100 80GB supports up to 7 MIG instances (1g.10gb profile). The H100 SXM5 80GB also supports up to 7 instances (7x 1g.10gb, or fewer larger slices). The H200 SXM5 141GB supports up to 7 instances with the smallest profile being 1g.18gb (18 GB per slice). The B200 reports 180GB usable memory via nvidia-smi and supports up to 7 instances with the smallest profile being 1g.23gb (23 GB per slice). Consumer GPUs like the RTX 4090, L40S, and RTX 5090 do not support MIG. You need root-level GPU access to enable MIG mode via nvidia-smi, which is available on bare-metal providers like Spheron but typically not on managed cloud VMs.

What is the difference between MIG and GPU time-slicing?

MIG (Multi-Instance GPU) uses hardware partitioning to create isolated GPU slices with dedicated VRAM and compute engines. Each MIG instance is fully isolated - a memory error or process crash in one instance does not affect others. Time-slicing is software-level scheduling where the GPU rapidly switches between processes. Time-slicing has no memory isolation and shares the full VRAM pool, but works on any GPU including L40S, RTX 4090, and older models that do not support MIG.

How many LLMs can I run on one H100 with MIG?

Up to 7 with the 1g.10gb MIG profile (7 instances, each with 10GB VRAM and 1/7 compute). Practically, 3 instances using the 2g.20gb profile (3x 20GB) is the most useful configuration for running three separate 7B models in INT4 or three 3B models in FP16. Each instance runs as an independent CUDA device and can host a separate vLLM or llama.cpp server on its own port.

Does GPU time-slicing hurt inference latency?

Yes, because context switching adds overhead. At low concurrency (1-2 requests per model), the GPU often sits idle per-model anyway, so time-slicing latency is negligible. Under sustained load across all co-located models, you will see latency spikes when a time slice expires mid-request. For latency-sensitive production APIs, MIG is the better choice. Time-slicing works well for development, testing, or low-traffic endpoints where SLA requirements are relaxed.

Can I use MIG with Kubernetes?

Yes. The NVIDIA device plugin for Kubernetes supports MIG via two strategies: 'single' (exposes MIG instances as kubernetes.io/gpu resources) and 'mixed' (uses nvidia.com/mig-Xg.YYgb labels for each profile). You configure the strategy in the device plugin DaemonSet and annotate nodes with the desired MIG profile. GPU Operator from NVIDIA automates MIG profile creation on Kubernetes nodes.

Run Multiple LLMs on One GPU: MIG, Time-Slicing, and MPS Guide

Most LLM inference servers run at 30-40% GPU utilization. If you are hosting three separate 7B models each handling sporadic traffic, each GPU is active less than 15% of the time. You are paying for three GPUs but using the equivalent of half of one. The GPU cost optimization playbook documents this exact pattern as one of the top sources of wasted cloud spend. The solution is GPU sharing: running multiple models on one GPU using hardware or software partitioning. Which method you choose depends on your GPU model, isolation requirements, and traffic pattern. This guide covers all three: MIG, time-slicing, and MPS, with real commands and cost math. For a GPU selection guide covering L40S, H100, H200, and B200 for inference, see best GPU for AI inference 2026.

Three Ways to Share a GPU

Method	Isolation	VRAM per process	Compatible GPUs	Overhead
MIG	Hardware (dedicated)	Fixed slice	A100, H100, H200, B200	Near zero
Time-slicing	None (shared pool)	Shared	All NVIDIA GPUs	Context switch latency
MPS	Process-level	Shared	Kepler+ (enhanced on Volta+)	Low (concurrent kernels)

MIG (Multi-Instance GPU)

MIG uses actual hardware partitioning at the SM (streaming multiprocessor) and memory controller level. When you create a MIG instance, the H100 assigns specific SM partitions and a dedicated VRAM range to that instance. The partitioning happens in silicon, not software, so a crash or memory fault in one instance cannot propagate to another.

Each MIG instance appears to the operating system as a completely independent CUDA device, with its own PCIe bus ID and NVML identity. You can pass a MIG instance directly to a Docker container using its UUID, and that container has no visibility into other MIG instances on the same physical GPU.

The trade-off is inflexibility. MIG profile sizes are fixed (10 GB, 20 GB, 40 GB slices on H100 80GB). You cannot dynamically resize an instance without destroying and recreating it, which interrupts any running workload. And MIG requires root-level access to the physical GPU, which rules out managed cloud VMs where the hypervisor controls the MIG configuration.

Time-Slicing

Time-slicing is purely software. The NVIDIA device plugin for Kubernetes (or the driver directly) schedules GPU time between processes using round-robin context switching. Each process gets on the order of 1-2ms of GPU compute before the driver switches to the next process.

The major limitation is VRAM. Time-sliced processes all share the full GPU VRAM pool with no boundaries between them. If process A allocates 30 GB and process B tries to allocate 30 GB on a 48 GB GPU, one of them gets an OOM error. You have to manually constrain VRAM allocation per process (using gpu_memory_utilization in vLLM, for example) to avoid oversubscription.

Time-slicing works on every NVIDIA GPU: L40S, RTX 4090, A100, H100. Setup is a three-line Kubernetes ConfigMap. It is the right tool when you want fast setup and isolation requirements are low, such as development environments or internal tools.

MPS (Multi-Process Service)

MPS is NVIDIA's middle-ground option. It allows multiple CUDA processes to execute kernels on the GPU at the same time (true concurrent execution), unlike time-slicing where only one process runs at a time. This can meaningfully improve throughput for workloads that do not individually saturate the GPU.

On pre-Volta GPUs, MPS used a shared GPU address space where memory faults could cross process boundaries. On Volta and later (which includes every data center GPU discussed in this post: A100, H100, H200, B200), each MPS client gets its own fully isolated GPU address space, so address space faults are contained per client. MPS still does not partition SMs the way MIG does, so a process that saturates SM capacity will starve other MPS clients. For untrusted multi-tenant workloads, MIG is the better option. MPS works well for controlled environments where you own all the processes and want concurrent kernel execution to improve SM utilization.

MPS requires starting an MPS daemon (nvidia-cuda-mps-control -d) before launching your processes, and tearing it down afterwards. The daemon manages the shared context.

Which GPUs Support MIG

GPU	MIG Support	Max Instances	Max Profile	VRAM per 7-way slice
H100 SXM5 80GB	Yes	7	1g.10gb	10 GB
H100 PCIe 80GB	Yes	7	1g.10gb	10 GB
A100 SXM4 80GB	Yes	7	1g.10gb	10 GB
A100 PCIe 80GB	Yes	7	1g.10gb	10 GB
H200 SXM5 141GB	Yes	7	1g.18gb	18 GB
B200 180GB	Yes	7	1g.23gb	23 GB
L40S 48GB	No	N/A	N/A	N/A
RTX 4090 24GB	No	N/A	N/A	N/A

The reason bare-metal access matters: AWS, GCP, and Azure provision GPUs as passthrough devices in shared hypervisors. The MIG configuration for each physical GPU is locked by the hypervisor layer and cannot be modified by tenant VMs. Enabling MIG mode and selecting profiles requires issuing commands directly to the GPU via the NVIDIA management library (NVML), which is blocked on shared VMs.

On a Spheron bare-metal H100 or A100 instance, you have root access to the physical GPU. You can enable MIG mode, create and destroy instances, and change profiles without any restrictions. See Spheron instance types for details on bare-metal options.

Tutorial: Run 3 LLMs on One H100 with MIG

This tutorial uses an H100 SXM5 80GB with three 2g.20gb MIG instances, each hosting a separate model. You will need root access on a bare-metal H100 node.

Step 1: Verify and Enable MIG Mode

First, confirm MIG support and current status:

bash

nvidia-smi --query-gpu=name,mig.mode.current --format=csv,noheader

Expected output: NVIDIA H100 80GB SXM5, Disabled

Enable MIG mode on GPU 0:

bash

sudo nvidia-smi -i 0 -mig 1

You should see: MIG mode of GPU 0 is: Enabled

Note: On some driver versions, enabling MIG mode requires either a GPU reset (sudo nvidia-smi --id=0 -r) or a full system reboot to take effect. If nvidia-smi still shows MIG as disabled after the command, try sudo reboot and re-run the enable command after restart.

Step 2: List Available MIG Profiles

bash

sudo nvidia-smi mig -lgip

This prints a table like:

+-----------------------------------------------------------------------------+
| GPU instance profiles:                                                       |
| GPU   Name             ID    Instances   Memory   P2P    SM    DEC   ENC   |
|       (MIG 1g.10gb)    19        7/7      9.5GiB   No    14     0     0    |
|       (MIG 2g.20gb)    14        3/3     19.5GiB   No    28     1     0    |
|       (MIG 3g.40gb)     9        2/2     39.5GiB   No    42     2     0    |
|       (MIG 4g.40gb)     5        1/1     39.5GiB   No    56     2     0    |
|       (MIG 7g.80gb)     0        1/1     79.1GiB   No   132     7     0    |
+-----------------------------------------------------------------------------+

The profile ID in the ID column is what you use to create instances. Check this table on your specific hardware. Profile IDs can differ between H100 SXM5, H100 PCIe, and A100 variants. The example below uses 14 for 2g.20gb on H100 SXM5, but your GPU may list a different ID.

Step 3: Create Three 2g.20gb Instances

bash

sudo nvidia-smi mig -cgi 14,14,14 -C

The -C flag creates compute instances automatically inside each GPU instance.

Verify:

bash

sudo nvidia-smi mig -lgi

Expected output shows three 2g.20gb instances with UUIDs:

+-------------------------------------------------------+
| Existing GPU Instances on GPU  0                      |
|  GPU   Name          Profile  Instance   Placement    |
|        (MIG 2g.20gb)   14       1        {0-27}       |
|        (MIG 2g.20gb)   14       2        {28-55}      |
|        (MIG 2g.20gb)   14       3        {56-83}      |
+-------------------------------------------------------+

Step 4: Get the MIG UUIDs

bash

nvidia-smi -L

This prints the physical GPU and each MIG instance with its UUID:

GPU 0: NVIDIA H100 80GB SXM5 (UUID: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
  MIG 2g.20gb     Device  0: (UUID: MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
  MIG 2g.20gb     Device  1: (UUID: MIG-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy)
  MIG 2g.20gb     Device  2: (UUID: MIG-zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz)

Copy the three MIG UUIDs. You will use them in the next step.

Step 5: Launch Three vLLM Containers on Separate Instances

Each container gets one MIG instance and listens on a different host port. This example serves Phi-4 14B (4-bit via bitsandbytes) on instance 0, Mistral 7B Instruct (4-bit via bitsandbytes) on instance 1, and Qwen3-8B (4-bit via bitsandbytes) on instance 2.

Important: The --gpus flag with MIG UUIDs requires Docker 19.03+ and the NVIDIA Container Toolkit. The quoting syntax is non-obvious and must be exact: use single quotes around "device=UUID".

bash

# Instance 0: Phi-4 14B 4-bit on port 8000
docker run -d --gpus '"device=MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"' \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  vllm/vllm-openai:latest \
  --model microsoft/phi-4 \
  --quantization bitsandbytes \
  --load-format bitsandbytes \
  --dtype float16 \
  --max-model-len 8192

# Instance 1: Mistral 7B Instruct 4-bit on port 8001
docker run -d --gpus '"device=MIG-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy"' \
  -p 8001:8000 \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --quantization bitsandbytes \
  --load-format bitsandbytes \
  --dtype float16 \
  --max-model-len 8192

# Instance 2: Qwen3-8B 4-bit on port 8002
docker run -d --gpus '"device=MIG-zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz"' \
  -p 8002:8000 \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen3-8B \
  --quantization bitsandbytes \
  --load-format bitsandbytes \
  --dtype float16 \
  --max-model-len 8192

For full vLLM server setup on Spheron, including instance selection and container configuration, see the Spheron vLLM server guide.

Step 6: Verify All Three Endpoints

bash

curl http://localhost:8000/v1/models | python3 -m json.tool
curl http://localhost:8001/v1/models | python3 -m json.tool
curl http://localhost:8002/v1/models | python3 -m json.tool

Each should return a different model name. You now have three independent LLM endpoints running on a single H100, with hardware-level memory isolation between them.

Tutorial: Time-Slice an L40S for Multiple Endpoints

Time-slicing is the Kubernetes-native approach for GPUs that do not support MIG. The NVIDIA device plugin handles context scheduling at the driver level.

Step 1: Create the Device Plugin ConfigMap

yaml

# time-slicing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        replicas: 3

Apply it:

bash

kubectl apply -f time-slicing-config.yaml

Step 2: Patch the Device Plugin to Use the Config

bash

kubectl patch clusterpolicies/cluster-policy \
  --type merge \
  -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "any"}}}}'

After patching, the node will advertise nvidia.com/gpu: 3 instead of 1, even though there is only one physical GPU.

Step 3: Deploy Three vLLM Pods

Each pod requests one GPU slice:

yaml

apiVersion: v1
kind: Pod
metadata:
  name: model-a
spec:
  containers:
  - name: vllm
    image: vllm/vllm-openai:latest
    resources:
      limits:
        nvidia.com/gpu: "1"
    args:
      - "--model"
      - "Qwen/Qwen3-8B"
      - "--quantization"
      - "bitsandbytes"
      - "--load-format"
      - "bitsandbytes"
      - "--gpu-memory-utilization"
      - "0.30"
      - "--max-model-len"
      - "4096"

Set --gpu-memory-utilization to 0.30 or lower when time-slicing three processes on one GPU, and use INT4 quantization for any 7B+ model. With three processes sharing 48 GB, each process can address at most 14.4 GB of VRAM (0.30 × 48 GB). vLLM allocates KV cache from whatever VRAM remains after loading model weights, so an FP16 8B model (~15 GiB of weights) would exceed that budget and fail to start. BitsAndBytes INT4 quantization reduces Qwen3-8B weights to roughly 5-6 GB, leaving ~8-9 GB for KV cache, enough to serve 4096-token contexts comfortably. vLLM defaults to 90% VRAM allocation, which causes all three pods to crash-loop when running concurrently.

For Spheron LLM deployment options including vLLM and Ollama frameworks, see the Spheron LLM docs. For detailed vLLM production configuration including multi-GPU tensor parallelism and load balancing, see our vLLM production deployment guide.

Performance Impact

Approximate latency and throughput figures for Llama 3.1 8B across GPU sharing methods:

Method	GPU	Config	TTFT (p50)	Throughput (tok/s)	Notes
Dedicated GPU	H100 SXM5	1x H100 solo	~25ms	~800	Baseline
MIG 2g.20gb	H100 SXM5	2/7 H100	~28ms	~265	Near-linear scaling with compute fraction
Time-slicing x3	L40S	L40S shared	~35ms	~190	Higher variance under concurrent load
MPS x3	H100 SXM5	H100 shared	~27ms	~310	Better SM utilization than time-slicing

MIG throughput scales near-linearly with the fraction of compute assigned. A 2g.20gb instance has roughly 2/7 of the H100's compute, so 2/7 of 800 tok/s is about 229 tok/s as a naive compute-fraction prediction. The measured ~265 tok/s slightly exceeds the 229 tok/s linear prediction because MIG instances have dedicated memory controllers, reducing memory bandwidth contention that would occur with software-based sharing methods like time-slicing.

Time-slicing overhead grows non-linearly under concurrent load. At low traffic (1-2 requests per model), the GPU often sits idle per-model anyway, so time-slicing overhead is minimal. Under three simultaneous sustained requests, context switching consumes 5-15% of available compute.

Cost Analysis

Scenario: 3 models serving moderate traffic (Mistral 7B Instruct, Phi-4, Qwen3-8B)

Option	Setup	Cost/hr	Notes
1x H100 SXM5 with MIG (3 instances)	3x 2g.20gb	$2.40	Single GPU, 3 isolated endpoints
3x A100 SXM4 80GB (dedicated)	3 separate GPUs	$3.15	Full isolation, 3x the GPU count
3x A100 PCIe 80GB (dedicated)	3 separate GPUs	$3.21	PCIe variant ($1.07/GPU x 3)
3x time-sliced on 1x L40S	1 GPU shared	$2.04	No VRAM isolation, works for low traffic

Running three dedicated A100 SXM4 instances costs $3.15/hr vs $2.40/hr for one H100 with MIG. That is a 24% cost reduction while keeping hardware isolation. If your models fit in 20 GB each and you care about fault isolation, MIG on H100 is cheaper than three separate A100s.

The L40S time-slicing option costs $2.04/hr for one GPU shared across three endpoints. That is cheaper than three separate A100s, but carries trade-offs: no VRAM isolation, higher latency variance under load, and less predictable behavior when all three models get simultaneous traffic spikes.

Pricing fluctuates based on GPU availability. The prices above are based on 26 Mar 2026 and may have changed. Check current GPU pricing for live rates.

When to Use Each Method

Use Case	Recommended Method	Why
Multi-tenant SaaS (per-customer isolation)	MIG	Hardware memory faults cannot cross instance boundaries
Development and testing (multiple models)	Time-slicing	No setup overhead, works on any GPU
Batch inference, high throughput	MPS	Concurrent kernel execution fills SMs better
Production API, mixed model sizes	MIG (larger slices)	Guaranteed VRAM + predictable latency
Low-traffic endpoints, cost optimization	Time-slicing	One GPU for 3-4 models at minimal overhead

Cloud VMs from AWS, GCP, and Azure lock MIG configuration at the hypervisor layer, which means tenants cannot change MIG profiles at runtime. Spheron provides root-level GPU access on bare-metal nodes, so you can enable MIG mode, select profiles, and reconfigure as your workload changes without filing a support ticket or waiting for a new instance type. See H100 GPU rental and current pricing for on-demand rates. For details on bare-metal access, see Spheron instance types.

Running multiple LLMs on shared GPUs requires root-level hardware access that most managed clouds don't provide. Spheron bare-metal H100 and A100 instances come with full MIG access, per-minute billing, and no minimum commitment.
Rent an H100 → | Rent an A100 → | View all pricing →
Get started on Spheron →

Three Ways to Share a GPU

MIG (Multi-Instance GPU)

Time-Slicing

MPS (Multi-Process Service)

Which GPUs Support MIG

Tutorial: Run 3 LLMs on One H100 with MIG

Step 1: Verify and Enable MIG Mode

Step 2: List Available MIG Profiles

Step 3: Create Three 2g.20gb Instances

Step 4: Get the MIG UUIDs

Step 5: Launch Three vLLM Containers on Separate Instances

Step 6: Verify All Three Endpoints

Tutorial: Time-Slice an L40S for Multiple Endpoints

Step 1: Create the Device Plugin ConfigMap

Step 2: Patch the Device Plugin to Use the Config

Step 3: Deploy Three vLLM Pods

Performance Impact

Cost Analysis

When to Use Each Method

Build what's next.