What GPUs does NVIDIA NIM support?

NIM supports NVIDIA data center GPUs including H100, H200, A100, and L40S. Post-GTC 2026, NIM containers include Rubin-optimized profiles, but H100 and A100 remain the most widely deployed options for self-hosted inference.

Do I need NVIDIA AI Enterprise to run NIM?

For production use, yes. NVIDIA AI Enterprise licensing is required for NIM in production workloads - it costs $4,500 per GPU per year. Developer Program members can access NIM containers for free on up to 16 GPUs for evaluation and development.

What is the difference between NIM and vLLM?

NIM is a pre-packaged inference microservice from NVIDIA that bundles model weights, an auto-selected inference backend (TensorRT-LLM, vLLM, or SGLang), and a standardized API in a single container. vLLM is an open-source inference engine you configure yourself. NIM simplifies deployment at the cost of flexibility; vLLM gives you more control but requires more setup.

Can I run multiple NIM containers on one GPU node?

Yes, if you have multiple GPUs on one node, you can assign one or more GPUs per NIM container using Docker's --gpus flag. NIM containers are GPU-isolated by default. For a single-GPU-per-model setup, an 8x H100 node can serve 8 separate NIM model instances simultaneously.

How much does it cost to self-host NIM vs. using NVIDIA Cloud Functions?

On Spheron, a single H100 runs at $2.01/hr on-demand. At that rate, running one NIM container 24/7 costs roughly $1,449/month. NVIDIA Cloud Functions pricing varies by model and is typically higher for sustained usage. The break-even point depends on how many inference requests you process per hour.

Self-Host NVIDIA NIM Microservices on GPU Cloud: Complete Deployment Guide (2026)

NIM containers are now free for NVIDIA Developer Program members on up to 16 GPUs. The question is no longer "can you self-host" but "where to run it." This guide covers bare metal deployment on Spheron's H100 and A100 instances from first pull to production-ready multi-model serving.

For cost comparison: a single H100 PCIe on Spheron runs at $2.01/hr on-demand.

What Is NVIDIA NIM?

NIM stands for NVIDIA Inference Microservices. Each NIM container is a self-contained inference service: it packages model weights, an auto-selected inference backend (TensorRT-LLM, vLLM, or SGLang) optimized for the target GPU, and an OpenAI-compatible API endpoint. You pull the container, give it a GPU, and it serves inference requests.

NIM also supports agentic AI pipeline workflows via NVIDIA Blueprints, which launched in August 2024. Post-GTC 2026, the NIM catalog expanded to add anticipated Rubin-optimized inference profiles and a free tier for NVIDIA Developer Program members covering up to 16 GPUs. That last change matters: evaluation and development no longer require an AI Enterprise license.

Why self-host instead of using NVIDIA Cloud Functions? Three reasons:

Cost at scale. Cloud Functions charges per token. At sustained load, a self-hosted bare metal H100 at $2.01/hr is significantly cheaper than per-token pricing.
Data privacy. Your prompts stay on your infrastructure. For healthcare, legal, and financial workloads, sending data to a third-party endpoint is not an option.
Customization. Self-hosting lets you modify container startup parameters, mount custom model weights, and control GPU memory allocation. Cloud Functions is fixed.

NIM Architecture: What's Inside the Container

Understanding the container structure helps you debug startup issues and plan resource allocation.

Container layers:

Base CUDA image with CUDA 12.x and cuDNN
NIM's own serving runtime (wrapping TensorRT-LLM, vLLM, or SGLang depending on model and GPU)
Compiled inference engine for one or more target GPU architectures
Model weights fetched from NGC on first container start (cached locally after)

GPU optimization profiles: each NIM container ships with profiles for multiple GPU SKUs. On startup, NIM detects the GPU architecture and selects the matching TRT-LLM profile automatically. An H100 SXM gets a different compiled kernel than an H100 PCIe, even from the same container image.

API surface: LLM NIM containers expose /v1/completions and /v1/chat/completions. Embedding NIM containers (like nv-embedqa-e5-v5) expose /v1/embeddings. Any client that works with OpenAI's API works with NIM without code changes. For a full walkthrough of building an OpenAI-compatible self-hosted API, see that guide for vLLM-based setup.

Prerequisites

GPU Requirements

Most LLM NIM containers require an A100 80GB or H100. Smaller containers like embedding models run on L40S or RTX-class GPUs. NIM selects the TRT-LLM profile at startup based on detected GPU architecture, so mismatched hardware causes a startup failure with a clear error message.

GPU	VRAM	Suitable NIM models	Spheron on-demand price
H100 PCIe	80GB	Llama 3.1 70B, Mistral Large, Nemotron 70B	$2.01/hr
A100 80GB PCIe	80GB	Llama 3.1 70B, Mistral 7B/8x7B	$1.04/hr
A100 80GB SXM4	80GB	Same as PCIe, higher memory bandwidth	$1.14/hr

Pricing fluctuates based on GPU availability. The prices above are based on 31 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

For a broader GPU selection guide, see the GPU requirements cheat sheet.

NGC Access and Licensing

NVIDIA Developer Program: free to join at ngc.nvidia.com. Grants access to NIM containers for evaluation and development on up to 16 GPUs. No production support, no SLA.

NVIDIA AI Enterprise: $4,500 per GPU per year. Required for production workloads beyond the Developer Program tier. Includes SLA, model versioning guarantees, and priority support.

Steps to get your NGC API key:

Go to ngc.nvidia.com and sign in or create an account
Navigate to Setup → API Key
Click Generate API Key
Copy and store it; you'll need it for docker login and as a container environment variable

System Requirements

Ubuntu 22.04 or 24.04 (bare metal strongly recommended over VMs for NIM)
Docker 24+ with NVIDIA Container Toolkit
NVIDIA driver 535+ for H100 and A100 (or 470+ for A100 legacy support)
At least 200GB local disk for model cache (weights are large; Llama 3.1 70B is ~140GB on disk at FP16, or ~70GB quantized)

Step-by-Step: Deploy NIM on Spheron Bare Metal

Step 1: Provision a Bare Metal GPU Instance

Go to app.spheron.ai and select a bare metal H100 PCIe or A100 80GB instance. Bare metal is not optional here. NIM's TensorRT-LLM engines need direct PCIe and NVLink access; virtualization adds latency and can block certain GPU performance counters that TRT-LLM uses for kernel selection.

Rent H100 → | Rent A100 →

Once the instance is up, SSH in and verify GPU access:

bash

nvidia-smi
# Expected: your GPU(s) listed with full VRAM and driver version

Step 2: Install NVIDIA Container Toolkit

Most Spheron GPU instances ship with the NVIDIA Container Toolkit pre-installed. If yours doesn't:

bash

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify GPU access inside Docker:

bash

docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Step 3: Authenticate with NGC

bash

export NGC_API_KEY="your_ngc_api_key_here"
echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

Note: $oauthtoken is a literal string, not a variable. Use single quotes to prevent shell expansion.

Step 4: Pull and Run the NIM Container

We'll use meta/llama-3.1-8b-instruct as the worked example. It fits in 80GB and is available to all NGC users.

bash

export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

docker run -d \
  --name nim-llama3 \
  --gpus all \
  -e NGC_API_KEY="$NGC_API_KEY" \
  -v "$LOCAL_NIM_CACHE:/root/.cache/nim" \
  -p 8000:8000 \
  --ipc=host \
  nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

Check NGC for current image tags; latest may point to a different revision over time.

Step 5: Verify the Deployment

The first start takes 10 to 20 minutes. NIM downloads model weights from NGC, compiles TensorRT-LLM kernels for your GPU, and caches both. Subsequent starts use the cached artifacts and are much faster.

bash

# Watch startup logs
docker logs -f nim-llama3

# Test inference once the server is ready
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "What is NVIDIA NIM?"}],
    "max_tokens": 256
  }'

A JSON response with a choices array confirms the container is serving requests.

Multi-Model NIM Serving: Running Multiple Containers on One Node

Each NIM container claims GPU resources via Docker's --gpus flag. On a multi-GPU node, you assign specific devices per container using device=N syntax.

Example: 8x H100 node running 4 models, 2 GPUs each for 70B models:

bash

# Model 1 on GPU 0,1
docker run -d --name nim-llama70b --gpus device=0,1 --ipc=host -p 8000:8000 \
  -e NGC_API_KEY="$NGC_API_KEY" -v ~/.cache/nim:/root/.cache/nim \
  nvcr.io/nim/meta/llama-3.1-70b-instruct:latest

# Model 2 on GPU 2,3
docker run -d --name nim-mistral --gpus device=2,3 --ipc=host -p 8001:8000 \
  -e NGC_API_KEY="$NGC_API_KEY" -v ~/.cache/nim:/root/.cache/nim \
  nvcr.io/nim/mistralai/mistral-7b-instruct-v0.3:latest

# Model 3 on GPU 4,5
docker run -d --name nim-nemotron70b --gpus device=4,5 --ipc=host -p 8002:8000 \
  -e NGC_API_KEY="$NGC_API_KEY" -v ~/.cache/nim:/root/.cache/nim \
  nvcr.io/nim/nvidia/llama-3.1-nemotron-70b-instruct:latest

# Model 4 on GPU 6,7
docker run -d --name nim-embed --gpus device=6,7 --ipc=host -p 8003:8000 \
  -e NGC_API_KEY="$NGC_API_KEY" -v ~/.cache/nim:/root/.cache/nim \
  nvcr.io/nim/nvidia/nv-embedqa-e5-v5:latest

Route to each container via different ports, or use nginx with path-based routing to expose a unified endpoint:

nginx

upstream llama70b { server 127.0.0.1:8000; }
upstream mistral  { server 127.0.0.1:8001; }

server {
    listen 443 ssl;
    location /llama/ { proxy_pass http://llama70b/; }
    location /mistral/ { proxy_pass http://mistral/; }
}

If you want to go further with GPU isolation, MIG partitioning lets you split a single H100 into multiple isolated instances. See running multiple LLMs on one GPU for MIG and time-slicing as an alternative to separate containers.

NIM vs. vLLM vs. TensorRT-LLM: When to Use Each

Criterion	NIM	vLLM	TensorRT-LLM
Setup time	Low (single docker run)	Medium (install and configure)	High (compile engines manually)
Model support	NVIDIA-curated catalog	Any HF model	NVIDIA-supported models
Customization	Low	High	Medium
License	NVIDIA AI Enterprise (production)	Apache 2.0	Apache 2.0
Agentic pipelines	Native (NIM Agent Blueprints)	Manual integration	Manual
Performance ceiling	High (auto-selected TRT-LLM, vLLM, or SGLang)	Very high (PagedAttention + FP8)	Highest (manual tuning)

Use NIM when: you want a production deployment that is ready in one command, need NVIDIA enterprise support, or are building agentic pipelines that integrate with NIM Agent Blueprints. NIM's pre-compiled engines are also the right choice for organizations where the ops team shouldn't need to understand vLLM internals to keep inference running.

Use vLLM when: you need to serve models outside the NIM catalog, want maximum flexibility in quantization and memory management, or are running a cost-sensitive operation that can't absorb AI Enterprise licensing. The vLLM production deployment guide covers that path in detail.

Use TensorRT-LLM directly when: you need the absolute maximum throughput from a fixed model and are willing to manage engine compilation yourself. For benchmarks comparing all three, see vLLM vs TensorRT-LLM vs SGLang benchmarks.

If you are building disaggregated prefill/decode pipelines, NVIDIA Dynamo is a related inference framework worth knowing about. See the NVIDIA Dynamo disaggregated inference guide for how it fits alongside NIM and TRT-LLM.

Scaling NIM with Kubernetes on GPU Cloud

For multi-replica deployments, Kubernetes with the NVIDIA GPU Operator is the standard pattern. The GPU Operator handles driver installation and the NVIDIA container runtime configuration across your cluster.

Minimal NIM deployment manifest:

yaml

apiVersion: v1
kind: Secret
metadata:
  name: ngc-secret
stringData:
  api-key: "your_ngc_api_key_here"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nim-model-cache
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: nfs  # use a storage class that supports RWX (e.g. NFS, CephFS, or cloud-provider equivalents)
  resources:
    requests:
      storage: 100Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nim-llama3
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nim-llama3
  template:
    metadata:
      labels:
        app: nim-llama3
    spec:
      hostIPC: true
      containers:
      - name: nim
        image: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
        resources:
          limits:
            nvidia.com/gpu: 1
        env:
        - name: NGC_API_KEY
          valueFrom:
            secretKeyRef:
              name: ngc-secret
              key: api-key
        ports:
        - containerPort: 8000
        volumeMounts:
        - name: nim-cache
          mountPath: /root/.cache/nim
      volumes:
      - name: nim-cache
        persistentVolumeClaim:
          claimName: nim-model-cache

A few practical notes:

Shared PVC for model cache: multiple replicas sharing the same PVC avoid re-downloading and re-compiling TRT-LLM engines on each pod start. The PVC must use accessModes: [ReadWriteMany] (RWX) so replicas scheduled on different nodes can all mount it. Use a storage class that supports RWX, such as NFS, CephFS, or a cloud-provider equivalent (e.g. AWS EFS, Azure Files, GCP Filestore). The default ReadWriteOnce access mode will cause the second pod to remain in Pending state when the two replicas land on different nodes. Size the PVC generously; 70B model caches run 140GB or more at FP16.
NIM containers are stateless after engine compilation. Horizontal pod autoscaling on request-per-second metrics works well once the cache is warm.
GPU Operator installation: follow the NVIDIA GPU Operator quick-start guide. The operator handles everything from driver management to runtime configuration.

For cluster setup and deployment, see Spheron documentation.

Cost Analysis: NIM on Spheron vs. NVIDIA Cloud Functions vs. Hyperscalers

Assumptions: Llama 3.1 70B requires 2 GPUs (2x H100 PCIe or 2x A100 80GB) at FP16. Running 24/7 for 30 days (720 hours).

Option	GPU Config	Hourly Cost	Monthly Cost (720 hrs)	Notes
Spheron H100 PCIe x2	2x H100 PCIe	$4.02/hr	~$2,894/mo	On-demand, no commitment
Spheron A100 80GB x2	2x A100 PCIe	$2.08/hr	~$1,498/mo	Best price for 70B
AWS p4d.24xlarge	8x A100	~$32.77/hr	~$23,600/mo	8 GPUs, significant excess capacity
Azure NC96ads A100 v4	4x A100	~$13.00/hr	~$9,360/mo	Managed overhead included
NVIDIA Cloud Functions	Per-token	Varies	Typically 2-5x higher at scale	No infrastructure to manage

AI Enterprise licensing adds $750/month for 2 GPUs ($4,500/GPU/year = $375/GPU/month). Total self-hosted cost for production NIM on Spheron A100 x2: roughly $1,498 + $750 = $2,248/month. That is still well below Azure or AWS for the same GPU config, and about half the cost of comparable hyperscaler options.

Pricing fluctuates based on GPU availability. The prices above are based on 31 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

For evaluation and development (Developer Program tier): no AI Enterprise license needed. Your cost is GPU only: $2.08/hr for 2x A100 on Spheron.

For a broader cost comparison across providers, see the GPU cloud pricing comparison 2026. For strategies to reduce GPU spend over time, see the GPU cost optimization playbook.

NIM's pre-compiled TensorRT-LLM engines need direct GPU access - bare metal is the right infrastructure. Spheron's H100 and A100 bare metal instances give you that access without the hyperscaler markup.
Rent H100 → | Rent A100 → | View all GPU pricing →
Deploy NIM on Spheron →

What Is NVIDIA NIM?

NIM Architecture: What's Inside the Container

Prerequisites

GPU Requirements

NGC Access and Licensing

System Requirements

Step-by-Step: Deploy NIM on Spheron Bare Metal

Step 1: Provision a Bare Metal GPU Instance

Step 2: Install NVIDIA Container Toolkit

Step 3: Authenticate with NGC

Step 4: Pull and Run the NIM Container

Step 5: Verify the Deployment

Multi-Model NIM Serving: Running Multiple Containers on One Node

NIM vs. vLLM vs. TensorRT-LLM: When to Use Each

Scaling NIM with Kubernetes on GPU Cloud

Cost Analysis: NIM on Spheron vs. NVIDIA Cloud Functions vs. Hyperscalers

Build what's next.