Tutorial

Self-Host NVIDIA NIM Microservices on GPU Cloud: Complete Deployment Guide (2026)

Back to BlogWritten by Mitrasish, Co-founderMar 31, 2026
NVIDIA NIMNIM DeploymentSelf-Hosted AIGPU CloudLLM InferenceH100Bare Metal GPUKubernetes
Self-Host NVIDIA NIM Microservices on GPU Cloud: Complete Deployment Guide (2026)

NIM containers are now free for NVIDIA Developer Program members on up to 16 GPUs. The question is no longer "can you self-host" but "where to run it." This guide covers bare metal deployment on Spheron's H100 and A100 instances from first pull to production-ready multi-model serving.

For cost comparison: a single H100 PCIe on Spheron runs at $2.01/hr on-demand.

What Is NVIDIA NIM?

NIM stands for NVIDIA Inference Microservices. Each NIM container is a self-contained inference service: it packages model weights, an auto-selected inference backend (TensorRT-LLM, vLLM, or SGLang) optimized for the target GPU, and an OpenAI-compatible API endpoint. You pull the container, give it a GPU, and it serves inference requests.

NIM also supports agentic AI pipeline workflows via NVIDIA Blueprints, which launched in August 2024. Post-GTC 2026, the NIM catalog expanded to add anticipated Rubin-optimized inference profiles and a free tier for NVIDIA Developer Program members covering up to 16 GPUs. That last change matters: evaluation and development no longer require an AI Enterprise license.

Why self-host instead of using NVIDIA Cloud Functions? Three reasons:

  1. Cost at scale. Cloud Functions charges per token. At sustained load, a self-hosted bare metal H100 at $2.01/hr is significantly cheaper than per-token pricing.
  2. Data privacy. Your prompts stay on your infrastructure. For healthcare, legal, and financial workloads, sending data to a third-party endpoint is not an option.
  3. Customization. Self-hosting lets you modify container startup parameters, mount custom model weights, and control GPU memory allocation. Cloud Functions is fixed.

NIM Architecture: What's Inside the Container

Understanding the container structure helps you debug startup issues and plan resource allocation.

Container layers:

  • Base CUDA image with CUDA 12.x and cuDNN
  • NIM's own serving runtime (wrapping TensorRT-LLM, vLLM, or SGLang depending on model and GPU)
  • Compiled inference engine for one or more target GPU architectures
  • Model weights fetched from NGC on first container start (cached locally after)

GPU optimization profiles: each NIM container ships with profiles for multiple GPU SKUs. On startup, NIM detects the GPU architecture and selects the matching TRT-LLM profile automatically. An H100 SXM gets a different compiled kernel than an H100 PCIe, even from the same container image.

API surface: LLM NIM containers expose /v1/completions and /v1/chat/completions. Embedding NIM containers (like nv-embedqa-e5-v5) expose /v1/embeddings. Any client that works with OpenAI's API works with NIM without code changes. For a full walkthrough of building an OpenAI-compatible self-hosted API, see that guide for vLLM-based setup.

Prerequisites

GPU Requirements

Most LLM NIM containers require an A100 80GB or H100. Smaller containers like embedding models run on L40S or RTX-class GPUs. NIM selects the TRT-LLM profile at startup based on detected GPU architecture, so mismatched hardware causes a startup failure with a clear error message.

GPUVRAMSuitable NIM modelsSpheron on-demand price
H100 PCIe80GBLlama 3.1 70B, Mistral Large, Nemotron 70B$2.01/hr
A100 80GB PCIe80GBLlama 3.1 70B, Mistral 7B/8x7B$1.04/hr
A100 80GB SXM480GBSame as PCIe, higher memory bandwidth$1.14/hr

Pricing fluctuates based on GPU availability. The prices above are based on 31 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

For a broader GPU selection guide, see the GPU requirements cheat sheet.

NGC Access and Licensing

NVIDIA Developer Program: free to join at ngc.nvidia.com. Grants access to NIM containers for evaluation and development on up to 16 GPUs. No production support, no SLA.

NVIDIA AI Enterprise: $4,500 per GPU per year. Required for production workloads beyond the Developer Program tier. Includes SLA, model versioning guarantees, and priority support.

Steps to get your NGC API key:

  1. Go to ngc.nvidia.com and sign in or create an account
  2. Navigate to Setup → API Key
  3. Click Generate API Key
  4. Copy and store it; you'll need it for docker login and as a container environment variable

System Requirements

  • Ubuntu 22.04 or 24.04 (bare metal strongly recommended over VMs for NIM)
  • Docker 24+ with NVIDIA Container Toolkit
  • NVIDIA driver 535+ for H100 and A100 (or 470+ for A100 legacy support)
  • At least 200GB local disk for model cache (weights are large; Llama 3.1 70B is ~140GB on disk at FP16, or ~70GB quantized)

Step-by-Step: Deploy NIM on Spheron Bare Metal

Step 1: Provision a Bare Metal GPU Instance

Go to app.spheron.ai and select a bare metal H100 PCIe or A100 80GB instance. Bare metal is not optional here. NIM's TensorRT-LLM engines need direct PCIe and NVLink access; virtualization adds latency and can block certain GPU performance counters that TRT-LLM uses for kernel selection.

Rent H100 → | Rent A100 →

Once the instance is up, SSH in and verify GPU access:

bash
nvidia-smi
# Expected: your GPU(s) listed with full VRAM and driver version

Step 2: Install NVIDIA Container Toolkit

Most Spheron GPU instances ship with the NVIDIA Container Toolkit pre-installed. If yours doesn't:

bash
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify GPU access inside Docker:

bash
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Step 3: Authenticate with NGC

bash
export NGC_API_KEY="your_ngc_api_key_here"
echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

Note: $oauthtoken is a literal string, not a variable. Use single quotes to prevent shell expansion.

Step 4: Pull and Run the NIM Container

We'll use meta/llama-3.1-8b-instruct as the worked example. It fits in 80GB and is available to all NGC users.

bash
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

docker run -d \
  --name nim-llama3 \
  --gpus all \
  -e NGC_API_KEY="$NGC_API_KEY" \
  -v "$LOCAL_NIM_CACHE:/root/.cache/nim" \
  -p 8000:8000 \
  --ipc=host \
  nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

Check NGC for current image tags; latest may point to a different revision over time.

Step 5: Verify the Deployment

The first start takes 10 to 20 minutes. NIM downloads model weights from NGC, compiles TensorRT-LLM kernels for your GPU, and caches both. Subsequent starts use the cached artifacts and are much faster.

bash
# Watch startup logs
docker logs -f nim-llama3

# Test inference once the server is ready
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "What is NVIDIA NIM?"}],
    "max_tokens": 256
  }'

A JSON response with a choices array confirms the container is serving requests.

Multi-Model NIM Serving: Running Multiple Containers on One Node

Each NIM container claims GPU resources via Docker's --gpus flag. On a multi-GPU node, you assign specific devices per container using device=N syntax.

Example: 8x H100 node running 4 models, 2 GPUs each for 70B models:

bash
# Model 1 on GPU 0,1
docker run -d --name nim-llama70b --gpus device=0,1 --ipc=host -p 8000:8000 \
  -e NGC_API_KEY="$NGC_API_KEY" -v ~/.cache/nim:/root/.cache/nim \
  nvcr.io/nim/meta/llama-3.1-70b-instruct:latest

# Model 2 on GPU 2,3
docker run -d --name nim-mistral --gpus device=2,3 --ipc=host -p 8001:8000 \
  -e NGC_API_KEY="$NGC_API_KEY" -v ~/.cache/nim:/root/.cache/nim \
  nvcr.io/nim/mistralai/mistral-7b-instruct-v0.3:latest

# Model 3 on GPU 4,5
docker run -d --name nim-nemotron70b --gpus device=4,5 --ipc=host -p 8002:8000 \
  -e NGC_API_KEY="$NGC_API_KEY" -v ~/.cache/nim:/root/.cache/nim \
  nvcr.io/nim/nvidia/llama-3.1-nemotron-70b-instruct:latest

# Model 4 on GPU 6,7
docker run -d --name nim-embed --gpus device=6,7 --ipc=host -p 8003:8000 \
  -e NGC_API_KEY="$NGC_API_KEY" -v ~/.cache/nim:/root/.cache/nim \
  nvcr.io/nim/nvidia/nv-embedqa-e5-v5:latest

Route to each container via different ports, or use nginx with path-based routing to expose a unified endpoint:

nginx
upstream llama70b { server 127.0.0.1:8000; }
upstream mistral  { server 127.0.0.1:8001; }

server {
    listen 443 ssl;
    location /llama/ { proxy_pass http://llama70b/; }
    location /mistral/ { proxy_pass http://mistral/; }
}

If you want to go further with GPU isolation, MIG partitioning lets you split a single H100 into multiple isolated instances. See running multiple LLMs on one GPU for MIG and time-slicing as an alternative to separate containers.

NIM vs. vLLM vs. TensorRT-LLM: When to Use Each

CriterionNIMvLLMTensorRT-LLM
Setup timeLow (single docker run)Medium (install and configure)High (compile engines manually)
Model supportNVIDIA-curated catalogAny HF modelNVIDIA-supported models
CustomizationLowHighMedium
LicenseNVIDIA AI Enterprise (production)Apache 2.0Apache 2.0
Agentic pipelinesNative (NIM Agent Blueprints)Manual integrationManual
Performance ceilingHigh (auto-selected TRT-LLM, vLLM, or SGLang)Very high (PagedAttention + FP8)Highest (manual tuning)

Use NIM when: you want a production deployment that is ready in one command, need NVIDIA enterprise support, or are building agentic pipelines that integrate with NIM Agent Blueprints. NIM's pre-compiled engines are also the right choice for organizations where the ops team shouldn't need to understand vLLM internals to keep inference running.

Use vLLM when: you need to serve models outside the NIM catalog, want maximum flexibility in quantization and memory management, or are running a cost-sensitive operation that can't absorb AI Enterprise licensing. The vLLM production deployment guide covers that path in detail.

Use TensorRT-LLM directly when: you need the absolute maximum throughput from a fixed model and are willing to manage engine compilation yourself. For benchmarks comparing all three, see vLLM vs TensorRT-LLM vs SGLang benchmarks.

If you are building disaggregated prefill/decode pipelines, NVIDIA Dynamo is a related inference framework worth knowing about. See the NVIDIA Dynamo disaggregated inference guide for how it fits alongside NIM and TRT-LLM.

Scaling NIM with Kubernetes on GPU Cloud

For multi-replica deployments, Kubernetes with the NVIDIA GPU Operator is the standard pattern. The GPU Operator handles driver installation and the NVIDIA container runtime configuration across your cluster.

Minimal NIM deployment manifest:

yaml
apiVersion: v1
kind: Secret
metadata:
  name: ngc-secret
stringData:
  api-key: "your_ngc_api_key_here"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nim-model-cache
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: nfs  # use a storage class that supports RWX (e.g. NFS, CephFS, or cloud-provider equivalents)
  resources:
    requests:
      storage: 100Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nim-llama3
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nim-llama3
  template:
    metadata:
      labels:
        app: nim-llama3
    spec:
      hostIPC: true
      containers:
      - name: nim
        image: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
        resources:
          limits:
            nvidia.com/gpu: 1
        env:
        - name: NGC_API_KEY
          valueFrom:
            secretKeyRef:
              name: ngc-secret
              key: api-key
        ports:
        - containerPort: 8000
        volumeMounts:
        - name: nim-cache
          mountPath: /root/.cache/nim
      volumes:
      - name: nim-cache
        persistentVolumeClaim:
          claimName: nim-model-cache

A few practical notes:

  • Shared PVC for model cache: multiple replicas sharing the same PVC avoid re-downloading and re-compiling TRT-LLM engines on each pod start. The PVC must use accessModes: [ReadWriteMany] (RWX) so replicas scheduled on different nodes can all mount it. Use a storage class that supports RWX, such as NFS, CephFS, or a cloud-provider equivalent (e.g. AWS EFS, Azure Files, GCP Filestore). The default ReadWriteOnce access mode will cause the second pod to remain in Pending state when the two replicas land on different nodes. Size the PVC generously; 70B model caches run 140GB or more at FP16.
  • NIM containers are stateless after engine compilation. Horizontal pod autoscaling on request-per-second metrics works well once the cache is warm.
  • GPU Operator installation: follow the NVIDIA GPU Operator quick-start guide. The operator handles everything from driver management to runtime configuration.

For cluster setup and deployment, see Spheron documentation.

Cost Analysis: NIM on Spheron vs. NVIDIA Cloud Functions vs. Hyperscalers

Assumptions: Llama 3.1 70B requires 2 GPUs (2x H100 PCIe or 2x A100 80GB) at FP16. Running 24/7 for 30 days (720 hours).

OptionGPU ConfigHourly CostMonthly Cost (720 hrs)Notes
Spheron H100 PCIe x22x H100 PCIe$4.02/hr~$2,894/moOn-demand, no commitment
Spheron A100 80GB x22x A100 PCIe$2.08/hr~$1,498/moBest price for 70B
AWS p4d.24xlarge8x A100~$32.77/hr~$23,600/mo8 GPUs, significant excess capacity
Azure NC96ads A100 v44x A100~$13.00/hr~$9,360/moManaged overhead included
NVIDIA Cloud FunctionsPer-tokenVariesTypically 2-5x higher at scaleNo infrastructure to manage

AI Enterprise licensing adds $750/month for 2 GPUs ($4,500/GPU/year = $375/GPU/month). Total self-hosted cost for production NIM on Spheron A100 x2: roughly $1,498 + $750 = $2,248/month. That is still well below Azure or AWS for the same GPU config, and about half the cost of comparable hyperscaler options.

Pricing fluctuates based on GPU availability. The prices above are based on 31 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

For evaluation and development (Developer Program tier): no AI Enterprise license needed. Your cost is GPU only: $2.08/hr for 2x A100 on Spheron.

For a broader cost comparison across providers, see the GPU cloud pricing comparison 2026. For strategies to reduce GPU spend over time, see the GPU cost optimization playbook.


NIM's pre-compiled TensorRT-LLM engines need direct GPU access - bare metal is the right infrastructure. Spheron's H100 and A100 bare metal instances give you that access without the hyperscaler markup.

Rent H100 → | Rent A100 → | View all GPU pricing →

Deploy NIM on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.