NIM containers are now free for NVIDIA Developer Program members on up to 16 GPUs. The question is no longer "can you self-host" but "where to run it." This guide covers bare metal deployment on Spheron's H100 and A100 instances from first pull to production-ready multi-model serving.
For cost comparison: a single H100 PCIe on Spheron runs at $2.01/hr on-demand.
What Is NVIDIA NIM?
NIM stands for NVIDIA Inference Microservices. Each NIM container is a self-contained inference service: it packages model weights, an auto-selected inference backend (TensorRT-LLM, vLLM, or SGLang) optimized for the target GPU, and an OpenAI-compatible API endpoint. You pull the container, give it a GPU, and it serves inference requests.
NIM also supports agentic AI pipeline workflows via NVIDIA Blueprints, which launched in August 2024. Post-GTC 2026, the NIM catalog expanded to add anticipated Rubin-optimized inference profiles and a free tier for NVIDIA Developer Program members covering up to 16 GPUs. That last change matters: evaluation and development no longer require an AI Enterprise license.
Why self-host instead of using NVIDIA Cloud Functions? Three reasons:
- Cost at scale. Cloud Functions charges per token. At sustained load, a self-hosted bare metal H100 at $2.01/hr is significantly cheaper than per-token pricing.
- Data privacy. Your prompts stay on your infrastructure. For healthcare, legal, and financial workloads, sending data to a third-party endpoint is not an option.
- Customization. Self-hosting lets you modify container startup parameters, mount custom model weights, and control GPU memory allocation. Cloud Functions is fixed.
NIM Architecture: What's Inside the Container
Understanding the container structure helps you debug startup issues and plan resource allocation.
Container layers:
- Base CUDA image with CUDA 12.x and cuDNN
- NIM's own serving runtime (wrapping TensorRT-LLM, vLLM, or SGLang depending on model and GPU)
- Compiled inference engine for one or more target GPU architectures
- Model weights fetched from NGC on first container start (cached locally after)
GPU optimization profiles: each NIM container ships with profiles for multiple GPU SKUs. On startup, NIM detects the GPU architecture and selects the matching TRT-LLM profile automatically. An H100 SXM gets a different compiled kernel than an H100 PCIe, even from the same container image.
API surface: LLM NIM containers expose /v1/completions and /v1/chat/completions. Embedding NIM containers (like nv-embedqa-e5-v5) expose /v1/embeddings. Any client that works with OpenAI's API works with NIM without code changes. For a full walkthrough of building an OpenAI-compatible self-hosted API, see that guide for vLLM-based setup.
Prerequisites
GPU Requirements
Most LLM NIM containers require an A100 80GB or H100. Smaller containers like embedding models run on L40S or RTX-class GPUs. NIM selects the TRT-LLM profile at startup based on detected GPU architecture, so mismatched hardware causes a startup failure with a clear error message.
| GPU | VRAM | Suitable NIM models | Spheron on-demand price |
|---|---|---|---|
| H100 PCIe | 80GB | Llama 3.1 70B, Mistral Large, Nemotron 70B | $2.01/hr |
| A100 80GB PCIe | 80GB | Llama 3.1 70B, Mistral 7B/8x7B | $1.04/hr |
| A100 80GB SXM4 | 80GB | Same as PCIe, higher memory bandwidth | $1.14/hr |
Pricing fluctuates based on GPU availability. The prices above are based on 31 Mar 2026 and may have changed. Check current GPU pricing → for live rates.
For a broader GPU selection guide, see the GPU requirements cheat sheet.
NGC Access and Licensing
NVIDIA Developer Program: free to join at ngc.nvidia.com. Grants access to NIM containers for evaluation and development on up to 16 GPUs. No production support, no SLA.
NVIDIA AI Enterprise: $4,500 per GPU per year. Required for production workloads beyond the Developer Program tier. Includes SLA, model versioning guarantees, and priority support.
Steps to get your NGC API key:
- Go to ngc.nvidia.com and sign in or create an account
- Navigate to Setup → API Key
- Click Generate API Key
- Copy and store it; you'll need it for
docker loginand as a container environment variable
System Requirements
- Ubuntu 22.04 or 24.04 (bare metal strongly recommended over VMs for NIM)
- Docker 24+ with NVIDIA Container Toolkit
- NVIDIA driver 535+ for H100 and A100 (or 470+ for A100 legacy support)
- At least 200GB local disk for model cache (weights are large; Llama 3.1 70B is ~140GB on disk at FP16, or ~70GB quantized)
Step-by-Step: Deploy NIM on Spheron Bare Metal
Step 1: Provision a Bare Metal GPU Instance
Go to app.spheron.ai and select a bare metal H100 PCIe or A100 80GB instance. Bare metal is not optional here. NIM's TensorRT-LLM engines need direct PCIe and NVLink access; virtualization adds latency and can block certain GPU performance counters that TRT-LLM uses for kernel selection.
Once the instance is up, SSH in and verify GPU access:
nvidia-smi
# Expected: your GPU(s) listed with full VRAM and driver versionStep 2: Install NVIDIA Container Toolkit
Most Spheron GPU instances ship with the NVIDIA Container Toolkit pre-installed. If yours doesn't:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart dockerVerify GPU access inside Docker:
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smiStep 3: Authenticate with NGC
export NGC_API_KEY="your_ngc_api_key_here"
echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdinNote: $oauthtoken is a literal string, not a variable. Use single quotes to prevent shell expansion.
Step 4: Pull and Run the NIM Container
We'll use meta/llama-3.1-8b-instruct as the worked example. It fits in 80GB and is available to all NGC users.
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d \
--name nim-llama3 \
--gpus all \
-e NGC_API_KEY="$NGC_API_KEY" \
-v "$LOCAL_NIM_CACHE:/root/.cache/nim" \
-p 8000:8000 \
--ipc=host \
nvcr.io/nim/meta/llama-3.1-8b-instruct:latestCheck NGC for current image tags; latest may point to a different revision over time.
Step 5: Verify the Deployment
The first start takes 10 to 20 minutes. NIM downloads model weights from NGC, compiles TensorRT-LLM kernels for your GPU, and caches both. Subsequent starts use the cached artifacts and are much faster.
# Watch startup logs
docker logs -f nim-llama3
# Test inference once the server is ready
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta/llama-3.1-8b-instruct",
"messages": [{"role": "user", "content": "What is NVIDIA NIM?"}],
"max_tokens": 256
}'A JSON response with a choices array confirms the container is serving requests.
Multi-Model NIM Serving: Running Multiple Containers on One Node
Each NIM container claims GPU resources via Docker's --gpus flag. On a multi-GPU node, you assign specific devices per container using device=N syntax.
Example: 8x H100 node running 4 models, 2 GPUs each for 70B models:
# Model 1 on GPU 0,1
docker run -d --name nim-llama70b --gpus device=0,1 --ipc=host -p 8000:8000 \
-e NGC_API_KEY="$NGC_API_KEY" -v ~/.cache/nim:/root/.cache/nim \
nvcr.io/nim/meta/llama-3.1-70b-instruct:latest
# Model 2 on GPU 2,3
docker run -d --name nim-mistral --gpus device=2,3 --ipc=host -p 8001:8000 \
-e NGC_API_KEY="$NGC_API_KEY" -v ~/.cache/nim:/root/.cache/nim \
nvcr.io/nim/mistralai/mistral-7b-instruct-v0.3:latest
# Model 3 on GPU 4,5
docker run -d --name nim-nemotron70b --gpus device=4,5 --ipc=host -p 8002:8000 \
-e NGC_API_KEY="$NGC_API_KEY" -v ~/.cache/nim:/root/.cache/nim \
nvcr.io/nim/nvidia/llama-3.1-nemotron-70b-instruct:latest
# Model 4 on GPU 6,7
docker run -d --name nim-embed --gpus device=6,7 --ipc=host -p 8003:8000 \
-e NGC_API_KEY="$NGC_API_KEY" -v ~/.cache/nim:/root/.cache/nim \
nvcr.io/nim/nvidia/nv-embedqa-e5-v5:latestRoute to each container via different ports, or use nginx with path-based routing to expose a unified endpoint:
upstream llama70b { server 127.0.0.1:8000; }
upstream mistral { server 127.0.0.1:8001; }
server {
listen 443 ssl;
location /llama/ { proxy_pass http://llama70b/; }
location /mistral/ { proxy_pass http://mistral/; }
}If you want to go further with GPU isolation, MIG partitioning lets you split a single H100 into multiple isolated instances. See running multiple LLMs on one GPU for MIG and time-slicing as an alternative to separate containers.
NIM vs. vLLM vs. TensorRT-LLM: When to Use Each
| Criterion | NIM | vLLM | TensorRT-LLM |
|---|---|---|---|
| Setup time | Low (single docker run) | Medium (install and configure) | High (compile engines manually) |
| Model support | NVIDIA-curated catalog | Any HF model | NVIDIA-supported models |
| Customization | Low | High | Medium |
| License | NVIDIA AI Enterprise (production) | Apache 2.0 | Apache 2.0 |
| Agentic pipelines | Native (NIM Agent Blueprints) | Manual integration | Manual |
| Performance ceiling | High (auto-selected TRT-LLM, vLLM, or SGLang) | Very high (PagedAttention + FP8) | Highest (manual tuning) |
Use NIM when: you want a production deployment that is ready in one command, need NVIDIA enterprise support, or are building agentic pipelines that integrate with NIM Agent Blueprints. NIM's pre-compiled engines are also the right choice for organizations where the ops team shouldn't need to understand vLLM internals to keep inference running.
Use vLLM when: you need to serve models outside the NIM catalog, want maximum flexibility in quantization and memory management, or are running a cost-sensitive operation that can't absorb AI Enterprise licensing. The vLLM production deployment guide covers that path in detail.
Use TensorRT-LLM directly when: you need the absolute maximum throughput from a fixed model and are willing to manage engine compilation yourself. For benchmarks comparing all three, see vLLM vs TensorRT-LLM vs SGLang benchmarks.
If you are building disaggregated prefill/decode pipelines, NVIDIA Dynamo is a related inference framework worth knowing about. See the NVIDIA Dynamo disaggregated inference guide for how it fits alongside NIM and TRT-LLM.
Scaling NIM with Kubernetes on GPU Cloud
For multi-replica deployments, Kubernetes with the NVIDIA GPU Operator is the standard pattern. The GPU Operator handles driver installation and the NVIDIA container runtime configuration across your cluster.
Minimal NIM deployment manifest:
apiVersion: v1
kind: Secret
metadata:
name: ngc-secret
stringData:
api-key: "your_ngc_api_key_here"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nim-model-cache
spec:
accessModes:
- ReadWriteMany
storageClassName: nfs # use a storage class that supports RWX (e.g. NFS, CephFS, or cloud-provider equivalents)
resources:
requests:
storage: 100Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nim-llama3
spec:
replicas: 2
selector:
matchLabels:
app: nim-llama3
template:
metadata:
labels:
app: nim-llama3
spec:
hostIPC: true
containers:
- name: nim
image: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
resources:
limits:
nvidia.com/gpu: 1
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
name: ngc-secret
key: api-key
ports:
- containerPort: 8000
volumeMounts:
- name: nim-cache
mountPath: /root/.cache/nim
volumes:
- name: nim-cache
persistentVolumeClaim:
claimName: nim-model-cacheA few practical notes:
- Shared PVC for model cache: multiple replicas sharing the same PVC avoid re-downloading and re-compiling TRT-LLM engines on each pod start. The PVC must use
accessModes: [ReadWriteMany](RWX) so replicas scheduled on different nodes can all mount it. Use a storage class that supports RWX, such as NFS, CephFS, or a cloud-provider equivalent (e.g. AWS EFS, Azure Files, GCP Filestore). The defaultReadWriteOnceaccess mode will cause the second pod to remain inPendingstate when the two replicas land on different nodes. Size the PVC generously; 70B model caches run 140GB or more at FP16. - NIM containers are stateless after engine compilation. Horizontal pod autoscaling on request-per-second metrics works well once the cache is warm.
- GPU Operator installation: follow the NVIDIA GPU Operator quick-start guide. The operator handles everything from driver management to runtime configuration.
For cluster setup and deployment, see Spheron documentation.
Cost Analysis: NIM on Spheron vs. NVIDIA Cloud Functions vs. Hyperscalers
Assumptions: Llama 3.1 70B requires 2 GPUs (2x H100 PCIe or 2x A100 80GB) at FP16. Running 24/7 for 30 days (720 hours).
| Option | GPU Config | Hourly Cost | Monthly Cost (720 hrs) | Notes |
|---|---|---|---|---|
| Spheron H100 PCIe x2 | 2x H100 PCIe | $4.02/hr | ~$2,894/mo | On-demand, no commitment |
| Spheron A100 80GB x2 | 2x A100 PCIe | $2.08/hr | ~$1,498/mo | Best price for 70B |
| AWS p4d.24xlarge | 8x A100 | ~$32.77/hr | ~$23,600/mo | 8 GPUs, significant excess capacity |
| Azure NC96ads A100 v4 | 4x A100 | ~$13.00/hr | ~$9,360/mo | Managed overhead included |
| NVIDIA Cloud Functions | Per-token | Varies | Typically 2-5x higher at scale | No infrastructure to manage |
AI Enterprise licensing adds $750/month for 2 GPUs ($4,500/GPU/year = $375/GPU/month). Total self-hosted cost for production NIM on Spheron A100 x2: roughly $1,498 + $750 = $2,248/month. That is still well below Azure or AWS for the same GPU config, and about half the cost of comparable hyperscaler options.
Pricing fluctuates based on GPU availability. The prices above are based on 31 Mar 2026 and may have changed. Check current GPU pricing → for live rates.
For evaluation and development (Developer Program tier): no AI Enterprise license needed. Your cost is GPU only: $2.08/hr for 2x A100 on Spheron.
For a broader cost comparison across providers, see the GPU cloud pricing comparison 2026. For strategies to reduce GPU spend over time, see the GPU cost optimization playbook.
NIM's pre-compiled TensorRT-LLM engines need direct GPU access - bare metal is the right infrastructure. Spheron's H100 and A100 bare metal instances give you that access without the hyperscaler markup.
