Tutorial

Deploy NVIDIA Nemotron 3 Nano Omni on GPU Cloud: Self-Host the 30B-A3B Omni-Modal Perception Model for Multimodal AI Agents (2026 Guide)

Nemotron 3 Nano OmniDeploy Nemotron Nano OmniOmni-Modal Model GPU CloudMultimodal Agent Perception ModelvLLMSGLangGPU CloudNVIDIA NemotronMamba Transformer MoEMultimodal AI
Deploy NVIDIA Nemotron 3 Nano Omni on GPU Cloud: Self-Host the 30B-A3B Omni-Modal Perception Model for Multimodal AI Agents (2026 Guide)

NVIDIA released Nemotron 3 Nano Omni on April 28, 2026 under the NVIDIA Nemotron Open Model License. The defining numbers are 30B total parameters and 3B active per token: a MoE design that gives you omni-modal perception at a compute cost close to a 3B dense model. It accepts text, images, video, and audio in a single inference loop and produces text output, with a 300K token context window and an approximately 16K token reasoning budget built in. For anyone building multimodal agent stacks, this changes the cost math on the perception layer considerably.

Before diving in, two related deployment guides worth bookmarking: the Nemotron 3 Super deployment guide covers the 120B/12B hybrid Mamba-Transformer tier for single-GPU deployment, and the Qwen3.5-Omni guide is a useful comparison point for other omni models in the same weight class.

What Is NVIDIA Nemotron 3 Nano Omni

Nemotron 3 Nano Omni is a hybrid Mamba-Transformer MoE model with three distinct architectural components: Mamba SSM layers for recurrent state across the sequence, standard Transformer attention layers, and Conv3D layers that process video frames. The Conv3D layers are the distinguishing piece: the model keeps a vision encoder (C-RADIOv4-H) and uses Conv3D to fuse pairs of frames into tubelets before passing them to the encoder. Efficient Video Sampling (EVS) then prunes redundant static tokens. Together, the Conv3D + EVS path reduces encoder compute by roughly 9x compared to per-frame ViT-based approaches, per NVIDIA's published figures.

SpecValue
Total parameters30B
Active parameters (per token)3B
ArchitectureHybrid Mamba-Transformer MoE + Conv3D
Context window300K tokens
Reasoning budget~16K tokens (approximate)
Input modalitiesText, image, video, audio
Output modalitiesText
LicenseNVIDIA Nemotron Open Model License

The NVIDIA Nemotron Open Model License is not Apache 2.0. It has specific terms around commercial use that differ from fully permissive open-source licenses. Check the model card license section before deploying commercially.

Why an Omni-Modal Perception Sub-Agent

The most practical use pattern for Nano Omni is not as a standalone assistant but as the perception layer in a two-model agent stack. Nano Omni receives raw video, audio, or images and returns structured text. A larger reasoning model gets that text and handles planning and decisions.

The numbers that make this work:

  • ~9.2x effective system capacity for video reasoning vs alternative open omni models at a fixed per-user interactivity threshold, per NVIDIA's published benchmarks.
  • ~9x lower compute for video reasoning because Conv3D fuses frame pairs into tubelets before the vision encoder and Efficient Video Sampling (EVS) prunes redundant static tokens, reducing total encoder compute compared to per-frame ViT-based approaches. This is NVIDIA's published figure for the Conv3D + EVS path.
  • Single inference loop: audio, video, and text pass through one model call instead of an ASR pipeline feeding a VLM feeding an LLM.

At 3B active parameters, each Nano Omni inference call is cheap enough that you can call it on every incoming video clip or audio segment without the per-call cost dominating your infrastructure budget.

For teams scaling to multi-GPU setups, the vLLM-Omni disaggregated serving guide covers multi-node deployment patterns for omni models.

GPU and VRAM Requirements

The 30B weight count is the primary constraint. Here is how it breaks down by quantization:

BF16 (2 bytes/param): ~60 GB for weights. Add 2-4 GB for Mamba recurrent state (scales with model width, not sequence length) and ~2 GB for KV cache headers. You need an 80GB GPU: H100 SXM5 or A100 80GB.

FP8 (1 byte/param): ~30 GB for weights plus the same Mamba state overhead, leaving roughly 14-16 GB free on a 48GB card for KV cache. The L40S 48GB is viable at FP8 for moderate concurrency.

4-bit GGUF (~0.5 bytes/param): ~15-18 GB. Fits RTX 4090 24GB for development and testing. Not recommended for production serving under concurrent load.

The Mamba state overhead is worth calling out separately: unlike pure-attention models where VRAM cost scales with sequence length, the SSM recurrent state in Mamba layers adds a fixed overhead proportional to model width regardless of context length. Plan for ~2-4 GB of fixed VRAM reserved for Mamba state on top of your weight and KV cache budget.

Use caseRecommended GPUQuantizationEst. VRAMOn-demand $/hr
Development / testingL40S 48GBFP8~34 GB$0.96
Production single-userA100 80GBBF16~64 GB$1.43
Production multi-userH100 SXM5 80GBBF16~64 GB$4.06
High throughputH100 SXM5 80GBFP8~34 GB$4.06
Edge / dev budgetRTX 4090 PCIe 24GBINT4/GGUF~18 GB$0.53

Pricing fluctuates based on GPU availability. The prices above are based on 29 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

For production multi-user serving, H100 SXM5 80GB on Spheron gives the most headroom at BF16 with per-minute billing and no contract. For FP8 development and testing, L40S 48GB instances are the cost-efficient option. For single-user production at BF16, the A100 80GB is the mid-tier choice.

Step-by-Step Deployment with vLLM

Step 1: Provision a GPU instance

Log into app.spheron.ai and provision your instance. For BF16 production, select H100 SXM5 80GB or A100 80GB. For FP8 development, L40S 48GB works. See docs.spheron.ai/getting-started for SSH setup instructions.

Verify your CUDA version after SSH:

bash
nvidia-smi
nvcc --version

You need CUDA 12.4+ for H100 (Hopper). If you are using B-series Blackwell GPUs, CUDA 12.8+ is required.

Attach persistent storage at /data/models/ before downloading weights. The BF16 checkpoint is roughly 60 GB and you do not want to re-download on every restart.

Step 2: Install vLLM

Nemotron 3 Nano Omni uses hybrid Mamba-Transformer architecture and requires audio input support. The minimum version is 0.20.0, and the [audio] extra is required for audio input modality:

bash
pip install "vllm[audio]==0.20.0"

# Verify installation
python -c "import vllm; print(vllm.__version__)"

Check NVIDIA's model card and vLLM release notes for any Nano Omni-specific notes before deploying, since Conv3D video layers are architecturally uncommon and may require version pinning.

Step 3: Download model weights

For BF16:

bash
huggingface-cli download nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 \
  --local-dir /data/models/nemotron-3-nano-omni

For FP8:

bash
huggingface-cli download nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 \
  --local-dir /data/models/nemotron-3-nano-omni

Always verify the exact HuggingFace repository name against NVIDIA's release page at https://huggingface.co/nvidia before running. The Nemotron family has inconsistent naming conventions across releases.

Step 4: Launch vLLM

BF16 on H100 (production):

bash
vllm serve /data/models/nemotron-3-nano-omni \
  --dtype bfloat16 \
  --no-enable-chunked-prefill \
  --trust-remote-code \
  --port 8000

FP8 on H100 (higher throughput):

bash
vllm serve /data/models/nemotron-3-nano-omni \
  --dtype bfloat16 \
  --quantization fp8 \
  --no-enable-chunked-prefill \
  --trust-remote-code \
  --port 8000

FP8 on L40S (development):

bash
vllm serve /data/models/nemotron-3-nano-omni \
  --dtype bfloat16 \
  --quantization fp8 \
  --no-enable-chunked-prefill \
  --gpu-memory-utilization 0.85 \
  --max-model-len 8192 \
  --trust-remote-code \
  --port 8000

The --no-enable-chunked-prefill flag is required for SSM layer correctness. This is the same requirement as Nemotron 3 Super: chunked prefill can produce incorrect outputs on models with Mamba layers. Re-enable only after you have validated correct outputs on your specific workload and confirmed support in your vLLM version.

Step 5: Send multimodal requests

The server exposes an OpenAI-compatible /v1/chat/completions endpoint. Pass multimodal content using the content array format.

Text-only (baseline):

python
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="/data/models/nemotron-3-nano-omni",
    messages=[{"role": "user", "content": "Describe the steps to train a vision model."}]
)
print(response.choices[0].message.content)

Image input:

python
import openai
import base64

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")

with open("frame.jpg", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="/data/models/nemotron-3-nano-omni",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
            {"type": "text", "text": "What is happening in this image?"}
        ]
    }]
)

Audio input (base64 WAV):

python
import openai
import base64

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")

with open("clip.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="/data/models/nemotron-3-nano-omni",
    messages=[{
        "role": "user",
        "content": [
            {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{audio_b64}"}},
            {"type": "text", "text": "Transcribe and summarize this audio."}
        ]
    }]
)

Audio tokens accumulate at approximately 7 per second (illustrative; verify against your model card), so a 60-second clip generates roughly 420 audio tokens before any text context.

Video input:

For video, sample keyframes at 1-2fps and pass them as an image array, or use the video MIME type if the model card specifies support for it. The Conv3D processing happens inside the model, so the exact input format may differ from ViT-based models. Verify the video input convention in the Nano Omni model card before deploying.

python
import openai
from your_video_utils import extract_keyframes  # replace with your frame-extraction logic

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")

# Recommended: extract keyframes at 1fps and pass as image array
# See model card for video MIME type support
frames = extract_keyframes("video.mp4", fps=1)
content = [{"type": "text", "text": "Describe what is happening in this video."}]
for frame_b64 in frames:
    content.append({"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{frame_b64}"}})

response = client.chat.completions.create(
    model="/data/models/nemotron-3-nano-omni",
    messages=[{"role": "user", "content": content}]
)

Step 6: Enable chain-of-thought reasoning

Nano Omni has an approximately 16K token reasoning budget. To activate it, pass enable_thinking: True in the extra body (or the equivalent parameter name per NVIDIA's model card):

python
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="/data/models/nemotron-3-nano-omni",
    messages=[{"role": "user", "content": "Analyze this complex scene: [video input]"}],
    extra_body={"enable_thinking": True}
)

Monitor generation length for reasoning requests: chain-of-thought can produce significantly more tokens than non-reasoning responses. Set --max-model-len to a value that accommodates your expected reasoning trace length plus the multimodal input token count.

Deployment with SGLang

For teams already running SGLang infrastructure, the launch pattern is similar. Check the SGLang production deployment guide for general SGLang setup, and the vLLM vs SGLang comparison if you are choosing between the two for this workload.

SGLang launch for Nemotron 3 Nano Omni:

bash
python -m sglang.launch_server \
  --model-path /data/models/nemotron-3-nano-omni \
  --trust-remote-code \
  --port 8000

Conv3D video layer support in SGLang may require version verification against NVIDIA's compatibility matrix. Check SGLang release notes for Nemotron 3 Nano Omni support before deploying, and fall back to vLLM if Conv3D is not yet supported in your SGLang version.

Building a Multimodal Agent: Nano Omni as the Perception Layer

The two-model architecture separates concerns clearly:

  1. Nano Omni (30B/3B active) handles perception: takes raw video, audio, or images and returns structured text
  2. Larger reasoning model handles planning: receives Nano Omni's text and produces decisions or actions
python
import openai
from your_video_utils import extract_keyframes  # replace with your frame-extraction logic

nano_omni = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")
reasoning_model = openai.OpenAI(base_url="http://reasoning-node:8001/v1", api_key="none")

def perceive(video_frames_b64: list[str]) -> str:
    content = [{"type": "text", "text": "Describe what is happening in this video in detail."}]
    for frame in video_frames_b64:
        content.append({"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{frame}"}})

    response = nano_omni.chat.completions.create(
        model="/data/models/nemotron-3-nano-omni",
        messages=[{"role": "user", "content": content}]
    )
    return response.choices[0].message.content

def plan(perception_text: str, task: str) -> str:
    response = reasoning_model.chat.completions.create(
        model="/data/models/nemotron-3-ultra",
        messages=[
            {"role": "system", "content": "You are a planning agent. Use perception inputs to make decisions."},
            {"role": "user", "content": f"Scene description: {perception_text}\n\nTask: {task}\n\nWhat action should the agent take?"}
        ]
    )
    return response.choices[0].message.content

# Agent loop
frames = extract_keyframes("scene.mp4", fps=1)
perception = perceive(frames)
action = plan(perception, "Navigate around any obstacles and reach the exit.")

Nemotron 3 Ultra 550B as the reasoning layer and Nemotron Ultra 253B are both strong candidates for the second tier depending on your latency and cost requirements. For setups combining Nano Omni with retrieval-augmented generation, the agentic RAG infrastructure guide covers the storage and indexing side of that architecture.

The cost advantage of this pattern: at 3B active params, Nano Omni perception calls cost a fraction of running the same video through a 70B or 550B model. The perception step handles the expensive multimodal encoding; the reasoning step receives clean, compact text.

Cost Math: On-Demand vs Spot on Spheron

Monthly cost estimates assume 730 hours of continuous use.

GPUOn-demand/hrSpot/hrMonthly on-demandMonthly spot
H100 SXM5 80GB$4.06$2.91~$2,964~$2,124
A100 80GB$1.43$1.19~$1,044~$869
L40S 48GB$0.96N/A~$701N/A

Pricing fluctuates based on GPU availability. The prices above are based on 29 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Break-even against hosted multimodal APIs. Nano Omni's 3B active parameter count drives very high tokens-per-dollar compared to comparable omni models. If you are processing video and audio at scale, the break-even point against pay-per-token multimodal APIs arrives faster than it does for text-only LLMs, because multimodal API pricing typically carries a premium per image frame or audio second.

For a rough estimate: if a hosted multimodal API charges $0.015 per image frame equivalent and you are processing 100 frames per minute continuously, that is $1.50/min or $90/hr. An A100 at $1.43/hr starts paying for itself in less than 1 hour of that throughput rate.

Spot instances. H100 spot pricing drops to $2.91/hr vs $4.06 on-demand. For stateless perception tasks in a multi-agent pipeline, spot is viable: if a spot instance is reclaimed, you redirect perception calls to another instance. Build checkpointing into any stateful parts of your pipeline (the reasoning model session, not the Nano Omni calls) rather than relying on spot instance uptime. Spot can be reclaimed without notice.

Production Notes

NIM microservice option. NVIDIA packages Nemotron models as containerized NIM inference endpoints. For teams who want a managed container rather than raw vLLM setup, see the NVIDIA NIM self-host deployment guide for the containerized path. NIM handles some of the engine configuration automatically.

Batching. Start with --max-num-seqs 8 and increase based on observed cache utilization. Video inputs inflate KV cache proportionally to frame count at the input token level (though Conv3D reduces downstream compute). Monitor vllm:gpu_cache_usage_perc in Prometheus. If it stays above 80% consistently, reduce --max-num-seqs before increasing it.

Video latency. Conv3D preprocessing latency scales with frame count. At 1-2fps sampling for a 30-second clip, you are passing 30-60 frames. Measure total time-to-first-token with nvidia-smi dmon -s mu -d 2 and size your timeout accordingly.

Context management for long sessions. At 300K context with approximately 7 audio tokens per second of audio (verify against your model card) plus video frame tokens, sessions accumulate tokens fast. Implement a sliding window over media tokens while preserving the full text history, or set hard session limits on context length per session. Track token counts in your orchestration layer before they hit the model context limit.

Persistent storage. The BF16 checkpoint is roughly 60 GB. Always mount a persistent volume at /data/models/ so the model survives instance restarts without re-downloading. Use FP8 if storage cost is a constraint: the checkpoint drops to about 30 GB.

Monitoring stack. vLLM exposes Prometheus metrics at /metrics. Key metrics for Nano Omni: vllm:gpu_cache_usage_perc, vllm:num_running_seqs, and vllm:time_to_first_token_seconds. Track num_running_seqs separately for multimodal vs text-only requests if your workload mixes both, since multimodal requests carry more input tokens per request on average.

Setup documentation for Spheron instances is at docs.spheron.ai.


Nemotron 3 Nano Omni's 3B active parameter design makes it one of the most cost-efficient omni-modal models to self-host. Spheron's GPU cloud lets you go from weights on Hugging Face to a running multimodal endpoint without buying dedicated hardware.

H100 SXM5 on Spheron → | L40S for FP8 serving → | View all GPU pricing →

Get started on Spheron →

STEPS / 06

Quick Setup Guide

  1. Choose a GPU tier based on quantization target

    For BF16 production serving, select an H100 SXM5 80GB or A100 80GB. For FP8 production with reduced VRAM cost, the L40S 48GB works. For development and testing with 4-bit GGUF, an RTX 4090 24GB is sufficient. Mamba state adds a fixed ~2-4 GB overhead on top of weight memory.

  2. Provision a GPU instance on Spheron

    Log into app.spheron.ai and provision your chosen GPU instance. Attach at least 70 GB of persistent storage for BF16 weights (or 35 GB for FP8). SSH in and verify your CUDA version with nvidia-smi: H100 requires CUDA 12.4+, and Blackwell GPUs require CUDA 12.8+.

  3. Install vLLM and download model weights from Hugging Face

    Install vLLM with SSM kernel and audio input support: pip install 'vllm[audio]==0.20.0'. The [audio] extra is required for audio input modality. Then download weights: huggingface-cli download nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 --local-dir /data/models/nemotron-3-nano-omni. For the FP8 checkpoint use nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 instead. Always verify the exact HuggingFace repository name against NVIDIA's release page at huggingface.co/nvidia before running, since the Nemotron family has inconsistent naming across releases.

  4. Launch the vLLM inference server with multimodal inputs enabled

    For BF16 on H100: vllm serve /data/models/nemotron-3-nano-omni --dtype bfloat16 --no-enable-chunked-prefill --trust-remote-code --port 8000. For FP8: add --quantization fp8. For FP8 on L40S: also add --gpu-memory-utilization 0.85 --max-model-len 8192.

  5. Send test requests with video, audio, and image inputs

    Use the OpenAI-compatible /v1/chat/completions endpoint with a content array. For image: include a dict with type image_url. For audio: base64-encode a WAV file and include it as audio_url. For video: pass keyframes as an image array at 1-2fps or use the video MIME type per the model card convention. Check NVIDIA's model card for the exact video input format.

  6. Wire Nano Omni as the perception layer for a reasoning agent

    Route multimodal inputs (video, audio, images) to Nano Omni first. Nano Omni returns structured text describing what it saw or heard. Pass that text to a larger reasoning model (Nemotron 3 Ultra 550B or Nemotron Ultra 253B) for planning and decision-making. This two-model pattern keeps perception costs low while preserving full reasoning quality in the planning step.

FAQ / 05

Frequently Asked Questions

Nemotron 3 Nano Omni is NVIDIA's 30B-total / 3B-active-per-token omni-modal MoE model, released on April 28, 2026 under the NVIDIA Nemotron Open Model License. It uses a hybrid Mamba-Transformer architecture with Conv3D video-native layers and supports text, image, video, and audio inputs with a 300K token context window and an approximately 16K token reasoning budget.

For BF16 precision you need an 80GB GPU (H100 SXM5 or A100 80GB) since the 30B weights occupy roughly 60 GB plus 4-5 GB for Mamba state and KV cache headers. At FP8, weight memory drops to about 30 GB, making the L40S 48GB viable for production. For development-only use, 4-bit GGUF runs on 24-32 GB consumer GPUs like the RTX 4090 24GB.

Nemotron 3 Nano Omni processes video using a C-RADIOv4-H vision encoder with Conv3D layers that fuse pairs of frames into tubelets before passing them to the encoder, reducing the number of tokens fed into it. Efficient Video Sampling (EVS) then prunes redundant static tokens further. Together, the Conv3D + EVS path reduces encoder compute by roughly 9x compared to per-frame ViT-based approaches, per NVIDIA's published figures. Recommend 1-2fps frame sampling when passing video to the model.

Nano Omni's 3B active parameter count means perception tasks (understanding video, audio, images) run at much lower cost per call than sending raw multimodal inputs to a 70B or larger model. The two-model pattern puts Nano Omni in the perception role (convert raw media to structured text) and a larger reasoning model like Nemotron 3 Ultra or Nemotron Ultra 253B in the planning role. The perception step handles expensive multimodal encoding; the reasoning step gets clean text at far lower latency.

vLLM supports the hybrid Mamba-Transformer MoE architecture as of version 0.20.0+, which includes the Mamba SSM kernel and the audio input extra required for Nemotron 3 Nano Omni. Install with pip install 'vllm[audio]==0.20.0' to include audio support. For Nemotron 3 Nano Omni specifically, you must use --no-enable-chunked-prefill on launch until you have validated that chunked prefill is correct on your workload. Check NVIDIA's serving documentation and vLLM release notes for any Nano Omni-specific compatibility notes before deploying.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.