NemoClaw runs Nemotron model inference inside sandboxed OpenShell environments on whatever GPU you point it at. For most teams, that GPU starts as a local workstation. When your Nemotron workload outgrows a single RTX PC or you need to test multi-agent scale before buying on-prem hardware, GPU cloud is the straightforward next step.
What Is NVIDIA NemoClaw: The Open-Source Agent Security Stack
NemoClaw is NVIDIA's open-source reference stack for running always-on AI agents more safely inside OpenShell sandboxes. It is published at github.com/NVIDIA/NemoClaw and installs three things in a single command: the Nemotron model weights, the OpenShell runtime, and the NemoClaw CLI that handles privacy enforcement, network policy controls, and agent lifecycle management.
NVIDIA announced NemoClaw at GTC 2026 as part of the RTX AI Garage initiative. The primary targets are local always-on compute platforms: RTX PCs, RTX PRO workstations, and DGX Spark. Running it on GPU cloud fits when your Nemotron workload exceeds what local hardware can handle, when you want to benchmark the full stack before committing to on-prem hardware, or when your agent deployment needs more GPU memory than a single workstation provides.
NemoClaw is not a rendering pipeline. It does not generate images, process video, or interface with simulation engines. It runs language model inference inside a sandboxed, policy-controlled environment for AI agents. That distinction matters for GPU sizing: the relevant metric is Nemotron inference throughput and VRAM headroom, not rendering FPS.
NemoClaw Architecture: OpenShell, Nemotron, and the CLI
Three components work together in a NemoClaw deployment.
OpenShell is the sandboxed runtime that agents run inside. It isolates each agent's compute, memory access, and network traffic. OpenShell provides the privacy guarantees that let you run multiple agents on shared GPU infrastructure without cross-contamination between agent contexts or data. The isolation level and network access rules are configurable per agent.
Nemotron models are NVIDIA's conversational AI models that serve as the inference backend. NemoClaw installs these models as part of its setup process and routes all inference requests through NemoClaw's request handling layer. The model size you choose determines your GPU requirements: a 4B parameter Nemotron model runs on almost any GPU with 8GB+ VRAM; a 70B model needs 140GB+ for full-precision inference.
NemoClaw's CLI handles three things: lifecycle management (starting, stopping, and restarting agents), routed inference (directing requests to the correct Nemotron model instance), and network policy controls (what each agent can reach on the network and what it cannot).
Two agent types ship with the NemoClaw reference stack:
- OpenClaw handles interactive, user-facing workloads. It is optimized for low first-token latency and works well for agent assistants and interactive query routing.
- Hermes handles background, task-oriented workloads. It is tuned for throughput over time and fits pipeline automation, document processing, and scheduled task agents.
Both run inside OpenShell sandboxes and communicate through NemoClaw's routing layer.
GPU Requirements for Nemotron Inference
NemoClaw's GPU requirement is entirely set by the Nemotron model size you choose. The OpenShell runtime and NemoClaw CLI add minimal overhead. What drives VRAM is the model weights and the KV cache for active inference sessions.
NVIDIA's Nemotron family covers a wide range:
- Nemotron-Mini (4B): fits comfortably on any GPU with 8GB+ VRAM
- Nemotron-4-8B-Instruct: standard model for interactive agent workloads, needs ~16GB VRAM in BF16
- Nemotron-Super-49B: large interactive model, needs ~98GB in BF16 or ~28GB with 4-bit quantization
- Nemotron-4-70B: needs ~140GB in BF16 or ~40GB with 4-bit quantization
- Nemotron-4-340B: requires multi-GPU or aggressive quantization for any single-GPU setup
| GPU | VRAM | Generation | NemoClaw Use Case | Spheron On-Demand | Spot |
|---|---|---|---|---|---|
| RTX PRO 6000 96GB | 96GB GDDR7 | Blackwell | Nemotron-4B/8B, multi-agent cost-optimized | $1.70/hr | $0.66/hr |
| H100 PCIe 80GB | 80GB HBM2e | Hopper | Nemotron-8B full, 49B quantized, dev/test | $2.01/hr | N/A |
| H100 SXM5 80GB | 80GB HBM3 | Hopper | Nemotron-49B quantized, high-throughput 8B | $3.90/hr | $1.73/hr |
| H200 SXM5 141GB | 141GB HBM3e | Hopper+ | Nemotron-70B full precision, large model agents | $2.51/hr | $1.40/hr |
| B200 SXM6 192GB | 192GB HBM3e | Blackwell | Nemotron-70B full, 340B quantized, maximum throughput | $7.00/hr | $2.14/hr |
Pricing fluctuates based on GPU availability. The prices above are based on 31 May 2026 and may have changed. Check current GPU pricing → for live rates.
A few notes on GPU selection for NemoClaw workloads:
RTX PRO 6000. 96GB GDDR7 handles Nemotron-8B and smaller with substantial headroom for KV cache during multi-turn agent sessions. At $1.70/hr on-demand or $0.66/hr on spot on Spheron, this is a solid option for smaller Nemotron deployments.
H100 PCIe vs SXM5. Both have 80GB VRAM. The SXM5 offers 3.35 TB/s memory bandwidth vs 2.0 TB/s for PCIe. For transformer inference, which is heavily memory-bandwidth-bound, SXM5 delivers meaningfully higher tokens per second at the same model size.
H200 for 70B full precision. The 141GB HBM3e fits Nemotron-70B in full BF16 without quantization. This matters when agent quality is more important than cost, and you want to avoid the generation quality tradeoffs of 4-bit quantization.
B200 Blackwell advantage. Native FP8 support and 192GB VRAM make the B200 the best single-GPU option for the largest Nemotron variants or extremely high-throughput multi-user agent deployments. The Blackwell FP8 path delivers higher token throughput than Hopper's Transformer Engine approach for the same model size.
For H100 access, Spheron's H100 GPU rental page has current availability. Teams running 70B+ models can access H200 instances on Spheron. For maximum throughput on Blackwell, B200 on Spheron and RTX PRO 6000 on Spheron cover both Blackwell tiers.
Step-by-Step: Deploy NemoClaw on GPU Cloud
Prerequisites
- Spheron account with GPU instance access (docs.spheron.ai)
- NGC account and API key (ngc.nvidia.com/setup/api-key)
- NVIDIA driver 550+ and CUDA 12.4+
- Docker and NVIDIA Container Toolkit
Step 1: Provision and Verify Your GPU Instance
Rent an H100 or H200 instance through Spheron. Once the instance is running:
# Verify driver and CUDA versions
nvidia-smi
# Expected: driver 550+, CUDA 12.4+
# Verify NVIDIA Container Toolkit
nvidia-ctk --version
# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Smoke test: confirm GPU passthrough to containers
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smiCUDA 12.4 or later is required. Earlier CUDA versions lack kernel features that NemoClaw and Nemotron inference paths depend on.
Step 2: Clone NemoClaw from GitHub and Run the Installer
NemoClaw is published as an open-source reference stack on GitHub. There is no NIM container on NGC for NemoClaw itself.
# Set your NGC API key before running the installer
# (the installer uses this to download Nemotron weights from NGC)
export NGC_API_KEY=<your-ngc-api-key>
# Clone the NemoClaw repository
git clone https://github.com/NVIDIA/NemoClaw.git
cd NemoClaw
# Run the installer
chmod +x install.sh
./install.shThe installer handles three things: downloading the selected Nemotron model weights from NGC or Hugging Face, setting up the OpenShell runtime, and installing the NemoClaw CLI binaries. Depending on which Nemotron model you configure, the download can range from a few gigabytes (4B model) to over 100GB (70B model). Make sure your instance has adequate storage before starting.
# Verify the install completed
nemoclaw --version
# Confirm Nemotron weights are present
nemoclaw models listStep 3: Configure OpenShell and Select Nemotron Model
NemoClaw uses a YAML config file to define the OpenShell parameters and the Nemotron model to load:
# Copy the example config
cp config/nemoclaw.example.yaml config/nemoclaw.yamlKey parameters to set before first run:
model:
name: "nemotron-4-8b-instruct" # options: nemotron-mini-4b, nemotron-4-8b-instruct,
# nemotron-super-49b, nemotron-4-70b-instruct
precision: "bf16" # bf16, fp8 (Hopper/Blackwell), int4 (quantized)
openShell:
isolation_level: "full" # full or partial
memory_limit_gb: 20 # max per-agent memory footprint in GB
network_policy: "restricted" # restricted (default) or permissive
agents:
type: "openclaw" # openclaw (interactive) or hermes (background tasks)
max_concurrent: 4 # concurrent agent sessions
output:
log_level: "info"
log_path: /var/log/nemoclawFP8 precision is available on Hopper (H100, H200) via the Transformer Engine and natively on Blackwell (B200, RTX PRO 6000). For development and testing on any modern GPU, bf16 is the safe default.
Step 4: Launch Your First NemoClaw Agent
# Start NemoClaw with OpenClaw for interactive workloads
nemoclaw start \
--config config/nemoclaw.yaml \
--agent-type openclaw
# Or with Hermes for background task agents
nemoclaw start \
--config config/nemoclaw.yaml \
--agent-type hermes
# Check that the agent is running
nemoclaw status
# Send a test request
nemoclaw test --prompt "What can you help me with today?"Expected first-token latency on H100 SXM5 with Nemotron-8B-Instruct in BF16: under 200ms for a standard prompt. On RTX PRO 6000 with the same model: under 300ms. On H200 with Nemotron-70B in BF16: 400-700ms depending on prompt length.
Step 5: Configure Network Policies for Agent Isolation
OpenShell's network policy controls define what each agent can and cannot reach. NemoClaw enforces these at the sandbox level:
# List current network policies
nemoclaw policy list
# Apply the strict-inference-only template
# (blocks all outbound except the local Nemotron inference endpoint)
nemoclaw policy apply \
--agent-type openclaw \
--policy strict-inference-only
# For multi-agent deployments with per-agent policies
nemoclaw policy apply \
--agent-id agent-001 \
--policy config/custom-policy.yaml
# Verify policy enforcement
nemoclaw policy verify --agent-type openclawThe strict-inference-only policy template ships with the NemoClaw reference stack and works for agent deployments that only need to call the local Nemotron inference endpoint. For production agents that require external API access, define a custom policy YAML that whitelists only the specific endpoints your agent needs.
Production Use Cases: When to Use GPU Cloud for NemoClaw
NemoClaw's primary target hardware is local always-on compute: RTX PCs, RTX PRO workstations, and DGX Spark. GPU cloud fits in three specific scenarios:
Large Nemotron models. Nemotron-70B requires 140GB+ VRAM for full-precision BF16 inference. No single RTX PC can handle that. An H200 SXM5 (141GB) on cloud handles Nemotron-70B without quantization. The B200 SXM6 (192GB) handles Nemotron-70B with headroom for large KV caches and multi-session workloads.
Multi-agent scale. Running 10+ concurrent agents, each with its own OpenShell sandbox, multiplies GPU memory requirements quickly. A Nemotron-8B model takes ~16GB in BF16; eight concurrent sessions take ~128GB. A single H200 handles that; a single RTX PC workstation does not. Cloud GPU instances let you scale horizontally without waiting on hardware procurement.
Development and testing before on-prem commitment. Teams evaluating NemoClaw before buying DGX Spark hardware can run the full stack on cloud GPU instances first. Benchmarking Nemotron inference performance, testing your agent configuration, and validating OpenShell isolation behavior all work on cloud instances before you finalize an on-prem purchase.
For context on other NVIDIA AI workloads running on the same GPU cloud infrastructure, the Deploy NVIDIA Isaac GR00T N1 on GPU Cloud guide covers GPU requirements for the GR00T N1 robotics model. The Deploy NVIDIA Cosmos on GPU Cloud guide covers world model generation workloads.
Spheron Pricing for NemoClaw Inference Workloads
These estimates use per-GPU rates from the live Spheron pricing API (31 May 2026). The "Nemotron fit" column shows which Nemotron model sizes fit without quantization at each GPU tier.
| Config | GPU | Rate | Est. 8hr cost | Nemotron fit |
|---|---|---|---|---|
| Cost-optimized agents | RTX PRO 6000 (spot) | $0.66/hr | $5.28 | 4B, 8B full precision |
| Dev/test | H100 PCIe (on-demand) | $2.01/hr | $16.08 | 8B full, 49B quantized |
| Production spot | H100 SXM5 (spot) | $1.73/hr | $13.84 | 8B/49B full, 70B quantized |
| Production on-demand | H100 SXM5 (on-demand) | $3.90/hr | $31.20 | 8B/49B full, 70B quantized |
| Large model spot | H200 SXM5 (spot) | $1.40/hr | $11.20 | 70B full precision |
| Large model on-demand | H200 SXM5 (on-demand) | $2.51/hr | $20.08 | 70B full precision |
| Max throughput spot | B200 SXM6 (spot) | $2.14/hr | $17.12 | 70B full, 340B quantized |
| Max throughput on-demand | B200 SXM6 (on-demand) | $7.00/hr | $56.00 | 70B full, 340B quantized |
Pricing fluctuates based on GPU availability. The prices above are based on 31 May 2026 and may have changed. Check current GPU pricing → for live rates.
For teams running Nemotron-70B, H200 SXM5 spot at $1.40/hr is the strongest value: full-precision BF16 inference without quantization tradeoffs, at the lowest per-hour rate of any GPU that fits the full model. For smaller Nemotron models (8B and below) where you want the lowest cost, RTX PRO 6000 spot at $0.66/hr is hard to beat.
NemoClaw's always-on agent pattern means the GPU is mostly idle, occasionally handling inference requests. Per-minute billing fits this usage profile better than hourly reserved instances. Spheron aggregates GPU availability across 5+ providers so instances scale up on-demand without a capacity reservation.
Performance and Tuning: Model Selection and Inference Throughput
Model Selection by Use Case
The right Nemotron model depends on the complexity of agent tasks and the latency requirements:
- Nemotron-Mini 4B: fast responses, simple task routing, lightweight always-on assistants. Runs on any GPU with 8GB+ VRAM. Use when cost and latency matter more than response quality.
- Nemotron-8B: better reasoning for code assistance, multi-step task completion, and information retrieval agents. Best cost-throughput balance for most OpenClaw interactive deployments.
- Nemotron-Super-49B: high-quality generation for demanding agent tasks. Use on H100 SXM5 with quantization (fits in 80GB) or H200 at full precision.
- Nemotron-70B: maximum quality for agents handling complex reasoning, long-context analysis, or nuanced instruction following. Use H200 for full BF16; H100 SXM5 with 4-bit quantization as a cost-reduction option where VRAM is the constraint.
- Nemotron-340B: use only when you need the top of NVIDIA's conversational quality and can accept multi-GPU setup or aggressive quantization tradeoffs.
Quantization Options
For teams running Nemotron-70B on H100 SXM5 (80GB), 4-bit quantization via GPTQ or AWQ brings VRAM requirements from ~140GB down to ~38GB. This fits on 80GB with room for KV cache. Generation quality drops slightly for complex reasoning tasks but is acceptable for most agent workloads. Use BF16 on H200 or B200 where VRAM is not the constraint.
Setting quantization in the NemoClaw config:
# 4-bit quantization (GPTQ): ~38GB VRAM, fits on H100 SXM5 80GB
model:
name: "nemotron-4-70b-instruct"
precision: "int4"# 8-bit quantization: higher quality, ~70GB VRAM
model:
name: "nemotron-4-70b-instruct"
precision: "int8"Throughput for Always-On Agent Workloads
NemoClaw's always-on pattern means the agent spends most of its time idle in the OpenShell sandbox, processing inference requests as they arrive. The key metrics:
First-token latency: matters for OpenClaw interactive agents. Minimum target: under 500ms for user-facing response. On H100 SXM5 with Nemotron-8B in BF16, typical latency is 100-200ms. On H200 with Nemotron-70B in BF16, typical latency is 400-700ms.
Tokens per second: matters for Hermes background task agents processing long outputs or high request volume. H200 SXM5 running Nemotron-70B in BF16 delivers roughly 40-60 tokens/sec. B200 SXM6 with native FP8 delivers 70-100+ tokens/sec for the same model.
Concurrent session capacity: each concurrent agent session holds its KV cache in GPU memory. At ~16GB per session for Nemotron-8B, an H100 SXM5 (80GB) handles 4 concurrent sessions with headroom. An H200 (141GB) handles 8+. Plan your max_concurrent setting based on available VRAM minus model weight footprint.
NemoClaw agent inference is GPU-bound only during active requests. Per-minute billing on Spheron means you pay for active compute, not idle sandbox time. Spheron aggregates GPU availability across 5+ providers so instances start in minutes without a capacity reservation.
Quick Setup Guide
Rent an H100 or H200 instance on Spheron. SSH in and verify NVIDIA driver 550+ and CUDA 12.4+ are installed with nvidia-smi. Confirm Docker and the NVIDIA Container Toolkit are installed. Run: sudo nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart docker. Smoke-test with: docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
Set your NGC API key with: export NGC_API_KEY=<your-ngc-api-key>. Clone the repository: git clone https://github.com/NVIDIA/NemoClaw.git && cd NemoClaw. Make the installer executable and run it: chmod +x install.sh && ./install.sh. The installer downloads Nemotron model weights from NGC and sets up the OpenShell runtime and NemoClaw tooling. Verify with: nemoclaw --version
Copy the example config: cp config/nemoclaw.example.yaml config/nemoclaw.yaml. Open the file and set: model.name (e.g. nemotron-4-8b-instruct), model.precision (bf16 or fp8 on Hopper/Blackwell), openShell.isolation_level (full or partial), openShell.network_policy (restricted or permissive), and agents.max_concurrent. Save the config before starting agents.
Start NemoClaw with OpenClaw for interactive workloads: nemoclaw start --config config/nemoclaw.yaml --agent-type openclaw. Or use Hermes for background task agents: nemoclaw start --config config/nemoclaw.yaml --agent-type hermes. Check agent status with: nemoclaw status. Send a test request: nemoclaw test --prompt 'Hello, what can you help me with?'
List current policies with: nemoclaw policy list. Apply the strict inference-only template to block all outbound network access except the local Nemotron inference endpoint: nemoclaw policy apply --agent-type openclaw --policy strict-inference-only. For multi-agent deployments with per-agent policies: nemoclaw policy apply --agent-id agent-001 --policy custom-policy.yaml. Verify enforcement with: nemoclaw policy verify --agent-type openclaw
Frequently Asked Questions
NVIDIA NemoClaw is an open-source reference stack for running always-on AI agents more safely inside NVIDIA OpenShell sandboxes. Published at github.com/NVIDIA/NemoClaw, it installs Nemotron model weights and the OpenShell runtime in a single command, and uses NemoClaw's built-in tooling to enforce privacy, network policy, and lifecycle management for agents. It was announced at GTC 2026 as part of the RTX AI Garage initiative.
GPU requirements depend on which Nemotron model you run. For Nemotron-Mini (4B) or Nemotron-8B, an RTX PRO 6000 (96GB GDDR7) or H100 PCIe is sufficient. For Nemotron-49B, an H100 SXM5 80GB handles it with quantization. For Nemotron-70B at full BF16 precision, use H200 SXM5 (141GB). For Nemotron-340B or maximum-throughput deployments, B200 SXM6 (192GB) is the best single-GPU option.
OpenClaw and Hermes are the two agent types that ship with the NemoClaw reference stack. OpenClaw handles interactive, user-facing workloads where low latency matters. Hermes handles background, task-oriented workloads where throughput over time is the priority. Both run inside OpenShell sandboxes and communicate through NemoClaw's routing layer.
OpenShell is a sandboxed runtime that isolates each agent's compute, memory access, and network traffic. NemoClaw enforces network policy controls that define what each agent can reach on the network. This lets you run multiple agents on shared GPU infrastructure without cross-contamination between agent contexts or data streams. The isolation level and network policy are both configurable per agent.
Yes. While NemoClaw's primary targets are local always-on platforms (RTX PCs, RTX PRO workstations, DGX Spark), GPU cloud is the right choice when you need to run large Nemotron models (70B+) that exceed workstation VRAM, when you're testing at scale before committing to on-prem hardware, or when your multi-agent deployment requires more GPU memory than a single workstation provides.
