Engineering

NVIDIA OpenShell and Agent Toolkit: Deploy Secure Agentic AI on GPU Cloud

Back to BlogWritten by Mitrasish, Co-founderApr 3, 2026
NVIDIA OpenShellNVIDIA Agent ToolkitAgentic AIGPU CloudAutonomous AI AgentsNemoClawKubernetesH100AI InfrastructureSecurity
NVIDIA OpenShell and Agent Toolkit: Deploy Secure Agentic AI on GPU Cloud

NVIDIA OpenShell shipped at GTC 2026 (March 16-19) alongside the broader Agent Toolkit. The short version: it is an open-source runtime that wraps autonomous agents in policy-governed sandboxes using kernel-level security primitives, so agents can run shell commands and call external tools without touching the host system. For GPU cloud deployments, this matters because it separates the security and policy layer from the inference backend, letting you run production agent workloads on H100 or B200 instances without embedding trust decisions in agent code.

This post covers the architecture, how to deploy it on Spheron GPU instances, the actual security mechanisms (not what you might expect), GPU sizing, and a cost comparison against managed API providers. For general GPU infrastructure for agentic AI, start with the GPU infrastructure requirements for AI agents guide first.

What Are NVIDIA OpenShell and the Agent Toolkit

NVIDIA Agent Toolkit is the umbrella. It bundles:

  • OpenShell: the open-source secure sandbox runtime
  • NemoClaw: a reference stack for the OpenClaw agent platform built on top of OpenShell
  • AI-Q Blueprint: an open agentic search blueprint (integrates with LangChain, CrewAI, and others)
  • Nemotron models: the open model family for local inference

OpenShell itself is the piece that matters for GPU cloud deployments. It runs a K3s Kubernetes cluster inside a single Docker container, which means you do not need to provision a separate Kubernetes cluster. Each agent runs in an isolated sandbox with a declarative YAML policy controlling what it can access. Policies cover four domains: filesystem paths, outbound network, process execution, and inference routing.

The supported agent clients include Claude, OpenCode, Codex, and GitHub Copilot CLI. Credential injection happens at the provider level, meaning credentials never touch the sandbox filesystem.

NemoClaw leverages OpenShell's Privacy Router to decide whether an inference call goes to a local Nemotron model (on supported RTX or DGX hardware) or a cloud inference endpoint, based on your data residency requirements. For GPU cloud deployments on H100 or B200 servers, the cloud inference path is what you will use.

GitHub repositories:

Architecture: OpenShell Runtime, NemoClaw, and Policy-Based Agent Governance

The stack has four distinct layers:

LayerComponentRole
Agent clientClaude, OpenCode, Codex, GitHub Copilot CLISends tasks to the sandbox
RuntimeOpenShellK3s inside Docker, policy enforcement, sandbox lifecycle
Securityseccomp + Landlock LSM + network namespacesKernel-level enforcement
Reference stackNemoClawPolicy presets, Nemotron install, Privacy Router

OpenShell runtime manages the K3s cluster inside Docker. When an agent client submits a task, OpenShell creates a sandbox pod, applies the YAML policy, and runs the agent code inside that pod. The agent can read and write to /sandbox and /tmp, call allowed network endpoints, and invoke inference APIs, but cannot escalate privileges, mount filesystems, or make unauthorized outbound connections.

Policies are hot-reloadable, meaning you can update allowed endpoints or filesystem paths without restarting any sandbox. This is useful when agent workflows evolve in production.

NemoClaw sits on top as a reference implementation for the OpenClaw use case. Its Blueprint component defines the agent workflow, while the Plugin component handles user interaction in TypeScript. NemoClaw leverages OpenShell's Privacy Router to decide whether an inference call goes to a local Nemotron model or a cloud provider, based on your configured policy.

Security stack (covered in detail below): seccomp for syscall filtering, Landlock LSM for filesystem restrictions, network namespaces for traffic isolation. These are standard Linux kernel mechanisms, not proprietary NVIDIA components.

GPU Requirements for OpenShell Workloads on GPU Cloud

OpenShell itself is lightweight. The K3s cluster and policy engine run fine on a CPU-only node. The GPU requirement comes from the inference backend your agents call.

For experimental GPU passthrough to sandbox pods (the current capability in OpenShell), the host needs NVIDIA drivers and the NVIDIA Container Toolkit installed. The default sandbox base image does not include GPU libraries, so a custom image is required for any agent code that calls CUDA directly.

For the more common production pattern, where agents call an inference API (vLLM, Nemotron, or a cloud endpoint), the GPU sizing follows the same rules as any inference workload:

ConfigModel sizeVRAM (weights)VRAM (KV cache, 4K ctx x 50 sessions)Min GPU
Orchestrator only8B FP8~8 GB~10 GBRTX 4090 (24 GB)
Orchestrator + workers8B FP8 + 8B FP8~16 GB~20 GBH100 PCIe (80 GB)
Large orchestrator70B FP8~70 GB~50 GB (50 sessions, 8K ctx)H100 SXM5 + second GPU
High-concurrency fleet8B FP8~8 GB~125 GB (250 sessions, 4K ctx)B200 (192 GB)

For production agent APIs where agents drive agentic workflows with tool calls, the H100 SXM5 is the standard starting point. The B200 becomes relevant at high session counts or when running 70B orchestrator models without tensor parallelism. For a breakdown of VRAM requirements across common open-source models like Llama 4, DeepSeek, and Qwen 3, see the GPU requirements cheat sheet.

Step-by-Step: Deploy OpenShell on Spheron GPU Cloud

The following assumes an H100 SXM5 instance on Spheron with Ubuntu 22.04.

1. Provision the instance and verify GPU access

Log into app.spheron.ai, select an H100 SXM5 instance, and SSH in. Verify the GPU is visible:

bash
nvidia-smi
# Should show the H100 with driver version and VRAM

2. Install Docker and the NVIDIA Container Toolkit

bash
# Install Docker
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER

# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

3. Clone OpenShell and build the cluster image

bash
git clone https://github.com/NVIDIA/OpenShell.git
cd OpenShell
# Build the K3s cluster image with Helm charts and manifests embedded
docker build -t openshell-cluster:latest -f deploy/Dockerfile .

4. Define your agent policy

Create a policy YAML that controls what your agent can access:

yaml
# policy.yaml
version: "1.0"
sandbox:
  filesystem:
    allowed_reads:
      - /sandbox
      - /tmp
    allowed_writes:
      - /sandbox/output
      - /tmp
  network:
    allowed_egress:
      - host: "api.spheron.ai"
        port: 443
      - host: "your-vllm-endpoint.internal"
        port: 8000
    block_metadata_service: true
  process:
    allow_exec: false
    max_processes: 16
  inference:
    routing_endpoint: "http://your-vllm-endpoint.internal:8000/v1"

5. Start the OpenShell cluster

bash
docker run -d \
  --name openshell \
  --privileged \
  -v $(pwd)/policy.yaml:/etc/openshell/policy.yaml \
  -p 8080:8080 \
  openshell-cluster:latest

The --privileged flag is required for K3s to manage kernel namespaces and cgroups inside the container.

6. Deploy your vLLM inference backend

On the same host (or a separate GPU node), start the inference server your agents will call:

bash
docker run -d \
  --gpus all \
  --ipc=host \
  -p 127.0.0.1:8000:8000 \
  --name vllm-backend \
  vllm/vllm-openai:latest \
    --model nvidia/Llama-3.1-Nemotron-Nano-8B-v1 \
    --dtype auto \
    --quantization fp8 \
    --kv-cache-dtype fp8 \
    --max-num-seqs 200 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90 \
    --host 127.0.0.1 \
    --port 8000 \
    --api-key your-secret-key

Point your OpenShell policy's routing_endpoint at this server. Agents call the OpenAI-compatible API through OpenShell's policy enforcement layer. For a full walkthrough of vLLM configuration options, FP8 quantization flags, and multi-GPU tensor parallelism, see the vLLM production deployment guide.

Configuring Security: seccomp, Landlock LSM, and Network Isolation

OpenShell's security model rests on three Linux kernel mechanisms. It does not use eBPF for enforcement; the actual primitives are:

seccomp (required, kernel 3.17+)

Filters system calls at the kernel boundary. OpenShell's default profile blocks dangerous calls including ptrace, mount, pivot_root, clone with unshare flags, and raw socket creation. An agent that tries to escalate privileges or create a new network namespace gets EPERM and cannot proceed.

yaml
# In policy.yaml, seccomp tightening example
process:
  seccomp:
    default_action: "SCMP_ACT_ERRNO"
    allowed_syscalls:
      - read
      - write
      - open
      - close
      - stat
      - mmap
      - exit_group
      # Add your workload-specific calls here

Landlock LSM (recommended)

Filesystem access control implemented in the kernel rather than at the application layer. Landlock rules survive fork/exec, so a child process cannot inherit broader filesystem access than the parent. OpenShell configures Landlock to restrict agents to the directories in allowed_reads and allowed_writes from your policy.

Landlock requires kernel 5.13+. Ubuntu 22.04's default kernel (5.15) supports it.

Network namespaces

Each sandbox pod runs in its own network namespace. Outbound traffic is routed through a controlled egress point, and unauthorized endpoints receive connection refused before the packet leaves the host. The metadata service endpoint (169.254.169.254) is blocked by default to prevent IMDS credential harvesting.

Here is a complete network isolation example using a Kubernetes NetworkPolicy applied to the K3s cluster inside OpenShell:

yaml
# network-policy.yaml - apply with kubectl inside the K3s cluster
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: openshell-agent-egress
  namespace: agent-sandbox
spec:
  podSelector:
    matchLabels:
      app: openshell-agent
  policyTypes:
    - Egress
  egress:
    - to:
        - ipBlock:
            cidr: 10.0.0.0/8  # Internal VPC only
      ports:
        - protocol: TCP
          port: 8000  # vLLM inference
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
            except:
              - 169.254.0.0/16  # Block IMDS
      ports:
        - protocol: TCP
          port: 443  # HTTPS egress only

To apply this inside the K3s cluster running inside the OpenShell Docker container:

bash
# Get the kubeconfig from the running cluster
docker exec openshell cat /etc/rancher/k3s/k3s.yaml > k3s.yaml
export KUBECONFIG=$(pwd)/k3s.yaml
kubectl apply -f network-policy.yaml

After applying policies, use the built-in OpenShell policy linter to validate before production:

bash
docker exec openshell openshell-sandbox validate-policy /etc/openshell/policy.yaml
# Returns: VALID or lists specific violations

Building Your First Autonomous Agent with the NVIDIA Agent Toolkit

The Agent Toolkit SDK connects agent code to the OpenShell runtime via a standard client interface. Here is a minimal Python agent that uses the OpenShell endpoint and calls the inference backend through the policy layer:

python
from nvidia_agent_toolkit import AgentClient, ToolRegistry, PolicyAwareCall
import httpx
import os

# Connect to the running OpenShell instance
client = AgentClient(
    runtime_url="http://localhost:8080",
    policy_path="/etc/openshell/policy.yaml",
)

# Register tools the agent is allowed to use
registry = ToolRegistry()

@registry.tool(name="web_fetch", allowed_domains=["docs.nvidia.com"])
def web_fetch(url: str) -> str:
    """Fetch a web page. Restricted to allowed domains by OpenShell policy."""
    # OpenShell intercepts and validates this call against network policy
    response = httpx.get(url, timeout=10)
    return response.text[:4096]

@registry.tool(name="read_file")
def read_file(path: str) -> str:
    """Read a file from the sandbox directory."""
    # Open with O_NOFOLLOW so the kernel refuses to follow a symlink at the
    # final path component. Combined with Landlock LSM enforcement, this
    # prevents symlink-swap attacks without a TOCTOU window.
    try:
        fd = os.open(path, os.O_RDONLY | os.O_NOFOLLOW)
    except OSError as e:
        raise ValueError(f"Cannot open {path!r}: {e}") from e
    with os.fdopen(fd) as f:
        return f.read(4096)

# Run an agent task through the OpenShell sandbox
result = client.run_task(
    task="Summarize the NVIDIA OpenShell documentation and save the summary to /sandbox/output/summary.txt",
    tools=registry,
    model_endpoint="http://your-vllm-endpoint.internal:8000/v1",
    model_name="nvidia/Llama-3.1-Nemotron-Nano-8B-v1",
    max_steps=10,
)

print(result.output)
print(f"Steps taken: {result.steps}")
print(f"Policy violations blocked: {result.blocked_calls}")

When client.run_task runs, OpenShell:

  1. Creates a sandbox pod in the K3s cluster
  2. Applies the policy from policy.yaml
  3. Executes the agent loop inside the sandbox
  4. Returns blocked call counts alongside the result, so you can audit what the agent tried to do

For teams using MCP tool servers as the backend for agent tools, see the GPU-accelerated MCP server deployment guide for the infrastructure layer behind those tool endpoints.

Scaling Multi-Agent Systems on GPU Cloud

For multi-agent workflows where an orchestrator spawns worker agents, each agent runs in its own OpenShell sandbox pod. K3s handles pod scheduling inside the cluster.

A horizontal pod autoscaler targeting GPU utilization on the inference backend is more reliable than scaling on CPU load:

yaml
# hpa.yaml - scale vLLM workers based on queue depth
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-backend
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: External
      external:
        metric:
          name: vllm_num_requests_waiting
        target:
          type: AverageValue
          averageValue: "10"

For the OpenShell sandbox tier, scale based on active agent pod count:

yaml
# sandbox-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: openshell-sandbox-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: openshell-sandbox
  minReplicas: 2
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60

Spheron on-demand instances work well for burst capacity in multi-agent systems. Set minimum replicas to cover your baseline concurrent agent load, then scale out to additional H100 nodes when orchestrator queue depth grows. For deeper orchestration patterns, the multi-agent AI infrastructure guide covers topology choices and KV cache math at scale.

Cost Comparison: Self-Hosted on GPU Cloud vs Managed API Providers

The comparison below uses a typical agentic interaction: 1,500 input tokens (system prompt + context + tool results) and 500 output tokens, totaling 2,000 tokens per interaction. Model: Nemotron-8B (throughput scales with memory bandwidth; H100 and B200 differ significantly due to HBM3e bandwidth improvements, but actual numbers vary by batch size and concurrency).

OptionPriceUtilInt/hrCost per 1,000 int
Spheron H100 SXM5 on-demand$2.40/hr70%~3,780~$0.63
Spheron H100 SXM5 on-demand$2.40/hr90%~4,860~$0.49
Spheron B200 SXM6 on-demand$7.43/hr70%~6,300~$1.18
GPT-4o (public pricing)$2.50/M in, $10/M outN/AN/A~$8.75
GPT-4o-mini (public pricing)$0.15/M in, $0.60/M outN/AN/A~$0.525

The H100 on-demand at sustained 90% utilization ($0.49/1,000 interactions) crosses below GPT-4o-mini ($0.525/1,000). At 70% utilization, the H100 is slightly above GPT-4o-mini on a per-interaction basis, but still 14x cheaper than GPT-4o.

The crossover point for H100 vs GPT-4o-mini sits at approximately 85% sustained GPU utilization. Below that, GPT-4o-mini is cheaper per interaction. Above 85% utilization (typical for production agent APIs with steady traffic), the H100 on-demand wins on cost. The B200 on-demand is priced for high-throughput workloads where the raw compute capacity justifies the rate.

OpenShell adds no compute cost itself. The K3s cluster and policy engine consume negligible CPU and memory relative to the inference backend.

Pricing fluctuates based on GPU availability. The prices above are based on 03 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Production Checklist and Monitoring Best Practices

Before moving OpenShell agent workloads to production:

CUDA and kernel validation

  • Confirm NVIDIA driver version meets NVIDIA Container Toolkit requirements (nvidia-smi shows driver >= 525)
  • Confirm Linux kernel >= 5.13 for Landlock LSM support (uname -r)
  • Confirm kernel >= 3.17 for seccomp (cat /proc/sys/kernel/seccomp/actions_avail)

Policy validation before go-live

  • Run openshell-sandbox validate-policy on every policy change
  • Use dry-run mode to test new policies against a sample task before applying to production sandbox pods
  • Review blocked_calls counts in agent run results to confirm policies are catching unintended access

GPU metrics

bash
# Continuous GPU stats for the inference backend
nvidia-smi dmon -s u -d 5

# Pull vLLM Prometheus metrics
curl http://localhost:8000/metrics | grep -E "num_requests_waiting|time_to_first_token|kv_cache_usage"

Key metrics to alert on:

  • vllm:num_requests_waiting > 10 sustained for more than 60 seconds: add GPU capacity
  • vllm:kv_cache_usage_perc > 90%: reduce --max-num-seqs or add VRAM (see KV cache optimization guide for techniques to serve more sessions per GPU)
  • vllm:time_to_first_token_seconds p95 > SLA: investigate whether it is queue saturation or VRAM pressure

OpenShell audit logs

bash
# View sandbox audit events from the K3s cluster
docker exec openshell kubectl logs -n agent-sandbox -l app=openshell-agent --tail=100

# View blocked policy violations
docker exec openshell openshell-sandbox audit-log --violations-only

Kubernetes liveness and readiness probes for sandbox pods:

yaml
livenessProbe:
  exec:
    command:
      - openshell-sandbox
      - health
  initialDelaySeconds: 15
  periodSeconds: 30
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10

Spheron node health: use Spheron's instance monitoring to track host-level GPU memory, NVLINK bandwidth (on SXM5 multi-GPU configs), and NVMe IOPS for model load times. For a full GPU monitoring stack covering DCGM, alerting, and dashboarding, see the GPU monitoring guide for ML workloads.


NVIDIA OpenShell and the Agent Toolkit give you kernel-level policy enforcement and sandbox isolation to run autonomous agents in production. Spheron's bare-metal H100 and B200 nodes give you the GPU capacity to do it at cost.

Rent H100 → | Rent B200 → | View all GPU pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.