NVIDIA's GR00T N1 model is publicly available, the architecture papers are out, and robotics teams are already running it on local workstations. The gap is a clear deployment guide for cloud GPU setup outside hyperscaler contracts. This post covers everything from instance provisioning to LoRA fine-tuning and sub-100ms inference. For GPU selection context across similar physical AI workloads, the GPU requirements cheat sheet for 2026 covers VRAM and compute needs for VLA models and adjacent robotics workloads.
The examples below are tested on Spheron H100 instances. Most commands apply to any Ubuntu 22.04 host with CUDA 12.4+ and the NVIDIA Container Toolkit installed.
What Is NVIDIA Isaac GR00T N1
GR00T N1, announced at GTC 2025, is NVIDIA's open foundation model for humanoid robotics. It is distributed under a non-commercial research license, meaning you can download the weights and run them freely for academic and research purposes but must contact NVIDIA before using them in a commercial product.
The architecture is a Vision-Language-Action (VLA) model with two main components. The first is a vision-language backbone that processes camera observations and optional language instructions to produce a high-level scene representation. The second is a flow-matching action diffusion head that takes the backbone's output and generates low-level joint commands through an iterative denoising process.
GR00T N1 is a generalist model with embodiment-conditioned action heads that support single-arm, bimanual, and full-humanoid configurations. The action space is configurable per embodiment ID rather than fixed. A full bimanual humanoid like the Fourier GR-1 uses a 52 DoF configuration; simpler embodiments use narrower action spaces. The diffusion head jointly reasons over the configured arms, wrist cameras, and hand pose for the active embodiment.
The model follows a two-system design borrowed from cognitive science. System 2 (the VLM backbone) handles slow, deliberate reasoning: parsing the task instruction, identifying objects, and planning the high-level motion sequence. System 1 (the diffusion head) runs the fast reactive control loop at 120 Hz per NVIDIA's GR00T N1 model card. The two systems communicate through a latent representation, which is why the VLM encoding pass and the denoising steps are architecturally separate and can be optimized independently.
Training data for GR00T N1 combines real teleoperation demonstrations with synthetic data generated by Isaac Sim and Cosmos. The sim-to-real pipeline is tight: Isaac Sim generates physically plausible demonstrations at scale, Cosmos adds photorealistic domain variation, and the combined dataset trains a model that transfers to real hardware with less fine-tuning than pure sim-trained policies.
Hardware Requirements: VRAM, NVLink, and Latency Budgets
GR00T N1 inference requires at least 16GB VRAM per NVIDIA's official documentation, which lists an RTX 4090, L40, H100, and Jetson AGX Orin as supported inference hardware. The 40GB+ floor applies to fine-tuning runs, not bare inference. At BF16, the 2B-parameter VLM backbone uses roughly 4-6GB and the diffusion head adds another 4-6GB for the denoising buffer. A single H100 80GB comfortably handles both systems at full fine-tuning scale.
| GPU Model | VRAM | Use Case | Recommended Config |
|---|---|---|---|
| H100 PCIe 80GB | 80GB HBM2e | Inference + evaluation | Single GPU, standalone inference node |
| H100 SXM5 80GB | 80GB HBM3 | LoRA fine-tuning on small datasets (5K-20K demos) | 2-4x with NVLink |
| B200 SXM6 192GB | 192GB HBM3e | Full fine-tuning, large teleoperation datasets (50K+ demos) | Single or 2x |
| RTX PRO 6000 Blackwell 96GB | 96GB GDDR7 | Cost-effective closed-loop inference demos | Single GPU |
NVLink matters specifically for the action diffusion head. During multi-GPU fine-tuning, the diffusion head's denoising steps require frequent cross-GPU gradient synchronization between the backbone's hidden states and the action trajectory parameters. With PCIe interconnects (16 GB/s bidirectional), this synchronization becomes the bottleneck at batch sizes above 8. With NVLink 4.0 (900 GB/s on SXM5 nodes), the bottleneck shifts back to compute. For clusters without NVLink, see multi-node GPU training without InfiniBand for bandwidth workarounds.
Spheron's bare-metal H100 instances include NVLink on SXM5 nodes and are available on demand without multi-month contracts.
Setting Up Isaac Lab and the GR00T Inference Stack on GPU Cloud
Step 1: Provision a GPU Instance
Rent an H100 SXM5 or H100 PCIe instance via the Spheron dashboard. For inference only, a single H100 PCIe is fine. For fine-tuning, use at least 2x H100 SXM5 with NVLink. The getting-started flow is documented at docs.spheron.ai/quick-guides/.
After SSH-ing in, verify your setup:
nvidia-smi
nvcc --version
nvidia-ctk --versionConfirm CUDA 12.4 or higher and that the NVIDIA Container Toolkit is installed. If the Container Toolkit is missing:
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart dockerStep 2: Install Isaac Lab
Isaac Lab is under active development. This guide targets Isaac Lab 2.x (check the Isaac Lab changelog for API changes since the time of writing).
# Python 3.10 is required - newer versions have compatibility issues with some IsaacSim deps
sudo apt install python3.10 python3.10-venv python3.10-dev -y
python3.10 -m venv ~/isaaclab-env
source ~/isaaclab-env/bin/activate
# Clone Isaac Lab
git clone https://github.com/isaac-sim/IsaacLab.git ~/IsaacLab
cd ~/IsaacLab
# Install with full extras (takes 10-15 minutes)
pip install -e ".[all]"
# Set the env variable used by the GR00T extension
export ISAAC_LAB_PATH=~/IsaacLab
echo 'export ISAAC_LAB_PATH=~/IsaacLab' >> ~/.bashrcStep 3: Install the Isaac GR00T Extension
# Clone the GR00T repository
git clone https://github.com/NVIDIA/Isaac-GR00T.git ~/Isaac-GR00T
cd ~/Isaac-GR00T
# Install the IsaacLab.GR00T extension
pip install -e "."Step 4: Download GR00T N1 Weights
The weights are gated on Hugging Face under a non-commercial research license. You must request access to nvidia/GR00T-N1 on the model page and wait for approval (typically 24-48 hours).
After approval:
pip install huggingface_hub
huggingface-cli login # paste your HF token
huggingface-cli download nvidia/GR00T-N1-2B --local-dir ~/gr00t-weights/GR00T-N1-2BThe base checkpoint is approximately 8GB. Plan for 20-30GB of total storage once you include the tokenizer, configuration files, and inference buffers.
Step 5: Run GR00T N1 Inference with the Isaac ROS Bridge
NVIDIA provides an Isaac ROS bridge that handles the ROS 2 interface between your robot stack and the GR00T inference server:
# Launch the GR00T inference server
python ~/Isaac-GR00T/scripts/inference_server.py \
--model-path ~/gr00t-weights/GR00T-N1-2B \
--camera-topic /camera/color/image_raw \
--gripper-state-topic /gripper/joint_states \
--action-space bimanual_52dof \
--control-freq 120 \
--device cuda:0The server exposes a ROS 2 action interface. Your robot's control loop subscribes to the output joint position targets and sends them to the hardware controller at the specified frequency.
Step 6: Validate with a Static Scene
Before connecting to hardware, validate the inference pipeline with a static camera feed:
# Test inference with a single image
python ~/Isaac-GR00T/scripts/test_inference.py \
--model-path ~/gr00t-weights/GR00T-N1-2B \
--image-path ~/test_image.jpg \
--task-instruction "pick up the red cube and place it in the bin" \
--output-path ~/inference_output.jsonCheck that the output JSON contains 52 joint position values and that inference latency is under 200ms. If latency is higher, proceed to the TensorRT export step in Section 6.
Fine-Tuning GR00T N1 on Custom Teleoperation Data with LoRA
Dataset Format Requirements
GR00T N1 fine-tuning uses the LeRobot v2 parquet dataset format. The dataset layout is:
data/chunk-000/: Parquet files containing state and action columns per timestep- Video files (MP4): Camera frames organized per episode
meta/modality.json: Schema file defining the observation and action modalities for the configured embodiment
If your teleoperation recordings are in a different format (ROS bags, custom pickle files, RLDS), use the conversion scripts in Isaac-GR00T/scripts/data_conversion/. The scripts handle common formats; for custom formats you need to write a thin adapter.
LoRA Configuration
LoRA fine-tuning targets both the VLM backbone and the action diffusion head:
- Backbone target modules: Q, K, V projection matrices in the attention layers
- Diffusion head target modules: Linear layers in the denoising MLP
- LoRA rank: 16-32 (rank 16 for single-task adaptation, 32 for multi-task)
- LoRA alpha: 2x the rank value
For a single-task adapter (e.g., "pick and place a specific object in your lab"), rank 16 is sufficient and keeps the adapter checkpoint under 500MB.
Multi-GPU Training Command
# Fine-tune on 2x H100 SXM5 with LoRA rank 16
torchrun \
--nproc_per_node=2 \
--nnodes=1 \
~/Isaac-GR00T/scripts/train_lora.py \
--base-model ~/gr00t-weights/GR00T-N1-2B \
--dataset ~/teleoperation-data/ \
--lora-rank 16 \
--lora-alpha 32 \
--num-gpus 2 \
--batch-size 4 \
--num-epochs 50 \
--val-interval 500 \
--output-dir ~/gr00t-lora-checkpoint/Validate the checkpoint by running rollouts on held-out demonstrations and computing success rate. A fine-tuned adapter should hit 60-80% rollout success on in-distribution tasks within 5,000-10,000 demonstrations.
Training Cost by Dataset Size
| Dataset Size | GPU Config | Est. Training Time | GPU-Hours | Cost at Live Price |
|---|---|---|---|---|
| 5K demos | 1x H100 PCIe | ~8h | 8 | $16.08 |
| 20K demos | 2x H100 SXM5 | ~12h | 24 | $19.20 (spot) |
| 50K demos | 4x H100 SXM5 | ~16h | 64 | $51.20 (spot) |
| 50K demos | 1x B200 SXM6 | ~10h | 10 | $21.20 (spot) |
Pricing fluctuates based on GPU availability. The prices above are based on 03 May 2026 and may have changed. Check current GPU pricing → for live rates.
For large teleoperation datasets (50K+ demonstrations), Spheron B200 instances cut fine-tuning wall-clock time roughly in half compared to H100 due to higher HBM3e bandwidth. The B200's 192GB VRAM also lets you run larger batch sizes without gradient checkpointing, which improves training stability on long-horizon tasks.
When B200 is out of stock, 4x H100 SXM5 with NVLink is the widely available fallback. Wall-clock time is longer but total GPU-hours and cost are similar.
Action Inference Loop: Sub-100ms Pipeline from Cameras to Joint Commands
The inference pipeline has five stages, each contributing to end-to-end latency:
1. Camera Input Preprocessing
Raw camera frames (typically 1080p or 720p RGB) are cropped and resized to the backbone's expected input resolution (224x224 for most VLM backbones). Normalization uses ImageNet statistics. This stage runs on CPU and takes 2-5ms at 720p with OpenCV.
2. VLM Visual Encoding
The preprocessed image passes through the VLM backbone (a large vision-language transformer). This is the slowest stage. At full FP16 precision on an H100, encoding a single 224x224 frame takes 30-60ms depending on the backbone size. With TensorRT export, this drops to 15-25ms.
To export the backbone with TensorRT:
python ~/Isaac-GR00T/scripts/export_tensorrt.py \
--model-path ~/gr00t-weights/GR00T-N1-2B \
--component backbone \
--precision fp16 \
--output-path ~/gr00t-trt/backbone.engine3. Action Diffusion Denoising
The diffusion head runs N denoising steps (typically 10-50 in the default config). Each step is a forward pass through the denoising MLP. With CUDA graph capture, 10 denoising steps take 8-12ms on an H100.
# Enable CUDA graph capture for the diffusion head
python ~/Isaac-GR00T/scripts/export_tensorrt.py \
--model-path ~/gr00t-weights/GR00T-N1-2B \
--component diffusion_head \
--num-denoising-steps 10 \
--use-cuda-graphs \
--output-path ~/gr00t-trt/diffusion_head.engine4. Joint Command Output
After denoising, the output is a joint position target vector with dimensionality set by the active embodiment configuration (52 joints for a full bimanual humanoid). For position-controlled robots, this maps directly to the hardware controller. For velocity-controlled robots, compute the delta from the current state at the control frequency.
5. ROS 2 Publishing
The Isaac ROS GR00T bridge publishes joint targets as sensor_msgs/JointState messages. Round-trip time from message receipt to command publication is under 1ms.
Latency Budget
| GPU | Backbone Encoding (TRT) | Diffusion (10 steps, CUDA graphs) | Preprocessing | Total |
|---|---|---|---|---|
| H100 PCIe | ~18ms | ~10ms | ~3ms | ~31ms |
| RTX PRO 6000 Blackwell | ~28ms | ~15ms | ~3ms | ~46ms |
Both configurations achieve sub-100ms end-to-end latency with TensorRT and CUDA graphs. The RTX PRO 6000 Blackwell is slower per step but handles most manipulation tasks comfortably within its latency budget.
For closed-loop inference demos and policy evaluation, RTX PRO 6000 Blackwell rental on Spheron gives you 96GB VRAM at a lower per-hour cost than H100, sufficient for single-arm inference workloads.
Inference Cost per Million Action Steps
| GPU | Action Freq. (Hz) | Steps/Hour | Cost/1M Steps |
|---|---|---|---|
| H100 PCIe | 32 | 115,200 | $17.45 |
| RTX PRO 6000 Blackwell | 21 | 75,600 | $22.49 |
Pricing fluctuates based on GPU availability. The prices above are based on 03 May 2026 and may have changed. Check current GPU pricing → for live rates.
Sim-to-Real with Isaac Sim and Cosmos Synthetic Data
The sim-to-real pipeline for GR00T N1 has four steps:
Scene authoring in Isaac Sim. Your manipulation workspace is built as an Omniverse USD scene. Isaac Sim runs physics simulation and domain randomization, generating physically plausible demonstration trajectories at scale. This is faster and cheaper than collecting all demonstrations with physical hardware.
Cosmos world model for photorealistic variation. Isaac Sim outputs are visually synthetic. Cosmos takes those synthetic clips and generates photorealistic variants with realistic lighting, material textures, and environmental noise. This closes the visual domain gap between simulation and the real deployment environment. The full Cosmos deployment setup for generating synthetic robotics data is covered in Deploy NVIDIA Cosmos World Foundation Models on GPU Cloud, including NGC authentication and Docker setup.
Mixed real + synthetic dataset. Combine a small real teleoperation dataset (500-2000 demonstrations from your actual hardware) with the larger synthetic dataset (10,000-50,000 clips from Isaac Sim + Cosmos). The mixing ratio depends on how closely your real environment matches what Cosmos can generate. Start with 80% synthetic, 20% real, and tune based on rollout success.
Sim-to-real transfer validation. Run a held-out set of real-world rollouts after each fine-tuning checkpoint. Track success rate on a standard test task. If success rate stops improving, your synthetic dataset has saturated the distribution your model can generalize from, and you need more real data or more environment variation in your Isaac Sim scenes.
Setting up both Isaac Sim and Cosmos on the same GPU node is documented at docs.spheron.ai. Teams combining Isaac Lab with real-world environment captures can use 3D Gaussian Splatting for rendering simulation backgrounds from captured photos rather than hand-authored 3D assets.
Cost Analysis: GPU Hours Per Training Run and Per Million Inference Steps
The tables below use live-fetched prices from the Spheron pricing API (03 May 2026):
- H100 PCIe: $2.01/hr (on-demand)
- H100 SXM5: $0.80/hr (spot)
- B200 SXM6: $2.12/hr (spot, on-demand currently unavailable)
- RTX PRO 6000 Blackwell: $1.70/hr (on-demand)
Table 1: Training Cost by Dataset Size
| Dataset Size | GPU Config | Est. Training Time | GPU-Hours | Cost at Live Price |
|---|---|---|---|---|
| 5K demos | 1x H100 PCIe | ~8h | 8 | $16.08 |
| 20K demos | 2x H100 SXM5 | ~12h | 24 | $19.20 (spot) |
| 50K demos | 4x H100 SXM5 | ~16h | 64 | $51.20 (spot) |
| 50K demos | 1x B200 SXM6 | ~10h | 10 | $21.20 (spot) |
Pricing fluctuates based on GPU availability. The prices above are based on 03 May 2026 and may have changed. Check current GPU pricing → for live rates.
Table 2: Inference Cost per Million Action Steps
| GPU | Action Freq. (Hz) | Steps/Hour | Cost/1M Steps |
|---|---|---|---|
| H100 PCIe | 32 | 115,200 | $17.45 |
| RTX PRO 6000 Blackwell | 21 | 75,600 | $22.49 |
Pricing fluctuates based on GPU availability. The prices above are based on 03 May 2026 and may have changed. Check current GPU pricing → for live rates.
The cost picture for GR00T N1 compares favorably to alternatives. A 20K demonstration fine-tuning run on 2x H100 SXM5 at spot pricing costs under $20. That same compute budget on AWS p4d.24xlarge spot instances would run $60-90 with the associated contract overhead and availability constraints.
Comparing GR00T N1, Pi-0, OpenVLA, and Octo for Production Deployment
| Model | Arch | Action Space | VRAM (inference) | FT Format | License | Best For |
|---|---|---|---|---|---|---|
| GR00T N1 | VLM + Flow Diffusion | variable per embodiment | 16GB+ | LeRobot v2 parquet | NC Research | NVIDIA humanoid platforms |
| Pi-0 | PaliGemma + Diffusion | General manipulation | 28GB+ | RLDS | Apache 2.0 | Cross-embodiment tasks |
| OpenVLA | Prismatic VLM | 7-DoF single arm | 14GB+ | RLDS | Apache 2.0 | Open-vocab pick-and-place |
| Octo | Transformer | 7-DoF single arm | 4GB+ | RLDS | Apache 2.0 | Low-resource fine-tuning |
The decision framework for production comes down to three questions: what hardware are you targeting, how much GPU do you have, and what license terms can you accept?
If you are on NVIDIA Isaac hardware (Isaac Sim, Isaac Lab, Jetson AGX Thor, or any platform in the Isaac ROS ecosystem), GR00T N1 is the right starting point. The integration is tight, the fine-tuning tooling is purpose-built, and the synthetic data pipeline through Cosmos and Isaac Sim is the most mature option for generating training data without physical hardware. The non-commercial license is the main constraint: check it carefully before planning a commercial deployment.
If you need cross-embodiment generalization across different robot morphologies or arm configurations, Pi-0 is the stronger choice. Its PaliGemma backbone gives it broader zero-shot capability on novel objects and tasks. The Apache 2.0 license also removes the commercial deployment uncertainty.
OpenVLA and Octo are the right choices if GPU budget is limited. OpenVLA runs on 14GB VRAM and handles a wide range of pick-and-place tasks out of the box. Octo runs on as little as 4GB and fine-tunes quickly on small datasets, making it useful for rapid prototyping on constrained hardware.
For deployments where the robot operates in low-latency closed-loop mode at the edge (on-board Jetson AGX Thor), see Hybrid Cloud and Edge AI Inference for the split-inference pattern that offloads VLM encoding to cloud while running the action head locally.
Robotics teams at universities and startups running GR00T N1 fine-tuning or closed-loop inference demos don't need to sign hyperscaler contracts. Spheron provides on-demand H100 and B200 nodes with NVLink, no minimum spend.
Rent H100 for GR00T fine-tuning → | Rent B200 → | View current pricing →
