Robotics teams spend more time collecting training data than training models. Real-world data collection for manipulation tasks or autonomous navigation costs hundreds of dollars per clip once you factor in operators, environments, and labeling. NVIDIA Cosmos changes the math by generating photorealistic synthetic video of physical environments at a fraction of that cost. On Spheron's on-demand H100 instances, you can run a full Cosmos inference pipeline without hyperscaler pricing. For a broader look at GPU selection for these workloads, see the GPU requirements cheat sheet for 2026.
What Is NVIDIA Cosmos: World Foundation Models for Physical AI
Cosmos is a family of world foundation models (WFMs) designed to generate physically plausible video of real-world environments. Unlike general-purpose video generation models, Cosmos is trained specifically on physical scenes: factories, warehouses, roads, outdoor environments, and manipulation workspaces. The outputs are used as synthetic training data for robots and autonomous vehicles.
Three model families ship in the current release. Cosmos-Predict handles video generation using both diffusion-based and autoregressive architectures. It takes text prompts or video conditioning inputs and produces photorealistic clips of physical environments. Cosmos-Transfer handles style and domain transfer, adapting existing video to target visual domains. Cosmos-Reason is a transformer-based video understanding model used for annotation and scene analysis. All families are distributed under the NVIDIA Open Model License (source code is Apache 2.0), which permits commercial use with attribution but is not a fully open license. You must accept the license on Hugging Face before downloading weights.
Models are available via Hugging Face (under the nvidia/ organization) and through NVIDIA's NGC registry. The NVIDIA Open Model License terms are shown during the HF access request flow.
GPU Requirements for Cosmos Model Variants: VRAM, Compute, and Storage
Cosmos-Predict models are VRAM-hungry. The 7B variant fits on a single 80GB GPU at full precision. The 14B variant typically needs either two 80GB GPUs running tensor parallelism or a single H200 with 141GB HBM3e memory at full precision. With aggressive model offloading you can reduce VRAM requirements to around 39GB, though inference speed drops significantly. Cosmos-Reason1 is less demanding and can run on 40GB A100s (note: architecture support may vary by version, with newer Reason releases targeting Hopper and Blackwell).
| Model Variant | VRAM Required | Recommended GPU | Minimum GPU | Storage |
|---|---|---|---|---|
| Cosmos-Predict1-7B-Text2World | 80GB | 1x H100 SXM5 | 1x H100 PCIe | ~50GB weights |
| Cosmos-Predict1-14B-Text2World | 80GB | 2x H100 SXM5 | 1x H100 80GB | ~100GB weights |
| Cosmos-Predict1-7B-Video2World | 80GB | 1x H100 SXM5 | 1x H100 PCIe | ~50GB weights |
| Cosmos-Predict1-14B-Video2World | 80GB | 2x H100 SXM5 | 1x H100 80GB | ~100GB weights |
| Cosmos-Reason1-7B | 40GB | 1x A100 80GB | 1x A100 40GB | ~15GB weights |
_14B fits on a single 80GB GPU with model offloading. Multi-GPU recommended for production throughput._
Multi-GPU 14B deployments use tensor parallelism. Inter-GPU bandwidth matters here. For multi-node setups, see the multi-node GPU training guide for networking considerations when NVLink is not available.
Storage is often overlooked. Model weights alone consume 50-100GB. Add output buffers for video frames and the temporary tensors during generation, and you need at least 200GB NVMe SSD per instance. Spin up H100 PCIe or H200 instances from Spheron's H100 GPU rental page with attached NVMe included.
Self-Hosting Cosmos vs Using NVIDIA's API: Cost and Control Tradeoffs
NVIDIA offers Cosmos inference as a managed API, which is the fastest way to run your first generation. You submit a prompt via a REST call, get back a video, and pay per generation. No infrastructure to manage. As of April 2026, NVIDIA has not published per-call pricing for Cosmos on their API; it is available through NVIDIA Cloud Functions and early enterprise agreements. The tradeoff: prompts and output data leave your perimeter, environment customization is limited, and at scale the per-call cost will exceed self-hosted GPU-hours.
Self-hosting on GPU cloud gives you the opposite profile. Upfront setup takes a few hours. After that, every GPU-hour is fully utilized by your pipeline with no per-token or per-generation markup. Your prompts and outputs stay on your infrastructure. You can run custom environments, modify inference parameters, and batch at any scale. The only constraint is GPU availability, which Spheron handles with on-demand provisioning.
The cost case for self-hosting becomes clear quickly. At $2.01/hr for an H100 PCIe on Spheron, generating 1,000 clips with an average 30-minute generation time per clip uses 500 GPU-hours, costing roughly $1,005 total. For teams generating more than a few hundred clips per month, self-hosted on-demand GPU will consistently beat managed API pricing once NVIDIA publishes Cosmos API rates.
Step-by-Step: Deploy Cosmos on GPU Cloud with Docker and NVIDIA Container Toolkit
Prerequisites
Before starting, you need:
- GPU instance with NVIDIA driver 550+ and CUDA 12.4+ installed
- Docker 24.0+
- NVIDIA Container Toolkit
- NGC account with API key (ngc.nvidia.com)
- Hugging Face account with Cosmos model access approved (accept the NVIDIA Open Model License on the model page)
- 200GB+ NVMe storage for weights and outputs
Step 1: Provision a GPU Instance
Rent via the Spheron dashboard. For the 7B models, a single H100 PCIe 80GB is the minimum and works well. For the 14B models or faster generation, use 2x H100 SXM5 or a single H200 SXM5.
# After SSH-ing into your instance, verify the GPU is visible
nvidia-smi
# Check CUDA version
nvcc --version
# Confirm available NVMe storage
df -h /mntH200 SXM5 on-demand instances are subject to availability. Check current GPU pricing and availability before provisioning.
Step 2: Install NVIDIA Container Toolkit
# On Ubuntu 22.04
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor \
-o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Verify
nvidia-ctk --versionStep 3: Authenticate with NGC and Pull the Cosmos Container
# Log in to NVIDIA's container registry
# Username: $oauthtoken (literal string)
# Password: your NGC API key
docker login nvcr.io
docker pull nvcr.io/nim/nvidia/cosmos-predict1-7b-text2world:1.0.0This uses the NIM (NVIDIA Inference Microservices) container path with the /nim/ prefix and a model-specific name and version tag. Verify the latest tag against NVIDIA's NGC catalog at ngc.nvidia.com before pulling, as newer model variants may have different image names.
Step 4: Download Cosmos Model Weights
pip install huggingface_hub
huggingface-cli login # enter your HF token when prompted
# 7B text-to-world model (~50GB)
huggingface-cli download nvidia/Cosmos-Predict1-7B-Text2World \
--local-dir /mnt/weights/cosmos-7b
# Optional: 14B model (~100GB, requires 2x H100 SXM5 or 1x H200)
huggingface-cli download nvidia/Cosmos-Predict1-14B-Text2World \
--local-dir /mnt/weights/cosmos-14bThe NVIDIA Open Model License terms are enforced at download time. If your HF account has not accepted the license, the download will fail with a 403 error. Accept the license at the model page on Hugging Face first.
Step 5: Run Cosmos Inference
NIM containers are microservices. You start the container, wait for it to become ready, then send HTTP requests to its REST API. Do not override the entrypoint.
Start the 7B container and expose its API on port 8000:
docker run --rm --gpus all \
-e NGC_API_KEY=$NGC_API_KEY \
-v /mnt/weights:/opt/nim/.cache \
-p 8000:8000 \
nvcr.io/nim/nvidia/cosmos-predict1-7b-text2world:1.0.0In a separate terminal, wait for the service to be ready then send a generation request:
# Poll until the service is ready (may take a few minutes on first run)
curl --retry 20 --retry-delay 10 --retry-connrefused --fail-with-body --retry-all-errors \
http://localhost:8000/v1/health/ready
# Generate a video via the REST API
curl -X POST http://localhost:8000/v1/infer \
-H "Content-Type: application/json" \
-d '{
"prompt": "A warehouse floor with pallets moving on autonomous forklifts, overhead lighting, concrete floor",
"resolution": "1280x720",
"num_frames": 120
}'For the 14B model on a 2-GPU setup, use --gpus '"device=0,1"' to assign both GPUs explicitly and reference the 14B image:
docker run --rm --gpus '"device=0,1"' \
-e NGC_API_KEY=$NGC_API_KEY \
-v /mnt/weights:/opt/nim/.cache \
-p 8000:8000 \
nvcr.io/nim/nvidia/cosmos-predict1-14b-text2world:1.0.0Then query the same API endpoint on port 8000 as shown above. The --gpus all flag works for single-GPU deployments. Use "device=0" for explicit single-GPU control.
Generating Synthetic Training Data: Warehouse, Factory, and Driving Environments
Prompt engineering for Cosmos is different from general video generation. Physical accuracy matters more than aesthetic appeal. Describe the environment precisely: floor materials, lighting type, fixture positions, background objects, and any moving elements. Here are three prompt templates that work well:
Warehouse:
A high-bay warehouse with shelving racks, a mobile robot base navigating narrow aisles,
fluorescent overhead lighting, concrete floor with painted safety lanes, depth visible
in background shelving, ambient dust particles in light beamsFactory:
An automotive assembly line with robotic arm welders, overhead conveyors, parts bins,
bright industrial lighting, metal surfaces with reflections, sparks from welding,
yellow safety barriers visible at frame edgesDriving (suburban):
A suburban intersection at dawn, lane markings, traffic signs, parked vehicles on both
sides, wet road surface after rain, street lights still on, early morning light
from the east casting long shadowsThe Video2World variant lets you condition generation on a real base clip. You provide a short real-world video and Cosmos generates photorealistic variants of the same scene. This is useful for sim-to-real transfer: anchor synthetic data to your actual deployment environment to reduce the domain gap.
Most robotics teams need between 10,000 and 100,000 clips per task type to see meaningful policy improvement. For general video AI GPU context, the AI video generation GPU guide covers VRAM and cost tradeoffs across generation models.
The Physical AI Data Factory Blueprint: End-to-End Pipeline Architecture
NVIDIA announced the Physical AI Data Factory Blueprint at GTC 2026. At its core, it connects Cosmos components (Curator for data curation, Cosmos-Transfer for domain adaptation, and Cosmos-Reason/Evaluator for quality assessment) with NVIDIA OSMO as the orchestration layer. Omniverse and Isaac Sim are part of NVIDIA's broader Physical AI ecosystem and can integrate with the pipeline, but the blueprint's primary components are the Cosmos modules and OSMO.
[Text / Scene Spec]
|
v
Cosmos Curator <-- data ingestion, filtering, processing
(data pipeline)
|
v
Cosmos-Transfer <-- domain and style adaptation of video
(video adaptation)
|
v
Cosmos-Reason / <-- quality evaluation, scene analysis,
Evaluator <-- annotation for downstream trainingOrchestration: NVIDIA OSMO manages compute orchestration across the pipeline stages, coordinating job scheduling and scaling on GPU clusters.
Integration with simulation: For teams using NVIDIA Omniverse and Isaac Sim, Cosmos outputs can feed into those tools for scene authoring, physics simulation, and reinforcement learning training loops. This broader integration is documented in NVIDIA's Physical AI Data Factory Blueprint documentation at developer.nvidia.com.
The entire pipeline runs on GPU cloud. No on-premise cluster is required. Each stage can scale independently: run more Cosmos generation instances when building a new dataset, then scale down while training runs on the same GPU budget.
Integrating Cosmos with Omniverse and Isaac Sim for Robotics Workflows
Cosmos and Omniverse operate at different points in the pipeline. Cosmos generates appearance: photorealistic video that looks like the real world. Omniverse handles physics and ground truth: it takes that appearance data and embeds it in physically simulated scenes where robot actions can be evaluated.
The handoff works in two directions. First, you use Cosmos to generate large volumes of photorealistic reference clips for an environment class (e.g., "warehouse with diverse lighting conditions"). These clips are imported into Omniverse as texture and appearance references, helping the simulated environment look realistic. Second, you can author a scene in Omniverse with precise asset placement and export that scene geometry as a video conditioning input to Cosmos-Predict's Video2World variant, generating appearance-varied versions of your specific Omniverse scene.
Isaac Sim connects after Omniverse: it runs physics simulation and domain randomization on top of the Omniverse scenes, generating the actual robot training trajectories. Isaac Lab provides the reinforcement learning scaffolding. Full documentation for this integration is available in NVIDIA's Physical AI Data Factory Blueprint documentation at developer.nvidia.com.
Cost Analysis: Synthetic Data Generation vs Real-World Data Collection on GPU Cloud
| Data Source | Cost per 10-sec clip | Clips per day (1x GPU) | Notes |
|---|---|---|---|
| Cosmos on H100 PCIe (Spheron, on-demand) | ~$1.00-$1.50 | ~32-48 | $2.01/hr, ~30-45 min/clip |
| Cosmos on H100 SXM5 (Spheron, spot) | ~$0.40-$0.60 | ~32-48 | $0.80/hr spot |
| Cosmos on H200 SXM5 (Spheron, on-demand) | ~$0.38-$0.76 | ~144-288 | $4.54/hr, ~5-10 min/clip |
| AWS p4d.24xlarge (8x A100) | ~$32/hr total | Faster, but 16-32x cost | On-demand only |
| Real-world robot data collection | $50-$500 per clip | N/A | Includes operators, environments, labeling |
Pricing fluctuates based on GPU availability. The prices above are based on 12 Apr 2026 and may have changed. Check current GPU pricing for live rates.
The H200 justifies its higher hourly rate through faster generation. At ~5-10 minutes per clip versus 30-45 minutes on an H100 PCIe, you get roughly 4-6x more clips per hour. For teams generating 5,000+ clips per month, the H200 on-demand can cost less total despite the higher rate. For lighter pipelines generating under 500 clips per month, H100 PCIe on-demand is the simpler and more cost-effective option. Spot instances on H100 SXM5 at $0.80/hr are attractive for workloads that can tolerate preemption.
Compared to AWS p4d.24xlarge at ~$32/hr for 8x A100s, running Cosmos on Spheron H100 instances costs 16-32x less for equivalent throughput. For a full breakdown of hyperscaler GPU pricing versus Spheron, see the AWS, GCP, and Azure GPU alternative guide. For strategies to further reduce GPU spend across your broader AI infrastructure, the GPU cost optimization playbook covers spot instance usage, checkpoint strategies, and right-sizing decisions that apply directly to synthetic data pipelines.
H200 on-demand availability varies. Check live inventory at Spheron's pricing page before planning a production pipeline around it.
Cosmos synthetic data pipelines run for GPU-hours at a time. Spheron's on-demand H100 and H200 instances let robotics teams spin up generation capacity when they need it and shut it down when they don't - no reserved capacity required.
