Deploy NVIDIA Cosmos 3 on GPU Cloud: Self-Host the Two-Tower MoT Physical AI Model (2026 Guide)

NVIDIA Cosmos 1 was a single-backbone video generation model. Cosmos 3 is architecturally different: it splits into two specialized transformer towers, each with a distinct job, and adds modalities that Cosmos 1 could not touch. For the prior generation baseline, see Deploy NVIDIA Cosmos World Foundation Models on GPU Cloud. This guide covers what changed, what GPU budget Cosmos 3 actually requires, and how to stand up both towers on an H200 or B200 instance.

What Is NVIDIA Cosmos 3: Two-Tower Mixture-of-Transformers for Physical AI

Cosmos 3 uses a two-tower Mixture-of-Transformers (MoT) design. The first tower is a reasoning transformer, an autoregressive VLM that handles scene understanding, instruction following, and spatial reasoning from text or image inputs. The second tower is the generation transformer, a diffusion model that synthesizes physical-world outputs from the scene representations the reasoning tower produces. Both towers share a transformer architecture with separate parameter sets within each layer that interact through joint attention, and use 3D mRoPE (3D multimodal Rotary Position Embeddings) per NVIDIA's technical report to align video, audio, and action trajectory tokens on a single temporal axis.

NVIDIA has released two scale variants: Nano (16B total, with a reported 8B + 8B per-tower split) and Super (64B total, reported as 32B + 32B per tower). A third variant, Edge (4B, targeting on-device/Jetson deployment), is announced but not yet released. Nano targets workstation-class hardware like the RTX PRO 6000 Blackwell; Super targets Hopper and Blackwell datacenter GPUs.

The architecture split is intentional. In Cosmos 1, a single backbone handled both scene understanding and generation, which forced the model to find shared representations for two fundamentally different tasks. Cosmos 3 separates them: the reasoning tower processes the prompt and produces a scene encoding, and the generation tower uses that encoding as a conditioning signal during diffusion. Modality routing is task-determined; both towers activate for each generation pass. Different output modalities (video, audio, action trajectories) are handled through task-specific model variants, not through separate expert sub-networks with a learned per-token router.

NVIDIA distributes Cosmos 3 weights under the OpenMDW-1.1 license (Linux Foundation's Open Model Development Weights license). OpenMDW-1.1 permits commercial use and derivatives. NVIDIA's model card additionally requires a "Built on NVIDIA Cosmos" attribution. The weights require accepting the license on Hugging Face before downloading. Released model checkpoints are available at nvidia/Cosmos3-Nano and nvidia/Cosmos3-Super on Hugging Face and through NVIDIA's NGC registry. Task-specific variants include nvidia/Cosmos3-Super-Text2Image, nvidia/Cosmos3-Super-Image2Video, and nvidia/Cosmos3-Nano-Policy-DROID.

Omnimodel Capabilities: Text, Video, Audio, and Action in One Checkpoint

Cosmos 1 generated video. Cosmos 3 generates text, images, video, and action trajectories from a single checkpoint, and adds audio as a conditioning input to the reasoning tower:

Text. The VLM tower produces natural language scene descriptions and grounded Q&A responses. This is primarily useful for annotation pipelines: you can ask Cosmos 3 to describe what is happening in a synthetic clip, then use that description as a label without running a separate VLM.

Image. Single-frame photorealistic generation from text prompts. For robotics teams, this is useful for quickly validating that your prompt correctly describes the environment before committing to a multi-frame generation run.

Video. Multi-frame temporally consistent world simulation. This is the core Cosmos 1 capability, substantially improved in Cosmos 3 by the dedicated video path. The visual fidelity, temporal consistency, and physical plausibility of object interactions all improve when the video path does not compete for parameters with the audio or action variants.

Ambient audio (input conditioning). Cosmos 3's reasoning tower accepts audio as a conditioning input, allowing multimodal scene understanding that incorporates environment sounds alongside video and text observations. This is useful for robotics pipelines where audio observations (gripper contact sounds, conveyor belt noise, emergency stop signals) are part of the sensor stream the model reasons over. NVIDIA's marketing materials reference ambient audio generation as a Cosmos 3 capability, but current technical documentation lists audio primarily under reasoner inputs rather than generation outputs. Confirm audio generation support against current model release notes and Table 1 of the Cosmos 3 technical report before building audio generation pipelines.

Physics-grounded action trajectories. This is the modality that most directly affects robotics teams. The action trajectory path outputs robot policy primitives: joint position sequences, end-effector poses, or contact force profiles, depending on the output configuration. These are not planning outputs in the sense of a motion planner. They are learned trajectory primitives grounded in the physical dynamics the model has internalized from training data. They feed directly into reinforcement learning loops or simulator replay.

The combination means a single Cosmos 3 deployment can supply multiple arms of a robotics training pipeline simultaneously: video for visual domain adaptation, audio for multimodal sensor training, and action trajectories for policy bootstrapping.

GPU and VRAM Requirements: Sizing the Two Towers

GPU sizing for Cosmos 3 depends on which scale variant you deploy. Nano (16B total, reportedly 8B + 8B per tower) is workstation-friendly; Super (64B total, reportedly 32B + 32B per tower) requires datacenter HBM. The table below covers the main configurations on Spheron.

Variant	Configuration	GPU	VRAM	On-Demand $/hr	Notes
Nano (16B)	Both towers	RTX PRO 6000 Blackwell	96GB	$4.50	Workstation target; RTX PRO 6000 on Spheron
Nano (16B)	Both towers	H100 PCIe	80GB	$2.98	Fits with headroom for KV cache
Super (64B)	Generation tower only	H200 SXM5	141GB	$3.70	Single GPU; add a second H200 for the reasoning tower
Super (64B)	Both towers	2×H200 SXM5	2×141GB	$7.40	Minimum per NVIDIA serving guidance; one tower per GPU
Super (64B)	Both towers	B200 SXM6	192GB	Spot only ($5.34)	Single GPU co-deployment; full HBM for multi-stream output

A note on B200 pricing: as of 29 Jun 2026, B200 SXM6 instances on Spheron are available on spot pricing at $5.34/hr. No on-demand B200 DEDICATED offers are available right now. Spot instances can be reclaimed and are not suitable for interactive or latency-sensitive pipelines. For production world generation pipelines that need guaranteed availability, run on H200 on-demand and treat B200 spot as the high-throughput batch option.

For next-generation scale, NVIDIA's Rubin-class R100 is available for pre-order on Spheron at /gpu-rental/r100/. Once available, R100's expanded HBM capacity will be the natural fit for Super at higher resolutions and multi-stream generation.

The co-deployment boundary for Super matters. A single H200 SXM5 (141GB) fits the Super generation tower (32B, roughly 64GB in BF16) with headroom for KV cache and diffusion activations, but not both towers together. The full 64B model weight set is approximately 128GB in BF16, and adding KV cache plus activation memory pushes beyond a 141GB card. NVIDIA's minimum serving configuration for co-deploying both Super towers is 2×H200 (one tower per GPU) or a B200 SXM6 (192GB) for single-GPU co-deployment.

For H200 SXM5 instances, H200 SXM5 instances on Spheron are available on-demand with per-minute billing. For B200, B200 instances are listed on the B200 page with current spot availability.

Pricing fluctuates based on GPU availability. The prices above are based on 29 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Step-by-Step Deployment: Reasoning Tower, Generation Tower, and Multi-GPU Planning

Prerequisites

Before starting:

GPU instance with NVIDIA driver 550+ and CUDA 12.4+ installed
Docker 24.0+
NVIDIA Container Toolkit
NGC account with API key (ngc.nvidia.com)
Hugging Face account with Cosmos 3 model access approved (accept the OpenMDW-1.1 license on the Nano or Super model page)
300GB+ NVMe storage: Nano variant (~32GB total weights), Super variant (~128GB total weights), plus output buffers and KV cache

Step 1: Provision a GPU Instance

For Nano (16B) deployments, an H100 80GB or RTX PRO 6000 Blackwell handles both towers on one GPU. For Super (64B), provision at least two H200 SXM5 instances (one per tower); a single H200 SXM5 fits the generation tower alone. For single-GPU Super co-deployment, use a B200 SXM6 (192GB). See Spheron's pricing page for current availability.

After SSH:

bash

nvidia-smi
# Confirm driver 550+ and GPU memory (141GB for H200, 192GB for B200)

Step 2: Set Up the NVIDIA Container Toolkit

bash

# Install container toolkit if not pre-installed
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Verify
nvidia-ctk --version

Step 3: Pull the Cosmos 3 Containers

bash

# Authenticate with NGC
docker login nvcr.io
# Username: $oauthtoken
# Password: <your NGC API key>

# Pull Cosmos 3 Nano (16B, for H100 / RTX PRO 6000 deployments)
docker pull nvcr.io/nim/nvidia/cosmos3-nano:latest

# Pull Cosmos 3 Super (64B, for H200 / B200 deployments)
docker pull nvcr.io/nim/nvidia/cosmos3-super:latest

Cosmos 3 container names follow the model variant naming (cosmos3-nano, cosmos3-super) rather than the tower-role naming used in Cosmos 1 (cosmos-predict1-*). Verify the exact version tag against the NGC catalog before pulling.

Step 4: Download Cosmos 3 Weights

bash

pip install huggingface_hub

# Authenticate
huggingface-cli login
# Enter your HF token when prompted

# Download Nano variant (16B total, ~8B per tower)
# Suitable for H100 80GB or RTX PRO 6000 Blackwell
huggingface-cli download nvidia/Cosmos3-Nano \
  --local-dir /weights/cosmos3-nano

# Download Super variant (64B total, ~32B per tower)
# Requires H200 SXM5 (141GB) or B200 SXM6 (192GB)
huggingface-cli download nvidia/Cosmos3-Super \
  --local-dir /weights/cosmos3-super

# Task-specific variants (optional):
# huggingface-cli download nvidia/Cosmos3-Super-Image2Video --local-dir /weights/cosmos3-super-i2v
# huggingface-cli download nvidia/Cosmos3-Nano-Policy-DROID --local-dir /weights/cosmos3-nano-policy

You must accept the OpenMDW-1.1 license on each model's Hugging Face page before the download will succeed. Nano and Super have separate license gates, as do the task-specific variants.

Step 5: Launch the Reasoning Tower

bash

docker network create cosmos-net 2>/dev/null || true

docker run -d \
  --gpus '"device=0"' \
  --name cosmos3-reasoning \
  --network cosmos-net \
  -v /weights/cosmos3-nano:/workspace/weights \
  -e NGC_API_KEY=$NGC_API_KEY \
  -p 8080:8080 \
  nvcr.io/nim/nvidia/cosmos3-nano:latest \
  --tower reasoning \
  --weights /workspace/weights \
  --host 0.0.0.0 \
  --port 8080

The reasoning tower exposes a REST endpoint at port 8080. It accepts text prompts and image observations, and returns scene encodings in a format consumed by the generation tower. Verify it is healthy before launching the generation tower:

bash

curl http://localhost:8080/v1/health

Step 6: Launch the Generation Tower and Configure Modality Router

The generation tower reads its routing configuration from a JSON file that directs each modality request to the appropriate task-specific output path. This is a NIM request-routing config, not a learned per-token router. Create this config first, then launch the container.

A minimal config:

json

{
  "modality_router": {
    "video": {
      "path": "video_generation",
      "resolution": "720p",
      "fps": 24,
      "num_frames": 60
    },
    "audio": {
      "path": "audio_conditioning",
      "sample_rate": 44100,
      "duration_sec": 2.5
    },
    "action": {
      "path": "action_generation",
      "output_format": "joint_positions",
      "horizon": 50
    }
  }
}

Save this as /weights/cosmos3-nano/router.json on the host. That directory is already mounted into the container as /workspace/weights, so the file is available at /workspace/weights/router.json without adding a second volume mount. Then start the generation tower:

bash

docker run -d \
  --gpus '"device=0"' \
  --name cosmos3-generation \
  --network cosmos-net \
  -v /weights/cosmos3-nano:/workspace/weights \
  -e NGC_API_KEY=$NGC_API_KEY \
  -e REASONING_ENDPOINT=http://cosmos3-reasoning:8080 \
  -p 8081:8081 \
  nvcr.io/nim/nvidia/cosmos3-nano:latest \
  --tower generation \
  --weights /workspace/weights \
  --reasoning-endpoint http://cosmos3-reasoning:8080 \
  --router-config /workspace/weights/router.json \
  --host 0.0.0.0 \
  --port 8081

Adjust resolution and parameters based on your VRAM budget.

Submit a generation request to test:

bash

curl -X POST http://localhost:8081/v1/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "A robot arm picks up a red cube from a warehouse shelf",
    "modalities": ["video"],
    "resolution": "720p"
  }'

Step 7: Multi-GPU Memory Planning

For Nano deployments on H100, both towers (16B total) fit on one GPU with headroom. For Super on H200, both towers (64B total, approximately 128GB in BF16) do not fit on a single 141GB H200; run each tower on a dedicated GPU. The two-GPU H200 split below is the minimum config per NVIDIA's serving guidance. For single-GPU Super co-deployment, move to a B200 SXM6 (192GB).

Two-GPU H200 split (Super variant):

bash

docker network create cosmos-net 2>/dev/null || true

# Reasoning tower on GPU 0
docker run -d --gpus '"device=0"' --name cosmos3-reasoning --network cosmos-net \
  nvcr.io/nim/nvidia/cosmos3-super:latest --tower reasoning ...

# Generation tower on GPU 1
docker run -d --gpus '"device=1"' --name cosmos3-generation --network cosmos-net \
  -e REASONING_ENDPOINT=http://cosmos3-reasoning:8080 \
  nvcr.io/nim/nvidia/cosmos3-super:latest --tower generation ...

KV cache sharing between towers uses the NVLink interconnect on SXM form-factor GPUs. For NVLink-connected multi-GPU setups, scene encodings from the reasoning tower transfer to the generation tower at NVLink bandwidth (approximately 900GB/s on H200 NVLink). For PCIe-connected setups, the transfer goes over PCIe (approximately 128GB/s bidirectional on PCIe 5.0 x16), which creates a bottleneck when generating at high frequency. For multi-node setups or configurations without NVLink, see the multi-node GPU training guide for bandwidth planning.

Use Cases: Robot Policy Training, AV Simulation, Synthetic Data, and Embodied-Agent Evaluation

Robot Policy Training

The action trajectory task-specific variant is the feature that changes the Cosmos 3 value proposition for robotics teams. Cosmos 1 provided photorealistic video clips used for visual domain adaptation. Cosmos 3 provides those clips plus the action trajectories that produced them. A robotics team building a manipulation policy can now generate video of a task, get the corresponding joint positions and contact forces from the action trajectory path, and use both to seed a fine-tuning dataset.

These trajectory outputs feed directly into LeRobot v2 and Isaac Lab training loops. The format is configurable: joint position sequences, end-effector poses, or contact force profiles, depending on the output configuration. For the next step after generating these trajectories, Deploy NVIDIA Isaac GR00T N1 on GPU Cloud covers the Isaac Lab fine-tuning pipeline that consumes this kind of synthetic trajectory dataset.

AV Simulation

Autonomous vehicle teams use world models to generate rare edge cases at scale: occluded pedestrians, unusual lighting, adversarial scenarios. Cosmos 3's video path produces better temporal consistency at longer horizons than Cosmos 1, which matters for AV scenarios that need 10-30 second clips rather than the 2-3 second clips Cosmos 1 handled well. The audio path also generates plausible road noise, which is useful for training models that fuse audio sensors. For the full picture of what GPUs power production world model pipelines, see GPU Infrastructure Behind World Models 2026.

Synthetic Data

Cosmos 3 improves on Cosmos 1 for robotics dataset generation in two ways. First, the dedicated video path removes the parameter-sharing penalty that Cosmos 1 paid when generating manipulation scenes: tool contact, object deformation, and grasping dynamics are now handled by a path that does nothing else. Second, the action trajectory output means your synthetic dataset includes both video and the policy primitives that generated it, which doubles the supervision signal for imitation learning. For the Cosmos 1 pipeline comparison and container setup details, the prior-generation deployment guide (linked in the intro) covers that setup in depth.

Embodied-Agent Evaluation

The world-generation tower serves as a closed-loop simulator: you send an action, the model predicts the next world state as a video frame, and your policy observes that frame to pick the next action. This is the same loop as a physics simulator, but with photorealistic visual outputs instead of rendered primitives. It does not replace a physics simulator for high-frequency control (Cosmos 3 generates frames at well under real-time), but it is a useful evaluation tool for policies that need to respond to visual observations at lower frequencies.

For physics-accurate high-speed simulation, pair Cosmos 3 with Genesis for the fast physics loop and use Cosmos 3 for the visual domain evaluation step, as the Deploy Genesis Physics Engine on GPU Cloud guide describes for the Genesis-Cosmos pipeline. For evaluating VLA policies under realistic visual conditions, Deploy OpenVLA on GPU Cloud covers the OpenVLA inference setup that consumes world model outputs.

For online RL where the agent interacts with the world model as a simulator, Deploy RLinf on GPU Cloud for Embodied AI covers the distributed RL training infrastructure that runs the policy update loop against these kinds of generative environment backends.

Cost Breakdown on Spheron: On-Demand and Spot Pricing for Cosmos 3

GPU	Pricing Type	$/hr	720p fps (est.)	Cost per minute of generated video
H200 SXM5	On-demand	$3.70	~0.6 fps	~$2.47/min
H200 SXM5	Spot	$3.31	~0.6 fps	~$2.21/min
B200 SXM6	Spot only	$5.34	~1.1 fps	~$1.94/min

fps here means generated video frames per second of wall-clock time. At 0.6 fps on H200, each minute of 24fps video requires about 40 minutes of GPU time. At 1.1 fps on B200 (approximately double H200 due to 8 TB/s vs 4.8 TB/s HBM bandwidth), the same minute of video takes about 22 minutes. The higher hourly rate on B200 spot produces a lower cost per minute of generated video.

Spot instances on Spheron can be reclaimed without notice. They are the right choice for large batch synthetic data generation jobs that are checkpointable and do not have strict completion time requirements. For interactive pipelines, CI/CD workflows, or any use case where GPU preemption would break the generation run, use on-demand H200.

Pricing fluctuates based on GPU availability. The prices above are based on 29 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Connecting Cosmos 3 to Your Embodied AI Stack

Cosmos 3 sits in the middle of the physical AI stack. Above it are the task-level components: the VLA policy you are training, the RL framework running the policy updates, and the simulation environment evaluating the policy. Below it are the compute primitives: the GPU instances, the HBM bandwidth budget, and the container infrastructure.

For deployment documentation including CUDA setup and storage configuration, see docs.spheron.ai. For the RL training infrastructure that consumes Cosmos 3 action trajectory outputs as training data, the RLinf distributed training guide (linked in the embodied-agent section above) covers that side. On the NVIDIA side, the Physical AI Data Factory Blueprint connects Cosmos 3 to Omniverse for scene authoring and Isaac Lab for physics validation, giving teams a fully documented pipeline from scene prompt to trained policy.

Cosmos 3's generation tower needs 141GB+ VRAM and sustained HBM bandwidth to handle multi-modality world generation. Spheron's H200 and B200 instances provide both at transparent on-demand pricing, with no hyperscaler lock-in.
H200 on Spheron → | Spheron B200 instances → | View all GPU pricing →

STEPS / 05

Quick Setup Guide

Provision an H200 or B200 GPU instance
Log in to app.spheron.ai and select an H200 SXM5 (141GB HBM3e) or B200 SXM6 (192GB HBM3e) instance. Deploy with Ubuntu 22.04 and CUDA 12.4+. Verify GPU visibility with nvidia-smi before pulling Cosmos 3 containers.
Authenticate with NGC and pull the Cosmos 3 container
Generate an NGC API key at ngc.nvidia.com. Log in to nvcr.io with docker login nvcr.io (use $oauthtoken as username and your NGC key as password). Pull the Cosmos 3 NIM container for the generation tower. Verify the exact image tag against NGC's catalog before pulling, as Cosmos 3 image names differ from Cosmos 1.
Download Cosmos 3 model weights from Hugging Face
Accept the OpenMDW-1.1 license on the Cosmos 3 model pages on Hugging Face. Authenticate with huggingface-cli login, then download the Nano variant (nvidia/Cosmos3-Nano, 16B) for H100 or RTX PRO 6000 deployments, or the Super variant (nvidia/Cosmos3-Super, 64B) for H200 and B200 deployments. Task-specific repos such as nvidia/Cosmos3-Super-Image2Video and nvidia/Cosmos3-Nano-Policy-DROID are available for specialized use cases.
Deploy the reasoning tower (VLM tower)
Start the reasoning transformer container with GPU passthrough. This tower handles text prompts and scene understanding inputs. Pass the NGC_API_KEY environment variable and mount the weights volume. The reasoning tower exposes a REST endpoint that accepts text or image observations and returns scene encodings consumed by the generation tower.
Deploy the generation tower and configure modality routing
Launch the generation tower container linked to the reasoning tower endpoint. Configure the modality router to direct video, audio, or action trajectory requests to the appropriate task-specific output path. For Nano on H100, both towers fit on one GPU. For Super on H200, use at least two H200 GPUs (one per tower); a single 141GB H200 fits the generation tower alone. A B200 SXM6 (192GB) can host both Super towers on one GPU.

FAQ / 05

Frequently Asked Questions

GPU requirements depend on the Cosmos 3 variant. Nano (16B total, reported as 8B + 8B per tower) fits on an RTX PRO 6000 Blackwell (96GB) or a single H100 80GB with room for KV cache. Super (64B total, reported as 32B + 32B per tower) needs at least an H200 SXM5 (141GB HBM3e) for the generation tower alone; co-deploying both Super towers requires at minimum 2×H200 SXM5 (one per tower) per NVIDIA's serving guidance, or a B200 SXM6 (192GB HBM3e) for single-GPU co-deployment. Edge (4B) is announced for future on-device/Jetson deployment but not yet released. On Spheron, H200 SXM5 runs at $3.70/hr on-demand; B200 SXM6 is available at spot pricing.

Cosmos 1 used a single autoregressive or diffusion backbone. Cosmos 3 separates reasoning and generation into two specialized towers: an autoregressive VLM (reasoning tower) and a diffusion model (generation tower). Both towers share a transformer architecture with separate parameter sets per layer interacting through joint attention, and use 3D mRoPE per NVIDIA's technical report to align video, audio, and action tokens on one temporal axis. Modality routing is task-determined, not a per-token learned router. Cosmos 3 ships as Nano (16B total, reportedly 8B + 8B per tower) and Super (64B total, reportedly 32B + 32B per tower). Edge (4B), targeting on-device/Jetson deployments, is announced but not yet released.

Cosmos 3 generates text, images, video, and physics-grounded action trajectories from a single model checkpoint. The video modality produces photorealistic physical-world clips used for synthetic training data. The action trajectory modality outputs robot policy primitives directly usable in reinforcement learning loops or simulator replay. Audio serves as a conditioning input to the reasoning tower rather than a confirmed generation output in current checkpoints; verify audio generation support against current model documentation before building audio pipelines.

Cosmos 3 weights are distributed under the OpenMDW-1.1 license (Linux Foundation's Open Model Development Weights license), available on Hugging Face under the nvidia/ organization and NVIDIA's NGC registry. OpenMDW-1.1 permits commercial use and derivatives. NVIDIA's model card additionally requires a 'Built on NVIDIA Cosmos' attribution. Self-hosting gives full data control, lower per-clip cost at scale, and the ability to customize environment prompts. You must accept the OpenMDW-1.1 license on Hugging Face before downloading weights.

At on-demand rates on Spheron, an H200 SXM5 runs at $3.70/hr for Cosmos 3 generation workloads. A B200 SXM6 offers spot pricing at $5.34/hr with roughly double the throughput due to its 8 TB/s HBM bandwidth, resulting in a lower cost per minute of generated video. Exact pricing depends on current availability - check live rates at spheron.network/pricing/.

What Is NVIDIA Cosmos 3: Two-Tower Mixture-of-Transformers for Physical AI

Omnimodel Capabilities: Text, Video, Audio, and Action in One Checkpoint

GPU and VRAM Requirements: Sizing the Two Towers

Step-by-Step Deployment: Reasoning Tower, Generation Tower, and Multi-GPU Planning

Prerequisites

Step 1: Provision a GPU Instance

Step 2: Set Up the NVIDIA Container Toolkit

Step 3: Pull the Cosmos 3 Containers

Step 4: Download Cosmos 3 Weights

Step 5: Launch the Reasoning Tower

Step 6: Launch the Generation Tower and Configure Modality Router

Step 7: Multi-GPU Memory Planning

Use Cases: Robot Policy Training, AV Simulation, Synthetic Data, and Embodied-Agent Evaluation

Robot Policy Training

AV Simulation

Synthetic Data

Embodied-Agent Evaluation

Cost Breakdown on Spheron: On-Demand and Spot Pricing for Cosmos 3

Connecting Cosmos 3 to Your Embodied AI Stack

Quick Setup Guide

Provision an H200 or B200 GPU instance

Authenticate with NGC and pull the Cosmos 3 container

Download Cosmos 3 model weights from Hugging Face

Deploy the reasoning tower (VLM tower)

Deploy the generation tower and configure modality routing

Frequently Asked Questions

01What GPU does NVIDIA Cosmos 3 require?

02How does Cosmos 3's two-tower MoT differ from Cosmos 1?

03What modalities does Cosmos 3 generate?

04Can I self-host Cosmos 3 or do I need NVIDIA's API?

05What is the cost to run Cosmos 3 world generation on Spheron?

Build what's next.