World models need 8-32x the GPU compute of a comparably-sized LLM, and the bandwidth requirements flip inference from compute-bound to memory-bound at every resolution above 480p. This is not a marginal difference in hardware needs - it fundamentally changes which GPU tier makes sense and why HBM bandwidth matters more than raw CUDA core count. This post covers what's actually running these models, what the GPU requirements look like for open alternatives, and what it costs per frame at each tier. For the equivalent breakdown for coding tools and agent workloads, see the GPU infrastructure guide for AI coding tools and the GPU infrastructure guide for AI agents.
What World Models Are and Why 2026 Is Their Breakout Year
A world model is a generative model of physical environments. Rather than producing a text token sequence, it produces video frames or 3D state transitions from action inputs or text prompts. The defining property is temporal consistency: each output frame must be physically plausible given every preceding frame. This is the part that destroys GPU budgets.
Three models define the current landscape:
Genie 3 (Google DeepMind) builds on the Genie family's latent action model approach, learning a compressed action space from video without labeled action data, then generating interactive environments from that learned space. The result is video that responds coherently to user inputs in real time. Genie 3 is closed API only as of 2026.
Marble (World Labs) targets 3D environment generation and navigation tasks, launched commercially in November 2025 with freemium and paid tiers. It is not self-hostable.
NVIDIA Cosmos is the most important model for practitioners in 2026 because it is actually self-hostable. It generates photorealistic synthetic video of physical environments for robotics and autonomous vehicle training pipelines. Full deployment details are in the NVIDIA Cosmos GPU cloud guide.
Why 2026 is the breakout year comes down to three converging trends. Robotics sim-to-real is moving from physics-only simulation to photorealistic world models because visual realism closes the policy transfer gap. Autonomous vehicle teams are using world models to generate rare edge cases at scale instead of waiting to encounter them in real-world driving logs. Game studios are running world model inference for procedural level generation. Each of these demands sustained, high-throughput GPU compute at resolutions that exceed what LLM-era GPU stacks were designed for.
For the robotics simulation context, see Deploy Genesis Physics Engine on GPU Cloud.
How World-Model Inference Differs from LLM Inference
LLM inference generates a token sequence. At each step, the model attends over a context window of 1,000-100,000 tokens and produces one next token. A well-optimized H100 running a 7B model in FP16 produces 100-500 tokens per second. Memory bandwidth is the bottleneck at small batch sizes because weight loading dominates.
World model inference at 720p/24fps looks fundamentally different. At 720p with standard patch sizes (16x16), each frame contains roughly 2,700 spatial tokens. At 24fps, the model must generate 64,800 spatial tokens per second, sustained. Even at 24fps with an autoregressive model generating one frame per step, you need far more throughput than any LLM inference scenario. For diffusion-based world models (Cosmos-style), which iteratively denoise each frame over 20-50 denoising steps, the compute requirement is higher still.
The critical difference is attention scaling. LLM attention scales quadratically with context length but that context is flat. World model attention must scale across both spatial extent (H×W patches per frame) and temporal extent (all prior frames for temporal consistency). A 720p diffusion world model at 24fps with 16 frames of temporal context is attending over a token grid roughly 3x the size of a standard LLM context window, with spatial structure that prevents the KV cache savings that make long-context LLMs tractable.
The result: 8-32x more compute than LLM inference at equivalent "output rate". The upper end of that range applies to high-resolution diffusion world models; the lower end applies to autoregressive models at 480p.
| Workload | VRAM | HBM Bandwidth Needed | H100 SXM5 Throughput | H200 SXM5 Throughput |
|---|---|---|---|---|
| 7B LLM (FP16, batch 1) | 14 GB | ~0.5 TB/s | ~200 tokens/sec | ~280 tokens/sec |
| Video gen 720p (Wan 2.1) | 65-80 GB | ~2 TB/s | 1 frame/10 min | 1 frame/7 min |
| World model 720p (Cosmos-7B) | 80 GB | ~2.5-3 TB/s | 1 frame/2-3 sec | 1 frame/~1.7 sec |
| World model 1080p (Cosmos-14B) | 160+ GB | ~4-5 TB/s | OOM | 2x H200 required |
HBM bandwidth is the reason H200 and B200 are the relevant GPU choices for production world model inference above 480p. The H100 SXM5 delivers 3.35 TB/s of HBM3 bandwidth. The H200 SXM5 delivers 4.8 TB/s of HBM3e - a 43% improvement on the same silicon. The B200 delivers 8 TB/s. At world model token rates, that bandwidth difference maps almost linearly to throughput. For a detailed breakdown of HBM generations and what each means for inference, see the HBM3e vs HBM4 vs HBM4e LLM inference guide.
For the video generation context (single-clip generation rather than temporally consistent world models), see the AI video generation GPU guide.
GPU Requirements for Real-Time World Model Generation
"Real-time" for a world model means generating one output frame faster than the frame period of the target frame rate. At 24fps, that is 41ms per frame. At 30fps, it is 33ms. No current open world model running on a single GPU achieves this for diffusion-based generation at 720p. Real-time performance at 720p currently requires multi-GPU inference or dedicated inference hardware from NVIDIA's Blackwell generation.
For practical deployment, the relevant threshold is near-real-time (under 10 seconds per frame) for interactive use, and throughput optimization for batch generation pipelines.
VRAM requirements by resolution:
| Resolution | Cosmos-Predict 7B | Cosmos-Predict 14B | Notes |
|---|---|---|---|
| 480p | ~45 GB | ~85 GB | 7B fits H100 PCIe 80GB comfortably |
| 720p | ~70-80 GB | ~140-160 GB | 7B fits H100 SXM5 80GB at edge; 14B needs H200 |
| 1080p | OOM on H100/H200; B200 SXM6 (192GB) may fit | OOM (single GPU) | 7B: 1x B200 SXM6 or 2x H100/H200; 14B: 2x H200+ |
GPU tier vs. throughput and cost:
| GPU | On-Demand $/hr | Cosmos-7B 720p fps | Cosmos-14B 720p fps | Cost/min of video |
|---|---|---|---|---|
| H100 PCIe ($2.01/hr) | $2.01 | ~0.3 fps (batch only) | OOM | ~$2.68/min |
| H100 SXM5 ($2.54/hr) | $2.54 | ~0.4 fps | OOM | ~$2.54/min |
| H200 SXM5 ($4.84/hr) | $4.84 | ~0.6 fps | ~0.3 fps (est.) | ~$3.23/min |
| B200 SXM6 ($7.41/hr) | $7.41 | ~1.1 fps | ~0.5 fps (est.) | ~$2.69/min |
fps here means generated video frames per second of wall-clock time, not frames per second in the output video. The B200's higher cost per hour produces a similar or better cost per minute of generated video compared to H200, because it generates more frames per GPU-hour.
Autoregressive world models (Genie 3-style architectures) are faster per step because each denoising iteration is replaced by a single autoregressive forward pass. The tradeoff is visual quality and temporal consistency at longer horizons. For most current open model deployments, diffusion-based architectures like Cosmos-Predict produce better results.
For H200 SXM5 on Spheron and B200 GPU cloud on Spheron provisioning, both models support on-demand with per-minute billing.
Open vs Closed World Models: What You Can Actually Run Today
The gap between open and closed world models is wider in 2026 than the equivalent gap in LLMs.
Closed models (API only):
- Genie 3 (Google DeepMind) - interactive world generation via Google's API
- Marble (World Labs) - 3D environment generation, commercial (not self-hostable)
- Sora (OpenAI) - text-to-video, no self-hosting
Open and self-hostable:
- NVIDIA Cosmos-Predict - Apache 2.0 source, NVIDIA Open Model License for weights. Distributed on Hugging Face and NGC. Commercial use permitted with attribution. This is the primary self-hostable option.
- UniSim - open research, not production-ready
| Model | License | Self-Hostable | Minimum GPU | Location |
|---|---|---|---|---|
| Genie 3 | Closed | No | N/A | Google API |
| Marble | Closed (commercial) | No | N/A | World Labs |
| Sora | Closed | No | N/A | OpenAI API |
| Cosmos-Predict 7B | NVIDIA Open Model | Yes | 1x H100 80GB | HF/NGC |
| Cosmos-Predict 14B | NVIDIA Open Model | Yes | 2x H100 or H200 | HF/NGC |
| UniSim | Research | Limited | H100 80GB | Research release |
The Cosmos-Predict 7B case is the most accessible entry point. A single H100 GPU rental at $2.01/hr (PCIe) or $2.54/hr (SXM5) handles 480p batch generation. For 720p inference without OOM risk, step up to H200.
Fine-tuning: Cosmos supports LoRA fine-tuning on custom environments. Minimum requirement is 2x H100 or a single H200. For fine-tuning on custom robotics environments, the approach mirrors the Cosmos-Transfer pipeline used in synthetic data generation - start with a base scene prompt and adapt the model to your target visual domain.
For the full Cosmos deployment guide including Docker setup, NGC authentication, and batch inference scripts, see Deploy NVIDIA Cosmos World Foundation Models on GPU Cloud.
Cost Economics of World-Model Inference
The cost-per-minute calculation depends on what you count as "generated video". For diffusion-based world models, the GPU generates frames during denoising iterations - the output is one polished frame, not a frame per iteration. At 720p with Cosmos-Predict 7B running on H200 SXM5, generating 1 minute of 24fps video requires 1,440 frames. At approximately 0.6 generated frames per second of wall-clock time, that takes about 40 minutes of GPU time.
| GPU | On-Demand $/hr | Minutes of 720p/24fps per GPU-hour | Cost per minute of generated video |
|---|---|---|---|
| H100 PCIe ($2.01/hr) | $2.01 | ~0.75 min | ~$2.68/min |
| H100 SXM5 ($2.54/hr) | $2.54 | ~1.0 min | ~$2.54/min |
| H200 SXM5 ($4.84/hr) | $4.84 | ~1.5 min | ~$3.23/min |
| B200 SXM6 ($7.41/hr) | $7.41 | ~2.75 min | ~$2.69/min |
_Note: throughput estimates are for Cosmos-Predict 7B at 720p. Actual values vary with scene complexity, denoising step count, and prompt conditioning._
Compare this to AWS p5.48xlarge (8x H100 SXM5): the instance runs at approximately $55/hr on-demand. Eight H100 SXM5s on Spheron cost $20.32/hr on-demand - a ~63% reduction with bare-metal performance. For parallel generation jobs that can run across multiple independent GPU instances, the Spheron approach scales linearly without the per-instance markup that hyperscaler managed services add.
Spot pricing note: world model training and non-real-time batch inference jobs are batchable and checkpointable. These are good candidates for spot instances. Spot H200 on Spheron runs significantly below the on-demand rate, making it the right choice for large synthetic data generation pipelines where occasional preemption is acceptable.
For broader AI inference cost economics including cost-per-million-token comparisons across GPU tiers, see the AI inference cost economics 2026 guide. For live rate comparison across all GPU tiers, check current GPU pricing.
Pricing fluctuates based on GPU availability. The prices above are based on 17 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Use Cases Driving GPU Demand for World Models in 2026
Robotics simulation
Physics simulators like Genesis give you accurate trajectories fast - Genesis reports 43M FPS on an RTX 4090 for a simple Franka arm scene. But they produce no visual realism. World models fill that gap by generating photorealistic synthetic frames from physics-accurate scene states. Teams combine Genesis for fast policy search with Cosmos-Transfer for visual domain adaptation before final deployment. See the Genesis Physics Engine deploy guide for the simulation setup.
Synthetic training data pipelines
NVIDIA's Physical AI Data Factory Blueprint connects Cosmos to Omniverse for scene authoring and Isaac Sim for physics validation. The pipeline generates photorealistic video from text-prompted environments, then annotates it with depth maps and semantic segmentation. For the full Cosmos synthetic data pipeline, see the NVIDIA Cosmos World Foundation Models guide.
3D and game content generation
World models are increasingly used alongside 3D scene reconstruction to accelerate game level design and virtual environment authoring. For teams integrating 3DGS scene captures into generative world model pipelines, see the 3D Gaussian Splatting on GPU cloud guide.
Autonomous vehicle simulation
AV teams use world models to generate rare edge cases - occluded pedestrians, unusual lighting conditions, adversarial scenarios - at a fraction of the cost of real-world data collection. A dataset of 10,000 edge-case clips that would cost $500,000 to collect in the real world costs roughly $15,000-20,000 in GPU-hours on self-hosted GPU cloud.
Practical Path: Experimenting with Open World Models on Spheron
For teams starting out with Cosmos-Predict, the most direct path is:
- Provision an H100 PCIe 80GB instance for 7B model experiments, or H200 SXM5 for 14B and 720p work.
- Authenticate with NGC (ngc.nvidia.com) and generate an API key.
- Accept the Cosmos model license on Hugging Face for the model variant you want.
- Pull the NIM container and run inference.
# Pull the Cosmos-Predict 7B NIM container
docker pull nvcr.io/nim/nvidia/cosmos-predict1-7b-text2world:1.0.0
# Run inference with a scene prompt
docker run --gpus all \
-v /path/to/weights:/workspace/weights \
-v /path/to/outputs:/workspace/outputs \
nvcr.io/nim/nvidia/cosmos-predict1-7b-text2world:1.0.0 \
--prompt "A robot arm picking up a red cube from a warehouse shelf" \
--resolution 480p \
--num-frames 60 \
--output /workspace/outputs/scene.mp4For instance provisioning documentation including CUDA setup and storage configuration, see docs.spheron.ai. For GPU tier selection and current availability, see the GPU pricing page. For a model-agnostic VRAM reference covering more than 50 models across precision levels, see the GPU requirements cheat sheet for 2026.
World models run lean on hyperscaler pricing. Spheron's H200 SXM5 and B200 SXM6 instances give teams experimenting with open world models bare-metal bandwidth at GPU cloud rates - no reserved contracts, no platform markup.
H200 SXM5 availability → | B200 GPU cloud → | View all GPU pricing →
Quick Setup Guide
For 480p batch inference (Cosmos-Predict 7B), a single H100 PCIe 80GB ($2.01/hr) is sufficient. For 720p real-time generation, use an H200 SXM5 ($4.84/hr). For 1080p or multi-stream world model serving, use a B200 SXM6 ($7.41/hr).
Log in to app.spheron.ai, select H200 SXM5 or B200 SXM6, choose on-demand for low-latency work or spot for batch generation jobs, and deploy with an Ubuntu 22.04 image with CUDA 12.4. Verify GPU visibility with nvidia-smi before pulling model weights.
For NVIDIA Cosmos, authenticate with NGC (ngc.nvidia.com), pull the Cosmos NIM container with docker pull nvcr.io/nim/nvidia/cosmos-predict1-7b-text2world:1.0.0, download model weights via huggingface-cli, and launch with GPU passthrough. For the 14B variant, configure tensor parallelism across two GPUs with --tensor-parallel-size 2.
Frequently Asked Questions
Genie 3 is a closed model available in limited preview via Project Genie (Google Labs) to Google AI Ultra subscribers. Internally, models of this type run on multi-node clusters of H100 or H200 GPUs with high-bandwidth NVLink interconnects. Self-hosting a comparable open world model like Cosmos-Predict 7B requires a minimum of one H100 80GB GPU; the 14B variant needs two H100s or a single H200 141GB.
LLM inference generates token sequences at 100-500 tokens per second and is primarily memory-bandwidth-bound. World model inference must generate spatially consistent video frames, where each 720p frame contains thousands of spatial patch tokens. At 720p/24fps, that translates to roughly 64,800 spatial tokens per second (about 2,700 tokens per frame), with attention scaling quadratically across both frame count and resolution. The result is 8-32x more compute than a comparably-sized LLM.
NVIDIA Cosmos-Predict (Apache 2.0 source, NVIDIA Open Model License for weights) is the primary self-hostable option. UniSim is available for research use. Genie 3, Marble, and Sora are closed API-only models. Cosmos-Predict 7B runs on a single H100 80GB; the 14B variant needs an H200 or two H100s.
At 720p/24fps with Cosmos-Predict 7B, a single H200 SXM5 ($4.84/hr on Spheron) can generate approximately 1.5 minutes of video per GPU-hour for diffusion-based world models (0.6 fps wall-clock, which means each minute of 24fps video takes about 40 minutes of GPU time). Autoregressive models (Genie 3-style architectures) generate faster per step but at lower quality. For non-real-time batch jobs, spot pricing on H200 runs significantly lower.
Bandwidth-bound inference occurs when the GPU's memory bandwidth, not its compute throughput, is the bottleneck. For world models, spatial attention over high-resolution frames requires moving large activation tensors repeatedly across HBM. The H100 SXM5 provides 3.35 TB/s of HBM3 bandwidth; the H200 provides 4.8 TB/s; the B200 provides 8 TB/s. These differences translate directly to throughput at 720p and above.
