GPU Infrastructure Behind World Models: What Powers Genie 3, Marble, and Real-Time Interactive AI (2026 Guide)

World models need 8-32x the GPU compute of a comparably-sized LLM, and the bandwidth requirements flip inference from compute-bound to memory-bound at every resolution above 480p. This is not a marginal difference in hardware needs - it fundamentally changes which GPU tier makes sense and why HBM bandwidth matters more than raw CUDA core count. This post covers what's actually running these models, what the GPU requirements look like for open alternatives, and what it costs per frame at each tier. For the equivalent breakdown for coding tools and agent workloads, see the GPU infrastructure guide for AI coding tools and the GPU infrastructure guide for AI agents.

What World Models Are and Why 2026 Is Their Breakout Year

A world model is a generative model of physical environments. Rather than producing a text token sequence, it produces video frames or 3D state transitions from action inputs or text prompts. The defining property is temporal consistency: each output frame must be physically plausible given every preceding frame. This is the part that destroys GPU budgets.

Three models define the current landscape:

Genie 3 (Google DeepMind) builds on the Genie family's latent action model approach, learning a compressed action space from video without labeled action data, then generating interactive environments from that learned space. The result is video that responds coherently to user inputs in real time. Genie 3 is closed API only as of 2026.

Marble (World Labs) targets 3D environment generation and navigation tasks, launched commercially in November 2025 with freemium and paid tiers. It is not self-hostable.

NVIDIA Cosmos is the most important model for practitioners in 2026 because it is actually self-hostable. It generates photorealistic synthetic video of physical environments for robotics and autonomous vehicle training pipelines. Full deployment details are in the NVIDIA Cosmos GPU cloud guide.

Why 2026 is the breakout year comes down to three converging trends. Robotics sim-to-real is moving from physics-only simulation to photorealistic world models because visual realism closes the policy transfer gap. Autonomous vehicle teams are using world models to generate rare edge cases at scale instead of waiting to encounter them in real-world driving logs. Game studios are running world model inference for procedural level generation. Each of these demands sustained, high-throughput GPU compute at resolutions that exceed what LLM-era GPU stacks were designed for.

For the robotics simulation context, see Deploy Genesis Physics Engine on GPU Cloud.

How World-Model Inference Differs from LLM Inference

LLM inference generates a token sequence. At each step, the model attends over a context window of 1,000-100,000 tokens and produces one next token. A well-optimized H100 running a 7B model in FP16 produces 100-500 tokens per second. Memory bandwidth is the bottleneck at small batch sizes because weight loading dominates.

World model inference at 720p/24fps looks fundamentally different. At 720p with standard patch sizes (16x16), each frame contains roughly 2,700 spatial tokens. At 24fps, the model must generate 64,800 spatial tokens per second, sustained. Even at 24fps with an autoregressive model generating one frame per step, you need far more throughput than any LLM inference scenario. For diffusion-based world models (Cosmos-style), which iteratively denoise each frame over 20-50 denoising steps, the compute requirement is higher still.

The critical difference is attention scaling. LLM attention scales quadratically with context length but that context is flat. World model attention must scale across both spatial extent (H×W patches per frame) and temporal extent (all prior frames for temporal consistency). A 720p diffusion world model at 24fps with 16 frames of temporal context is attending over a token grid roughly 3x the size of a standard LLM context window, with spatial structure that prevents the KV cache savings that make long-context LLMs tractable.

The result: 8-32x more compute than LLM inference at equivalent "output rate". The upper end of that range applies to high-resolution diffusion world models; the lower end applies to autoregressive models at 480p.

Workload	VRAM	HBM Bandwidth Needed	H100 SXM5 Throughput	H200 SXM5 Throughput
7B LLM (FP16, batch 1)	14 GB	~0.5 TB/s	~200 tokens/sec	~280 tokens/sec
Video gen 720p (Wan 2.1)	65-80 GB	~2 TB/s	1 frame/10 min	1 frame/7 min
World model 720p (Cosmos-7B)	80 GB	~2.5-3 TB/s	1 frame/2-3 sec	1 frame/~1.7 sec
World model 1080p (Cosmos-14B)	160+ GB	~4-5 TB/s	OOM	2x H200 required

HBM bandwidth is the reason H200 and B200 are the relevant GPU choices for production world model inference above 480p. The H100 SXM5 delivers 3.35 TB/s of HBM3 bandwidth. The H200 SXM5 delivers 4.8 TB/s of HBM3e - a 43% improvement on the same silicon. The B200 delivers 8 TB/s. At world model token rates, that bandwidth difference maps almost linearly to throughput. For a detailed breakdown of HBM generations and what each means for inference, see the HBM3e vs HBM4 vs HBM4e LLM inference guide.

For the video generation context (single-clip generation rather than temporally consistent world models), see the AI video generation GPU guide.

GPU Requirements for Real-Time World Model Generation

"Real-time" for a world model means generating one output frame faster than the frame period of the target frame rate. At 24fps, that is 41ms per frame. At 30fps, it is 33ms. No current open world model running on a single GPU achieves this for diffusion-based generation at 720p. Real-time performance at 720p currently requires multi-GPU inference or dedicated inference hardware from NVIDIA's Blackwell generation.

For practical deployment, the relevant threshold is near-real-time (under 10 seconds per frame) for interactive use, and throughput optimization for batch generation pipelines.

VRAM requirements by resolution:

Resolution	Cosmos-Predict 7B	Cosmos-Predict 14B	Notes
480p	~45 GB	~85 GB	7B fits H100 PCIe 80GB comfortably
720p	~70-80 GB	~140-160 GB	7B fits H100 SXM5 80GB at edge; 14B needs H200
1080p	OOM on H100/H200; B200 SXM6 (192GB) may fit	OOM (single GPU)	7B: 1x B200 SXM6 or 2x H100/H200; 14B: 2x H200+

GPU tier vs. throughput and cost:

GPU	On-Demand $/hr	Cosmos-7B 720p fps	Cosmos-14B 720p fps	Cost/min of video
H100 PCIe ($2.01/hr)	$2.01	~0.3 fps (batch only)	OOM	~$2.68/min
H100 SXM5 ($2.54/hr)	$2.54	~0.4 fps	OOM	~$2.54/min
H200 SXM5 ($4.84/hr)	$4.84	~0.6 fps	~0.3 fps (est.)	~$3.23/min
B200 SXM6 ($7.41/hr)	$7.41	~1.1 fps	~0.5 fps (est.)	~$2.69/min

fps here means generated video frames per second of wall-clock time, not frames per second in the output video. The B200's higher cost per hour produces a similar or better cost per minute of generated video compared to H200, because it generates more frames per GPU-hour.

Autoregressive world models (Genie 3-style architectures) are faster per step because each denoising iteration is replaced by a single autoregressive forward pass. The tradeoff is visual quality and temporal consistency at longer horizons. For most current open model deployments, diffusion-based architectures like Cosmos-Predict produce better results.

For H200 SXM5 on Spheron and B200 GPU cloud on Spheron provisioning, both models support on-demand with per-minute billing.

Open vs Closed World Models: What You Can Actually Run Today

The gap between open and closed world models is wider in 2026 than the equivalent gap in LLMs.

Closed models (API only):

Genie 3 (Google DeepMind) - interactive world generation via Google's API
Marble (World Labs) - 3D environment generation, commercial (not self-hostable)
Sora (OpenAI) - text-to-video, no self-hosting

Open and self-hostable:

NVIDIA Cosmos-Predict - Apache 2.0 source, NVIDIA Open Model License for weights. Distributed on Hugging Face and NGC. Commercial use permitted with attribution. This is the primary self-hostable option.
UniSim - open research, not production-ready

Model	License	Self-Hostable	Minimum GPU	Location
Genie 3	Closed	No	N/A	Google API
Marble	Closed (commercial)	No	N/A	World Labs
Sora	Closed	No	N/A	OpenAI API
Cosmos-Predict 7B	NVIDIA Open Model	Yes	1x H100 80GB	HF/NGC
Cosmos-Predict 14B	NVIDIA Open Model	Yes	2x H100 or H200	HF/NGC
Cosmos-3 (two-tower MoT)	NVIDIA Open Model	Yes	1x H200	HF/NGC
UniSim	Research	Limited	H100 80GB	Research release

The Cosmos-Predict 7B case is the most accessible entry point. A single H100 GPU rental at $2.01/hr (PCIe) or $2.54/hr (SXM5) handles 480p batch generation. For 720p inference without OOM risk, step up to H200.

Fine-tuning: Cosmos supports LoRA fine-tuning on custom environments. Minimum requirement is 2x H100 or a single H200. For fine-tuning on custom robotics environments, the approach mirrors the Cosmos-Transfer pipeline used in synthetic data generation - start with a base scene prompt and adapt the model to your target visual domain.

For the full Cosmos deployment guide including Docker setup, NGC authentication, and batch inference scripts, see Deploy NVIDIA Cosmos World Foundation Models on GPU Cloud. For the Cosmos 3 two-tower architecture with action trajectory outputs and per-modality expert heads, see the Cosmos 3 two-tower deployment guide.

Cost Economics of World-Model Inference

The cost-per-minute calculation depends on what you count as "generated video". For diffusion-based world models, the GPU generates frames during denoising iterations - the output is one polished frame, not a frame per iteration. At 720p with Cosmos-Predict 7B running on H200 SXM5, generating 1 minute of 24fps video requires 1,440 frames. At approximately 0.6 generated frames per second of wall-clock time, that takes about 40 minutes of GPU time.

GPU	On-Demand $/hr	Minutes of 720p/24fps per GPU-hour	Cost per minute of generated video
H100 PCIe ($2.01/hr)	$2.01	~0.75 min	~$2.68/min
H100 SXM5 ($2.54/hr)	$2.54	~1.0 min	~$2.54/min
H200 SXM5 ($4.84/hr)	$4.84	~1.5 min	~$3.23/min
B200 SXM6 ($7.41/hr)	$7.41	~2.75 min	~$2.69/min

_Note: throughput estimates are for Cosmos-Predict 7B at 720p. Actual values vary with scene complexity, denoising step count, and prompt conditioning._

Compare this to AWS p5.48xlarge (8x H100 SXM5): the instance runs at approximately $55/hr on-demand. Eight H100 SXM5s on Spheron cost $20.32/hr on-demand - a ~63% reduction with bare-metal performance. For parallel generation jobs that can run across multiple independent GPU instances, the Spheron approach scales linearly without the per-instance markup that hyperscaler managed services add.

Spot pricing note: world model training and non-real-time batch inference jobs are batchable and checkpointable. These are good candidates for spot instances. Spot H200 on Spheron runs significantly below the on-demand rate, making it the right choice for large synthetic data generation pipelines where occasional preemption is acceptable.

For broader AI inference cost economics including cost-per-million-token comparisons across GPU tiers, see the AI inference cost economics 2026 guide. For live rate comparison across all GPU tiers, check current GPU pricing.

Pricing fluctuates based on GPU availability. The prices above are based on 17 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Use Cases Driving GPU Demand for World Models in 2026

Robotics simulation

Physics simulators like Genesis give you accurate trajectories fast - Genesis reports 43M FPS on an RTX 4090 for a simple Franka arm scene. But they produce no visual realism. World models fill that gap by generating photorealistic synthetic frames from physics-accurate scene states. Teams combine Genesis for fast policy search with Cosmos-Transfer for visual domain adaptation before final deployment. See the Genesis Physics Engine deploy guide for the simulation setup.

Synthetic training data pipelines

NVIDIA's Physical AI Data Factory Blueprint connects Cosmos to Omniverse for scene authoring and Isaac Sim for physics validation. The pipeline generates photorealistic video from text-prompted environments, then annotates it with depth maps and semantic segmentation. For the full Cosmos synthetic data pipeline, see the NVIDIA Cosmos World Foundation Models guide.

3D and game content generation

World models are increasingly used alongside 3D scene reconstruction to accelerate game level design and virtual environment authoring. For teams integrating 3DGS scene captures into generative world model pipelines, see the 3D Gaussian Splatting on GPU cloud guide.

Autonomous vehicle simulation

AV teams use world models to generate rare edge cases - occluded pedestrians, unusual lighting conditions, adversarial scenarios - at a fraction of the cost of real-world data collection. A dataset of 10,000 edge-case clips that would cost $500,000 to collect in the real world costs roughly $15,000-20,000 in GPU-hours on self-hosted GPU cloud.

Practical Path: Experimenting with Open World Models on Spheron

For teams starting out with Cosmos-Predict, the most direct path is:

Provision an H100 PCIe 80GB instance for 7B model experiments, or H200 SXM5 for 14B and 720p work.
Authenticate with NGC (ngc.nvidia.com) and generate an API key.
Accept the Cosmos model license on Hugging Face for the model variant you want.
Pull the NIM container and run inference.

bash

# Pull the Cosmos-Predict 7B NIM container
docker pull nvcr.io/nim/nvidia/cosmos-predict1-7b-text2world:1.0.0

# Run inference with a scene prompt
docker run --gpus all \
  -v /path/to/weights:/workspace/weights \
  -v /path/to/outputs:/workspace/outputs \
  nvcr.io/nim/nvidia/cosmos-predict1-7b-text2world:1.0.0 \
  --prompt "A robot arm picking up a red cube from a warehouse shelf" \
  --resolution 480p \
  --num-frames 60 \
  --output /workspace/outputs/scene.mp4

For instance provisioning documentation including CUDA setup and storage configuration, see docs.spheron.ai. For GPU tier selection and current availability, see the GPU pricing page. For a model-agnostic VRAM reference covering more than 50 models across precision levels, see the GPU requirements cheat sheet for 2026.

World models run lean on hyperscaler pricing. Spheron's H200 SXM5 and B200 SXM6 instances give teams experimenting with open world models bare-metal bandwidth at GPU cloud rates - no reserved contracts, no platform markup.
H200 SXM5 availability → | B200 GPU cloud → | View all GPU pricing →

STEPS / 03

Quick Setup Guide

Choose a GPU tier based on resolution and latency target
For 480p batch inference (Cosmos-Predict 7B), a single H100 PCIe 80GB ($2.01/hr) is sufficient. For 720p real-time generation, use an H200 SXM5 ($4.84/hr). For 1080p or multi-stream world model serving, use a B200 SXM6 ($7.41/hr).
Provision an H200 or B200 instance on Spheron
Log in to app.spheron.ai, select H200 SXM5 or B200 SXM6, choose on-demand for low-latency work or spot for batch generation jobs, and deploy with an Ubuntu 22.04 image with CUDA 12.4. Verify GPU visibility with nvidia-smi before pulling model weights.
Set up the inference container for your chosen open world model
For NVIDIA Cosmos, authenticate with NGC (ngc.nvidia.com), pull the Cosmos NIM container with docker pull nvcr.io/nim/nvidia/cosmos-predict1-7b-text2world:1.0.0, download model weights via huggingface-cli, and launch with GPU passthrough. For the 14B variant, configure tensor parallelism across two GPUs with --tensor-parallel-size 2.

FAQ / 05

Frequently Asked Questions

Genie 3 is a closed model available in limited preview via Project Genie (Google Labs) to Google AI Ultra subscribers. Internally, models of this type run on multi-node clusters of H100 or H200 GPUs with high-bandwidth NVLink interconnects. Self-hosting a comparable open world model like Cosmos-Predict 7B requires a minimum of one H100 80GB GPU; the 14B variant needs two H100s or a single H200 141GB.

LLM inference generates token sequences at 100-500 tokens per second and is primarily memory-bandwidth-bound. World model inference must generate spatially consistent video frames, where each 720p frame contains thousands of spatial patch tokens. At 720p/24fps, that translates to roughly 64,800 spatial tokens per second (about 2,700 tokens per frame), with attention scaling quadratically across both frame count and resolution. The result is 8-32x more compute than a comparably-sized LLM.

NVIDIA Cosmos-Predict (Apache 2.0 source, NVIDIA Open Model License for weights) is the primary self-hostable option. UniSim is available for research use. Genie 3, Marble, and Sora are closed API-only models. Cosmos-Predict 7B runs on a single H100 80GB; the 14B variant needs an H200 or two H100s.

At 720p/24fps with Cosmos-Predict 7B, a single H200 SXM5 ($4.84/hr on Spheron) can generate approximately 1.5 minutes of video per GPU-hour for diffusion-based world models (0.6 fps wall-clock, which means each minute of 24fps video takes about 40 minutes of GPU time). Autoregressive models (Genie 3-style architectures) generate faster per step but at lower quality. For non-real-time batch jobs, spot pricing on H200 runs significantly lower.

Bandwidth-bound inference occurs when the GPU's memory bandwidth, not its compute throughput, is the bottleneck. For world models, spatial attention over high-resolution frames requires moving large activation tensors repeatedly across HBM. The H100 SXM5 provides 3.35 TB/s of HBM3 bandwidth; the H200 provides 4.8 TB/s; the B200 provides 8 TB/s. These differences translate directly to throughput at 720p and above.

What World Models Are and Why 2026 Is Their Breakout Year

How World-Model Inference Differs from LLM Inference

GPU Requirements for Real-Time World Model Generation

Open vs Closed World Models: What You Can Actually Run Today

Cost Economics of World-Model Inference

Use Cases Driving GPU Demand for World Models in 2026

Practical Path: Experimenting with Open World Models on Spheron

Quick Setup Guide

Choose a GPU tier based on resolution and latency target

Provision an H200 or B200 instance on Spheron

Set up the inference container for your chosen open world model

Frequently Asked Questions

01What GPU does Genie 3 need?

02How does world model inference differ from LLM inference?

03What open world models can I self-host in 2026?

04What does real-time 720p/24fps world model inference cost per hour?

05What is HBM bandwidth bound inference for world models?

Try It on Real GPUs