What GPU do I need to run Wan 2.5?

Wan 2.5-Preview is currently API-only (Alibaba Cloud) with no public weights for self-hosting. For self-hosted production video generation today, use Wan 2.2 which requires 40-48GB VRAM at 480p (FP8) or 65-80GB at 720p. The H100 PCIe (80GB) is the minimum practical GPU for Wan 2.2 14B at 720p. The B200 (192GB) gives comfortable headroom and faster generation times.

How does Wan 2.5 differ from Wan 2.1 and Wan 2.2?

Wan 2.5-Preview (September 2025) is a multimodal audio-video model from Alibaba, meaning it generates synchronized audio alongside video. It also adds an agentic video generation mode. Wan 2.2 (July 2025) upgraded the architecture to Mixture-of-Experts with the same VRAM footprint as Wan 2.1, trained on 65.6% more images and 83.2% more videos. Wan 2.5 weights are not publicly available; Wan 2.2 remains the latest self-hostable version.

Can I run Wan 2.2 in ComfyUI?

Yes. The ComfyUI-WanVideoWrapper custom node adds native Wan 2.1 and 2.2 support to ComfyUI. You load 14B or 1.3B weights through a standard checkpoint loader and connect a text-to-video or image-to-video workflow. H100 PCIe (80GB) is the minimum for the 14B model at 480p with FP8 quantization.

How much does a 5-second Wan 2.2 clip cost on Spheron vs Replicate?

On Spheron, a 5-second 720p clip using the Wan 2.2 14B model on an H100 PCIe (~$2.01/hr) takes roughly 10-12 minutes and costs approximately $0.34-0.40. On Replicate, the Wan model runs at per-second GPU compute billing, which for the same clip typically costs $0.50-0.80 depending on current pricing. At 50+ clips per day, self-hosted Spheron beats Replicate API pricing.

Does Wan 2.5 support image-to-video generation?

Wan 2.5-Preview supports I2V through the Alibaba Cloud API. For self-hosted I2V today, use Wan 2.2 I2V A14B (weights at Wan-AI/Wan2.2-I2V-A14B on HuggingFace). When Wan 2.5 public weights ship, the I2V upgrade will be a checkpoint swap with the same ComfyUI infrastructure.

Deploy Wan 2.5 on GPU Cloud: Production Video Generation Setup (2026)

The Wan series 14B model at 720p requires 65-80GB of VRAM. Consumer GPUs won't run it. If you want to self-host production-quality open-source video generation, you need datacenter hardware. This guide covers exactly that: GPU requirements for Wan 2.2 (the current publicly available version), step-by-step ComfyUI and diffusers setup on H100 and B200, benchmark numbers, and what each clip costs you. For teams upgrading from earlier setups, the Wan 2.1/2.2 deployment guide covers the shared ComfyUI infrastructure that Wan 2.2 builds on.

On Wan 2.5: Wan 2.5-Preview shipped in September 2025 as a multimodal audio-video model accessible only through Alibaba Cloud APIs. No public weights have been released through the Wan-AI GitHub or HuggingFace as of April 2026. This post covers self-hosted deployment using Wan 2.2 as the current production standard, with a full section on what Wan 2.5 changes and what the upgrade path looks like when weights ship.

What Changed in Wan 2.5 vs 2.1 and 2.2

The Wan model series has moved through three major architectural shifts since February 2025.

Model	Architecture	Training Scale	Multimodal	VRAM (14B 720p)	Public Weights
Wan 2.1	Dense transformer	Baseline	No	65-80GB	Yes (Wan-AI/Wan2.1-T2V-14B)
Wan 2.2	Mixture-of-Experts	+65.6% images, +83.2% videos	No	65-80GB	Yes (Wan-AI/Wan2.2-T2V-A14B)
Wan 2.5-Preview	Unknown	Unknown	Yes (audio+video)	Unknown	No (API only)

Wan 2.2 (July 2025) was a meaningful upgrade over 2.1. The architecture switch to MoE keeps active parameters at 14B (27B total) but applies separate expert networks for early and late denoising steps. The larger training dataset translates to better motion coherence, stronger instruction following, and fewer geometric artifacts. VRAM requirements stayed the same.

Wan 2.5-Preview introduced two things that 2.2 doesn't have: synchronized audio generation alongside video, and an agentic mode where the model can decompose multi-scene prompts into sub-tasks. The "Preview" label signals this is not a stable release. Alibaba has kept the weights proprietary, likely because the audio component involves licensed training data that complicates open distribution.

Wan 2.6 (December 2025) followed, also without public weights.

What this means for infrastructure: your Wan 2.2 ComfyUI setup is the foundation you'll run Wan 2.5 on when weights eventually ship. The upgrade is a checkpoint swap. Build the infrastructure now on Wan 2.2 and you're ready to flip to Wan 2.5 on day one of a public release.

For the Wan 2.1 setup guide and how the 2.1-to-2.2 migration works, see the Wan 2.1/2.2 deployment guide.

GPU Hardware Requirements for Wan 2.2/2.5

The VRAM math is driven by model weight size, activation memory, and attention overhead. Here's the breakdown:

Weight memory:

14B parameters at FP16: ~28GB
14B parameters at FP8: ~14GB
T5 text encoder: ~11GB (BF16)
VAE: ~0.5GB

Runtime overhead:

Activation memory during denoising: 15-20% of weight memory
Framework and CUDA context: ~2-4GB

The resolution amplifier is the attention mechanism. Going from 480p to 720p increases spatial token count by 2.25x, but the attention matrix grows quadratically, so VRAM increases roughly 2-3x despite the smaller pixel ratio change.

Precision	Resolution	Duration	VRAM Required	Min GPU
FP16	480p (832x480)	5s	55-65GB	H100 SXM5
FP8	480p (832x480)	5s	~40-48GB	H100 PCIe
FP16	720p (1280x720)	5s	75-90GB	H200 SXM5
FP8	720p (1280x720)	5s	~65-80GB	H100 PCIe (tight)
FP8	720p (1280x720)	10s	80GB+	H200 SXM5

GPU selection guide:

H100 PCIe on Spheron (80GB, from $2.01/hr): The minimum for Wan 2.2 14B at 480p-720p with FP8. Tight VRAM margin at 720p. Prefer FP8 quantization; run nvidia-smi on the first job to confirm headroom before scaling. Single-GPU offers available.

H100 SXM5 on Spheron (80GB, ~$2.90/hr per GPU in 8-GPU bundles): Same VRAM as PCIe, but 3.35 TB/s memory bandwidth vs 2 TB/s cuts generation time by roughly 25% for attention-heavy video workloads. No spot pricing currently. More useful if you need multi-GPU setups.

B200 GPU rental on Spheron (192GB): The best GPU for Wan 2.2 production workloads. 192GB eliminates all VRAM margin concerns including 10-second 720p clips and FP16 precision. B200 SXM6 is currently available via spot pricing on Spheron at $2.06/hr per GPU (in 2-GPU and 8-GPU bundles).

H200 SXM5 on Spheron (141GB, from $9.76/hr per GPU): 141GB gives headroom for 720p 10-second clips and FP16 at 720p.

Multi-GPU note: Video generation on a single GPU does not benefit from tensor parallelism in standard ComfyUI or diffusers setups. One job runs on one GPU. Scale throughput by running multiple GPU instances in parallel, each processing an independent job. No NVLink or InfiniBand required for this pattern.

Step-by-Step: Deploy Wan 2.2 with ComfyUI on Spheron H100

This walkthrough uses ComfyUI with the WanVideoWrapper custom node. The ComfyUI on GPU cloud node-based interface lets you build reusable workflows, chain T2V and I2V pipelines, and iterate faster than CLI-only approaches.

Step 1: Launch an H100 instance

Go to Spheron's H100 GPU rental page and provision an H100 PCIe or SXM5 instance. For 720p 14B generation, H100 PCIe at $2.01/hr covers most use cases. Choose Ubuntu 22.04. Do not expose port 8188 in your network settings. ComfyUI has no built-in authentication; you'll access it through an SSH tunnel instead.

Step 2: Deploy ComfyUI via Docker

SSH into the instance, then run:

bash

# latest-cuda is a floating tag. For supply-chain assurance, pin by digest:
#   docker inspect --format='{{index .RepoDigests 0}}' ghcr.io/ai-dock/comfyui:latest-cuda
# then substitute the sha256 digest reference below.
IMAGE=ghcr.io/ai-dock/comfyui:latest-cuda

docker pull $IMAGE

docker run -d \
  --name comfyui \
  --gpus all \
  --ipc=host \
  -p 127.0.0.1:8188:8188 \
  -v ~/comfyui-models:/opt/ComfyUI/models \
  -v ~/comfyui-output:/opt/ComfyUI/output \
  $IMAGE

The -v flags persist model weights and outputs across container restarts. --ipc=host is required for PyTorch shared memory. -p 127.0.0.1:8188:8188 binds ComfyUI to localhost only.

Step 3: Install ComfyUI-WanVideoWrapper

Enter the running container:

bash

docker exec -it comfyui bash

Navigate to custom nodes and clone the wrapper:

bash

cd /opt/ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-WanVideoWrapper
pip install -r ComfyUI-WanVideoWrapper/requirements.txt

Exit and restart to register the new nodes:

bash

exit
docker restart comfyui

If you hit dependency errors, check the WanVideoWrapper GitHub for current installation notes. The requirements file updates frequently.

Step 4: Download Wan 2.2 model weights

On the host (not inside the container), download into the mounted model directory:

bash

pip install huggingface_hub

# Wan 2.2 14B text-to-video (~69GB: DiT ~57GB + T5 ~11GB + VAE ~0.5GB)
# Download takes 30-90 minutes depending on connection speed
huggingface-cli download Wan-AI/Wan2.2-T2V-A14B \
  --local-dir ~/comfyui-models/wan-t2v-14b

# For image-to-video:
# huggingface-cli download Wan-AI/Wan2.2-I2V-A14B \
#   --local-dir ~/comfyui-models/wan-i2v-14b

# For the 1.3B variant (consumer-GPU friendly, lower quality):
# huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B \
#   --local-dir ~/comfyui-models/wan-t2v-1.3b

The weights are immediately available inside ComfyUI without re-entering the container because of the -v mount from Step 2.

Step 5: Access via SSH tunnel and run first generation

From your local machine:

bash

ssh -L 8188:localhost:8188 user@your-server-ip

While the tunnel is open, navigate to http://localhost:8188. Load a Wan 2.2 workflow JSON from comfyworkflows.com or the WanVideoWrapper GitHub repository. Set your prompt, select the model checkpoint, and queue the generation.

Expected times on H100 PCIe:

480p, 5 seconds, 14B: approximately 4 minutes
720p, 5 seconds, 14B: approximately 10-12 minutes

Monitor VRAM usage with nvidia-smi in a second SSH session during the first run.

FP8 Quantization

FP8 reduces the 14B model's VRAM at 480p from ~55-65GB (FP16) to ~40-48GB, making 480p reliable on H100 PCIe. At 720p, FP8 still requires ~65-80GB due to attention overhead. In ComfyUI with WanVideoWrapper, enable FP8 in the WanVideoModelLoader node under the precision or dtype setting. Look for fp8_e4m3fn or equivalent. The exact option name changes with node releases; check the WanVideoWrapper GitHub for the current setting.

Quality impact: FP8 introduces minor visual degradation versus BF16, most visible on fine textures and small on-screen details. Generate a few comparison clips before committing to a pipeline.

Step-by-Step: Deploy Wan 2.2 with diffusers

For teams building programmatic pipelines or API wrappers without a UI layer, the diffusers library offers a cleaner Python-native interface.

Install dependencies:

bash

pip install diffusers transformers accelerate
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

Basic T2V inference (BF16):

python

import torch
from diffusers import WanPipeline

# The official Wan-AI/Wan2.2-T2V-A14B-Diffusers checkpoint stores weights in BF16.
# Use torch_dtype=torch.bfloat16 to match. FP8 inference requires a separately
# quantized community checkpoint (e.g. search Hugging Face for fp8 Wan2.2 variants)
# and a quantization library like torchao or bitsandbytes. It cannot be enabled
# via torch_dtype alone on a BF16 checkpoint.
pipe = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.2-T2V-A14B-Diffusers",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

video = pipe(
    prompt="A slow-motion close-up of water droplets falling onto a dark stone surface",
    num_frames=25,       # 25 frames at 24fps = ~1s; scale up for longer clips
    height=480,
    width=832,
    num_inference_steps=40,
    guidance_scale=5.0,
).frames[0]

# Export with imageio
import imageio
imageio.mimsave("output.mp4", video, fps=24)

For a production API wrapper, add FastAPI on top:

bash

pip install fastapi uvicorn python-multipart

Keep the pipeline loaded in memory between requests. Wan 2.2 14B takes 30-60 seconds to load; reloading per request kills throughput. Run the server behind an SSH tunnel or Spheron's public endpoint for remote access.

bash

uvicorn app:app --host 127.0.0.1 --port 8080
ssh -L 8080:localhost:8080 user@your-server-ip

Wan 2.5 on B200: What the Upgrade Path Looks Like

When Wan 2.5 public weights ship, the infrastructure change is minimal. The ComfyUI and diffusers setups above will work unchanged; you swap the checkpoint reference from Wan-AI/Wan2.2-T2V-A14B to the Wan 2.5 repo once available.

The main hardware consideration: Wan 2.5-Preview includes audio generation alongside video. If the released weights include the audio component, VRAM requirements will likely increase from the Wan 2.2 baseline. Having a B200 bare-metal instance on Spheron with 192GB removes the need to re-evaluate hardware when the weights drop.

B200 also has native FP4 support (Blackwell B200 architecture). Video diffusion model libraries don't currently expose float4 inference paths through ComfyUI or standard diffusers, but when framework support lands, B200 users get it without a hardware swap.

Current B200 pricing on Spheron: B200 SXM6 is available via spot pricing at $2.06/hr per GPU (in 2-GPU and 8-GPU bundles).

Memory bandwidth comparison:

H100 SXM5: 3.35 TB/s
H200 SXM5: 4.8 TB/s
B200 SXM6: ~8 TB/s

Higher memory bandwidth means faster attention computation in video diffusion transformers. For the same 720p 5-second Wan 2.2 job, B200 should complete faster than H100 even at identical GPU count. Exact speedup depends on whether the workload is compute-bound or memory-bound; video transformers tend to be both at 720p, so B200 should show meaningful gains.

Latency, Throughput, and Cost-Per-Second Benchmarks

Generation time estimates below are single-batch figures for Wan 2.2 14B. Actual performance varies by driver version, step count, and load. Run your own benchmarks on target hardware before provisioning at scale.

GPU	Precision	Resolution	Duration	Est. Gen Time	$/hr (OD)	Cost/clip	Cost/sec output
H100 PCIe	FP8	480p	5s	~4 min	$2.01	~$0.13	~$0.027
H100 PCIe	FP8	720p	5s	~10-12 min	$2.01	~$0.34-0.40	~$0.068-0.080
H100 SXM5	FP8	720p	5s	~8-10 min	$2.90	~$0.39-0.48	~$0.078-0.097
H200 SXM5	FP8	720p	5s	~7-9 min	$9.76	~$1.14-1.46	~$0.228-0.293
B200 SXM6	FP8	720p	5s	~5-7 min	$2.06 spot	~$0.17-0.24	~$0.035-0.049
RTX 5090 PCIe	FP8	480p	5s (1.3B)	~2-3 min	$0.86	~$0.03	~$0.006

Pricing fluctuates based on GPU availability. The prices above are based on 28 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Production Tips: Queuing, Batching, NVENC, and Safety Filters

Job queuing

Video generation jobs run 4-12 minutes per clip. A synchronous API won't work. The production pattern:

API gateway receives image or prompt input, validates, returns a job ID immediately
Redis queue holds pending jobs with priority tiers
GPU worker pool runs one process per GPU, model weights loaded and resident in VRAM between jobs; workers pull from the queue
Object storage (S3, Cloudflare R2, MinIO) holds generated clips; workers write output paths to a results store
Webhook or poll endpoint notifies callers when the job completes

Keep workers alive between jobs. Wan 2.2 14B takes 30-60 seconds to load; reloading on every request cuts effective throughput in half.

Batching across clips

Video generation does not batch across clips on a single GPU. Each clip independently uses the full VRAM allocation. Submitting multiple jobs to one GPU causes OOM errors, not faster throughput. Scale by provisioning more instances, not by batching more clips per GPU. Four H100s run four concurrent generation jobs with linear throughput scaling and no inter-GPU coordination overhead.

NVENC handoff

After generation, use GPU-accelerated encoding to convert raw frames to delivery format:

bash

# GPU hardware encoding via NVENC (no CPU bottleneck)
ffmpeg -hwaccel cuda -i input.mp4 \
  -c:v h264_nvenc -preset fast \
  -b:v 4M output.mp4

This keeps the entire generate-to-encode pipeline on the GPU and avoids CPU encoding becoming a throughput bottleneck at high clip volumes.

Watermarking

For adding a watermark overlay before delivery:

bash

ffmpeg -i clip.mp4 \
  -i watermark.png \
  -filter_complex "overlay=10:10" \
  output_watermarked.mp4

For safety filtering before delivery, run a NSFW classification model on the output frames before saving to object storage. This keeps GPU pipeline separate from content moderation and lets you swap classifiers without touching the generation code.

Spheron vs Replicate and Fal.ai: Cost Per 5-Second Clip

For sustained Wan 2.2 workloads at 50+ clips per day, bare-metal hourly pricing on Spheron beats per-second API billing on Replicate and Fal.ai.

Break-even math at 720p, 5-second clips:

Provider	Pricing Model	Cost per 5s 720p clip	100 clips/day
Spheron H100 PCIe (on-demand)	Hourly compute	~$0.34-0.40	~$34-40
Replicate (Wan model)	Per-second GPU billing	~$0.50-0.80 est.	~$50-80
Fal.ai (Wan model)	Per-generation or per-second	~$0.60-1.00 est.	~$60-100

Pricing fluctuates based on GPU availability. The prices above are based on 28 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Replicate bills per second of GPU compute, not per output second. For the Wan 2.2 14B at 720p, 10-12 minutes of H100 time equals ~$0.50-0.80 at typical Replicate GPU pricing. Fal.ai billing varies by model and tier.

At 50 clips per day, Spheron on-demand starts undercutting Replicate. At 100+ clips per day, the gap is substantial. Below 20 clips per day, the API services are cheaper because you pay nothing when idle.

The crossover point: if you are generating more than 40-60 clips per day consistently, self-hosted on Spheron is cheaper. Below that threshold, Replicate or Fal.ai avoids the infrastructure overhead without significant cost penalty.

For the broader video model ecosystem comparison including HunyuanVideo and LTX-Video, see the AI video generation GPU guide. For image-to-video workflows covering Wan 2.2 I2V, LTX-Video, and Hunyuan Video Avatar, see the image-to-video deployment guide. For VRAM sizing across the full 2026 video AI landscape, see GPU cloud for video AI 2026.

Wan 2.2 production workloads on Spheron H100 and B200 bare-metal start at $2.01/hr on H100 PCIe and $2.06/hr per GPU on B200, with no per-output fees and full root access.
Rent H100 for Wan 2.2 → | Rent B200 → | View all GPU pricing →

What Changed in Wan 2.5 vs 2.1 and 2.2

GPU Hardware Requirements for Wan 2.2/2.5

Step-by-Step: Deploy Wan 2.2 with ComfyUI on Spheron H100

Step 1: Launch an H100 instance

Step 2: Deploy ComfyUI via Docker

Step 3: Install ComfyUI-WanVideoWrapper

Step 4: Download Wan 2.2 model weights

Step 5: Access via SSH tunnel and run first generation

FP8 Quantization

Step-by-Step: Deploy Wan 2.2 with diffusers

Wan 2.5 on B200: What the Upgrade Path Looks Like

Latency, Throughput, and Cost-Per-Second Benchmarks

Production Tips: Queuing, Batching, NVENC, and Safety Filters

Job queuing

Batching across clips

NVENC handoff

Watermarking

Spheron vs Replicate and Fal.ai: Cost Per 5-Second Clip

Build what's next.