The Wan series 14B model at 720p requires 65-80GB of VRAM. Consumer GPUs won't run it. If you want to self-host production-quality open-source video generation, you need datacenter hardware. This guide covers exactly that: GPU requirements for Wan 2.2 (the current publicly available version), step-by-step ComfyUI and diffusers setup on H100 and B200, benchmark numbers, and what each clip costs you. For teams upgrading from earlier setups, the Wan 2.1/2.2 deployment guide covers the shared ComfyUI infrastructure that Wan 2.2 builds on.
On Wan 2.5: Wan 2.5-Preview shipped in September 2025 as a multimodal audio-video model accessible only through Alibaba Cloud APIs. No public weights have been released through the Wan-AI GitHub or HuggingFace as of April 2026. This post covers self-hosted deployment using Wan 2.2 as the current production standard, with a full section on what Wan 2.5 changes and what the upgrade path looks like when weights ship.
What Changed in Wan 2.5 vs 2.1 and 2.2
The Wan model series has moved through three major architectural shifts since February 2025.
| Model | Architecture | Training Scale | Multimodal | VRAM (14B 720p) | Public Weights |
|---|---|---|---|---|---|
| Wan 2.1 | Dense transformer | Baseline | No | 65-80GB | Yes (Wan-AI/Wan2.1-T2V-14B) |
| Wan 2.2 | Mixture-of-Experts | +65.6% images, +83.2% videos | No | 65-80GB | Yes (Wan-AI/Wan2.2-T2V-A14B) |
| Wan 2.5-Preview | Unknown | Unknown | Yes (audio+video) | Unknown | No (API only) |
Wan 2.2 (July 2025) was a meaningful upgrade over 2.1. The architecture switch to MoE keeps active parameters at 14B (27B total) but applies separate expert networks for early and late denoising steps. The larger training dataset translates to better motion coherence, stronger instruction following, and fewer geometric artifacts. VRAM requirements stayed the same.
Wan 2.5-Preview introduced two things that 2.2 doesn't have: synchronized audio generation alongside video, and an agentic mode where the model can decompose multi-scene prompts into sub-tasks. The "Preview" label signals this is not a stable release. Alibaba has kept the weights proprietary, likely because the audio component involves licensed training data that complicates open distribution.
Wan 2.6 (December 2025) followed, also without public weights.
What this means for infrastructure: your Wan 2.2 ComfyUI setup is the foundation you'll run Wan 2.5 on when weights eventually ship. The upgrade is a checkpoint swap. Build the infrastructure now on Wan 2.2 and you're ready to flip to Wan 2.5 on day one of a public release.
For the Wan 2.1 setup guide and how the 2.1-to-2.2 migration works, see the Wan 2.1/2.2 deployment guide.
GPU Hardware Requirements for Wan 2.2/2.5
The VRAM math is driven by model weight size, activation memory, and attention overhead. Here's the breakdown:
Weight memory:
- 14B parameters at FP16: ~28GB
- 14B parameters at FP8: ~14GB
- T5 text encoder: ~11GB (BF16)
- VAE: ~0.5GB
Runtime overhead:
- Activation memory during denoising: 15-20% of weight memory
- Framework and CUDA context: ~2-4GB
The resolution amplifier is the attention mechanism. Going from 480p to 720p increases spatial token count by 2.25x, but the attention matrix grows quadratically, so VRAM increases roughly 2-3x despite the smaller pixel ratio change.
| Precision | Resolution | Duration | VRAM Required | Min GPU |
|---|---|---|---|---|
| FP16 | 480p (832x480) | 5s | 55-65GB | H100 SXM5 |
| FP8 | 480p (832x480) | 5s | ~40-48GB | H100 PCIe |
| FP16 | 720p (1280x720) | 5s | 75-90GB | H200 SXM5 |
| FP8 | 720p (1280x720) | 5s | ~65-80GB | H100 PCIe (tight) |
| FP8 | 720p (1280x720) | 10s | 80GB+ | H200 SXM5 |
GPU selection guide:
- H100 PCIe on Spheron (80GB, from $2.01/hr): The minimum for Wan 2.2 14B at 480p-720p with FP8. Tight VRAM margin at 720p. Prefer FP8 quantization; run
nvidia-smion the first job to confirm headroom before scaling. Single-GPU offers available.
- H100 SXM5 on Spheron (80GB, ~$2.90/hr per GPU in 8-GPU bundles): Same VRAM as PCIe, but 3.35 TB/s memory bandwidth vs 2 TB/s cuts generation time by roughly 25% for attention-heavy video workloads. No spot pricing currently. More useful if you need multi-GPU setups.
- B200 GPU rental on Spheron (192GB): The best GPU for Wan 2.2 production workloads. 192GB eliminates all VRAM margin concerns including 10-second 720p clips and FP16 precision. B200 SXM6 is currently available via spot pricing on Spheron at $2.06/hr per GPU (in 2-GPU and 8-GPU bundles).
- H200 SXM5 on Spheron (141GB, from $9.76/hr per GPU): 141GB gives headroom for 720p 10-second clips and FP16 at 720p.
Multi-GPU note: Video generation on a single GPU does not benefit from tensor parallelism in standard ComfyUI or diffusers setups. One job runs on one GPU. Scale throughput by running multiple GPU instances in parallel, each processing an independent job. No NVLink or InfiniBand required for this pattern.
Step-by-Step: Deploy Wan 2.2 with ComfyUI on Spheron H100
This walkthrough uses ComfyUI with the WanVideoWrapper custom node. The ComfyUI on GPU cloud node-based interface lets you build reusable workflows, chain T2V and I2V pipelines, and iterate faster than CLI-only approaches.
Step 1: Launch an H100 instance
Go to Spheron's H100 GPU rental page and provision an H100 PCIe or SXM5 instance. For 720p 14B generation, H100 PCIe at $2.01/hr covers most use cases. Choose Ubuntu 22.04. Do not expose port 8188 in your network settings. ComfyUI has no built-in authentication; you'll access it through an SSH tunnel instead.
Step 2: Deploy ComfyUI via Docker
SSH into the instance, then run:
# latest-cuda is a floating tag. For supply-chain assurance, pin by digest:
# docker inspect --format='{{index .RepoDigests 0}}' ghcr.io/ai-dock/comfyui:latest-cuda
# then substitute the sha256 digest reference below.
IMAGE=ghcr.io/ai-dock/comfyui:latest-cuda
docker pull $IMAGE
docker run -d \
--name comfyui \
--gpus all \
--ipc=host \
-p 127.0.0.1:8188:8188 \
-v ~/comfyui-models:/opt/ComfyUI/models \
-v ~/comfyui-output:/opt/ComfyUI/output \
$IMAGEThe -v flags persist model weights and outputs across container restarts. --ipc=host is required for PyTorch shared memory. -p 127.0.0.1:8188:8188 binds ComfyUI to localhost only.
Step 3: Install ComfyUI-WanVideoWrapper
Enter the running container:
docker exec -it comfyui bashNavigate to custom nodes and clone the wrapper:
cd /opt/ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-WanVideoWrapper
pip install -r ComfyUI-WanVideoWrapper/requirements.txtExit and restart to register the new nodes:
exit
docker restart comfyuiIf you hit dependency errors, check the WanVideoWrapper GitHub for current installation notes. The requirements file updates frequently.
Step 4: Download Wan 2.2 model weights
On the host (not inside the container), download into the mounted model directory:
pip install huggingface_hub
# Wan 2.2 14B text-to-video (~69GB: DiT ~57GB + T5 ~11GB + VAE ~0.5GB)
# Download takes 30-90 minutes depending on connection speed
huggingface-cli download Wan-AI/Wan2.2-T2V-A14B \
--local-dir ~/comfyui-models/wan-t2v-14b
# For image-to-video:
# huggingface-cli download Wan-AI/Wan2.2-I2V-A14B \
# --local-dir ~/comfyui-models/wan-i2v-14b
# For the 1.3B variant (consumer-GPU friendly, lower quality):
# huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B \
# --local-dir ~/comfyui-models/wan-t2v-1.3bThe weights are immediately available inside ComfyUI without re-entering the container because of the -v mount from Step 2.
Step 5: Access via SSH tunnel and run first generation
From your local machine:
ssh -L 8188:localhost:8188 user@your-server-ipWhile the tunnel is open, navigate to http://localhost:8188. Load a Wan 2.2 workflow JSON from comfyworkflows.com or the WanVideoWrapper GitHub repository. Set your prompt, select the model checkpoint, and queue the generation.
Expected times on H100 PCIe:
- 480p, 5 seconds, 14B: approximately 4 minutes
- 720p, 5 seconds, 14B: approximately 10-12 minutes
Monitor VRAM usage with nvidia-smi in a second SSH session during the first run.
FP8 Quantization
FP8 reduces the 14B model's VRAM at 480p from ~55-65GB (FP16) to ~40-48GB, making 480p reliable on H100 PCIe. At 720p, FP8 still requires ~65-80GB due to attention overhead. In ComfyUI with WanVideoWrapper, enable FP8 in the WanVideoModelLoader node under the precision or dtype setting. Look for fp8_e4m3fn or equivalent. The exact option name changes with node releases; check the WanVideoWrapper GitHub for the current setting.
Quality impact: FP8 introduces minor visual degradation versus BF16, most visible on fine textures and small on-screen details. Generate a few comparison clips before committing to a pipeline.
Step-by-Step: Deploy Wan 2.2 with diffusers
For teams building programmatic pipelines or API wrappers without a UI layer, the diffusers library offers a cleaner Python-native interface.
Install dependencies:
pip install diffusers transformers accelerate
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121Basic T2V inference (BF16):
import torch
from diffusers import WanPipeline
# The official Wan-AI/Wan2.2-T2V-A14B-Diffusers checkpoint stores weights in BF16.
# Use torch_dtype=torch.bfloat16 to match. FP8 inference requires a separately
# quantized community checkpoint (e.g. search Hugging Face for fp8 Wan2.2 variants)
# and a quantization library like torchao or bitsandbytes. It cannot be enabled
# via torch_dtype alone on a BF16 checkpoint.
pipe = WanPipeline.from_pretrained(
"Wan-AI/Wan2.2-T2V-A14B-Diffusers",
torch_dtype=torch.bfloat16,
device_map="cuda",
)
video = pipe(
prompt="A slow-motion close-up of water droplets falling onto a dark stone surface",
num_frames=25, # 25 frames at 24fps = ~1s; scale up for longer clips
height=480,
width=832,
num_inference_steps=40,
guidance_scale=5.0,
).frames[0]
# Export with imageio
import imageio
imageio.mimsave("output.mp4", video, fps=24)For a production API wrapper, add FastAPI on top:
pip install fastapi uvicorn python-multipartKeep the pipeline loaded in memory between requests. Wan 2.2 14B takes 30-60 seconds to load; reloading per request kills throughput. Run the server behind an SSH tunnel or Spheron's public endpoint for remote access.
uvicorn app:app --host 127.0.0.1 --port 8080
ssh -L 8080:localhost:8080 user@your-server-ipWan 2.5 on B200: What the Upgrade Path Looks Like
When Wan 2.5 public weights ship, the infrastructure change is minimal. The ComfyUI and diffusers setups above will work unchanged; you swap the checkpoint reference from Wan-AI/Wan2.2-T2V-A14B to the Wan 2.5 repo once available.
The main hardware consideration: Wan 2.5-Preview includes audio generation alongside video. If the released weights include the audio component, VRAM requirements will likely increase from the Wan 2.2 baseline. Having a B200 bare-metal instance on Spheron with 192GB removes the need to re-evaluate hardware when the weights drop.
B200 also has native FP4 support (Blackwell B200 architecture). Video diffusion model libraries don't currently expose float4 inference paths through ComfyUI or standard diffusers, but when framework support lands, B200 users get it without a hardware swap.
Current B200 pricing on Spheron: B200 SXM6 is available via spot pricing at $2.06/hr per GPU (in 2-GPU and 8-GPU bundles).
Memory bandwidth comparison:
- H100 SXM5: 3.35 TB/s
- H200 SXM5: 4.8 TB/s
- B200 SXM6: ~8 TB/s
Higher memory bandwidth means faster attention computation in video diffusion transformers. For the same 720p 5-second Wan 2.2 job, B200 should complete faster than H100 even at identical GPU count. Exact speedup depends on whether the workload is compute-bound or memory-bound; video transformers tend to be both at 720p, so B200 should show meaningful gains.
Latency, Throughput, and Cost-Per-Second Benchmarks
Generation time estimates below are single-batch figures for Wan 2.2 14B. Actual performance varies by driver version, step count, and load. Run your own benchmarks on target hardware before provisioning at scale.
| GPU | Precision | Resolution | Duration | Est. Gen Time | $/hr (OD) | Cost/clip | Cost/sec output |
|---|---|---|---|---|---|---|---|
| H100 PCIe | FP8 | 480p | 5s | ~4 min | $2.01 | ~$0.13 | ~$0.027 |
| H100 PCIe | FP8 | 720p | 5s | ~10-12 min | $2.01 | ~$0.34-0.40 | ~$0.068-0.080 |
| H100 SXM5 | FP8 | 720p | 5s | ~8-10 min | $2.90 | ~$0.39-0.48 | ~$0.078-0.097 |
| H200 SXM5 | FP8 | 720p | 5s | ~7-9 min | $9.76 | ~$1.14-1.46 | ~$0.228-0.293 |
| B200 SXM6 | FP8 | 720p | 5s | ~5-7 min | $2.06 spot | ~$0.17-0.24 | ~$0.035-0.049 |
| RTX 5090 PCIe | FP8 | 480p | 5s (1.3B) | ~2-3 min | $0.86 | ~$0.03 | ~$0.006 |
Pricing fluctuates based on GPU availability. The prices above are based on 28 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Production Tips: Queuing, Batching, NVENC, and Safety Filters
Job queuing
Video generation jobs run 4-12 minutes per clip. A synchronous API won't work. The production pattern:
- API gateway receives image or prompt input, validates, returns a job ID immediately
- Redis queue holds pending jobs with priority tiers
- GPU worker pool runs one process per GPU, model weights loaded and resident in VRAM between jobs; workers pull from the queue
- Object storage (S3, Cloudflare R2, MinIO) holds generated clips; workers write output paths to a results store
- Webhook or poll endpoint notifies callers when the job completes
Keep workers alive between jobs. Wan 2.2 14B takes 30-60 seconds to load; reloading on every request cuts effective throughput in half.
Batching across clips
Video generation does not batch across clips on a single GPU. Each clip independently uses the full VRAM allocation. Submitting multiple jobs to one GPU causes OOM errors, not faster throughput. Scale by provisioning more instances, not by batching more clips per GPU. Four H100s run four concurrent generation jobs with linear throughput scaling and no inter-GPU coordination overhead.
NVENC handoff
After generation, use GPU-accelerated encoding to convert raw frames to delivery format:
# GPU hardware encoding via NVENC (no CPU bottleneck)
ffmpeg -hwaccel cuda -i input.mp4 \
-c:v h264_nvenc -preset fast \
-b:v 4M output.mp4This keeps the entire generate-to-encode pipeline on the GPU and avoids CPU encoding becoming a throughput bottleneck at high clip volumes.
Watermarking
For adding a watermark overlay before delivery:
ffmpeg -i clip.mp4 \
-i watermark.png \
-filter_complex "overlay=10:10" \
output_watermarked.mp4For safety filtering before delivery, run a NSFW classification model on the output frames before saving to object storage. This keeps GPU pipeline separate from content moderation and lets you swap classifiers without touching the generation code.
Spheron vs Replicate and Fal.ai: Cost Per 5-Second Clip
For sustained Wan 2.2 workloads at 50+ clips per day, bare-metal hourly pricing on Spheron beats per-second API billing on Replicate and Fal.ai.
Break-even math at 720p, 5-second clips:
| Provider | Pricing Model | Cost per 5s 720p clip | 100 clips/day |
|---|---|---|---|
| Spheron H100 PCIe (on-demand) | Hourly compute | ~$0.34-0.40 | ~$34-40 |
| Replicate (Wan model) | Per-second GPU billing | ~$0.50-0.80 est. | ~$50-80 |
| Fal.ai (Wan model) | Per-generation or per-second | ~$0.60-1.00 est. | ~$60-100 |
Pricing fluctuates based on GPU availability. The prices above are based on 28 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Replicate bills per second of GPU compute, not per output second. For the Wan 2.2 14B at 720p, 10-12 minutes of H100 time equals ~$0.50-0.80 at typical Replicate GPU pricing. Fal.ai billing varies by model and tier.
At 50 clips per day, Spheron on-demand starts undercutting Replicate. At 100+ clips per day, the gap is substantial. Below 20 clips per day, the API services are cheaper because you pay nothing when idle.
The crossover point: if you are generating more than 40-60 clips per day consistently, self-hosted on Spheron is cheaper. Below that threshold, Replicate or Fal.ai avoids the infrastructure overhead without significant cost penalty.
For the broader video model ecosystem comparison including HunyuanVideo and LTX-Video, see the AI video generation GPU guide. For image-to-video workflows covering Wan 2.2 I2V, LTX-Video, and Hunyuan Video Avatar, see the image-to-video deployment guide. For VRAM sizing across the full 2026 video AI landscape, see GPU cloud for video AI 2026.
Wan 2.2 production workloads on Spheron H100 and B200 bare-metal start at $2.01/hr on H100 PCIe and $2.06/hr per GPU on B200, with no per-output fees and full root access.
Rent H100 for Wan 2.2 → | Rent B200 → | View all GPU pricing →
