Tutorial

Deploy Wan 2.5 on GPU Cloud: Production Video Generation Setup (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 28, 2026
Wan 2.5Wan 2.5 DeploymentDeploy Wan 2.5Open Source Video GenerationWan 2.5 GPUAI Video GenerationComfyUIH100B200GPU Cloud
Deploy Wan 2.5 on GPU Cloud: Production Video Generation Setup (2026)

The Wan series 14B model at 720p requires 65-80GB of VRAM. Consumer GPUs won't run it. If you want to self-host production-quality open-source video generation, you need datacenter hardware. This guide covers exactly that: GPU requirements for Wan 2.2 (the current publicly available version), step-by-step ComfyUI and diffusers setup on H100 and B200, benchmark numbers, and what each clip costs you. For teams upgrading from earlier setups, the Wan 2.1/2.2 deployment guide covers the shared ComfyUI infrastructure that Wan 2.2 builds on.

On Wan 2.5: Wan 2.5-Preview shipped in September 2025 as a multimodal audio-video model accessible only through Alibaba Cloud APIs. No public weights have been released through the Wan-AI GitHub or HuggingFace as of April 2026. This post covers self-hosted deployment using Wan 2.2 as the current production standard, with a full section on what Wan 2.5 changes and what the upgrade path looks like when weights ship.

What Changed in Wan 2.5 vs 2.1 and 2.2

The Wan model series has moved through three major architectural shifts since February 2025.

ModelArchitectureTraining ScaleMultimodalVRAM (14B 720p)Public Weights
Wan 2.1Dense transformerBaselineNo65-80GBYes (Wan-AI/Wan2.1-T2V-14B)
Wan 2.2Mixture-of-Experts+65.6% images, +83.2% videosNo65-80GBYes (Wan-AI/Wan2.2-T2V-A14B)
Wan 2.5-PreviewUnknownUnknownYes (audio+video)UnknownNo (API only)

Wan 2.2 (July 2025) was a meaningful upgrade over 2.1. The architecture switch to MoE keeps active parameters at 14B (27B total) but applies separate expert networks for early and late denoising steps. The larger training dataset translates to better motion coherence, stronger instruction following, and fewer geometric artifacts. VRAM requirements stayed the same.

Wan 2.5-Preview introduced two things that 2.2 doesn't have: synchronized audio generation alongside video, and an agentic mode where the model can decompose multi-scene prompts into sub-tasks. The "Preview" label signals this is not a stable release. Alibaba has kept the weights proprietary, likely because the audio component involves licensed training data that complicates open distribution.

Wan 2.6 (December 2025) followed, also without public weights.

What this means for infrastructure: your Wan 2.2 ComfyUI setup is the foundation you'll run Wan 2.5 on when weights eventually ship. The upgrade is a checkpoint swap. Build the infrastructure now on Wan 2.2 and you're ready to flip to Wan 2.5 on day one of a public release.

For the Wan 2.1 setup guide and how the 2.1-to-2.2 migration works, see the Wan 2.1/2.2 deployment guide.

GPU Hardware Requirements for Wan 2.2/2.5

The VRAM math is driven by model weight size, activation memory, and attention overhead. Here's the breakdown:

Weight memory:

  • 14B parameters at FP16: ~28GB
  • 14B parameters at FP8: ~14GB
  • T5 text encoder: ~11GB (BF16)
  • VAE: ~0.5GB

Runtime overhead:

  • Activation memory during denoising: 15-20% of weight memory
  • Framework and CUDA context: ~2-4GB

The resolution amplifier is the attention mechanism. Going from 480p to 720p increases spatial token count by 2.25x, but the attention matrix grows quadratically, so VRAM increases roughly 2-3x despite the smaller pixel ratio change.

PrecisionResolutionDurationVRAM RequiredMin GPU
FP16480p (832x480)5s55-65GBH100 SXM5
FP8480p (832x480)5s~40-48GBH100 PCIe
FP16720p (1280x720)5s75-90GBH200 SXM5
FP8720p (1280x720)5s~65-80GBH100 PCIe (tight)
FP8720p (1280x720)10s80GB+H200 SXM5

GPU selection guide:

  • H100 PCIe on Spheron (80GB, from $2.01/hr): The minimum for Wan 2.2 14B at 480p-720p with FP8. Tight VRAM margin at 720p. Prefer FP8 quantization; run nvidia-smi on the first job to confirm headroom before scaling. Single-GPU offers available.
  • H100 SXM5 on Spheron (80GB, ~$2.90/hr per GPU in 8-GPU bundles): Same VRAM as PCIe, but 3.35 TB/s memory bandwidth vs 2 TB/s cuts generation time by roughly 25% for attention-heavy video workloads. No spot pricing currently. More useful if you need multi-GPU setups.
  • B200 GPU rental on Spheron (192GB): The best GPU for Wan 2.2 production workloads. 192GB eliminates all VRAM margin concerns including 10-second 720p clips and FP16 precision. B200 SXM6 is currently available via spot pricing on Spheron at $2.06/hr per GPU (in 2-GPU and 8-GPU bundles).
  • H200 SXM5 on Spheron (141GB, from $9.76/hr per GPU): 141GB gives headroom for 720p 10-second clips and FP16 at 720p.

Multi-GPU note: Video generation on a single GPU does not benefit from tensor parallelism in standard ComfyUI or diffusers setups. One job runs on one GPU. Scale throughput by running multiple GPU instances in parallel, each processing an independent job. No NVLink or InfiniBand required for this pattern.

Step-by-Step: Deploy Wan 2.2 with ComfyUI on Spheron H100

This walkthrough uses ComfyUI with the WanVideoWrapper custom node. The ComfyUI on GPU cloud node-based interface lets you build reusable workflows, chain T2V and I2V pipelines, and iterate faster than CLI-only approaches.

Step 1: Launch an H100 instance

Go to Spheron's H100 GPU rental page and provision an H100 PCIe or SXM5 instance. For 720p 14B generation, H100 PCIe at $2.01/hr covers most use cases. Choose Ubuntu 22.04. Do not expose port 8188 in your network settings. ComfyUI has no built-in authentication; you'll access it through an SSH tunnel instead.

Step 2: Deploy ComfyUI via Docker

SSH into the instance, then run:

bash
# latest-cuda is a floating tag. For supply-chain assurance, pin by digest:
#   docker inspect --format='{{index .RepoDigests 0}}' ghcr.io/ai-dock/comfyui:latest-cuda
# then substitute the sha256 digest reference below.
IMAGE=ghcr.io/ai-dock/comfyui:latest-cuda

docker pull $IMAGE

docker run -d \
  --name comfyui \
  --gpus all \
  --ipc=host \
  -p 127.0.0.1:8188:8188 \
  -v ~/comfyui-models:/opt/ComfyUI/models \
  -v ~/comfyui-output:/opt/ComfyUI/output \
  $IMAGE

The -v flags persist model weights and outputs across container restarts. --ipc=host is required for PyTorch shared memory. -p 127.0.0.1:8188:8188 binds ComfyUI to localhost only.

Step 3: Install ComfyUI-WanVideoWrapper

Enter the running container:

bash
docker exec -it comfyui bash

Navigate to custom nodes and clone the wrapper:

bash
cd /opt/ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-WanVideoWrapper
pip install -r ComfyUI-WanVideoWrapper/requirements.txt

Exit and restart to register the new nodes:

bash
exit
docker restart comfyui

If you hit dependency errors, check the WanVideoWrapper GitHub for current installation notes. The requirements file updates frequently.

Step 4: Download Wan 2.2 model weights

On the host (not inside the container), download into the mounted model directory:

bash
pip install huggingface_hub

# Wan 2.2 14B text-to-video (~69GB: DiT ~57GB + T5 ~11GB + VAE ~0.5GB)
# Download takes 30-90 minutes depending on connection speed
huggingface-cli download Wan-AI/Wan2.2-T2V-A14B \
  --local-dir ~/comfyui-models/wan-t2v-14b

# For image-to-video:
# huggingface-cli download Wan-AI/Wan2.2-I2V-A14B \
#   --local-dir ~/comfyui-models/wan-i2v-14b

# For the 1.3B variant (consumer-GPU friendly, lower quality):
# huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B \
#   --local-dir ~/comfyui-models/wan-t2v-1.3b

The weights are immediately available inside ComfyUI without re-entering the container because of the -v mount from Step 2.

Step 5: Access via SSH tunnel and run first generation

From your local machine:

bash
ssh -L 8188:localhost:8188 user@your-server-ip

While the tunnel is open, navigate to http://localhost:8188. Load a Wan 2.2 workflow JSON from comfyworkflows.com or the WanVideoWrapper GitHub repository. Set your prompt, select the model checkpoint, and queue the generation.

Expected times on H100 PCIe:

  • 480p, 5 seconds, 14B: approximately 4 minutes
  • 720p, 5 seconds, 14B: approximately 10-12 minutes

Monitor VRAM usage with nvidia-smi in a second SSH session during the first run.

FP8 Quantization

FP8 reduces the 14B model's VRAM at 480p from ~55-65GB (FP16) to ~40-48GB, making 480p reliable on H100 PCIe. At 720p, FP8 still requires ~65-80GB due to attention overhead. In ComfyUI with WanVideoWrapper, enable FP8 in the WanVideoModelLoader node under the precision or dtype setting. Look for fp8_e4m3fn or equivalent. The exact option name changes with node releases; check the WanVideoWrapper GitHub for the current setting.

Quality impact: FP8 introduces minor visual degradation versus BF16, most visible on fine textures and small on-screen details. Generate a few comparison clips before committing to a pipeline.

Step-by-Step: Deploy Wan 2.2 with diffusers

For teams building programmatic pipelines or API wrappers without a UI layer, the diffusers library offers a cleaner Python-native interface.

Install dependencies:

bash
pip install diffusers transformers accelerate
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

Basic T2V inference (BF16):

python
import torch
from diffusers import WanPipeline

# The official Wan-AI/Wan2.2-T2V-A14B-Diffusers checkpoint stores weights in BF16.
# Use torch_dtype=torch.bfloat16 to match. FP8 inference requires a separately
# quantized community checkpoint (e.g. search Hugging Face for fp8 Wan2.2 variants)
# and a quantization library like torchao or bitsandbytes. It cannot be enabled
# via torch_dtype alone on a BF16 checkpoint.
pipe = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.2-T2V-A14B-Diffusers",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

video = pipe(
    prompt="A slow-motion close-up of water droplets falling onto a dark stone surface",
    num_frames=25,       # 25 frames at 24fps = ~1s; scale up for longer clips
    height=480,
    width=832,
    num_inference_steps=40,
    guidance_scale=5.0,
).frames[0]

# Export with imageio
import imageio
imageio.mimsave("output.mp4", video, fps=24)

For a production API wrapper, add FastAPI on top:

bash
pip install fastapi uvicorn python-multipart

Keep the pipeline loaded in memory between requests. Wan 2.2 14B takes 30-60 seconds to load; reloading per request kills throughput. Run the server behind an SSH tunnel or Spheron's public endpoint for remote access.

bash
uvicorn app:app --host 127.0.0.1 --port 8080
ssh -L 8080:localhost:8080 user@your-server-ip

Wan 2.5 on B200: What the Upgrade Path Looks Like

When Wan 2.5 public weights ship, the infrastructure change is minimal. The ComfyUI and diffusers setups above will work unchanged; you swap the checkpoint reference from Wan-AI/Wan2.2-T2V-A14B to the Wan 2.5 repo once available.

The main hardware consideration: Wan 2.5-Preview includes audio generation alongside video. If the released weights include the audio component, VRAM requirements will likely increase from the Wan 2.2 baseline. Having a B200 bare-metal instance on Spheron with 192GB removes the need to re-evaluate hardware when the weights drop.

B200 also has native FP4 support (Blackwell B200 architecture). Video diffusion model libraries don't currently expose float4 inference paths through ComfyUI or standard diffusers, but when framework support lands, B200 users get it without a hardware swap.

Current B200 pricing on Spheron: B200 SXM6 is available via spot pricing at $2.06/hr per GPU (in 2-GPU and 8-GPU bundles).

Memory bandwidth comparison:

  • H100 SXM5: 3.35 TB/s
  • H200 SXM5: 4.8 TB/s
  • B200 SXM6: ~8 TB/s

Higher memory bandwidth means faster attention computation in video diffusion transformers. For the same 720p 5-second Wan 2.2 job, B200 should complete faster than H100 even at identical GPU count. Exact speedup depends on whether the workload is compute-bound or memory-bound; video transformers tend to be both at 720p, so B200 should show meaningful gains.

Latency, Throughput, and Cost-Per-Second Benchmarks

Generation time estimates below are single-batch figures for Wan 2.2 14B. Actual performance varies by driver version, step count, and load. Run your own benchmarks on target hardware before provisioning at scale.

GPUPrecisionResolutionDurationEst. Gen Time$/hr (OD)Cost/clipCost/sec output
H100 PCIeFP8480p5s~4 min$2.01~$0.13~$0.027
H100 PCIeFP8720p5s~10-12 min$2.01~$0.34-0.40~$0.068-0.080
H100 SXM5FP8720p5s~8-10 min$2.90~$0.39-0.48~$0.078-0.097
H200 SXM5FP8720p5s~7-9 min$9.76~$1.14-1.46~$0.228-0.293
B200 SXM6FP8720p5s~5-7 min$2.06 spot~$0.17-0.24~$0.035-0.049
RTX 5090 PCIeFP8480p5s (1.3B)~2-3 min$0.86~$0.03~$0.006

Pricing fluctuates based on GPU availability. The prices above are based on 28 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Production Tips: Queuing, Batching, NVENC, and Safety Filters

Job queuing

Video generation jobs run 4-12 minutes per clip. A synchronous API won't work. The production pattern:

  1. API gateway receives image or prompt input, validates, returns a job ID immediately
  2. Redis queue holds pending jobs with priority tiers
  3. GPU worker pool runs one process per GPU, model weights loaded and resident in VRAM between jobs; workers pull from the queue
  4. Object storage (S3, Cloudflare R2, MinIO) holds generated clips; workers write output paths to a results store
  5. Webhook or poll endpoint notifies callers when the job completes

Keep workers alive between jobs. Wan 2.2 14B takes 30-60 seconds to load; reloading on every request cuts effective throughput in half.

Batching across clips

Video generation does not batch across clips on a single GPU. Each clip independently uses the full VRAM allocation. Submitting multiple jobs to one GPU causes OOM errors, not faster throughput. Scale by provisioning more instances, not by batching more clips per GPU. Four H100s run four concurrent generation jobs with linear throughput scaling and no inter-GPU coordination overhead.

NVENC handoff

After generation, use GPU-accelerated encoding to convert raw frames to delivery format:

bash
# GPU hardware encoding via NVENC (no CPU bottleneck)
ffmpeg -hwaccel cuda -i input.mp4 \
  -c:v h264_nvenc -preset fast \
  -b:v 4M output.mp4

This keeps the entire generate-to-encode pipeline on the GPU and avoids CPU encoding becoming a throughput bottleneck at high clip volumes.

Watermarking

For adding a watermark overlay before delivery:

bash
ffmpeg -i clip.mp4 \
  -i watermark.png \
  -filter_complex "overlay=10:10" \
  output_watermarked.mp4

For safety filtering before delivery, run a NSFW classification model on the output frames before saving to object storage. This keeps GPU pipeline separate from content moderation and lets you swap classifiers without touching the generation code.

Spheron vs Replicate and Fal.ai: Cost Per 5-Second Clip

For sustained Wan 2.2 workloads at 50+ clips per day, bare-metal hourly pricing on Spheron beats per-second API billing on Replicate and Fal.ai.

Break-even math at 720p, 5-second clips:

ProviderPricing ModelCost per 5s 720p clip100 clips/day
Spheron H100 PCIe (on-demand)Hourly compute~$0.34-0.40~$34-40
Replicate (Wan model)Per-second GPU billing~$0.50-0.80 est.~$50-80
Fal.ai (Wan model)Per-generation or per-second~$0.60-1.00 est.~$60-100

Pricing fluctuates based on GPU availability. The prices above are based on 28 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Replicate bills per second of GPU compute, not per output second. For the Wan 2.2 14B at 720p, 10-12 minutes of H100 time equals ~$0.50-0.80 at typical Replicate GPU pricing. Fal.ai billing varies by model and tier.

At 50 clips per day, Spheron on-demand starts undercutting Replicate. At 100+ clips per day, the gap is substantial. Below 20 clips per day, the API services are cheaper because you pay nothing when idle.

The crossover point: if you are generating more than 40-60 clips per day consistently, self-hosted on Spheron is cheaper. Below that threshold, Replicate or Fal.ai avoids the infrastructure overhead without significant cost penalty.

For the broader video model ecosystem comparison including HunyuanVideo and LTX-Video, see the AI video generation GPU guide. For image-to-video workflows covering Wan 2.2 I2V, LTX-Video, and Hunyuan Video Avatar, see the image-to-video deployment guide. For VRAM sizing across the full 2026 video AI landscape, see GPU cloud for video AI 2026.


Wan 2.2 production workloads on Spheron H100 and B200 bare-metal start at $2.01/hr on H100 PCIe and $2.06/hr per GPU on B200, with no per-output fees and full root access.

Rent H100 for Wan 2.2 → | Rent B200 → | View all GPU pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.