Tutorial

Deploy Wan 2.1/2.2 for AI Video: GPU Requirements and ComfyUI Setup

Back to BlogWritten by Mitrasish, Co-founderMar 18, 2026
Wan 2.1Wan 2.2AI Video GenerationComfyUIH100H200GPU CloudOpen Source AI
Deploy Wan 2.1/2.2 for AI Video: GPU Requirements and ComfyUI Setup

The Wan 2.1 14B model requires 65–80GB of VRAM at 720p. That rules out every consumer GPU including the RTX 5090 (32GB). If you want broadcast-quality AI video generation from an open-source model, you need datacenter hardware. This guide covers exactly which GPU to pick, how to set up a ComfyUI-based workflow using WanVideoWrapper, whether to run Wan 2.1 or Wan 2.2, and what each clip will actually cost you.

Wan 2.1 vs Wan 2.2: What Changed

Wan 2.2 (released July 28, 2025) is a meaningful architectural upgrade, not a minor patch. The key difference: Wan 2.1 uses a dense transformer, while Wan 2.2 switches to a Mixture-of-Experts (MoE) architecture. Wan 2.2's MoE is specific to the diffusion denoising process: a high-noise expert handles early denoising steps (overall layout and structure) and a low-noise expert takes over for later steps (fine detail refinement). The switch between experts is determined by signal-to-noise ratio (SNR) thresholds at each diffusion timestep, not per-token routing. Each expert has about 14B parameters, giving 27B total, but only 14B are active at any step. Inference compute and VRAM requirements stay nearly unchanged from Wan 2.1.

Beyond the architecture change, Wan 2.2 was trained on a substantially larger dataset. Compared to Wan 2.1: 65.6% more images and 83.2% more videos. The result is noticeable in three areas:

  • Motion coherence: Objects and characters maintain consistent appearance across frames better in Wan 2.2.
  • Instruction following: Complex prompts with multiple subjects or specific motion descriptions produce more accurate output.
  • Structural stability: Camera motion and scene transitions are smoother with fewer geometric artifacts.

VRAM requirements are essentially unchanged between versions. Your existing H100 or H200 setup runs Wan 2.2 without any hardware changes. You just swap the model weights.

ModelArchitectureTraining Data (vs 2.1)Quality TierVRAM (14B 720p)Weights
Wan 2.1Dense transformerBaselineHigh65–80GBWan-AI/Wan2.1-T2V-14B
Wan 2.2Mixture-of-Experts+65.6% images, +83.2% videosHigh+65–80GBWan-AI/Wan2.2-T2V-A14B

For new self-hosted deployments, use Wan 2.2 weights. For existing setups, the upgrade is a weight swap with no infrastructure changes required. Alibaba has since released Wan 2.5-Preview (September 2025, a multimodal audio-video model accessed via Alibaba Cloud APIs) and Wan 2.6 (December 2025), but neither version published model weights through the official Wan-AI open-source channels. As of March 2026, Wan 2.2 remains the latest version with publicly available weights for self-deployment. Check the Wan 2.2 GitHub for the latest release notes before downloading.

Model Variants: 1.3B vs 14B

The model size decision drives everything else: which GPU you need, what the output quality will be, and what each clip will cost. The two variants are genuinely different products.

VariantVRAM (480p)VRAM (720p)Min GPUOutput QualityUse Case
1.3B T2V8–12GB16–20GBRTX 4090GoodLocal testing, rapid prototyping, cost-sensitive
14B T2V40–48GB (FP8)65–80GBH100 PCIeBroadcast-qualityProduction pipelines, commercial output
14B I2V40–48GB (FP8)65–80GBH100 PCIeBroadcast-qualityImage-to-video, character consistency

The 1.3B model fits on a consumer RTX 4090 (24GB) or RTX 5090 (32GB). It produces usable video, but the quality gap versus the 14B model is visible in motion clarity, temporal consistency, and fine detail. For prototyping workflows and testing prompt strategies, the 1.3B on a cheaper GPU makes sense. For anything shipping to users, use the 14B.

The I2V (image-to-video) variant uses the same VRAM profile as T2V. If your pipeline needs to animate a specific reference image, you get identical hardware requirements.

GPU Requirements by Resolution and Duration

ConfigResolutionDurationVRAM RequiredMin GPUNotes
Wan 2.1/2.2 1.3B480p (832×480)5s8–12GBRTX 4090Consumer-viable
Wan 2.1/2.2 1.3B720p (1280×720)5s16–20GBRTX 4090Tight on 24GB
Wan 2.1/2.2 14B480p5s~40–48GB (FP8)H100 PCIeFP8 required
Wan 2.1/2.2 14B720p5s~65–80GBH100 PCIeTight; OOM risk on PCIe
Wan 2.1/2.2 14B720p10s80GB+H200Exceeds H100 capacity

The jump from 480p to 720p is significant. Pixel count increases 2.25x, but transformer attention memory grows quadratically with token count, so VRAM requirements increase roughly 2–3x. Going from 5 seconds to 10 seconds at 720p pushes you past 80GB, which is why the H200 is the right GPU for longer clips.

GPU selection guide:

  • RTX 5090 (32GB, $0.76/hr on-demand, no spot pricing): 1.3B model development and testing. Does not run the 14B at any resolution.
  • H100 PCIe (80GB, $2.01/hr on-demand): 14B model at 480p–720p (5s). Tight VRAM margin at 720p; FP8 quantization required.
  • H100 SXM5 (80GB, $2.50/hr on-demand, $0.99/hr spot): Same VRAM as PCIe, but 3.35 TB/s memory bandwidth vs 2 TB/s cuts generation time by ~25% for video workloads. Preferred for 14B at 720p.
  • H200 SXM (141GB, $4.54/hr on-demand, no spot pricing): 720p 10-second clips, reliable production runs with VRAM headroom, no OOM risk.

Step-by-Step: ComfyUI + Wan 2.1 on Spheron H100

This walkthrough uses the ComfyUI-WanVideoWrapper custom node package, which adds native Wan 2.1/2.2 support to ComfyUI. The node-based interface lets you build reusable workflows, chain image-to-video generation, and iterate faster than CLI-only approaches.

Step 1: Launch an H100 instance

Go to Spheron's H100 GPU rental page and provision an H100 PCIe or SXM5 instance. For 720p 14B generation, the SXM5 at $2.50/hr on-demand ($0.99/hr spot) is recommended for its higher memory bandwidth. For 480p work, the PCIe at $2.01/hr on-demand works fine.

Choose Ubuntu 22.04 as your OS. Do not expose port 8188 in your network settings. ComfyUI has no built-in authentication. You will access it via SSH tunnel instead.

Step 2: Deploy ComfyUI via Docker

SSH into your instance, then run:

bash
# latest-cuda is a floating tag; the image can be updated by the maintainer at any time.
# For stronger supply-chain assurance, pin by digest:
#   docker pull ghcr.io/ai-dock/comfyui:latest-cuda
#   docker inspect --format='{{index .RepoDigests 0}}' ghcr.io/ai-dock/comfyui:latest-cuda
# Then replace IMAGE below with the returned sha256 digest reference.
IMAGE=ghcr.io/ai-dock/comfyui:latest-cuda

docker pull $IMAGE

docker run -d \
  --name comfyui \
  --gpus all \
  --ipc=host \
  -p 127.0.0.1:8188:8188 \
  -v ~/comfyui-models:/opt/ComfyUI/models \
  -v ~/comfyui-output:/opt/ComfyUI/output \
  $IMAGE

The -v flags persist model files and outputs across container restarts. --ipc=host is required for PyTorch's shared memory. -p 127.0.0.1:8188:8188 binds ComfyUI to localhost only, so it is never reachable from outside the instance.

Step 3: Install ComfyUI-WanVideoWrapper

Enter the running container:

bash
docker exec -it comfyui bash

Navigate to the custom nodes directory and clone the wrapper:

bash
cd /opt/ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-WanVideoWrapper
pip install -r ComfyUI-WanVideoWrapper/requirements.txt

Exit the container and restart it to register the new nodes:

bash
exit
docker restart comfyui

Custom node packages update frequently. If you hit installation errors, check the WanVideoWrapper GitHub for current installation instructions before debugging the requirements file.

Step 4: Download Wan 2.1 model weights

On the host (not inside the container), download the weights directly into the mounted model directory:

bash
pip install huggingface_hub

# 14B text-to-video model (~69GB total: DiT weights ~57GB + T5 encoder ~11GB + VAE ~0.5GB)
# Download takes 30–90 minutes depending on connection speed
huggingface-cli download Wan-AI/Wan2.1-T2V-14B \
  --local-dir ~/comfyui-models/wan-t2v-14b

# For the 1.3B variant (smaller, consumer-GPU friendly):
# huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B \
#   --local-dir ~/comfyui-models/wan-t2v-1.3b

For Wan 2.2, the download pattern is the same:

bash
# Wan 2.2 14B text-to-video model (same VRAM requirements as Wan 2.1)
# Check https://github.com/Wan-Video/Wan2.2 for the current HuggingFace repo name
huggingface-cli download Wan-AI/Wan2.2-T2V-A14B \
  --local-dir ~/comfyui-models/wan-t2v-14b-v2

The weights are mounted into the container via the -v ~/comfyui-models:/opt/ComfyUI/models flag from Step 2, so they're immediately available inside ComfyUI without re-entering the container.

Step 5: Access via SSH tunnel and run first generation

On your local machine:

bash
ssh -L 8188:localhost:8188 user@your-server-ip

Replace user and your-server-ip with your instance credentials from the Spheron dashboard. While the tunnel is open, navigate to http://localhost:8188 in your browser. ComfyUI's node graph interface will load.

Load a Wan 2.1 workflow JSON. Community sources include comfyworkflows.com and the WanVideoWrapper GitHub repository. Set your text prompt, select the model checkpoint from the dropdown (the 14B weights you downloaded will appear in the list), and queue the generation.

Expected generation times on H100 PCIe:

  • 480p, 5 seconds, 14B: approximately 4 minutes
  • 720p, 5 seconds, 14B: approximately 10–12 minutes

Watch VRAM usage during the first run. ComfyUI displays memory stats in its terminal output; you can also check from another SSH session with nvidia-smi.

FP8 Quantization for the 14B Model

Without quantization, the 14B model at 720p uses approximately 65–80GB. FP8 quantization reduces this to roughly 40–50GB, which makes 480p generation viable on the H100 PCIe and gives more margin for 720p.

In ComfyUI with WanVideoWrapper, FP8 is enabled through the model loader node settings, not CLI flags. Look for a precision or dtype option in the WanVideoModelLoader node and set it to fp8_e4m3fn or equivalent. The exact option name evolves with node releases.

Note that the CLI flags referenced in some older guides (--dit_fsdp_num, --t5_fsdp_num) are for the Wan CLI, not ComfyUI. Do not conflate the two. Always check the WanVideoWrapper GitHub for current ComfyUI-specific quantization options.

Quality impact: FP8 introduces a minor visual quality reduction versus BF16, typically visible only on fine textures and very small on-screen details. For most production use, FP8 output is acceptable. Generate a few test clips at both precisions to evaluate before committing to a pipeline.

Cost Per Video at Different Resolutions

Prices as of 17 Mar 2026. Pricing can fluctuate over time based on availability of GPUs. Check current GPU pricing before building cost models.

ModelResolutionDurationGPURateGen TimeCost per Clip
Wan 2.1 14B480p5sH100 SXM5$2.50/hr OD~4–5 min~$0.17–0.21
Wan 2.1 14B720p5sH100 SXM5$2.50/hr OD~10–12 min~$0.42–0.50
Wan 2.1 14B720p5sH100 SXM5$0.99/hr Spot~10–12 min~$0.17–0.20
Wan 2.1 14B720p5sH200 SXM$4.54/hr OD~8–10 min~$0.61–0.76
Wan 2.1 1.3B480p5sRTX 5090$0.76/hr OD~2–3 min~$0.03–0.04

Cost per second of output video at 720p on H100 SXM5: approximately $0.084–0.100.

Spot vs on-demand decision: Spot pricing cuts cost by approximately 60% on H100 SXM5 ($0.99/hr vs $2.50/hr on-demand). But video generation jobs typically take 10–25 minutes per clip, and a spot preemption mid-generation loses the entire job. Use spot instances for batch processing where you have checkpointing or can afford to retry. For interactive generation via ComfyUI, use on-demand.

A production pipeline generating 1,000 clips per day at 720p (5s each) on H100 SXM5 on-demand costs approximately $420–500 per day. At spot pricing, roughly $165–200 per day. Multiple concurrent GPU instances scale throughput linearly since each video generation job is fully independent.

Wan 2.1/2.2 vs HunyuanVideo vs LTX-2.3

ModelQuality Tier720p VRAMGen Time (5s)Cost/sec outputBest For
Wan 2.1/2.2 14BHigh65–80GB~10–12 min (H100 SXM5)~$0.084–0.100Production, cost-efficient
HunyuanVideo 13B (original)Highest60–80GB (80GB recommended)~15–25 min (H100 SXM5)~$0.14–0.21Max quality, motion realism
LTX-2.3 (22B)High32GB+~5–8 min (H100 SXM5)~$0.04–0.07Fastest at quality tier
Wan 2.1 1.3BMedium16–20GB~3–4 min (RTX 5090)~$0.008–0.012Local/testing

Which to use:

Wan 2.1/2.2 14B is the default for new production video AI projects. It produces broadcast-quality output at the best cost efficiency in the high-quality tier. The H100 PCIe covers most use cases, and the hardware is well-supported by both CLI and ComfyUI tooling.

HunyuanVideo (original 13B) benchmarks ahead on motion realism and scene coherence. If those are your primary quality metrics and you have H200 budget ($4.54/hr), it's worth evaluating. On H100, HunyuanVideo runs at exactly the recommended 80GB VRAM threshold. Longer clips or memory overhead during ComfyUI inference can push usage past that limit. For consistent production reliability, H200 is the safer choice. The cost per second of output is also 1.5–2x higher. Note: Tencent released HunyuanVideo-1.5 (8.3B parameters) in November 2025, which runs on consumer GPUs with a minimum 14GB VRAM (with model offloading enabled) at lower quality. That version is better suited for prototyping than production datacenter workloads.

LTX-2.3 (Lightricks, 22B parameters) requires at least 32GB VRAM as a baseline. With FP8 or GGUF quantization it can squeeze onto smaller cards, but official support starts at 32GB. At $0.04–0.07 per second of output and faster generation times than Wan 2.1, it is the pick when throughput matters more than top-tier motion quality at the high tier.

For a full comparison of video AI models and VRAM requirements across the entire open-source landscape, see GPU Cloud for Video AI 2026.

Optimizing Generation Speed and VRAM

FP8 quantization

Already covered in the ComfyUI section, but to summarize: FP8 reduces VRAM by roughly 20–40% versus BF16 at a minor quality cost. For the 14B model at 720p, this is the difference between fitting on H100 PCIe and exceeding it. The tradeoff is acceptable for most production pipelines. Exact ComfyUI node settings evolve with each release; always check the WanVideoWrapper GitHub for current options rather than hard-coding version-specific values.

Resolution staging

Generate at 480p for composition review, then re-run at 720p for the final output. A 480p clip on H100 SXM5 costs $0.17–0.21; the same 720p clip costs $0.42–0.50. That's roughly 2-3x cheaper per iteration. For a pipeline that iterates 10 times before settling on a final clip, resolution staging cuts iteration cost from roughly $4–5 down to $1.70–2.10.

This is the highest-leverage optimization available. Most motion and composition issues are visible at 480p. Scale up only for finals.

Batch vs sequential

Video generation does not batch across clips the way image generation does. Each clip independently consumes the full VRAM allocation. Putting multiple video jobs on a single GPU does not increase throughput; it causes OOM errors.

For production pipelines: run one generation job per GPU instance, and scale by provisioning additional instances. Four H100s run four concurrent generation jobs with linear throughput scaling and no inter-GPU coordination overhead. See GPU Cloud for Video AI 2026 for the multi-GPU scaling architecture section.


Wan 2.1 and Wan 2.2 are running in production on Spheron's H100 and H200 GPUs today. Provision an instance in minutes, no contract required.

Rent an H100 for Wan 2.1 →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.