What GPU do I need to run Wan 2.1?

The 1.3B variant runs on consumer GPUs with 8–16GB VRAM (RTX 4090, RTX 5090). The 14B model, which produces broadcast-quality output, requires 40–48GB VRAM at 480p with FP8 quantization or 65–80GB at 720p. In practice, the H100 PCIe (80GB) is the minimum for 14B 720p, and the H200 (141GB) is recommended for 10-second clips or reliable production runs.

What is the difference between Wan 2.1 and Wan 2.2?

Wan 2.2 (released July 2025) uses a Mixture-of-Experts architecture and was trained on 65.6% more images and 83.2% more videos than Wan 2.1. Output quality improves noticeably on motion coherence and instruction following. VRAM requirements are essentially the same between versions, so existing H100/H200 setups run Wan 2.2 without changes.

Can I run Wan 2.1 in ComfyUI?

Yes. The ComfyUI-WanVideoWrapper custom node package adds native Wan 2.1/2.2 support to ComfyUI. You load the 14B or 1.3B weights through a standard checkpoint loader node and connect a text-to-video or image-to-video generation workflow. H100 PCIe (80GB) is the minimum for the 14B model at 480p with FP8 quantization.

How much does it cost to generate a 5-second Wan 2.1 video on Spheron?

At current Spheron pricing (as of 17 Mar 2026), a 5-second 480p clip using the 14B model on an H100 SXM5 ($2.50/hr) costs approximately $0.17–0.21. The same clip at 720p costs $0.42–0.50. On an H200 SXM ($4.54/hr), the 720p clip costs $0.61–0.76. Spot pricing on H100 reduces these costs by approximately 60% for fault-tolerant workloads. Check current rates at /pricing/ before building cost models.

Is Wan 2.1 or HunyuanVideo better for production video AI?

Wan 2.1 (and Wan 2.2) produces high-quality output at better cost efficiency: roughly $0.084–0.100 per second of output at 720p on H100 SXM5. HunyuanVideo benchmarks ahead on motion realism and scene coherence, but costs $0.14–0.21 per second of output and requires H200 for reliable 720p runs. Use Wan 2.1/2.2 for cost-sensitive production pipelines; use HunyuanVideo when motion quality is the primary metric.

Deploy Wan 2.1/2.2 for AI Video: GPU Requirements and ComfyUI Setup

The Wan 2.1 14B model requires 65–80GB of VRAM at 720p. That rules out every consumer GPU including the RTX 5090 (32GB). If you want broadcast-quality AI video generation from an open-source model, you need datacenter hardware. This guide covers exactly which GPU to pick, how to set up a ComfyUI-based workflow using WanVideoWrapper, whether to run Wan 2.1 or Wan 2.2, and what each clip will actually cost you.

Wan 2.1 vs Wan 2.2: What Changed

Wan 2.2 (released July 28, 2025) is a meaningful architectural upgrade, not a minor patch. The key difference: Wan 2.1 uses a dense transformer, while Wan 2.2 switches to a Mixture-of-Experts (MoE) architecture. Wan 2.2's MoE is specific to the diffusion denoising process: a high-noise expert handles early denoising steps (overall layout and structure) and a low-noise expert takes over for later steps (fine detail refinement). The switch between experts is determined by signal-to-noise ratio (SNR) thresholds at each diffusion timestep, not per-token routing. Each expert has about 14B parameters, giving 27B total, but only 14B are active at any step. Inference compute and VRAM requirements stay nearly unchanged from Wan 2.1.

Beyond the architecture change, Wan 2.2 was trained on a substantially larger dataset. Compared to Wan 2.1: 65.6% more images and 83.2% more videos. The result is noticeable in three areas:

Motion coherence: Objects and characters maintain consistent appearance across frames better in Wan 2.2.
Instruction following: Complex prompts with multiple subjects or specific motion descriptions produce more accurate output.
Structural stability: Camera motion and scene transitions are smoother with fewer geometric artifacts.

VRAM requirements are essentially unchanged between versions. Your existing H100 or H200 setup runs Wan 2.2 without any hardware changes. You just swap the model weights.

Model	Architecture	Training Data (vs 2.1)	Quality Tier	VRAM (14B 720p)	Weights
Wan 2.1	Dense transformer	Baseline	High	65–80GB	Wan-AI/Wan2.1-T2V-14B
Wan 2.2	Mixture-of-Experts	+65.6% images, +83.2% videos	High+	65–80GB	Wan-AI/Wan2.2-T2V-A14B

For new self-hosted deployments, use Wan 2.2 weights. For existing setups, the upgrade is a weight swap with no infrastructure changes required. Alibaba has since released Wan 2.5-Preview (September 2025, a multimodal audio-video model accessed via Alibaba Cloud APIs) and Wan 2.6 (December 2025), but neither version published model weights through the official Wan-AI open-source channels. As of March 2026, Wan 2.2 remains the latest version with publicly available weights for self-deployment. Check the Wan 2.2 GitHub for the latest release notes before downloading.

Model Variants: 1.3B vs 14B

The model size decision drives everything else: which GPU you need, what the output quality will be, and what each clip will cost. The two variants are genuinely different products.

Variant	VRAM (480p)	VRAM (720p)	Min GPU	Output Quality	Use Case
1.3B T2V	8–12GB	16–20GB	RTX 4090	Good	Local testing, rapid prototyping, cost-sensitive
14B T2V	40–48GB (FP8)	65–80GB	H100 PCIe	Broadcast-quality	Production pipelines, commercial output
14B I2V	40–48GB (FP8)	65–80GB	H100 PCIe	Broadcast-quality	Image-to-video, character consistency

The 1.3B model fits on a consumer RTX 4090 (24GB) or RTX 5090 (32GB). It produces usable video, but the quality gap versus the 14B model is visible in motion clarity, temporal consistency, and fine detail. For prototyping workflows and testing prompt strategies, the 1.3B on a cheaper GPU makes sense. For anything shipping to users, use the 14B.

The I2V (image-to-video) variant uses the same VRAM profile as T2V. If your pipeline needs to animate a specific reference image, you get identical hardware requirements.

GPU Requirements by Resolution and Duration

Config	Resolution	Duration	VRAM Required	Min GPU	Notes
Wan 2.1/2.2 1.3B	480p (832×480)	5s	8–12GB	RTX 4090	Consumer-viable
Wan 2.1/2.2 1.3B	720p (1280×720)	5s	16–20GB	RTX 4090	Tight on 24GB
Wan 2.1/2.2 14B	480p	5s	~40–48GB (FP8)	H100 PCIe	FP8 required
Wan 2.1/2.2 14B	720p	5s	~65–80GB	H100 PCIe	Tight; OOM risk on PCIe
Wan 2.1/2.2 14B	720p	10s	80GB+	H200	Exceeds H100 capacity

The jump from 480p to 720p is significant. Pixel count increases 2.25x, but transformer attention memory grows quadratically with token count, so VRAM requirements increase roughly 2–3x. Going from 5 seconds to 10 seconds at 720p pushes you past 80GB, which is why the H200 is the right GPU for longer clips.

GPU selection guide:

RTX 5090 (32GB, $0.76/hr on-demand, no spot pricing): 1.3B model development and testing. Does not run the 14B at any resolution.
H100 PCIe (80GB, $2.01/hr on-demand): 14B model at 480p–720p (5s). Tight VRAM margin at 720p; FP8 quantization required.
H100 SXM5 (80GB, $2.50/hr on-demand, $0.99/hr spot): Same VRAM as PCIe, but 3.35 TB/s memory bandwidth vs 2 TB/s cuts generation time by ~25% for video workloads. Preferred for 14B at 720p.
H200 SXM (141GB, $4.54/hr on-demand, no spot pricing): 720p 10-second clips, reliable production runs with VRAM headroom, no OOM risk.

Step-by-Step: ComfyUI + Wan 2.1 on Spheron H100

This walkthrough uses the ComfyUI-WanVideoWrapper custom node package, which adds native Wan 2.1/2.2 support to ComfyUI. The node-based interface lets you build reusable workflows, chain image-to-video generation, and iterate faster than CLI-only approaches.

Step 1: Launch an H100 instance

Go to Spheron's H100 GPU rental page and provision an H100 PCIe or SXM5 instance. For 720p 14B generation, the SXM5 at $2.50/hr on-demand ($0.99/hr spot) is recommended for its higher memory bandwidth. For 480p work, the PCIe at $2.01/hr on-demand works fine.

Choose Ubuntu 22.04 as your OS. Do not expose port 8188 in your network settings. ComfyUI has no built-in authentication. You will access it via SSH tunnel instead.

Step 2: Deploy ComfyUI via Docker

SSH into your instance, then run:

bash

# latest-cuda is a floating tag; the image can be updated by the maintainer at any time.
# For stronger supply-chain assurance, pin by digest:
#   docker pull ghcr.io/ai-dock/comfyui:latest-cuda
#   docker inspect --format='{{index .RepoDigests 0}}' ghcr.io/ai-dock/comfyui:latest-cuda
# Then replace IMAGE below with the returned sha256 digest reference.
IMAGE=ghcr.io/ai-dock/comfyui:latest-cuda

docker pull $IMAGE

docker run -d \
  --name comfyui \
  --gpus all \
  --ipc=host \
  -p 127.0.0.1:8188:8188 \
  -v ~/comfyui-models:/opt/ComfyUI/models \
  -v ~/comfyui-output:/opt/ComfyUI/output \
  $IMAGE

The -v flags persist model files and outputs across container restarts. --ipc=host is required for PyTorch's shared memory. -p 127.0.0.1:8188:8188 binds ComfyUI to localhost only, so it is never reachable from outside the instance.

Step 3: Install ComfyUI-WanVideoWrapper

Enter the running container:

bash

docker exec -it comfyui bash

Navigate to the custom nodes directory and clone the wrapper:

bash

cd /opt/ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-WanVideoWrapper
pip install -r ComfyUI-WanVideoWrapper/requirements.txt

Exit the container and restart it to register the new nodes:

bash

exit
docker restart comfyui

Custom node packages update frequently. If you hit installation errors, check the WanVideoWrapper GitHub for current installation instructions before debugging the requirements file.

Step 4: Download Wan 2.1 model weights

On the host (not inside the container), download the weights directly into the mounted model directory:

bash

pip install huggingface_hub

# 14B text-to-video model (~69GB total: DiT weights ~57GB + T5 encoder ~11GB + VAE ~0.5GB)
# Download takes 30–90 minutes depending on connection speed
huggingface-cli download Wan-AI/Wan2.1-T2V-14B \
  --local-dir ~/comfyui-models/wan-t2v-14b

# For the 1.3B variant (smaller, consumer-GPU friendly):
# huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B \
#   --local-dir ~/comfyui-models/wan-t2v-1.3b

For Wan 2.2, the download pattern is the same:

bash

# Wan 2.2 14B text-to-video model (same VRAM requirements as Wan 2.1)
# Check https://github.com/Wan-Video/Wan2.2 for the current HuggingFace repo name
huggingface-cli download Wan-AI/Wan2.2-T2V-A14B \
  --local-dir ~/comfyui-models/wan-t2v-14b-v2

The weights are mounted into the container via the -v ~/comfyui-models:/opt/ComfyUI/models flag from Step 2, so they're immediately available inside ComfyUI without re-entering the container.

Step 5: Access via SSH tunnel and run first generation

On your local machine:

bash

ssh -L 8188:localhost:8188 user@your-server-ip

Replace user and your-server-ip with your instance credentials from the Spheron dashboard. While the tunnel is open, navigate to http://localhost:8188 in your browser. ComfyUI's node graph interface will load.

Load a Wan 2.1 workflow JSON. Community sources include comfyworkflows.com and the WanVideoWrapper GitHub repository. Set your text prompt, select the model checkpoint from the dropdown (the 14B weights you downloaded will appear in the list), and queue the generation.

Expected generation times on H100 PCIe:

480p, 5 seconds, 14B: approximately 4 minutes
720p, 5 seconds, 14B: approximately 10–12 minutes

Watch VRAM usage during the first run. ComfyUI displays memory stats in its terminal output; you can also check from another SSH session with nvidia-smi.

FP8 Quantization for the 14B Model

Without quantization, the 14B model at 720p uses approximately 65–80GB. FP8 quantization reduces this to roughly 40–50GB, which makes 480p generation viable on the H100 PCIe and gives more margin for 720p.

In ComfyUI with WanVideoWrapper, FP8 is enabled through the model loader node settings, not CLI flags. Look for a precision or dtype option in the WanVideoModelLoader node and set it to fp8_e4m3fn or equivalent. The exact option name evolves with node releases.

Note that the CLI flags referenced in some older guides (--dit_fsdp_num, --t5_fsdp_num) are for the Wan CLI, not ComfyUI. Do not conflate the two. Always check the WanVideoWrapper GitHub for current ComfyUI-specific quantization options.

Quality impact: FP8 introduces a minor visual quality reduction versus BF16, typically visible only on fine textures and very small on-screen details. For most production use, FP8 output is acceptable. Generate a few test clips at both precisions to evaluate before committing to a pipeline.

Cost Per Video at Different Resolutions

Prices as of 17 Mar 2026. Pricing can fluctuate over time based on availability of GPUs. Check current GPU pricing before building cost models.

Model	Resolution	Duration	GPU	Rate	Gen Time	Cost per Clip
Wan 2.1 14B	480p	5s	H100 SXM5	$2.50/hr OD	~4–5 min	~$0.17–0.21
Wan 2.1 14B	720p	5s	H100 SXM5	$2.50/hr OD	~10–12 min	~$0.42–0.50
Wan 2.1 14B	720p	5s	H100 SXM5	$0.99/hr Spot	~10–12 min	~$0.17–0.20
Wan 2.1 14B	720p	5s	H200 SXM	$4.54/hr OD	~8–10 min	~$0.61–0.76
Wan 2.1 1.3B	480p	5s	RTX 5090	$0.76/hr OD	~2–3 min	~$0.03–0.04

Cost per second of output video at 720p on H100 SXM5: approximately $0.084–0.100.

Spot vs on-demand decision: Spot pricing cuts cost by approximately 60% on H100 SXM5 ($0.99/hr vs $2.50/hr on-demand). But video generation jobs typically take 10–25 minutes per clip, and a spot preemption mid-generation loses the entire job. Use spot instances for batch processing where you have checkpointing or can afford to retry. For interactive generation via ComfyUI, use on-demand.

A production pipeline generating 1,000 clips per day at 720p (5s each) on H100 SXM5 on-demand costs approximately $420–500 per day. At spot pricing, roughly $165–200 per day. Multiple concurrent GPU instances scale throughput linearly since each video generation job is fully independent.

Wan 2.1/2.2 vs HunyuanVideo vs LTX-2.3

Model	Quality Tier	720p VRAM	Gen Time (5s)	Cost/sec output	Best For
Wan 2.1/2.2 14B	High	65–80GB	~10–12 min (H100 SXM5)	~$0.084–0.100	Production, cost-efficient
HunyuanVideo 13B (original)	Highest	60–80GB (80GB recommended)	~15–25 min (H100 SXM5)	~$0.14–0.21	Max quality, motion realism
LTX-2.3 (22B)	High	32GB+	~5–8 min (H100 SXM5)	~$0.04–0.07	Fastest at quality tier
Wan 2.1 1.3B	Medium	16–20GB	~3–4 min (RTX 5090)	~$0.008–0.012	Local/testing

Which to use:

Wan 2.1/2.2 14B is the default for new production video AI projects. It produces broadcast-quality output at the best cost efficiency in the high-quality tier. The H100 PCIe covers most use cases, and the hardware is well-supported by both CLI and ComfyUI tooling.

HunyuanVideo (original 13B) benchmarks ahead on motion realism and scene coherence. If those are your primary quality metrics and you have H200 budget ($4.54/hr), it's worth evaluating. On H100, HunyuanVideo runs at exactly the recommended 80GB VRAM threshold. Longer clips or memory overhead during ComfyUI inference can push usage past that limit. For consistent production reliability, H200 is the safer choice. The cost per second of output is also 1.5–2x higher. Note: Tencent released HunyuanVideo-1.5 (8.3B parameters) in November 2025, which runs on consumer GPUs with a minimum 14GB VRAM (with model offloading enabled) at lower quality. That version is better suited for prototyping than production datacenter workloads.

LTX-2.3 (Lightricks, 22B parameters) requires at least 32GB VRAM as a baseline. With FP8 or GGUF quantization it can squeeze onto smaller cards, but official support starts at 32GB. At $0.04–0.07 per second of output and faster generation times than Wan 2.1, it is the pick when throughput matters more than top-tier motion quality at the high tier.

For a full comparison of video AI models and VRAM requirements across the entire open-source landscape, see GPU Cloud for Video AI 2026.

Optimizing Generation Speed and VRAM

FP8 quantization

Already covered in the ComfyUI section, but to summarize: FP8 reduces VRAM by roughly 20–40% versus BF16 at a minor quality cost. For the 14B model at 720p, this is the difference between fitting on H100 PCIe and exceeding it. The tradeoff is acceptable for most production pipelines. Exact ComfyUI node settings evolve with each release; always check the WanVideoWrapper GitHub for current options rather than hard-coding version-specific values.

Resolution staging

Generate at 480p for composition review, then re-run at 720p for the final output. A 480p clip on H100 SXM5 costs $0.17–0.21; the same 720p clip costs $0.42–0.50. That's roughly 2-3x cheaper per iteration. For a pipeline that iterates 10 times before settling on a final clip, resolution staging cuts iteration cost from roughly $4–5 down to $1.70–2.10.

This is the highest-leverage optimization available. Most motion and composition issues are visible at 480p. Scale up only for finals.

Batch vs sequential

Video generation does not batch across clips the way image generation does. Each clip independently consumes the full VRAM allocation. Putting multiple video jobs on a single GPU does not increase throughput; it causes OOM errors.

For production pipelines: run one generation job per GPU instance, and scale by provisioning additional instances. Four H100s run four concurrent generation jobs with linear throughput scaling and no inter-GPU coordination overhead. See GPU Cloud for Video AI 2026 for the multi-GPU scaling architecture section.

Wan 2.1 and Wan 2.2 are running in production on Spheron's H100 and H200 GPUs today. Provision an instance in minutes, no contract required.
Rent an H100 for Wan 2.1 →

Wan 2.1 vs Wan 2.2: What Changed

Model Variants: 1.3B vs 14B

GPU Requirements by Resolution and Duration

Step-by-Step: ComfyUI + Wan 2.1 on Spheron H100

Step 1: Launch an H100 instance

Step 2: Deploy ComfyUI via Docker

Step 3: Install ComfyUI-WanVideoWrapper

Step 4: Download Wan 2.1 model weights

Step 5: Access via SSH tunnel and run first generation

FP8 Quantization for the 14B Model

Cost Per Video at Different Resolutions

Wan 2.1/2.2 vs HunyuanVideo vs LTX-2.3

Optimizing Generation Speed and VRAM

FP8 quantization

Resolution staging

Batch vs sequential

Build what's next.