How much VRAM do I need to run Wan 2.1?

Wan 2.1 requires approximately 40GB of VRAM for 480p (832×480) 5-second clips with float8 quantization, and 65–80GB for 720p (1280×720). This puts the H100 PCIe (80GB) as the minimum practical GPU for 720p output. The H200 (141GB) is recommended for 10-second clips or comfortable headroom at 720p.

Can I run HunyuanVideo on an H100?

Yes. HunyuanVideo requires approximately 60–80GB of VRAM for 5-second 720p clips (60GB minimum, 80GB recommended per official specs). The H100 PCIe (80GB) sits right at the recommended amount, meaning system overhead and VRAM fragmentation during generation can cause OOM errors. The H200 (141GB) is the recommended minimum for reliable production use and to support 1080p output.

How long does video AI generation take on an H100?

Generation times vary significantly by model and resolution. Wan 2.1 at 480p takes approximately 3–5 minutes per 5-second clip on an H100. At 720p, expect 10–12 minutes on an H100 PCIe. HunyuanVideo at 720p takes 15–25 minutes per 5-second clip. AnimateDiff at 512×512 runs 3–5 minutes on an RTX 5090.

What is the cheapest cloud GPU that runs video AI models?

The RTX 5090 (32GB, from $0.76/hr on Spheron) runs AnimateDiff, Mochi 1, and LTX-2.3 at 720p with fp8 quantization. For Wan 2.1 and HunyuanVideo, which require 40GB+ and 60GB+ respectively, you need an H100 or H200. There is no practical way to run these models on consumer 24GB GPUs.

What is the cost per second of video output using cloud GPUs?

At current Spheron pricing (as of March 16, 2026), generating one second of output video costs roughly $0.013–0.020 (AnimateDiff on RTX 5090), $0.034–0.042 (Wan 2.1 at 480p on H100 SXM5 at $2.50/hr), and ~$0.14–0.21 (HunyuanVideo at 720p on H200). GPU pricing fluctuates over time based on availability, so check current rates before provisioning.

GPU Cloud for Video AI 2026: Wan 2.1, HunyuanVideo VRAM Guide

GPU prices in this article are based on Spheron marketplace data as of March 16, 2026. GPU pricing fluctuates over time based on availability - check current GPU pricing for live on-demand and spot rates before provisioning.

Generating a 5-second 720p video with Wan 2.1 on an H100 PCIe takes approximately 10–12 minutes and requires 65–80GB of VRAM. On a consumer RTX 4090 (24GB), it won't run at all. On a consumer RTX 5090 (32GB), it still won't run. The only practical path to running Wan 2.1 or HunyuanVideo is datacenter hardware - and even there, you need the 80GB tier minimum.

This is not a limitation that will disappear with optimization. It is a fundamental property of how video generation models work: they must maintain temporal consistency across dozens or hundreds of frames simultaneously, and that requires the full frame sequence to reside in VRAM at once. More frames means more VRAM. Higher resolution means more VRAM. These relationships do not compress away.

This guide covers the actual VRAM requirements for every major open-source video model as of March 2026, which cloud GPUs can run them, what generation times to expect, and how to deploy Wan 2.1 on Spheron's H100 infrastructure today.

Why Video AI Is So Much More Demanding Than Image AI

The VRAM gap between image and video generation is not incremental - it's structural. Understanding why helps you make better infrastructure decisions.

Temporal dimension means exponentially more computation. An image model processes a single frame. A video model processes N frames and maintains consistency between them through temporal attention mechanisms. For a 5-second clip at 24 fps, that's 120 frames, each requiring individual processing plus cross-frame attention. The attention matrix grows quadratically with the number of tokens, which scales with both frame count and resolution.

VRAM scales with clip length linearly, but attention memory scales worse. A 5-second clip at 24 fps holds approximately 120 frames in the attention window simultaneously. Doubling clip length to 10 seconds roughly doubles the frame count, which more than doubles VRAM requirements once the attention overhead is factored in. This is why 10-second Wan 2.1 clips at 720p require 80GB or more - beyond what even an H100 can handle comfortably.

Resolution scaling is quadratic, not linear. Going from 480p (832×480) to 720p (1280×720) is a 2.25× increase in pixel count per frame. But the transformer attention matrix grows quadratically with token count, meaning the VRAM increase is closer to 2–3× even though you only doubled the pixel count. This is the single biggest reason most users should start at 480p.

Generation time is orders of magnitude longer than image generation. The table below makes the contrast concrete:

Content Type	Model Example	Typical Generation Time (H100)	VRAM Needed
512×512 image	SDXL	3–5 seconds	8–12GB
1024×1024 image	Flux.1 Dev	10–20 seconds	20–25GB
5s 480p video	Wan 2.1	~4 minutes	~40–48GB
5s 720p video	Wan 2.1	~10–12 minutes	~65–80GB
5s 720p video	HunyuanVideo	~20 minutes	~60–80GB+

Image generation is measured in seconds. Video generation is measured in minutes - even on the best available datacenter hardware. Set your expectations accordingly before provisioning infrastructure.

VRAM Requirements - The Complete Model Guide

Verified as of March 2026. VRAM requirements change with new releases and quantization improvements. Always check the model's GitHub repository for current requirements before provisioning.

Model	Clip Length	Resolution	VRAM Required	Min GPU	Notes
Wan 2.1	5s	832×480	~40–48GB	H100 PCIe	With float8 quantization
Wan 2.1	5s	1280×720	~65–80GB	H100 PCIe	Full quality, tight on 80GB
Wan 2.1	10s	1280×720	~80GB+	H200	Barely fits on H100, H200 recommended
HunyuanVideo	5s	720p	~60–80GB	H100 PCIe (tight)	60GB min (official); 80GB recommended; OOM risk at 80GB due to spikes
HunyuanVideo	5s	1080p	~100–120GB+	H200	Community-tested; 1080p not in official specs; H200 141GB minimum
AnimateDiff v3	3s	512×512	~18–24GB	RTX 5090	Short clips only; limited motion range
LTX-2.3	5s	720p	~24–32GB	RTX 5090	4K at 50 FPS; 32GB+ official minimum; H100 recommended
CogVideoX-1.5-5B	10s	720p	~24–32GB	RTX 4090	Nov 2024 release; 8-bit quant reduces to ~16GB
Mochi 1	5s	480p	~22GB	RTX 5090	With bfloat16 variant; full precision needs 60GB+

A few clarifications on the table:

Wan 2.1 with float8 quantization brings the 480p requirement down to approximately 40GB. Without quantization, expect 48–55GB. The Wan 2.1 GitHub repository provides current quantization instructions and community-verified VRAM measurements.

HunyuanVideo at 720p on H100 PCIe is technically possible but practically risky. With exactly 80GB available, system overhead and VRAM fragmentation during generation can cause OOM errors. The H200 with 141GB is the recommended minimum for reliable production use.

LTX-2.3 (from Lightricks, released March 8, 2026) is the current model in the LTX series and the best option for new projects. LTX-2 was announced October 2025 and open-sourced January 6, 2026; LTX-2.3 followed in early March with a rebuilt VAE, 4x larger text connector, and native audio generation. Both support native 4K at 50 FPS. The official minimum is 32GB+ VRAM, with 48GB+ recommended for stable 4K generation. In practice, 720p runs on 12-24GB with fp8 quantization, and 1080p on 24-32GB. Note that the earlier LTX-Video series (2B and 13B variants, 2024-early 2025) used a different architecture and required 8-40GB depending on model size and resolution. If you are evaluating LTX for a new project, use LTX-2.3 and check the LTX-2 GitHub for the latest model weights and requirements. For a broader look at model hardware requirements, see the GPU requirements cheat sheet 2026.

GPU Recommendations for Video AI

For getting started / testing models - RTX 5090 (32GB)

The RTX 5090 runs AnimateDiff, Mochi 1, and LTX-2.3 (at tight margins for 720p) at lower resolutions. It is the entry point for video AI experimentation on cloud infrastructure, but it does not run the headline models (Wan 2.1, HunyuanVideo) at any useful quality level.

On-demand: $0.76/hr on Spheron (as of March 16, 2026)
Use for: AnimateDiff clips, Mochi 1 at 480p, LTX-2.3 at 720p (fp8 quantized), evaluating video AI pipelines before committing to H100 budget

For production video generation (standard quality) - H100 PCIe or SXM (80GB)

The H100 is the practical minimum for Wan 2.1 at 720p and HunyuanVideo at 720p. The SXM variant's higher memory bandwidth (3.35 TB/s vs 2 TB/s on PCIe) reduces generation times meaningfully for video workloads.

On-demand (SXM5): $2.50/hr on Spheron (as of March 16, 2026)
Spot (SXM5): $0.99/hr on Spheron (preemptible; not suitable for long video generation jobs)
Use for: Wan 2.1 at 480p–720p, HunyuanVideo at 720p, CogVideoX-1.5-5B, LTX-2.3 at full 1080p quality

For high-resolution and longer clips - H200 SXM (141GB)

The H200 is the right tier for HunyuanVideo 1080p, Wan 2.1 10-second clips at 720p, and production workloads where OOM errors are unacceptable. Its 141GB of HBM3e at 4.8 TB/s makes it the most practical single-GPU option for any current video model. Learn more at the NVIDIA H200 GPU rental guide.

On-demand: $3.49/hr on Spheron (as of March 16, 2026)
Spot: $2.85/hr on Spheron
Use for: HunyuanVideo 1080p, Wan 2.1 10s+ clips at 720p, any workload where VRAM margin matters

For more detail on how VRAM architecture affects AI workloads, see Dedicated vs Shared GPU Memory.

Full comparison table:

GPU	VRAM	Best Video AI Use Case	On-Demand (Spheron)	Spot (Spheron)
RTX 5090	32GB	AnimateDiff, Mochi 1, LTX-2.3 (720p fp8)	$0.76/hr	N/A
H100 PCIe	80GB	Wan 2.1 (480–720p), CogVideoX-1.5-5B	From $2.01/hr	N/A
H100 SXM5	80GB	Wan 2.1 (720p), HunyuanVideo (720p)	$2.50/hr	$0.99/hr
H200 SXM	141GB	All models; HunyuanVideo 1080p, Wan 2.1 10s	$3.49/hr	$2.85/hr
B200	192GB	Maximum quality; batch generation	From ~$6.03/hr	N/A

Pricing as of March 16, 2026. GPU pricing fluctuates over time based on availability. Check GPU pricing for live rates.

Wan 2.1 - Setup Guide on Spheron

Wan 2.1 (Alibaba) is one of the most widely deployed open-source video models and remains a strong production choice. Note that Wan 2.2 was released in July 2025, bringing a Mixture-of-Experts architecture with 65.6% more images and 83.2% more videos in its training data, while maintaining similar VRAM requirements. The setup instructions and VRAM guidance below apply to both versions. This walkthrough uses the Wan 2.1 14B text-to-video model.

Step 1: Launch an H100 instance on Spheron

Go to Spheron's GPU rental page and provision an H100 PCIe or SXM instance. For 720p output, the SXM5 variant at $2.50/hr on-demand is recommended for its higher memory bandwidth. For 480p, the PCIe variant works well.

Step 2: Install requirements

bash

git clone https://github.com/Wan-Video/Wan2.1
cd Wan2.1
pip install -r requirements.txt

Step 3: Download model weights

bash

# Install huggingface-cli
pip install huggingface_hub

# Download the 14B T2V model weights (~30GB)
huggingface-cli download Wan-AI/Wan2.1-T2V-14B \
  --local-dir ./Wan2.1-T2V-14B

Weight download takes 15–30 minutes depending on connection speed. The 14B model is approximately 29GB on disk in BF16 format.

Step 4: Generate a video

python

# 480p generation - approximately 4 minutes on H100 PCIe, ~40GB VRAM
python generate.py \
  --task t2v-14B \
  --size 832*480 \
  --ckpt_dir ./Wan2.1-T2V-14B \
  --prompt "A cat walking through a sunlit garden, cinematic lighting" \
  --save_file output_480p.mp4

# 720p generation - approximately 10–12 minutes on H100 PCIe, ~65–80GB VRAM
python generate.py \
  --task t2v-14B \
  --size 1280*720 \
  --ckpt_dir ./Wan2.1-T2V-14B \
  --prompt "A cat walking through a sunlit garden, cinematic lighting" \
  --save_file output_720p.mp4

What to expect: The first run downloads and caches model components - expect 5–10 minutes of overhead before generation starts. Subsequent runs begin immediately. Peak VRAM usage is reported in the terminal output; watch this to confirm you're within the GPU's capacity before a long generation job.

For float8 quantization (reduces VRAM by ~20% at a minor quality cost), add --dit_fsdp_num 1 --t5_fsdp_num 1 --ulysses_size 1 flags and check the Wan 2.1 GitHub for current quantization documentation, as the flags evolve with new releases.

HunyuanVideo - The High-Quality Option

HunyuanVideo (Tencent) delivers among the highest output quality of open-source video models as of March 2026. HunyuanVideo 1.5 benchmarks ahead of Wan 2.2 on instruction following, structural stability, and motion clarity, while the original 13B parameter model remains the most capable single-GPU option for motion realism. Both HunyuanVideo and Wan 2.2 occupy the top quality tier, with different strengths depending on your use case. HunyuanVideo's tradeoff is extreme VRAM requirements and slower generation times.

Hardware requirements:

720p, 5s: ~60–80GB VRAM (60GB min, 80GB recommended; H100 PCIe at 80GB is near capacity with OOM risk; H200 recommended for reliable production use)
1080p, 5s: ~100–120GB+ VRAM (community-tested configuration; 1080p is not the official documented resolution, but is achievable; H200 141GB is the minimum; B200 for comfortable headroom)
Generation time: 15–25 minutes per 5-second 720p clip on H100; 12–18 minutes on H200

Why HunyuanVideo over Wan 2.1? Quality, particularly in motion realism and scene coherence across frames. For teams building production video AI products where quality is the primary metric and generation time is secondary, HunyuanVideo is the right choice. For teams optimizing cost-per-video-minute, Wan 2.1 delivers better economics.

ComfyUI support: HunyuanVideo has ComfyUI nodes available, which makes it accessible through the same workflow interface as image generation. If you're building a ComfyUI-based pipeline, HunyuanVideo integrates without requiring a separate deployment stack.

Setup: See the HunyuanVideo GitHub for current installation instructions. Requirements change with new releases; always use the GitHub documentation rather than third-party setup guides for current configuration.

HunyuanVideo 1.5 (November 2025): Tencent released HunyuanVideo 1.5 as a lighter 8.3B parameter variant that runs on consumer GPUs with 14GB+ VRAM. If your workload does not require the full-quality original model and you want to target consumer-tier hardware, HunyuanVideo 1.5 is worth evaluating. See the HunyuanVideo 1.5 GitHub for setup instructions.

Rent an H200 on Spheron for comfortable full-quality HunyuanVideo headroom at $3.49/hr on-demand or $2.85/hr spot.

Cost Per Video Minute - Cloud vs Local

This section helps production teams and studios build realistic cost models.

Model	Resolution	Duration	Gen Time	GPU Cost	Cost per clip
AnimateDiff v3	512×512	3s	~3–5 min	$0.76/hr (RTX 5090)	~$0.04–0.06
Wan 2.1	832×480	5s	~4–5 min	$2.50/hr (H100 SXM5)	~$0.17–0.21
Wan 2.1	1280×720	5s	~10–12 min	$2.50/hr (H100 SXM5)	~$0.42–0.50
Wan 2.1	1280×720	5s	~8–10 min	$3.49/hr (H200 SXM)	~$0.47–0.58
HunyuanVideo	720p	5s	~12–18 min (H200 SXM)	$3.49/hr (H200 SXM)	~$0.70–1.05
LTX-2.3	720p	5s	~5–8 min	$2.50/hr (H100 SXM5)	~$0.21–0.33

Gen time estimates based on community benchmarks as of March 2026. Costs calculated at Spheron on-demand prices as of March 16, 2026. GPU pricing fluctuates over time based on availability. Spot pricing (H100 SXM5: $0.99/hr, H200 SXM: $2.85/hr) reduces costs significantly for fault-tolerant workloads.

Translating to cost per second of output video:

AnimateDiff (512p, 3s clip): ~$0.013–0.020 per second of output
Wan 2.1 (480p, 5s clip, H100 SXM5): ~$0.034–0.042 per second of output
Wan 2.1 (720p, 5s clip, H100 SXM5): ~$0.084–0.100 per second of output
HunyuanVideo (720p, 5s clip, H200): ~$0.14–0.21 per second of output

Cost per second of output video is the metric that matters for production teams. A video product that generates 1,000 clips per day at 720p Wan 2.1 quality produces 5,000 seconds of output and is spending approximately $420–500 per day at H100 SXM5 on-demand rates ($2.50/hr). At H100 SXM5 spot rates ($0.99/hr), this drops to approximately $165–200 per day. GPU pricing fluctuates over time based on availability, so verify current rates before building cost models.

Local GPU comparison: On a consumer RTX 4090 (24GB), Wan 2.1 and HunyuanVideo will not run. On an RTX 5090 (32GB), they still won't run at usable quality settings. For video AI at 720p, cloud infrastructure is not an optimization - it is the only viable path.

Optimizations to Reduce VRAM and Speed Up Generation

Float8 / FP8 quantization

Most video models support BF16 by default. Running at FP8 on H100 (which has native FP8 Tensor Core support) reduces VRAM by approximately 20–40% and speeds up generation proportionally. Check each model's GitHub for current FP8 support flags - they vary by model version. Wan 2.1 has documented FP8 options; HunyuanVideo's FP8 support varies by release.

VAE tiling

During the final VAE decoding step, high-resolution videos spike VRAM significantly above the generation baseline. Enabling VAE tiling in ComfyUI or the model's CLI decodes the video in spatial tiles, eliminating this spike. This is particularly important for 1080p HunyuanVideo on H200, where the decoding step can push you over the 141GB limit without tiling.

Inference step reduction

The default inference step count for most video models is 30–50 steps. Reducing to 20–25 steps approximately halves generation time at a noticeable but often acceptable quality reduction. Use step reduction for draft previews - generate at 25 steps to evaluate the composition, then at full steps for the final output.

Resolution and duration tradeoffs

The most effective VRAM optimization is simply generating at lower resolution first. Start every new prompt at 480p - if the composition and motion look correct, scale up to 720p for the final render. This saves significant GPU time during iteration.

For clip length, 5 seconds is the practical sweet spot for most use cases. VRAM requirements increase non-linearly above 5 seconds: a 10-second Wan 2.1 clip at 720p often requires the full 80GB H100 capacity (or exceeds it), while the 5-second equivalent uses 65–70GB comfortably.

Building a Video AI Production Pipeline

For teams shipping a video generation product, the architecture differs from image generation in a few important ways.

Job queue architecture is mandatory. Unlike image generation (which can return in seconds and can be handled synchronously), video generation takes 5–25 minutes per clip. Build around an async job queue from the start: submit a generation job, poll for status, retrieve the completed video. Redis + Celery, BullMQ, or a cloud-native queue service all work. Do not attempt synchronous video generation in a web request.

Storage for video outputs. A 5-second 720p clip from Wan 2.1 is typically 50–200MB depending on codec settings. A 1080p HunyuanVideo clip can exceed 500MB. Plan for object storage (any S3-compatible service) from day one. Local disk storage does not scale.

GPU utilization target. During active generation, GPU utilization should be consistently above 90%. If you observe sustained periods below 70%, there is a CPU bottleneck in the data pipeline - typically in prompt tokenization, weight loading, or output encoding. Profile with nvidia-smi dmon to identify gaps between generation jobs.

Multi-GPU scaling is simple for video generation. Unlike multi-GPU LLM serving (which requires tensor parallelism and NVLink), video generation pipelines scale trivially: each GPU handles one generation job independently, with no inter-GPU communication. Four H100s run four concurrent generation jobs; throughput scales linearly. This makes video generation an ideal workload for GPU pools.

Wan 2.1, HunyuanVideo, and other open-source video AI models are running on Spheron's H100 and H200 GPUs today. No contracts, no waitlists. Start generating on the hardware that actually fits these models.
Explore GPU options for video AI →