GPU prices in this article are based on Spheron marketplace data as of March 16, 2026. GPU pricing fluctuates over time based on availability - check current GPU pricing for live on-demand and spot rates before provisioning.
Generating a 5-second 720p video with Wan 2.1 on an H100 PCIe takes approximately 10–12 minutes and requires 65–80GB of VRAM. On a consumer RTX 4090 (24GB), it won't run at all. On a consumer RTX 5090 (32GB), it still won't run. The only practical path to running Wan 2.1 or HunyuanVideo is datacenter hardware - and even there, you need the 80GB tier minimum.
This is not a limitation that will disappear with optimization. It is a fundamental property of how video generation models work: they must maintain temporal consistency across dozens or hundreds of frames simultaneously, and that requires the full frame sequence to reside in VRAM at once. More frames means more VRAM. Higher resolution means more VRAM. These relationships do not compress away.
This guide covers the actual VRAM requirements for every major open-source video model as of March 2026, which cloud GPUs can run them, what generation times to expect, and how to deploy Wan 2.1 on Spheron's H100 infrastructure today.
Why Video AI Is So Much More Demanding Than Image AI
The VRAM gap between image and video generation is not incremental - it's structural. Understanding why helps you make better infrastructure decisions.
Temporal dimension means exponentially more computation. An image model processes a single frame. A video model processes N frames and maintains consistency between them through temporal attention mechanisms. For a 5-second clip at 24 fps, that's 120 frames, each requiring individual processing plus cross-frame attention. The attention matrix grows quadratically with the number of tokens, which scales with both frame count and resolution.
VRAM scales with clip length linearly, but attention memory scales worse. A 5-second clip at 24 fps holds approximately 120 frames in the attention window simultaneously. Doubling clip length to 10 seconds roughly doubles the frame count, which more than doubles VRAM requirements once the attention overhead is factored in. This is why 10-second Wan 2.1 clips at 720p require 80GB or more - beyond what even an H100 can handle comfortably.
Resolution scaling is quadratic, not linear. Going from 480p (832×480) to 720p (1280×720) is a 2.25× increase in pixel count per frame. But the transformer attention matrix grows quadratically with token count, meaning the VRAM increase is closer to 2–3× even though you only doubled the pixel count. This is the single biggest reason most users should start at 480p.
Generation time is orders of magnitude longer than image generation. The table below makes the contrast concrete:
| Content Type | Model Example | Typical Generation Time (H100) | VRAM Needed |
|---|---|---|---|
| 512×512 image | SDXL | 3–5 seconds | 8–12GB |
| 1024×1024 image | Flux.1 Dev | 10–20 seconds | 20–25GB |
| 5s 480p video | Wan 2.1 | ~4 minutes | ~40–48GB |
| 5s 720p video | Wan 2.1 | ~10–12 minutes | ~65–80GB |
| 5s 720p video | HunyuanVideo | ~20 minutes | ~60–80GB+ |
Image generation is measured in seconds. Video generation is measured in minutes - even on the best available datacenter hardware. Set your expectations accordingly before provisioning infrastructure.
VRAM Requirements - The Complete Model Guide
Verified as of March 2026. VRAM requirements change with new releases and quantization improvements. Always check the model's GitHub repository for current requirements before provisioning.
| Model | Clip Length | Resolution | VRAM Required | Min GPU | Notes |
|---|---|---|---|---|---|
| Wan 2.1 | 5s | 832×480 | ~40–48GB | H100 PCIe | With float8 quantization |
| Wan 2.1 | 5s | 1280×720 | ~65–80GB | H100 PCIe | Full quality, tight on 80GB |
| Wan 2.1 | 10s | 1280×720 | ~80GB+ | H200 | Barely fits on H100, H200 recommended |
| HunyuanVideo | 5s | 720p | ~60–80GB | H100 PCIe (tight) | 60GB min (official); 80GB recommended; OOM risk at 80GB due to spikes |
| HunyuanVideo | 5s | 1080p | ~100–120GB+ | H200 | Community-tested; 1080p not in official specs; H200 141GB minimum |
| AnimateDiff v3 | 3s | 512×512 | ~18–24GB | RTX 5090 | Short clips only; limited motion range |
| LTX-2.3 | 5s | 720p | ~24–32GB | RTX 5090 | 4K at 50 FPS; 32GB+ official minimum; H100 recommended |
| CogVideoX-1.5-5B | 10s | 720p | ~24–32GB | RTX 4090 | Nov 2024 release; 8-bit quant reduces to ~16GB |
| Mochi 1 | 5s | 480p | ~22GB | RTX 5090 | With bfloat16 variant; full precision needs 60GB+ |
A few clarifications on the table:
Wan 2.1 with float8 quantization brings the 480p requirement down to approximately 40GB. Without quantization, expect 48–55GB. The Wan 2.1 GitHub repository provides current quantization instructions and community-verified VRAM measurements.
HunyuanVideo at 720p on H100 PCIe is technically possible but practically risky. With exactly 80GB available, system overhead and VRAM fragmentation during generation can cause OOM errors. The H200 with 141GB is the recommended minimum for reliable production use.
LTX-2.3 (from Lightricks, released March 8, 2026) is the current model in the LTX series and the best option for new projects. LTX-2 was announced October 2025 and open-sourced January 6, 2026; LTX-2.3 followed in early March with a rebuilt VAE, 4x larger text connector, and native audio generation. Both support native 4K at 50 FPS. The official minimum is 32GB+ VRAM, with 48GB+ recommended for stable 4K generation. In practice, 720p runs on 12-24GB with fp8 quantization, and 1080p on 24-32GB. Note that the earlier LTX-Video series (2B and 13B variants, 2024-early 2025) used a different architecture and required 8-40GB depending on model size and resolution. If you are evaluating LTX for a new project, use LTX-2.3 and check the LTX-2 GitHub for the latest model weights and requirements. For a broader look at model hardware requirements, see the GPU requirements cheat sheet 2026.
GPU Recommendations for Video AI
For getting started / testing models - RTX 5090 (32GB)
The RTX 5090 runs AnimateDiff, Mochi 1, and LTX-2.3 (at tight margins for 720p) at lower resolutions. It is the entry point for video AI experimentation on cloud infrastructure, but it does not run the headline models (Wan 2.1, HunyuanVideo) at any useful quality level.
- On-demand: $0.76/hr on Spheron (as of March 16, 2026)
- Use for: AnimateDiff clips, Mochi 1 at 480p, LTX-2.3 at 720p (fp8 quantized), evaluating video AI pipelines before committing to H100 budget
For production video generation (standard quality) - H100 PCIe or SXM (80GB)
The H100 is the practical minimum for Wan 2.1 at 720p and HunyuanVideo at 720p. The SXM variant's higher memory bandwidth (3.35 TB/s vs 2 TB/s on PCIe) reduces generation times meaningfully for video workloads.
- On-demand (SXM5): $2.50/hr on Spheron (as of March 16, 2026)
- Spot (SXM5): $0.99/hr on Spheron (preemptible; not suitable for long video generation jobs)
- Use for: Wan 2.1 at 480p–720p, HunyuanVideo at 720p, CogVideoX-1.5-5B, LTX-2.3 at full 1080p quality
For high-resolution and longer clips - H200 SXM (141GB)
The H200 is the right tier for HunyuanVideo 1080p, Wan 2.1 10-second clips at 720p, and production workloads where OOM errors are unacceptable. Its 141GB of HBM3e at 4.8 TB/s makes it the most practical single-GPU option for any current video model. Learn more at the NVIDIA H200 GPU rental guide.
- On-demand: $3.49/hr on Spheron (as of March 16, 2026)
- Spot: $2.85/hr on Spheron
- Use for: HunyuanVideo 1080p, Wan 2.1 10s+ clips at 720p, any workload where VRAM margin matters
For more detail on how VRAM architecture affects AI workloads, see Dedicated vs Shared GPU Memory.
Full comparison table:
| GPU | VRAM | Best Video AI Use Case | On-Demand (Spheron) | Spot (Spheron) |
|---|---|---|---|---|
| RTX 5090 | 32GB | AnimateDiff, Mochi 1, LTX-2.3 (720p fp8) | $0.76/hr | N/A |
| H100 PCIe | 80GB | Wan 2.1 (480–720p), CogVideoX-1.5-5B | From $2.01/hr | N/A |
| H100 SXM5 | 80GB | Wan 2.1 (720p), HunyuanVideo (720p) | $2.50/hr | $0.99/hr |
| H200 SXM | 141GB | All models; HunyuanVideo 1080p, Wan 2.1 10s | $3.49/hr | $2.85/hr |
| B200 | 192GB | Maximum quality; batch generation | From ~$6.03/hr | N/A |
Pricing as of March 16, 2026. GPU pricing fluctuates over time based on availability. Check GPU pricing for live rates.
Wan 2.1 - Setup Guide on Spheron
Wan 2.1 (Alibaba) is one of the most widely deployed open-source video models and remains a strong production choice. Note that Wan 2.2 was released in July 2025, bringing a Mixture-of-Experts architecture with 65.6% more images and 83.2% more videos in its training data, while maintaining similar VRAM requirements. The setup instructions and VRAM guidance below apply to both versions. This walkthrough uses the Wan 2.1 14B text-to-video model.
Step 1: Launch an H100 instance on Spheron
Go to Spheron's GPU rental page and provision an H100 PCIe or SXM instance. For 720p output, the SXM5 variant at $2.50/hr on-demand is recommended for its higher memory bandwidth. For 480p, the PCIe variant works well.
Step 2: Install requirements
git clone https://github.com/Wan-Video/Wan2.1
cd Wan2.1
pip install -r requirements.txtStep 3: Download model weights
# Install huggingface-cli
pip install huggingface_hub
# Download the 14B T2V model weights (~30GB)
huggingface-cli download Wan-AI/Wan2.1-T2V-14B \
--local-dir ./Wan2.1-T2V-14BWeight download takes 15–30 minutes depending on connection speed. The 14B model is approximately 29GB on disk in BF16 format.
Step 4: Generate a video
# 480p generation - approximately 4 minutes on H100 PCIe, ~40GB VRAM
python generate.py \
--task t2v-14B \
--size 832*480 \
--ckpt_dir ./Wan2.1-T2V-14B \
--prompt "A cat walking through a sunlit garden, cinematic lighting" \
--save_file output_480p.mp4
# 720p generation - approximately 10–12 minutes on H100 PCIe, ~65–80GB VRAM
python generate.py \
--task t2v-14B \
--size 1280*720 \
--ckpt_dir ./Wan2.1-T2V-14B \
--prompt "A cat walking through a sunlit garden, cinematic lighting" \
--save_file output_720p.mp4What to expect: The first run downloads and caches model components - expect 5–10 minutes of overhead before generation starts. Subsequent runs begin immediately. Peak VRAM usage is reported in the terminal output; watch this to confirm you're within the GPU's capacity before a long generation job.
For float8 quantization (reduces VRAM by ~20% at a minor quality cost), add --dit_fsdp_num 1 --t5_fsdp_num 1 --ulysses_size 1 flags and check the Wan 2.1 GitHub for current quantization documentation, as the flags evolve with new releases.
HunyuanVideo - The High-Quality Option
HunyuanVideo (Tencent) delivers among the highest output quality of open-source video models as of March 2026. HunyuanVideo 1.5 benchmarks ahead of Wan 2.2 on instruction following, structural stability, and motion clarity, while the original 13B parameter model remains the most capable single-GPU option for motion realism. Both HunyuanVideo and Wan 2.2 occupy the top quality tier, with different strengths depending on your use case. HunyuanVideo's tradeoff is extreme VRAM requirements and slower generation times.
Hardware requirements:
- 720p, 5s: ~60–80GB VRAM (60GB min, 80GB recommended; H100 PCIe at 80GB is near capacity with OOM risk; H200 recommended for reliable production use)
- 1080p, 5s: ~100–120GB+ VRAM (community-tested configuration; 1080p is not the official documented resolution, but is achievable; H200 141GB is the minimum; B200 for comfortable headroom)
- Generation time: 15–25 minutes per 5-second 720p clip on H100; 12–18 minutes on H200
Why HunyuanVideo over Wan 2.1? Quality, particularly in motion realism and scene coherence across frames. For teams building production video AI products where quality is the primary metric and generation time is secondary, HunyuanVideo is the right choice. For teams optimizing cost-per-video-minute, Wan 2.1 delivers better economics.
ComfyUI support: HunyuanVideo has ComfyUI nodes available, which makes it accessible through the same workflow interface as image generation. If you're building a ComfyUI-based pipeline, HunyuanVideo integrates without requiring a separate deployment stack.
Setup: See the HunyuanVideo GitHub for current installation instructions. Requirements change with new releases; always use the GitHub documentation rather than third-party setup guides for current configuration.
HunyuanVideo 1.5 (November 2025): Tencent released HunyuanVideo 1.5 as a lighter 8.3B parameter variant that runs on consumer GPUs with 14GB+ VRAM. If your workload does not require the full-quality original model and you want to target consumer-tier hardware, HunyuanVideo 1.5 is worth evaluating. See the HunyuanVideo 1.5 GitHub for setup instructions.
Rent an H200 on Spheron for comfortable full-quality HunyuanVideo headroom at $3.49/hr on-demand or $2.85/hr spot.
Cost Per Video Minute - Cloud vs Local
This section helps production teams and studios build realistic cost models.
| Model | Resolution | Duration | Gen Time | GPU Cost | Cost per clip |
|---|---|---|---|---|---|
| AnimateDiff v3 | 512×512 | 3s | ~3–5 min | $0.76/hr (RTX 5090) | ~$0.04–0.06 |
| Wan 2.1 | 832×480 | 5s | ~4–5 min | $2.50/hr (H100 SXM5) | ~$0.17–0.21 |
| Wan 2.1 | 1280×720 | 5s | ~10–12 min | $2.50/hr (H100 SXM5) | ~$0.42–0.50 |
| Wan 2.1 | 1280×720 | 5s | ~8–10 min | $3.49/hr (H200 SXM) | ~$0.47–0.58 |
| HunyuanVideo | 720p | 5s | ~12–18 min (H200 SXM) | $3.49/hr (H200 SXM) | ~$0.70–1.05 |
| LTX-2.3 | 720p | 5s | ~5–8 min | $2.50/hr (H100 SXM5) | ~$0.21–0.33 |
Gen time estimates based on community benchmarks as of March 2026. Costs calculated at Spheron on-demand prices as of March 16, 2026. GPU pricing fluctuates over time based on availability. Spot pricing (H100 SXM5: $0.99/hr, H200 SXM: $2.85/hr) reduces costs significantly for fault-tolerant workloads.
Translating to cost per second of output video:
- AnimateDiff (512p, 3s clip): ~$0.013–0.020 per second of output
- Wan 2.1 (480p, 5s clip, H100 SXM5): ~$0.034–0.042 per second of output
- Wan 2.1 (720p, 5s clip, H100 SXM5): ~$0.084–0.100 per second of output
- HunyuanVideo (720p, 5s clip, H200): ~$0.14–0.21 per second of output
Cost per second of output video is the metric that matters for production teams. A video product that generates 1,000 clips per day at 720p Wan 2.1 quality produces 5,000 seconds of output and is spending approximately $420–500 per day at H100 SXM5 on-demand rates ($2.50/hr). At H100 SXM5 spot rates ($0.99/hr), this drops to approximately $165–200 per day. GPU pricing fluctuates over time based on availability, so verify current rates before building cost models.
Local GPU comparison: On a consumer RTX 4090 (24GB), Wan 2.1 and HunyuanVideo will not run. On an RTX 5090 (32GB), they still won't run at usable quality settings. For video AI at 720p, cloud infrastructure is not an optimization - it is the only viable path.
Optimizations to Reduce VRAM and Speed Up Generation
Float8 / FP8 quantization
Most video models support BF16 by default. Running at FP8 on H100 (which has native FP8 Tensor Core support) reduces VRAM by approximately 20–40% and speeds up generation proportionally. Check each model's GitHub for current FP8 support flags - they vary by model version. Wan 2.1 has documented FP8 options; HunyuanVideo's FP8 support varies by release.
VAE tiling
During the final VAE decoding step, high-resolution videos spike VRAM significantly above the generation baseline. Enabling VAE tiling in ComfyUI or the model's CLI decodes the video in spatial tiles, eliminating this spike. This is particularly important for 1080p HunyuanVideo on H200, where the decoding step can push you over the 141GB limit without tiling.
Inference step reduction
The default inference step count for most video models is 30–50 steps. Reducing to 20–25 steps approximately halves generation time at a noticeable but often acceptable quality reduction. Use step reduction for draft previews - generate at 25 steps to evaluate the composition, then at full steps for the final output.
Resolution and duration tradeoffs
The most effective VRAM optimization is simply generating at lower resolution first. Start every new prompt at 480p - if the composition and motion look correct, scale up to 720p for the final render. This saves significant GPU time during iteration.
For clip length, 5 seconds is the practical sweet spot for most use cases. VRAM requirements increase non-linearly above 5 seconds: a 10-second Wan 2.1 clip at 720p often requires the full 80GB H100 capacity (or exceeds it), while the 5-second equivalent uses 65–70GB comfortably.
Building a Video AI Production Pipeline
For teams shipping a video generation product, the architecture differs from image generation in a few important ways.
Job queue architecture is mandatory. Unlike image generation (which can return in seconds and can be handled synchronously), video generation takes 5–25 minutes per clip. Build around an async job queue from the start: submit a generation job, poll for status, retrieve the completed video. Redis + Celery, BullMQ, or a cloud-native queue service all work. Do not attempt synchronous video generation in a web request.
Storage for video outputs. A 5-second 720p clip from Wan 2.1 is typically 50–200MB depending on codec settings. A 1080p HunyuanVideo clip can exceed 500MB. Plan for object storage (any S3-compatible service) from day one. Local disk storage does not scale.
GPU utilization target. During active generation, GPU utilization should be consistently above 90%. If you observe sustained periods below 70%, there is a CPU bottleneck in the data pipeline - typically in prompt tokenization, weight loading, or output encoding. Profile with nvidia-smi dmon to identify gaps between generation jobs.
Multi-GPU scaling is simple for video generation. Unlike multi-GPU LLM serving (which requires tensor parallelism and NVLink), video generation pipelines scale trivially: each GPU handles one generation job independently, with no inter-GPU communication. Four H100s run four concurrent generation jobs; throughput scales linearly. This makes video generation an ideal workload for GPU pools.
Wan 2.1, HunyuanVideo, and other open-source video AI models are running on Spheron's H100 and H200 GPUs today. No contracts, no waitlists. Start generating on the hardware that actually fits these models.
