What GPU do I need to run LTX-Video in real time?

LTX-Video can generate 720p video in near-real time on an RTX 4090 (24GB VRAM) using FP8 quantization. For batch production at 1080p, an H100 or H200 cuts generation time by 3-4x compared to the RTX 4090.

Can I self-host Wan 2.5 I2V?

Wan 2.5-Preview is currently only available via Alibaba Cloud API. For self-hosted I2V deployment today, use Wan 2.2 I2V A14B, which has public weights at Wan-AI/Wan2.2-I2V-A14B on ModelScope and HuggingFace and supports 720p I2V generation on a single H100.

How does Hunyuan Video Avatar differ from standard HunyuanVideo?

Hunyuan Video Avatar adds face-locked motion control, keeping identity stable across frames while driving expression and movement from a reference motion sequence. Standard HunyuanVideo generates video from text or image prompts without identity anchoring.

What is the cost per finished second of video on Spheron vs Runway Gen-3?

On Spheron, an RTX 5090 at on-demand pricing generates roughly 1 second of 720p LTX-Video output for $0.004-$0.007 depending on model and step count. Runway Gen-3 Alpha charges $0.05 per second of output, making self-hosted I2V on Spheron significantly cheaper at scale.

Which GPU is best for a high-volume I2V render farm?

For batch render farms processing hundreds of clips per day, the H100 SXM5 offers the best cost-per-second for LTX-Video at 720p. For models requiring 40GB+ VRAM (like Wan 2.2 I2V or Hunyuan Video Avatar), the H100 PCIe or H100 SXM5 are the right choices.

Image-to-Video AI on GPU Cloud: Deploy LTX-Video, Wan 2.2 I2V, and Hunyuan Video Avatar (2026)

Text-to-video models generate motion from scratch. Image-to-video models take a still frame you already own and animate it - which is exactly what ad-tech and creator teams need when they have product shots, character artwork, or portrait photos and want motion without a full generative pipeline.

This guide covers the three main self-hostable I2V models available today: LTX-Video (the lighter, faster option that runs on consumer GPUs), Wan 2.2 I2V (the production-grade choice for quality-first pipelines), and Hunyuan Video Avatar (for portrait and identity-anchored animation). For a broader text-to-video model comparison, the AI video generation GPU guide covers the full model landscape including HunyuanVideo, CogVideoX, and Runway. For VRAM sizing and infrastructure decisions, see GPU cloud for video AI 2026.

Image-to-Video vs Text-to-Video: When Each Workflow Wins

The technical difference between T2V and I2V is simple: T2V takes a text prompt, I2V takes a text prompt plus a reference image. The practical difference for production teams is significant.

Dimension	Text-to-Video	Image-to-Video
Input	Text prompt only	Reference image + motion prompt
Identity consistency	Model-dependent	High (anchored to input frame)
Best for	Creative ideation, b-roll	Product animation, portrait motion, ad assets
Typical workflow	Storyboard → generate	Asset library → animate
Example models	Wan 2.1/2.2 T2V, CogVideoX	LTX-Video I2V, Wan I2V, Hunyuan Avatar

For ad-tech teams, the appeal is obvious: you have a product shot, a brand character, or a logo, and you want it to move. T2V models might generate a similar-looking object from scratch, but it won't match your existing asset. I2V models anchor to the frame you give them.

For creator platforms, the use case is portrait animation: take a user's profile photo and generate a short animated clip from it. This is the niche Hunyuan Video Avatar fills. Standard T2V models can't maintain identity across frames without explicit conditioning.

The tradeoff: I2V models are generally harder to prompt for complex motion than T2V, because the motion must be coherent with a specific starting frame. The image constrains what the model can reasonably generate next.

The I2V Model Landscape in 2026

LTX-Video (Lightricks)

LTX-Video is a DiT-based (diffusion transformer) model from Lightricks. The key advantage over Wan 2.2 and other larger models is weight size: LTX-Video runs on 16-24GB VRAM at 720p, making it the only production-quality I2V model that fits on a consumer RTX 4090.

The model supports both text-to-video and image-to-video modes. In I2V mode, you pass a conditioning frame and a motion prompt. The model tends to handle camera motion and object animation well, though identity preservation for portraits is weaker than Hunyuan Video Avatar.

Weights: Public at Lightricks/LTX-Video on HuggingFace
Min VRAM: 16GB at 720p (FP8 quantization), 24GB at 720p full precision
Best GPU: RTX 4090 (24GB) for real-time/interactive, H100 for batch production
Architecture: DiT transformer diffusion

Wan 2.2 I2V (Alibaba) [Self-Hosted] / Wan 2.5 [API-Only]

Wan 2.2 is the current version with publicly available weights for self-hosting. Wan 2.2 I2V A14B (MoE, 14B active) supports 720p I2V generation and produces broadcast-quality output with strong motion coherence and identity preservation from the input frame.

Important note on Wan 2.5: Wan 2.5-Preview (released September 2025) is a multimodal audio-video model from Alibaba and is only accessible via Alibaba Cloud APIs. There are no public weights for self-deployment. The same applies to Wan 2.6. For any self-hosted I2V workflow today, you need Wan 2.2 I2V weights. The existing Wan 2.1/2.2 deployment guide covers the shared infrastructure setup for both T2V and I2V modes.

Weights: Public at Wan-AI/Wan2.2-I2V-A14B on ModelScope and HuggingFace
Min VRAM: 40-48GB at 480p (FP8), 65-80GB at 720p
Minimum GPU: H100 PCIe (80GB) for 720p
I2V mode: Image frame + motion prompt, strong identity retention from input

Hunyuan Video Avatar (Tencent)

Hunyuan Video Avatar is a specialized model built on top of the HunyuanVideo base, with added facial landmark conditioning. It keeps the subject's identity stable across all generated frames while driving expression and movement from a motion reference. This makes it useful for portrait animation, talking head generation, and character-driven ad content.

The key difference from standard HunyuanVideo: the Avatar model does not generate a subject from scratch. It takes an identity image as input and preserves that person's face and features throughout the clip. For use cases where identity fidelity matters (customer-facing content, branded characters), this conditioning is essential.

Weights: Available at tencent/HunyuanVideo-Avatar on HuggingFace (check the repo for current license terms before deployment)
Min VRAM: 40GB+ (L40S or H100 minimum)
Architecture: HunyuanVideo base with facial landmark conditioning
Best for: Portrait animation, talking head generation, identity-anchored clips

CogVideoX-I2V (ZhipuAI)

CogVideoX-5B-I2V is a 5B parameter model with strong prompt adherence in I2V mode. It needs H100 or H200 hardware for 720p at usable speeds and trades some motion coherence for better instruction following.

Stable Video Diffusion 1.1 (Stability AI)

The lightest option in this list. SVD 1.1 runs on 8-16GB VRAM, makes it accessible on consumer GPUs, but the quality ceiling is noticeably lower than LTX-Video and Wan 2.2. Good for preview generation and low-compute environments.

Model Comparison

Model	Weights	VRAM Min	720p Gen (H100 est.)	720p Gen (RTX 4090)	Identity	Prompt
LTX-Video I2V	Public	16GB	~20-30s	~90-120s	Medium	High
Wan 2.2 14B I2V	Public	40GB (FP8)	~45-60s	OOM	High	High
Hunyuan Video Avatar	Public (gated)	40GB	~90-120s	OOM	Very High	Medium
CogVideoX-5B-I2V	Public	24GB	~60-90s	~OOM at 720p	High	Very High
SVD 1.1	Public	8GB	~15-20s	~25-40s	Low	Low

All timing figures are approximate for single-batch generation with no concurrent load. Actual performance varies by driver version, step count, and hardware configuration. Mark all as estimates and run your own benchmarks on target hardware before provisioning at scale.

VRAM and Step-Time Tables by Resolution

720p (1280x720)

Model	GPU	VRAM Used	Steps	Approx. Time
LTX-Video I2V	RTX 4090	~22GB (FP8)	30	~90-120s
LTX-Video I2V	RTX 5090	~24GB	30	~45-60s
LTX-Video I2V	H100 PCIe	~24GB	30	~20-30s
Wan 2.2 I2V 14B	H100 SXM5	~65-80GB	40	~45-60s
CogVideoX-I2V	H100 PCIe	~28GB	50	~60-90s
Hunyuan Avatar	H100 PCIe	~42-48GB	50	~90-120s

1080p (1920x1080)

Model	GPU	VRAM Used	Steps	Approx. Time
LTX-Video I2V	H100 PCIe	~32-40GB	30	~60-90s
LTX-Video I2V	H200 SXM5	~36GB	30	~30-45s
Wan 2.2 I2V 14B	H200 SXM5	~80GB+	40	~90-120s
Hunyuan Avatar	H200 SXM5	~80-100GB	50	~150-200s

Note: All VRAM figures assume BF16 precision unless otherwise noted. FP8 quantization reduces VRAM by 30-40% with minor quality loss. "~" indicates approximations.

Real-Time vs Offline: Choosing the Right GPU

LTX-Video is the only I2V model with a meaningful real-time use case on GPU cloud. At FP8 quantization on an RTX 4090 or RTX 5090, it generates 720p clips fast enough to support interactive applications where users submit an image and wait a short time for output.

For everything else, you are in batch processing territory.

Interactive/live generation (LTX-Video): An RTX 4090 with FP8 quantization produces 720p output in roughly 90-120 seconds per clip. For interactive tools where users tolerate short waits, this works. The RTX 5090 at 32GB cuts that time roughly in half.

Batch render farm (high volume, any model): The H100 SXM5 at $2.90/hr on-demand is the most cost-efficient option for high-throughput LTX-Video work. For Wan 2.2 I2V and Hunyuan Avatar, H100 is the minimum viable GPU.

Quality-first production (Wan 2.2, CogVideoX): H100 PCIe or SXM5 for 720p. H200 SXM5 for 1080p and Wan 2.2 10-second clips.

Portrait/avatar work (Hunyuan Video Avatar): H100 PCIe minimum (40GB+ requirement). H200 for comfortable headroom and longer clips.

L40S: Good middle ground for LTX-Video and lighter I2V workloads. Check current GPU pricing for L40S availability.

Production I2V Pipeline Architecture

For teams shipping I2V as a product feature, a synchronous API is not viable. A 90-second LTX-Video job or a 60-second Wan 2.2 job cannot sit in a web request.

A practical production setup has these components:

API gateway - receives the input image, motion prompt, and generation parameters; validates inputs; returns a job ID immediately
Redis queue - stores pending jobs with priority tiers (paid users, free tier, batch jobs)
GPU worker pool - one worker process per GPU, each with the model loaded and resident in VRAM between jobs; workers pull from the queue and post results to object storage
Object storage - S3-compatible storage (MinIO, Cloudflare R2, or AWS S3) for generated video files; workers write output paths back to a results store
Post-processing step - optional ffmpeg watermarking, format conversion, or quality check
Webhook or polling endpoint - caller polls for job status or receives a webhook when the clip is ready

Each GPU worker keeps the model weights resident in VRAM between jobs. Loading LTX-Video weights takes roughly 10-15 seconds; Wan 2.2 14B takes 30-60 seconds. Reloading on every job kills throughput. Keep workers alive and warm.

Multi-GPU scaling is trivial for I2V: each GPU handles one job independently, no inter-GPU communication needed. Four H100s give four concurrent generation slots; throughput scales linearly with GPU count.

Step-by-Step: Deploy LTX-Video on Spheron

LTX-Video runs on RTX 4090 hardware (24GB VRAM) with FP8 quantization at 720p. This makes it the most accessible I2V model to self-host.

Step 1: Provision an RTX 4090 instance

Go to Spheron's RTX 4090 rental page and provision a Ubuntu 22.04 instance. Verify CUDA 12.1 or higher is installed after boot.

bash

nvidia-smi
nvcc --version

Step 2: Install dependencies

bash

# Python environment
python3 -m venv ltxvideo-env
source ltxvideo-env/bin/activate

# PyTorch with CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Clone the LTX-Video repo
git clone https://github.com/Lightricks/LTX-Video.git
cd LTX-Video
pip install -r requirements.txt

Step 3: Download model weights

bash

pip install huggingface_hub
huggingface-cli download Lightricks/LTX-Video \
  --local-dir ./models/ltx-video \
  --include "*.safetensors" "*.json" "*.txt"

Step 4: Test with an I2V generation

bash

python inference.py \
  --ckpt_dir ./models/ltx-video \
  --input_image_path /path/to/your/image.jpg \
  --prompt "The subject slowly turns to face the camera" \
  --height 720 \
  --width 1280 \
  --num_frames 25 \
  --fps 24 \
  --num_inference_steps 30 \
  --output_path ./output/clip.mp4

Step 5: Set up an inference wrapper for production

For production use, wrap the model in a FastAPI service so workers can accept jobs from a queue:

bash

pip install fastapi uvicorn python-multipart

# Start the API server on port 8080
uvicorn app:app --host 0.0.0.0 --port 8080

Access from outside the instance via SSH tunnel:

bash

ssh -L 8080:localhost:8080 user@your-server-ip

Step-by-Step: Deploy Wan 2.2 I2V on Spheron

Wan 2.2 I2V 14B requires an H100 PCIe (80GB) for 720p generation. The setup is similar to the T2V workflow described in the full Wan setup guide. The main difference is the model checkpoint and the generation task flag.

Step 1: Provision an H100 instance

Go to Spheron's H100 GPU rental page and provision an H100 PCIe or SXM5 instance with Ubuntu 22.04. The SXM5 at $2.90/hr on-demand has higher memory bandwidth than the PCIe variant, which reduces generation time for 720p work.

Step 2: Install dependencies

bash

git clone https://github.com/Wan-AI/Wan2.2.git
cd Wan2.2
python3 -m venv wan-env
source wan-env/bin/activate
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

Step 3: Download I2V model weights

bash

pip install modelscope
modelscope download --model Wan-AI/Wan2.2-I2V-A14B \
  --local_dir ./models/wan2.2-i2v-a14b

Or via HuggingFace:

bash

pip install huggingface_hub
huggingface-cli download Wan-AI/Wan2.2-I2V-A14B \
  --local-dir ./models/wan2.2-i2v-a14b

Step 4: Generate an I2V clip

bash

python generate.py \
  --task i2v-14B \
  --size 1280*720 \
  --ckpt_dir ./models/wan2.2-i2v-a14b \
  --image /path/to/input.jpg \
  --prompt "The camera slowly zooms out as the subject smiles" \
  --sample_steps 40 \
  --sample_shift 5.0 \
  --save_file ./output/wan-i2v-clip.mp4

For FP8 quantization (reduces VRAM to ~40-48GB, enabling H100 PCIe for some 480p I2V work):

bash

python generate.py \
  --task i2v-14B \
  --ckpt_dir ./models/wan2.2-i2v-a14b \
  --image /path/to/input.jpg \
  --prompt "your motion prompt" \
  --quantize fp8 \
  --size 832*480 \
  --save_file ./output/clip_480p.mp4

Wan 2.2 I2V uses the same ComfyUI-WanVideoWrapper node package as T2V. If you already have a ComfyUI deployment from the T2V guide, switch to the I2V checkpoint and update your workflow node to use the image input. No infrastructure changes required.

Avatar Deployment: Hunyuan Video Avatar

Hunyuan Video Avatar generates portrait animations with identity preservation. The model takes a subject photo and drives expression and movement from a reference motion source, keeping the person's face consistent across all frames.

Weight access: The weights are available at tencent/HunyuanVideo-Avatar on HuggingFace. Check the repository for the current license terms before deploying for commercial use. Some Tencent model releases use a non-commercial research license.

Hardware: H100 PCIe minimum (40GB+ required). H200 SXM5 gives more headroom for longer clips and higher resolutions.

Setup:

bash

python3 -m venv hunyuan-avatar-env
source hunyuan-avatar-env/bin/activate
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

# Download weights (includes inference code and model files)
pip install huggingface_hub
huggingface-cli download tencent/HunyuanVideo-Avatar \
  --local-dir ./ckpts/HunyuanVideo-Avatar

Identity image requirements:

Single face, front-facing or slight angle (up to ~30 degrees)
Clear lighting, no heavy shadows across the face
Minimum 256x256 resolution, 512x512 or higher preferred

Running a portrait animation job:

bash

python sample_video.py \
  --config ./configs/hunyuanvideo_avatar_720p.yaml \
  --identity_image /path/to/portrait.jpg \
  --driving_video /path/to/motion_reference.mp4 \
  --output_path ./output/avatar_clip.mp4 \
  --num_frames 49 \
  --fps 24

The driving video provides the motion reference. For talking head applications, this is typically a short clip of a neutral face performing the target expressions. The model maps the motion from the reference onto the identity image's subject.

For teams building portrait animation products at scale, Hunyuan Video Avatar is currently the most capable open-source option. The identity preservation noticeably outperforms standard HunyuanVideo I2V mode, where the face can drift across frames.

Spheron GPU Pricing: Cost Per Finished Second of Video

Pricing below is based on live Spheron API data as of April 28, 2026. "Finished sec/hr" is an estimate for 4-second 720p clips with default settings on LTX-Video; mark all throughput figures as approximate single-batch estimates.

GPU	On-Demand ($/hr)	Spot ($/hr)	720p sec/hr (LTX est.)	$/finished sec (LTX)
RTX 4090	$0.79	N/A	~120	~$0.007
RTX 5090 PCIe	$0.86	N/A	~200	~$0.004
L40S	$0.72	$0.32	~180	~$0.004
H100 PCIe	$2.01	N/A	~450	~$0.004
H100 SXM5	$2.90	$0.80	~500	~$0.006
H200 SXM5	$3.96	$1.19	~600	~$0.007

The H100 SXM5 at spot pricing ($0.80/hr) is the best value for high-throughput LTX-Video work: roughly $0.002 per finished second of 720p output. That's 25x cheaper than Runway Gen-3 Alpha ($0.05/sec) for the same output quality tier.

For Wan 2.2 I2V (heavier model, requires H100+), the math changes:

H100 SXM5 at $2.90/hr, generating roughly 80-100 seconds of Wan 2.2 video per hour: ~$0.029-0.036/finished sec
Still cheaper than Runway at $0.05/sec, with significantly better identity preservation

For a team generating 1,000 four-second I2V clips per day on H100 SXM5 spot pricing:

4,000 seconds of output, at $0.80/hr, H100 SXM5 can produce ~500 sec/hr (LTX-Video)
Total GPU hours needed: 8 hours
Total cost: ~$6.40/day for LTX-Video output, vs $200/day on Runway Gen-3 at $0.05/sec

Pricing fluctuates based on GPU availability. The prices above are based on 28 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Quality Benchmarks: Motion Coherence, Identity Preservation, Prompt Adherence

These ratings are qualitative, based on community testing and published model cards. For official VBench or EvalCrafter scores, check each model's paper or GitHub repository directly.

Model	Motion Coherence	Identity Preservation	Prompt Adherence	Artifacts
LTX-Video I2V	Good	Medium	High	Low
Wan 2.2 14B I2V	Very Good	High	High	Low
Hunyuan Video Avatar	Medium	Very High	Medium	Medium
CogVideoX-5B-I2V	Good	High	Very High	Low
SVD 1.1	Medium	Low	Low	Medium

A few practical notes:

LTX-Video handles camera motion and smooth transitions well. Where it struggles: complex scene changes, multiple subjects, and detailed face preservation over more than a few seconds.

Wan 2.2 I2V shows the most consistent results across varied inputs. The motion stays coherent even for clips with multiple moving elements or significant camera movement. The quality jump over LTX-Video is visible, particularly for clips longer than 3 seconds.

Hunyuan Video Avatar optimizes for identity at the expense of motion complexity. Clips tend to show subtle, naturalistic motion rather than large movements. For talking-head applications, this is ideal. For action sequences, it's limiting.

CogVideoX-I2V follows text prompts more precisely than the other models. If your pipeline needs the output to match a specific motion description, CogVideoX handles this better. The tradeoff is slightly lower motion realism compared to Wan 2.2.

Migrating from Runway Gen-3 or Sora API to Self-Hosted I2V

Break-even math:

At $0.05/sec (Runway Gen-3 Alpha), a team generating 1,000 clips/day at 4 seconds each spends $200/day or roughly $6,000/month. A single H100 SXM5 at spot pricing ($0.80/hr) costs roughly $576/month running 24/7. At 500 sec/hr throughput for LTX-Video, one H100 handles about 12,000 sec/day of capacity, far more than 4,000 sec/day of demand - meaning you likely only need to run the GPU during active generation windows.

Break-even is typically at 50-150 clips per day, depending on clip length and model choice.

API vs self-hosted comparison:

Factor	Runway / Sora API	Self-Hosted on Spheron
Setup time	Minutes	Hours to days
Infrastructure ops	None	You manage
Cost at scale (1,000 clips/day)	$200/day	$6-25/day
Model control	Limited	Full
Fine-tuning	Not available	Possible
Latency	30-60 seconds	30-120 seconds
Uptime SLA	Commercial SLA	Your responsibility
Identity preservation	Good (Gen-3)	Very Good (Wan 2.2)

Migration checklist:

Audit current prompt format: Runway prompts tend to be shorter and more abstract than what open-source I2V models respond to best. Wan 2.2 and CogVideoX benefit from more explicit motion descriptions.
Match output resolution: Runway Gen-3 outputs 1280x768 or 1280x720 by default. Configure your model to match for easy A/B comparisons.
Latency expectations: If your product shows users a progress indicator, self-hosted latency (30-120 sec) is comparable to Runway. If you have a synchronous API call with a timeout under 30 seconds, you need to restructure to async job queuing first.
Output format: Both Runway and self-hosted pipelines output MP4. No conversion needed. Bitrate and codec settings may differ; standardize with ffmpeg post-processing if needed.
Test on your actual input distribution: If your users submit varied portrait angles, lighting conditions, and image qualities, test the self-hosted model on a representative sample before cutting over production traffic.

Running I2V at scale on Runway or Sora API costs $0.04-0.05 per finished second. On Spheron, the same workload on an H100 SXM5 at spot pricing runs far cheaper, with bare-metal performance and no per-output fees.
Rent RTX 4090 → | Rent H100 → | View all GPU pricing →
Launch your I2V pipeline on Spheron →

Image-to-Video vs Text-to-Video: When Each Workflow Wins

The I2V Model Landscape in 2026

LTX-Video (Lightricks)

Wan 2.2 I2V (Alibaba) [Self-Hosted] / Wan 2.5 [API-Only]

Hunyuan Video Avatar (Tencent)

CogVideoX-I2V (ZhipuAI)

Stable Video Diffusion 1.1 (Stability AI)

Model Comparison

VRAM and Step-Time Tables by Resolution

720p (1280x720)

1080p (1920x1080)

Real-Time vs Offline: Choosing the Right GPU

Production I2V Pipeline Architecture

Step-by-Step: Deploy LTX-Video on Spheron

Step-by-Step: Deploy Wan 2.2 I2V on Spheron

Avatar Deployment: Hunyuan Video Avatar

Spheron GPU Pricing: Cost Per Finished Second of Video

Quality Benchmarks: Motion Coherence, Identity Preservation, Prompt Adherence

Migrating from Runway Gen-3 or Sora API to Self-Hosted I2V

Build what's next.