Tutorial

Image-to-Video AI on GPU Cloud: Deploy LTX-Video, Wan 2.2 I2V, and Hunyuan Video Avatar (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 28, 2026
Image to Video Model DeploymentLTX Video GPU CloudWan 2.2 I2V DeploymentHunyuan Video Avatar Self HostI2V Production Pipeline GPUSelf-Hosted Video GenerationImage to Video AIGPU CloudCogVideoX I2VStable Video DiffusionPortrait AnimationRunway Gen-3RTX 4090RTX 5090H100H200L40S
Image-to-Video AI on GPU Cloud: Deploy LTX-Video, Wan 2.2 I2V, and Hunyuan Video Avatar (2026)

Text-to-video models generate motion from scratch. Image-to-video models take a still frame you already own and animate it - which is exactly what ad-tech and creator teams need when they have product shots, character artwork, or portrait photos and want motion without a full generative pipeline.

This guide covers the three main self-hostable I2V models available today: LTX-Video (the lighter, faster option that runs on consumer GPUs), Wan 2.2 I2V (the production-grade choice for quality-first pipelines), and Hunyuan Video Avatar (for portrait and identity-anchored animation). For a broader text-to-video model comparison, the AI video generation GPU guide covers the full model landscape including HunyuanVideo, CogVideoX, and Runway. For VRAM sizing and infrastructure decisions, see GPU cloud for video AI 2026.

Image-to-Video vs Text-to-Video: When Each Workflow Wins

The technical difference between T2V and I2V is simple: T2V takes a text prompt, I2V takes a text prompt plus a reference image. The practical difference for production teams is significant.

DimensionText-to-VideoImage-to-Video
InputText prompt onlyReference image + motion prompt
Identity consistencyModel-dependentHigh (anchored to input frame)
Best forCreative ideation, b-rollProduct animation, portrait motion, ad assets
Typical workflowStoryboard → generateAsset library → animate
Example modelsWan 2.1/2.2 T2V, CogVideoXLTX-Video I2V, Wan I2V, Hunyuan Avatar

For ad-tech teams, the appeal is obvious: you have a product shot, a brand character, or a logo, and you want it to move. T2V models might generate a similar-looking object from scratch, but it won't match your existing asset. I2V models anchor to the frame you give them.

For creator platforms, the use case is portrait animation: take a user's profile photo and generate a short animated clip from it. This is the niche Hunyuan Video Avatar fills. Standard T2V models can't maintain identity across frames without explicit conditioning.

The tradeoff: I2V models are generally harder to prompt for complex motion than T2V, because the motion must be coherent with a specific starting frame. The image constrains what the model can reasonably generate next.

The I2V Model Landscape in 2026

LTX-Video (Lightricks)

LTX-Video is a DiT-based (diffusion transformer) model from Lightricks. The key advantage over Wan 2.2 and other larger models is weight size: LTX-Video runs on 16-24GB VRAM at 720p, making it the only production-quality I2V model that fits on a consumer RTX 4090.

The model supports both text-to-video and image-to-video modes. In I2V mode, you pass a conditioning frame and a motion prompt. The model tends to handle camera motion and object animation well, though identity preservation for portraits is weaker than Hunyuan Video Avatar.

  • Weights: Public at Lightricks/LTX-Video on HuggingFace
  • Min VRAM: 16GB at 720p (FP8 quantization), 24GB at 720p full precision
  • Best GPU: RTX 4090 (24GB) for real-time/interactive, H100 for batch production
  • Architecture: DiT transformer diffusion

Wan 2.2 I2V (Alibaba) [Self-Hosted] / Wan 2.5 [API-Only]

Wan 2.2 is the current version with publicly available weights for self-hosting. Wan 2.2 I2V A14B (MoE, 14B active) supports 720p I2V generation and produces broadcast-quality output with strong motion coherence and identity preservation from the input frame.

Important note on Wan 2.5: Wan 2.5-Preview (released September 2025) is a multimodal audio-video model from Alibaba and is only accessible via Alibaba Cloud APIs. There are no public weights for self-deployment. The same applies to Wan 2.6. For any self-hosted I2V workflow today, you need Wan 2.2 I2V weights. The existing Wan 2.1/2.2 deployment guide covers the shared infrastructure setup for both T2V and I2V modes.

  • Weights: Public at Wan-AI/Wan2.2-I2V-A14B on ModelScope and HuggingFace
  • Min VRAM: 40-48GB at 480p (FP8), 65-80GB at 720p
  • Minimum GPU: H100 PCIe (80GB) for 720p
  • I2V mode: Image frame + motion prompt, strong identity retention from input

Hunyuan Video Avatar (Tencent)

Hunyuan Video Avatar is a specialized model built on top of the HunyuanVideo base, with added facial landmark conditioning. It keeps the subject's identity stable across all generated frames while driving expression and movement from a motion reference. This makes it useful for portrait animation, talking head generation, and character-driven ad content.

The key difference from standard HunyuanVideo: the Avatar model does not generate a subject from scratch. It takes an identity image as input and preserves that person's face and features throughout the clip. For use cases where identity fidelity matters (customer-facing content, branded characters), this conditioning is essential.

  • Weights: Available at tencent/HunyuanVideo-Avatar on HuggingFace (check the repo for current license terms before deployment)
  • Min VRAM: 40GB+ (L40S or H100 minimum)
  • Architecture: HunyuanVideo base with facial landmark conditioning
  • Best for: Portrait animation, talking head generation, identity-anchored clips

CogVideoX-I2V (ZhipuAI)

CogVideoX-5B-I2V is a 5B parameter model with strong prompt adherence in I2V mode. It needs H100 or H200 hardware for 720p at usable speeds and trades some motion coherence for better instruction following.

Stable Video Diffusion 1.1 (Stability AI)

The lightest option in this list. SVD 1.1 runs on 8-16GB VRAM, makes it accessible on consumer GPUs, but the quality ceiling is noticeably lower than LTX-Video and Wan 2.2. Good for preview generation and low-compute environments.

Model Comparison

ModelWeightsVRAM Min720p Gen (H100 est.)720p Gen (RTX 4090)IdentityPrompt
LTX-Video I2VPublic16GB~20-30s~90-120sMediumHigh
Wan 2.2 14B I2VPublic40GB (FP8)~45-60sOOMHighHigh
Hunyuan Video AvatarPublic (gated)40GB~90-120sOOMVery HighMedium
CogVideoX-5B-I2VPublic24GB~60-90s~OOM at 720pHighVery High
SVD 1.1Public8GB~15-20s~25-40sLowLow

All timing figures are approximate for single-batch generation with no concurrent load. Actual performance varies by driver version, step count, and hardware configuration. Mark all as estimates and run your own benchmarks on target hardware before provisioning at scale.

VRAM and Step-Time Tables by Resolution

720p (1280x720)

ModelGPUVRAM UsedStepsApprox. Time
LTX-Video I2VRTX 4090~22GB (FP8)30~90-120s
LTX-Video I2VRTX 5090~24GB30~45-60s
LTX-Video I2VH100 PCIe~24GB30~20-30s
Wan 2.2 I2V 14BH100 SXM5~65-80GB40~45-60s
CogVideoX-I2VH100 PCIe~28GB50~60-90s
Hunyuan AvatarH100 PCIe~42-48GB50~90-120s

1080p (1920x1080)

ModelGPUVRAM UsedStepsApprox. Time
LTX-Video I2VH100 PCIe~32-40GB30~60-90s
LTX-Video I2VH200 SXM5~36GB30~30-45s
Wan 2.2 I2V 14BH200 SXM5~80GB+40~90-120s
Hunyuan AvatarH200 SXM5~80-100GB50~150-200s

Note: All VRAM figures assume BF16 precision unless otherwise noted. FP8 quantization reduces VRAM by 30-40% with minor quality loss. "~" indicates approximations.

Real-Time vs Offline: Choosing the Right GPU

LTX-Video is the only I2V model with a meaningful real-time use case on GPU cloud. At FP8 quantization on an RTX 4090 or RTX 5090, it generates 720p clips fast enough to support interactive applications where users submit an image and wait a short time for output.

For everything else, you are in batch processing territory.

Interactive/live generation (LTX-Video): An RTX 4090 with FP8 quantization produces 720p output in roughly 90-120 seconds per clip. For interactive tools where users tolerate short waits, this works. The RTX 5090 at 32GB cuts that time roughly in half.

Batch render farm (high volume, any model): The H100 SXM5 at $2.90/hr on-demand is the most cost-efficient option for high-throughput LTX-Video work. For Wan 2.2 I2V and Hunyuan Avatar, H100 is the minimum viable GPU.

Quality-first production (Wan 2.2, CogVideoX): H100 PCIe or SXM5 for 720p. H200 SXM5 for 1080p and Wan 2.2 10-second clips.

Portrait/avatar work (Hunyuan Video Avatar): H100 PCIe minimum (40GB+ requirement). H200 for comfortable headroom and longer clips.

L40S: Good middle ground for LTX-Video and lighter I2V workloads. Check current GPU pricing for L40S availability.

Production I2V Pipeline Architecture

For teams shipping I2V as a product feature, a synchronous API is not viable. A 90-second LTX-Video job or a 60-second Wan 2.2 job cannot sit in a web request.

A practical production setup has these components:

  1. API gateway - receives the input image, motion prompt, and generation parameters; validates inputs; returns a job ID immediately
  2. Redis queue - stores pending jobs with priority tiers (paid users, free tier, batch jobs)
  3. GPU worker pool - one worker process per GPU, each with the model loaded and resident in VRAM between jobs; workers pull from the queue and post results to object storage
  4. Object storage - S3-compatible storage (MinIO, Cloudflare R2, or AWS S3) for generated video files; workers write output paths back to a results store
  5. Post-processing step - optional ffmpeg watermarking, format conversion, or quality check
  6. Webhook or polling endpoint - caller polls for job status or receives a webhook when the clip is ready

Each GPU worker keeps the model weights resident in VRAM between jobs. Loading LTX-Video weights takes roughly 10-15 seconds; Wan 2.2 14B takes 30-60 seconds. Reloading on every job kills throughput. Keep workers alive and warm.

Multi-GPU scaling is trivial for I2V: each GPU handles one job independently, no inter-GPU communication needed. Four H100s give four concurrent generation slots; throughput scales linearly with GPU count.

Step-by-Step: Deploy LTX-Video on Spheron

LTX-Video runs on RTX 4090 hardware (24GB VRAM) with FP8 quantization at 720p. This makes it the most accessible I2V model to self-host.

Step 1: Provision an RTX 4090 instance

Go to Spheron's RTX 4090 rental page and provision a Ubuntu 22.04 instance. Verify CUDA 12.1 or higher is installed after boot.

bash
nvidia-smi
nvcc --version

Step 2: Install dependencies

bash
# Python environment
python3 -m venv ltxvideo-env
source ltxvideo-env/bin/activate

# PyTorch with CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Clone the LTX-Video repo
git clone https://github.com/Lightricks/LTX-Video.git
cd LTX-Video
pip install -r requirements.txt

Step 3: Download model weights

bash
pip install huggingface_hub
huggingface-cli download Lightricks/LTX-Video \
  --local-dir ./models/ltx-video \
  --include "*.safetensors" "*.json" "*.txt"

Step 4: Test with an I2V generation

bash
python inference.py \
  --ckpt_dir ./models/ltx-video \
  --input_image_path /path/to/your/image.jpg \
  --prompt "The subject slowly turns to face the camera" \
  --height 720 \
  --width 1280 \
  --num_frames 25 \
  --fps 24 \
  --num_inference_steps 30 \
  --output_path ./output/clip.mp4

Step 5: Set up an inference wrapper for production

For production use, wrap the model in a FastAPI service so workers can accept jobs from a queue:

bash
pip install fastapi uvicorn python-multipart

# Start the API server on port 8080
uvicorn app:app --host 0.0.0.0 --port 8080

Access from outside the instance via SSH tunnel:

bash
ssh -L 8080:localhost:8080 user@your-server-ip

Step-by-Step: Deploy Wan 2.2 I2V on Spheron

Wan 2.2 I2V 14B requires an H100 PCIe (80GB) for 720p generation. The setup is similar to the T2V workflow described in the full Wan setup guide. The main difference is the model checkpoint and the generation task flag.

Step 1: Provision an H100 instance

Go to Spheron's H100 GPU rental page and provision an H100 PCIe or SXM5 instance with Ubuntu 22.04. The SXM5 at $2.90/hr on-demand has higher memory bandwidth than the PCIe variant, which reduces generation time for 720p work.

Step 2: Install dependencies

bash
git clone https://github.com/Wan-AI/Wan2.2.git
cd Wan2.2
python3 -m venv wan-env
source wan-env/bin/activate
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

Step 3: Download I2V model weights

bash
pip install modelscope
modelscope download --model Wan-AI/Wan2.2-I2V-A14B \
  --local_dir ./models/wan2.2-i2v-a14b

Or via HuggingFace:

bash
pip install huggingface_hub
huggingface-cli download Wan-AI/Wan2.2-I2V-A14B \
  --local-dir ./models/wan2.2-i2v-a14b

Step 4: Generate an I2V clip

bash
python generate.py \
  --task i2v-14B \
  --size 1280*720 \
  --ckpt_dir ./models/wan2.2-i2v-a14b \
  --image /path/to/input.jpg \
  --prompt "The camera slowly zooms out as the subject smiles" \
  --sample_steps 40 \
  --sample_shift 5.0 \
  --save_file ./output/wan-i2v-clip.mp4

For FP8 quantization (reduces VRAM to ~40-48GB, enabling H100 PCIe for some 480p I2V work):

bash
python generate.py \
  --task i2v-14B \
  --ckpt_dir ./models/wan2.2-i2v-a14b \
  --image /path/to/input.jpg \
  --prompt "your motion prompt" \
  --quantize fp8 \
  --size 832*480 \
  --save_file ./output/clip_480p.mp4

Wan 2.2 I2V uses the same ComfyUI-WanVideoWrapper node package as T2V. If you already have a ComfyUI deployment from the T2V guide, switch to the I2V checkpoint and update your workflow node to use the image input. No infrastructure changes required.

Avatar Deployment: Hunyuan Video Avatar

Hunyuan Video Avatar generates portrait animations with identity preservation. The model takes a subject photo and drives expression and movement from a reference motion source, keeping the person's face consistent across all frames.

Weight access: The weights are available at tencent/HunyuanVideo-Avatar on HuggingFace. Check the repository for the current license terms before deploying for commercial use. Some Tencent model releases use a non-commercial research license.

Hardware: H100 PCIe minimum (40GB+ required). H200 SXM5 gives more headroom for longer clips and higher resolutions.

Setup:

bash
python3 -m venv hunyuan-avatar-env
source hunyuan-avatar-env/bin/activate
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

# Download weights (includes inference code and model files)
pip install huggingface_hub
huggingface-cli download tencent/HunyuanVideo-Avatar \
  --local-dir ./ckpts/HunyuanVideo-Avatar

Identity image requirements:

  • Single face, front-facing or slight angle (up to ~30 degrees)
  • Clear lighting, no heavy shadows across the face
  • Minimum 256x256 resolution, 512x512 or higher preferred

Running a portrait animation job:

bash
python sample_video.py \
  --config ./configs/hunyuanvideo_avatar_720p.yaml \
  --identity_image /path/to/portrait.jpg \
  --driving_video /path/to/motion_reference.mp4 \
  --output_path ./output/avatar_clip.mp4 \
  --num_frames 49 \
  --fps 24

The driving video provides the motion reference. For talking head applications, this is typically a short clip of a neutral face performing the target expressions. The model maps the motion from the reference onto the identity image's subject.

For teams building portrait animation products at scale, Hunyuan Video Avatar is currently the most capable open-source option. The identity preservation noticeably outperforms standard HunyuanVideo I2V mode, where the face can drift across frames.

Spheron GPU Pricing: Cost Per Finished Second of Video

Pricing below is based on live Spheron API data as of April 28, 2026. "Finished sec/hr" is an estimate for 4-second 720p clips with default settings on LTX-Video; mark all throughput figures as approximate single-batch estimates.

GPUOn-Demand ($/hr)Spot ($/hr)720p sec/hr (LTX est.)$/finished sec (LTX)
RTX 4090$0.79N/A~120~$0.007
RTX 5090 PCIe$0.86N/A~200~$0.004
L40S$0.72$0.32~180~$0.004
H100 PCIe$2.01N/A~450~$0.004
H100 SXM5$2.90$0.80~500~$0.006
H200 SXM5$3.96$1.19~600~$0.007

The H100 SXM5 at spot pricing ($0.80/hr) is the best value for high-throughput LTX-Video work: roughly $0.002 per finished second of 720p output. That's 25x cheaper than Runway Gen-3 Alpha ($0.05/sec) for the same output quality tier.

For Wan 2.2 I2V (heavier model, requires H100+), the math changes:

  • H100 SXM5 at $2.90/hr, generating roughly 80-100 seconds of Wan 2.2 video per hour: ~$0.029-0.036/finished sec
  • Still cheaper than Runway at $0.05/sec, with significantly better identity preservation

For a team generating 1,000 four-second I2V clips per day on H100 SXM5 spot pricing:

  • 4,000 seconds of output, at $0.80/hr, H100 SXM5 can produce ~500 sec/hr (LTX-Video)
  • Total GPU hours needed: 8 hours
  • Total cost: ~$6.40/day for LTX-Video output, vs $200/day on Runway Gen-3 at $0.05/sec

Pricing fluctuates based on GPU availability. The prices above are based on 28 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Quality Benchmarks: Motion Coherence, Identity Preservation, Prompt Adherence

These ratings are qualitative, based on community testing and published model cards. For official VBench or EvalCrafter scores, check each model's paper or GitHub repository directly.

ModelMotion CoherenceIdentity PreservationPrompt AdherenceArtifacts
LTX-Video I2VGoodMediumHighLow
Wan 2.2 14B I2VVery GoodHighHighLow
Hunyuan Video AvatarMediumVery HighMediumMedium
CogVideoX-5B-I2VGoodHighVery HighLow
SVD 1.1MediumLowLowMedium

A few practical notes:

LTX-Video handles camera motion and smooth transitions well. Where it struggles: complex scene changes, multiple subjects, and detailed face preservation over more than a few seconds.

Wan 2.2 I2V shows the most consistent results across varied inputs. The motion stays coherent even for clips with multiple moving elements or significant camera movement. The quality jump over LTX-Video is visible, particularly for clips longer than 3 seconds.

Hunyuan Video Avatar optimizes for identity at the expense of motion complexity. Clips tend to show subtle, naturalistic motion rather than large movements. For talking-head applications, this is ideal. For action sequences, it's limiting.

CogVideoX-I2V follows text prompts more precisely than the other models. If your pipeline needs the output to match a specific motion description, CogVideoX handles this better. The tradeoff is slightly lower motion realism compared to Wan 2.2.

Migrating from Runway Gen-3 or Sora API to Self-Hosted I2V

Break-even math:

At $0.05/sec (Runway Gen-3 Alpha), a team generating 1,000 clips/day at 4 seconds each spends $200/day or roughly $6,000/month. A single H100 SXM5 at spot pricing ($0.80/hr) costs roughly $576/month running 24/7. At 500 sec/hr throughput for LTX-Video, one H100 handles about 12,000 sec/day of capacity, far more than 4,000 sec/day of demand - meaning you likely only need to run the GPU during active generation windows.

Break-even is typically at 50-150 clips per day, depending on clip length and model choice.

API vs self-hosted comparison:

FactorRunway / Sora APISelf-Hosted on Spheron
Setup timeMinutesHours to days
Infrastructure opsNoneYou manage
Cost at scale (1,000 clips/day)$200/day$6-25/day
Model controlLimitedFull
Fine-tuningNot availablePossible
Latency30-60 seconds30-120 seconds
Uptime SLACommercial SLAYour responsibility
Identity preservationGood (Gen-3)Very Good (Wan 2.2)

Migration checklist:

  • Audit current prompt format: Runway prompts tend to be shorter and more abstract than what open-source I2V models respond to best. Wan 2.2 and CogVideoX benefit from more explicit motion descriptions.
  • Match output resolution: Runway Gen-3 outputs 1280x768 or 1280x720 by default. Configure your model to match for easy A/B comparisons.
  • Latency expectations: If your product shows users a progress indicator, self-hosted latency (30-120 sec) is comparable to Runway. If you have a synchronous API call with a timeout under 30 seconds, you need to restructure to async job queuing first.
  • Output format: Both Runway and self-hosted pipelines output MP4. No conversion needed. Bitrate and codec settings may differ; standardize with ffmpeg post-processing if needed.
  • Test on your actual input distribution: If your users submit varied portrait angles, lighting conditions, and image qualities, test the self-hosted model on a representative sample before cutting over production traffic.

Running I2V at scale on Runway or Sora API costs $0.04-0.05 per finished second. On Spheron, the same workload on an H100 SXM5 at spot pricing runs far cheaper, with bare-metal performance and no per-output fees.

Rent RTX 4090 → | Rent H100 → | View all GPU pricing →

Launch your I2V pipeline on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.