ComfyLab
Wan 2.2 Video Generation in ComfyUI: Image-to-Video Workflow Guide

Wan 2.2 Video Generation in ComfyUI: Image-to-Video Workflow Guide

12GB VRAM VRAM Advanced 11 min Wan 2.2 I2V (1.3B or 14B)
Savien

Alibaba’s Wan 2.2 model in ComfyUI is a solid step up if you’re doing local video generation. The Wan 2.2 ComfyUI setup handles temporal coherence better—less flicker jumping between frames—and reads motion prompts with more nuance than Wan 2.1.

This guide walks you through the complete image to video ComfyUI pipeline, breaks down the hardware trade-offs between model sizes, and shows you the motion prompt techniques that actually produce fluid motion.

Wan 2.2 vs. Wan 2.1: What Improved

AspectWan 2.1Wan 2.2
Temporal coherenceNoticeable flicker between framesSmooth transitions, minimal jitter
Motion prompt understandingBasic directional promptsComplex, layered descriptions work reliably
Frame artifactsColor banding at object edgesReduced distortion and color shifts
Node architectureWanVideoModelLoader, WanVideoSampler, etc.Same nodes, improved model weights
Migration effortN/ADrop-in replacement; no workflow rebuild needed

The node names stay the same: WanVideoModelLoader, WanVideoTextEncode, WanVideoImageEncode, WanVideoSampler, and WanVideoVAEDecode. Upgrading from Wan 2.1? Just download the new model weights, drop them in your models folder, and select the new file in WanVideoModelLoader. Your existing graphs don’t need touching.

👉 Quick takeaway: Wan 2.2 is a direct upgrade with better temporal smoothness and prompt interpretation. If you already use Wan 2.1, migration is seamless—just swap the model file.

Model Sizes: 1.3B vs. 14B

The Wan 2.2 ComfyUI implementation comes in two flavors. Your choice hinges on GPU memory, how much time you’re willing to wait, and what quality threshold you need.

Aspect1.3B Model14B Model
File size2.7GB~30GB (single merged file)
VRAM required (no offload)12GB24GB
VRAM with sequential_cpu_offloadN/A14–16GB
Render time (33 frames, 16fps)2–4 minutes3–6 minutes (no offload)
Render time with offloadN/A8–15 minutes
Motion qualityGood, straightforward scenesNoticeably more fluid, detailed motion
Prompt interpretationReliable for simple/medium promptsHandles complex, multi-layered descriptions
Best forRTX 3060/4060 Ti, rapid iterationRTX 4090, quality-first workflows

The 1.3B Model: Speed and Accessibility

Grab this if you’re running 12GB VRAM or less, or if you need to test motion prompts without waiting around. The quality is honest—it nails basic camera movements, walking, object motion, and simple scene dynamics without breaking a sweat. The 2–4 minute render window makes it practical to test three or four different prompts in a single session.

The catch: complex, multi-layered prompts (like “camera tracks smoothly while the subject walks and wind blows leaves across the frame”) tend to simplify into less nuanced motion. For straightforward effects and rapid testing, it’s excellent.

💡 Tip: Use the 1.3B model to validate motion prompts at 480p before scaling up to higher resolutions. A quick test here saves you 10+ minutes on the 14B model if the prompt doesn’t work.

The 14B Model: Quality and Complexity

This is the full-featured version. Motion is noticeably more fluid, intricate prompts land better, and spatial coherence holds up over longer sequences. The trade-off is real: you need either 24GB VRAM without offload or 14–16GB with sequential_cpu_offload turned on.

With offload enabled, the model shuffles data between GPU and CPU as needed. Render time climbs to 8–15 minutes for a 33-frame sequence, but it becomes doable on 16–20GB systems. Without offload and with 24GB VRAM, you’re looking at 3–6 minute renders, which is practical for actual production work.

👉 Quick takeaway: The 14B model delivers superior motion fluidity and prompt interpretation. Enable sequential_cpu_offload if you have 14–20GB VRAM; skip it if you have 24GB+.

Installation and Setup

Step 1: Install the ComfyUI-WanVideo Custom Node

  1. Open ComfyUI Manager in your ComfyUI interface.
  2. Search for “WanVideo” in the custom nodes browser.
  3. Click Install on the ComfyUI-WanVideo node pack.
  4. Restart ComfyUI.

New nodes appear under the Video category. You’ll see WanVideoModelLoader, WanVideoTextEncode, WanVideoImageEncode, WanVideoSampler, WanVideoVAEDecode, and related utilities.

Step 2: Download and Place Model Weights

  1. Search Hugging Face for the official Alibaba Wan 2.2 repository and confirm you’re on the verified org page before downloading anything.
  2. Download the .safetensors file for your chosen model size — the 1.3B variant is around 2.7GB, the 14B variant is a single merged file around 30GB. Exact filenames vary by upload, so match against the VRAM figures in the table above rather than a specific filename.
  3. Drop the file directly in ComfyUI/models/diffusion_models/ (no nested folders).
  4. ComfyUI auto-detects it on the next startup.

Building the Image-to-Video Workflow

Here’s the complete node graph for an image to video ComfyUI pipeline:

Node 1: Load Image

  • Input your starting frame (PNG, JPG, or WebP).
  • Resolution requirements:
    • 1.3B model: 480×832 (vertical) or 832×480 (horizontal)
    • 14B model: up to 720×1280
    • All dimensions must be divisible by 16 (e.g., 480, 496, 512, 528 are valid; 500 is not).

Node 2: WanVideoModelLoader

model_name: [select your .safetensors file]
sequential_cpu_offload: [enabled if VRAM < 24GB, disabled for 1.3B or if you have 24GB+]
  • Using the 14B model on 16–20GB VRAM? Enable sequential_cpu_offload.
  • Leave it off for the 1.3B model or if you have 24GB+ VRAM.

Node 3: WanVideoImageEncode

  • Connect the Load Image output here.
  • This node preps your starting frame as the video’s anchor point. No parameters to tweak.

Node 4: WanVideoTextEncode

This is where the magic happens. Write your prompt in English; the model was trained primarily on English descriptions.

Prompts that actually work:

  • "the person walks slowly forward, camera pans right"
  • "ocean waves crash gently on the shore, soft foam movement"
  • "clouds drift slowly across the sky, wind-blown motion"
  • "the camera follows from behind as the person walks forward, smooth tracking shot"

The key rule: Specify both subject motion and camera motion. Generic stuff like "movement" or "action" produces either static frames or incoherent results. Specific, directional language is what gets you fluid motion.

Wan 2.2 doesn’t use a motion_bucket_id parameter (that’s Stable Video Diffusion territory). Motion intensity comes from the prompt wording and sampler settings.

Node 5: WanVideoSampler

This is where Wan 2.2 video generation happens. The settings that matter:

  • num_frames: 33 (sweet spot: 25–65). More frames aren’t automatically better; 33 balances quality and speed nicely.
  • fps: 16–24
    • 16 fps: slow motion, cinematic feel
    • 20–24 fps: standard action, smooth playback
    • 33 frames at 16 fps = ~2 seconds of video
  • steps: 20–30 for quality/speed balance. Go to 40+ only if you’re chasing maximum quality.
  • seed: Fixed value for reproducibility, or -1 for random variation.
  • cfg_scale: 7–9 works well. Push to 12 if you need stronger prompt adherence; avoid anything above 13 (causes artifacts).

Node 6: WanVideoVAEDecode

  • No parameters. Converts the latent frames from the sampler into actual pixel data.
  • Connect directly from WanVideoSampler.

Node 7: VHS_VideoCombine

Exports your final video:

  • format: MP4 (recommended) or WebM
  • fps: Must match the sampler’s fps setting
  • quality: 95 for maximum quality

📌 Keep in mind: The Wan 2.2 I2V workflow is linear and straightforward: Load Image → Encode Image → Encode Motion Prompt → Sample → Decode → Export. Each node does one thing well.

Motion Prompt Techniques That Work

Motion prompts are what separate a static frame from genuinely fluid movement. Here’s what actually makes a difference:

Be Directional

Instead of: "the person moves" Try: "the person walks forward slowly, camera follows from behind"

Direction words (forward, backward, left, right, up, down, toward, away) trigger coherent motion. Vague terms just produce random jitter.

Combine Subject and Camera

The best prompts describe both what moves in the scene and how the camera moves:

  • "the car drives down the road, camera pans left to follow"
  • "the dancer spins, camera circles around them"
  • "the waves crash, camera slowly zooms in on the foam"

Use Speed Modifiers

Words like “slowly,” “gently,” “rapidly,” “fast,” and “dynamic” shift motion intensity:

  • Slow prompts → subtle, controlled movement
  • Fast/rapid prompts → more aggressive frame-to-frame change

Test Early, Scale Late

Always start with 33 frames at 480p to validate your motion prompt before jumping to higher resolutions or longer sequences. A failed test at 720p wastes far more time than a quick validation at lower resolution.

⚠️ Important: This single practice—testing at low resolution first—saves hours of wasted renders.

Troubleshooting Common Issues

Out of Memory with the 14B Model

Solution: Enable sequential_cpu_offload in WanVideoModelLoader. Renders take 8–15 minutes for 33 frames, but the model becomes usable on 14–16GB VRAM.

Video Looks Too Static

Cause: Motion prompt is too generic or contradictory.

Fix:

  • Rewrite with specific direction and speed: "rapidly spinning, dynamic camera rotation" instead of "moving".
  • Check that num_frames is at least 33. With 16 frames, motion is barely visible.
  • Bump cfg_scale to 9–10 to strengthen prompt adherence.

Flickering or Color Banding

Cause: Model struggling with the prompt or resolution.

Fix:

  • Reduce num_frames to 25–30.
  • Lower cfg_scale to 7.
  • Simplify the motion prompt.

Slow Renders on the 14B Model

Expected behavior with sequential_cpu_offload enabled. If you need speed, either:

  • Upgrade to 24GB+ VRAM
  • Use the 1.3B model
  • Drop steps to 20 (quality loss is minimal)

GPU Strategy by VRAM

  • 12GB VRAM: Use the 1.3B model without offload. Expect 2–4 minutes per 33-frame video.
  • 16–20GB VRAM: Use the 14B model with sequential_cpu_offload enabled. Roughly 8–12 minutes per video; best quality/speed balance for your hardware.
  • 24GB+ VRAM: Use the 14B model without offload. Roughly 3–6 minutes per video; maximum speed and quality.

Don’t have the VRAM locally? Cloud GPU rentals (Vast.ai, RunPod) are realistic here — the 14B model is one of the most VRAM-intensive workloads covered on this site, so renting a 24GB card for a single test session is a cheap way to try it before committing to hardware. Check current hourly rates on either platform.

FAQ

Q: What’s the difference between Wan 2.1 and Wan 2.2 for image-to-video?

A: Wan 2.2 improves temporal coherence (less flicker between frames) and motion-prompt understanding. The nodes stay the same (WanVideoModelLoader, etc.) but the model weights differ. Already running Wan 2.1? You can keep using it; 2.2 is an incremental upgrade, not a complete architecture overhaul.

Q: How many frames should I generate to start?

A: Always start with 33 frames at 480p (854x480px). This validates your motion prompt and overall behavior in under 5 minutes. Only scale up to 49+ frames and higher resolution once the motion is working correctly. Change one parameter at a time.

Q: Can I animate images generated with Flux or SDXL?

A: Yes. Wan 2.2 accepts any image as input regardless of how it was generated. The input image defines the first frame; the motion prompt describes how it should move. Images with clear composition and a simple background perform better.

Q: Why is the generated video frozen or barely moving?

A: The motion prompt is probably too generic or contradictory. Be specific: instead of ‘moving’, write ‘the person raises their right hand slowly, camera stays fixed’. Also verify num_frames is at least 33—with 16 frames, motion is barely perceptible.

Keep Reading

Don’t have 24GB of local VRAM for the 14B model? See our RunPod vs Vast.ai cloud GPU guide for renting one by the hour. And if you’re running the 1.3B model on limited VRAM, our complete guide to reducing VRAM usage covers offloading techniques that apply to video workflows too.


🏆 Our recommendation

If you have 12GB VRAM or less, use the 1.3B model for fast iteration and accessible performance. If you have 16–20GB VRAM, use the 14B model with sequential_cpu_offload enabled for the best quality-to-speed balance. If you have 24GB+ VRAM, use the 14B model without offload to maximize quality and minimize render time. Start every workflow at 480p and 33 frames, validate your motion prompt, then scale up once you’re confident in the result.

FAQ

What's the difference between Wan 2.1 and Wan 2.2 for image-to-video?
Wan 2.2 improves temporal coherence (less flicker between frames) and motion-prompt understanding. The nodes are the same (WanVideoModelLoader, etc.) but the model weights differ. If you already have Wan 2.1 installed, you can keep using it; 2.2 is an incremental upgrade, not a radical architecture change.
How many frames should I generate to start?
Always start with 33 frames at 480p (854x480px). This validates your motion prompt and overall behavior in under 5 minutes. Only scale up to 49+ frames and higher resolution once the motion is working correctly. Change one parameter at a time.
Can I animate images generated with Flux or SDXL?
Yes. Wan 2.2 accepts any image as input regardless of how it was generated. The input image defines the first frame; the motion prompt describes how it should move. Images with clear composition and a simple background give better results.
Why is the generated video frozen or barely moving?
The motion prompt is probably too generic or contradictory. Be specific: instead of 'moving', write 'the person raises their right hand slowly, camera stays fixed'. Also verify num_frames is at least 33 -- with 16 frames, motion is barely perceptible.
Share X LinkedIn

You may also like