alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

We're hiring
PaperBlogResources

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

BibTex
Copy
@misc{chuTue Dec 09 2025 16:13:55 GMT+0000 (Coordinated Universal Time)wanmovemotioncontrollablevideo,
      title={Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance},
      author={Ruihang Chu and Yefei He and Zhekai Chen and Shiwei Zhang and Xiaogang Xu and Bin Xia and Dingdong Wang and Hongwei Yi and Xihui Liu and Hengshuang Zhao and Yu Liu and Yingya Zhang and Yujiu Yang},
      year={Tue Dec 09 2025 16:13:55 GMT+0000 (Coordinated Universal Time)},
      eprint={2512.08765},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.08765},
}
GitHub
Wan-Move
31
HTTPS
https://github.com/ali-vilab/Wan-Move
SSH
git@github.com:ali-vilab/Wan-Move.git
CLI
gh repo clone ali-vilab/Wan-Move
AI Audio Lecture + Q&A
0:00 / 0:00
Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance
Transcript
John: In our seminar on Advanced Topics in Generative Models, we've seen a lot of recent work on motion control. Papers like 'Motion Prompting' established the use of point trajectories, while others have explored complex auxiliary modules. Today's lecture is on 'Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance' from researchers at Alibaba and Tsinghua University. This work is notable because it directly challenges the trend of adding architectural complexity for control. It suggests a simpler, more direct method for guiding motion. Yes, Noah? Noah: Excuse me, Professor. When you say it simplifies the architecture, are you referring to how it avoids modules like ControlNet, which many other methods use to inject conditional information? John: Exactly. Instead of adding new networks or encoders that need to be trained and integrated, Wan-Move's core idea is to directly edit the latent representation of the initial image condition. They call this 'latent trajectory guidance.' The goal is to achieve fine-grained motion control without the extra computational and architectural overhead. Noah: So how does it guide the motion without a dedicated motion encoder? John: It's a surprisingly direct mechanism. The model takes user-provided point trajectories—basically, a series of coordinates showing where a point moves over time. It then finds the latent feature vector corresponding to the starting point of a trajectory in the very first frame. This single feature vector, which contains rich information about appearance and local context, is then copied and pasted along the entire path of that trajectory in the latent space for all subsequent frames. This modified latent code, now embedded with both the initial appearance and the desired motion path, becomes the new condition for the image-to-video model. The diffusion model then learns to generate a video consistent with this edited guide. Noah: That sounds very efficient. But does propagating a static feature from the first frame cause problems? For instance, if an object is supposed to rotate or change its appearance slightly as it moves, wouldn't this method struggle since it's just copying the initial state? John: That's a very sharp question. It's a potential limitation. The method relies on the powerful diffusion backbone to interpret this guidance and render natural motion, including plausible rotations and minor appearance shifts. The ablation studies in the paper show this 'latent feature replication' works significantly better than alternatives, like just copying pixel values or using random embeddings. The richness of the latent feature provides enough context for the model to work with. However, for dramatic transformations or lighting changes, this approach might be less effective than a method that dynamically encodes motion. Noah: So, what are the primary applications for this kind of precise control? John: The flexibility of point trajectories opens up quite a few practical uses. In creative industries, an animator could precisely guide the movement of a character's arm or a prop in a scene without complex rigging. You can also control the virtual camera, simulating dolly shots or pans simply by defining trajectories for the entire frame. The paper even demonstrates motion transfer, where they extract trajectories from a source video and apply them to a completely different static image to animate it. For research, it could be used to generate specific datasets for training other computer vision models, like action recognition systems that need examples of very specific movements. Noah: What about evaluation? The report mentioned the authors introduced a new benchmark called MoveBench. Why was that necessary? John: That's another one of its key contributions. The field lacked a standardized, high-quality benchmark for this specific task. Existing datasets were often small, had short video durations, or lacked precise, verified annotations. This made it difficult to compare methods fairly. MoveBench was created to solve this. It provides longer, high-resolution videos across many categories, with meticulously annotated point trajectories and segmentation masks created with a human-in-the-loop process. This allows for a much more rigorous and reliable evaluation of motion control, which is critical for measuring real progress in the field. Noah: So the paper's significance is both methodological and infrastructural. John: Correct. The methodological shift is its architectural simplicity and scalability, showing that you can achieve state-of-the-art control without adding complex modules. This makes it easier to adapt to newer, more powerful base models in the future. The infrastructural contribution is MoveBench, which provides a tool for the whole community to build upon. And importantly, by achieving performance on par with closed, commercial systems, this work helps democratize access to high-end controllable video generation tools, which will likely accelerate research and creative applications. John: So, to wrap up, the main takeaway from Wan-Move is that effective and precise motion control in video generation may not require increasing architectural complexity. By directly manipulating latent features, this work presents a scalable and efficient framework that achieves top-tier results. Paired with the introduction of a robust benchmark, it sets a new standard for both creating and evaluating controllable video. Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.