Follow-Your-Click: Open-domain Regional Image Animation via Motion Prompts

Authors: Yue Ma, Yingqing He, Hongfa Wang, Andong Wang, Leqi Shen, Chenyang Qi, Jixuan Ying, Chengfei Cai, Zhifeng Li, Heung-Yeung Shum, Wei Liu, Qifeng Chen

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments compared with 7 baselines, including both commercial tools and research methods on 8 metrics, suggest the superiority of our approach. We conducted extensive experiments and user studies to evaluate our approach, which shows our method achieves state-of-the-art performance. In this section, we introduce our detailed implementation in Sec. 4.1. Then we evaluate our approach with various baselines to comprehensively evaluate our performance in Sec. 4.2. We then ablate our key components to show their effectiveness in Sec. 4.3.
Researcher Affiliation Collaboration 1The Hong Kong University of Science and Technology, Hong Kong 2Tencent, Hunyuan, China 3Tsinghua University, China
Pseudocode No The paper describes steps in regular paragraph text within sections like '3 Follow-Your-Click' and its subsections, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://follow-your-click.github.io/
Open Datasets Yes We train our model for 60k steps on the Web Vid-10M (Bain et al. 2021) and then finetune it for 30k steps on the reconstructed Web Vid-Motion dataset. Training on public datasets such as Web Vid (Bain et al. 2021) and HDVILA (Xue et al. 2022) directly is challenging...
Dataset Splits No The paper mentions using Web Vid-10M, Web Vid-Motion, UCF-101, and MSRVTT datasets but does not explicitly provide details about specific training, validation, or test splits (e.g., percentages or sample counts) used for these datasets in their experiments, nor does it refer to standard splits for their specific evaluation setup.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, memory) used to run its experiments. It only mentions the software modules used: 'In our experiments, the spatial modules are based on Stable Diffusion (SD) V1.5 (Rombach et al. 2022), and motion modules use the corresponding Animate Diff (Guo et al. 2023) checkpoint V2.'
Software Dependencies Yes In our experiments, the spatial modules are based on Stable Diffusion (SD) V1.5 (Rombach et al. 2022), and motion modules use the corresponding Animate Diff (Guo et al. 2023) checkpoint V2.
Experiment Setup Yes We train our model for 60k steps on the Web Vid-10M (Bain et al. 2021) and then finetune it for 30k steps on the reconstructed Web Vid-Motion dataset. We measure these metrics at the resolution of 256 x 256 with 16 frames. In Sec. 4.3, we conduct a detailed analysis of the selection of mask ratio.