Follow-Your-Click: Open-domain Regional Image Animation via Motion Prompts
Authors: Yue Ma, Yingqing He, Hongfa Wang, Andong Wang, Leqi Shen, Chenyang Qi, Jixuan Ying, Chengfei Cai, Zhifeng Li, Heung-Yeung Shum, Wei Liu, Qifeng Chen
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments compared with 7 baselines, including both commercial tools and research methods on 8 metrics, suggest the superiority of our approach. We conducted extensive experiments and user studies to evaluate our approach, which shows our method achieves state-of-the-art performance. In this section, we introduce our detailed implementation in Sec. 4.1. Then we evaluate our approach with various baselines to comprehensively evaluate our performance in Sec. 4.2. We then ablate our key components to show their effectiveness in Sec. 4.3. |
| Researcher Affiliation | Collaboration | 1The Hong Kong University of Science and Technology, Hong Kong 2Tencent, Hunyuan, China 3Tsinghua University, China |
| Pseudocode | No | The paper describes steps in regular paragraph text within sections like '3 Follow-Your-Click' and its subsections, but it does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://follow-your-click.github.io/ |
| Open Datasets | Yes | We train our model for 60k steps on the Web Vid-10M (Bain et al. 2021) and then finetune it for 30k steps on the reconstructed Web Vid-Motion dataset. Training on public datasets such as Web Vid (Bain et al. 2021) and HDVILA (Xue et al. 2022) directly is challenging... |
| Dataset Splits | No | The paper mentions using Web Vid-10M, Web Vid-Motion, UCF-101, and MSRVTT datasets but does not explicitly provide details about specific training, validation, or test splits (e.g., percentages or sample counts) used for these datasets in their experiments, nor does it refer to standard splits for their specific evaluation setup. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, memory) used to run its experiments. It only mentions the software modules used: 'In our experiments, the spatial modules are based on Stable Diffusion (SD) V1.5 (Rombach et al. 2022), and motion modules use the corresponding Animate Diff (Guo et al. 2023) checkpoint V2.' |
| Software Dependencies | Yes | In our experiments, the spatial modules are based on Stable Diffusion (SD) V1.5 (Rombach et al. 2022), and motion modules use the corresponding Animate Diff (Guo et al. 2023) checkpoint V2. |
| Experiment Setup | Yes | We train our model for 60k steps on the Web Vid-10M (Bain et al. 2021) and then finetune it for 30k steps on the reconstructed Web Vid-Motion dataset. We measure these metrics at the resolution of 256 x 256 with 16 frames. In Sec. 4.3, we conduct a detailed analysis of the selection of mask ratio. |