reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Animate-X: Universal Character Image Animation with Enhanced Motion Representation

Authors: Shuai Tan, Biao Gong, Xiang Wang, Shiwei Zhang, DanDan Zheng, Ruobing Zheng, Kecheng Zheng, Jingdong Chen, Ming Yang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the superiority and effectiveness of Animate-X compared to state-of-the-art methods.
Researcher Affiliation	Industry	1Ant Group 2Alibaba Group
Pseudocode	No	The paper describes methods and processes in detail but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	Project Page: https://lucaria-academy.github.io/Animate-X/
Open Datasets	Yes	We collect approximately 9,000 human videos from the internet and supplement this with Tik Tok dataset Jafarian & Park (2021) and Fashion dataset Zablotskaia et al. (2019a) for training. [...] Moreover, we introduce a new Animated Anthropomorphic Benchmark (A2Bench), which includes 500 anthropomorphic characters along with corresponding dance videos, to evaluate the performance of Animate-X on universal and widely applicable animation images.
Dataset Splits	No	The paper mentions using 10 and 100 videos for qualitative and quantitative comparisons from Tik Tok and Fashion datasets, respectively, and manually screening 100 videos from A2Bench as test videos. However, it does not provide explicit training, validation, or test splits (e.g., percentages or exact counts for all splits) for the main training data or combined datasets.
Hardware Specification	Yes	The experiments are carried out using 8 NVIDIA A100 GPUs.
Software Dependencies	Yes	We use the visual encoder of the multi-modal CLIP-Huge model Radford et al. (2021) in Stable Diffusion v2.1 Rombach et al. (2022) to encode the CLIP embedding of the reference image and driving videos. [...] For the driven video Id 1:F , we detect the pose keypoints pd and CLIP feature Id via a DWPose Yang et al. (2023) and CLIP Image Encoder Φ.
Experiment Setup	Yes	During training, videos are resized to a spatial resolution of 768 512 pixels, and we feed the model with uniformly sampled video segments of 32 frames to ensure temporal consistency. We use the Adam W optimizer Loshchilov & Hutter (2017) with learning rates of 5e-7 for the implicit pose indicator and 5e-5 for other modules. For noise sampling, DDPM Ho et al. (2020) with 1000 steps is applied during training. In the inference phase, we adjust the length of the driving pose to align roughly with the reference pose and used the DDIM sampler Song et al. (2021) with 50 steps for faster sampling. To improve the model s robustness against pose and reference image misalignments, we adopt two key training schemes. First, we set a high transformation probability λ (over 98%) in the EPI, enabling the model to handle a wide range of misalignment scenarios. Second, we apply random dropout to the input conditions at a predefined rate Wang et al. (2024b).