Animate-X: Universal Character Image Animation with Enhanced Motion Representation

Authors: Shuai Tan, Biao Gong, Xiang Wang, Shiwei Zhang, DanDan Zheng, Ruobing Zheng, Kecheng Zheng, Jingdong Chen, Ming Yang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the superiority and effectiveness of Animate-X compared to state-of-the-art methods.
Researcher Affiliation Industry 1Ant Group 2Alibaba Group
Pseudocode No The paper describes methods and processes in detail but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Project Page: https://lucaria-academy.github.io/Animate-X/
Open Datasets Yes We collect approximately 9,000 human videos from the internet and supplement this with Tik Tok dataset Jafarian & Park (2021) and Fashion dataset Zablotskaia et al. (2019a) for training. [...] Moreover, we introduce a new Animated Anthropomorphic Benchmark (A2Bench), which includes 500 anthropomorphic characters along with corresponding dance videos, to evaluate the performance of Animate-X on universal and widely applicable animation images.
Dataset Splits No The paper mentions using 10 and 100 videos for qualitative and quantitative comparisons from Tik Tok and Fashion datasets, respectively, and manually screening 100 videos from A2Bench as test videos. However, it does not provide explicit training, validation, or test splits (e.g., percentages or exact counts for all splits) for the main training data or combined datasets.
Hardware Specification Yes The experiments are carried out using 8 NVIDIA A100 GPUs.
Software Dependencies Yes We use the visual encoder of the multi-modal CLIP-Huge model Radford et al. (2021) in Stable Diffusion v2.1 Rombach et al. (2022) to encode the CLIP embedding of the reference image and driving videos. [...] For the driven video Id 1:F , we detect the pose keypoints pd and CLIP feature Id via a DWPose Yang et al. (2023) and CLIP Image Encoder Φ.
Experiment Setup Yes During training, videos are resized to a spatial resolution of 768 512 pixels, and we feed the model with uniformly sampled video segments of 32 frames to ensure temporal consistency. We use the Adam W optimizer Loshchilov & Hutter (2017) with learning rates of 5e-7 for the implicit pose indicator and 5e-5 for other modules. For noise sampling, DDPM Ho et al. (2020) with 1000 steps is applied during training. In the inference phase, we adjust the length of the driving pose to align roughly with the reference pose and used the DDIM sampler Song et al. (2021) with 50 steps for faster sampling. To improve the model s robustness against pose and reference image misalignments, we adopt two key training schemes. First, we set a high transformation probability λ (over 98%) in the EPI, enabling the model to handle a wide range of misalignment scenarios. Second, we apply random dropout to the input conditions at a predefined rate Wang et al. (2024b).