Towards Multiple Character Image Animation Through Enhancing Implicit Decoupling

Authors: Jingyun Xue, WANG HongFa, Qi Tian, Yue Ma, Andong Wang, Zhiyuan Zhao, Shaobo Min, Wenzhe Zhao, Kaihao Zhang, Heung-Yeung Shum, Wei Liu, Mengyang LIU, Wenhan Luo

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive qualitative and quantitative evaluations demonstrate that our method excels in generating high-quality character animations, especially in scenarios of complex backgrounds and multiple characters. ... We conduct extensive quantitative and qualitative experiments to illustrate the superiority of our approach. ... 5 EXPERIMENT: Dataset. Training strategy. Implementation Details. Comparisons. Dataset and metrics. Evaluation on Tik Tok dataset. Evaluation on TED-talks dataset. Evaluation on Multi-Character bench. Ablation study.
Researcher Affiliation Collaboration 1Shenzhen Campus of Sun Yat-sen University, 2Tencent Hunyuan, 3Tsinghua Univerisity 4HKUST, 5Harbin Institute of Technology, Shenzhen
Pseudocode Yes B.1 PSEUDOCODE OF DEPTH ORDER GUIDER Algorithm 1: Pseudocode for depth order map mask extraction
Open Source Code No The paper provides a project page URL (https://multi-animation.github.io/) but does not contain an unambiguous statement or a direct link to a source-code repository for the methodology described in this paper. The reference to 'Moore-AnimateAnyone' is for a counterpart method, not their own code.
Open Datasets Yes Additionally, to fill the gap of fair evaluation of multi-character image animation, we propose a new benchmark comprising about 4, 000 frames. ... We collect 20 multiple-character dancing videos with 3917 frames, named Multi-Character. ... Table 6: The source of Multi-character benchmark. ... Following the previous methods (Wang et al., 2023), we evaluate our method on Tik Tok videos (Wang et al., 2023) and TED-talks (Siarohin et al., 2021).
Dataset Splits No We collect 4000 character-action videos of 2M frames as our training set. ... Following the previous methods (Wang et al., 2023), we evaluate our method on Tik Tok videos (Wang et al., 2023) and TED-talks (Siarohin et al., 2021). Additionally, ... we collect 20 multiple-character dancing videos with 3917 frames, named Multi-Character. This dataset serves as a benchmark for evaluating models capabilities... While the paper identifies distinct training and evaluation sets, it does not provide specific percentages or sample counts for how the evaluation datasets are split or if they adhere to predefined splits beyond being 'evaluated on'.
Hardware Specification Yes Experiments are conducted on 8 NVIDIA A800 GPUs.
Software Dependencies No We utilize the DWPose (Yang et al., 2023) to extract pose sequence from videos, and PWC-Net (Sun et al., 2018) from the open-source toolbox MMFlow (Open MMLab, 2021) to calculate optical flow vectors. Additionally, we use the Depth Anything (Yang et al., 2024) to extract depth maps from videos. ... We utilize the weights of Stable Diffusion v1.5 (Rombach et al., 2022) to initialize this training stage. ... we use the weights of Animate Diff v2 (Guo et al., 2023b) to initialize this training stage. While specific models and some tools are mentioned, explicit version numbers for the overall software environment (e.g., Python, PyTorch) or the mentioned tools are not provided.
Experiment Setup Yes We sample 16 frames of video, resize and center-crop them to a resolution of 896 640. Experiments are conducted on 8 NVIDIA A800 GPUs. Both stages are optimized using Adam with a learning rate 1 10 5. In the first stage, we train our model for 60K steps with a batch size of 4, and in the second stage, we train for 60K steps with a batch size of 1. At inference, we apply DDIM (Song et al., 2020) sampler for 50 denoising steps, with classifier-free guidance (Ho & Salimans, 2022) scale of 1.5.