Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model

Authors: Fei Shen, Cong Wang, Junyao Gao, Qin Guo, Jisheng Dang, Jinhui Tang, Tat-Seng Chua

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the effectiveness of MCDM in maintaining identity and motion continuity for long-term Talking Face generation.
Researcher Affiliation Academia 1Nanjing University of Science and Technology 2Nanjing University 3Tongji University 4Peking University 5Sun Yat-sen University 6National University of Singapore. Correspondence to: Jinhui Tang <EMAIL>.
Pseudocode No The paper describes the architecture and methodology in sections 3.1, 3.2, 3.3, and 3.4 using descriptive text and a figure (Figure 1), but no explicit pseudocode or algorithm block is provided.
Open Source Code No The paper does not contain an explicit statement indicating the release of source code for the methodology described, nor does it provide a direct link to a code repository for their implementation. Footnote 4 links to a company website (https://www.guiji.ai/) and a third-party tool's GitHub link is mentioned (https://github.com/Moore Threads/Moore-Animate Anyone), but neither is the authors' own code release.
Open Datasets Yes Additionally, we present the Talking Face-Wild dataset, a high-quality, multilingual video dataset with over 200 hours of footage in 10 languages, offering a valuable resource for further research in Talking Face generation.
Dataset Splits Yes Following prior work (Chen et al., 2024; Tian et al., 2024; Xu et al., 2024), we split HDTF into training and testing sets with a 9:1 ratio.
Hardware Specification Yes The experiments are conducted on a computing platform equipped with 8 NVIDIA V100 GPUs.
Software Dependencies No The paper mentions using 'Stable Diffusion v1.5' and models like 'Wav2Vec' and 'CLIP', but does not provide specific version numbers for ancillary software dependencies such as programming languages, libraries, or frameworks (e.g., PyTorch version, Python version, CUDA version).
Experiment Setup Yes Training is performed in three stages, with each stage consisting of 30,000 iterations and a batch size of 4. Video data is processed at a resolution of 512 × 512. The learning rate is fixed at 1 × 10−5 across all stages, and the Adam W optimizer is employed to stabilize training. Each training clip comprised 16 video frames. In the archived-clip motion-prior module, we set α = 16, m = 256, and n = 16. In the present-clip motion-prior diffusion model, the number of layers L is set to 8, and the weighting factor α in Eq. 5 is configured to 0.1 to balance the influence of prior motion information.