MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

Authors: Seyeon Kim, Siyoon Jin, Jihye Park, Kihong Kim, Jiyoung Kim, Jisu Nam, Seungryong Kim

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on standard benchmarks demonstrate that our model outperforms existing GAN-based and diffusion-based models. We also provide comprehensive ablation studies and user study results. In experiments, our framework achieves state-of-the-art performance on HDTF dataset (Zhang et al. 2021), surpassing GAN-based (Prajwal et al. 2020; Zhou et al. 2021) and diffusion-based (Ma et al. 2023; Wei, Yang, and Wang 2024) approaches.
Researcher Affiliation Collaboration Seyeon Kim 1, 2*, Siyoon Jin 1*, Jihye Park 1, 2*, Kihong Kim 3, Jiyoung Kim 1, Jisu Nam 4, Seungryong Kim 4 1Korea University 2Samsung Electronics 3VIVE STUDIOS 4KAIST
Pseudocode No The paper describes its methodology in prose and mathematical formulations but does not include any distinct pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/cvlab-kaist/Mo Di Talker
Open Datasets Yes We used the LRS3-TED (Afouras, Chung, and Zisserman 2018) and HDTF (Zhang et al. 2021) datasets to train our ATo M and MTo V models, respectively.
Dataset Splits Yes For MTo V, we randomly selected 312 videos from the HDTF dataset for training, using remaining 98 videos for testing.
Hardware Specification Yes For all experiments, we used single NVIDIA RTX 3090 GPU.
Software Dependencies No The paper mentions software components like Hu BERT and 3DMM but does not provide specific version numbers for these or other key software dependencies required for replication.
Experiment Setup Yes For ATo M, we train the model for 300k iterations with a learning rate of 1e-4. For MTo V, we train the model for 600k iterations with a learning rate of 1e-4. To alleviate jittering, we employed a blending technique using Gaussian blur, as described in (Chen et al. 2020). Additional implementation details are provided in the Appendix 1.