MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation
Authors: Seyeon Kim, Siyoon Jin, Jihye Park, Kihong Kim, Jiyoung Kim, Jisu Nam, Seungryong Kim
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on standard benchmarks demonstrate that our model outperforms existing GAN-based and diffusion-based models. We also provide comprehensive ablation studies and user study results. In experiments, our framework achieves state-of-the-art performance on HDTF dataset (Zhang et al. 2021), surpassing GAN-based (Prajwal et al. 2020; Zhou et al. 2021) and diffusion-based (Ma et al. 2023; Wei, Yang, and Wang 2024) approaches. |
| Researcher Affiliation | Collaboration | Seyeon Kim 1, 2*, Siyoon Jin 1*, Jihye Park 1, 2*, Kihong Kim 3, Jiyoung Kim 1, Jisu Nam 4, Seungryong Kim 4 1Korea University 2Samsung Electronics 3VIVE STUDIOS 4KAIST |
| Pseudocode | No | The paper describes its methodology in prose and mathematical formulations but does not include any distinct pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/cvlab-kaist/Mo Di Talker |
| Open Datasets | Yes | We used the LRS3-TED (Afouras, Chung, and Zisserman 2018) and HDTF (Zhang et al. 2021) datasets to train our ATo M and MTo V models, respectively. |
| Dataset Splits | Yes | For MTo V, we randomly selected 312 videos from the HDTF dataset for training, using remaining 98 videos for testing. |
| Hardware Specification | Yes | For all experiments, we used single NVIDIA RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions software components like Hu BERT and 3DMM but does not provide specific version numbers for these or other key software dependencies required for replication. |
| Experiment Setup | Yes | For ATo M, we train the model for 300k iterations with a learning rate of 1e-4. For MTo V, we train the model for 600k iterations with a learning rate of 1e-4. To alleviate jittering, we employed a blending technique using Gaussian blur, as described in (Chen et al. 2020). Additional implementation details are provided in the Appendix 1. |