AnyTalk: Multi-modal Driven Multi-domain Talking Head Generation
Authors: Yu Wang, Yunfei Liu, Fa-Ting Hong, Meng Cao, Lijian Lin, Yu Li
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that Any Talk excels at generating high-quality, multimodal talking head videos, showcasing remarkable generalization capabilities across diverse domains. We conduct extensive experiments to evaluate our Any Talk in the cross-domain setup. Experiments demonstrate that our Any Talk can generate high-quality and diverse videos across multiple domains, including across species. We utilize the Frechet Inception Distance (FID) to measure the realism of our generated outputs. To assess the identity preservation, we follow the previous works (Gong et al. 2023; Hong et al. 2022) and utilize the cosine similarity (CSIM) between synthetic and source images through Arc Face (Deng et al. 2019). Meanwhile, we use the cosine similarity of expression embedding (CEIM) to quantify the subtle yet significant facial expression between the driving and generated images. We compare our Any Talk under cross-domain face reenactment setting with several state-of-the-art face reenactment methods... The quantitative results of four cross-domain reenactment tasks are reported in Tab. 2. Our Any Talk outperforms all baselines in terms of FID across the four tasks... Additionally, our Any Talk also performs the best in identity preservation and expression consistency, i.e., the highest CSIM and CEIM. To verify the effectiveness of our proposed expression consistency loss Lexp, we perform an ablation study by removing Lexp in our method. We conduct a user study to further evaluate the performance of all the methods. |
| Researcher Affiliation | Academia | Yu Wang, Yunfei Liu, Fa-Ting Hong, Meng Cao, Lijian Lin, Yu Li* International Digital Economy Academy (IDEA) |
| Pseudocode | No | The paper describes the methodology using mathematical equations and textual explanations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide any links to a code repository. |
| Open Datasets | Yes | We first pretrain our Any Talk on Vox Celeb1 (Nagrani, Chung, and Zisserman 2017), a popular talking head generation dataset. Eexp is pretrained on Affect Net (Mollahosseini, Hasani, and Mahoor 2019), a large-scale annotated emotion dataset. |
| Dataset Splits | No | The paper mentions using Vox Celeb1 and Ani Talk datasets and states, 'For each video, we sample two frames: one as the source image s and the other as the driving image d, respectively.' However, it does not provide specific training/test/validation split percentages, absolute sample counts for splits, or references to predefined splits for these datasets required for reproduction. |
| Hardware Specification | No | The paper states, 'Anytalk runs at 42 fps using naive Py Torch implementation,' but it does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for training or inference. |
| Software Dependencies | No | The paper mentions 'naive Py Torch implementation' but does not specify the version number of PyTorch or any other software libraries or dependencies. |
| Experiment Setup | No | The paper describes a total loss function `Ltotal = λexp Lexp + λP LP + λGLG + λELE + λdist Ldist + λMLM + λ L` and states that 'λP , λG, λE, λdist, λM, λ and λexp are hyperparameters to balance these losses.' However, it defers the specific values of these hyperparameters, stating, 'More details refer to Appendix.' Thus, concrete hyperparameter values are not provided in the main text. |