AnyTalk: Multi-modal Driven Multi-domain Talking Head Generation

Authors: Yu Wang, Yunfei Liu, Fa-Ting Hong, Meng Cao, Lijian Lin, Yu Li

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that Any Talk excels at generating high-quality, multimodal talking head videos, showcasing remarkable generalization capabilities across diverse domains. We conduct extensive experiments to evaluate our Any Talk in the cross-domain setup. Experiments demonstrate that our Any Talk can generate high-quality and diverse videos across multiple domains, including across species. We utilize the Frechet Inception Distance (FID) to measure the realism of our generated outputs. To assess the identity preservation, we follow the previous works (Gong et al. 2023; Hong et al. 2022) and utilize the cosine similarity (CSIM) between synthetic and source images through Arc Face (Deng et al. 2019). Meanwhile, we use the cosine similarity of expression embedding (CEIM) to quantify the subtle yet significant facial expression between the driving and generated images. We compare our Any Talk under cross-domain face reenactment setting with several state-of-the-art face reenactment methods... The quantitative results of four cross-domain reenactment tasks are reported in Tab. 2. Our Any Talk outperforms all baselines in terms of FID across the four tasks... Additionally, our Any Talk also performs the best in identity preservation and expression consistency, i.e., the highest CSIM and CEIM. To verify the effectiveness of our proposed expression consistency loss Lexp, we perform an ablation study by removing Lexp in our method. We conduct a user study to further evaluate the performance of all the methods.
Researcher Affiliation Academia Yu Wang, Yunfei Liu, Fa-Ting Hong, Meng Cao, Lijian Lin, Yu Li* International Digital Economy Academy (IDEA)
Pseudocode No The paper describes the methodology using mathematical equations and textual explanations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code No The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide any links to a code repository.
Open Datasets Yes We first pretrain our Any Talk on Vox Celeb1 (Nagrani, Chung, and Zisserman 2017), a popular talking head generation dataset. Eexp is pretrained on Affect Net (Mollahosseini, Hasani, and Mahoor 2019), a large-scale annotated emotion dataset.
Dataset Splits No The paper mentions using Vox Celeb1 and Ani Talk datasets and states, 'For each video, we sample two frames: one as the source image s and the other as the driving image d, respectively.' However, it does not provide specific training/test/validation split percentages, absolute sample counts for splits, or references to predefined splits for these datasets required for reproduction.
Hardware Specification No The paper states, 'Anytalk runs at 42 fps using naive Py Torch implementation,' but it does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for training or inference.
Software Dependencies No The paper mentions 'naive Py Torch implementation' but does not specify the version number of PyTorch or any other software libraries or dependencies.
Experiment Setup No The paper describes a total loss function `Ltotal = λexp Lexp + λP LP + λGLG + λELE + λdist Ldist + λMLM + λ L` and states that 'λP , λG, λE, λdist, λM, λ and λexp are hyperparameters to balance these losses.' However, it defers the specific values of these hyperparameters, stating, 'More details refer to Appendix.' Thus, concrete hyperparameter values are not provided in the main text.