reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control

Authors: Hejia Chen, Haoxian Zhang, Shoulong Zhang, Xiaoqiang Liu, Sisi Zhuang, zhangyuan, Pengfei Wan, Di ZHANG, Shuai Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We thoroughly validate the effectiveness of our proposed controllable talking face generation method through extensive experiments. To the best of our knowledge, we are the first to address the fine-grained 3D talking face control task. Thus, we compare the lip synchronization and coarsegrained expression control capabilities with existing methods and achieve state-of-the-art performances. Additionally, we conduct ablation experiments to demonstrate the importance of each proposed module in achieving fine-grained control. Finally, we execute a user study and received wide acceptance of the effectiveness of our control.
Researcher Affiliation	Collaboration	1State Key Laboratory of Virtual Reality Systems and Technology, Beihang University 2Kuaishou Technology 3 Zhongguancun Laboratory
Pseudocode	Yes	Algorithm 1 Binarization Algorithm Process
Open Source Code	No	Project page: https://harryxd2018.github.io/cafe-talk/
Open Datasets	Yes	We use the public-available emotional talking face datasets MEAD (Wang et al., 2020) and RAVDESS (Livingstone & Russo, 2018) to train our model.
Dataset Splits	Yes	We split two participants from MEAD for validation and test set each and split RAVDESS following Peng et al. (2023b). ... We randomly split the dataset into the training set, validation set, and testing set, with quantities of 70%, 15%, and 15%, to ensure the in-domain classifying ability.
Hardware Specification	Yes	both stages are trained with a batch size of 16 on 8 Nvidia V100 GPUs for 400k and 300k iterations ( 4 days each). ... The inference latency is primarily caused by the sampling and denoising process of the current diffusion architecture, which could be partially alleviated with a more advanced GPU. Our method s high controllability, precise manipulation, and diverse outputs are compatible with flexible offline facial editing and time-tolerant AIGC talking animation generation in real-world applications. Table 7: Size and inference time cost on related methods. Models Num of parameters (M) Inference time (sec) ... Cafe-Talk ... 14.71 (sec) (inference time for 5-second audio on the Nvidia 2080Ti GPU)
Software Dependencies	No	The paper mentions software like "Wav2Vec2-xslr-300m" (an audio encoder model), "Adam W optimizer", and "CLIP" but does not provide specific version numbers for these or other software libraries/frameworks. "Unreal Engine Meta Human 2" is mentioned for rendering, which has a version number, but it's an external tool used for visualization, not a core software dependency for the methodology's implementation with its version.
Experiment Setup	Yes	We utilize an Adam W optimizer with a learning rate of 0.0001, and both stages are trained with a batch size of 16 on 8 Nvidia V100 GPUs for 400k and 300k iterations ( 4 days each). ... For the training objective, we adopt the simple loss (Ho et al., 2020)(Eq. 4). The speech audio and coarse-grained conditions are independently masked with a probability of 20% during training for the CFG technique (Ho & Salimans, 2021), which enhances the controllability (shown in App. B).