Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control

Authors: Hejia Chen, Haoxian Zhang, Shoulong Zhang, Xiaoqiang Liu, Sisi Zhuang, zhangyuan, Pengfei Wan, Di ZHANG, Shuai Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We thoroughly validate the effectiveness of our proposed controllable talking face generation method through extensive experiments. To the best of our knowledge, we are the first to address the fine-grained 3D talking face control task. Thus, we compare the lip synchronization and coarsegrained expression control capabilities with existing methods and achieve state-of-the-art performances. Additionally, we conduct ablation experiments to demonstrate the importance of each proposed module in achieving fine-grained control. Finally, we execute a user study and received wide acceptance of the effectiveness of our control.
Researcher Affiliation Collaboration 1State Key Laboratory of Virtual Reality Systems and Technology, Beihang University 2Kuaishou Technology 3 Zhongguancun Laboratory
Pseudocode Yes Algorithm 1 Binarization Algorithm Process
Open Source Code No Project page: https://harryxd2018.github.io/cafe-talk/
Open Datasets Yes We use the public-available emotional talking face datasets MEAD (Wang et al., 2020) and RAVDESS (Livingstone & Russo, 2018) to train our model.
Dataset Splits Yes We split two participants from MEAD for validation and test set each and split RAVDESS following Peng et al. (2023b). ... We randomly split the dataset into the training set, validation set, and testing set, with quantities of 70%, 15%, and 15%, to ensure the in-domain classifying ability.
Hardware Specification Yes both stages are trained with a batch size of 16 on 8 Nvidia V100 GPUs for 400k and 300k iterations ( 4 days each). ... The inference latency is primarily caused by the sampling and denoising process of the current diffusion architecture, which could be partially alleviated with a more advanced GPU. Our method s high controllability, precise manipulation, and diverse outputs are compatible with flexible offline facial editing and time-tolerant AIGC talking animation generation in real-world applications. Table 7: Size and inference time cost on related methods. Models Num of parameters (M) Inference time (sec) ... Cafe-Talk ... 14.71 (sec) (inference time for 5-second audio on the Nvidia 2080Ti GPU)
Software Dependencies No The paper mentions software like "Wav2Vec2-xslr-300m" (an audio encoder model), "Adam W optimizer", and "CLIP" but does not provide specific version numbers for these or other software libraries/frameworks. "Unreal Engine Meta Human 2" is mentioned for rendering, which has a version number, but it's an external tool used for visualization, not a core software dependency for the methodology's implementation with its version.
Experiment Setup Yes We utilize an Adam W optimizer with a learning rate of 0.0001, and both stages are trained with a batch size of 16 on 8 Nvidia V100 GPUs for 400k and 300k iterations ( 4 days each). ... For the training objective, we adopt the simple loss (Ho et al., 2020)(Eq. 4). The speech audio and coarse-grained conditions are independently masked with a probability of 20% during training for the CFG technique (Ho & Salimans, 2021), which enhances the controllability (shown in App. B).