GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

Authors: Yihong Lin, Zhaoxin Fan, Xianjia Wu, Lingyu Xiong, Xiandong Li, Wenxiong Kang, Liang Peng, Songju Lei, Huang Xu

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations on standard benchmarks demonstrate that GLDi Talker outperforms existing methods, achieving superior results in both lip-sync accuracy and motion diversity. ... 4 Experiments 4.1 Datasets and Implementations 4.2 Quantitative Evaluation 4.3 Qualitative Evaluation 4.4 User Study 4.5 Ablation Study
Researcher Affiliation Collaboration 1South China University of Technology 2 Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Artificial Intelligence, Beihang University 3Hangzhou International Innovation Institute, Beihang University 4Huawei Cloud 5Nanjing University
Pseudocode No The paper describes the methodology using architectural diagrams and textual explanations, but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not explicitly state that source code for the described methodology is publicly available, nor does it provide a link to a code repository.
Open Datasets Yes We conduct abundant experiments on two public 3D facial datasets, BIWI [Fanelli et al., 2010] and VOCASET [Cudeiro et al., 2019], both of which have 4D face scans along with audio recordings.
Dataset Splits Yes We follow the data splits of the previous work [Fan et al., 2022] and only use the emotional data for fair comparisons. Specifically, the training set (BIWITrain) contains 192 sentences, the validation set (BIWI-Val) contains 24 sentences, and the testing set are divided into two subsets, in which BIWI-Test-A contains 24 sentences spoken by 6 seen subjects during training and BIWI-Test-B contains 32 sentences spoken by 8 unseen subjects during training. ... Similar to [Fan et al., 2022], we adopt the same training (VOCA-Train), validation (VOCA-Val) and testing (VOCA-Test) splits for qualitative testing.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory specifications).
Software Dependencies Yes Audio Encoder Ea uses the released hubert-base-ls960 version of the Hu BERT architecture pre-trained on 960 hours of 16k Hz sampled speech audio.
Experiment Setup Yes Lstage1 = λrec1Lrec1 + λquant Lquant, (6) where λrec1 = λquant = 1. ... where β denotes a weighted hyperparameter, which is 0.25 in all our experiments. ... Lstage2 = λrec2Lrec2 + λvel Lvel, (13) where λrec2 = λvel = 1. ... The feature extractor, feature projection layer and the initial two layers of the encoder are frozen, while the remaining parameters are set to be trainable.