GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer
Authors: Yihong Lin, Zhaoxin Fan, Xianjia Wu, Lingyu Xiong, Xiandong Li, Wenxiong Kang, Liang Peng, Songju Lei, Huang Xu
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations on standard benchmarks demonstrate that GLDi Talker outperforms existing methods, achieving superior results in both lip-sync accuracy and motion diversity. ... 4 Experiments 4.1 Datasets and Implementations 4.2 Quantitative Evaluation 4.3 Qualitative Evaluation 4.4 User Study 4.5 Ablation Study |
| Researcher Affiliation | Collaboration | 1South China University of Technology 2 Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Artificial Intelligence, Beihang University 3Hangzhou International Innovation Institute, Beihang University 4Huawei Cloud 5Nanjing University |
| Pseudocode | No | The paper describes the methodology using architectural diagrams and textual explanations, but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not explicitly state that source code for the described methodology is publicly available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We conduct abundant experiments on two public 3D facial datasets, BIWI [Fanelli et al., 2010] and VOCASET [Cudeiro et al., 2019], both of which have 4D face scans along with audio recordings. |
| Dataset Splits | Yes | We follow the data splits of the previous work [Fan et al., 2022] and only use the emotional data for fair comparisons. Specifically, the training set (BIWITrain) contains 192 sentences, the validation set (BIWI-Val) contains 24 sentences, and the testing set are divided into two subsets, in which BIWI-Test-A contains 24 sentences spoken by 6 seen subjects during training and BIWI-Test-B contains 32 sentences spoken by 8 unseen subjects during training. ... Similar to [Fan et al., 2022], we adopt the same training (VOCA-Train), validation (VOCA-Val) and testing (VOCA-Test) splits for qualitative testing. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory specifications). |
| Software Dependencies | Yes | Audio Encoder Ea uses the released hubert-base-ls960 version of the Hu BERT architecture pre-trained on 960 hours of 16k Hz sampled speech audio. |
| Experiment Setup | Yes | Lstage1 = λrec1Lrec1 + λquant Lquant, (6) where λrec1 = λquant = 1. ... where β denotes a weighted hyperparameter, which is 0.25 in all our experiments. ... Lstage2 = λrec2Lrec2 + λvel Lvel, (13) where λrec2 = λvel = 1. ... The feature extractor, feature projection layer and the initial two layers of the encoder are frozen, while the remaining parameters are set to be trainable. |