VQTalker: Towards Multilingual Talking Avatars Through Facial Motion Tokenization

Authors: Tao Liu, Ziyang Ma, Qi Chen, Feilong Chen, Shuai Fan, Xie Chen, Kai Yu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that VQTalker achieves state-of-the-art performance in both video-driven and speech-driven scenarios, particularly in multilingual settings.
Researcher Affiliation Collaboration Tao Liu1*, Ziyang Ma1, Qi Chen1, Feilong Chen2, Shuai Fan2, Xie Chen1, Kai Yu1 1X-LANCE Lab, Mo E Key Lab of Artificial Intelligence, Shanghai Jiao Tong University 2AISpeech Ltd
Pseudocode Yes Algorithm 1: Group-Residual FSQ
Open Source Code No The paper provides a link for viewing synthetic results (https://x-lance.github.io/VQTalker), but does not explicitly state that source code for the methodology is available at this link or in supplementary materials.
Open Datasets Yes We utilized three publicly available datasets: Vox Celeb (Nagrani, Chung, and Zisserman 2017), HDTF (Zhang et al. 2021), and VFHQ (Xie et al. 2022).
Dataset Splits Yes To evaluate performance in Indo-European languages and video reconstruction tasks, we use HDTF (Zhang et al. 2021) as our test set, which follows Dinet (Zhang et al. 2023b).
Hardware Specification No No specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments were found in the paper.
Software Dependencies No The paper mentions using a 12-layer BERT network and a pre-trained speech tokenizer from Cosy Voice, but does not provide specific version numbers for these or other software dependencies like programming languages or libraries.
Experiment Setup Yes In the second stage, we employed a 12-layer BERT (Devlin et al. 2019) network to iteratively generate a four-layer residual codebook for the face tokenizer. The maximum length is 4096. ... Our approach employs 12 group layers, 4 residual layers, and 625 codebook entries per group, with a sampling rate of 25 fps...