VQTalker: Towards Multilingual Talking Avatars Through Facial Motion Tokenization
Authors: Tao Liu, Ziyang Ma, Qi Chen, Feilong Chen, Shuai Fan, Xie Chen, Kai Yu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that VQTalker achieves state-of-the-art performance in both video-driven and speech-driven scenarios, particularly in multilingual settings. |
| Researcher Affiliation | Collaboration | Tao Liu1*, Ziyang Ma1, Qi Chen1, Feilong Chen2, Shuai Fan2, Xie Chen1, Kai Yu1 1X-LANCE Lab, Mo E Key Lab of Artificial Intelligence, Shanghai Jiao Tong University 2AISpeech Ltd |
| Pseudocode | Yes | Algorithm 1: Group-Residual FSQ |
| Open Source Code | No | The paper provides a link for viewing synthetic results (https://x-lance.github.io/VQTalker), but does not explicitly state that source code for the methodology is available at this link or in supplementary materials. |
| Open Datasets | Yes | We utilized three publicly available datasets: Vox Celeb (Nagrani, Chung, and Zisserman 2017), HDTF (Zhang et al. 2021), and VFHQ (Xie et al. 2022). |
| Dataset Splits | Yes | To evaluate performance in Indo-European languages and video reconstruction tasks, we use HDTF (Zhang et al. 2021) as our test set, which follows Dinet (Zhang et al. 2023b). |
| Hardware Specification | No | No specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments were found in the paper. |
| Software Dependencies | No | The paper mentions using a 12-layer BERT network and a pre-trained speech tokenizer from Cosy Voice, but does not provide specific version numbers for these or other software dependencies like programming languages or libraries. |
| Experiment Setup | Yes | In the second stage, we employed a 12-layer BERT (Devlin et al. 2019) network to iteratively generate a four-layer residual codebook for the face tokenizer. The maximum length is 4096. ... Our approach employs 12 group layers, 4 residual layers, and 625 codebook entries per group, with a sampling rate of 25 fps... |