Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion

Authors: Yan Rong, Li Liu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that IDFace VC achieves state-of-the-art performance across various metrics, with qualitative and user study results confirming its effectiveness in naturalness, similarity, and diversity.
Researcher Affiliation Academia Yan Rong, Li Liu* The Hong Kong University of Science and Technology (Guangzhou) EMAIL, EMAIL
Pseudocode No The paper describes the methodology using textual explanations and mathematical equations (e.g., equations (1) to (8)) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No Project website https://id-facevc.github.io Extended version https://arxiv.org/pdf/2409.00700
Open Datasets Yes To the best of our knowledge, current ZS-FVC methods utilized the LRS3 (Afouras, Chung, and Zisserman 2018) dataset, which comprises over 400 hours of TED talks collected from You Tube, for training.
Dataset Splits Yes More precisely, we selected the paired data from the top 200 speakers by video count, resulting in 11,430 videos for training and 5,173 videos for validation. For testing, we randomly selected 16 previously unseen speakers, including 8 target speakers (4 male, 4 female) and 8 source speakers (4 male, 4 female).
Hardware Specification Yes Training is conducted on a single Nvidia-A800 GPU with a batch size of 256 for 2000 epochs.
Software Dependencies No Facial features are extracted using the Vi T-B/32 from CLIP, with outputs from the penultimate layer utilized to enhance generalization over the final layer. Audio is extracted from video clips via FFmpeg (Yamamoto, Song, and Kim 2020), and the HTSAT-base from CLAP serves as the speaker feature extractor. For the vocoder, we utilize a pretrained Parallel Wave GAN (Yamamoto, Song, and Kim 2020). We select the VITS (Kim, Kong, and Son 2021) model as the base speaker TTS.
Experiment Setup Yes Training is conducted on a single Nvidia-A800 GPU with a batch size of 256 for 2000 epochs. Loss weights specified in Eq. (8) are set at λ1 = 0.1, λ2 = 0.01, λ3 = 0.1, λ4 = 0.1, and λ5 = 1.