Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion
Authors: Yan Rong, Li Liu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that IDFace VC achieves state-of-the-art performance across various metrics, with qualitative and user study results confirming its effectiveness in naturalness, similarity, and diversity. |
| Researcher Affiliation | Academia | Yan Rong, Li Liu* The Hong Kong University of Science and Technology (Guangzhou) EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology using textual explanations and mathematical equations (e.g., equations (1) to (8)) but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Project website https://id-facevc.github.io Extended version https://arxiv.org/pdf/2409.00700 |
| Open Datasets | Yes | To the best of our knowledge, current ZS-FVC methods utilized the LRS3 (Afouras, Chung, and Zisserman 2018) dataset, which comprises over 400 hours of TED talks collected from You Tube, for training. |
| Dataset Splits | Yes | More precisely, we selected the paired data from the top 200 speakers by video count, resulting in 11,430 videos for training and 5,173 videos for validation. For testing, we randomly selected 16 previously unseen speakers, including 8 target speakers (4 male, 4 female) and 8 source speakers (4 male, 4 female). |
| Hardware Specification | Yes | Training is conducted on a single Nvidia-A800 GPU with a batch size of 256 for 2000 epochs. |
| Software Dependencies | No | Facial features are extracted using the Vi T-B/32 from CLIP, with outputs from the penultimate layer utilized to enhance generalization over the final layer. Audio is extracted from video clips via FFmpeg (Yamamoto, Song, and Kim 2020), and the HTSAT-base from CLAP serves as the speaker feature extractor. For the vocoder, we utilize a pretrained Parallel Wave GAN (Yamamoto, Song, and Kim 2020). We select the VITS (Kim, Kong, and Son 2021) model as the base speaker TTS. |
| Experiment Setup | Yes | Training is conducted on a single Nvidia-A800 GPU with a batch size of 256 for 2000 epochs. Loss weights specified in Eq. (8) are set at λ1 = 0.1, λ2 = 0.01, λ3 = 0.1, λ4 = 0.1, and λ5 = 1. |