reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion

Authors: Yan Rong, Li Liu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that IDFace VC achieves state-of-the-art performance across various metrics, with qualitative and user study results confirming its effectiveness in naturalness, similarity, and diversity.
Researcher Affiliation	Academia	Yan Rong, Li Liu* The Hong Kong University of Science and Technology (Guangzhou) EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology using textual explanations and mathematical equations (e.g., equations (1) to (8)) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	Project website https://id-facevc.github.io Extended version https://arxiv.org/pdf/2409.00700
Open Datasets	Yes	To the best of our knowledge, current ZS-FVC methods utilized the LRS3 (Afouras, Chung, and Zisserman 2018) dataset, which comprises over 400 hours of TED talks collected from You Tube, for training.
Dataset Splits	Yes	More precisely, we selected the paired data from the top 200 speakers by video count, resulting in 11,430 videos for training and 5,173 videos for validation. For testing, we randomly selected 16 previously unseen speakers, including 8 target speakers (4 male, 4 female) and 8 source speakers (4 male, 4 female).
Hardware Specification	Yes	Training is conducted on a single Nvidia-A800 GPU with a batch size of 256 for 2000 epochs.
Software Dependencies	No	Facial features are extracted using the Vi T-B/32 from CLIP, with outputs from the penultimate layer utilized to enhance generalization over the final layer. Audio is extracted from video clips via FFmpeg (Yamamoto, Song, and Kim 2020), and the HTSAT-base from CLAP serves as the speaker feature extractor. For the vocoder, we utilize a pretrained Parallel Wave GAN (Yamamoto, Song, and Kim 2020). We select the VITS (Kim, Kong, and Son 2021) model as the base speaker TTS.
Experiment Setup	Yes	Training is conducted on a single Nvidia-A800 GPU with a batch size of 256 for 2000 epochs. Loss weights specified in Eq. (8) are set at λ1 = 0.1, λ2 = 0.01, λ3 = 0.1, λ4 = 0.1, and λ5 = 1.