reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BridgeVoC: Neural Vocoder with Schrödinger Bridge

Authors: Tong Lei, Zhiyu Zhang, Rilin Chen, Meng Yu, Jing Lu, Chengshi Zheng, Dong Yu, Andong Li

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Quantitative experiments on LJSpeech and Libri TTS show that Bridge Vo C achieves faster inference and surpasses existing diffusion-based vocoder baselines, while also matching or exceeding non-diffusion state-ofthe-art methods across evaluation metrics.
Researcher Affiliation	Collaboration	1Key Laboratory of Modern Acoustics, Nanjing University 2 Key Laboratory of Noise and Vibration Research, Institute of Acoustics Chinese Academy of Sciences 3Tencent AI Lab 4National Mobile Communications Research Laboratory, Southeast University
Pseudocode	No	The paper describes mathematical models (SDEs, SB) and loss functions, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or provide a link to a code repository.
Open Datasets	Yes	Two benchmarks are used in this study: LJSpeech [Keith and Linda, 2017] and Libri TTS [Heiga et al., 2019]. ... To evaluate the generalization capability of neural vocoders, the VCTK dataset [Yamagishi, 2012] is utilized for out-of-distribution evaluations
Dataset Splits	Yes	LJSpeech contains 13,100 clean speech clips from a single female speaker at 22.05 k Hz, partitioned into 12,500/100/500 clips for training, validation, and testing, following the VITS repository. Libri TTS, sampled at 24 k Hz, includes diverse recording conditions; we use the {train-clean-100, train-clean-300, train-other-500} subsets for training, devclean+dev-other for objective evaluation, and test-clean+testother for subjective evaluation, as in [Lee et al., 2023].
Hardware Specification	Yes	real-time factor (RTF) which is measured on a single Tesla V100 GPU.
Software Dependencies	No	The paper mentions various models and frameworks (e.g., NCSN++, Big VGAN, Hi Fi GAN, WaveNet) but does not provide specific version numbers for software dependencies used in their implementation.
Experiment Setup	Yes	For the weight hyperparameters in Eq. (22), λmel, λg and λfm are 0.1, 10.0 and 10.0, respectively. ... We train all models for 1 million steps, except for Big VGAN, which is trained for 5 million steps. ... For feature extraction, we employ a 1024-point FFT, a Hann window of length 1024, and a hop size of 256. For the LJSpeech dataset, we utilize 80 mel-bands with the upper-bound frequency fmax set to 8 k Hz... For Libri TTS, the mel-bands and upper-bound frequency are set to 100 and 12 k Hz, respectively.