BridgeVoC: Neural Vocoder with Schrödinger Bridge

Authors: Tong Lei, Zhiyu Zhang, Rilin Chen, Meng Yu, Jing Lu, Chengshi Zheng, Dong Yu, Andong Li

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Quantitative experiments on LJSpeech and Libri TTS show that Bridge Vo C achieves faster inference and surpasses existing diffusion-based vocoder baselines, while also matching or exceeding non-diffusion state-ofthe-art methods across evaluation metrics.
Researcher Affiliation Collaboration 1Key Laboratory of Modern Acoustics, Nanjing University 2 Key Laboratory of Noise and Vibration Research, Institute of Acoustics Chinese Academy of Sciences 3Tencent AI Lab 4National Mobile Communications Research Laboratory, Southeast University
Pseudocode No The paper describes mathematical models (SDEs, SB) and loss functions, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing source code or provide a link to a code repository.
Open Datasets Yes Two benchmarks are used in this study: LJSpeech [Keith and Linda, 2017] and Libri TTS [Heiga et al., 2019]. ... To evaluate the generalization capability of neural vocoders, the VCTK dataset [Yamagishi, 2012] is utilized for out-of-distribution evaluations
Dataset Splits Yes LJSpeech contains 13,100 clean speech clips from a single female speaker at 22.05 k Hz, partitioned into 12,500/100/500 clips for training, validation, and testing, following the VITS repository. Libri TTS, sampled at 24 k Hz, includes diverse recording conditions; we use the {train-clean-100, train-clean-300, train-other-500} subsets for training, devclean+dev-other for objective evaluation, and test-clean+testother for subjective evaluation, as in [Lee et al., 2023].
Hardware Specification Yes real-time factor (RTF) which is measured on a single Tesla V100 GPU.
Software Dependencies No The paper mentions various models and frameworks (e.g., NCSN++, Big VGAN, Hi Fi GAN, WaveNet) but does not provide specific version numbers for software dependencies used in their implementation.
Experiment Setup Yes For the weight hyperparameters in Eq. (22), λmel, λg and λfm are 0.1, 10.0 and 10.0, respectively. ... We train all models for 1 million steps, except for Big VGAN, which is trained for 5 million steps. ... For feature extraction, we employ a 1024-point FFT, a Hann window of length 1024, and a hop size of 256. For the LJSpeech dataset, we utilize 80 mel-bands with the upper-bound frequency fmax set to 8 k Hz... For Libri TTS, the mel-bands and upper-bound frequency are set to 100 and 12 k Hz, respectively.