BridgeVoC: Neural Vocoder with Schrödinger Bridge
Authors: Tong Lei, Zhiyu Zhang, Rilin Chen, Meng Yu, Jing Lu, Chengshi Zheng, Dong Yu, Andong Li
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Quantitative experiments on LJSpeech and Libri TTS show that Bridge Vo C achieves faster inference and surpasses existing diffusion-based vocoder baselines, while also matching or exceeding non-diffusion state-ofthe-art methods across evaluation metrics. |
| Researcher Affiliation | Collaboration | 1Key Laboratory of Modern Acoustics, Nanjing University 2 Key Laboratory of Noise and Vibration Research, Institute of Acoustics Chinese Academy of Sciences 3Tencent AI Lab 4National Mobile Communications Research Laboratory, Southeast University |
| Pseudocode | No | The paper describes mathematical models (SDEs, SB) and loss functions, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or provide a link to a code repository. |
| Open Datasets | Yes | Two benchmarks are used in this study: LJSpeech [Keith and Linda, 2017] and Libri TTS [Heiga et al., 2019]. ... To evaluate the generalization capability of neural vocoders, the VCTK dataset [Yamagishi, 2012] is utilized for out-of-distribution evaluations |
| Dataset Splits | Yes | LJSpeech contains 13,100 clean speech clips from a single female speaker at 22.05 k Hz, partitioned into 12,500/100/500 clips for training, validation, and testing, following the VITS repository. Libri TTS, sampled at 24 k Hz, includes diverse recording conditions; we use the {train-clean-100, train-clean-300, train-other-500} subsets for training, devclean+dev-other for objective evaluation, and test-clean+testother for subjective evaluation, as in [Lee et al., 2023]. |
| Hardware Specification | Yes | real-time factor (RTF) which is measured on a single Tesla V100 GPU. |
| Software Dependencies | No | The paper mentions various models and frameworks (e.g., NCSN++, Big VGAN, Hi Fi GAN, WaveNet) but does not provide specific version numbers for software dependencies used in their implementation. |
| Experiment Setup | Yes | For the weight hyperparameters in Eq. (22), λmel, λg and λfm are 0.1, 10.0 and 10.0, respectively. ... We train all models for 1 million steps, except for Big VGAN, which is trained for 5 million steps. ... For feature extraction, we employ a 1024-point FFT, a Hann window of length 1024, and a hop size of 256. For the LJSpeech dataset, we utilize 80 mel-bands with the upper-bound frequency fmax set to 8 k Hz... For Libri TTS, the mel-bands and upper-bound frequency are set to 100 and 12 k Hz, respectively. |