Learning Neural Vocoder from Range-Null Space Decomposition

Authors: Andong Li, Tong Lei, Zhihang Sun, Rilin Chen, Erwei Yin, Xiaodong Li, Chengshi Zheng

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments are conducted on the LJSpeech and Libri TTS benchmarks. Quantitative and qualitative results show that while enjoying lightweight network parameters, the proposed approach yields state-of-the-art performance among existing advanced methods.
Researcher Affiliation Collaboration 1Institute of Acoustics, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Tencent AI Lab 4Nanjing University 5 Defense Innovation Institute, Academy of Military Sciences (AMS) 6 Tianjin Artificial Intelligence Innovation Center (TAIIC) EMAIL
Pseudocode No The paper describes network architectures and processes in detail (e.g., in Section 3.3 and Figure 3) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code and the pretrained model weights are available at https://github.com/Andong-Li-speech/RNDVoC.
Open Datasets Yes Two benchmarks are employed in this study, namely LJSpeech [Keith and Linda, 2017] and Libri TTS [Zen et al., 2019].
Dataset Splits Yes The LJSpeech dataset includes 13,100 clean speech clips by a single female, and the sampling rate is 22.05 k Hz. Following the division in the open-sourced VITS repository4, {12500, 100, 500} clips are used for training, valiation, and testing, respectively. The Libri TTS dataset covers diverse recording environments with the sampling rate of 24 k Hz. Following the division in [Lee et al., 2023], {train-clean-100, train-clean-300, train-other-500} are for model training. The subsets dev-clean + dev-other are for objective comparisons, and test-clean + test-other are for subjective evaluations.
Hardware Specification Yes The inference speed on a CPU is evaluated based on a CPU Intel(R) Core(TM) i7-14700F. For GPU, it is based on NVIDIA Ge Force RTX 4060 Ti.
Software Dependencies No The paper mentions the use of the Adam W optimizer but does not specify version numbers for any key software components or libraries like Python, PyTorch, or CUDA.
Experiment Setup Yes A batch size of 16, a segment size of 16384, and an initial learning rate of 2e-4 are used for training. The Adam W optimizer [Loshchilov and Hutter, 2017] is employed, with {β1 = 0.8, β2 = 0.99}. The generator and discriminator are updated for 1 million steps, respectively.