GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling

Authors: Jixun Yao, Hexin Liu, CHEN CHEN, Yuchen Hu, Ensiong Chng, Lei Xie

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on benchmark datasets demonstrate that our proposed approach outperforms state-of-the-art SE systems in terms of speech quality and generalization capability. Codes and demos are publicly available at https://yaoxunji.github.io/gen-se.
Researcher Affiliation Academia Jixun Yao1, Hexin Liu2, Chen Chen2, Yuchen Hu2, Eng Siong Chng2, Lei, Xie1 1. Northwestern Polytechnical University 2. Nanyang Technological University
Pseudocode No The paper describes methods using mathematical formulas and textual descriptions of processes, but it does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code.
Open Source Code Yes Codes and demos are publicly available at https://yaoxunji.github.io/gen-se.
Open Datasets Yes Dataset: Following previous works (Wang et al., 2024c; Tai et al., 2024), the clean speech data consists of subsets from Libri Light (Kahn et al., 2020), Libri TTS (Zen et al., 2019), Voice Bank (Veaux et al., 2013), and the deep noise suppression (DNS) challenge datasets (Reddy et al., 2021). The noise datasets used are WHAM! (Wichern et al., 2019) and DEMAND (Thiemann et al., 2013). Room impulse responses (RIRs) from open SLR26 and open SLR28 (Ko et al., 2017) are randomly selected to simulate reverberation.
Dataset Splits No All training data are generated on the fly, with an 80% probability of adding noise at a signal-to-noise ratio (SNR) ranging from -5 d B to 20 d B, and a 50% probability of convolving the speech with RIRs. We use the testset from publicly available DNS Challenge (Reddy et al., 2021) to compare Gen SE with existing state-of-the-art baseline systems. Following Tai et al. (2024), we use the CHi ME-4 dataset (Du et al., 2016) as an additional test set to evaluate the generalization ability of the models. While specific test sets are mentioned, explicit training/validation splits (e.g., percentages or counts for the combined dataset) are not provided, as training data is 'generated on the fly'.
Hardware Specification Yes The training is conducted using 2 A100 GPUs with a batch size of 128. For language model training, we use 8 A100 GPUs with a batch size of 256, training for 1 million steps.
Software Dependencies No The paper does not explicitly state any software dependencies with specific version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes The Sim Codec model is trained for 50k steps in the first stage and 10k steps in the second stage. We employ the Adam W optimizer with a learning rate of 1e-4 to optimize the codec model. For language model training, we use 8 A100 GPUs with a batch size of 256, training for 1 million steps. We employ the Adam W optimizer with a learning rate of 1e-4 and 5k warmup steps, following the inverse square root learning schedule to adjust the learning rate dynamically during training.