GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling
Authors: Jixun Yao, Hexin Liu, CHEN CHEN, Yuchen Hu, Ensiong Chng, Lei Xie
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on benchmark datasets demonstrate that our proposed approach outperforms state-of-the-art SE systems in terms of speech quality and generalization capability. Codes and demos are publicly available at https://yaoxunji.github.io/gen-se. |
| Researcher Affiliation | Academia | Jixun Yao1, Hexin Liu2, Chen Chen2, Yuchen Hu2, Eng Siong Chng2, Lei, Xie1 1. Northwestern Polytechnical University 2. Nanyang Technological University |
| Pseudocode | No | The paper describes methods using mathematical formulas and textual descriptions of processes, but it does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code. |
| Open Source Code | Yes | Codes and demos are publicly available at https://yaoxunji.github.io/gen-se. |
| Open Datasets | Yes | Dataset: Following previous works (Wang et al., 2024c; Tai et al., 2024), the clean speech data consists of subsets from Libri Light (Kahn et al., 2020), Libri TTS (Zen et al., 2019), Voice Bank (Veaux et al., 2013), and the deep noise suppression (DNS) challenge datasets (Reddy et al., 2021). The noise datasets used are WHAM! (Wichern et al., 2019) and DEMAND (Thiemann et al., 2013). Room impulse responses (RIRs) from open SLR26 and open SLR28 (Ko et al., 2017) are randomly selected to simulate reverberation. |
| Dataset Splits | No | All training data are generated on the fly, with an 80% probability of adding noise at a signal-to-noise ratio (SNR) ranging from -5 d B to 20 d B, and a 50% probability of convolving the speech with RIRs. We use the testset from publicly available DNS Challenge (Reddy et al., 2021) to compare Gen SE with existing state-of-the-art baseline systems. Following Tai et al. (2024), we use the CHi ME-4 dataset (Du et al., 2016) as an additional test set to evaluate the generalization ability of the models. While specific test sets are mentioned, explicit training/validation splits (e.g., percentages or counts for the combined dataset) are not provided, as training data is 'generated on the fly'. |
| Hardware Specification | Yes | The training is conducted using 2 A100 GPUs with a batch size of 128. For language model training, we use 8 A100 GPUs with a batch size of 256, training for 1 million steps. |
| Software Dependencies | No | The paper does not explicitly state any software dependencies with specific version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | The Sim Codec model is trained for 50k steps in the first stage and 10k steps in the second stage. We employ the Adam W optimizer with a learning rate of 1e-4 to optimize the codec model. For language model training, we use 8 A100 GPUs with a batch size of 256, training for 1 million steps. We employ the Adam W optimizer with a learning rate of 1e-4 and 5k warmup steps, following the inverse square root learning schedule to adjust the learning rate dynamically during training. |