reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling

Authors: Jixun Yao, Hexin Liu, CHEN CHEN, Yuchen Hu, Ensiong Chng, Lei Xie

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on benchmark datasets demonstrate that our proposed approach outperforms state-of-the-art SE systems in terms of speech quality and generalization capability. Codes and demos are publicly available at https://yaoxunji.github.io/gen-se.
Researcher Affiliation	Academia	Jixun Yao1, Hexin Liu2, Chen Chen2, Yuchen Hu2, Eng Siong Chng2, Lei, Xie1 1. Northwestern Polytechnical University 2. Nanyang Technological University
Pseudocode	No	The paper describes methods using mathematical formulas and textual descriptions of processes, but it does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code.
Open Source Code	Yes	Codes and demos are publicly available at https://yaoxunji.github.io/gen-se.
Open Datasets	Yes	Dataset: Following previous works (Wang et al., 2024c; Tai et al., 2024), the clean speech data consists of subsets from Libri Light (Kahn et al., 2020), Libri TTS (Zen et al., 2019), Voice Bank (Veaux et al., 2013), and the deep noise suppression (DNS) challenge datasets (Reddy et al., 2021). The noise datasets used are WHAM! (Wichern et al., 2019) and DEMAND (Thiemann et al., 2013). Room impulse responses (RIRs) from open SLR26 and open SLR28 (Ko et al., 2017) are randomly selected to simulate reverberation.
Dataset Splits	No	All training data are generated on the fly, with an 80% probability of adding noise at a signal-to-noise ratio (SNR) ranging from -5 d B to 20 d B, and a 50% probability of convolving the speech with RIRs. We use the testset from publicly available DNS Challenge (Reddy et al., 2021) to compare Gen SE with existing state-of-the-art baseline systems. Following Tai et al. (2024), we use the CHi ME-4 dataset (Du et al., 2016) as an additional test set to evaluate the generalization ability of the models. While specific test sets are mentioned, explicit training/validation splits (e.g., percentages or counts for the combined dataset) are not provided, as training data is 'generated on the fly'.
Hardware Specification	Yes	The training is conducted using 2 A100 GPUs with a batch size of 128. For language model training, we use 8 A100 GPUs with a batch size of 256, training for 1 million steps.
Software Dependencies	No	The paper does not explicitly state any software dependencies with specific version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	The Sim Codec model is trained for 50k steps in the first stage and 10k steps in the second stage. We employ the Adam W optimizer with a learning rate of 1e-4 to optimize the codec model. For language model training, we use 8 A100 GPUs with a batch size of 256, training for 1 million steps. We employ the Adam W optimizer with a learning rate of 1e-4 and 5k warmup steps, following the inverse square root learning schedule to adjust the learning rate dynamically during training.