reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Audio Large Language Models Can Be Descriptive Speech Quality Evaluators

Authors: CHEN CHEN, Yuchen Hu, Siyin Wang, Helin Wang, Zhehuai Chen, Chao Zhang, Chao-Han Huck Yang, Ensiong Chng

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that ALLD outperforms the previous state-of-the-art regression model in MOS prediction, with a mean square error of 0.17 and an A/B test accuracy of 98.6%. Additionally, the generated responses achieve BLEU scores of 25.8 and 30.2 on two tasks, surpassing the capabilities of task-specific models.
Researcher Affiliation	Collaboration	1Nanyang Technological University 2NVIDIA 3Tsinghua University 4Johns Hopkins University corresponding authors: {cchen1,hucky}nvidia.com; EMAIL
Pseudocode	No	The paper describes the methodology including equations and a framework diagram, but it does not contain a clearly labeled pseudocode or algorithm block with structured steps.
Open Source Code	No	The paper discusses various models like SALMONN, Qwen-Audio, Qwen2-Audio, Wav2vec2, and Wav LM, but it does not provide any explicit statement about releasing the source code for the ALLD methodology described in this paper, nor does it include a link to a code repository.
Open Datasets	Yes	We used the NISQA (Mittag et al., 2021) that contains more than 97, 000 human ratings... as well as the overall MOS. ...an ASR task on Common Voice (Ardila et al., 2019), speaker-related age and gender prediction tasks on Fair-Speech (Veliche et al., 2024), and a nonspeech automatic audio captioning task on (Drossos et al., 2020). We utilize Libri Speech for data generation, with further details provided in Appendix D.
Dataset Splits	Yes	To formulate the training set for ALLD, we utilize the LLa MA3.1-70B-Instruct model to generate a total of 20k training examples for MOS prediction (10k) and A/B test (10k), which includes 2, 322 speakers based on the largest subset NISQA TRAIN SIM. Meanwhile, NISQA TRAIN SIM with 938 speakers are constructed as a 5k in-domain test set for these two tasks. ...Half of the training examples are used for warm-up finetuning... For SWD tasks... Qwen-Audio2 is trained with 30k examples... Then the model is evaluated on a 3k (500 6) test set
Hardware Specification	No	The paper mentions training and evaluating models like LLa MA3.1-70B-Instruct and Qwen-Audio2, but it does not specify any particular GPU models, CPU types, or other hardware components used for running the experiments.
Software Dependencies	No	The paper refers to using specific LLMs like LLa MA3.1-70B-Instruct and Qwen2-72BInstruct, but it does not list any other software dependencies, libraries, or programming language versions with specific version numbers.
Experiment Setup	Yes	For ALLD, β is set as 0.4 to enhance the distillation, and the learning rate is set as 5e-6. ... For the second generation, we adjusted the temperature to 1.1 and set top p to 0.9 to encourage greater diversity.