Audio Large Language Models Can Be Descriptive Speech Quality Evaluators
Authors: CHEN CHEN, Yuchen Hu, Siyin Wang, Helin Wang, Zhehuai Chen, Chao Zhang, Chao-Han Huck Yang, Ensiong Chng
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that ALLD outperforms the previous state-of-the-art regression model in MOS prediction, with a mean square error of 0.17 and an A/B test accuracy of 98.6%. Additionally, the generated responses achieve BLEU scores of 25.8 and 30.2 on two tasks, surpassing the capabilities of task-specific models. |
| Researcher Affiliation | Collaboration | 1Nanyang Technological University 2NVIDIA 3Tsinghua University 4Johns Hopkins University corresponding authors: {cchen1,hucky}nvidia.com; EMAIL |
| Pseudocode | No | The paper describes the methodology including equations and a framework diagram, but it does not contain a clearly labeled pseudocode or algorithm block with structured steps. |
| Open Source Code | No | The paper discusses various models like SALMONN, Qwen-Audio, Qwen2-Audio, Wav2vec2, and Wav LM, but it does not provide any explicit statement about releasing the source code for the ALLD methodology described in this paper, nor does it include a link to a code repository. |
| Open Datasets | Yes | We used the NISQA (Mittag et al., 2021) that contains more than 97, 000 human ratings... as well as the overall MOS. ...an ASR task on Common Voice (Ardila et al., 2019), speaker-related age and gender prediction tasks on Fair-Speech (Veliche et al., 2024), and a nonspeech automatic audio captioning task on (Drossos et al., 2020). We utilize Libri Speech for data generation, with further details provided in Appendix D. |
| Dataset Splits | Yes | To formulate the training set for ALLD, we utilize the LLa MA3.1-70B-Instruct model to generate a total of 20k training examples for MOS prediction (10k) and A/B test (10k), which includes 2, 322 speakers based on the largest subset NISQA TRAIN SIM. Meanwhile, NISQA TRAIN SIM with 938 speakers are constructed as a 5k in-domain test set for these two tasks. ...Half of the training examples are used for warm-up finetuning... For SWD tasks... Qwen-Audio2 is trained with 30k examples... Then the model is evaluated on a 3k (500 6) test set |
| Hardware Specification | No | The paper mentions training and evaluating models like LLa MA3.1-70B-Instruct and Qwen-Audio2, but it does not specify any particular GPU models, CPU types, or other hardware components used for running the experiments. |
| Software Dependencies | No | The paper refers to using specific LLMs like LLa MA3.1-70B-Instruct and Qwen2-72BInstruct, but it does not list any other software dependencies, libraries, or programming language versions with specific version numbers. |
| Experiment Setup | Yes | For ALLD, β is set as 0.4 to enhance the distillation, and the learning rate is set as 5e-6. ... For the second generation, we adjusted the temperature to 1.1 and set top p to 0.9 to encourage greater diversity. |