Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation
Authors: Praveen Srinivasa Varadhan, amogh gulati, Ashwin Sankar, Srija Anand, Anirudh Gupta, Anirudh Mukherjee, Shiva Kumar Marepally, Ankur Bhatia, Saloni Jaju, Suvrat Bhooshan, Mitesh M Khapra
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | More specifically, we conduct a comprehensive assessment of the MUSHRA test, focusing on its sensitivity to factors such as rater variability, listener fatigue, and reference bias. Based on our extensive evaluation involving 492 human listeners across Hindi and Tamil we identify two primary shortcomings: (i) reference-matching bias, where raters are unduly influenced by the human reference, and (ii) judgement ambiguity, arising from a lack of clear fine-grained guidelines. To address these issues, we propose two refined variants of the MUSHRA test. ... We also release Mango1, a massive dataset of 246,000 human ratings, the first-of-its-kind collection for Indian languages, aiding in analyzing human preferences and developing automatic metrics for evaluating TTS systems . |
| Researcher Affiliation | Collaboration | 1AI4Bharat, Indian Institute of Technology Madras, 2Gan.AI . |
| Pseudocode | No | The paper describes methods and calculations (e.g., in Section 5, formula for SM) but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states: "The dataset is publicly available at https://huggingface.co/datasets/ai4bharat/MANGO." but does not provide a link or explicit statement for the release of source code for the methodology described in the paper. |
| Open Datasets | Yes | We also release Mango1, a massive dataset of 246,000 human ratings, the first-of-its-kind collection for Indian languages, aiding in analyzing human preferences and developing automatic metrics for evaluating TTS systems . The dataset is publicly available at https://huggingface.co/datasets/ai4bharat/MANGO. |
| Dataset Splits | No | The paper mentions training TTS systems "on the train-test splits" in Section 3.3, but does not provide specific details on the percentages, counts, or methodology for these splits, nor for any splits related to the Mango human rating dataset itself. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, or cloud instance specifications) used for running the experiments or training the models. |
| Software Dependencies | No | The paper mentions using 'Torch Audio-Squim' for objective metrics and various TTS models, but does not provide specific version numbers for these tools or any other key software dependencies like PyTorch or TensorFlow. |
| Experiment Setup | Yes | We train Fast Speech2 (FS2) (Ren et al., 2021) with Hi Fi GAN v1 (Kong et al., 2020) and VITS (Kim et al., 2021) from scratch on the train-test splits using hyper-parameters suggested in a recent study (Kumar et al., 2023b). We finetune Style TTS2 (ST2) (Li et al., 2023) from the Libri TTS checkpoint. We finetune XTTSv2 (Coqui AI, 2023) starting from the multilingual checkpoint with the hyper-parameters from their original implementations on the same splits described in Section 3.3. ... We analytically derive a MUSHRA naturalness score for raters using an intuitive formula with weights (provided in Appendix A.6) for different dimensions listed above. ... The MUSHRA score (SM) for a system is given by, SM = L + V Q + R / 3 - min(MP, 15) - 5 * min(SP, 7) - 10 * US - 5 * DA - 5 * WS - 25 * SEF - 5 |