Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation

Authors: Praveen Srinivasa Varadhan, amogh gulati, Ashwin Sankar, Srija Anand, Anirudh Gupta, Anirudh Mukherjee, Shiva Kumar Marepally, Ankur Bhatia, Saloni Jaju, Suvrat Bhooshan, Mitesh M Khapra

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental More specifically, we conduct a comprehensive assessment of the MUSHRA test, focusing on its sensitivity to factors such as rater variability, listener fatigue, and reference bias. Based on our extensive evaluation involving 492 human listeners across Hindi and Tamil we identify two primary shortcomings: (i) reference-matching bias, where raters are unduly influenced by the human reference, and (ii) judgement ambiguity, arising from a lack of clear fine-grained guidelines. To address these issues, we propose two refined variants of the MUSHRA test. ... We also release Mango1, a massive dataset of 246,000 human ratings, the first-of-its-kind collection for Indian languages, aiding in analyzing human preferences and developing automatic metrics for evaluating TTS systems .
Researcher Affiliation Collaboration 1AI4Bharat, Indian Institute of Technology Madras, 2Gan.AI .
Pseudocode No The paper describes methods and calculations (e.g., in Section 5, formula for SM) but does not present any structured pseudocode or algorithm blocks.
Open Source Code No The paper states: "The dataset is publicly available at https://huggingface.co/datasets/ai4bharat/MANGO." but does not provide a link or explicit statement for the release of source code for the methodology described in the paper.
Open Datasets Yes We also release Mango1, a massive dataset of 246,000 human ratings, the first-of-its-kind collection for Indian languages, aiding in analyzing human preferences and developing automatic metrics for evaluating TTS systems . The dataset is publicly available at https://huggingface.co/datasets/ai4bharat/MANGO.
Dataset Splits No The paper mentions training TTS systems "on the train-test splits" in Section 3.3, but does not provide specific details on the percentages, counts, or methodology for these splits, nor for any splits related to the Mango human rating dataset itself.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU types, or cloud instance specifications) used for running the experiments or training the models.
Software Dependencies No The paper mentions using 'Torch Audio-Squim' for objective metrics and various TTS models, but does not provide specific version numbers for these tools or any other key software dependencies like PyTorch or TensorFlow.
Experiment Setup Yes We train Fast Speech2 (FS2) (Ren et al., 2021) with Hi Fi GAN v1 (Kong et al., 2020) and VITS (Kim et al., 2021) from scratch on the train-test splits using hyper-parameters suggested in a recent study (Kumar et al., 2023b). We finetune Style TTS2 (ST2) (Li et al., 2023) from the Libri TTS checkpoint. We finetune XTTSv2 (Coqui AI, 2023) starting from the multilingual checkpoint with the hyper-parameters from their original implementations on the same splits described in Section 3.3. ... We analytically derive a MUSHRA naturalness score for raters using an intuitive formula with weights (provided in Appendix A.6) for different dimensions listed above. ... The MUSHRA score (SM) for a system is given by, SM = L + V Q + R / 3 - min(MP, 15) - 5 * min(SP, 7) - 10 * US - 5 * DA - 5 * WS - 25 * SEF - 5