reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation

Authors: Praveen Srinivasa Varadhan, amogh gulati, Ashwin Sankar, Srija Anand, Anirudh Gupta, Anirudh Mukherjee, Shiva Kumar Marepally, Ankur Bhatia, Saloni Jaju, Suvrat Bhooshan, Mitesh M Khapra

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	More specifically, we conduct a comprehensive assessment of the MUSHRA test, focusing on its sensitivity to factors such as rater variability, listener fatigue, and reference bias. Based on our extensive evaluation involving 492 human listeners across Hindi and Tamil we identify two primary shortcomings: (i) reference-matching bias, where raters are unduly influenced by the human reference, and (ii) judgement ambiguity, arising from a lack of clear fine-grained guidelines. To address these issues, we propose two refined variants of the MUSHRA test. ... We also release Mango1, a massive dataset of 246,000 human ratings, the first-of-its-kind collection for Indian languages, aiding in analyzing human preferences and developing automatic metrics for evaluating TTS systems .
Researcher Affiliation	Collaboration	1AI4Bharat, Indian Institute of Technology Madras, 2Gan.AI .
Pseudocode	No	The paper describes methods and calculations (e.g., in Section 5, formula for SM) but does not present any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper states: "The dataset is publicly available at https://huggingface.co/datasets/ai4bharat/MANGO." but does not provide a link or explicit statement for the release of source code for the methodology described in the paper.
Open Datasets	Yes	We also release Mango1, a massive dataset of 246,000 human ratings, the first-of-its-kind collection for Indian languages, aiding in analyzing human preferences and developing automatic metrics for evaluating TTS systems . The dataset is publicly available at https://huggingface.co/datasets/ai4bharat/MANGO.
Dataset Splits	No	The paper mentions training TTS systems "on the train-test splits" in Section 3.3, but does not provide specific details on the percentages, counts, or methodology for these splits, nor for any splits related to the Mango human rating dataset itself.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types, or cloud instance specifications) used for running the experiments or training the models.
Software Dependencies	No	The paper mentions using 'Torch Audio-Squim' for objective metrics and various TTS models, but does not provide specific version numbers for these tools or any other key software dependencies like PyTorch or TensorFlow.
Experiment Setup	Yes	We train Fast Speech2 (FS2) (Ren et al., 2021) with Hi Fi GAN v1 (Kong et al., 2020) and VITS (Kim et al., 2021) from scratch on the train-test splits using hyper-parameters suggested in a recent study (Kumar et al., 2023b). We finetune Style TTS2 (ST2) (Li et al., 2023) from the Libri TTS checkpoint. We finetune XTTSv2 (Coqui AI, 2023) starting from the multilingual checkpoint with the hyper-parameters from their original implementations on the same splits described in Section 3.3. ... We analytically derive a MUSHRA naturalness score for raters using an intuitive formula with weights (provided in Appendix A.6) for different dimensions listed above. ... The MUSHRA score (SM) for a system is given by, SM = L + V Q + R / 3 - min(MP, 15) - 5 * min(SP, 7) - 10 * US - 5 * DA - 5 * WS - 25 * SEF - 5