reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis

Authors: Xiangheng He, Junjie Chen, Zixing Zhang, Björn Schuller

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that Prosody FM can effectively improve the phrasing and intonation aspects of prosody, thereby enhancing the overall intelligibility compared to four state-of-the-art (SOTA) models. Out-of-distribution experiments show that this prosody improvement can further bring Prosody FM superior generalizability for unseen complex sentences and speakers. Our case study intuitively illustrates the powerful and fine-grained controllability of Prosody FM over phrasing and intonation.
Researcher Affiliation	Academia	Xiangheng He1, Junjie Chen3, Zixing Zhang4*, Bj orn Schuller1,2 1GLAM Group on Language, Audio, & Music, Imperial College London, UK 2CHI Chair of Health Informatics, MRI, Technical University of Munich, Germany 3Department of Computer Science, The University of Tokyo, Japan 4College of Computer Science and Electronic Engineering, Hunan University, China EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methods through text and diagrams (Figure 2, Figure 3) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code and demo https://github.com/Xiangheng Hee/Prosody FM
Open Datasets	Yes	We perform the experiments in Table 1, Table 2, Table 4, and Figure 4 on the Libri TTS corpus (Zen et al. 2019). For the experiments in Table 3, we train the models on the VCTK corpus (Yamagishi, Veaux, and Mac Donald 2019)
Dataset Splits	Yes	We randomly split (speakers-independent) the audio samples in the train-clean-100, dev-clean, and test-clean sections of Libri TTS into 40421, 839, and 839 samples for our training, validation, and testing sets, respectively. The whole dataset has in total 71 hours of audio signals and 326 speakers. For the experiments in Table 3, we train the models on the VCTK corpus (Yamagishi, Veaux, and Mac Donald 2019) with the same training set as in (Kim, Kong, and Son 2021)
Hardware Specification	Yes	Prosody FM and its ablated variants in Table 4 are trained for 350 epochs on an NVIDIA A100 GPU with batch size 64 and learning rate 1e-4.
Software Dependencies	No	For the Phrase Break Predictor, we fine-tune T5 (Ni et al. 2022) independent from Prosody FM using Lo RA (Hu et al. 2022) with 16 ranks... For the Text-Pitch Aligner, we initialize BERT1 with pretrained weights... For the Pitch Processor, we use Praat (Boersma 2001)... For a fair comparison, we utilize the same model architecture and hyperparameters as Matcha TTS (Mehta et al. 2024)... No specific software version numbers (e.g., Python, PyTorch) are provided for these tools, only their originating papers/versions.
Experiment Setup	Yes	Prosody FM and its ablated variants in Table 4 are trained for 350 epochs on an NVIDIA A100 GPU with batch size 64 and learning rate 1e-4.