ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis

Authors: Xiangheng He, Junjie Chen, Zixing Zhang, Björn Schuller

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that Prosody FM can effectively improve the phrasing and intonation aspects of prosody, thereby enhancing the overall intelligibility compared to four state-of-the-art (SOTA) models. Out-of-distribution experiments show that this prosody improvement can further bring Prosody FM superior generalizability for unseen complex sentences and speakers. Our case study intuitively illustrates the powerful and fine-grained controllability of Prosody FM over phrasing and intonation.
Researcher Affiliation Academia Xiangheng He1, Junjie Chen3, Zixing Zhang4*, Bj orn Schuller1,2 1GLAM Group on Language, Audio, & Music, Imperial College London, UK 2CHI Chair of Health Informatics, MRI, Technical University of Munich, Germany 3Department of Computer Science, The University of Tokyo, Japan 4College of Computer Science and Electronic Engineering, Hunan University, China EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes methods through text and diagrams (Figure 2, Figure 3) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code and demo https://github.com/Xiangheng Hee/Prosody FM
Open Datasets Yes We perform the experiments in Table 1, Table 2, Table 4, and Figure 4 on the Libri TTS corpus (Zen et al. 2019). For the experiments in Table 3, we train the models on the VCTK corpus (Yamagishi, Veaux, and Mac Donald 2019)
Dataset Splits Yes We randomly split (speakers-independent) the audio samples in the train-clean-100, dev-clean, and test-clean sections of Libri TTS into 40421, 839, and 839 samples for our training, validation, and testing sets, respectively. The whole dataset has in total 71 hours of audio signals and 326 speakers. For the experiments in Table 3, we train the models on the VCTK corpus (Yamagishi, Veaux, and Mac Donald 2019) with the same training set as in (Kim, Kong, and Son 2021)
Hardware Specification Yes Prosody FM and its ablated variants in Table 4 are trained for 350 epochs on an NVIDIA A100 GPU with batch size 64 and learning rate 1e-4.
Software Dependencies No For the Phrase Break Predictor, we fine-tune T5 (Ni et al. 2022) independent from Prosody FM using Lo RA (Hu et al. 2022) with 16 ranks... For the Text-Pitch Aligner, we initialize BERT1 with pretrained weights... For the Pitch Processor, we use Praat (Boersma 2001)... For a fair comparison, we utilize the same model architecture and hyperparameters as Matcha TTS (Mehta et al. 2024)... No specific software version numbers (e.g., Python, PyTorch) are provided for these tools, only their originating papers/versions.
Experiment Setup Yes Prosody FM and its ablated variants in Table 4 are trained for 350 epochs on an NVIDIA A100 GPU with batch size 64 and learning rate 1e-4.