ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis
Authors: Xiangheng He, Junjie Chen, Zixing Zhang, Björn Schuller
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that Prosody FM can effectively improve the phrasing and intonation aspects of prosody, thereby enhancing the overall intelligibility compared to four state-of-the-art (SOTA) models. Out-of-distribution experiments show that this prosody improvement can further bring Prosody FM superior generalizability for unseen complex sentences and speakers. Our case study intuitively illustrates the powerful and fine-grained controllability of Prosody FM over phrasing and intonation. |
| Researcher Affiliation | Academia | Xiangheng He1, Junjie Chen3, Zixing Zhang4*, Bj orn Schuller1,2 1GLAM Group on Language, Audio, & Music, Imperial College London, UK 2CHI Chair of Health Informatics, MRI, Technical University of Munich, Germany 3Department of Computer Science, The University of Tokyo, Japan 4College of Computer Science and Electronic Engineering, Hunan University, China EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods through text and diagrams (Figure 2, Figure 3) but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and demo https://github.com/Xiangheng Hee/Prosody FM |
| Open Datasets | Yes | We perform the experiments in Table 1, Table 2, Table 4, and Figure 4 on the Libri TTS corpus (Zen et al. 2019). For the experiments in Table 3, we train the models on the VCTK corpus (Yamagishi, Veaux, and Mac Donald 2019) |
| Dataset Splits | Yes | We randomly split (speakers-independent) the audio samples in the train-clean-100, dev-clean, and test-clean sections of Libri TTS into 40421, 839, and 839 samples for our training, validation, and testing sets, respectively. The whole dataset has in total 71 hours of audio signals and 326 speakers. For the experiments in Table 3, we train the models on the VCTK corpus (Yamagishi, Veaux, and Mac Donald 2019) with the same training set as in (Kim, Kong, and Son 2021) |
| Hardware Specification | Yes | Prosody FM and its ablated variants in Table 4 are trained for 350 epochs on an NVIDIA A100 GPU with batch size 64 and learning rate 1e-4. |
| Software Dependencies | No | For the Phrase Break Predictor, we fine-tune T5 (Ni et al. 2022) independent from Prosody FM using Lo RA (Hu et al. 2022) with 16 ranks... For the Text-Pitch Aligner, we initialize BERT1 with pretrained weights... For the Pitch Processor, we use Praat (Boersma 2001)... For a fair comparison, we utilize the same model architecture and hyperparameters as Matcha TTS (Mehta et al. 2024)... No specific software version numbers (e.g., Python, PyTorch) are provided for these tools, only their originating papers/versions. |
| Experiment Setup | Yes | Prosody FM and its ablated variants in Table 4 are trained for 350 epochs on an NVIDIA A100 GPU with batch size 64 and learning rate 1e-4. |