Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis
Authors: Weiwei Lin, Chenhang HE
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach significantly outperforms the state-of-the-art autoregressive model VALL-E in both subjective and objective evaluations, achieving these results with only 10.3% of VALL-E s parameters. This demonstrates the potential of continuous speech language models as a more efficient alternative to existing quantization-based speech language models. Sample audio can be found at https://tinyurl.com/gmm-lm-tts. ... 5 EXPERIMENT RESULTS |
| Researcher Affiliation | Academia | Weiwei Lin Department of Electrical and Electronic Engineering The Hong Kong Polytechnic University Hong Kong SAR, China EMAIL Chenghan He Department of Computing The Hong Kong Polytechnic University Hong Kong SAR, China |
| Pseudocode | Yes | Algorithm 1 provides a detailed explanation of how encoder and decoder features are monotonically aligned and fed into the decoder to compute the negative log-likelihood. |
| Open Source Code | No | Additionally, we plan to release the code and pre-trained models soon. |
| Open Datasets | Yes | All models were trained on the Libri Light corpus (Kahn et al., 2020), which contains 60k hours of unlabeled 16k Hz audiobook speech. ... We used the Libri Speech test set for evaluation (Panayotov et al., 2015). |
| Dataset Splits | Yes | We used the Libri Speech test set for evaluation (Panayotov et al., 2015). ... To study the effect of prompt length on speech synthesis, we divided the prompts into three groups: 3 seconds, 8 seconds, and 15 seconds. |
| Hardware Specification | No | For GMM-VAE training, we used 8 GPUs, with a total batch size of 680, and each segment length was 8960. ... For GMM-LM training, we used 8 GPUs with a total batch size of 240. |
| Software Dependencies | No | Our implementations are based on publicly available codebases (descript-audio-codec and Speech Brain) and datasets. ... Standard deep learning libraries, such as Py Torch, already support this estimation, and backpropagation through mixture sampling is derived in (Graves, 2016). ... We used scheduler-free Adam W (Defazio et al., 2024). |
| Experiment Setup | Yes | For GMM-VAE training, we used 8 GPUs, with a total batch size of 680, and each segment length was 8960. We used scheduler-free Adam W (Defazio et al., 2024) with a learning rate of 0.001, training for a total of 1,000k steps. For GMM-LM training, we used 8 GPUs with a total batch size of 240. We used the same scheduler-free Adam W with a learning rate of 0.01 and a gradient norm clip of 0.005, training for a total of 200k steps. ... For models using GMM-VAE features, the GMM-VAE model was trained with 3 mixture components, constrained by λ = 50. For the Mel-spectrogram model, we used a 120-dimensional Mel-spectrogram with a hop length of 240. |