reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis

Authors: Weiwei Lin, Chenhang HE

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach significantly outperforms the state-of-the-art autoregressive model VALL-E in both subjective and objective evaluations, achieving these results with only 10.3% of VALL-E s parameters. This demonstrates the potential of continuous speech language models as a more efficient alternative to existing quantization-based speech language models. Sample audio can be found at https://tinyurl.com/gmm-lm-tts. ... 5 EXPERIMENT RESULTS
Researcher Affiliation	Academia	Weiwei Lin Department of Electrical and Electronic Engineering The Hong Kong Polytechnic University Hong Kong SAR, China EMAIL Chenghan He Department of Computing The Hong Kong Polytechnic University Hong Kong SAR, China
Pseudocode	Yes	Algorithm 1 provides a detailed explanation of how encoder and decoder features are monotonically aligned and fed into the decoder to compute the negative log-likelihood.
Open Source Code	No	Additionally, we plan to release the code and pre-trained models soon.
Open Datasets	Yes	All models were trained on the Libri Light corpus (Kahn et al., 2020), which contains 60k hours of unlabeled 16k Hz audiobook speech. ... We used the Libri Speech test set for evaluation (Panayotov et al., 2015).
Dataset Splits	Yes	We used the Libri Speech test set for evaluation (Panayotov et al., 2015). ... To study the effect of prompt length on speech synthesis, we divided the prompts into three groups: 3 seconds, 8 seconds, and 15 seconds.
Hardware Specification	No	For GMM-VAE training, we used 8 GPUs, with a total batch size of 680, and each segment length was 8960. ... For GMM-LM training, we used 8 GPUs with a total batch size of 240.
Software Dependencies	No	Our implementations are based on publicly available codebases (descript-audio-codec and Speech Brain) and datasets. ... Standard deep learning libraries, such as Py Torch, already support this estimation, and backpropagation through mixture sampling is derived in (Graves, 2016). ... We used scheduler-free Adam W (Defazio et al., 2024).
Experiment Setup	Yes	For GMM-VAE training, we used 8 GPUs, with a total batch size of 680, and each segment length was 8960. We used scheduler-free Adam W (Defazio et al., 2024) with a learning rate of 0.001, training for a total of 1,000k steps. For GMM-LM training, we used 8 GPUs with a total batch size of 240. We used the same scheduler-free Adam W with a learning rate of 0.01 and a gradient norm clip of 0.005, training for a total of 200k steps. ... For models using GMM-VAE features, the GMM-VAE model was trained with 3 mixture components, constrained by λ = 50. For the Mel-spectrogram model, we used a 120-dimensional Mel-spectrogram with a hop length of 240.