reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Reordering

Authors: Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, Xie Chen

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experimental findings reveal that our model outperforms baselines in terms of accuracy and delivers more stable results using both greedy and sampling-based decoding strategies. Experimental results, using comparable model configurations and 960 hours of speech data from Libri Speech (Panayotov et al. 2015) as a training set, demonstrate the superiority of ELLA-V. We further conducted ablation experiments to investigate the effects of our proposed modifications. We present the evaluation results in Table 2.
Researcher Affiliation	Collaboration	1Shanghai Jiao Tong University, Shanghai, China 2Byte Dance Inc., USA 3Microsoft, One Microsoft Way, Redmond, USA
Pseudocode	No	The paper describes the methodology in narrative text and figures, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Demo & Code https://ereboas.github.io/ELLAV/
Open Datasets	Yes	We trained ELLA-V using the publicly available Librispeech (Panayotov et al. 2015) 960h training dataset. The open-source 24k Hz checkpoint of En Codec(D efossez et al. 2023) was used as the codec to generate discrete acoustic tokens. Your TTS (Casanova et al. 2022) trained on the 585hour Libri Speech subset Libri TTS (Zen et al. 2019).
Dataset Splits	Yes	We trained ELLA-V using the publicly available Librispeech (Panayotov et al. 2015) 960h training dataset. For the zero-shot TTS continuation task, we adhered to methodologies established by previous works (Wang et al. 2023a,c), selecting examples ranging from 4 seconds to 10 seconds from the Libri Speech testclean dataset as our test set. For the zero-shot TTS cross-speaker task, we designed a hard case set comprising 100 hard sentences.
Hardware Specification	Yes	All models were trained in parallel using 8 NVIDIA Tesla V100 GPUs with a batch size of 16384 tokens for GAR and 12288 tokens for NAR per GPU, respectively, learning a total of 320k steps.
Software Dependencies	No	The paper mentions 'Montreal Forced Aligner (MFA)' and 'En Codec', and the 'Adam W optimizer' but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	Training Configuration: For both GAR and NAR models, we stacked 12 Transformer decoder layers with an embedding dimension of 1024, a hidden state dimension of 1024, and a feed-forward layer dimension of 4096. All models were trained in parallel using 8 NVIDIA Tesla V100 GPUs with a batch size of 16384 tokens for GAR and 12288 tokens for NAR per GPU, respectively, learning a total of 320k steps. We used the Adam W optimizer with β1 = 0.9, β2 = 0.999, ϵ = 10 9. We employed an inverse-sqrt learning rate scheduler with warm-up. For the first 32000 updates, we linearly increased the learning rate from 10 7 to a peak of 5 10 4. The weight decay was 0.01.