ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Reordering
Authors: Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, Xie Chen
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental findings reveal that our model outperforms baselines in terms of accuracy and delivers more stable results using both greedy and sampling-based decoding strategies. Experimental results, using comparable model configurations and 960 hours of speech data from Libri Speech (Panayotov et al. 2015) as a training set, demonstrate the superiority of ELLA-V. We further conducted ablation experiments to investigate the effects of our proposed modifications. We present the evaluation results in Table 2. |
| Researcher Affiliation | Collaboration | 1Shanghai Jiao Tong University, Shanghai, China 2Byte Dance Inc., USA 3Microsoft, One Microsoft Way, Redmond, USA |
| Pseudocode | No | The paper describes the methodology in narrative text and figures, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Demo & Code https://ereboas.github.io/ELLAV/ |
| Open Datasets | Yes | We trained ELLA-V using the publicly available Librispeech (Panayotov et al. 2015) 960h training dataset. The open-source 24k Hz checkpoint of En Codec(D efossez et al. 2023) was used as the codec to generate discrete acoustic tokens. Your TTS (Casanova et al. 2022) trained on the 585hour Libri Speech subset Libri TTS (Zen et al. 2019). |
| Dataset Splits | Yes | We trained ELLA-V using the publicly available Librispeech (Panayotov et al. 2015) 960h training dataset. For the zero-shot TTS continuation task, we adhered to methodologies established by previous works (Wang et al. 2023a,c), selecting examples ranging from 4 seconds to 10 seconds from the Libri Speech testclean dataset as our test set. For the zero-shot TTS cross-speaker task, we designed a hard case set comprising 100 hard sentences. |
| Hardware Specification | Yes | All models were trained in parallel using 8 NVIDIA Tesla V100 GPUs with a batch size of 16384 tokens for GAR and 12288 tokens for NAR per GPU, respectively, learning a total of 320k steps. |
| Software Dependencies | No | The paper mentions 'Montreal Forced Aligner (MFA)' and 'En Codec', and the 'Adam W optimizer' but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | Training Configuration: For both GAR and NAR models, we stacked 12 Transformer decoder layers with an embedding dimension of 1024, a hidden state dimension of 1024, and a feed-forward layer dimension of 4096. All models were trained in parallel using 8 NVIDIA Tesla V100 GPUs with a batch size of 16384 tokens for GAR and 12288 tokens for NAR per GPU, respectively, learning a total of 320k steps. We used the Adam W optimizer with β1 = 0.9, β2 = 0.999, ϵ = 10 9. We employed an inverse-sqrt learning rate scheduler with warm-up. For the first 32000 updates, we linearly increased the learning rate from 10 7 to a peak of 5 10 4. The weight decay was 0.01. |