HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis

Authors: Yuto Nishimura, Takumi Hirose, Masanari Ohi, Hideki Nakayama, Nakamasa Inoue

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, we demonstrated the effectiveness of our approaches by applying our post-training framework to VALL-E. We achieved the frame rate down to as low as 8 Hz, enabling the stable minitue-long speech synthesis in a single inference step. Audio samples, dataset, codes and pre-trained models are available at https://yutonishimura-v2.github.io/HALL-E_DEMO.
Researcher Affiliation Academia Yuto Nishimura1,2, Takumi Hirose2, Masanari Ohi2, Hideki Nakayama1, Nakamasa Inoue2 1 The University of Tokyo, 2 Institute of Science Tokyo EMAIL
Pseudocode No The paper describes the proposed methods (MRe Q and HALL-E) using mathematical formulations and descriptive text, but it does not include explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Audio samples, dataset, codes and pre-trained models are available at https://yutonishimura-v2.github.io/HALL-E_DEMO.
Open Datasets Yes Furthermore, to promote TTS research, we create Minutes Speech, a new benchmark dataset consisting of 40k hours of filtered speech data for training and evaluating speech synthesis ranging from 3s up to 180s. In experiments, we demonstrated the effectiveness of our approaches by applying our post-training framework to VALL-E. Audio samples, dataset, codes and pre-trained models are available at https://yutonishimura-v2.github.io/HALL-E_DEMO.
Dataset Splits Yes We provide two subsets for benchmarking: Minutes Speech-90s and Minutes Speech-180s, consisting of speech segments ranging from 3 seconds to 90 seconds and 3 seconds to 180 seconds, respectively. All test audio files are under Creative Commons licenses. For each speech segment, we provide transcriptions created by two professional native transcribers. ... We also provide a training set consisting of 40,000 hours of audio data. ... For training, we used the Minutes Speech training set, specifically train-90s and -180s for HALL-E, and train-28s, -54s, -90s, and -180s for VALL-E. The train-28s and -54s are designed for 48Hz to match the token length in train-90s and -180s at 8Hz (see Table 2). For evaluation, Minutes Speech test-90s, -180s, and Libri Speech test clean set are used. The minimum audio length was set to 4s, while the maximum audio lengths were set to 90s, 180s, and 35s, respectively. The audio length for prompt was consistently set to 3s.
Hardware Specification Yes Encodec and Speech Tokenizer were pre-trained using the Adam optimizer for 100k iters with a batch size of 704 for Encodec and 674 for Speech Tokenizer on four H100 GPUs, and a learning rate of 9 × 10−4. MRe Q post-training was performed on a single H100 GPU for 160k iters with a batch size of 160 and a learning rate of 3 × 10−4. VALL-E was trained using the AdamW optimizer for 100k iters on four H100 GPUs. ... Table 5 compares the real-time factor (RTF) of VALL-E and HALL-E, measured on an RTX 4090 GPU using the 4s to 10s segments from Libri Speech.
Software Dependencies Yes WER is calculated using the conformer-transducer5. ... SIM is calculated using WavLM-TDNN7. ... DNSMOS using ITU-T P.808 (ITU, 2018)6. ... Specifically, Whisper distil Large v33 was used for automatic transcriptions, and Pyannote4 was employed for speaker diarization. ... PESQ score estimated using Torch Audio-Squim (Kumar et al., 2023a).
Experiment Setup Yes Encodec and Speech Tokenizer were pre-trained using the Adam optimizer for 100k iters with a batch size of 704 for Encodec and 674 for Speech Tokenizer on four H100 GPUs, and a learning rate of 9 × 10−4. MRe Q post-training was performed on a single H100 GPU for 160k iters with a batch size of 160 and a learning rate of 3 × 10−4. VALL-E was trained using the AdamW optimizer for 100k iters on four H100 GPUs. To fully utilize GPU memory, the batch size was adjusted based on the audio length of the training samples. A cosine annealing learning rate schedule was employed with an initial learning rate of 1 × 10−4. HALL-E was trained using the same settings, where VALL-E is used as a pre-trained model. More details are provided in Appendix A.