HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis
Authors: Yuto Nishimura, Takumi Hirose, Masanari Ohi, Hideki Nakayama, Nakamasa Inoue
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments, we demonstrated the effectiveness of our approaches by applying our post-training framework to VALL-E. We achieved the frame rate down to as low as 8 Hz, enabling the stable minitue-long speech synthesis in a single inference step. Audio samples, dataset, codes and pre-trained models are available at https://yutonishimura-v2.github.io/HALL-E_DEMO. |
| Researcher Affiliation | Academia | Yuto Nishimura1,2, Takumi Hirose2, Masanari Ohi2, Hideki Nakayama1, Nakamasa Inoue2 1 The University of Tokyo, 2 Institute of Science Tokyo EMAIL |
| Pseudocode | No | The paper describes the proposed methods (MRe Q and HALL-E) using mathematical formulations and descriptive text, but it does not include explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Audio samples, dataset, codes and pre-trained models are available at https://yutonishimura-v2.github.io/HALL-E_DEMO. |
| Open Datasets | Yes | Furthermore, to promote TTS research, we create Minutes Speech, a new benchmark dataset consisting of 40k hours of filtered speech data for training and evaluating speech synthesis ranging from 3s up to 180s. In experiments, we demonstrated the effectiveness of our approaches by applying our post-training framework to VALL-E. Audio samples, dataset, codes and pre-trained models are available at https://yutonishimura-v2.github.io/HALL-E_DEMO. |
| Dataset Splits | Yes | We provide two subsets for benchmarking: Minutes Speech-90s and Minutes Speech-180s, consisting of speech segments ranging from 3 seconds to 90 seconds and 3 seconds to 180 seconds, respectively. All test audio files are under Creative Commons licenses. For each speech segment, we provide transcriptions created by two professional native transcribers. ... We also provide a training set consisting of 40,000 hours of audio data. ... For training, we used the Minutes Speech training set, specifically train-90s and -180s for HALL-E, and train-28s, -54s, -90s, and -180s for VALL-E. The train-28s and -54s are designed for 48Hz to match the token length in train-90s and -180s at 8Hz (see Table 2). For evaluation, Minutes Speech test-90s, -180s, and Libri Speech test clean set are used. The minimum audio length was set to 4s, while the maximum audio lengths were set to 90s, 180s, and 35s, respectively. The audio length for prompt was consistently set to 3s. |
| Hardware Specification | Yes | Encodec and Speech Tokenizer were pre-trained using the Adam optimizer for 100k iters with a batch size of 704 for Encodec and 674 for Speech Tokenizer on four H100 GPUs, and a learning rate of 9 × 10−4. MRe Q post-training was performed on a single H100 GPU for 160k iters with a batch size of 160 and a learning rate of 3 × 10−4. VALL-E was trained using the AdamW optimizer for 100k iters on four H100 GPUs. ... Table 5 compares the real-time factor (RTF) of VALL-E and HALL-E, measured on an RTX 4090 GPU using the 4s to 10s segments from Libri Speech. |
| Software Dependencies | Yes | WER is calculated using the conformer-transducer5. ... SIM is calculated using WavLM-TDNN7. ... DNSMOS using ITU-T P.808 (ITU, 2018)6. ... Specifically, Whisper distil Large v33 was used for automatic transcriptions, and Pyannote4 was employed for speaker diarization. ... PESQ score estimated using Torch Audio-Squim (Kumar et al., 2023a). |
| Experiment Setup | Yes | Encodec and Speech Tokenizer were pre-trained using the Adam optimizer for 100k iters with a batch size of 704 for Encodec and 674 for Speech Tokenizer on four H100 GPUs, and a learning rate of 9 × 10−4. MRe Q post-training was performed on a single H100 GPU for 160k iters with a batch size of 160 and a learning rate of 3 × 10−4. VALL-E was trained using the AdamW optimizer for 100k iters on four H100 GPUs. To fully utilize GPU memory, the batch size was adjusted based on the audio length of the training samples. A cosine annealing learning rate schedule was employed with an initial learning rate of 1 × 10−4. HALL-E was trained using the same settings, where VALL-E is used as a pre-trained model. More details are provided in Appendix A. |