Scaling Speech-Text Pre-training with Synthetic Interleaved Data
Authors: Aohan Zeng, Zhengxiao Du, Mingdao Liu, Lei Zhang, shengmin jiang, Yuxiao Dong, Jie Tang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Starting from a pre-trained language model and scaling our pre-training to 1 trillion tokens (with 600B synthetic interleaved speech-text data), we achieve state-of-the-art performance in speech language modeling and spoken question answering, improving performance on spoken questions tasks from the previous SOTA of 13% (Moshi) to 31%. We further demonstrate that by fine-tuning the pre-trained model with speech dialogue data, we can develop an end-to-end spoken chatbot that achieves competitive performance comparable to existing baselines in both conversational abilities and speech quality, even operating exclusively in the speech domain. |
| Researcher Affiliation | Collaboration | Aohan Zeng , Zhengxiao Du , Mingdao Liu , Lei Zhang , Shengmin Jiang Yuxiao Dong , Jie Tang Tsinghua University Zhipu.AI |
| Pseudocode | No | The paper describes methods and processes like |
| Open Source Code | Yes | Tsinghua University Zhipu.AI Code & Models: https://github.com/THUDM/GLM-4-Voice |
| Open Datasets | Yes | We evaluate the content preservation and quality of generated speech by our speech decoder on Libri Speech (Panayotov et al., 2015). We evaluated the performance of the text-to-token model on the VCTK (Yamagishi et al., 2019) dataset and the interleaved data using word error rate (WER) as the evaluation metric. We selected high quality text datasets (Fine Web-Edu (Penedo et al., 2024) for English and Chinese-Fineweb-Edu (Open CSG Community) for chinese) to apply the previously mentioned synthesis process, generating a total of 600B tokens, with a 2:1 ratio of English to Chinese. We utilize four data types, each serving a specific purpose: ... Unsupervised speech data: Using the Emilia pipeline (He et al., 2024), we collected 700k hours of high-quality English and Chinese speech data... |
| Dataset Splits | Yes | We evaluate the finetuned Whisper on Libri Speech (Panayotov et al., 2015) and AISHELL-1 (Bu et al., 2017), along with the original Whisper model. The results are shown in Table 10. Overall all the tokenizers reserve enough semantic information to achieve accurate ASR performance. |
| Hardware Specification | Yes | To speedup the speech token generation process, we deployed the model using the SGLang framework (Zheng et al., 2024), achieving a generation speed of 25k tokens per second on a single H800 instance. |
| Software Dependencies | No | The paper mentions the use of frameworks and optimizers such as SGLang framework (Zheng et al., 2024), Adam W, GPT-4, Whisper-large-v3, Fun ASR, and Melo TTS, but does not provide specific version numbers for these software components or any other libraries. |
| Experiment Setup | Yes | For train hyper-paramaters, we use a batch size of 64, a sequence length of 4096 tokens, and train for 10 epochs with a learning rate decaying from 5e-5 to 5e-6. We use the Adam W optimizer. Our speech-text pre-training stage processes a total of 1T tokens, with a fixed sampling of 30% text data, one epoch each of unsupervised speech and supervised speech-text data, and the remainder consisting of interleaved data. Throughout the pre-training stage, we maintain a sequence length of 8192 tokens and use a learning rate that linearly decays from 6e-5 to 6e-6. For the fine-tuning phase, we use a batch size of 64, a sequence length of 4096 tokens, and train for 10 epochs on the fine-tuning dataset with a learning rate decaying from 5e-5 to 5e-6. We use the Adam W optimizer for both pre-training and fine-tuning stages. |