reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Speech-Text Pre-training with Synthetic Interleaved Data

Authors: Aohan Zeng, Zhengxiao Du, Mingdao Liu, Lei Zhang, shengmin jiang, Yuxiao Dong, Jie Tang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Starting from a pre-trained language model and scaling our pre-training to 1 trillion tokens (with 600B synthetic interleaved speech-text data), we achieve state-of-the-art performance in speech language modeling and spoken question answering, improving performance on spoken questions tasks from the previous SOTA of 13% (Moshi) to 31%. We further demonstrate that by fine-tuning the pre-trained model with speech dialogue data, we can develop an end-to-end spoken chatbot that achieves competitive performance comparable to existing baselines in both conversational abilities and speech quality, even operating exclusively in the speech domain.
Researcher Affiliation	Collaboration	Aohan Zeng , Zhengxiao Du , Mingdao Liu , Lei Zhang , Shengmin Jiang Yuxiao Dong , Jie Tang Tsinghua University Zhipu.AI
Pseudocode	No	The paper describes methods and processes like
Open Source Code	Yes	Tsinghua University Zhipu.AI Code & Models: https://github.com/THUDM/GLM-4-Voice
Open Datasets	Yes	We evaluate the content preservation and quality of generated speech by our speech decoder on Libri Speech (Panayotov et al., 2015). We evaluated the performance of the text-to-token model on the VCTK (Yamagishi et al., 2019) dataset and the interleaved data using word error rate (WER) as the evaluation metric. We selected high quality text datasets (Fine Web-Edu (Penedo et al., 2024) for English and Chinese-Fineweb-Edu (Open CSG Community) for chinese) to apply the previously mentioned synthesis process, generating a total of 600B tokens, with a 2:1 ratio of English to Chinese. We utilize four data types, each serving a specific purpose: ... Unsupervised speech data: Using the Emilia pipeline (He et al., 2024), we collected 700k hours of high-quality English and Chinese speech data...
Dataset Splits	Yes	We evaluate the finetuned Whisper on Libri Speech (Panayotov et al., 2015) and AISHELL-1 (Bu et al., 2017), along with the original Whisper model. The results are shown in Table 10. Overall all the tokenizers reserve enough semantic information to achieve accurate ASR performance.
Hardware Specification	Yes	To speedup the speech token generation process, we deployed the model using the SGLang framework (Zheng et al., 2024), achieving a generation speed of 25k tokens per second on a single H800 instance.
Software Dependencies	No	The paper mentions the use of frameworks and optimizers such as SGLang framework (Zheng et al., 2024), Adam W, GPT-4, Whisper-large-v3, Fun ASR, and Melo TTS, but does not provide specific version numbers for these software components or any other libraries.
Experiment Setup	Yes	For train hyper-paramaters, we use a batch size of 64, a sequence length of 4096 tokens, and train for 10 epochs with a learning rate decaying from 5e-5 to 5e-6. We use the Adam W optimizer. Our speech-text pre-training stage processes a total of 1T tokens, with a fixed sampling of 30% text data, one epoch each of unsupervised speech and supervised speech-text data, and the remainder consisting of interleaved data. Throughout the pre-training stage, we maintain a sequence length of 8192 tokens and use a learning rate that linearly decays from 6e-5 to 6e-6. For the fine-tuning phase, we use a batch size of 64, a sequence length of 4096 tokens, and train for 10 epochs on the fine-tuning dataset with a learning rate decaying from 5e-5 to 5e-6. We use the Adam W optimizer for both pre-training and fine-tuning stages.