Speech Recognition Meets Large Language Model: Benchmarking, Models, and Exploration

Authors: Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To address these challenges, we conduct a comprehensive series of experiments that explore various aspects, leading to the optimal LLM-based ASR system. We found that delicate designs are not necessary, while a clean setup with little task-specific design is competent. The models achieve strong performance on the Librispeech and Gigaspeech datasets, compared to both LLM-based models and non-LLM-based models.
Researcher Affiliation Collaboration 1Mo E Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University 2Alibaba Group
Pseudocode No The paper describes the model architecture and training process using mathematical equations and textual descriptions (e.g., Equations 1-7 and sections 2.3 and 3.3), but it does not contain any explicitly labeled pseudocode blocks or algorithms.
Open Source Code Yes Codes & Checkpoints https://github.com/X-LANCE/SLAM-LLM
Open Datasets Yes To evaluate the capabilities of the LLM-based ASR models, we use the most widely used benchmark for the ASR task, the standard Librispeech (Panayotov et al. 2015) benchmark with 960 hours of training data without any data augmentation or splicing. ... We also test our findings on a more diverse, noisy, and challenging dataset, the Gigaspeech (Chen et al. 2021) dataset.
Dataset Splits Yes We use the dev-other subset as the validation set and test-clean/test-other as the test sets, each of which contains 10 hours of speech. We train the model with Gigaspeech-M with 1, 000 hours, select on the DEV set with 10 hours, and test on the TEST set with 40 hours.
Hardware Specification No The paper does not provide specific hardware details such as GPU models (e.g., NVIDIA A100), CPU types (e.g., Intel Xeon), or memory amounts used for the experiments. It only mentions training parameters like batch size.
Software Dependencies No The paper mentions optimizers like AdamW and various models (Whisper, LLaMA, Vicuna, HuBERT, WavLM), but it does not specify version numbers for any software libraries, frameworks, or programming languages used for implementation (e.g., PyTorch 1.x, Python 3.x).
Experiment Setup Yes For the optimizing strategy, we use Adam W (Loshchilov and Hutter 2019) with a max learning rate of 1 10 4 without a weight decay. For the learning rate scheduler, we conduct warmup at the first 1, 000 steps and then keep the maximum learning rate for training all the time. The max training step is set to 100, 000, but we will stop early if the loss on the validation set does not decrease. For the audio embedding provided by the Whisper family of models, we found that not padding would affect the performance. As a result, we pad the speech to 30 seconds for all Whisper models and the batch size is set to 4. For other models, the length of the input audio remains consistent with the original length in the temporal dimension, and the batch is set to 6