Speech Recognition Meets Large Language Model: Benchmarking, Models, and Exploration
Authors: Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To address these challenges, we conduct a comprehensive series of experiments that explore various aspects, leading to the optimal LLM-based ASR system. We found that delicate designs are not necessary, while a clean setup with little task-specific design is competent. The models achieve strong performance on the Librispeech and Gigaspeech datasets, compared to both LLM-based models and non-LLM-based models. |
| Researcher Affiliation | Collaboration | 1Mo E Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University 2Alibaba Group |
| Pseudocode | No | The paper describes the model architecture and training process using mathematical equations and textual descriptions (e.g., Equations 1-7 and sections 2.3 and 3.3), but it does not contain any explicitly labeled pseudocode blocks or algorithms. |
| Open Source Code | Yes | Codes & Checkpoints https://github.com/X-LANCE/SLAM-LLM |
| Open Datasets | Yes | To evaluate the capabilities of the LLM-based ASR models, we use the most widely used benchmark for the ASR task, the standard Librispeech (Panayotov et al. 2015) benchmark with 960 hours of training data without any data augmentation or splicing. ... We also test our findings on a more diverse, noisy, and challenging dataset, the Gigaspeech (Chen et al. 2021) dataset. |
| Dataset Splits | Yes | We use the dev-other subset as the validation set and test-clean/test-other as the test sets, each of which contains 10 hours of speech. We train the model with Gigaspeech-M with 1, 000 hours, select on the DEV set with 10 hours, and test on the TEST set with 40 hours. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models (e.g., NVIDIA A100), CPU types (e.g., Intel Xeon), or memory amounts used for the experiments. It only mentions training parameters like batch size. |
| Software Dependencies | No | The paper mentions optimizers like AdamW and various models (Whisper, LLaMA, Vicuna, HuBERT, WavLM), but it does not specify version numbers for any software libraries, frameworks, or programming languages used for implementation (e.g., PyTorch 1.x, Python 3.x). |
| Experiment Setup | Yes | For the optimizing strategy, we use Adam W (Loshchilov and Hutter 2019) with a max learning rate of 1 10 4 without a weight decay. For the learning rate scheduler, we conduct warmup at the first 1, 000 steps and then keep the maximum learning rate for training all the time. The max training step is set to 100, 000, but we will stop early if the loss on the validation set does not decrease. For the audio embedding provided by the Whisper family of models, we found that not padding would affect the performance. As a result, we pad the speech to 30 seconds for all Whisper models and the batch size is set to 4. For other models, the length of the input audio remains consistent with the original length in the temporal dimension, and the batch is set to 6 |