On The Landscape of Spoken Language Models: A Comprehensive Survey

Authors: Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung-yi Lee, Karen Livescu, Shinji Watanabe

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical This paper aims to contribute an improved understanding of SLMs via a unifying literature survey of recent work in the context of the evolution of the field. Our survey categorizes the work in this area by model architecture, training, and evaluation choices, and describes some key challenges and directions for future work.
Researcher Affiliation Academia 1 Carnegie Mellon University, USA 2 National Taiwan University, Taiwan 3 Toyota Technological Institute at Chicago, USA 4 Hebrew University of Jerusalem, Israel 5 ENS PSL, EHESS, CNRS, France
Pseudocode No The paper describes methods and architectures using textual explanations and diagrams (e.g., Figure 2: Overview of SLM architecture, Figure 3: A general pipeline for speech encoders, Figure 4: Hierarchical generation strategies). It does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper is a survey and does not describe a new methodology that would typically involve source code. While it mentions the availability of open-source models and toolkits related to the surveyed SLMs, it does not provide concrete access to source code for its own methodology.
Open Datasets No The paper is a comprehensive survey of spoken language models. It discusses various datasets used by the surveyed models for training and evaluation (e.g., 'LLa MA-Questions', 'Web Questions (Berant et al., 2013)', 'Trivia QA (Joshi et al., 2017)', 'Dynamic-SUPERB (Huang et al., 2024)'). However, the authors of this survey paper do not present or provide access to their own dataset for empirical studies.
Dataset Splits No As a survey paper, it does not conduct its own experiments or introduce new datasets, and therefore does not provide training/test/validation dataset splits.
Hardware Specification No The paper is a survey and does not describe experimental work performed by its authors. Therefore, it does not specify the hardware used to run experiments. While it lists hardware specifications for other models (e.g., 'A40', 'L40' in Table 3) during a latency comparison, this refers to the hardware used by the surveyed works, not by the authors of this paper.
Software Dependencies No This paper is a survey and does not present experimental results requiring specific software dependencies for its own methodology. It mentions various software and models in the context of the surveyed literature (e.g., 'BERT (Devlin et al., 2019)', 'GPT-2', 'LLa MA (Touvron et al., 2023a)'), but does not list specific versioned software dependencies for its own work.
Experiment Setup No The paper is a comprehensive survey and does not involve original experimental work by its authors. Consequently, it does not provide details regarding hyperparameters or system-level training settings for an experimental setup.