Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics

Authors: Siddhant Arora, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, Shinji Watanabe

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To answer this, we propose a novel evaluation protocol that can assess spoken dialog system s turn-taking capabilities using a supervised model as a judge that has been trained to predict turn-taking events in human-human conversations. Using this protocol, we present the first comprehensive user study that evaluates existing spoken dialogue systems on their ability to perform turn-taking events and reveal many interesting insights, such as they sometimes do not understand when to speak up, can interrupt too aggressively and rarely backchannel. We further evaluate multiple open-source and proprietary audio FMs accessible through APIs on carefully curated test benchmarks from Switchboard to measure their ability to understand and predict turn-taking events and identify significant room for improvement.
Researcher Affiliation Collaboration Siddhant Arora1, Zhiyun Lu2, Chung-Cheng Chiu2, Ruoming Pang2, Shinji Watanabe1 1 Carnegie Mellon University, USA,2 Apple EMAIL
Pseudocode No The paper includes diagrams illustrating the model architecture (Figure 7) and conceptual flows, but no explicit pseudocode block or algorithm steps are presented.
Open Source Code Yes To ensure reproducibility of our results and provide researchers with the ability to evaluate their own pre-trained audio FMs using our evaluation platform, we will publicly release our full codebase as part of the ESPnet (Watanabe et al., 2018) toolkit. The release can be followed here: https://github.com/espnet/espnet/pull/5948.
Open Datasets Yes Dataset details: We train our turn-taking prediction model on Switchboard dataset. Similar to prior work (Wang et al., 2024), we split dataset by conversations into 2000:300:138 for train, validation, and test respectively. We evaluated our model on the in-domain switchboard test set and additionally on 2 out-of-domain (OOD) datasets: the Columbia Games Corpus (Gravano & Hirschberg, 2011) and the Fisher Corpus (Cieri et al., 2004).
Dataset Splits Yes Dataset details: We train our turn-taking prediction model on Switchboard dataset. Similar to prior work (Wang et al., 2024), we split dataset by conversations into 2000:300:138 for train, validation, and test respectively.
Hardware Specification No The paper does not explicitly describe the hardware used for running its experiments. It mentions using 'Whisper medium encoder' but does not specify the hardware on which this encoder was run or the experiments were conducted.
Software Dependencies No The conversation is passed through a VAD model using pyannotate (Bredin et al., 2020) library. We use the Whisper medium encoder to generate acoustic representations.
Experiment Setup Yes The size of the chunk, i.e., Nblock is 40msec. We use the Whisper medium encoder to generate acoustic representations. The context window of the supervised turn-taking model W is 30 seconds. ... For our proposed metrics (Sec. 4.4-4.8), we get the following values for threshold i.e. threshold1 = 0, threshold2 = 0.1, threshold3 = -0.45, threshold4 = -0.1. ... For single-label evaluations (Sec. 4.9), operating points or thresholds for the predicted likelihood of label C = 0.2, NA = 0.45, I = 0.4, BC = 0.4, T = 0.4.