Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics
Authors: Siddhant Arora, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, Shinji Watanabe
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To answer this, we propose a novel evaluation protocol that can assess spoken dialog system s turn-taking capabilities using a supervised model as a judge that has been trained to predict turn-taking events in human-human conversations. Using this protocol, we present the first comprehensive user study that evaluates existing spoken dialogue systems on their ability to perform turn-taking events and reveal many interesting insights, such as they sometimes do not understand when to speak up, can interrupt too aggressively and rarely backchannel. We further evaluate multiple open-source and proprietary audio FMs accessible through APIs on carefully curated test benchmarks from Switchboard to measure their ability to understand and predict turn-taking events and identify significant room for improvement. |
| Researcher Affiliation | Collaboration | Siddhant Arora1, Zhiyun Lu2, Chung-Cheng Chiu2, Ruoming Pang2, Shinji Watanabe1 1 Carnegie Mellon University, USA,2 Apple EMAIL |
| Pseudocode | No | The paper includes diagrams illustrating the model architecture (Figure 7) and conceptual flows, but no explicit pseudocode block or algorithm steps are presented. |
| Open Source Code | Yes | To ensure reproducibility of our results and provide researchers with the ability to evaluate their own pre-trained audio FMs using our evaluation platform, we will publicly release our full codebase as part of the ESPnet (Watanabe et al., 2018) toolkit. The release can be followed here: https://github.com/espnet/espnet/pull/5948. |
| Open Datasets | Yes | Dataset details: We train our turn-taking prediction model on Switchboard dataset. Similar to prior work (Wang et al., 2024), we split dataset by conversations into 2000:300:138 for train, validation, and test respectively. We evaluated our model on the in-domain switchboard test set and additionally on 2 out-of-domain (OOD) datasets: the Columbia Games Corpus (Gravano & Hirschberg, 2011) and the Fisher Corpus (Cieri et al., 2004). |
| Dataset Splits | Yes | Dataset details: We train our turn-taking prediction model on Switchboard dataset. Similar to prior work (Wang et al., 2024), we split dataset by conversations into 2000:300:138 for train, validation, and test respectively. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running its experiments. It mentions using 'Whisper medium encoder' but does not specify the hardware on which this encoder was run or the experiments were conducted. |
| Software Dependencies | No | The conversation is passed through a VAD model using pyannotate (Bredin et al., 2020) library. We use the Whisper medium encoder to generate acoustic representations. |
| Experiment Setup | Yes | The size of the chunk, i.e., Nblock is 40msec. We use the Whisper medium encoder to generate acoustic representations. The context window of the supervised turn-taking model W is 30 seconds. ... For our proposed metrics (Sec. 4.4-4.8), we get the following values for threshold i.e. threshold1 = 0, threshold2 = 0.1, threshold3 = -0.45, threshold4 = -0.1. ... For single-label evaluations (Sec. 4.9), operating points or thresholds for the predicted likelihood of label C = 0.2, NA = 0.45, I = 0.4, BC = 0.4, T = 0.4. |