reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics

Authors: Siddhant Arora, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, Shinji Watanabe

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To answer this, we propose a novel evaluation protocol that can assess spoken dialog system s turn-taking capabilities using a supervised model as a judge that has been trained to predict turn-taking events in human-human conversations. Using this protocol, we present the first comprehensive user study that evaluates existing spoken dialogue systems on their ability to perform turn-taking events and reveal many interesting insights, such as they sometimes do not understand when to speak up, can interrupt too aggressively and rarely backchannel. We further evaluate multiple open-source and proprietary audio FMs accessible through APIs on carefully curated test benchmarks from Switchboard to measure their ability to understand and predict turn-taking events and identify significant room for improvement.
Researcher Affiliation	Collaboration	Siddhant Arora1, Zhiyun Lu2, Chung-Cheng Chiu2, Ruoming Pang2, Shinji Watanabe1 1 Carnegie Mellon University, USA,2 Apple EMAIL
Pseudocode	No	The paper includes diagrams illustrating the model architecture (Figure 7) and conceptual flows, but no explicit pseudocode block or algorithm steps are presented.
Open Source Code	Yes	To ensure reproducibility of our results and provide researchers with the ability to evaluate their own pre-trained audio FMs using our evaluation platform, we will publicly release our full codebase as part of the ESPnet (Watanabe et al., 2018) toolkit. The release can be followed here: https://github.com/espnet/espnet/pull/5948.
Open Datasets	Yes	Dataset details: We train our turn-taking prediction model on Switchboard dataset. Similar to prior work (Wang et al., 2024), we split dataset by conversations into 2000:300:138 for train, validation, and test respectively. We evaluated our model on the in-domain switchboard test set and additionally on 2 out-of-domain (OOD) datasets: the Columbia Games Corpus (Gravano & Hirschberg, 2011) and the Fisher Corpus (Cieri et al., 2004).
Dataset Splits	Yes	Dataset details: We train our turn-taking prediction model on Switchboard dataset. Similar to prior work (Wang et al., 2024), we split dataset by conversations into 2000:300:138 for train, validation, and test respectively.
Hardware Specification	No	The paper does not explicitly describe the hardware used for running its experiments. It mentions using 'Whisper medium encoder' but does not specify the hardware on which this encoder was run or the experiments were conducted.
Software Dependencies	No	The conversation is passed through a VAD model using pyannotate (Bredin et al., 2020) library. We use the Whisper medium encoder to generate acoustic representations.
Experiment Setup	Yes	The size of the chunk, i.e., Nblock is 40msec. We use the Whisper medium encoder to generate acoustic representations. The context window of the supervised turn-taking model W is 30 seconds. ... For our proposed metrics (Sec. 4.4-4.8), we get the following values for threshold i.e. threshold1 = 0, threshold2 = 0.1, threshold3 = -0.45, threshold4 = -0.1. ... For single-label evaluations (Sec. 4.9), operating points or thresholds for the predicted likelihood of label C = 0.2, NA = 0.45, I = 0.4, BC = 0.4, T = 0.4.