VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words?

Authors: Xize Cheng, Ruofan Hu, Xiaoda Yang, Jingyu Lu, Dongjie Fu, zehan wang, Shengpeng Ji, Rongjie Huang, Boyang Zhang, Tao Jin, Zhou Zhao

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we evaluated several existing spoken dialogue models, analyzing their performance on the 12 attribute subsets of Vox Dialogue. Experiments have shown that in spoken dialogue scenarios, many acoustic cues cannot be conveyed through textual information and must be directly interpreted from the audio input.
Researcher Affiliation Academia Zhejiang University1 EMAIL
Pseudocode No The paper describes methods like 'Stage1: Dialogue Script Synthesis' and 'Stage2: Spoken Dialogue Generation' in Section 3.2, but these are explained in paragraph form and do not include structured pseudocode or algorithm blocks.
Open Source Code Yes Code & Data: https://voxdialogue.github.io/
Open Datasets Yes Code & Data: https://voxdialogue.github.io/
Dataset Splits No The paper mentions evaluating on 'a subset of Voxdialogue' and '12 different attribute-specific test sets' in Section 4.2 and Figure 2, but it does not specify any training, validation, or test dataset splits, percentages, or a methodology for partitioning the data.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models used for running the experiments.
Software Dependencies Yes Specifically, we used the Whisper model (Radford et al., 2023) to filter out all sentences with a word error rate (WER) greater than 5%, and applied speaker-diarization-3.1 (Plaquet & Bredin, 2023; Bredin, 2023) to eliminate samples with timbre inconsistencies in speeches of the same speaker throughout dialogue sequence.
Experiment Setup No The paper defines the task in Section 4.1 and elaborates on evaluation metrics in Section 4.2, including quantitative (BLEU, ROUGE-L, METEOR, BERTScore) and qualitative (GPT-based metric) measures. However, it does not provide specific experimental setup details such as hyperparameters, training configurations, or system-level settings for the evaluated models.