reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words?

Authors: Xize Cheng, Ruofan Hu, Xiaoda Yang, Jingyu Lu, Dongjie Fu, zehan wang, Shengpeng Ji, Rongjie Huang, Boyang Zhang, Tao Jin, Zhou Zhao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we evaluated several existing spoken dialogue models, analyzing their performance on the 12 attribute subsets of Vox Dialogue. Experiments have shown that in spoken dialogue scenarios, many acoustic cues cannot be conveyed through textual information and must be directly interpreted from the audio input.
Researcher Affiliation	Academia	Zhejiang University1 EMAIL
Pseudocode	No	The paper describes methods like 'Stage1: Dialogue Script Synthesis' and 'Stage2: Spoken Dialogue Generation' in Section 3.2, but these are explained in paragraph form and do not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code & Data: https://voxdialogue.github.io/
Open Datasets	Yes	Code & Data: https://voxdialogue.github.io/
Dataset Splits	No	The paper mentions evaluating on 'a subset of Voxdialogue' and '12 different attribute-specific test sets' in Section 4.2 and Figure 2, but it does not specify any training, validation, or test dataset splits, percentages, or a methodology for partitioning the data.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models used for running the experiments.
Software Dependencies	Yes	Specifically, we used the Whisper model (Radford et al., 2023) to filter out all sentences with a word error rate (WER) greater than 5%, and applied speaker-diarization-3.1 (Plaquet & Bredin, 2023; Bredin, 2023) to eliminate samples with timbre inconsistencies in speeches of the same speaker throughout dialogue sequence.
Experiment Setup	No	The paper defines the task in Section 4.1 and elaborates on evaluation metrics in Section 4.2, including quantitative (BLEU, ROUGE-L, METEOR, BERTScore) and qualitative (GPT-based metric) measures. However, it does not provide specific experimental setup details such as hyperparameters, training configurations, or system-level settings for the evaluated models.