reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Consistency Identification in Task-oriented Dialogue Through Multi-Agent Collaboration

Authors: Peng Wang, Shuo Li, Ruoxi Zhou, Qiguang Chen, Xiao Xu, Hao Fei, Dagang Li, Wanxiang Che, Libo Qin

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on the standard benchmark reveal that our framework achieves superior performance. Additionally, we compare MAC-CITo D with the most advanced trained approaches and find that its zero-shot performance on most metrics even surpasses that of models after training on the CI-To D dataset.
Researcher Affiliation	Academia	1School of Computer Science and Engineering, Central South University, China 2 Key Laboratory of Data Intelligence and Advanced Computing in Provincial Universities, Soochow University, China 3Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, China 4School of Computing, National University of Singapore, Singapore 5School of Computer Science and Engineering, Macau University of Science and Technology, China EMAIL, EMAIL
Pseudocode	No	The paper describes the model architecture and collaboration paradigms using mathematical formulas and descriptive text, but no distinct pseudocode or algorithm blocks are provided.
Open Source Code	Yes	To facilitate the further research, our code will be available at https://github.com/WPENGxs/MAC-CITo D.
Open Datasets	Yes	Following previous work [Qin et al., 2021; Qin et al., 2022; Ding et al., 2024], we use the standard CI-To D benchmark for experiments.
Dataset Splits	No	The paper mentions using the "standard CI-To D benchmark for experiments" by citing previous work [Qin et al., 2021; Qin et al., 2022; Ding et al., 2024]. However, it does not explicitly provide specific dataset split information (percentages, sample counts, or explicit instructions for reproducing the data partitioning) within its main text.
Hardware Specification	No	The paper does not provide specific hardware details (like GPU/CPU models, processor types, or memory amounts) used for running its experiments. It only mentions the use of various LLM backbones and general computing resources like the "High Performance Computing Center of Central South University" in the acknowledgments, without further specification.
Software Dependencies	No	The paper mentions that "All open source models are obtained from Hugging Face Library [Wolf et al., 2020]". While it names a library, it does not specify a version number for Hugging Face or any other critical software dependencies like Python, PyTorch, or TensorFlow, which are necessary for replication.
Experiment Setup	Yes	For the GPT series models, the temperature is 0.3, the top p is 1, and the output max token length is 512. For the open source model, the temperature is 0.7, the top p is 0.8, and the output max token length is 512.