reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Synergistic Multi-Agent Framework with Trajectory Learning for Knowledge-Intensive Tasks

Authors: Shengbin Yue, Siyuan Wang, Wei Chen, Xuanjing Huang, Zhongyu Wei

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on five knowledge-intensive tasks demonstrate SMART s superior performance compared to widely adopted knowledge internalization and knowledge enhancement methods. Our framework can extend beyond knowledge-intensive tasks to more complex scenarios. We conduct experiments on five knowledge-intensive tasks, including fact verification, multiple-choice reasoning, open-domain question answering and long-form generation. Results demonstrate that our framework significantly outperforms pre-trained and instruction-tuned LLMs with more parameters (knowledge internalization methods), and widely adopted knowledge enhancement methods. Ablation Studies
Researcher Affiliation	Academia	1 Fudan University, Shanghai, China 2 University of Southern California, Los Angeles, USA 3 Huazhong University of Science and Technology, Wuhan, China
Pseudocode	Yes	The pseudo-code for inference is referenced in Appendix.
Open Source Code	Yes	Code https://github.com/yueshengbin/SMART
Open Datasets	Yes	We conduct experiments on five knowledge-intensive tasks, including fact verification, multiple-choice reasoning, open-domain question answering and long-form generation. Including (1) Fact verification: Pub Health (Akhtar, Cocarascu, and Simperl 2022) is a fact verification dataset about public health; (2) Multiple-choice reasoning: ARC-Challenge (Clark et al. 2018) is a multiple-choice questions dataset about science exam. (3) Open-domain question answering: contains two short-form QA datasets, Pop QA (Mallen et al. 2022), and SQu AD 1.1 (Rajpurkar et al. 2016). (4) Ambiguous question answering: ASQA (Gao et al. 2023) is ambiguous factoid question of the long form response.
Dataset Splits	No	The paper mentions collecting a "Trajectory dataset" consisting of "long-trajectory subset" and "short-trajectory subset" and for ablation studies, it mentions "randomly selected subsets of 8k, 20k, 60k, and 121k instances from the initial 140k training instances". However, it does not explicitly provide the train/test/validation splits for the specific datasets (Pub Health, ARC-C, Pop QA, SQuAD 1.1, ASQA) used in the main experimental results, nor does it specify the methodology for such splits in the main text. It states "Details of evaluation data, including size, and evaluation metrics are available in Appendix Sec. B.1.", implying this information is not in the main body.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU, GPU models, memory) used for running the experiments. It only mentions general support for computational resources: "The project s computational resources are supported by CFFF platform of Fudan University."
Software Dependencies	No	The paper mentions using pre-trained LLMs and an off-the-shelf retrieval model, but it does not specify any software dependencies (e.g., libraries, frameworks) with version numbers that would be required to reproduce the experiments.
Experiment Setup	No	The paper states "Due to page limitations, details of our training and evaluation are in Appendix Sec. B.3." and does not provide specific hyperparameters (e.g., learning rate, batch size, number of epochs, optimizer settings) or detailed system-level training configurations in the main text.