Synergistic Multi-Agent Framework with Trajectory Learning for Knowledge-Intensive Tasks
Authors: Shengbin Yue, Siyuan Wang, Wei Chen, Xuanjing Huang, Zhongyu Wei
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on five knowledge-intensive tasks demonstrate SMART s superior performance compared to widely adopted knowledge internalization and knowledge enhancement methods. Our framework can extend beyond knowledge-intensive tasks to more complex scenarios. We conduct experiments on five knowledge-intensive tasks, including fact verification, multiple-choice reasoning, open-domain question answering and long-form generation. Results demonstrate that our framework significantly outperforms pre-trained and instruction-tuned LLMs with more parameters (knowledge internalization methods), and widely adopted knowledge enhancement methods. Ablation Studies |
| Researcher Affiliation | Academia | 1 Fudan University, Shanghai, China 2 University of Southern California, Los Angeles, USA 3 Huazhong University of Science and Technology, Wuhan, China |
| Pseudocode | Yes | The pseudo-code for inference is referenced in Appendix. |
| Open Source Code | Yes | Code https://github.com/yueshengbin/SMART |
| Open Datasets | Yes | We conduct experiments on five knowledge-intensive tasks, including fact verification, multiple-choice reasoning, open-domain question answering and long-form generation. Including (1) Fact verification: Pub Health (Akhtar, Cocarascu, and Simperl 2022) is a fact verification dataset about public health; (2) Multiple-choice reasoning: ARC-Challenge (Clark et al. 2018) is a multiple-choice questions dataset about science exam. (3) Open-domain question answering: contains two short-form QA datasets, Pop QA (Mallen et al. 2022), and SQu AD 1.1 (Rajpurkar et al. 2016). (4) Ambiguous question answering: ASQA (Gao et al. 2023) is ambiguous factoid question of the long form response. |
| Dataset Splits | No | The paper mentions collecting a "Trajectory dataset" consisting of "long-trajectory subset" and "short-trajectory subset" and for ablation studies, it mentions "randomly selected subsets of 8k, 20k, 60k, and 121k instances from the initial 140k training instances". However, it does not explicitly provide the train/test/validation splits for the specific datasets (Pub Health, ARC-C, Pop QA, SQuAD 1.1, ASQA) used in the main experimental results, nor does it specify the methodology for such splits in the main text. It states "Details of evaluation data, including size, and evaluation metrics are available in Appendix Sec. B.1.", implying this information is not in the main body. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, memory) used for running the experiments. It only mentions general support for computational resources: "The project s computational resources are supported by CFFF platform of Fudan University." |
| Software Dependencies | No | The paper mentions using pre-trained LLMs and an off-the-shelf retrieval model, but it does not specify any software dependencies (e.g., libraries, frameworks) with version numbers that would be required to reproduce the experiments. |
| Experiment Setup | No | The paper states "Due to page limitations, details of our training and evaluation are in Appendix Sec. B.3." and does not provide specific hyperparameters (e.g., learning rate, batch size, number of epochs, optimizer settings) or detailed system-level training configurations in the main text. |