reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Agent-Oriented Planning in Multi-Agent Systems

Authors: Ao LI, Yuexiang Xie, Songze Li, Fugee Tsung, Bolin Ding, Yaliang Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the advancement of AOP in solving real-world problems compared to both single-agent systems and existing planning strategies for multi-agent systems. Extensive experiments are conducted based on several reasoning datasets that require collaboration among multiple LLM-empowered agents. Comparisons between AOP and baseline methods demonstrate the remarkable advancements achieved by the proposed framework. Furthermore, we conduct an ablation study to show the contributions of different components in AOP
Researcher Affiliation	Collaboration	Ao Li1,2, Yuexiang Xie3 Songze Li4,5 Fugee Tsung1,2 Bolin Ding3 Yaliang Li3, 1The Hong Kong University of Science and Technology (Guangzhou) 2The Hong Kong University of Science and Technology 3Alibaba Group 4Southeast University 5Engineering Research Center of Blockchain Application, Supervision and Management (Southeast University), Ministry of Education
Pseudocode	No	The paper describes the AOP framework in Section 4, detailing its components and processes in prose. It includes figures like 'Overall architecture of AOP' (Figure 2) and prompt examples in the Appendix, but no explicit 'Pseudocode' or 'Algorithm' blocks are present within the main body or appendices.
Open Source Code	Yes	The source code is available at https://github.com/lalaliat/Agent-Oriented-Planning.
Open Datasets	Yes	We conduct experiments based on a numerical reasoning dataset (Kim et al., 2024), which necessitates the collaboration of multiple agents in resolving the queries. Following previous study (Kim et al., 2024), we adopt Husky QA, which consists of 1,440 queries in the training data and 292 queries in the test data. Besides, we also provide more experimental results on the decontextualized versions of a subset of DROP (Dua et al., 2019) and IIRC (Ferguson et al., 2020) in Appendix D.1.
Dataset Splits	Yes	Following previous study (Kim et al., 2024), we adopt Husky QA, which consists of 1,440 queries in the training data and 292 queries in the test data.
Hardware Specification	Yes	We train the reward model for 50 epochs on one Tesla V100-SXM2-32GB GPU.
Software Dependencies	No	The paper mentions using 'all-Mini LM-L6-v2' as embedding layers for the reward model and 'GPT-4o' as the LLM for agents, along with Python for code generation, but it does not specify version numbers for these software components or any other libraries used.
Experiment Setup	Yes	The batch size is set to 32, and the learning rate is 1e-3. We train the reward model for 50 epochs on one Tesla V100-SXM2-32GB GPU.