reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Rational Decision-Making Agent with Learning Internal Utility Judgment

Authors: Yining Ye, Xin Cong, Shizuo Tian, Yujia Qin, Chong Liu, Yankai Lin, Zhiyuan Liu, Maosong Sun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on the Game of 24, Web Shop, Tool Bench and Rest Bench datasets demonstrate Ra DAgent s superiority over baselines, achieving about 7.8% improvement on average. Besides, Ra DAgent can also reduce costs (Chat GPT API calls), highlighting its effectiveness and efficiency.
Researcher Affiliation	Academia	1Tsinghua University 2Renmin University of China EMAIL EMAIL EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Ra DAgent
Open Source Code	Yes	Our source code is released in https://github.com/Open BMB/Ra D-Agent.
Open Datasets	Yes	We conduct extensive experiments on Game of 24 (Yao et al., 2023), Web Shop (Yao et al., 2022a), and Tool Bench (Qin et al., 2023c) datasets. ... To verify that our method is robust and applicable to real-world environments, we expand our evaluation to the Rest Bench dataset (Song et al., 2023)
Dataset Splits	Yes	We use 100, 500, and 500 instances for Game of 24, Web Shop, and Tool Bench to evaluate the decision-making ability respectively.
Hardware Specification	No	The paper mentions using OpenAI Chat GPT models (gpt-3.5-turbo-0613-16k, GPT-4o-mini, GPT-4) for implementation and experiments, which are accessed via API calls. It does not provide details of the local hardware used to run these experiments or interact with the APIs.
Software Dependencies	Yes	We use Open AI Chat GPT gpt-3.5-turbo-0613-16k to implement our approach (our designed prompt can refer to Appendix A). ... We implement our method and baseline based on GPT-4o-mini to reduce the API cost ... To validate the effectiveness of different LLMs, we have conducted additional experiments integrating GPT-4 into our Ra DAgent
Experiment Setup	Yes	Our approach involves conducting a decision-exploration process 20 times and finally selecting the decision sequence with the highest Elo score as the final decision. For Elo-based Utility Learning, the initial Elo score of the decision step is set as 0.0 and the Elo coefficient r is set as 173.72 according to the vanilla Elo rating system (Elo, 1967). The Elo score of ˆd in Equation 5 is set as 0.0. K in Equation 3 is set as 50. To manage the computational cost of Chat GPT API calls, we set a maximum limit of 12 steps for each decision sequence for a decision-searching process.