reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ChemAgent: Self-updating Memories in Large Language Models Improves Chemical Reasoning

Authors: Xiangru Tang, Tianyu Hu, Muyang Ye, Daniel Shao, Xunjian Yin, Siru Ouyang, Wangchunshu Zhou, Pan Lu, Zhuosheng Zhang, Yilun Zhao, Arman Cohan, Mark Gerstein

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on four chemical reasoning datasets from Sci Bench demonstrate that Chem Agent achieves performance gains of up to 46% (GPT-4), significantly outperforming existing methods. Our experiments are conducted on four chemical reasoning datasets from Sci Bench (Wang et al., 2024a) with GPT-3.5, GPT-4 (Open AI et al., 2024), and open-source models like Llama3 (Llama Team, 2024).
Researcher Affiliation	Academia	1Yale University 2UIUC 3Stanford University 4Shanghai Jiao Tong University EMAIL
Pseudocode	Yes	2.4 LIBRARY CONSTRUCTION Algorithm 1: Library Construction Input: Development set D, LLM F, prompts {psplit, pref, prank} Output: Static memory M consisting of units U = {condition, question, solution} for (P, S) in D do
Open Source Code	Yes	Our code can be found at https://github.com/gersteinlab/chemagent.
Open Datasets	Yes	Our experiments are conducted on four chemical reasoning datasets from Sci Bench (Wang et al., 2024a) with GPT-3.5, GPT-4 (Open AI et al., 2024), and open-source models like Llama3 (Llama Team, 2024).
Dataset Splits	Yes	Each dataset is divided into a development set (Dd) and a test set (Dt), with exact sizes provided in Table 6.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions using 'GPT-3.5, GPT-4, and Llama3' as models but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the implementation of Chem Agent.
Experiment Setup	Yes	During the reasoning stage, we configure the planning memory (Mp) to provide a maximum of two related memory instances (2-shot) for each query, and the execution memory (Me) to provide up to four related instances (4-shot). However, during the construction of the library, only the knowledge memory (Mk) is used, as the standard solutions are already available in the development set (Dd). We evaluate the accuracy by comparing their outputs with the correct answers, using a relative tolerance of 0.01.