ChemAgent: Self-updating Memories in Large Language Models Improves Chemical Reasoning

Authors: Xiangru Tang, Tianyu Hu, Muyang Ye, Daniel Shao, Xunjian Yin, Siru Ouyang, Wangchunshu Zhou, Pan Lu, Zhuosheng Zhang, Yilun Zhao, Arman Cohan, Mark Gerstein

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on four chemical reasoning datasets from Sci Bench demonstrate that Chem Agent achieves performance gains of up to 46% (GPT-4), significantly outperforming existing methods. Our experiments are conducted on four chemical reasoning datasets from Sci Bench (Wang et al., 2024a) with GPT-3.5, GPT-4 (Open AI et al., 2024), and open-source models like Llama3 (Llama Team, 2024).
Researcher Affiliation Academia 1Yale University 2UIUC 3Stanford University 4Shanghai Jiao Tong University EMAIL
Pseudocode Yes 2.4 LIBRARY CONSTRUCTION Algorithm 1: Library Construction Input: Development set D, LLM F, prompts {psplit, pref, prank} Output: Static memory M consisting of units U = {condition, question, solution} for (P, S) in D do
Open Source Code Yes Our code can be found at https://github.com/gersteinlab/chemagent.
Open Datasets Yes Our experiments are conducted on four chemical reasoning datasets from Sci Bench (Wang et al., 2024a) with GPT-3.5, GPT-4 (Open AI et al., 2024), and open-source models like Llama3 (Llama Team, 2024).
Dataset Splits Yes Each dataset is divided into a development set (Dd) and a test set (Dt), with exact sizes provided in Table 6.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions using 'GPT-3.5, GPT-4, and Llama3' as models but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the implementation of Chem Agent.
Experiment Setup Yes During the reasoning stage, we configure the planning memory (Mp) to provide a maximum of two related memory instances (2-shot) for each query, and the execution memory (Me) to provide up to four related instances (4-shot). However, during the construction of the library, only the knowledge memory (Mk) is used, as the standard solutions are already available in the development set (Dd). We evaluate the accuracy by comparing their outputs with the correct answers, using a relative tolerance of 0.01.