reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ReactGPT: Understanding of Chemical Reactions via In-Context Tuning

Authors: Zhe Chen, Zhe Fang, Wenhao Tian, Zhaoguang Long, Changzhi Sun, Yuefeng Chen, Hao Yuan, Honglin Li, Man Lan

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the effectiveness of React GPT on reaction captioning and experimental procedure prediction, both of these tasks can reflect the understanding of chemical reactions. Experimental results show that compared to previous models, React GPT exhibits competitive capabilities in resolving chemical reactions and generating high-quality text with correct structure. We conduct comprehensive experiments comparing our proposed approaches with existing methods on two tasks: reaction captioning and experimental procedure prediction. Experimental Setting Data We employ the Open Exp dataset (Liu et al. 2024b) for fine-tuning and evaluation. Evaluation Metrics To evaluate the understanding of chemical reaction, we employ BLEU (Papineni et al. 2002), ROUGE (Lin 2004), METEOR (Banerjee and Lavie 2005), and the normalized Levenshtein similarity (Levenshtein et al. 1966) for assessing the quality of generations. Table 1 displays the comparison results between our model and other baselines. Ablation Study To evaluate the effectiveness of different components, we compare React GPT with it variants on the reaction captioning task.
Researcher Affiliation	Collaboration	1School of Computer Science and Technology, East China Normal University, Shanghai, China 2Innovation Center for Artificial Intelligence and Drug Discovery, East China Normal University, Shanghai, China 3Institute of Artificial Intelligence (Tele AI), China Telecom 4Shenzhen Transsion Holdings CO.,LTD. EMAIL EMAIL, EMAIL, EMAIL EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology in prose and utilizes a flowchart in Figure 2 to illustrate the framework, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps formatted like code.
Open Source Code	No	The paper references third-party tools and models like LLaMA3 (1https://github.com/meta-llama/llama3), Hugging Face Transformers (2https://github.com/huggingface/transformers), and LLaMA-Factory (3https://github.com/hiyouga/LLa MA-Factory), but it does not provide an explicit statement or a link to the source code for the React GPT methodology described in this paper.
Open Datasets	Yes	We employ the Open Exp dataset (Liu et al. 2024b) for fine-tuning and evaluation. It consists of 274,439 chemical reactions with the corresponding captions and experimental procedures, which have been filtered and processed from chemical reaction databases of USPTO-Applications (Lowe 2017) and ORD (Kearnes et al. 2021).
Dataset Splits	No	The paper states: 'For our evaluation, we focus on the test split while using the training set as the local database to retrieve k-shot context examples for In-Context Tuning.' However, it does not specify the exact percentages, absolute sample counts, or detailed methodology for these training and test splits, nor does it provide a citation to where these splits are defined in detail within the Open Exp dataset.
Hardware Specification	Yes	All our experiments are performed on 2 NVIDIA A100-80G.
Software Dependencies	No	The paper mentions using 'LLa MA3-8B-Instruct' and that 'The entire project is based on the LLa MA-Factory', and also 'huggingface Transformers' and 'Deepspeed Ze RO stage 2', but it does not provide specific version numbers for these software libraries or frameworks required for replication.
Experiment Setup	Yes	We adopt Adam W optimizer, set the learning rate as 5e-5, batch size as 4, and the maximum input length to 4096 tokens. The temperature is set to 0.95, the top-p is set to 0.95 and the topk is set to 5 in the decoding strategy.