reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses

Authors: Zonglin Yang, Wanhao Liu, Ben Gao, Tong Xie, Yuqiang Li, Wanli Ouyang, Soujanya Poria, Erik Cambria, Dongzhan Zhou

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate this framework, we construct a benchmark of 51 high-impact chemistry papers published and online after January 2024, each manually annotated by Ph D chemists with background, inspirations, and hypothesis. The framework is able to rediscover many hypotheses with high similarity to the groundtruth, successfully capturing the core innovations while ensuring no data contamination since it uses an LLM with knowledge cutoff date prior to 2024. ... We design experiments with the benchmark to test the three fundamental questions and find that LLMs are highly capable.
Researcher Affiliation	Collaboration	Zonglin Yang1,2 , Wanhao Liu2,3, Ben Gao2,4, Tong Xie5,6, Yuqiang Li2, Wanli Ouyang2, Soujanya Poria7, Erik Cambria1 , Dongzhan Zhou2 1 Nanyang Technological University 2 Shanghai Artificial Intelligence Laboratory 3 University of Science and Technology of China 4 Wuhan University 5 University of New South Wales 6 Green Dynamics 7 Singapore University of Technology and Design
Pseudocode	No	The paper describes the MOOSE-Chem framework's methodology using descriptive text and figures (Figure 1 and Figure 2), detailing its stages and components like the 'evolutionary unit'. However, it does not present any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like steps.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or provide a link to a code repository for the methodology described.
Open Datasets	No	To evaluate this framework, we construct a benchmark of 51 high-impact chemistry papers published and online after January 2024, each manually annotated by Ph D chemists with background, inspirations, and hypothesis. ... The benchmark consists of 51 chemistry and material science papers and is constructed by multiple chemistry Ph D students.
Dataset Splits	No	The paper describes the construction of a benchmark (TOMATO-Chem) used for evaluation and how a literature corpus (I) is created and sampled using 'screening windows'. For example, it states, 'I is constructed by first adding the ground truth inspiration papers (around 120), then randomly selecting the remaining papers from the 3000 papers, and finally randomizing the order of all the collected papers.' However, it does not specify traditional training/validation/test dataset splits for the framework's own learning or evaluation in the sense of splitting the benchmark into distinct train/test sets to train and evaluate the MOOSE-Chem framework itself.
Hardware Specification	No	All experiments are performed by GPT-4o (its training data is up to October 2023). ... Table 5 compares LLMs in different scales on inspiration retrieval ability.
Software Dependencies	No	All experiments are performed by GPT-4o (its training data is up to October 2023). ... To investigate whether the results and corresponding conclusions in the main text are caused by the usage of GPT-4o for automatic evaluation, here we use Claude-3.5-Sonnet and Gemini-1.5-Pro to evaluate all of the results that have been evaluated by GPT-4o.
Experiment Setup	Yes	The default setting for MOOSE-Chem is to perform three rounds for each b. In every other round, the number of i and h can expand exponentially. Here, we adopt beam search to select a fixed size of the top-ranked h to enter the next round. The default beam size is 15. ... Specifically, for each inference, we (1) sequentially select a fixed number of papers from I, where the fixed number is called the screening window size (default is 15); (2) set up a prompt consisting of b, the title and abstract of the selected papers from I, and the previous h (if it is not ); and (3) instruct the LLM to generate three titles from the input that can best serve as i for b (and optionally previous h), and give reasons.