reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Evolutionary Search Over Chemical Space with Large Language Models

Authors: Haorui Wang, Marta Skreta, Cher-Tian Ser, Wenhao Gao, Lingkai Kong, Felix Strieth-Kalthoff, Chenru Duan, Yuchen Zhuang, Yue Yu, Yanqiao Zhu, Yuanqi Du, Alan Aspuru-Guzik, Kirill Neklyudov, Chao Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform extensive empirical studies on both commercial and open-source models on multiple tasks involving property optimization, molecular rediscovery, and structure-based drug design, demonstrating that the joint usage of LLMs with EAs yields superior performance over all baseline models across single- and multi-objective settings.
Researcher Affiliation	Collaboration	1Georgia Institute of Technology, 2University of Toronto, 3Vector Institute, 4Massachusetts Institute of Technology, 5University of Wuppertal, 6Deep Principle Inc., 7University of California, Los Angeles 8Cornell University, 9Université de Montréal, 10Mila Quebec AI Institute
Pseudocode	Yes	This process is outlined in Algorithm 1.
Open Source Code	Yes	Our code is available at https://github.com/zoom-wang112358/MOLLEO.
Open Datasets	Yes	We evaluate MOLLEO on 26 tasks from two molecular generation benchmarks, Practical Molecular Optimization (PMO) (Gao et al., 2022) and Therapeutics Data Commons (TDC) (Huang et al., 2021). [...] For the initial population of molecules, we randomly sample 120 molecules from ZINC 250K (Sterling & Irwin, 2015). [...] We took ZINC20 (Irwin et al., 2020), a database of 1.4 billion compounds that were used to generate the training set for Bio T5, and Pub Chem (Kim et al., 2023)( 250K molecules), which was used to generate the training set for Molecule STM, and checked if the final molecules for the JNK3 task from each model appeared in the respective datasets.
Dataset Splits	No	The paper describes an optimization algorithm that uses datasets like ZINC 250K for initial molecule pools and PMO/TDC tasks for evaluation. However, it does not specify conventional training/test/validation splits (e.g., percentages or exact counts) for reproducing dataset partitioning in a machine learning model training context. The focus is on the optimization process and its performance over oracle calls, rather than on fixed data splits for model generalization.
Hardware Specification	Yes	Our experiments were computed on NVIDIA A100-SXM4-80GB and T4V2 GPUs.
Software Dependencies	Yes	All GPT-4 checkpoints were hosted on Microsoft Azure3. 2.https://platform.openai.com/docs/models 3 *.openai.azure.com
Experiment Setup	Yes	For the choice of hyperparameters, we use the best practices from Graph-GA (Jensen, 2019), the baseline genetic algorithm that we build our method upon. We kept the best hyperparameters that were determined in (Gao et al., 2022). In each iteration, Graph-GA samples two molecules with a probability proportional to their fitnesses for crossover and mutation and then randomly mutates the offspring with probability pm = 0.067. This process is repeated to generate 70 offspring. The fitnesses of the offspring are measured, and the top 120 most fit molecules in the entire pool are kept for the next generation. For docking experiments, we reduce the number of generated offspring to 7 and the population size to 12 due to long experiment runtimes. We set the maximum number of oracle calls to 10,000 for all experiments except docking, where we set it to 1,000. We kept the default early-stopping criterion the same as in PMO (Gao et al., 2022), which is that we terminate the algorithm if the mean score of the top 100 molecules does not increase by at least 1e-3 within five epochs.