reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LLM-Augmented Chemical Synthesis and Design Decision Programs

Authors: Haorui Wang, Jeff Guo, Lingkai Kong, Rampi Ramprasad, Philippe Schwaller, Yuanqi Du, Chao Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through comprehensive evaluations, we show that our LLM-augmented approach excels at retrosynthesis planning and extends naturally to the broader challenge of synthesizable molecular design. 4. Experiments ... We present the retrosynthesis planning results in Table 2. ... We conducted several ablation studies to evaluate different design choices: route formats, the use of molecule RAG, reward signals, EA parameters, and prompt robustness. The results are shown in Table 3.
Researcher Affiliation	Academia	1Georgia Tech 2École Polytechnique Fédérale de Lausanne (EPFL) 3National Centre of Competence in Research (NCCR) Catalysis 4Harvard University 5Cornell University. Correspondence to: Haorui Wang <EMAIL>.
Pseudocode	Yes	Algorithm 1 LLM-Syn-Planner Algorithm Data: The target molecule T; the reward function F; the evaluation function E; the population size nc; the number of retrieval size no; the routes retrieval set O; the maximum number of attempts budget. Result: Found synthesis routes population P
Open Source Code	Yes	1Our code is available at https://github.com/ zoom-wang112358/LLM-Syn-Planner.
Open Datasets	Yes	Dataset. We conduct experiments using the USPTO (Schneider et al., 2016; Dai et al., 2019) and Pistachio (pis) datasets. For USPTO, we utilize USPTO-190 (Chen et al., 2020) and a simplified subset, USPTO-EASY, which is randomly sampled from the test set used in Retro* single-step model training. For the Pistachio dataset, we adopt the version from (Yu et al., 2024) but remove the starting material constraints.
Dataset Splits	No	Dataset. We conduct experiments using the USPTO (Schneider et al., 2016; Dai et al., 2019) and Pistachio (pis) datasets. For USPTO, we utilize USPTO-190 (Chen et al., 2020) and a simplified subset, USPTO-EASY, which is randomly sampled from the test set used in Retro* single-step model training. For the Pistachio dataset, we adopt the version from (Yu et al., 2024) but remove the starting material constraints. The route database is constructed using the training and validation sets from Retro*, while the reaction database is a processed version of USPTO-Full, as used in (Yu et al., 2024). For the building block set, we canonicalize all SMILES strings from the 23 million purchasable building blocks available in e Molecules, following the approach of (Chen et al., 2020). We show the statistics of the datasets in Appendix A.1. The paper does not provide explicit training/validation/test split percentages or sample counts for the datasets used in their experiments.
Hardware Specification	No	Our experiments utilized the GPT-4o model and the Deep Seek-V3 model. The GPT-4o model refers to the GPT-4o checkpoint from 2024-08-06. All GPT-4o checkpoints were hosted on Microsoft Azure. This describes the models used and where they were hosted, but does not provide specific hardware details such as GPU/CPU models, memory, or processor types.
Software Dependencies	No	At the molecule level, we validate whether the molecules in the molecule set are both valid (RDKit parsable) and purchasable. For single-step models, we use the checkpoints from syntheseus. We utilize GPT-4o 2 (Hurst et al., 2024) and Deep Seek-V3 (Guo et al., 2025) as our LLMs. The paper mentions RDKit, syntheseus, GPT-4o, and Deep Seek-V3 but does not provide specific version numbers for these software components.
Experiment Setup	Yes	Configuration 1. We utilize GPT-4o 2 (Hurst et al., 2024) and Deep Seek-V3 (Guo et al., 2025) as our LLMs and set the temperature to 0.7 for all queries, ensuring a balanced trade-off between creativity and reliability. To maintain efficiency, we impose a maximum search time of 60 minutes per molecule. N denotes the model call limit. In the MCTS algorithm, we employ a basic reward function: a state receives a reward of 1.0 if all molecules are purchasable (i.e., the state is solved), and 0.0 otherwise. The value function is set as a constant 0.5. For policy, we use softmax values derived from the single-step reaction model, scaled by a temperature of 3.0 and normalized across the total number of reactions. In the Retro* algorithm, we follow the retro-0 variant described in the original paper (Chen et al., 2020). The Or Node cost function assigns a cost of 0 to purchasable molecules and infinity otherwise. The And Node cost function defines the reaction cost as -log(softmax) of the reaction model output, thresholded at a minimum value. For the search heuristic (value function), we use a constant value of 0, consistent with the retro-0 algorithm.