reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Advancing Retrosynthesis with Retrieval-Augmented Graph Generation

Authors: Anjie Qiao, Zhen Wang, Jiahua Rao, Yuedong Yang, Zhewei Wei

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	RARB demonstrates state-of-the-art performance on standard benchmarks, achieving a 14.8% relative improvement in top-1 accuracy over its base generative model, highlighting the effectiveness of retrieval augmentation. Additionally, RARB excels in handling out-of-distribution molecules, and its advantages remain significant even with smaller models or fewer denoising steps. These strengths make RARB highly valuable for real-world retrosynthesis applications, where extrapolation to novel molecules and high-throughput prediction are essential. Section 4: Experiments. We conduct our experiments using the USPTO-50k dataset.
Researcher Affiliation	Academia	1Sun Yat-sen University, Guangzhou, Guangdong, China; 2Guangdong Province Key Laboratory of Computational Science; 3Renmin University of China, Beijing, China. Email domains: EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methods and components but does not include any explicit pseudocode blocks or algorithms formatted as such.
Open Source Code	Yes	Our code is available at this repository: https://github.com/anjie-qiao/RARB.
Open Datasets	Yes	We conduct our experiments using the USPTO-50k dataset (Schneider, Stiefl, and Landrum 2016), adhering to the standard train/validation/test splits (Dai et al. 2019; Somnath et al. 2021). For the external dataset, we utilize data from USPTO 2001-2016 applications, comprising 1,939,254 raw reactions.
Dataset Splits	Yes	We conduct our experiments using the USPTO-50k dataset (Schneider, Stiefl, and Landrum 2016), adhering to the standard train/validation/test splits (Dai et al. 2019; Somnath et al. 2021). To assess RARB s ability to handle out-of-distribution data, we construct a more challenging dataset by applying a cluster splitting strategy (Zheng et al. 2019) to USPTO-50k. Specifically, we first use Morgan fingerprints to measure the scaffold similarities between products and employ the Butina algorithm (Butina 1999) to cluster them with a similarity threshold of 0.6. Then, we sort the clusters in descending order by size, and we split these sorted clusters into train/validation/test splits with an 8/1/1 ratio, resulting in our USPTO-50K-cluster dataset.
Hardware Specification	No	The paper mentions that "This work is conducted in part on RTAI cluster, which is supported by School of Computer Science and Engineering and Institute of Artificial Intelligence, Sun Yat-sen University." However, it does not provide specific hardware details like GPU/CPU models or memory specifications of the cluster.
Software Dependencies	No	The paper states, "We implement RARB based on the open-sourced code of Retro Bridge," but it does not specify any software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA version).
Experiment Setup	Yes	For each input product, we sample 100 reactant sets from RARB and rank them based on their confidence scores, determined by the frequency of their occurrences. Next, we use the forward reaction prediction model Molecular Transformer (Schwaller et al. 2019) to predict the products of these top-k samples. (RQ3) Do smaller diffusion models or those with fewer denoising steps still benefit from this augmentation? To validate this, we train RARB in two conditions: (i) using only about 60% of the base model s parameters, and (ii) reducing the base generative model s denoising steps from 500 to 200. (RQ4) Diversity and Efficiency. As discussed in Sec. 3.3, we are concerned that the denoiser s overreliance on retrieval results may reduce the diversity of its generated samples. As shown in Table 5, RARB, without any specific strategy to improve diversity, generates molecules that are less diverse than those generated by its base generative model, confirming our concern. Increasing the dropout rate for prompt extractor to 0.5 (i.e., our first strategy (S1) ) not only enhances RARB s diversity but also improves its accuracy, supporting our rationale of reducing the denoiser s reliance on shortcuts. Additionally, we further add our second strategy (S2) to RARB, namely, randomly selecting 3 out of the top-5 ranked molecules as the retrieval results.