reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Neural Graph Matching Improves Retrieval Augmented Generation in Molecular Machine Learning

Authors: Runzhong Wang, Rui-Xi Wang, Mrunali Manjrekar, Connor W. Coley

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results highlight the effectiveness of our design, with MARASON achieving 28% top-1 accuracy, a substantial improvement over the non-retrieval state-of-the-art accuracy of 19%. Moreover, MARASON outperforms both naive retrieval-augmented generation methods and traditional graph matching approaches. Our experimental evaluation on standard benchmarks demonstrates state-of-the-art accuracy on the mass spectrum simulation task, outperforming both RAG and non RAG baselines and thus validating the effectiveness of our design strategy.
Researcher Affiliation	Academia	1Massachusetts Institute of Technology, Cambridge, MA, United States. Correspondence to: Connor W. Coley <EMAIL>.
Pseudocode	No	The paper describes the methodology in regular paragraph text without explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Code is publicly available at https: //github.com/coleygroup/ms-pred.
Open Datasets	Yes	We trained our models on the NIST (2020) dataset with 530,640 high-energy collision-induced dissociation (HCD) spectra and 25,541 unique molecular structures. We further retrain MARASON on the recently developed open-source dataset, Mass Spec Gym (Bushuiev et al., 2024), where we achieve state-of-the-art retrieval accuracy, as shown in Table 2.
Dataset Splits	Yes	The dataset is split into structurally disjoint 80%-10%-10% train-validate-test subsets. Following Goldman et al. (2024), we evaluate on two different splits: (1) a random split that splits different In Ch I keys and (2) a Murcko scaffold split that clusters different molecular scaffolds that require more generalization to out-of-distribution structures.
Hardware Specification	Yes	All experiments are conducted on a workstation with AMD 3995WX CPU, 4 NVIDIA A5000 GPU, and 512GB RAM.
Software Dependencies	No	The paper mentions software like Py Torch and Pygmtools, but does not provide specific version numbers for these dependencies, which is required for a reproducible description of ancillary software.
Experiment Setup	Yes	We conduct an ablation study to compare matching algorithms and GNN designs on the NIST (2020) dataset under a random split, as shown in Table 3. A possible explanation for the superiority of Softmax over Sinkhorn is that Softmax is sufficient for the many-to-one aggregation path in Eq. (7) and provides better gradients because it takes fewer iterations. It is also shown in Sarlin et al. (2020) that Softmax outperforms as the matching layer for larger-sized graphs.