reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Foundation Molecular Grammar: Multi-Modal Foundation Models Induce Interpretable Molecular Graph Languages

Authors: Michael Sun, Weize Yuan, Gang Liu, Wojciech Matusik, Jie Chen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that FMG not only excels in synthesizability, diversity, and data efficiency but also offers built-in chemical interpretability for automated molecular discovery workflows. Code is available at https://github.com/ shiningsunnyday/induction. ... Demonstrating that FMG outperforms existing state-of-the-art methods on popular molecular generation benchmarks in terms of superior data efficiency, diversity, and synthesizability ... Evaluating FMG s step-by-step reasoning via comprehensive case studies and quantitative analysis.
Researcher Affiliation	Collaboration	1MIT CSAIL 2MIT Chemistry 3University of Notre Dame 4MIT-IBM Watson AI Lab, IBM Research. Correspondence to: Michael Sun <EMAIL>.
Pseudocode	No	The paper describes the FMG algorithm and its steps (e.g., in Section 3 and Figure 1) with detailed explanations in prose and diagrams. However, it does not provide any explicitly labeled pseudocode blocks or formal algorithm structures.
Open Source Code	Yes	Code is available at https://github.com/ shiningsunnyday/induction.
Open Datasets	Yes	Datasets. We evaluate on three small monomer datasets used by Guo et al. (2022b) curated from literature, as well as two real-world datasets from the photovoltaic and toxicology domains used by Sun et al. (2024). ... We trained FMG on a 1k subset (0.05%) of the refined ZINC dataset used by the MOSES benchmark (Polykovskiy et al., 2020).
Dataset Splits	Yes	We do a 80-20 train-val split of the dataset and finetune until the validation loss converges. ... For our Small Datasets, there are as few as 11 samples, making (FT) extremely difficult. We instead adapt pretrained checkpoints to sample in the posterior distribution of the dataset. ... We trained FMG on a 1k subset (0.05%) of the refined ZINC dataset used by the MOSES benchmark (Polykovskiy et al., 2020).
Hardware Specification	No	The paper mentions 'MMFMs such as GPT-4o' as the base model, indicating the type of model used, but does not specify any particular hardware (e.g., GPU models, CPU, or cloud computing instances with their specifications) on which the experiments were run or the models were trained.
Software Dependencies	No	The paper mentions several software components like 'MMFMs such as GPT-4o', 'rdkit', and 'matplotlib.pyplot' but does not provide specific version numbers for any of them, which is necessary for reproducibility.
Experiment Setup	Yes	We generate 10000 for small datasets and 1000 for HOPV/PTC, use the same Retro* parameters and adopt the same membership criteria as Guo et al. (2022b); Sun et al. (2024). ... We set K = 10 and study the performance of Top-k FMG as k increases from 1 to K. ... We use a batch size of 32 to accomodate our smaller datasets. ... We set the maximum generation length to 512.