DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra

Authors: Montgomery Bohde, Mrunali Manjrekar, Runzhong Wang, Shuiwang Ji, Connor W. Coley

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on established benchmarks show that Diff MS outperforms existing models on de novo molecule generation. We provide several ablations to demonstrate the effectiveness of our diffusion and pretraining approaches and show consistent performance scaling with increasing pretraining dataset size.
Researcher Affiliation Academia 1Massachusetts Institute of Technology, Cambridge, MA, United States 2Texas A&M University, College Station, TX, United States.
Pseudocode No The paper describes the diffusion process using mathematical formulations (e.g., equations 1-5) and prose, but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code Yes Diff MS code is publicly available at https: //github.com/coleygroup/Diff MS.
Open Datasets Yes We evaluate Diff MS on two common open-source de novo generation benchmark datasets, NPLIB1 (D uhrkop et al., 2021a) and Mass Spec Gym (Bushuiev et al., 2024). ... To this end, we build a pretraining dataset consisting of 2.8M fingerprint-molecule pairs sampled from DSSTox (CCTE, 2019), HMDB (Wishart et al., 2021), COCONUT (Sorokina et al., 2021), and MOSES (Polykovskiy et al., 2020) datasets.
Dataset Splits No The paper mentions removing NPLIB1 and Mass Spec Gym test and validation molecules from the pretraining dataset and discusses properties of the Mass Spec Gym test set, but it does not provide specific percentages or sample counts for training, validation, and test splits needed for reproducibility of data partitioning.
Hardware Specification Yes Diff MS is a relatively lightweight model, and all experiments were run on NVIDIA 2080ti GPUs with 12 GB of memory.
Software Dependencies No The paper mentions using specific optimizers (Adam W, RAdam) and a cosine annealing learning rate scheduler with citations, but does not provide specific version numbers for these software components or other libraries (e.g., Python, PyTorch versions).
Experiment Setup Yes We pretrain the decoder for 100 epochs using the Adam W optimizer ... We pretrain the encoder for 100 epochs ... We finetune Diff MS for 50 epochs on NPLIB1 and 15 epochs on Mass Spec Gym.