DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra
Authors: Montgomery Bohde, Mrunali Manjrekar, Runzhong Wang, Shuiwang Ji, Connor W. Coley
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on established benchmarks show that Diff MS outperforms existing models on de novo molecule generation. We provide several ablations to demonstrate the effectiveness of our diffusion and pretraining approaches and show consistent performance scaling with increasing pretraining dataset size. |
| Researcher Affiliation | Academia | 1Massachusetts Institute of Technology, Cambridge, MA, United States 2Texas A&M University, College Station, TX, United States. |
| Pseudocode | No | The paper describes the diffusion process using mathematical formulations (e.g., equations 1-5) and prose, but does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Diff MS code is publicly available at https: //github.com/coleygroup/Diff MS. |
| Open Datasets | Yes | We evaluate Diff MS on two common open-source de novo generation benchmark datasets, NPLIB1 (D uhrkop et al., 2021a) and Mass Spec Gym (Bushuiev et al., 2024). ... To this end, we build a pretraining dataset consisting of 2.8M fingerprint-molecule pairs sampled from DSSTox (CCTE, 2019), HMDB (Wishart et al., 2021), COCONUT (Sorokina et al., 2021), and MOSES (Polykovskiy et al., 2020) datasets. |
| Dataset Splits | No | The paper mentions removing NPLIB1 and Mass Spec Gym test and validation molecules from the pretraining dataset and discusses properties of the Mass Spec Gym test set, but it does not provide specific percentages or sample counts for training, validation, and test splits needed for reproducibility of data partitioning. |
| Hardware Specification | Yes | Diff MS is a relatively lightweight model, and all experiments were run on NVIDIA 2080ti GPUs with 12 GB of memory. |
| Software Dependencies | No | The paper mentions using specific optimizers (Adam W, RAdam) and a cosine annealing learning rate scheduler with citations, but does not provide specific version numbers for these software components or other libraries (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | We pretrain the decoder for 100 epochs using the Adam W optimizer ... We pretrain the encoder for 100 epochs ... We finetune Diff MS for 50 epochs on NPLIB1 and 15 epochs on Mass Spec Gym. |