MADGEN: Mass-Spec attends to De Novo Molecular generation

Authors: Yinkai Wang, Xiaohui Chen, Liping Liu, Soha Hassoun

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate MADGEN on three datasets (NIST23, CANOPUS, and Mass Spec Gym) and evaluate MADGEN s performance with a predictive scaffold retriever and with an oracle retriever. We demonstrate the effectiveness of using attention to integrate spectral information throughout the generation process to achieve strong results with the oracle retriever.
Researcher Affiliation Academia Yinkai Wang, Xiaohui Chen, Liping Liu, Soha Hassoun Department of Computer Science Tufts University EMAIL
Pseudocode No The paper describes the methodology and model architectures using text, mathematical formulations, and diagrams (Figure 1, Figure 2) but does not include any explicit pseudocode blocks or algorithms labeled as such.
Open Source Code Yes Our code is available at https://github.com/Hassoun Lab/MADGEN
Open Datasets Yes We evaluate the performance of MADGEN on three datasets (Table 1). The NIST23 dataset (National Institute of Standards and Technology (NIST), 2023)... The CANOPUS dataset... The newly developed Mass Spec Gym benchmark dataset (Bushuiev et al., 2024) is collected from many public reference spectral databases and curated uniformly.
Dataset Splits Yes The NIST23 and CANOPUS datasets were split into training, validation, and test sets based on the scaffold, ensuring that scaffolds are unique to each split. This split prevents data leakage and ensures robust evaluation of model performance. For Mass Spec Gym, we utilized the split suggested by the benchmark (Bushuiev et al., 2024), which is based on the Maximum Common Edge Substructure (MCES).
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions tools like RDKit but does not provide specific version numbers for software dependencies or libraries used for implementation.
Experiment Setup Yes The model was trained using a graph transformer with 5 layers and 50 diffusion steps. We employed the Adam W optimizer with a learning rate of 1 10 5. Full training details and hyperparameters can be found in Appendix A.2. Appendix A.2: The model is trained with a batch size of 64 and employed 47 workers for data loading. The learning rate is set to 2 10 4, while weight decay is configured at 1 10 12. Training proceeds for 2000 epochs, with the model logging progress every 40 steps. A Markov bridge process with 100 steps is employed during training, and a cosine noise schedule is employed. The model consists of 5 layers, with node, edge, and spectral features set at 64 dimensions each. The MLP hidden dimensions are configured to 256 for nodes, 128 for edges, and 256 for spectral features. The model also employs 8 attention heads for cross-attention and self-attention mechanisms. The feedforward dimensions are set to 256 for nodes, 128 for edges, and 128 for global features.