MADGEN: Mass-Spec attends to De Novo Molecular generation
Authors: Yinkai Wang, Xiaohui Chen, Liping Liu, Soha Hassoun
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate MADGEN on three datasets (NIST23, CANOPUS, and Mass Spec Gym) and evaluate MADGEN s performance with a predictive scaffold retriever and with an oracle retriever. We demonstrate the effectiveness of using attention to integrate spectral information throughout the generation process to achieve strong results with the oracle retriever. |
| Researcher Affiliation | Academia | Yinkai Wang, Xiaohui Chen, Liping Liu, Soha Hassoun Department of Computer Science Tufts University EMAIL |
| Pseudocode | No | The paper describes the methodology and model architectures using text, mathematical formulations, and diagrams (Figure 1, Figure 2) but does not include any explicit pseudocode blocks or algorithms labeled as such. |
| Open Source Code | Yes | Our code is available at https://github.com/Hassoun Lab/MADGEN |
| Open Datasets | Yes | We evaluate the performance of MADGEN on three datasets (Table 1). The NIST23 dataset (National Institute of Standards and Technology (NIST), 2023)... The CANOPUS dataset... The newly developed Mass Spec Gym benchmark dataset (Bushuiev et al., 2024) is collected from many public reference spectral databases and curated uniformly. |
| Dataset Splits | Yes | The NIST23 and CANOPUS datasets were split into training, validation, and test sets based on the scaffold, ensuring that scaffolds are unique to each split. This split prevents data leakage and ensures robust evaluation of model performance. For Mass Spec Gym, we utilized the split suggested by the benchmark (Bushuiev et al., 2024), which is based on the Maximum Common Edge Substructure (MCES). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions tools like RDKit but does not provide specific version numbers for software dependencies or libraries used for implementation. |
| Experiment Setup | Yes | The model was trained using a graph transformer with 5 layers and 50 diffusion steps. We employed the Adam W optimizer with a learning rate of 1 10 5. Full training details and hyperparameters can be found in Appendix A.2. Appendix A.2: The model is trained with a batch size of 64 and employed 47 workers for data loading. The learning rate is set to 2 10 4, while weight decay is configured at 1 10 12. Training proceeds for 2000 epochs, with the model logging progress every 40 steps. A Markov bridge process with 100 steps is employed during training, and a cosine noise schedule is employed. The model consists of 5 layers, with node, edge, and spectral features set at 64 dimensions each. The MLP hidden dimensions are configured to 256 for nodes, 128 for edges, and 256 for spectral features. The model also employs 8 attention heads for cross-attention and self-attention mechanisms. The feedforward dimensions are set to 256 for nodes, 128 for edges, and 128 for global features. |