MolSpectra: Pre-training 3D Molecular Representation with Multi-modal Energy Spectra
Authors: Shaozhen Liu, Yu Rong, Deli Zhao, Qiang Liu, shu wu, Liang Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations on public benchmarks reveal that our pre-trained representations surpass existing methods in predicting molecular properties and modeling dynamics. Extensive experiments over different downstream molecular property prediction benchmarks shows the superiority of Mol Spectra. We conduct experiments to evaluate Mol Spectra... We conduct experiments to evaluate the impact of patch length Pi, stride Di, and mask ratio α. We conducted an ablation study on them. |
| Researcher Affiliation | Collaboration | 1New Laboratory of Pattern Recognition (NLPR), State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA) 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3DAMO Academy, Alibaba Group |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. It describes methodologies in narrative text and mathematical equations. |
| Open Source Code | Yes | 1The code is released at https://github.com/Azure Leon1/Mol Spectra |
| Open Datasets | Yes | As described in Section 3.4, we first perform denoising pre-training on the PCQM4Mv2 (Nakata & Shimazaki, 2017) dataset, followed by a second stage of pre-training on the QM9Spectra (QM9S) (Zou et al., 2023) dataset, which includes multi-modal molecular energy spectra. The QM9 dataset is a quantum chemistry dataset comprising over 134,000 small molecules... The MD17 dataset contains molecular dynamics trajectories for eight organic molecules, including aspirin, benzene, and ethanol. |
| Dataset Splits | Yes | The QM9 dataset is divided into a training set of 110k molecules, a validation set of 10k molecules, and a test set containing the remaining over 10k molecules. ... We use a standard limited data split: models train on 1k samples, validate on 50, and test on the rest. |
| Hardware Specification | Yes | Our experiments are conducted on Linux servers equipped with 184 Intel Xeon Platinum 8469C CPUs, 920GB RAM, and 8 NVIDIA H20 96GB GPUs. |
| Software Dependencies | Yes | Our model is implemented in Py Torch version 2.3.1, Py Torch Geometric version 2.6.1 (https://pyg.org/) with CUDA version 12.1, and Python 3.10.14. |
| Experiment Setup | Yes | We tune the mask ratio (i.e., α) in {0.05, 0.10, 0.15, 0.20, 0.25, 0.30}, tune the stride/patch length pair (i.e., Di/Pi) in {5/20, 10/20, 15/20, 20/20, 8/16, 15/30}, and tune the weights of sub-objectives (i.e., βDenoising, βMPR, and βContrast ) in {0.01, 0.1, 1, 10}. Based on the results of hyper-parameter tuning, we adopt α = 0.10, Di = 10, Pi = 20, βDenoising = 1.0, βMPR = 1.0, and βContrast = 1.0. In our method, bs = 128. The noise is added to atom positions as scaled mixture of isotropic Gaussian noise, with a scaling factor of 0.04. |