MolSpectra: Pre-training 3D Molecular Representation with Multi-modal Energy Spectra

Authors: Shaozhen Liu, Yu Rong, Deli Zhao, Qiang Liu, shu wu, Liang Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluations on public benchmarks reveal that our pre-trained representations surpass existing methods in predicting molecular properties and modeling dynamics. Extensive experiments over different downstream molecular property prediction benchmarks shows the superiority of Mol Spectra. We conduct experiments to evaluate Mol Spectra... We conduct experiments to evaluate the impact of patch length Pi, stride Di, and mask ratio α. We conducted an ablation study on them.
Researcher Affiliation Collaboration 1New Laboratory of Pattern Recognition (NLPR), State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA) 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3DAMO Academy, Alibaba Group
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It describes methodologies in narrative text and mathematical equations.
Open Source Code Yes 1The code is released at https://github.com/Azure Leon1/Mol Spectra
Open Datasets Yes As described in Section 3.4, we first perform denoising pre-training on the PCQM4Mv2 (Nakata & Shimazaki, 2017) dataset, followed by a second stage of pre-training on the QM9Spectra (QM9S) (Zou et al., 2023) dataset, which includes multi-modal molecular energy spectra. The QM9 dataset is a quantum chemistry dataset comprising over 134,000 small molecules... The MD17 dataset contains molecular dynamics trajectories for eight organic molecules, including aspirin, benzene, and ethanol.
Dataset Splits Yes The QM9 dataset is divided into a training set of 110k molecules, a validation set of 10k molecules, and a test set containing the remaining over 10k molecules. ... We use a standard limited data split: models train on 1k samples, validate on 50, and test on the rest.
Hardware Specification Yes Our experiments are conducted on Linux servers equipped with 184 Intel Xeon Platinum 8469C CPUs, 920GB RAM, and 8 NVIDIA H20 96GB GPUs.
Software Dependencies Yes Our model is implemented in Py Torch version 2.3.1, Py Torch Geometric version 2.6.1 (https://pyg.org/) with CUDA version 12.1, and Python 3.10.14.
Experiment Setup Yes We tune the mask ratio (i.e., α) in {0.05, 0.10, 0.15, 0.20, 0.25, 0.30}, tune the stride/patch length pair (i.e., Di/Pi) in {5/20, 10/20, 15/20, 20/20, 8/16, 15/30}, and tune the weights of sub-objectives (i.e., βDenoising, βMPR, and βContrast ) in {0.01, 0.1, 1, 10}. Based on the results of hyper-parameter tuning, we adopt α = 0.10, Di = 10, Pi = 20, βDenoising = 1.0, βMPR = 1.0, and βContrast = 1.0. In our method, bs = 128. The noise is added to atom positions as scaled mixture of isotropic Gaussian noise, with a scaling factor of 0.04.