reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MolSpectra: Pre-training 3D Molecular Representation with Multi-modal Energy Spectra

Authors: Shaozhen Liu, Yu Rong, Deli Zhao, Qiang Liu, shu wu, Liang Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations on public benchmarks reveal that our pre-trained representations surpass existing methods in predicting molecular properties and modeling dynamics. Extensive experiments over different downstream molecular property prediction benchmarks shows the superiority of Mol Spectra. We conduct experiments to evaluate Mol Spectra... We conduct experiments to evaluate the impact of patch length Pi, stride Di, and mask ratio α. We conducted an ablation study on them.
Researcher Affiliation	Collaboration	1New Laboratory of Pattern Recognition (NLPR), State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA) 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3DAMO Academy, Alibaba Group
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. It describes methodologies in narrative text and mathematical equations.
Open Source Code	Yes	1The code is released at https://github.com/Azure Leon1/Mol Spectra
Open Datasets	Yes	As described in Section 3.4, we first perform denoising pre-training on the PCQM4Mv2 (Nakata & Shimazaki, 2017) dataset, followed by a second stage of pre-training on the QM9Spectra (QM9S) (Zou et al., 2023) dataset, which includes multi-modal molecular energy spectra. The QM9 dataset is a quantum chemistry dataset comprising over 134,000 small molecules... The MD17 dataset contains molecular dynamics trajectories for eight organic molecules, including aspirin, benzene, and ethanol.
Dataset Splits	Yes	The QM9 dataset is divided into a training set of 110k molecules, a validation set of 10k molecules, and a test set containing the remaining over 10k molecules. ... We use a standard limited data split: models train on 1k samples, validate on 50, and test on the rest.
Hardware Specification	Yes	Our experiments are conducted on Linux servers equipped with 184 Intel Xeon Platinum 8469C CPUs, 920GB RAM, and 8 NVIDIA H20 96GB GPUs.
Software Dependencies	Yes	Our model is implemented in Py Torch version 2.3.1, Py Torch Geometric version 2.6.1 (https://pyg.org/) with CUDA version 12.1, and Python 3.10.14.
Experiment Setup	Yes	We tune the mask ratio (i.e., α) in {0.05, 0.10, 0.15, 0.20, 0.25, 0.30}, tune the stride/patch length pair (i.e., Di/Pi) in {5/20, 10/20, 15/20, 20/20, 8/16, 15/30}, and tune the weights of sub-objectives (i.e., βDenoising, βMPR, and βContrast ) in {0.01, 0.1, 1, 10}. Based on the results of hyper-parameter tuning, we adopt α = 0.10, Di = 10, Pi = 20, βDenoising = 1.0, βMPR = 1.0, and βContrast = 1.0. In our method, bs = 128. The noise is added to atom positions as scaled mixture of isotropic Gaussian noise, with a scaling factor of 0.04.