DenoiseVAE: Learning Molecule-Adaptive Noise Distributions for Denoising-based 3D Molecular Pre-training
Authors: Yurou Liu, Jiahao Chen, Rui Jiao, Jiangmeng Li, Wenbing Huang, Bing Su
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that Denoise VAE outperforms the current state-of-the-art methods on various molecular property prediction tasks, demonstrating the effectiveness of it. 5 EXPERIMENTS 5.1 SETTINGS 5.2 MAIN RESULTS 5.3 ABLATION STUDIES |
| Researcher Affiliation | Academia | 1 Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 2 Department of Computer Science and Technology, Tsinghua University, Beijing, China 3 Institute for AI Industry Research, Tsinghua University, Beijing, China 4 Institute of Software, Chinese Academy of Sciences, Beijing, China EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | We provide the pseudocode in the Appendix A.6. Algorithm 1 Algorithm of our Denoise VAE |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the Denoise VAE methodology or a link to a code repository. |
| Open Datasets | Yes | We leverage a large-scale molecular dataset PCQM4Mv2 (Nakata & Shimazaki, 2017) as our pre-training dataset. For downstream tasks, we evaluate our method both on molecular and complex property prediction. For the former, we test on QM9 (Ruddigkeit et al., 2012; Ramakrishnan et al., 2014), MD17 (Chmiela et al., 2017) and PCQM4Mv2 (Nakata & Shimazaki, 2017). For the latter, we adopt the widely recognized PDBBind dataset (v2019) for the ligand binding affinity (LBA) prediction, adhering to the 30% and 60% protein sequence identity splits and preprocessing methods outlined in Atom3D (Townshend et al., 2020). |
| Dataset Splits | Yes | For details, QM9 contains 12 chemical properties of small molecules with stable 3D structures. We follow previous work (Jiao et al., 2023) and split the dataset as the training set, validation set, and test set, which contains 100k, 18k, and 13k conformations, respectively. MD17 contains the simulated dynamical trajectories of 8 small organic molecules, with the recorded energy and force at each frame. We select 9,500 and 500 frames as the training and validation set respectively. PCQM4Mv2 (Nakata & Shimazaki, 2017) has a divided validation set and test set. We report the performance on the validation set according to formal standards, please refer to Appendix A.7 for details. For the latter, we adopt the widely recognized PDBBind dataset (v2019) for the ligand binding affinity (LBA) prediction, adhering to the 30% and 60% protein sequence identity splits and preprocessing methods outlined in Atom3D (Townshend et al., 2020). |
| Hardware Specification | Yes | For training resources, all experiments are conducted on Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz with a single RTX A3090 GPU. |
| Software Dependencies | No | The paper mentions the RDKit library (Landrum, 2006) for energy calculation but does not specify a version number for it or any other software dependencies. |
| Experiment Setup | Yes | Experimental setup We set the prior distribution pxi as a Gaussian distribution, where xi is the mean and σ is the standard deviation. If not specifically noted, we set σ = 0.1 for all experiments. Table 13: Hyper-parameters for Pre-training dataset. Dataset PCQM4Mv2 Batch size 128 Optimizer Adam W Max learning rate 0.0005 Learning rate decay policy Cosine Network archecture Equivariant Graph Neural (EGNN) Noise Generator layers 4 Denoising Module layers 7 |