Self-Supervised Diffusion Models for Electron-Aware Molecular Representation Learning

Authors: Gyoung S. Na, Chanyoung Park

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we focus on evaluating the prediction capabilities of the machine learning methods on biased and relatively small experimental datasets rather than simulated datasets (e.g., QM9 dataset (Ramakrishnan et al., 2014)). Although the simulated datasets are useful for analyzing rough statistics on small molecules, they are not appropriate to evaluate the prediction capabilities of the machine learning methods on real-world molecular physics due to the following two reasons: 1) The simulated datasets do not contain complex and large molecules due to the large time complexity of the quantum mechanical calculations. 2) The simulated datasets do not sufficiently reflect the quantum mechanical uncertainty in real-world molecules (Sim et al., 2018). For these reasons, we used experimentally collected molecular datasets from physicochemistry, toxicity, pharmacokinetics, and optical applications to evaluate the practical potential of DELID. For all benchmark molecular datasets, DELID achieved state-of-the-art performance in predicting experimentally observed properties of real-world complex molecules.
Researcher Affiliation Academia Gyoung S. Na KRICT, Republic of Korea EMAIL Chanyoung Park KAIST, Republic of Korea EMAIL
Pseudocode Yes Algorithm 1 shows an algorithmic description of the forward and training processes of DELID.
Open Source Code Yes The source code of DELID is publicly available at https://github.com/ngs00/DELID.
Open Datasets Yes We employed nine benchmark molecular datasets constructed by real-world chemical experiments. The benchmark molecular datasets were selected from well-known databases in molecular science (Wu et al., 2018; Wu & Wei, 2018; Mendez et al., 2019; Joung et al., 2020).
Dataset Splits Yes For all datasets, the R2-scores were measured by the 5-fold cross-validation.
Hardware Specification Yes The execution time was measured in a machine with Intel i9-12900K CPU, 128G memory, and NVIDIA Ge Force RTX 3090 Ti GPU.
Software Dependencies Yes DELID and experiment scripts were implemented with Py Torch 2.0.0+cu1172 and Py Torch Geometric 2.3.13 under Python 3.9.
Experiment Setup Yes The model parameters of DELID were optimized by the Adam W optimizer (Loshchilov & Hutter, 2017) for all experiments in this paper. The initial learning rate and L2 regularization coefficients were fixed to 5e-4 and 5e-6 for all benchmark datasets, respectively. Batch size is also fixed to 64 for all benchmark datasets. The GNN-based embedding networks were constructed by two node aggregation layers and one dense layer with 64 output channels.