Bi-level Contrastive Learning for Knowledge-Enhanced Molecule Representations

Authors: Pengcheng Jiang, Cao Xiao, Tianfan Fu, Parminder Bhatia, Taha Kass-Hout, Jimeng Sun, Jiawei Han

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental When fine-tuned across 11 chemical property tasks, our model significantly outperforms existing benchmarks, achieving an average ROC-AUC improvement of 12.7% for classification tasks and an average RMSE/MAE improvement of 34.4% for regression tasks. Notably, GODE surpasses the current leading model in property prediction, with advancements of 2.2% in classification and 7.2% in regression tasks. To evaluate GODE s performance, we conducted experiments across 11 chemical property prediction tasks. We benchmarked GODE against state-of-the-art methods, including GROVER (Rong et al. 2020), Mol CLR (Wang et al. 2021a), and KANO (Fang et al. 2023). Our results demonstrate that GODE consistently outperforms these baselines, achieving improvements of 12.7% in classification tasks and 34.4% in regression tasks for molecular property prediction.
Researcher Affiliation Collaboration Pengcheng Jiang1, Cao Xiao2, Tianfan Fu3, Parminder Bhatia2, Taha Kass-Hout2, Jimeng Sun1, Jiawei Han1 1University of Illinois Urbana Champaign 2GE Health Care 3Rensselaer Polytechnic Institute
Pseudocode No The paper describes the GODE framework using definitions, descriptions, and mathematical equations (e.g., Eq. 1, Eq. 2, Eq. 4) and a diagram in Figure 1, but does not contain a dedicated pseudocode or algorithm block.
Open Source Code No The paper does not explicitly state that the source code for the GODE methodology is open-source, nor does it provide a specific link to a code repository or mention code in supplementary materials.
Open Datasets Yes The pre-training data for our molecule-level M-GNN is derived from the same unlabelled dataset of 11 million molecules utilized by GROVER. This dataset encompasses sources such as ZINC15 (Sterling and Irwin 2015) and Ch EMBL (Gaulton et al. 2012). For the KG-level pre-training, we retrieve KG triples related to the molecules from Pub Chem RDF and Prime KG. These include various subdomains and properties from Pub Chem RDF, as well as 3-hop sub-graphs for all 7957 drugs from Prime KG. The effectiveness of our model is tested utilizing the comprehensive Molecule Net dataset (Wu et al. 2018; Huang et al. 2021)1, which contains 6 classification and 5 regression datasets for molecular property prediction.
Dataset Splits Yes We randomly split this dataset into two subsets with a 9:1 ratio for training and validation. The dataset is divided into training and validation sets with a 9:1 ratio. Training and validation samples are in a 0.95 : 0.05 ratio. Scaffold splitting with three random seeds was employed with a training/validation/testing ratio of 8:1:1 across all datasets, aligning with previous studies (Rong et al. 2020; Fang et al. 2023).
Hardware Specification Yes All tests are performed with two AMD EPYC 7513 32-core Processors, 528GB RAM, 8 NVIDIA A6000 GPUs, and CUDA 11.7.
Software Dependencies No The paper mentions "CUDA 11.7" with a specific version number. However, other key software components such as the programming language, deep learning framework (e.g., PyTorch, TensorFlow), or specific versions for libraries like RDKit, GROVER, GINE, or TransE are not provided.
Experiment Setup Yes Our settings include λe = 1.5, λm = 1.8, and λn = 1.5. Both M-GNN and K-GNN have a hidden size of 1,200. We adopt a temperature τ = 1.0 for contrastive learning. Early stopping is anchored to validation loss. During fine-tuning, embeddings from K-GNN remain fixed, updating only the parameters of M-GNN. We use Adam optimizer with the Noam learning rate scheduler (Vaswani et al. 2017).