Lift Your Molecules: Molecular Graph Generation in Latent Euclidean Space

Authors: Mohamed Amine Ketata, Nicholas Gao, Johanna Sommer, Tom Wollschläger, Stephan Günnemann

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental EDM-SYCO achieves state-of-the-art performance in distribution learning of molecular graphs, outperforming the best non-autoregressive methods by more than 26% on ZINC250K and 16% on the Guaca Mol dataset while improving conditional generation by up to 3.9 times.1
Researcher Affiliation Academia Mohamed Amine Ketata , Nicholas Gao , Johanna Sommer , Tom Wollschläger, Stephan Günnemann Department of Computer Science & Munich Data Science Institute Technical University of Munich EMAIL
Pseudocode Yes Algorithm 1 Training Algorithm Input: a dataset of molecular graphs G = (h, A) Initial: Encoder network Eϕ, Decoder network Dξ, denoising network ϵθ ... Algorithm 2 Sampling Algorithm Input: Decoder network Dξ, denoising network ϵθ
Open Source Code Yes 1Our code is available at https://github.com/ketatam/SyCo.
Open Datasets Yes We use two datasets of different sizes and complexities: ZINC250K (Irwin et al., 2012) containing 250K molecules with up to 40 atoms, and Guaca Mol (Brown et al., 2019) containing 1.5M drug-like molecules with up to 88 atoms.
Dataset Splits Yes We use the original train/validation/test splits of the used datasets (Irwin et al., 2012; Brown et al., 2019).
Hardware Specification Yes We train all models on ZINC250K on a single Nvidia A100 GPU, and on Guaca Mol, we use multi-GPU training on 4 Nvidia A100 GPUs.
Software Dependencies No RDKit (Landrum et al., 2006) (BSD 3-Clause License) PyTorch (Paszke et al., 2019) (BSD 3-Clause License) EDM (Hoogeboom et al., 2022) (MIT license) Geo LDM (Xu et al., 2023) (MIT license). The paper lists software but does not provide specific version numbers for them.
Experiment Setup Yes All models are trained with a batch size of 64 and using the recent Prodigy optimizer (Mishchenko & Defazio, 2023) with dcoef = 0.1, which we found to be a very important hyperparameter for the stability of training. The autoencoder is trained in the first stage to minimize the cross-entropy loss between the ground truth and predicted graphs for a maximum of 100 epochs, with early stopping if the validation accuracy does not improve for 10 epochs. In the second training stage, the EDM model is trained for 1000 epochs on ZINC250 and approximately 300 epochs on Guaca Mol. The regressor is trained with the L1 loss ... for 500 epochs, with early stopping if the MAE on the validation set does not improve after 50 epochs.