Lift Your Molecules: Molecular Graph Generation in Latent Euclidean Space
Authors: Mohamed Amine Ketata, Nicholas Gao, Johanna Sommer, Tom Wollschläger, Stephan Günnemann
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | EDM-SYCO achieves state-of-the-art performance in distribution learning of molecular graphs, outperforming the best non-autoregressive methods by more than 26% on ZINC250K and 16% on the Guaca Mol dataset while improving conditional generation by up to 3.9 times.1 |
| Researcher Affiliation | Academia | Mohamed Amine Ketata , Nicholas Gao , Johanna Sommer , Tom Wollschläger, Stephan Günnemann Department of Computer Science & Munich Data Science Institute Technical University of Munich EMAIL |
| Pseudocode | Yes | Algorithm 1 Training Algorithm Input: a dataset of molecular graphs G = (h, A) Initial: Encoder network Eϕ, Decoder network Dξ, denoising network ϵθ ... Algorithm 2 Sampling Algorithm Input: Decoder network Dξ, denoising network ϵθ |
| Open Source Code | Yes | 1Our code is available at https://github.com/ketatam/SyCo. |
| Open Datasets | Yes | We use two datasets of different sizes and complexities: ZINC250K (Irwin et al., 2012) containing 250K molecules with up to 40 atoms, and Guaca Mol (Brown et al., 2019) containing 1.5M drug-like molecules with up to 88 atoms. |
| Dataset Splits | Yes | We use the original train/validation/test splits of the used datasets (Irwin et al., 2012; Brown et al., 2019). |
| Hardware Specification | Yes | We train all models on ZINC250K on a single Nvidia A100 GPU, and on Guaca Mol, we use multi-GPU training on 4 Nvidia A100 GPUs. |
| Software Dependencies | No | RDKit (Landrum et al., 2006) (BSD 3-Clause License) PyTorch (Paszke et al., 2019) (BSD 3-Clause License) EDM (Hoogeboom et al., 2022) (MIT license) Geo LDM (Xu et al., 2023) (MIT license). The paper lists software but does not provide specific version numbers for them. |
| Experiment Setup | Yes | All models are trained with a batch size of 64 and using the recent Prodigy optimizer (Mishchenko & Defazio, 2023) with dcoef = 0.1, which we found to be a very important hyperparameter for the stability of training. The autoencoder is trained in the first stage to minimize the cross-entropy loss between the ground truth and predicted graphs for a maximum of 100 epochs, with early stopping if the validation accuracy does not improve for 10 epochs. In the second training stage, the EDM model is trained for 1000 epochs on ZINC250 and approximately 300 epochs on Guaca Mol. The regressor is trained with the L1 loss ... for 500 epochs, with early stopping if the MAE on the validation set does not improve after 50 epochs. |