reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Lift Your Molecules: Molecular Graph Generation in Latent Euclidean Space

Authors: Mohamed Amine Ketata, Nicholas Gao, Johanna Sommer, Tom Wollschläger, Stephan Günnemann

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	EDM-SYCO achieves state-of-the-art performance in distribution learning of molecular graphs, outperforming the best non-autoregressive methods by more than 26% on ZINC250K and 16% on the Guaca Mol dataset while improving conditional generation by up to 3.9 times.1
Researcher Affiliation	Academia	Mohamed Amine Ketata , Nicholas Gao , Johanna Sommer , Tom Wollschläger, Stephan Günnemann Department of Computer Science & Munich Data Science Institute Technical University of Munich EMAIL
Pseudocode	Yes	Algorithm 1 Training Algorithm Input: a dataset of molecular graphs G = (h, A) Initial: Encoder network Eϕ, Decoder network Dξ, denoising network ϵθ ... Algorithm 2 Sampling Algorithm Input: Decoder network Dξ, denoising network ϵθ
Open Source Code	Yes	1Our code is available at https://github.com/ketatam/SyCo.
Open Datasets	Yes	We use two datasets of different sizes and complexities: ZINC250K (Irwin et al., 2012) containing 250K molecules with up to 40 atoms, and Guaca Mol (Brown et al., 2019) containing 1.5M drug-like molecules with up to 88 atoms.
Dataset Splits	Yes	We use the original train/validation/test splits of the used datasets (Irwin et al., 2012; Brown et al., 2019).
Hardware Specification	Yes	We train all models on ZINC250K on a single Nvidia A100 GPU, and on Guaca Mol, we use multi-GPU training on 4 Nvidia A100 GPUs.
Software Dependencies	No	RDKit (Landrum et al., 2006) (BSD 3-Clause License) PyTorch (Paszke et al., 2019) (BSD 3-Clause License) EDM (Hoogeboom et al., 2022) (MIT license) Geo LDM (Xu et al., 2023) (MIT license). The paper lists software but does not provide specific version numbers for them.
Experiment Setup	Yes	All models are trained with a batch size of 64 and using the recent Prodigy optimizer (Mishchenko & Defazio, 2023) with dcoef = 0.1, which we found to be a very important hyperparameter for the stability of training. The autoencoder is trained in the first stage to minimize the cross-entropy loss between the ground truth and predicted graphs for a maximum of 100 epochs, with early stopping if the validation accuracy does not improve for 10 epochs. In the second training stage, the EDM model is trained for 1000 epochs on ZINC250 and approximately 300 epochs on Guaca Mol. The regressor is trained with the L1 loss ... for 500 epochs, with early stopping if the MAE on the validation set does not improve after 50 epochs.