reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Attribute Graphs Underlying Molecular Generative Models: Path to Learning with Limited Data

Authors: Samuel C Hoffman, Payel Das, Karthikeyan Shanmugam, Kahini Wadhawan, Prasanna Sattigeri

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on nine pharmacokinetic property prediction problems, an area of science where data limitations are prevalent and label acquisition is costly and time-consuming. Using a pre-trained generative autoencoder trained on a large dataset of small molecules, we demonstrate that the graphical model between various molecular attributes and latent codes learned by our algorithm can be used to predict a speciﬁc property for molecules which are drawn from a diﬀerent distribution. Results show empirically that the predictor that relies on our Markov blanket attributes is robust to distribution shifts when transferred or ﬁne-tuned with a few samples from the new distribution, especially when training data is limited.
Researcher Affiliation	Industry	Samuel C. Hoﬀman shoﬀman@ibm.com IBM Research; Payel Das EMAIL IBM Research; Karthikeyan Shanmugam EMAIL IBM Researchú; Kahini Wadhawan EMAIL IBM Research; Prasanna Sattigeri EMAIL IBM Research
Pseudocode	Yes	Algorithm 1 Perturb Learn Perturbations to inﬂuence weights; Algorithm 2 Perturb Learn Sparse weights to attribute graph
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets	Yes	Therapeutics Data Commons (TDC) is a platform for AI-powered drug discovery which contains a number of datasets with prediction tasks for drug-relevant properties (Huang et al., 2021). We use all of the regression tasks in the pharmacokinetics domain i.e., drug absorption, distribution, metabolism, and excretion (ADME). For small molecule representation, we use the VAE from Chenthamarakshan et al. (2020). This model was trained primarily on the MOSES dataset (Polykovskiy et al., 2020) which is a subset of the ZINC Clean Leads dataset (Irwin et al., 2012).
Dataset Splits	Yes	All datasets were divided into training, validation, and testing splits according to a ratio of 70%, 10%, and 20%, respectively, based on scaﬀold (the core structure of the molecule) in order to separate structurally distinct compounds.
Hardware Specification	Yes	For all experiments, we use machines with Intel Xeon Gold 6258R CPUs and NVIDIA Tesla V100 GPUs with up to 32 GB of RAM.
Software Dependencies	No	The paper mentions using RDKit for molecular descriptors and L-BFGS-B for optimization, but it does not specify version numbers for these or any other software dependencies.
Experiment Setup	Yes	The MLP hyperparameters are tuned using 5-fold cross-validation on the training set where the search space is a grid of combinations of: hidden layer size 64, 128, or 512 (2 hidden layers chosen independently); dropout rate 0.25 or 0.5; and training duration 100 or 500 epochs. All models use a mean squared error (MSE) loss with a batch size of 256 (or the size of the dataset, if smaller), rectiﬁed linear unit (Re LU) activations, and Adam optimization with a learning rate of 0.001.