Attribute Graphs Underlying Molecular Generative Models: Path to Learning with Limited Data

Authors: Samuel C Hoffman, Payel Das, Karthikeyan Shanmugam, Kahini Wadhawan, Prasanna Sattigeri

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on nine pharmacokinetic property prediction problems, an area of science where data limitations are prevalent and label acquisition is costly and time-consuming. Using a pre-trained generative autoencoder trained on a large dataset of small molecules, we demonstrate that the graphical model between various molecular attributes and latent codes learned by our algorithm can be used to predict a specific property for molecules which are drawn from a different distribution. Results show empirically that the predictor that relies on our Markov blanket attributes is robust to distribution shifts when transferred or fine-tuned with a few samples from the new distribution, especially when training data is limited.
Researcher Affiliation Industry Samuel C. Hoffman shoffman@ibm.com IBM Research; Payel Das EMAIL IBM Research; Karthikeyan Shanmugam EMAIL IBM Researchú; Kahini Wadhawan EMAIL IBM Research; Prasanna Sattigeri EMAIL IBM Research
Pseudocode Yes Algorithm 1 Perturb Learn Perturbations to influence weights; Algorithm 2 Perturb Learn Sparse weights to attribute graph
Open Source Code No The paper does not contain any explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets Yes Therapeutics Data Commons (TDC) is a platform for AI-powered drug discovery which contains a number of datasets with prediction tasks for drug-relevant properties (Huang et al., 2021). We use all of the regression tasks in the pharmacokinetics domain i.e., drug absorption, distribution, metabolism, and excretion (ADME). For small molecule representation, we use the VAE from Chenthamarakshan et al. (2020). This model was trained primarily on the MOSES dataset (Polykovskiy et al., 2020) which is a subset of the ZINC Clean Leads dataset (Irwin et al., 2012).
Dataset Splits Yes All datasets were divided into training, validation, and testing splits according to a ratio of 70%, 10%, and 20%, respectively, based on scaffold (the core structure of the molecule) in order to separate structurally distinct compounds.
Hardware Specification Yes For all experiments, we use machines with Intel Xeon Gold 6258R CPUs and NVIDIA Tesla V100 GPUs with up to 32 GB of RAM.
Software Dependencies No The paper mentions using RDKit for molecular descriptors and L-BFGS-B for optimization, but it does not specify version numbers for these or any other software dependencies.
Experiment Setup Yes The MLP hyperparameters are tuned using 5-fold cross-validation on the training set where the search space is a grid of combinations of: hidden layer size 64, 128, or 512 (2 hidden layers chosen independently); dropout rate 0.25 or 0.5; and training duration 100 or 500 epochs. All models use a mean squared error (MSE) loss with a batch size of 256 (or the size of the dataset, if smaller), rectified linear unit (Re LU) activations, and Adam optimization with a learning rate of 0.001.