Learning Molecular Representation in a Cell

Authors: Gang Liu, Srijit Seal, John Arevalo, Zhenwen Liang, Anne Carpenter, Meng Jiang, Shantanu Singh

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we validate representations from Info Align in two downstream applications: molecular property prediction against up to 27 baseline methods across four datasets, plus zero-shot molecule-morphology matching. The code and model are available at https://github.com/liugangcode/Info Align. 6 EXPERIMENTS We demonstrate the effectiveness of Info Align’s representation in (1) molecular property prediction, (2) molecule-morphology matching, and (3) analyze the performance of Info Align. These lead to three research questions (RQs).
Researcher Affiliation Academia 1University of Notre Dame 2Broad Institute of MIT and Harvard EMAIL EMAIL
Pseudocode No The paper describes the methodology using figures (Figure 1, Figure 2) and narrative text within sections like '4 MULTI-MODAL ALIGNMENT WITH INFOALIGN' and '5 IMPLEMENTATION OF CONTEXT GRAPH AND PRETRAINING SETTING'. However, it does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like steps.
Open Source Code Yes Empirically, we validate representations from Info Align in two downstream applications: molecular property prediction against up to 27 baseline methods across four datasets, plus zero-shot molecule-morphology matching. The code and model are available at https://github.com/liugangcode/Info Align.
Open Datasets Yes We select datasets for important tasks in drug discovery, including activity classification for various assays in Ch EMBL2K (Gaulton et al., 2012) and Broad6K (Moshkov et al., 2023), drug toxicity classification using Tox Cast (Richard et al., 2016), and absorption, distribution, metabolism, and excretion (ADME) regression using Biogen3K (Fang et al., 2023). We create the context graph based on (1) two Cell Painting datasets (Bray et al., 2017; Chandrasekaran et al., 2023), containing around 140K molecule perturbations (molecule and cell morphology pairs) and 15K genetic perturbations (gene and cell morphology pairs) across 1.6 billion human cells; (2) Hetionet (Himmelstein et al., 2017), which captures gene-gene and gene-molecule relationships from millions of biomedical studies; and (3) a dataset reporting differential gene expression values for 978 landmark genes (Wang et al., 2016) for chemical perturbations (molecule and gene expression pairs) (Subramanian et al., 2017).
Dataset Splits Yes We apply scaffold-splitting for all datasets. We follow a 0.6:0.15:0.25 ratio for training, validation, and test sets for all datasets.
Hardware Specification Yes All experiments were run on a single 32G V100.
Software Dependencies No The paper does not explicitly mention specific software dependencies with version numbers. It refers to architectures and models like Graph Neural Networks (GNNs) and Multi-Layer Perceptrons (MLPs), but without specifying the software libraries and their versions (e.g., PyTorch, TensorFlow, scikit-learn versions) used for their implementation.
Experiment Setup Yes We use a five-layer Graph Isomorphism Network (GIN) (Xu et al., 2019) with sum as the readout function as the molecule encoder. All molecules on the context graph are used to pretrain the encoder. Since we extract feature vectors as decoding targets in different modalities, we efficiently use a Multi-Layer Perceptron (MLP) as modality decoders. We set the hidden dimension to 300, β = 4, and L = 4. In each training batch, random walks start from the molecule node to extract the walk path. The decoders are then pretrained to reconstruct the corresponding node features along the path.