Learning Molecular Representation in a Cell
Authors: Gang Liu, Srijit Seal, John Arevalo, Zhenwen Liang, Anne Carpenter, Meng Jiang, Shantanu Singh
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we validate representations from Info Align in two downstream applications: molecular property prediction against up to 27 baseline methods across four datasets, plus zero-shot molecule-morphology matching. The code and model are available at https://github.com/liugangcode/Info Align. 6 EXPERIMENTS We demonstrate the effectiveness of Info Align’s representation in (1) molecular property prediction, (2) molecule-morphology matching, and (3) analyze the performance of Info Align. These lead to three research questions (RQs). |
| Researcher Affiliation | Academia | 1University of Notre Dame 2Broad Institute of MIT and Harvard EMAIL EMAIL |
| Pseudocode | No | The paper describes the methodology using figures (Figure 1, Figure 2) and narrative text within sections like '4 MULTI-MODAL ALIGNMENT WITH INFOALIGN' and '5 IMPLEMENTATION OF CONTEXT GRAPH AND PRETRAINING SETTING'. However, it does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like steps. |
| Open Source Code | Yes | Empirically, we validate representations from Info Align in two downstream applications: molecular property prediction against up to 27 baseline methods across four datasets, plus zero-shot molecule-morphology matching. The code and model are available at https://github.com/liugangcode/Info Align. |
| Open Datasets | Yes | We select datasets for important tasks in drug discovery, including activity classification for various assays in Ch EMBL2K (Gaulton et al., 2012) and Broad6K (Moshkov et al., 2023), drug toxicity classification using Tox Cast (Richard et al., 2016), and absorption, distribution, metabolism, and excretion (ADME) regression using Biogen3K (Fang et al., 2023). We create the context graph based on (1) two Cell Painting datasets (Bray et al., 2017; Chandrasekaran et al., 2023), containing around 140K molecule perturbations (molecule and cell morphology pairs) and 15K genetic perturbations (gene and cell morphology pairs) across 1.6 billion human cells; (2) Hetionet (Himmelstein et al., 2017), which captures gene-gene and gene-molecule relationships from millions of biomedical studies; and (3) a dataset reporting differential gene expression values for 978 landmark genes (Wang et al., 2016) for chemical perturbations (molecule and gene expression pairs) (Subramanian et al., 2017). |
| Dataset Splits | Yes | We apply scaffold-splitting for all datasets. We follow a 0.6:0.15:0.25 ratio for training, validation, and test sets for all datasets. |
| Hardware Specification | Yes | All experiments were run on a single 32G V100. |
| Software Dependencies | No | The paper does not explicitly mention specific software dependencies with version numbers. It refers to architectures and models like Graph Neural Networks (GNNs) and Multi-Layer Perceptrons (MLPs), but without specifying the software libraries and their versions (e.g., PyTorch, TensorFlow, scikit-learn versions) used for their implementation. |
| Experiment Setup | Yes | We use a five-layer Graph Isomorphism Network (GIN) (Xu et al., 2019) with sum as the readout function as the molecule encoder. All molecules on the context graph are used to pretrain the encoder. Since we extract feature vectors as decoding targets in different modalities, we efficiently use a Multi-Layer Perceptron (MLP) as modality decoders. We set the hidden dimension to 300, β = 4, and L = 4. In each training batch, random walks start from the molecule node to extract the walk path. The decoders are then pretrained to reconstruct the corresponding node features along the path. |