Interpretable Causal Representation Learning for Biological Data in the Pathway Space

Authors: Jesus de la Fuente Cedeño, Robert Lehmann, Carlos Ruiz-Arenas, Jan Voges, Irene Marín-Goñi, Xabier Martinez de Morentin, David Gomez-Cabrero, Idoia Ochoa, Jesper Tegnér, Vincenzo Lagani, Mikel Hernaez

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that SENA-discrepancy-VAE achieves predictive performances on unseen combinations of interventions that are comparable with its original, non-interpretable counterpart, while inferring causal latent factors that are biologically meaningful. [...] We employ two large-scale Perturb-seq datasets, one collected on leukemia lymphoblast cells (K562 cell line) (Norman et al., 2019), termed the Norman2019 dataset, and a second one collected on acute myeloid leukemia cells (THP1 cell line)(Wessels et al., 2022), termed the Wessels2023 dataset. [...] Section 5 ABLATION STUDY. [...] Section 6 LEARNING INTERPRETABLE LATENT CAUSAL FACTORS. [...] Table 1: Benchmarking SENA-discrepancy-VAE and discrepancy-VAE on double perturbations prediction.
Researcher Affiliation Academia 1 CIMA University of Navarra, CCUN, Idi SNA, Pamplona, Spain. 2 TECNUN, University of Navarra, San Sebastián, Spain. 3 Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia 4 Dept. of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, MN, USA 5 Institute of Chemical Biology, Ilia State University, Tbilisi 0162, Georgia 6 Center for Data Science (DATAI), University of Navarra, 31008, Pamplona, Spain.
Pseudocode No The paper describes the SENA-discrepancy-VAE model and its components using mathematical equations and descriptive text, but does not include a dedicated pseudocode or algorithm block.
Open Source Code Yes 2Python package, including data and code for reproducibility: github.com/ML4BM-Lab/SENA
Open Datasets Yes We employ two large-scale Perturb-seq datasets, one collected on leukemia lymphoblast cells (K562 cell line) (Norman et al., 2019), termed the Norman2019 dataset, and a second one collected on acute myeloid leukemia cells (THP1 cell line)(Wessels et al., 2022), termed the Wessels2023 dataset.
Dataset Splits Yes The Norman2019 dataset underwent standard preprocessing steps for single cell data (filtering, normalization, and log-transformation (Wolf et al., 2018)), leading to a total of 8,907 unperturbed cells (controls), 57,831 cells under the 105 single-gene perturbations, and 41,759 cells under the 131 double-gene perturbations. [...] For both datasets, we trained both models on the unperturbed and single-gene perturbations samples from Norman et al. (2019)... Double-gene perturbations were set aside for evaluation purposes.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions several software tools and packages such as 'Python package', 'statsannotation package (Charlier et al., 2022)', 'Seurat', and 'Scanpy', but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes Given the good results (in interpretability and reconstruction performance) obtained in the ablation study (Section 5), we varied the number of latent factors within {5, 10, 35, 70, 105}, and the λ for the SENA-discrepancy-VAE in {0, 0.1} (Appendix VII Fig. 12 shows gradients and mask (M) distribution across several λ values). [...] The parameter N is set to 100 in our analyses. [...] We evaluated the aforementioned architectures for several values of λ: {0, 0.1, 0.01, 10 3}. [...] enforcing that every BP contains at least 5 genes.