reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Contextualizing biological perturbation experiments through language

Authors: Menghua (Rachel) Wu, Russell Littman, Jacob Levine, Lin Qiu, Tommaso Biancalani, David Richmond, Jan-Christian Huetter

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose PERTURBQA, a benchmark for structured reasoning over perturbation experiments. We evaluate state-of-the-art machine learning and statistical approaches for modeling perturbations, as well as standard LLM reasoning strategies, and we find that current methods perform poorly on PERTURBQA. Ground truth labels are derived from five high quality single-cell RNA sequencing datasets with CRISPR interference (CRISPRi) perturbations (Replogle et al., 2022; Nadig et al., 2024), based on strict statistical considerations. Evaluation of state-of-the-art statistical, graph, and language-based methods reveal that these tasks are still far from solved.
Researcher Affiliation	Collaboration	Menghua Wu Massachusetts Institute of Technology Cambridge, MA, USA; Russell Littman, Jacob Levine Biology Research & AI Development, Genentech South San Francisco, CA, USA; Lin Qiu Meta AI Menlo Park, CA, USA; David Richmond, Tommaso Biancalani, Jan-Christian Hütter Biology Research & AI Development, Genentech South San Francisco, CA, USA
Pseudocode	No	The paper describes the steps of the SUMMER framework (Summarize, Retrieve, Answer) and provides prompt templates in the appendix, but it does not present a formal pseudocode or algorithm block.
Open Source Code	Yes	Our code and data are publicly available at https://github.com/genentech/Perturb QA
Open Datasets	Yes	Our code and data are publicly available at https://github.com/genentech/Perturb QA. We constructed our benchmark based on five Perturb-seq datasets, derived from Replogle et al. (2022) and Nadig et al. (2024).
Dataset Splits	Yes	Datasets are split 75:25 into train and test along the perturbation axis, with similar distributions of number of DEGs. We split perturbations 75:25 between training and testing. Validation data were sampled at random during training (10% of training). Further details regarding dataset and data split statistics may be found in Tables 4 and 5.
Hardware Specification	No	The paper mentions running experiments with Llama3 (70B and 8B models) using the LMDeploy framework, but it does not specify the underlying hardware like GPU models or CPU types used for these experiments.
Software Dependencies	No	The paper mentions using Llama3 models and the LMDeploy framework, but does not provide specific version numbers for these or other key software dependencies like programming languages or libraries (e.g., Python, PyTorch version).
Experiment Setup	Yes	We ran all experiments with Llama3 (Dubey et al., 2024) with default parameters of top p 0.9 and temperature 0.6, using the LMDeploy framework (Contributors, 2023). For GAT, we grid searched over the number of layers (1, 2, 4, 8) and hidden dimension (64, 128, 256). We used FFN dimension 1024 (memory constraint), GELU activation, dropout of 0.1, weight decay 1e-6, learning rate 1e-4, and residual connections.