reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments

Authors: Yusuf Roohani, Andrew Lee, Qian Huang, Jian Vora, Zachary Steinhart, Kexin Huang, Alexander Marson, Percy Liang, Jure Leskovec

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we introduce Bio Discovery Agent, an agent that designs new experiments, reasons about their outcomes, and efficiently navigates the hypothesis space to reach desired solutions. We demonstrate our agent on the problem of designing genetic perturbation experiments, where the aim is to find a small subset out of many possible genes that, when perturbed, result in a specific phenotype (e.g., cell growth). Utilizing its biological knowledge, Bio Discovery Agent can uniquely design new experiments without the need to train a machine learning model or explicitly design an acquisition function as in Bayesian optimization. Moreover, Bio Discovery Agent using Claude 3.5 Sonnet achieves an average of 21% improvement in predicting relevant genetic perturbations across six datasets, and a 46% improvement in the harder task of non-essential gene perturbation, compared to existing Bayesian optimization baselines specifically trained for this task. Our evaluation includes one dataset that is unpublished, ensuring it is not part of the language model s training data. Additionally, Bio Discovery Agent predicts gene combinations to perturb more than twice as accurately as a random baseline, a task so far not explored in the context of closed-loop experiment design.
Researcher Affiliation	Academia	Yusuf Roohani 1,2 Andrew Lee 1 Qian Huang 1 Jian Vora1 Zachary Steinhart3,4 Kexin Huang1 Alexander Marson3,4,5,6 Percy Liang1 Jure Leskovec1 1Department of Computer Science, Stanford University 2Arc Institute 3Gladstone-UCSF Institute of Genomic Immunology 4Department of Medicine, University of California, San Francisco 5Department of Microbiology and Immunology, University of California, San Francisco 6UCSF Helen Diller Family Comprehensive Cancer Center
Pseudocode	Yes	Algorithm 1 Bio Discovery Agent: AI Agent for Biological Experiment Design (using all tools) Input: Experiment description, Number of rounds T, Number of genes to perturb in each round b Output: Set of genes to perturb for t = 1 to T do
Open Source Code	Yes	Code is available at: www.github.com/snap-stanford/Bio Discovery Agent
Open Datasets	Yes	For the single-gene perturbation setting, we make use of six different datasets spread across different cell types, publication dates and data generation sites. Each of the datasets contains the phenotypic response of knocking-down over 18,000 individual genes in distinct cells, with the exception of Scharenberg et al. (2023) which contains data for 1061 perturbations. All datasets were released after 2021, apart from one dataset (CAR-T 1) which is so far unpublished. ... The Schmidt et al. (2022) dataset measures the changes in the production of two key cytokines involved in immune signaling: Interferon-γ (IFNG) and Interleukin-2 (IL-2) under different genetic perturbations performed in primary human T-cells. The Carnevale et al. (2022) dataset includes perturbation screens for identifying genes that render T cells resistant to inhibitory signals encountered in the tumor microenvironment. Unpublished data (CAR-T dataset) studies the impact of genome-wide perturbations on CAR-T cell proliferation. The Scharenberg et al. (2023) dataset measures the effect of perturbation on mediating lysosomal choline recycling in pancreatic cells, and the Sanchez et al. (2021) dataset studies the change in expression of endogenous tau protein levels in neurons. For the two-gene perturbation task, we use a dataset from a screen that knocked down 100,576 gene pairs in K562 cells (Horlbeck et al., 2018).
Dataset Splits	No	In every experimental round we perturb 128 genes, representing a reasonably sized small-scale biological screen. Since each round of experimentation can incur additional costs and introduce unwanted experimental variation, we focus our evaluations on fewer experimental rounds (5) to more accurately reflect a real biological setting. For each dataset, after each round, we calculate the hit ratio as the proportion of discovered hits out of the total true hits for that dataset. ... For Scharenberg et al. (2023), a batch size of 32 was used due to its smaller size of 1061 perturbations.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instances) used for running its experiments.
Software Dependencies	No	We tested 9 different LLMs across varying levels of complexity for use in Bio Discovery Agent (Claude v1 (Anthropic, 2023), Claude 3 Haiku, Claude 3 Sonnet, Claude 3 Opus (Anthropic, 2024b), Claude 3.5 Sonnet (Anthropic, 2024a), GPT-3.5-Turbo (Open AI, 2023), GPT-4o (Open AI, 2024a), o1-mini (Open AI, 2024b), o1-preview (Open AI, 2024c)). ... In this case, the agent uses the Pub Med API (Wobben, 2020) to search for papers containing the most pertinent literature. ... We provide the agent with the ability to query databases to search for other genes with similar biological properties as hit genes from previous experimental rounds (Appendix Figure 5d). First, the API is called to perform enrichment analysis for biological processes on the Reactome 2022 database (Gillespie et al., 2022) to identify the most relevant biological pathways. ... The agent accesses the KEGG (Kanehisa et al., 2017) enrichment database...
Experiment Setup	Yes	In every experimental round we perturb 128 genes, representing a reasonably sized small-scale biological screen. Since each round of experimentation can incur additional costs and introduce unwanted experimental variation, we focus our evaluations on fewer experimental rounds (5) to more accurately reflect a real biological setting. ... The agent receives a prompt that describes general information about the experimental setup and the biological hypothesis being tested (Figure 1b, Appendix A, B). The results from each experiment are incorporated into the next prompt, along with the same information about the experimental setup. ... To ensure interpretability and to guide the agent s thought process, a consistent response format is defined across all prompts. We direct the LLM to structure its responses into several parts: Reflection, Research Plan, Solution (Appendix A, Figure 1b), similar to (Huang et al., 2023b).