Active Learning for Efficient Discovery of Optimal Combinatorial Perturbations
Authors: Jason Qin, Hans-Hermann Wessels, Carlos Fernandez-Granda, Yuhan Hao
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluated on four CRISPR datasets with over 350,000 interactions, NAIAD trained on small datasets outperforms existing models by up to 40%. Its recommendation system prioritizes gene pairs with maximum predicted effects, accelerating discovery with fewer experiments. |
| Researcher Affiliation | Collaboration | 1Neptune Bio, New York, NY, USA 2Center for Data Science, New York University, New York, NY, USA 3Courant Institute of Mathematical Sciences, New York University, New York, NY, USA . Correspondence to: Yuhan Hao <EMAIL>. |
| Pseudocode | No | The paper describes the model formulation mathematically in Section 3.1 and various sampling strategies in Appendix C, but it does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like steps. |
| Open Source Code | Yes | Our code is publicly available at: https://github.com/Neptune Bio/NAIAD |
| Open Datasets | Yes | We evaluate our models on cell-viability measurements across two cell types from four bulk combinatorial CRISPR perturbation screening datasets and one drug combination screening dataset (Norman et al., 2019; Simpson et al., 2023; Horlbeck et al., 2018; Zheng et al., 2021; Bertin et al., 2023). |
| Dataset Splits | Yes | In our downsampling experiments in Section 5.2, we used [100, 200, 350, 500, 750, 1000, 1250, and 1500] samples during training for the Norman et al. (2019) dataset (6,328 combinations), and [100, 500, 1000, 2000, 3000, 4000, 5000, 6000] samples for training on the Simpson et al. (2023) (147,658 combinations) and Horlbeck et al. (2018) datasets (100,576 combinations for K562; 95,703 combinations for Jurkat T), along with 10% and 30% of each dataset for validation and testing, respectively. |
| Hardware Specification | Yes | All model training was done on a single Paperspace A100-80G server with 100GB of RAM. |
| Software Dependencies | No | The paper mentions learning rate, batch size, weight decay, and other hyperparameters, but does not specify any software names with version numbers (e.g., Python, PyTorch, CUDA versions) that were used to implement the models. |
| Experiment Setup | Yes | For all training of the NAIAD and MLP models, we use learning rate = 10 2, and a batch size = 1024. We also used a linear rate scheduler with 10% of training steps used for warm up, and weight decay = 0. To identify these optimal hyperparameters, we testing hyperparameters across the following ranges: n epoch: [50, 100, 200, 500, 1000, 2000] batch size: [512, 1024, 2048, 4096] learning rate: [10 4, 10 3, 10 2, 10 1] d embed: [2, 4, 8, 16, 32, 64, 128, 256] d single-gene: [8, 16, 32, 64, 128, 256, 512] weight decay: [0, 10 4, 10 3] |