Efficient Biological Data Acquisition through Inference Set Design

Authors: Ihor Neporozhnii, Julien Roy, Emmanuel Bengio, Jason Hartford

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical studies on image and molecular datasets, as well as a real-world large-scale biological assay, show that active learning for inference set design leads to significant reduction in experimental cost while retaining high system performance.
Researcher Affiliation Collaboration 1Valence Labs 2University of Toronto 3University of Manchester
Pseudocode Yes A pseudo-code is available in Appendix B.
Open Source Code Yes The code is available at https://github.com/ineporozhnii/inference_set_design. All datasets to reproduce our results are publicly available, except one proprietary dataset for the results in Figure 8.
Open Datasets Yes The whole MNIST training set is used as the target set from which agents can acquire samples. The MNIST test set is split 50-50 into a validation set used for early stopping and a test set used for measuring model performance on held-out data inaccessible by agents. We use the Quantum Machine 9 (QM9) (Ruddigkeit et al., 2012; Ramakrishnan et al., 2014). For our experiments, we start by using the publicly available Rx Rx3 dataset (Fay et al., 2023). To evaluate the inference set design paradigm on a regression task we use the Molecules3D dataset (Xu et al., 2021).
Dataset Splits Yes Both datasets are split into inference, validation, and test sets with 80%, 5%, 15% fractions.
Hardware Specification No The paper does not provide specific hardware details used for running its experiments. It mentions 'HTS platforms' but this is a general term and not a specific hardware specification (e.g., GPU/CPU models, memory details).
Software Dependencies Yes As a first data processing step, we use the RDKit (Landrum et al., 2024) and Molfeat (Noutahi et al., 2023) libraries to convert molecular structures into SMILES strings and compute their Extended Connectivity Fingerprints (ECFPs).
Experiment Setup Yes Hyperparameters for experiments. Table 2: Hyperparameters for experiments.