Efficient Biological Data Acquisition through Inference Set Design
Authors: Ihor Neporozhnii, Julien Roy, Emmanuel Bengio, Jason Hartford
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical studies on image and molecular datasets, as well as a real-world large-scale biological assay, show that active learning for inference set design leads to significant reduction in experimental cost while retaining high system performance. |
| Researcher Affiliation | Collaboration | 1Valence Labs 2University of Toronto 3University of Manchester |
| Pseudocode | Yes | A pseudo-code is available in Appendix B. |
| Open Source Code | Yes | The code is available at https://github.com/ineporozhnii/inference_set_design. All datasets to reproduce our results are publicly available, except one proprietary dataset for the results in Figure 8. |
| Open Datasets | Yes | The whole MNIST training set is used as the target set from which agents can acquire samples. The MNIST test set is split 50-50 into a validation set used for early stopping and a test set used for measuring model performance on held-out data inaccessible by agents. We use the Quantum Machine 9 (QM9) (Ruddigkeit et al., 2012; Ramakrishnan et al., 2014). For our experiments, we start by using the publicly available Rx Rx3 dataset (Fay et al., 2023). To evaluate the inference set design paradigm on a regression task we use the Molecules3D dataset (Xu et al., 2021). |
| Dataset Splits | Yes | Both datasets are split into inference, validation, and test sets with 80%, 5%, 15% fractions. |
| Hardware Specification | No | The paper does not provide specific hardware details used for running its experiments. It mentions 'HTS platforms' but this is a general term and not a specific hardware specification (e.g., GPU/CPU models, memory details). |
| Software Dependencies | Yes | As a first data processing step, we use the RDKit (Landrum et al., 2024) and Molfeat (Noutahi et al., 2023) libraries to convert molecular structures into SMILES strings and compute their Extended Connectivity Fingerprints (ECFPs). |
| Experiment Setup | Yes | Hyperparameters for experiments. Table 2: Hyperparameters for experiments. |