apricot: Submodular selection for data summarization in Python
Authors: Jacob Schreiber, Jeffrey Bilmes, William Stafford Noble
JMLR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the use of subset selection by training machine learning models to comparable accuracy using either the full data set or a representative subset thereof. To demonstrate the practical utility of the selected examples, we evaluated logistic regression models trained on subsets of examples from the two data sets. |
| Researcher Affiliation | Academia | Jacob Schreiber EMAIL Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195-4322, USA; Jeffrey Bilmes EMAIL Department of Electrical & Computer Engineering, University of Washington, Seattle, WA 98195-4322, USA; William Stafford Noble EMAIL Department of Genome Science, University of Washington, Seattle, WA 98195-4322, USA |
| Pseudocode | No | The paper describes various algorithms (e.g., greedy algorithm, accelerated greedy algorithm, stochastic greedy, sample greedy, approximate lazy greedy, bidirectional greedy algorithms, Gree Di) but does not provide structured pseudocode or algorithm blocks for them. |
| Open Source Code | Yes | The code and tutorial Jupyter notebooks are available at https://github.com/jmschrei/apricot |
| Open Datasets | Yes | To illustrate this approach in apricot, we consider two data sets: classifying digits from images in the MNIST data set (Le Cun et al., 1998) and classifying articles of clothing from images in the Fashion MNIST data set (Xiao et al., 2017). |
| Dataset Splits | Yes | The subsets were chosen solely from the training sets (of 60,000 examples each) using either a facility location function or 20 iterations of random selection. The model is evaluated on the full test set each time. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions software like "Python", "numba (Lam et al., 2015)", "scikit-learn transformers", and "keras (Chollet et al., 2015)", but it does not specify explicit version numbers for any of these components. |
| Experiment Setup | No | The paper mentions evaluating "logistic regression models" and using "subsets of varying sizes" but does not provide specific hyperparameters such as learning rates, batch sizes, number of epochs, or other detailed training configurations for these models. |