apricot: Submodular selection for data summarization in Python

Authors: Jacob Schreiber, Jeffrey Bilmes, William Stafford Noble

JMLR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the use of subset selection by training machine learning models to comparable accuracy using either the full data set or a representative subset thereof. To demonstrate the practical utility of the selected examples, we evaluated logistic regression models trained on subsets of examples from the two data sets.
Researcher Affiliation Academia Jacob Schreiber EMAIL Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195-4322, USA; Jeffrey Bilmes EMAIL Department of Electrical & Computer Engineering, University of Washington, Seattle, WA 98195-4322, USA; William Stafford Noble EMAIL Department of Genome Science, University of Washington, Seattle, WA 98195-4322, USA
Pseudocode No The paper describes various algorithms (e.g., greedy algorithm, accelerated greedy algorithm, stochastic greedy, sample greedy, approximate lazy greedy, bidirectional greedy algorithms, Gree Di) but does not provide structured pseudocode or algorithm blocks for them.
Open Source Code Yes The code and tutorial Jupyter notebooks are available at https://github.com/jmschrei/apricot
Open Datasets Yes To illustrate this approach in apricot, we consider two data sets: classifying digits from images in the MNIST data set (Le Cun et al., 1998) and classifying articles of clothing from images in the Fashion MNIST data set (Xiao et al., 2017).
Dataset Splits Yes The subsets were chosen solely from the training sets (of 60,000 examples each) using either a facility location function or 20 iterations of random selection. The model is evaluated on the full test set each time.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions software like "Python", "numba (Lam et al., 2015)", "scikit-learn transformers", and "keras (Chollet et al., 2015)", but it does not specify explicit version numbers for any of these components.
Experiment Setup No The paper mentions evaluating "logistic regression models" and using "subsets of varying sizes" but does not provide specific hyperparameters such as learning rates, batch sizes, number of epochs, or other detailed training configurations for these models.