reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ASPEST: Bridging the Gap Between Active Learning and Selective Prediction

Authors: Jiefeng Chen, Jinsung Yoon, Sayna Ebrahimi, Sercan O Arik, Somesh Jha, Tomas Pfister

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on numerous image, text and structured datasets, which suﬀer from domain shifts, demonstrate that ASPEST can signiﬁcantly outperform prior work on selective prediction and active learning (e.g. on the MNIST SVHN benchmark with the labeling budget of 100, ASPEST improves the AUACC metric from 79.36% to 88.84%) and achieves more optimal utilization of humans in the loop.
Researcher Affiliation	Collaboration	Jiefeng Chen EMAIL University of Wisconsin-Madison Jinsung Yoon EMAIL Google Sayna Ebrahimi EMAIL Google Sercan Ö. Arık EMAIL Google Somesh Jha EMAIL University of Wisconsin-Madison Google Tomas Pﬁster tpﬁster@google.com Google
Pseudocode	Yes	Algorithm 1 Softmax Response with Active Learning Algorithm 2 Deep Ensembles with Active Learning Algorithm 3 Active Selective Prediction using Ensembles and Self-Training
Open Source Code	Yes	Work done during internship at Google. Our code is available at: https://github.com/google-research/google-research/tree/master/active_selective_prediction.
Open Datasets	Yes	Speciﬁcally, we use the following datasets with distribution shift: (i) MNIST SVHN (Le Cun, 1998; Netzer et al., 2011), (ii) CIFAR-10 CINIC-10 (Krizhevsky et al., 2009; Darlow et al., 2018), (iii) FMo W (Koh et al., 2021), (iv) Amazon Review (Koh et al., 2021), (v) Domain Net (Peng et al., 2019) and (vi) Otto (Benjamin Bossan, 2015).
Dataset Splits	Yes	MNIST consists 28 28 grayscale images of handwritten digits, containing in total 5,500 training images and 1,000 test images. We resize each image to be 32 32 resolution and change them to be colored. We use the training set of MNIST as Dtr and the test set of MNIST as the source validation dataset. SVHN consists 32 32 colored images of digits obtained from house numbers in Google Street View images. The training set has 73,257 images and the test set has 26,032 images. We use the test set of SVHN as UX.
Hardware Specification	Yes	We run all experiments with Tensor Flow 2.0 on NVIDIA A100 GPUs in the Debian GNU/Linux 10 system.
Software Dependencies	Yes	We run all experiments with Tensor Flow 2.0 on NVIDIA A100 GPUs in the Debian GNU/Linux 10 system.
Experiment Setup	Yes	Active learning hyper-parameters. We evaluate diﬀerent methods with diﬀerent labeling budget M values on each dataset. By default, we set the number of rounds T = 10 for all methods (Appendix F.6 presents the eﬀect of T). During the active learning process, we ﬁne-tune the model on the selected labeled test data. During ﬁne-tuning, we don t apply any data augmentation to the test data. We use the same ﬁnetuning hyper-parameters for diﬀerent methods to ensure a fair comparison. More details on the ﬁne-tuning hyper-parameters can be found in Appendix E.4. Hyper-parameters of ASPEST. Table 2 comprehensively lists all the hyperparameters used in ASPEST, along with their respective default values. We set λ = 1, ns = 1000 and N = 5 (see Appendix F.7 for the eﬀect of N), which are the same as those for Deep Ensembles, for fair comparisons. For all datasets, we use cs = 200, p = 0.1, η = 0.9, the number of self-training epochs to be 20 and ce = 5.