reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Selective Classification Under Distribution Shifts

Authors: Hengyue Liang, Le Peng, Ju Sun

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive analysis and experiments, we show that our proposed score functions are more effective and reliable than the existing ones for generalized SC on a variety of classification tasks and DL classifiers. The code is available at https://github.com/sun-umn/sc_with_distshift. Section 4: Experiments. In this section, we experiment with various multiclass classification tasks and recent DNN classifiers to verify the effectiveness of our margin-based score functions for generalized SC.
Researcher Affiliation	Academia	Hengyue Liang EMAIL Department of Electrical and Computer Engineering University of Minnesota; Le Peng EMAIL Department of Computer Science and Engineering University of Minnesota; Ju Sun EMAIL Department of Computer Science and Engineering University of Minnesota
Pseudocode	Yes	Algorithm 1 Non-training-based selective classification. Algorithm 2 Typical OOD detection pipeline (e.g., Sun et al. (2021))
Open Source Code	Yes	The code is available at https://github.com/sun-umn/sc_with_distshift.
Open Datasets	Yes	our evaluation tasks include (i) Image Net (Russakovsky et al., 2015), the most widely used testbed for image classification, with a covariate-shifted version Image Net-C (Hendrycks & Dietterich, 2018) composed of synthetic perturbations, and Open Image-O (Wang et al., 2022) composed of natural images similar to Image Net but with disjoint labels, i.e., label-shifted samples; (ii) i WIld Cam (Beery et al., 2020) test set provides two subsets of animal images taken at different geo-locations; (iii) Amazon (Ni et al., 2019) test set provides two subsets of review comments by different users; (iv) CIFAR-10 (Krizhevsky et al., 2009), a small image classification dataset commonly used in previous training-based SC works, together with CIFAR-10-C (perturbed CIFAR-10) and CIFAR-100 (with disjoint labels from CIFAR-10)
Dataset Splits	Yes	Table 2: Summary of In-D and distribution-shifted datasets used for our SC evaluation. Task: Image Net, In-D (split): ILSVRC-2012 (val), classes: 1000, samples: 50,000, Shift-Cov samples: Image Net-C (severity 3) 50,000, Shift-Label samples: Open Image-O 17,256. Task: i Wild Cam, In-D (split): i Wild Cam (id_test), classes: 178, samples: 8154, Shift-Cov samples: i Wild Cam (ood_test) 42791. Task: Amazon, In-D (split): Amazon (id_test), classes: 5, samples: 46,950, Shift-Cov samples: Amazon (test) 100,050. Task: CIFAR, In-D (split): CIFAR-10 (val), classes: 10, samples: 10,000, Shift-Cov samples: CIFAR-10-C (severity 3) 10,000, Shift-Label samples: CIFAR-100 10,000.
Hardware Specification	No	The authors acknowledge the Minnesota Supercomputing Institute (MSI) at the University of Minnesota for providing resources that contributed to the research results reported in this article.
Software Dependencies	No	The paper mentions 'timm' (Wightman, 2019) for model retrieval, and 'Py Torch' for reimplementing Sc Net. However, it does not specify exact version numbers for these software libraries, nor for Python, CUDA, or other dependencies.
Experiment Setup	Yes	Table 7: Key hyperparameters for the Sc Net training used in this paper. Dataset: CIFAR-10, Model architecture: VGG, Dropout prob.: 0.3, Target coverage: 0.7, Batch size: 128, Total epochs: 300, Lr (base): 0.1, Scheduler: Step LR. Dataset: Imaeg Net-1k, Model architecture: resnet34, Dropout prob.: N/A, Target coverage: 0.7, Batch size: 768, Total epochs: 250, Lr (base): 0.1, Scheduler: Cosine Annealing LR.