reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Principled Out-of-Distribution Detection via Multiple Testing

Authors: Akshayaa Magesh, Venugopal V. Veeravalli, Anirban Roy, Susmit Jha

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we ﬁnd that threshold-based tests proposed in prior work perform well in speciﬁc settings, but not uniformly well across diﬀerent OOD instances. In contrast, our proposed method that combines multiple statistics performs uniformly well across diﬀerent datasets and neural networks architectures.
Researcher Affiliation	Collaboration	Akshayaa Magesh EMAIL Department of Electrical and Computer Engineering University of Illinois Urbana-Champaign Champaign, IL 61820, USA Venugopal V. Veeravalli EMAIL Department of Electrical and Computer Engineering University of Illinois Urbana-Champaign Champaign, IL 61820, USA Anirban Roy EMAIL Computer Science Laboratory SRI International Menlo Park, CA 94061 Susmit Jha EMAIL Computer Science Laboratory SRI International Menlo Park, CA 94061
Pseudocode	Yes	Algorithm 1 BH based OOD detection test with conformal p-values Algorithm 2 Bonferroni based OOD detection test with conformal p-values
Open Source Code	No	The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	For CIFAR10 as the in-distribution dataset, we study SVHN, LSUN, Image Net, and i SUN as OOD datasets. For SVHN as the in-distribution dataset, we study LSUN, Image Net, CIFAR10 and i SUN as OOD datasets.
Dataset Splits	Yes	The calibration dataset in each case is a subset of 5000 samples of the in-distribution training dataset. We use a subset of 45000 points from the training dataset (with no overlap with the calibration dataset) to calculate the class-wise empirical means and shared covariance for the Mahalanobis scores, and the minimum and maximum correlations for the Gram scores.
Hardware Specification	Yes	All experiments presented in this paper were run on a single NVIDIA GTX-1080Ti GPU with Py Torch.
Software Dependencies	No	The paper mentions 'Py Torch' but does not specify a version number or other software dependencies with version numbers.
Experiment Setup	Yes	In our experiments, we set the temperature parameter T to 100 for all in-distribution datasets, DNN architectures and OOD datasets (as stated by Liu et al. (2020), the energy score is not sensitive to the temperature parameter).