reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Can We Ignore Labels in Out of Distribution Detection?

Authors: Hong Yang, Qi Yu, Travis Desell

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct the following experiments to verify the existence of label blindness in unlabeled OOD detection methods. All hyperparameters and configurations were the best performing from their respective original paper implementations, unless noted otherwise. Experiments are repeated 3 times. Note that code for fully replicating experiments of this work can be found at https://github.com/hyang0129/Problematic_Self_Supervised_OOD
Researcher Affiliation	Academia	Hong Yang, Qi Yu, Travis Desell Rochester Institute of Technology 1 Lomb Memorial Dr, Rochester, NY 14623, USA EMAIL
Pseudocode	No	The paper includes theoretical proofs (Theorem 3.1, Lemma 3.2, Corollary 3.3, Theorem 3.4, Theorem 3.5, and all proofs in Appendix D) but does not contain any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Note that code for fully replicating experiments of this work can be found at https://github.com/hyang0129/Problematic_Self_Supervised_OOD
Open Datasets	Yes	The ICML Facial Expressions dataset (Erhan et al., 2013) contains seven facial expressions split across 28, 709 faces in the train set and 7, 178 in the test set. The Stanford Cars dataset (Krause et al., 2013) contains 16, 185 images taken from 196 classes of cars. The Food 101 dataset by (Bossard et al., 2014) consists of 101 food categories and 101, 000 images. In appendix F.1, we show adjacent OOD results for CIFAR10 and CIFAR100.
Dataset Splits	Yes	To create the Adjacent OOD detection task, we randomly split 25% of all classes into the OOD set and retain 75% as the ID set. We also repeat our experiments three times with different seeds to account for different splits of the ID and OOD set. The Stanford Cars dataset (Krause et al., 2013) ... The data is split into 8, 144 training images and 8, 041 testing images, with each class being split roughly 50-50. The Food 101 dataset by (Bossard et al., 2014) ... There are 250 manually reviewed test images and 750 training images for each class.
Hardware Specification	No	The authors acknowledge Research Computing at the Rochester Institute of Technology for providing computational resources and support that have contributed to the research results reported in this publication (RIT, 2024).
Software Dependencies	No	The paper mentions specific software components like ResNet50 architecture, Sim CLR, Rotation Loss, diffusion impainting OOD detection method, Grad CAM, and CLIPN model, but does not provide specific version numbers for any of these or any underlying libraries/frameworks.
Experiment Setup	Yes	All hyperparameters and configurations were the best performing from their respective original paper implementations, unless noted otherwise. Supervised Baseline. We augment the training data using random rotation, horizontal flip, random crop, gray scale, and color jitter. Images are resized to 64 64. We train using stochastic gradient descent with momentum and a cosine annealing learning schedule. We train for 10 warm up epochs followed by 150 regular epochs, selecting the weights with the highest validation accuracy. We use a standard Res Net50 architecture. Self-supervised Baselines. Images are resized to 64 64 for both cases. For Sim CLR, we augment the training data using random rotation, horizontal flip, random crop, gray scale, and color jitter. For Rotation Loss, we use only random crop and horizontal flip. We train using stochastic gradient descent with momentum (and a cosine annealing learning schedule) and employ a standard Res Net50 architecture and train for 10 warm up epochs followed by 500 regular epochs, selecting the weights with the best-learned representations. Unsupervised Baseline. We utilize the training configuration that generated the paper s main results, which involved an alternating checkerboard mask 8 8, an LPIPS distance metric to calculate the OOD score, and 10 reconstructions per image. We modify only the input image size to be 64 64 for all datasets and run additional experiments to evaluate performance on their alternative MSE distance metric.