Can We Ignore Labels in Out of Distribution Detection?

Authors: Hong Yang, Qi Yu, Travis Desell

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct the following experiments to verify the existence of label blindness in unlabeled OOD detection methods. All hyperparameters and configurations were the best performing from their respective original paper implementations, unless noted otherwise. Experiments are repeated 3 times. Note that code for fully replicating experiments of this work can be found at https://github.com/hyang0129/Problematic_Self_Supervised_OOD
Researcher Affiliation Academia Hong Yang, Qi Yu, Travis Desell Rochester Institute of Technology 1 Lomb Memorial Dr, Rochester, NY 14623, USA EMAIL
Pseudocode No The paper includes theoretical proofs (Theorem 3.1, Lemma 3.2, Corollary 3.3, Theorem 3.4, Theorem 3.5, and all proofs in Appendix D) but does not contain any explicit pseudocode or algorithm blocks.
Open Source Code Yes Note that code for fully replicating experiments of this work can be found at https://github.com/hyang0129/Problematic_Self_Supervised_OOD
Open Datasets Yes The ICML Facial Expressions dataset (Erhan et al., 2013) contains seven facial expressions split across 28, 709 faces in the train set and 7, 178 in the test set. The Stanford Cars dataset (Krause et al., 2013) contains 16, 185 images taken from 196 classes of cars. The Food 101 dataset by (Bossard et al., 2014) consists of 101 food categories and 101, 000 images. In appendix F.1, we show adjacent OOD results for CIFAR10 and CIFAR100.
Dataset Splits Yes To create the Adjacent OOD detection task, we randomly split 25% of all classes into the OOD set and retain 75% as the ID set. We also repeat our experiments three times with different seeds to account for different splits of the ID and OOD set. The Stanford Cars dataset (Krause et al., 2013) ... The data is split into 8, 144 training images and 8, 041 testing images, with each class being split roughly 50-50. The Food 101 dataset by (Bossard et al., 2014) ... There are 250 manually reviewed test images and 750 training images for each class.
Hardware Specification No The authors acknowledge Research Computing at the Rochester Institute of Technology for providing computational resources and support that have contributed to the research results reported in this publication (RIT, 2024).
Software Dependencies No The paper mentions specific software components like ResNet50 architecture, Sim CLR, Rotation Loss, diffusion impainting OOD detection method, Grad CAM, and CLIPN model, but does not provide specific version numbers for any of these or any underlying libraries/frameworks.
Experiment Setup Yes All hyperparameters and configurations were the best performing from their respective original paper implementations, unless noted otherwise. Supervised Baseline. We augment the training data using random rotation, horizontal flip, random crop, gray scale, and color jitter. Images are resized to 64 64. We train using stochastic gradient descent with momentum and a cosine annealing learning schedule. We train for 10 warm up epochs followed by 150 regular epochs, selecting the weights with the highest validation accuracy. We use a standard Res Net50 architecture. Self-supervised Baselines. Images are resized to 64 64 for both cases. For Sim CLR, we augment the training data using random rotation, horizontal flip, random crop, gray scale, and color jitter. For Rotation Loss, we use only random crop and horizontal flip. We train using stochastic gradient descent with momentum (and a cosine annealing learning schedule) and employ a standard Res Net50 architecture and train for 10 warm up epochs followed by 500 regular epochs, selecting the weights with the best-learned representations. Unsupervised Baseline. We utilize the training configuration that generated the paper s main results, which involved an alternating checkerboard mask 8 8, an LPIPS distance metric to calculate the OOD score, and 10 reconstructions per image. We modify only the input image size to be 64 64 for all datasets and run additional experiments to evaluate performance on their alternative MSE distance metric.