reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Position: Supervised Classifiers Answer the Wrong Questions for OOD Detection

Authors: Yucen Lily Li, Daohan Lu, Polina Kirichenko, Shikai Qiu, Tim G. J. Rudner, C. Bayan Bruss, Andrew Gordon Wilson

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To demonstrate this lack of separability between ID and OOD features, we study four different models trained on Image Net-1k: Res Net-18, Res Net-50, Vi T-S/16, and Vi TB/16, with the OOD datasets of Image Net-OOD (Yang et al., 2024b), Textures (Cimpoi et al., 2014), and i Naturalist (Van Horn et al., 2018). For each setting, we train an Oracle, a binary linear classifier, to differentiate between examples of ID features and OOD features and report its performance on held-out ID and OOD features.
Researcher Affiliation	Collaboration	1New York University 2Capital One. Correspondence to: Yucen Lily Li <EMAIL>, Andrew Gordon Wilson <EMAIL>.
Pseudocode	No	The paper describes methods and analyses but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code to reproduce our experiments can be found at https://github.com/yucenli/ood-pathologies.
Open Datasets	Yes	To demonstrate this lack of separability between ID and OOD features, we study four different models trained on Image Net-1k: Res Net-18, Res Net-50, Vi T-S/16, and Vi TB/16, with the OOD datasets of Image Net-OOD (Yang et al., 2024b), Textures (Cimpoi et al., 2014), and i Naturalist (Van Horn et al., 2018).
Dataset Splits	No	The paper frequently refers to 'training data', 'test data', 'in-distribution data', and 'OOD data', and mentions using subsets or specific classes from datasets (e.g., 'To evaluate the model on STL-10, we only use the 9 classes which overlap with CIFAR-10 classes'), but it does not provide explicit numerical dataset splits (e.g., percentages, exact counts for train/test/validation) for reproducibility beyond using standard benchmark datasets.
Hardware Specification	No	The paper mentions 'NYU IT High Performance Computing resources, services, and staff expertise.' However, it does not specify any particular hardware components such as GPU or CPU models, or memory details used for the experiments.
Software Dependencies	No	The paper mentions adapting the 'Open OOD codebase (Zhang et al., 2023; Yang et al., 2022)' but does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We train models for 100 epochs with batch size 128 for ID data and batch size 256 for the outlier dataset, SGD with momentum and initial learning rate 0.1 and weight decay 5 10 4, and we set the coefficient before the OE loss to α = 0.5 (overall, we use standard training hyper-parameters as in Zhang et al. (2023)).