reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Out-of-Distribution Learning with Human Feedback

Authors: Haoyue Bai, Xuefeng Du, Katie Rainey, Shibin Parameswaran, Yixuan Li

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that our approach outperforms state-of-the-art methods by a significant margin. Code is publicly available at https://github.com/Haoyue Bai ZJU/ood-hf. Lastly, we provide extensive experiments showing that this human-centered approach can effectively improve both OOD generalization and detection under a small annotation budget (Section 4).
Researcher Affiliation	Collaboration	Haoyue Bai EMAIL Department of Computer Sciences University of Wisconsin-Madison Katie Rainey EMAIL Naval Information Warfare Center Pacific
Pseudocode	Yes	An end-to-end algorithm is fully specified in Appendix A.
Open Source Code	Yes	Code is publicly available at https://github.com/Haoyue Bai ZJU/ood-hf.
Open Datasets	Yes	Datasets and benchmarks. Following the setup in Bai et al. (2023), we employ CIFAR-10 (Krizhevsky et al., 2009) as Pin and CIFAR-10-C (Hendrycks & Dietterich, 2018) with Gaussian additive noise as the Pcovariate out . For Psemantic out , we leverage SVHN (Netzer et al., 2011), Textures (Cimpoi et al., 2014), Places365 (Zhou et al., 2017), and LSUN (Yu et al., 2015). Detailed descriptions of the datasets and data mixture can be found in the Appendix B. To demonstrate the adaptability and robustness of our proposed method, we extend the framework to more diverse settings and datasets. Additional results on other types of covariate shifts can be found in Appendix E. Additional results on PACS benchmark. In Table 2, we report results on the PACS dataset (Li et al., 2017) from Domain Bed.
Dataset Splits	Yes	We divide CIFAR-10 training set into 50% labeled as ID and 50% unlabeled. Details of data split for OOD datasets. For datasets with standard train-test split (e.g., SVHN), we use the original test split for evaluation. For other OOD datasets (e.g., LSUN-C), we use 70% of the data for creating the wild mixture training data as well as the mixture validation dataset. We use the remaining examples for test-time evaluation. For splitting training/validation, we use 30% for validation and the remaining for training.
Hardware Specification	Yes	All experiments are performed using NVIDIA Ge Force RTX 2080 Ti.
Software Dependencies	Yes	Our implementation is based on Py Torch 1.8.1. We run all experiments with Python 3.8.5 and Py Torch 1.13.1, using NVIDIA Ge Force RTX 2080 Ti GPUs.
Experiment Setup	Yes	Experimental details. To ensure a fair comparison with prior works (Bai et al., 2023; Liu et al., 2020; Katz-Samuels et al., 2022), we adopt Wide Res Net with 40 layers and a widen factor of 2 (Zagoruyko & Komodakis, 2016). We use stochastic gradient descent with Nesterov momentum (Duchi et al., 2011), with weight decay 0.0005 and momentum 0.09. The model is initialized with a pre-trained network on CIFAR-10, and then trained for 100 epochs using our objective in Equation 7, with α = 10. We use a batch size of 128 and an initial learning rate of 0.1 with cosine learning rate decay. We default k to 1000 and provide analysis of different labeling budgets k {100, 500, 1000, 2000} in Section 4.3.