Out-of-Distribution Learning with Human Feedback

Authors: Haoyue Bai, Xuefeng Du, Katie Rainey, Shibin Parameswaran, Yixuan Li

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our approach outperforms state-of-the-art methods by a significant margin. Code is publicly available at https://github.com/Haoyue Bai ZJU/ood-hf. Lastly, we provide extensive experiments showing that this human-centered approach can effectively improve both OOD generalization and detection under a small annotation budget (Section 4).
Researcher Affiliation Collaboration Haoyue Bai EMAIL Department of Computer Sciences University of Wisconsin-Madison Katie Rainey EMAIL Naval Information Warfare Center Pacific
Pseudocode Yes An end-to-end algorithm is fully specified in Appendix A.
Open Source Code Yes Code is publicly available at https://github.com/Haoyue Bai ZJU/ood-hf.
Open Datasets Yes Datasets and benchmarks. Following the setup in Bai et al. (2023), we employ CIFAR-10 (Krizhevsky et al., 2009) as Pin and CIFAR-10-C (Hendrycks & Dietterich, 2018) with Gaussian additive noise as the Pcovariate out . For Psemantic out , we leverage SVHN (Netzer et al., 2011), Textures (Cimpoi et al., 2014), Places365 (Zhou et al., 2017), and LSUN (Yu et al., 2015). Detailed descriptions of the datasets and data mixture can be found in the Appendix B. To demonstrate the adaptability and robustness of our proposed method, we extend the framework to more diverse settings and datasets. Additional results on other types of covariate shifts can be found in Appendix E. Additional results on PACS benchmark. In Table 2, we report results on the PACS dataset (Li et al., 2017) from Domain Bed.
Dataset Splits Yes We divide CIFAR-10 training set into 50% labeled as ID and 50% unlabeled. Details of data split for OOD datasets. For datasets with standard train-test split (e.g., SVHN), we use the original test split for evaluation. For other OOD datasets (e.g., LSUN-C), we use 70% of the data for creating the wild mixture training data as well as the mixture validation dataset. We use the remaining examples for test-time evaluation. For splitting training/validation, we use 30% for validation and the remaining for training.
Hardware Specification Yes All experiments are performed using NVIDIA Ge Force RTX 2080 Ti.
Software Dependencies Yes Our implementation is based on Py Torch 1.8.1. We run all experiments with Python 3.8.5 and Py Torch 1.13.1, using NVIDIA Ge Force RTX 2080 Ti GPUs.
Experiment Setup Yes Experimental details. To ensure a fair comparison with prior works (Bai et al., 2023; Liu et al., 2020; Katz-Samuels et al., 2022), we adopt Wide Res Net with 40 layers and a widen factor of 2 (Zagoruyko & Komodakis, 2016). We use stochastic gradient descent with Nesterov momentum (Duchi et al., 2011), with weight decay 0.0005 and momentum 0.09. The model is initialized with a pre-trained network on CIFAR-10, and then trained for 100 epochs using our objective in Equation 7, with α = 10. We use a batch size of 128 and an initial learning rate of 0.1 with cosine learning rate decay. We default k to 1000 and provide analysis of different labeling budgets k {100, 500, 1000, 2000} in Section 4.3.