reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Empirical Study on Optimizer Selection for Out-of-Distribution Generalization

Authors: Hiroki Naganuma, Kartik Ahuja, Shiro Takagi, Tetsuya Motokawa, Rio Yokota, Kohta Ishikawa, Ikuro Sato, Ioannis Mitliagkas

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this study, we examine the performance of popular ﬁrst-order optimizers for diﬀerent classes of distributional shift under empirical risk minimization and invariant risk minimization. We address this question for image and text classiﬁcation using Domain Bed, WILDS, and Backgrounds Challenge as testbeds for studying diﬀerent types of shifts namely correlation and diversity shift. We search over a wide range of hyperparameters and examine classiﬁcation accuracy (in-distribution and out-of-distribution) for over 20,000 models.
Researcher Affiliation	Collaboration	1Mila Quebec AI Institute, 2Université de Montréal, 3Independent Researcher,4University of Tsukuba, Rio Yokota5, Kohta Ishikawa6, Ikuro Sato5,6, Ioannis Mitliagkas1,2,7 EMAIL, EMAIL, EMAIL, EMAIL 5Tokyo Institute of Technology, 6Denso IT Laboratory Inc., 7Canada CIFAR AI Chair
Pseudocode	Yes	Algorithm 1 Generic adaptive optimization method setup.
Open Source Code	Yes	Our code can be found at the link below. https://github.com/Hiroki11x/Optimizer_Comparison_OOD
Open Datasets	Yes	We evaluate the OOD generalization performance of these optimizers on 10 diﬀerent benchmarks: Domain Bed (which includes seven image datasets) (Gulrajani & Lopez-Paz, 2021), the Backgrounds Challenge dataset (Xiao et al., 2021), , and Civil Comments-WILDS (Koh et al., 2021). Image Classiﬁcation Datasets: Domain Bed consists of a set of benchmark datasets for domain generalization, which includes PACS (Fang et al., 2013), VLCS (Li et al., 2017), Oﬃce-Home (Venkateswara et al., 2017), Terra Incognita (Beery et al., 2018) Domain Net (Peng et al., 2019), Rotated MNIST (Ghifary et al., 2015), and Colored MNIST (Arjovsky et al., 2019). The Backgrounds Challenge dataset measures a model s robustness against background shift (Xiao et al., 2021). To further strengthen our claim, we also performed experiments on CIFAR10-C and CIFAR10-P which can be casted to image corruption and perturbation shift. Natural Language Processing (NLP) Datasets: The Civil Comments-WILDS dataset is cast as a subpopulation shift problem.
Dataset Splits	Yes	Our ﬁrst approach involves partitioning the data from the training domains into a training dataset and a validation dataset, subsequently selecting the model that demonstrates the highest average performance (accuracy) based on the validation data from the training domain. In the training phase of Domain Bed datasets, we do not access the data in the test domain but split data from the training domains into a training dataset and validation dataset. The split ratio is 80 % for training and 20 % for validation. In Civil Comments-WILDS, we divide the data into training, validation, and test datasets and maximize worst-group accuracy in the validation data (and by association, maximize the average accuracy over all domains).
Hardware Specification	No	We perform our experiment with ABCI (AI Bridging Cloud Infrastructure), a supercomputer owned by the National Institute of Advanced Industrial Science and Technology, and TSUBAME, a supercomputer owned by the Tokyo Institute of Technology. The computational resources instrumental to this study were provided under the auspices of the "ABCI Grand Challenge" Program, National Institute of Advanced Industrial Science and Technology (AIST), and the TSUBAME Grand Challenge Program, Tokyo Institute of Technology. Special thanks to the AI Bridging Cloud Infrastructure (ABCI) and the TSUBAME 3.0. Moreover, we acknowledge the generous allocation of computational resources from the TSUBAME3.0 supercomputer, facilitated by Tokyo Institute of Technology.
Software Dependencies	No	All codes for experiments are modiﬁcations of the codes provided by the authors who introduced the datasets Gulrajani & Lopez-Paz (2021); Koh et al. (2021); Xiao et al. (2021). Licenses of the codes are MIT license for Domain Bed Gulrajani & Lopez-Paz (2021) and WILDS Koh et al. (2021). The code of Backgrounds Challenge does not indicate the license. Our code can be found at the link below. https://github.com/Hiroki11x/Optimizer_Comparison_OOD
Experiment Setup	Yes	We describe the conﬁgurations of hyperparameters and protocol for the experiments in further detail in Appendix E and Appendix D respectively. Hyperparameter Tuning: The hyperparameters are tuned using Bayes optimization functionality of Weights&Biases2 by evaluating in-distribution validation accuracy. Table 5: Domain Bed: Workloads Table 6: Domain Bed: Res Net-50 Table 7: Domain Bed: MNIST Conv Net (Gulrajani & Lopez-Paz, 2021)