reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Intermediate Layer Classifiers for OOD generalization

Authors: Arnas Uselis, Seong Joon Oh

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform an extensive study over 9 datasets, covering various scenarios, including subpopulation shifts, conditional shifts, noise-level perturbations, and natural image shifts. Detailed definitions of shift types and corresponding datasets are listed in Table 2 below.
Researcher Affiliation	Academia	Arnas Uselis Tübingen AI Center University of Tübingen EMAIL Seong Joon Oh Tübingen AI Center University of Tübingen
Pseudocode	Yes	Algorithm 1 Training Intermediate Layer Classifiers (ILCs) Algorithm 2 Inference with ILC for l-th layer
Open Source Code	Yes	Code is available at https://github.com/oshapio/ intermediate-layer-generalization.
Open Datasets	Yes	We perform an extensive study over 9 datasets, covering various scenarios... Datasets were selected based on the availability of distribution shifts and their compatibility with publicly available pre-trained model weights. (Table 2 mentions CMNIST (Arjovsky et al., 2020; Bahng et al., 2020), Celeb A (Liu et al., 2015), Waterbirds (Sagawa et al., 2020a), Multi-Celeb A (Kim et al., 2023), CIFAR-10C, CIFAR-100C (Hendrycks & Dietterich, 2019), Image Net-A, Image Net-R, Image Net-Cue Conflict, Image Net-Silhouette (Hendrycks et al., 2021b;a; Geirhos et al., 2022))
Dataset Splits	Yes	In 3.1, we have introduced the few-shot and zero-shot settings for the OOD generalization. Below, we explain how we adopt each dataset for the required data splits, Dtrain, Dprobe, Dvalid, and Dtest. Training split Dtrain. For all datasets, we assume DNN models were trained on the given training split. Probe-training split Dprobe. For the zero-shot setting, we use the Dtrain split. For the few-shot setting, we use a subset of the OOD splits of each dataset. Validation split Dvalid. In all settings, we use the original held-out validation set whenever available in the datasets (Waterbirds, Celeb A, Multi Celeb A, Image Net). When unavailable (CMNIST, CIFAR10C, CIFAR-100C), we use a random half of the test splits of the datasets. Test split Dtest. In all settings, we use the original test set. When half of it was used for validation due to a lack of a validation split, then we use the other half for evaluating the models.
Hardware Specification	Yes	Depending on availability, we use either Nvidia 2080 or A100 GPUs.
Software Dependencies	No	The paper mentions using the "Torch Vision library (maintainers & contributors, 2016)" and "Adam optimizer (Kingma & Ba, 2017)", but does not provide specific version numbers for the software or libraries used in their experiments (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	We perform a minimal hyperparameter search over the learning rate and ℓ1 regularization strength (similarly to (Kirichenko et al., 2023)) when training ILCs. For the zero-shot case, we tune the learning rate η and ℓ1 regularization strength according to (η, ℓ1) Hzero-shot = {10 4, 10 3, 10 2} {0, 10 3, 10 2}. For the few-shot case, we use (η, ℓ1) Hfew-shot = {10 4, 10 3, 10 2} {0, 10 4, 10 3, 10 2}, since we found higher regularization rates to help last-layer retraining. For all experiments, we use the Adam optimizer (Kingma & Ba, 2017), training ILCs and performing last-layer retraining for 100 epochs. We repeat each experiment at least three times, varying the DNN model s seed, the initialization of the ILCs, and the data splits, depending on the specific setting.