Intermediate Layer Classifiers for OOD generalization

Authors: Arnas Uselis, Seong Joon Oh

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform an extensive study over 9 datasets, covering various scenarios, including subpopulation shifts, conditional shifts, noise-level perturbations, and natural image shifts. Detailed definitions of shift types and corresponding datasets are listed in Table 2 below.
Researcher Affiliation Academia Arnas Uselis Tübingen AI Center University of Tübingen EMAIL Seong Joon Oh Tübingen AI Center University of Tübingen
Pseudocode Yes Algorithm 1 Training Intermediate Layer Classifiers (ILCs) Algorithm 2 Inference with ILC for l-th layer
Open Source Code Yes Code is available at https://github.com/oshapio/ intermediate-layer-generalization.
Open Datasets Yes We perform an extensive study over 9 datasets, covering various scenarios... Datasets were selected based on the availability of distribution shifts and their compatibility with publicly available pre-trained model weights. (Table 2 mentions CMNIST (Arjovsky et al., 2020; Bahng et al., 2020), Celeb A (Liu et al., 2015), Waterbirds (Sagawa et al., 2020a), Multi-Celeb A (Kim et al., 2023), CIFAR-10C, CIFAR-100C (Hendrycks & Dietterich, 2019), Image Net-A, Image Net-R, Image Net-Cue Conflict, Image Net-Silhouette (Hendrycks et al., 2021b;a; Geirhos et al., 2022))
Dataset Splits Yes In 3.1, we have introduced the few-shot and zero-shot settings for the OOD generalization. Below, we explain how we adopt each dataset for the required data splits, Dtrain, Dprobe, Dvalid, and Dtest. Training split Dtrain. For all datasets, we assume DNN models were trained on the given training split. Probe-training split Dprobe. For the zero-shot setting, we use the Dtrain split. For the few-shot setting, we use a subset of the OOD splits of each dataset. Validation split Dvalid. In all settings, we use the original held-out validation set whenever available in the datasets (Waterbirds, Celeb A, Multi Celeb A, Image Net). When unavailable (CMNIST, CIFAR10C, CIFAR-100C), we use a random half of the test splits of the datasets. Test split Dtest. In all settings, we use the original test set. When half of it was used for validation due to a lack of a validation split, then we use the other half for evaluating the models.
Hardware Specification Yes Depending on availability, we use either Nvidia 2080 or A100 GPUs.
Software Dependencies No The paper mentions using the "Torch Vision library (maintainers & contributors, 2016)" and "Adam optimizer (Kingma & Ba, 2017)", but does not provide specific version numbers for the software or libraries used in their experiments (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We perform a minimal hyperparameter search over the learning rate and ℓ1 regularization strength (similarly to (Kirichenko et al., 2023)) when training ILCs. For the zero-shot case, we tune the learning rate η and ℓ1 regularization strength according to (η, ℓ1) Hzero-shot = {10 4, 10 3, 10 2} {0, 10 3, 10 2}. For the few-shot case, we use (η, ℓ1) Hfew-shot = {10 4, 10 3, 10 2} {0, 10 4, 10 3, 10 2}, since we found higher regularization rates to help last-layer retraining. For all experiments, we use the Adam optimizer (Kingma & Ba, 2017), training ILCs and performing last-layer retraining for 100 epochs. We repeat each experiment at least three times, varying the DNN model s seed, the initialization of the ILCs, and the data splits, depending on the specific setting.