reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beyond Distribution Shift: Spurious Features Through the Lens of Training Dynamics

Authors: Nihal Murali, Aahlad Manas Puli, Ke Yu, Rajesh Ranganath, kayhan Batmanghelich

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	While previous works highlight the harmful effects of spurious features on the generalization ability of DNNs, we emphasize that not all spurious features are harmful. ... We empirically show that the harmful spurious features can be detected by observing the learning dynamics of the DNN s early layers. ... We verify our claims on medical and vision datasets, both simulated and real...
Researcher Affiliation	Academia	Nihal Murali EMAIL Intelligent Systems Program University of Pittsburgh; Aahlad Puli EMAIL Department of Computer Science New York University; Ke Yu EMAIL Intelligent Systems Program University of Pittsburgh; Rajesh Ranganath EMAIL Department of Computer Science New York University; Kayhan Batmanghelich EMAIL Department of Electrical and Computer Engineering Boston University
Pseudocode	No	The paper describes the methodology using mathematical definitions and textual descriptions, for example, 'PD(x) = min k {k\|f k knn(x) = f i knn(x); i > k}' and 'This is a sample text in gknn(ϕk q; ϕk i {1,2,...m}) = ...', but no structured pseudocode or algorithm blocks are explicitly labeled or formatted as such.
Open Source Code	Yes	The code for this project is publicly available at: https://github.com/batmanlab/TMLR23_Dynamics_of_Spurious_Features
Open Datasets	Yes	We use the Dominoes binary classification dataset (formed by concatenating two datasets vertically; see Fig-4) similar to the setup of Kirichenko et al. (2022). The bottom (top) image acts as the core (spurious) feature. Images are of size 64 32. We construct three pairs of domino datasets such that each pair has both a hard and an easy spurious feature with respect to the common core feature (see Table-1). We use classes {0,1} for MNIST and SVHN, {coat,dress} for FMNIST, and {airplane, automobile} for CIFAR10. We also include two classes from Kuzushiji-MNIST (or KMNIST) and construct a modification of this dataset called KMNpatch, which has a spurious patch feature (5x5 white patch on the top-left corner) for one of the two classes of KMNIST. ... We follow the procedure by De Grave et al. (2021) to create the Chest X-ray14/Git Hub-COVID dataset. This dataset comprises Covid19 positive images from Github Covid repositories and negative images from Chest X-ray14 dataset (Wang et al., 2017b). In addition, we also create the Chex-MIMIC dataset following the procedure by Puli et al. (2022). This dataset comprises 90% images of Pneumonia from Chexpert (Irvin et al., 2019) and 90% healthy images from MIMIC-CXR (Johnson et al., 2019). ... For this experiment, we use the NIH dataset (Wang et al., 2017a)... We use the NICO++ (Non-I.I.D. Image dataset with Contexts) dataset Zhang et al. (2022)...
Dataset Splits	No	The paper refers to 'validation' and 'core-only (test) accuracy' and discusses constructing training data with '90% prevalence' of spurious correlations for some datasets. It also mentions 'a held-out dataset sampled from the same distribution' for validation. However, specific numerical split percentages (e.g., 80% train, 10% validation, 10% test) or explicit sample counts for training, validation, and test sets are not provided consistently throughout the main text.
Hardware Specification	No	The paper discusses various deep learning models (ResNet-18, VGG16, DenseNet121) and training procedures, but it does not specify any particular hardware used for conducting the experiments (e.g., GPU models, CPU types, or memory).
Software Dependencies	No	The paper mentions using an 'Adam optimizer' and 'VGG16', 'ResNet-18', 'DenseNet121' models, but it does not specify any software versions for libraries like PyTorch, TensorFlow, or Python, nor any other specific software dependencies with their versions.
Experiment Setup	Yes	We train two VGG16 models, one on the KMNIST with a spurious patch (Msh) and another on the original KMNIST without the patch (Morig). ... We train our models for 30 epochs using an Adam optimizer and a base learning rate of 0.01. We choose the best checkpoint using early stopping. ... We train a VGG16 model on these datasets for ten epochs using an Adam optimizer and a base learning rate of 0.01.