reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Severing Spurious Correlations with Data Pruning

Authors: Varun Mulchandani, Jung-Eun Kim

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we identify new settings where the strength of the spurious signal is relatively weaker, making it difficult to detect any spurious information while continuing to have catastrophic consequences. We also discover that spurious correlations are learned primarily due to only a handful of all the samples containing the spurious feature and develop a novel data pruning technique that identifies and prunes small subsets of the training data that contain these samples. Our proposed technique does not require inferred domain knowledge, information regarding the sample-wise presence or nature of spurious information, or human intervention. Finally, we show that such data pruning attains state-of-the-art performance on previously studied settings where spurious information is identifiable.1
Researcher Affiliation	Academia	Varun Mulchandani & Jung-Eun Kim Department of Computer Science North Carolina State University EMAIL
Pseudocode	No	The paper describes the methodology in prose, without including any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	1Code is available at: https://github.com/JEKim Lab/ICLR2025_Spurious Data Pruning
Open Datasets	Yes	Suppose one wants to build a gender classifier using the Celeb A dataset (Liu et al., 2015). We create a testbed based on the CIFAR-10 dataset (Krizhevsky, 2009) where we synthetically introduce spurious features in a fraction of one class (c1) training samples. Hard Image Net. (Moayeri et al., 2022) The Hard Image Net dataset is a 15 class classification task where all classes have a spurious feature associated with them and certain classes share similar spurious features. Waterbirds. (Sagawa et al., 2020a) The Waterbirds task is a binary image classification task where the goal is to classify an image of a bird as landbird or waterbird. Multi NLI. (Williams et al., 2018) The Multi NLI task is a classification task with three classes where the goal is to classify the second sentence in a pair of sentences as entailed by, neutral with, or contradicts.
Dataset Splits	Yes	Waterbirds. We use the original Waterbirds setting commonly used in practice (Sagawa et al., 2020a; Liu et al., 2021; Zhang et al., 2022; Kirichenko et al., 2023). Multi NLI. We use the original Multi NLI setting commonly used in practice (Sagawa et al., 2020a; Liu et al., 2021; Kirichenko et al., 2023). CIFAR-10S. We create a testbed based on the CIFAR-10 dataset (Krizhevsky, 2009).
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. It only mentions the models and frameworks used for training.
Software Dependencies	No	We use an Image Net pre-trained Res Net-50 from Py Torch (Paszke et al., 2019) that we train for 25 epochs. The network is optimized using SGD with a static learning rate 1e-3 and weight decay 1e-4. We maintain a batch size of 64. Sample difficulty is computed after the 10th epoch.
Experiment Setup	Yes	A.1 TRAINING DETAILS: CIFAR-10S. We use the Res Net20 implementation from Liu et al. (2019) that we train for 160 epochs. The network is optimized using SGD with an initial learning rate 1e-1 and weight decay 1e-4. The learning rate drops to 1e-2 and 1e-3 at epochs 80 and 120 respectively. We maintain a batch size of 64. Sample difficulty is computed after the 10th epoch. Celeb A. We use an Image Net pre-trained Res Net-50 from Py Torch (Paszke et al., 2019) that we train for 25 epochs. The network is optimized using SGD with a static learning rate 1e-3 and weight decay 1e-4. We maintain a batch size of 64. Sample difficulty is computed after the 10th epoch. Hard Image-Net. We use an Image Net pre-trained Res Net-50 from Py Torch (Paszke et al., 2019) that we train for 50 epochs. The network is optimized using SGD with a static learning rate 1e-3 and weight decay 1e-4. We maintain a batch size of 128. Sample difficulty is computed after the 1st epoch. Waterbirds. We use an Image Net pre-trained Res Net-50 from Py Torch (Paszke et al., 2019) that we train for 100 epochs. The network is optimized using SGD with a static learning rate 1e-3 and weight decay 1e-3. We maintain a batch size of 128. Sample difficulty is computed after the 1st epoch. Multi NLI. We use a pre-trained BERT model that we train for 20 epochs. The network is optimized using Adam W using a linearly decaying starting learning rate 2e-5. We maintain a batch size of 32. Sample difficulty is computed after the 5th epoch.