reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

EvA: Erasing Spurious Correlations with Activations

Authors: Qiyuan He, Kai Xu, Angela Yao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We verify our proposed framework with the datasets listed in Table 1. The datasets can be characterized by the conflicting ratio (Tiwari & Shenoy, 2023), or the proportion of samples that counter the spurious correlation within the entire dataset. A detailed description is given in Appendix D. For each dataset, we report the mean and standard deviation over ten runs on the average top-1 Accuracy unless otherwise indicated. ... Table 2 shows that Ev A easily controls the feature it relies on to make predictions. ... As shown in Table 3, both Ev A-E and Ev A-C achieve high performance even with a very small fraction of unbiased dataset (0.7% of the training set). Additional results on the Waterbirds dataset, compared with DFR, are presented in Figure 2 (c).
Researcher Affiliation	Academia	Qiyuan He, Kai Xu, Angela Yao National University of Singapore EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Section B: Algorithm of Ev A with pseudo code. ... Algorithm 1: Detect(Φ, Dtrain, ϵ) / (Φ, Dtrain, Dex, ϵ) ... Algorithm 2: Reweight
Open Source Code	No	The paper does not explicitly provide a link to source code, nor does it state that code is available in supplementary materials or upon publication. While algorithms are detailed, there is no concrete access information for the implementation code.
Open Datasets	Yes	CMNIST. Color-MNIST (Tiwari & Shenoy, 2023) is a synthesized dataset that colors the digits 0 and 1 from MNIST (Deng, 2012). ... BAR. The Biased Activity Recognition dataset (Nam et al., 2020), features six human actions spuriously correlated with the background... Waterbirds. Waterbirds (Sagawa et al., 2019) is a two-class image dataset of waterand ground-birds... Celeb A Hair. Celeb A is a large human face benchmark (Liu et al., 2018).
Dataset Splits	Yes	For CMNIST, Waterbirds and Celeb A, we use the same validation and test dataset as (Sagawa et al., 2019; Ye et al., 2023; Tiwari & Shenoy, 2023). BAR has no provided validation set, so we randomly split the testing dataset into two equal halves across the ten experiments to form a validation set that follows the same testing distribution. ... Table 1: Summary of dataset details. Val ratio indicates the proportion of the validation dataset size relative to the training dataset size. Conflicting ratio denotes the proportion of images that counter the spurious correlation within the entire dataset.
Hardware Specification	Yes	In our experiments, aimed at selecting the best hyperparameters from 90 candidates, Ev A-E significantly reduces computation time while achieving higher accuracy within 10 minutes, compared to the 6 days required by Si FER (Tiwari & Shenoy, 2023) on a single RTX 3080 GPU.
Software Dependencies	No	The paper mentions using Res Net18 as the base model and SGD optimization, but it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For fair comparison with previous works (Nam et al., 2020; Tiwari & Shenoy, 2023; Li et al., 2022; Teney et al., 2022), we use the Res Net18 (He et al., 2016) as the base model initialized with weights pre-trained on Image Net and the same settings of (Nam et al., 2020; Tiwari & Shenoy, 2023; Li et al., 2022; Teney et al., 2022). We use SGD optimization with a fixed learning rate of 0.001. For CMNIST, Waterbirds and Celeb A, we use the same validation and test dataset as (Sagawa et al., 2019; Ye et al., 2023; Tiwari & Shenoy, 2023). ... To select the erase ratio ϵ, we retrain the linear layer with different erase ratio candidates and select the one with the highest accuracy on Dunbiased.