reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Identifying Spurious Correlations using Counterfactual Alignment

Authors: Joseph Paul Cohen, Louis Blankemeier, Akshay S Chaudhari

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experiments in this work are performed on the Celeb A HQ dataset (Karras et al., 2018) that contains over 200k celebrity images with 40 facial attribute labels per image. The resolution of the images is 1024 1024. Experiments are also performed on the lower resolution (178 x 178) Celeb A dataset (Liu et al., 2015).
Researcher Affiliation	Academia	Joseph Paul Cohen EMAIL Stanford University Louis Blankemeier Stanford University Akshay Chaudhari Stanford University
Pseudocode	No	The paper describes the methodology using equations and figures, but it does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	The source code and model weights for all experiments will be released publicly online . https://github.com/ieee8023/latentshift
Open Datasets	Yes	The experiments in this work are performed on the Celeb A HQ dataset (Karras et al., 2018) that contains over 200k celebrity images with 40 facial attribute labels per image. Experiments are also performed on the lower resolution (178 x 178) Celeb A dataset (Liu et al., 2015). We leverage the VQ-GAN autoencoder from (Esser et al., 2021) trained on the Faces HQ dataset, which combines the Celeb A HQ dataset (Karras et al., 2018) and the Flickr-Faces-HQ (FFHQ) dataset (Karras et al., 2019). Work on Group DRO Sagawa et al. (2020) constructed the Waterbirds task such that there is a spurious correlation between birds (from the CUB dataset (Wah et al., 2011)) with backgrounds that contain water or land. A VQ-GAN trained on Open Images (Krasin et al., 2017)
Dataset Splits	Yes	Experiments in this section are performed on the held out test datasets and selected to be balanced such that there is a balanced distribution of positive and negative examples with the sensitive attribute (e.g. in the Celeb A dataset, 1/4 of the samples are labeled to have blond_hair and be male, 1/4 blond_hair and be not male, 1/4 not blond_hair and be not male, and 1/4 not blond_hair and be male). Aggregate results over 4097 samples from the test set of the waterbirds dataset reveals a reduction in mean relative change using DRO shown in Table 1. Evaluations are performed on 1024 samples from the test set. This experiment is performed with a train, validation, and hold out test set in order to demonstrate the generalization of this unbiasing process to unseen data.
Hardware Specification	Yes	The CF alignment algorithm requires a few seconds to run for each image on a NVIDIA V100 16GB GPU.
Software Dependencies	No	The Captum library (Kokhlikyan et al., 2020) is utilized for baseline attribution methods. Py Torch (Paszke et al., 2019) is used for efficient tensor computation and automatic differentiation. The paper mentions the software and their corresponding publications, but it does not specify exact version numbers for these software packages themselves.
Experiment Setup	Yes	The parameter λ is determined through an iterative search process, where its value is systematically adjusted in steps. The objective is to find a suitable λ such that the classifier s prediction is either reduced by 0.6 or starts to increase. 0.6 is chosen as a difference that should cross the decision boundary. To achieve this we compute the average relative change between pairs of classifiers (N=400 images per class) is shown in Figure 2a. fbiased(x) = fsmiling(x) + 0.3farched_eyebrows(x). Optimization to compute each β is performed using a pseudo gradient descent where the gradient is approximated by the mean relative change between the target classifier and the big_nose classifier. By subtracting the relative change from β, scaled by a learning rate, the relative change (ψ) with that classifier will be reduced. All together our training objective (including momentum) is βn = 0.001ψ(ftarget, fbig_nose) + 0.1βn 1. We find that using small minibatches of 10 samples works well because the computation time for each CF can take over 1 second on a GPU. Additionally, training with samples which induce a small change in the base classifier during the CF generation process can be challenging. As the classifiers are modified, this base change is reduced, which causes the relative change to become more erratic and prevents the optimization from converging. To prevent this, we use samples with a base change > 0.6.