reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Formalizing Spuriousness of Biased Datasets Using Partial Information Decomposition

Authors: Barproda Halder, Faisal Hamman, Pasan Dissanayake, Qiuyi Zhang, Ilia Sucholutsky, Sanghamitra Dutta

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we also perform empirical evaluation to demonstrate the trends of unique, redundant, and synergistic information, as well as our proposed spuriousness measure across 6 benchmark datasets under various experimental settings. We observe an agreement between our preemptive measure of dataset spuriousness and post-training model generalization metrics such as worst-group accuracy, further supporting our proposition.
Researcher Affiliation	Collaboration	Barproda Halder EMAIL Department of Electrical and Computer Engineering University of Maryland, College Park [...] Qiuyi Zhang EMAIL Google Research [...] Ilia Sucholutsky EMAIL Department of Computer Science Princeton University
Pseudocode	Yes	Algorithm 1: Spuriousness Disentangler: An Autoencoder-Based Explainability Framework
Open Source Code	Yes	The code is available at https://github.com/Barproda/spuriousness-disentangler.
Open Datasets	Yes	Our evaluation spans six datasets: Waterbird (Wah et al., 2011), Adult (Becker & Kohavi, 1996), Celeb A (Lee et al., 2020), Dominoes (Shah et al., 2020), Spawrious (Lynch et al., 2023), and Colored MNIST (Arjovsky et al., 2019).
Dataset Splits	Yes	Table 5: Summary of the datasets (Waterbird Train 3,498 184 56 1,057 Validation 467 466 133 133 Test 2,255 2,255 642 642)
Hardware Specification	Yes	All experiments are executed on NVIDIA RTX A4500.
Software Dependencies	No	The paper mentions 'DIT package (James et al., 2018)' but does not specify a version number for this or any other software component.
Experiment Setup	Yes	The hyperparameters are as follows: a batch size of 64, a learning rate of 0.001, a Cosine Annealing LR scheduler, an Adam optimizer with a weight decay of 0.0001, 50 pretraining epochs, followed by 100 epochs of additional training. When fine-tuning Res Net-50 we use the following hyperparameters: batch size of 64, learning rate of 0.0001, Cosine Annealing LR scheduler, stochastic gradient descent (SGD) optimizer with a weight decay of 0.0001, binary cross-entropy as the loss function, and 100 epochs.