reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bias Amplification Enhances Minority Group Performance

Authors: Gaotang Li, Jiarui Liu, Wei Hu

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, Bam achieves competitive performance compared with existing methods evaluated on spurious correlation benchmarks in computer vision and natural language processing. Moreover, we find a simple stopping criterion based on minimum class accuracy difference that can remove the need for group annotations, with little or no loss in worst-group accuracy. We perform extensive analyses and ablations to verify the effectiveness and robustness of our algorithm in varying class and group imbalance ratios.
Researcher Affiliation	Academia	Gaotang Li EMAIL University of Michigan Ann Arbor, MI Jiarui Liu EMAIL Carnegie Mellon University Pittsburgh, PA Wei Hu EMAIL University of Michigan Ann Arbor, MI
Pseudocode	Yes	Algorithm 1 Bam Input: Training dataset D, number of epochs T in Stage 1, auxiliary coefficient λ, and upweight factor µ
Open Source Code	Yes	1Our code is available at https://github.com/motivationss/BAM
Open Datasets	Yes	We conduct our experiments on four popular benchmark datasets containing spurious correlations. Two of them are image datasets: Waterbirds (Wah et al., 2011; Sagawa et al., 2019), Celeb A (Liu et al., 2015; Sagawa et al., 2019), and the other two are NLP datasets: Multi NLI (Williams et al., 2018; Sagawa et al., 2019), and Civil Comments-WILDS (Borkan et al., 2019; Koh et al., 2021).
Dataset Splits	Yes	The train/validation/test splits is followed from Sagawa et al. (2019). We shuffle the original data and generate dataset splits with train/valudation/test sizes = 0.7/0.15/0.15. We regenerate the dataset splits as train/valudation/test sizes = 0.7/0.15/0.15.
Hardware Specification	No	The paper does not explicitly describe the hardware used for running its experiments. It mentions using pre-trained models (ResNet-50, BERT) but no specific GPU/CPU models or other hardware details.
Software Dependencies	No	The paper mentions using "Pytorch implementation for Res Net50 and the Hugging Face implementation for BERT" but does not provide specific version numbers for these software components.
Experiment Setup	Yes	Table 6: Hyperparameters tuned over 4 datasets. Dataset Auxiliary coefficient (λ) #Epochs in Stage 1 (T) Upweight factor (µ) Waterbirds { 0.5, 5, 50} {10, 15, 20} {50, 100, 140} Celeb A {0.5, 5, 50} {1, 2} {50, 70, 100} Multi NLI {0.5, 5, 50} {1, 2} {4, 5, 6} Civil Comments {0.5, 5, 50 } {1, 2} {4, 5, 6}. In general, our setting follows closely from Liu et al. (2021), with some minor discrepancies. For the major hyperparameters, We tuned over λ = {0.5, 5, 50}, T = {1, 2, 10, 15, 60} and µ = {4, 5, 6, 50, 70, 100, 140} for Bam.