Bias Amplification Enhances Minority Group Performance

Authors: Gaotang Li, Jiarui Liu, Wei Hu

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, Bam achieves competitive performance compared with existing methods evaluated on spurious correlation benchmarks in computer vision and natural language processing. Moreover, we find a simple stopping criterion based on minimum class accuracy difference that can remove the need for group annotations, with little or no loss in worst-group accuracy. We perform extensive analyses and ablations to verify the effectiveness and robustness of our algorithm in varying class and group imbalance ratios.
Researcher Affiliation Academia Gaotang Li EMAIL University of Michigan Ann Arbor, MI Jiarui Liu EMAIL Carnegie Mellon University Pittsburgh, PA Wei Hu EMAIL University of Michigan Ann Arbor, MI
Pseudocode Yes Algorithm 1 Bam Input: Training dataset D, number of epochs T in Stage 1, auxiliary coefficient λ, and upweight factor µ
Open Source Code Yes 1Our code is available at https://github.com/motivationss/BAM
Open Datasets Yes We conduct our experiments on four popular benchmark datasets containing spurious correlations. Two of them are image datasets: Waterbirds (Wah et al., 2011; Sagawa et al., 2019), Celeb A (Liu et al., 2015; Sagawa et al., 2019), and the other two are NLP datasets: Multi NLI (Williams et al., 2018; Sagawa et al., 2019), and Civil Comments-WILDS (Borkan et al., 2019; Koh et al., 2021).
Dataset Splits Yes The train/validation/test splits is followed from Sagawa et al. (2019). We shuffle the original data and generate dataset splits with train/valudation/test sizes = 0.7/0.15/0.15. We regenerate the dataset splits as train/valudation/test sizes = 0.7/0.15/0.15.
Hardware Specification No The paper does not explicitly describe the hardware used for running its experiments. It mentions using pre-trained models (ResNet-50, BERT) but no specific GPU/CPU models or other hardware details.
Software Dependencies No The paper mentions using "Pytorch implementation for Res Net50 and the Hugging Face implementation for BERT" but does not provide specific version numbers for these software components.
Experiment Setup Yes Table 6: Hyperparameters tuned over 4 datasets. Dataset Auxiliary coefficient (λ) #Epochs in Stage 1 (T) Upweight factor (µ) Waterbirds { 0.5, 5, 50} {10, 15, 20} {50, 100, 140} Celeb A {0.5, 5, 50} {1, 2} {50, 70, 100} Multi NLI {0.5, 5, 50} {1, 2} {4, 5, 6} Civil Comments {0.5, 5, 50 } {1, 2} {4, 5, 6}. In general, our setting follows closely from Liu et al. (2021), with some minor discrepancies. For the major hyperparameters, We tuned over λ = {0.5, 5, 50}, T = {1, 2, 10, 15, 60} and µ = {4, 5, 6, 50, 70, 100, 140} for Bam.