Bias Amplification Enhances Minority Group Performance
Authors: Gaotang Li, Jiarui Liu, Wei Hu
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, Bam achieves competitive performance compared with existing methods evaluated on spurious correlation benchmarks in computer vision and natural language processing. Moreover, we find a simple stopping criterion based on minimum class accuracy difference that can remove the need for group annotations, with little or no loss in worst-group accuracy. We perform extensive analyses and ablations to verify the effectiveness and robustness of our algorithm in varying class and group imbalance ratios. |
| Researcher Affiliation | Academia | Gaotang Li EMAIL University of Michigan Ann Arbor, MI Jiarui Liu EMAIL Carnegie Mellon University Pittsburgh, PA Wei Hu EMAIL University of Michigan Ann Arbor, MI |
| Pseudocode | Yes | Algorithm 1 Bam Input: Training dataset D, number of epochs T in Stage 1, auxiliary coefficient λ, and upweight factor µ |
| Open Source Code | Yes | 1Our code is available at https://github.com/motivationss/BAM |
| Open Datasets | Yes | We conduct our experiments on four popular benchmark datasets containing spurious correlations. Two of them are image datasets: Waterbirds (Wah et al., 2011; Sagawa et al., 2019), Celeb A (Liu et al., 2015; Sagawa et al., 2019), and the other two are NLP datasets: Multi NLI (Williams et al., 2018; Sagawa et al., 2019), and Civil Comments-WILDS (Borkan et al., 2019; Koh et al., 2021). |
| Dataset Splits | Yes | The train/validation/test splits is followed from Sagawa et al. (2019). We shuffle the original data and generate dataset splits with train/valudation/test sizes = 0.7/0.15/0.15. We regenerate the dataset splits as train/valudation/test sizes = 0.7/0.15/0.15. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running its experiments. It mentions using pre-trained models (ResNet-50, BERT) but no specific GPU/CPU models or other hardware details. |
| Software Dependencies | No | The paper mentions using "Pytorch implementation for Res Net50 and the Hugging Face implementation for BERT" but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Table 6: Hyperparameters tuned over 4 datasets. Dataset Auxiliary coefficient (λ) #Epochs in Stage 1 (T) Upweight factor (µ) Waterbirds { 0.5, 5, 50} {10, 15, 20} {50, 100, 140} Celeb A {0.5, 5, 50} {1, 2} {50, 70, 100} Multi NLI {0.5, 5, 50} {1, 2} {4, 5, 6} Civil Comments {0.5, 5, 50 } {1, 2} {4, 5, 6}. In general, our setting follows closely from Liu et al. (2021), with some minor discrepancies. For the major hyperparameters, We tuned over λ = {0.5, 5, 50}, T = {1, 2, 10, 15, 60} and µ = {4, 5, 6, 50, 70, 100, 140} for Bam. |