Subgroups Matter for Robust Bias Mitigation
Authors: Anissa Alloula, Charles Jones, Ben Glocker, Bartlomiej Papiez
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a comprehensive evaluation of state-of-the-art bias mitigation methods across multiple vision and language classification tasks, systematically varying subgroup definitions, including coarse, fine-grained, intersectional, and noisy subgroups. Our results reveal that subgroup choice significantly impacts performance... We evaluate performance in image classification tasks in four datasets which we construct to satisfy the distributions specified in Figure 1. ...We implement and train models with each of the g DRO, resampling, Domain Ind, and CFair bias mitigation methods. We apply each method to each of our generated subgroups and average the results over three random seeds. We repeat this process for the four datasets, comparing performance of the bias mitigation methods with the baseline ERM method. |
| Researcher Affiliation | Academia | 1University of Oxford, UK 2Imperial College London, UK. Correspondence to: Anissa Alloula <EMAIL>. |
| Pseudocode | No | The paper describes methods textually (e.g., in Section 3.1) and does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present any structured code-like procedures. |
| Open Source Code | Yes | The code is available here. |
| Open Datasets | Yes | We adapt the MNIST dataset (Lecun et al., 1998) by binarising the classification task... We repeat the experiments with chest X-ray images from the Che Xpert dataset (CXP) (Irvin et al., 2019)... We explore another real vision dataset commonly used in fair ML research, Celeb A (Liu et al., 2015a)... Finally, we explore whether our findings extend to the text modality through the use of the Civil comments dataset, also commonly used in fair ML (Borkan et al., 2019). |
| Dataset Splits | No | The paper describes the construction of semi-synthetic datasets to satisfy specific probability distributions (Ptrain, Punbiased) and mentions using a validation set for hyperparameter tuning. However, it does not provide explicit percentages, sample counts, or methodologies for splitting the original raw MNIST, CheXpert, Celeb A, or Civil comments datasets into training, validation, and test sets. |
| Hardware Specification | Yes | In total, we train 306 models, with 40 NVIDIA A100 hours of compute. |
| Software Dependencies | No | Table 7 lists 'Backbone' architectures such as 'Dense Net121' and 'BERTClassifier' and mentions 'Optimiser Adam' and 'Loss Binary cross-entropy', but it does not specify version numbers for general software dependencies like Python, PyTorch, or specific libraries used (e.g., PyTorch 1.9, TensorFlow 2.x). |
| Experiment Setup | Yes | The training strategy, hyperparameters, architectures etc. are the same across all models, as detailed in Table A7, except for necessary adjustments to apply each bias mitigation method. Table A7 provides detailed 'Implementation details' including 'Backbone', 'Pre-training', 'Batch size', 'Image size', 'Augmentation', 'Optimiser', 'Loss', 'Learning rate', 'Learning scheduler', 'Weight decay', and 'Max epochs'. We also specify additional hyperparameters for the mitigation methods: a step size of 0.01 and a size adjustment factor of 1 was used for g DRO, and a ยต coefficient of 0.1 was used for the adversarial loss of CFair. |