reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Subgroups Matter for Robust Bias Mitigation

Authors: Anissa Alloula, Charles Jones, Ben Glocker, Bartlomiej Papiez

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a comprehensive evaluation of state-of-the-art bias mitigation methods across multiple vision and language classification tasks, systematically varying subgroup definitions, including coarse, fine-grained, intersectional, and noisy subgroups. Our results reveal that subgroup choice significantly impacts performance... We evaluate performance in image classification tasks in four datasets which we construct to satisfy the distributions specified in Figure 1. ...We implement and train models with each of the g DRO, resampling, Domain Ind, and CFair bias mitigation methods. We apply each method to each of our generated subgroups and average the results over three random seeds. We repeat this process for the four datasets, comparing performance of the bias mitigation methods with the baseline ERM method.
Researcher Affiliation	Academia	1University of Oxford, UK 2Imperial College London, UK. Correspondence to: Anissa Alloula <EMAIL>.
Pseudocode	No	The paper describes methods textually (e.g., in Section 3.1) and does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present any structured code-like procedures.
Open Source Code	Yes	The code is available here.
Open Datasets	Yes	We adapt the MNIST dataset (Lecun et al., 1998) by binarising the classification task... We repeat the experiments with chest X-ray images from the Che Xpert dataset (CXP) (Irvin et al., 2019)... We explore another real vision dataset commonly used in fair ML research, Celeb A (Liu et al., 2015a)... Finally, we explore whether our findings extend to the text modality through the use of the Civil comments dataset, also commonly used in fair ML (Borkan et al., 2019).
Dataset Splits	No	The paper describes the construction of semi-synthetic datasets to satisfy specific probability distributions (Ptrain, Punbiased) and mentions using a validation set for hyperparameter tuning. However, it does not provide explicit percentages, sample counts, or methodologies for splitting the original raw MNIST, CheXpert, Celeb A, or Civil comments datasets into training, validation, and test sets.
Hardware Specification	Yes	In total, we train 306 models, with 40 NVIDIA A100 hours of compute.
Software Dependencies	No	Table 7 lists 'Backbone' architectures such as 'Dense Net121' and 'BERTClassifier' and mentions 'Optimiser Adam' and 'Loss Binary cross-entropy', but it does not specify version numbers for general software dependencies like Python, PyTorch, or specific libraries used (e.g., PyTorch 1.9, TensorFlow 2.x).
Experiment Setup	Yes	The training strategy, hyperparameters, architectures etc. are the same across all models, as detailed in Table A7, except for necessary adjustments to apply each bias mitigation method. Table A7 provides detailed 'Implementation details' including 'Backbone', 'Pre-training', 'Batch size', 'Image size', 'Augmentation', 'Optimiser', 'Loss', 'Learning rate', 'Learning scheduler', 'Weight decay', and 'Max epochs'. We also specify additional hyperparameters for the mitigation methods: a step size of 0.01 and a size adjustment factor of 1 was used for g DRO, and a µ coefficient of 0.1 was used for the adversarial loss of CFair.