reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Preserving AUC Fairness in Learning with Noisy Protected Groups

Authors: Mingyang Wu, Li Lin, Wenbin Zhang, Xin Wang, Zhenhuan Yang, Shu Hu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on tabular and image datasets show that our method outperforms state-of-the-art approaches in preserving AUC fairness. The code is in https://github.com/Purdue-M2/ AUC_Fairness_with_Noisy_Groups. To demonstrate the impact of noisy protected group levels on AUC fairness, we conduct experiments on the tabular Adult dataset (for socioeconomic analysis) (Asuncion et al., 2007) and the image-based FF++ dataset (for deepfake detection) (Rossler et al., 2019; Lin et al., 2024).
Researcher Affiliation	Collaboration	1Department of Computer and Information Technology, Purdue University, West Lafayette, USA 2Knight Foundation School of Computing and Information Sciences, Florida International University, Miami, USA 3Department of Epidemiology and Biostatistics, University at Albany, State University of New York, New York, USA 4Amazon, New York, USA.
Pseudocode	Yes	Algorithm 1 Robust AUC Fairness Algorithm 2 Sampler(Dataset: S, batch size: b)
Open Source Code	Yes	The code is in https://github.com/Purdue-M2/ AUC_Fairness_with_Noisy_Groups.
Open Datasets	Yes	To demonstrate the impact of noisy protected group levels on AUC fairness, we conduct experiments on the tabular Adult dataset (for socioeconomic analysis) (Asuncion et al., 2007) and the image-based FF++ dataset (for deepfake detection) (Rossler et al., 2019; Lin et al., 2024). For tabular data, we conduct socioeconomic analysis on three widely used datasets in fair machine learning research (Donini et al., 2018): Adult (protected attribute: gender), Bank (protected attribute: age), and Default (protected attribute: gender). For image data, we focus on the deepfake detection task using datasets from Lin et al. (2024). Specifically, we train models on the FF++ (Rossler et al., 2019) training set (protected attribute: gender) and evaluate them on the test sets of FF++, DFDC (dee), DFD (Google & Jigsaw, 2019), and Celeb-DF (Li et al., 2020).
Dataset Splits	Yes	Each dataset is randomly split into training, validation, and test sets in a 60%/20%/20% ratio.
Hardware Specification	Yes	All experiments are implemented in PyTorch and trained on an NVIDIA RTX A6000.
Software Dependencies	No	All experiments are implemented in PyTorch and trained on an NVIDIA RTX A6000. These prompts are fed into the text encoder of the CLIP model (Radford et al., 2021) to generate text feature representations T P and T N. The paper mentions PyTorch and CLIP but does not specify their version numbers.
Experiment Setup	Yes	For training, we set the batch size to 10,000 for socioeconomic analysis and 32 for deepfake detection, with 1,000 and 100 training epochs, respectively. We use the SGD optimizer. For socioeconomic analysis, we use 3-layer multilayer perceptron (MLP) as the model. γ is selected from {0.1, 0.2, 0.3, 0.4, 0.5}. For deepfake detection, we use noisy group labels, which are common in datasets like FF++ where demographic attributes are inferred... We use Xception (Chollet, 2017) and Efficient Net-B4 (Tan & Le, 2019) as the detector backbones. γ = 0.02 is estimated using Eq. (10). See Appendix F.1 for details. Appendix F.1 provides specific hyperparameter ranges for learning rates (ηθ, ηλ, ηp) and perturbation magnitude (ν) for both tabular and image data.