Preserving AUC Fairness in Learning with Noisy Protected Groups
Authors: Mingyang Wu, Li Lin, Wenbin Zhang, Xin Wang, Zhenhuan Yang, Shu Hu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on tabular and image datasets show that our method outperforms state-of-the-art approaches in preserving AUC fairness. The code is in https://github.com/Purdue-M2/ AUC_Fairness_with_Noisy_Groups. To demonstrate the impact of noisy protected group levels on AUC fairness, we conduct experiments on the tabular Adult dataset (for socioeconomic analysis) (Asuncion et al., 2007) and the image-based FF++ dataset (for deepfake detection) (Rossler et al., 2019; Lin et al., 2024). |
| Researcher Affiliation | Collaboration | 1Department of Computer and Information Technology, Purdue University, West Lafayette, USA 2Knight Foundation School of Computing and Information Sciences, Florida International University, Miami, USA 3Department of Epidemiology and Biostatistics, University at Albany, State University of New York, New York, USA 4Amazon, New York, USA. |
| Pseudocode | Yes | Algorithm 1 Robust AUC Fairness Algorithm 2 Sampler(Dataset: S, batch size: b) |
| Open Source Code | Yes | The code is in https://github.com/Purdue-M2/ AUC_Fairness_with_Noisy_Groups. |
| Open Datasets | Yes | To demonstrate the impact of noisy protected group levels on AUC fairness, we conduct experiments on the tabular Adult dataset (for socioeconomic analysis) (Asuncion et al., 2007) and the image-based FF++ dataset (for deepfake detection) (Rossler et al., 2019; Lin et al., 2024). For tabular data, we conduct socioeconomic analysis on three widely used datasets in fair machine learning research (Donini et al., 2018): Adult (protected attribute: gender), Bank (protected attribute: age), and Default (protected attribute: gender). For image data, we focus on the deepfake detection task using datasets from Lin et al. (2024). Specifically, we train models on the FF++ (Rossler et al., 2019) training set (protected attribute: gender) and evaluate them on the test sets of FF++, DFDC (dee), DFD (Google & Jigsaw, 2019), and Celeb-DF (Li et al., 2020). |
| Dataset Splits | Yes | Each dataset is randomly split into training, validation, and test sets in a 60%/20%/20% ratio. |
| Hardware Specification | Yes | All experiments are implemented in PyTorch and trained on an NVIDIA RTX A6000. |
| Software Dependencies | No | All experiments are implemented in PyTorch and trained on an NVIDIA RTX A6000. These prompts are fed into the text encoder of the CLIP model (Radford et al., 2021) to generate text feature representations T P and T N. The paper mentions PyTorch and CLIP but does not specify their version numbers. |
| Experiment Setup | Yes | For training, we set the batch size to 10,000 for socioeconomic analysis and 32 for deepfake detection, with 1,000 and 100 training epochs, respectively. We use the SGD optimizer. For socioeconomic analysis, we use 3-layer multilayer perceptron (MLP) as the model. γ is selected from {0.1, 0.2, 0.3, 0.4, 0.5}. For deepfake detection, we use noisy group labels, which are common in datasets like FF++ where demographic attributes are inferred... We use Xception (Chollet, 2017) and Efficient Net-B4 (Tan & Le, 2019) as the detector backbones. γ = 0.02 is estimated using Eq. (10). See Appendix F.1 for details. Appendix F.1 provides specific hyperparameter ranges for learning rates (ηθ, ηλ, ηp) and perturbation magnitude (ν) for both tabular and image data. |