Multi-Output Distributional Fairness via Post-Processing

Authors: Gang Li, Qihang Lin, Ayush Ghosh, Tianbao Yang

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical studies evaluate the proposed approach against various baselines on multi-task/multi-class classification and representation learning tasks, demonstrating the effectiveness of the proposed approach.
Researcher Affiliation Academia Gang Li EMAIL Texas A&M University Qihang Lin EMAIL The University of Iowa Ayush Ghosh EMAIL The University of Iowa Tianbao Yang EMAIL Texas A&M University
Pseudocode Yes Algorithm 1 Approximate Barycenter Algorithm 2 Post-Processing Method by Transporting to Approximate Barycenter (TAB)
Open Source Code Yes Code is available at: https://github.com/GangLii/TAB
Open Datasets Yes Datasets. In our experiments, we include four datasets from various domains, including marketing domain(Customer dataset 3), medical diagnosis( Chexpert Dataset (Irvin et al., 2019)), face recognition (Celeb A dataset (Liu et al., 2015) and UTKFace dataset (Zhang & Qi, 2017)). The details of these datasets are provided in Appendix B. 3https://www.kaggle.com/datasets/kaushiksuresh147/customer-segmentation
Dataset Splits Yes The Customer dataset has 8068 training samples and 2627 testing samples and the task is to classify customers into anonymous customer categories for target marketing. The Chexpert dataset contains 224,316 training instances and the task is to detect five chest and lung diseases based on X-ray images. Due to the high computational complexity of solving optimal transportation between large datasets, we sample 5% instances from the original training data as the training set and sample another 5% as the testing set. The Celeb A dataset contains 162,770 training instances and 39,829 testing instances and the task is to detect ten attributes (chosen based on Ramaswamy et al. (2021)) of the person in an image, which are being attractive, having bags under eye, having black hair, having bangs, wearing glasses, having high cheek bones, being smiling, wearing hat, having a slightly open mouth, and have a pointy nose. For the same computational reason, we sample 5% instances from the original training data as the training set and sample 20% from the original testing data as the testing set. UTKFace dataset consists of 23705 face images with five groups in terms of race(i.e., White, Black, Asian, Indian, and Others) and we randomly split it into training and testing (8:2) sets.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models or memory specifications used for running experiments.
Software Dependencies No The paper mentions several models and optimizers like Res Net50, Dense Net121, Adam optimizer, CLIP(Vi T-B/16), and Gaussian kernel but does not specify software library versions (e.g., PyTorch 1.9, TensorFlow 2.x) that are crucial for reproducibility.
Experiment Setup Yes For baselines FRAPPE, Sim Fair, and Adv Debiasing, we train the models for 60 epochs with Adam optimizer, batch size as 64, and tune the learning rate in {1e-3, 1e-4}. For f-FERM, we follow their paper to vary the weight parameter λ in {0.1, 1, 10, 50, 100, 150} and tune the learning rate in {0.1, 0.01, 0.001} with their proposed optimization algorithm. For our method, we experiment with a Gaussian kernel and h is chosen from {0.02, 0.04, 0.5, 1} based on input dimension, as smaller h theoretically and empirically leads to better performance but too small h may lead to numerical issues. We vary α for our method in {0, 0.2, 0.4, 0.6, 0.8, 1.0}. All the experiments are run five times with different seeds.