reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multi-Output Distributional Fairness via Post-Processing

Authors: Gang Li, Qihang Lin, Ayush Ghosh, Tianbao Yang

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical studies evaluate the proposed approach against various baselines on multi-task/multi-class classification and representation learning tasks, demonstrating the effectiveness of the proposed approach.
Researcher Affiliation	Academia	Gang Li EMAIL Texas A&M University Qihang Lin EMAIL The University of Iowa Ayush Ghosh EMAIL The University of Iowa Tianbao Yang EMAIL Texas A&M University
Pseudocode	Yes	Algorithm 1 Approximate Barycenter Algorithm 2 Post-Processing Method by Transporting to Approximate Barycenter (TAB)
Open Source Code	Yes	Code is available at: https://github.com/GangLii/TAB
Open Datasets	Yes	Datasets. In our experiments, we include four datasets from various domains, including marketing domain(Customer dataset 3), medical diagnosis( Chexpert Dataset (Irvin et al., 2019)), face recognition (Celeb A dataset (Liu et al., 2015) and UTKFace dataset (Zhang & Qi, 2017)). The details of these datasets are provided in Appendix B. 3https://www.kaggle.com/datasets/kaushiksuresh147/customer-segmentation
Dataset Splits	Yes	The Customer dataset has 8068 training samples and 2627 testing samples and the task is to classify customers into anonymous customer categories for target marketing. The Chexpert dataset contains 224,316 training instances and the task is to detect five chest and lung diseases based on X-ray images. Due to the high computational complexity of solving optimal transportation between large datasets, we sample 5% instances from the original training data as the training set and sample another 5% as the testing set. The Celeb A dataset contains 162,770 training instances and 39,829 testing instances and the task is to detect ten attributes (chosen based on Ramaswamy et al. (2021)) of the person in an image, which are being attractive, having bags under eye, having black hair, having bangs, wearing glasses, having high cheek bones, being smiling, wearing hat, having a slightly open mouth, and have a pointy nose. For the same computational reason, we sample 5% instances from the original training data as the training set and sample 20% from the original testing data as the testing set. UTKFace dataset consists of 23705 face images with five groups in terms of race(i.e., White, Black, Asian, Indian, and Others) and we randomly split it into training and testing (8:2) sets.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models or memory specifications used for running experiments.
Software Dependencies	No	The paper mentions several models and optimizers like Res Net50, Dense Net121, Adam optimizer, CLIP(Vi T-B/16), and Gaussian kernel but does not specify software library versions (e.g., PyTorch 1.9, TensorFlow 2.x) that are crucial for reproducibility.
Experiment Setup	Yes	For baselines FRAPPE, Sim Fair, and Adv Debiasing, we train the models for 60 epochs with Adam optimizer, batch size as 64, and tune the learning rate in {1e-3, 1e-4}. For f-FERM, we follow their paper to vary the weight parameter λ in {0.1, 1, 10, 50, 100, 150} and tune the learning rate in {0.1, 0.01, 0.001} with their proposed optimization algorithm. For our method, we experiment with a Gaussian kernel and h is chosen from {0.02, 0.04, 0.5, 1} based on input dimension, as smaller h theoretically and empirically leads to better performance but too small h may lead to numerical issues. We vary α for our method in {0, 0.2, 0.4, 0.6, 0.8, 1.0}. All the experiments are run five times with different seeds.