reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FairGrad: Fairness Aware Gradient Descent

Authors: Gaurav Maheshwari, Michaël Perrot

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments, we also show that, in addition to its versatility, Fair Grad is competitive with several standard baselines in fairness on both standard datasets as well as complex natural language processing and computer vision tasks. In this section, we present several experiments that demonstrate the competitiveness of Fair Grad as a procedure to learn fair models for classiﬁcation.
Researcher Affiliation	Academia	Gaurav Maheshwari EMAIL Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 CRISt AL, F-59000 Lille, France Michaël Perrot EMAIL Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 CRISt AL, F-59000 Lille, France
Pseudocode	Yes	Algorithm 1 Fair Grad for Exact Fairness
Open Source Code	Yes	Fair Grad is available as a Py PI package at https://pypi.org/project/fairgrad
Open Datasets	Yes	We consider commonly used fairness datasets, namely Adult Income (Kohavi, 1996) and Celeb A (Liu et al., 2015). Both are binary classiﬁcation datasets with binary sensitive attributes (gender)... To showcase the wide applicability of Fair Grad, we consider the Twitter Sentiment3 (Blodgett et al., 2016) dataset... We also employ the UTKFace dataset4 (Zhang et al., 2017)... We provide detailed descriptions of these datasets as well as the pre-processing steps to Appendix E.2.
Dataset Splits	Yes	For both datasets, we use 20% of the data as a test set and the remaining 80% as a train set. We further divide the train set into two and keep 25% of the training examples as a validation set. For each repetition, we randomly shuﬄe the data before splitting it, and thus we have unique splits for each random seed. We use the following seeds: 10, 20, 30, 40, 50 for all our experiments.
Hardware Specification	Yes	We used an Intel Xeon E5-2680 CPU to train. We consider a large convolutional neural network (Res Net18 (He et al., 2016)) ﬁne tuned over the UTK-Face dataset... We trained the model using a Tesla P100 GPU. We experiment with a large transformer (bert-base-uncased (Devlin et al., 2019)) ﬁne tuned over the Twitter Sentiment Dataset... We trained it using a Tesla P100 GPU.
Software Dependencies	No	From a practitioner point of view, it means that using Fair Grad is as simple as replacing their existing loss from Py Torch with our custom loss and passing along some meta data, while the rest of the training loop remains identical. For Constraints, we based our implementation on the publicly available authors library but were only able to reliably handle linear models and thus we do not consider this baseline for non-linear models. We use the implementation available in the Tensor Flow Constrained Optimization library with default hyper-parameters. We use the implementation available in the Fairlearn with default hyper-parameters.
Experiment Setup	Yes	Apart from the common hyper-parameters such as dropout, several baselines come with their own set of hyper-parameters. For instance, Bi Fair has the inner loop length, which controls the number of iterations in its inner loop, while Adversarial has the scaling, which re-weights the adversarial branch loss and the task loss. We provide details of common and approach speciﬁc hyper-parameters with their range in Appendix E.1. For all our experiments, apart from Bi Fair, we use Batch Gradient Descent as the optimizer with a learning rate of 0.1 and a gradient clipping of 0.05 to avoid exploding gradients. For Bi Fair, we employ the Adam optimizer as suggested by the authors with a learning rate of 0.001. For each method, we consider all the X possible hyper-parameter combinations and we run the training procedure for 50 epochs for each combination. In our setting, the encoder is a Multi-Layer Perceptron with two hidden layers of size 64 and 128 respectively, and the task classiﬁer is another Multi-Layer Perceptron with a single hidden layer of size 32. We use a Re LU as the activation function with the dropout set to 0.2 and employ batch normalization with default Py Torch parameters. As a part of the hyper-parameter tuning, we did a grid search over λ, varying it between 0.1 to 3.0 with an interval of 0.2.