FairGrad: Fairness Aware Gradient Descent

Authors: Gaurav Maheshwari, Michaël Perrot

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we also show that, in addition to its versatility, Fair Grad is competitive with several standard baselines in fairness on both standard datasets as well as complex natural language processing and computer vision tasks. In this section, we present several experiments that demonstrate the competitiveness of Fair Grad as a procedure to learn fair models for classification.
Researcher Affiliation Academia Gaurav Maheshwari EMAIL Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 CRISt AL, F-59000 Lille, France Michaël Perrot EMAIL Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 CRISt AL, F-59000 Lille, France
Pseudocode Yes Algorithm 1 Fair Grad for Exact Fairness
Open Source Code Yes Fair Grad is available as a Py PI package at https://pypi.org/project/fairgrad
Open Datasets Yes We consider commonly used fairness datasets, namely Adult Income (Kohavi, 1996) and Celeb A (Liu et al., 2015). Both are binary classification datasets with binary sensitive attributes (gender)... To showcase the wide applicability of Fair Grad, we consider the Twitter Sentiment3 (Blodgett et al., 2016) dataset... We also employ the UTKFace dataset4 (Zhang et al., 2017)... We provide detailed descriptions of these datasets as well as the pre-processing steps to Appendix E.2.
Dataset Splits Yes For both datasets, we use 20% of the data as a test set and the remaining 80% as a train set. We further divide the train set into two and keep 25% of the training examples as a validation set. For each repetition, we randomly shuffle the data before splitting it, and thus we have unique splits for each random seed. We use the following seeds: 10, 20, 30, 40, 50 for all our experiments.
Hardware Specification Yes We used an Intel Xeon E5-2680 CPU to train. We consider a large convolutional neural network (Res Net18 (He et al., 2016)) fine tuned over the UTK-Face dataset... We trained the model using a Tesla P100 GPU. We experiment with a large transformer (bert-base-uncased (Devlin et al., 2019)) fine tuned over the Twitter Sentiment Dataset... We trained it using a Tesla P100 GPU.
Software Dependencies No From a practitioner point of view, it means that using Fair Grad is as simple as replacing their existing loss from Py Torch with our custom loss and passing along some meta data, while the rest of the training loop remains identical. For Constraints, we based our implementation on the publicly available authors library but were only able to reliably handle linear models and thus we do not consider this baseline for non-linear models. We use the implementation available in the Tensor Flow Constrained Optimization library with default hyper-parameters. We use the implementation available in the Fairlearn with default hyper-parameters.
Experiment Setup Yes Apart from the common hyper-parameters such as dropout, several baselines come with their own set of hyper-parameters. For instance, Bi Fair has the inner loop length, which controls the number of iterations in its inner loop, while Adversarial has the scaling, which re-weights the adversarial branch loss and the task loss. We provide details of common and approach specific hyper-parameters with their range in Appendix E.1. For all our experiments, apart from Bi Fair, we use Batch Gradient Descent as the optimizer with a learning rate of 0.1 and a gradient clipping of 0.05 to avoid exploding gradients. For Bi Fair, we employ the Adam optimizer as suggested by the authors with a learning rate of 0.001. For each method, we consider all the X possible hyper-parameter combinations and we run the training procedure for 50 epochs for each combination. In our setting, the encoder is a Multi-Layer Perceptron with two hidden layers of size 64 and 128 respectively, and the task classifier is another Multi-Layer Perceptron with a single hidden layer of size 32. We use a Re LU as the activation function with the dropout set to 0.2 and employ batch normalization with default Py Torch parameters. As a part of the hyper-parameter tuning, we did a grid search over λ, varying it between 0.1 to 3.0 with an interval of 0.2.