reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Hyper-parameter Tuning for Fair Classification without Sensitive Attribute Access

Authors: Akshaj Kumar Veldanda, Ivan Brugere, Sanghamitra Dutta, Alan Mishler, Siddharth Garg

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show theoretically and empirically that these proxy labels can be used to maximize fairness under average accuracy constraints. Key to our results is a principled approach to select the hyper-parameters of the ERM model in a completely unsupervised fashion (meaning without access to ground truth sensitive attributes) that minimizes the gap between fairness estimated using noisy versus ground-truth sensitive labels. We demonstrate that Antigone outperforms existing methods on Celeb A, Waterbirds, and UCI datasets. ... Empirically, we find that: (1) Antigone produces more accurate PSA labels on validation data compared to GEORGE s unsupervised clustering approach (Table 1); (2) used with JTT (AFR), Antigone comes close to matching the fairness of JTT (AFR) tuned with ground-truth SA as shown in Table 2 (Table 5); and (3) improves the fairness of both GEORGE and ARL when Antigone s PSA labels are used instead of their own hyperparameter tuning methods (Table 3, Table 4).
Researcher Affiliation	Collaboration	Akshaj Kumar Veldanda EMAIL Electrical and Computer Engineering Department New York University Ivan Brugere EMAIL JP Morgan Chase AI Research Sanghamitra Dutta EMAIL Electrical and Computer Engineering Department University of Maryland College Park Alan Mishler EMAIL JP Morgan Chase AI Research Siddharth Garg EMAIL Electrical and Computer Engineering Department New York University
Pseudocode	No	The paper describes the 'Antigone Algorithm' in Section 2.2 using descriptive text and numbered steps, but it is presented as continuous prose and mathematical formulas rather than a structured pseudocode block or algorithm chart with distinct keywords like 'Input', 'Output', 'For loop', 'If-else' statements, etc. Therefore, it does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code with README.txt file is available at: https://github.com/akshajkumarv/fairness_without_demographics
Open Datasets	Yes	We evaluate Antigone in conjunction with three state-of-art methods, JTT (Liu et al., 2021), AFR (Qiu et al., 2023), GEORGE (Sohoni et al., 2020) and ARL (Lahoti et al., 2020), on binary SA using demographic parity, equal opportunity, and worst sub-group accuracy as fairness metrics across the Celeb A, Waterbirds and Adult datasets. ... Celeb A (Liu et al., 2015) is an image dataset... Waterbirds is a synthetically generated dataset... Adult dataset (Dua & Graff, 2017) is used to predict if an individual s annual income is <=50K (Y = 0) or > 50K (Y = 1)...
Dataset Splits	Yes	Celeb A Dataset: ...The dataset is split into training, validation and test sets with 162770, 19867 and 19962 images, respectively. Waterbirds Dataset: ...The dataset is split into training, validation and test sets with 4795, 1199 and 5794 images, respectively. UCI Adult Dataset: ...The dataset consists of 45,000 instances and is split into training, validation and test sets with 21112, 9049 and 15060 instances, respectively.
Hardware Specification	Yes	We trained all the models employing the JTT approach using Quadro RTX8000 (48 GB) NVIDIA GPU cards, whereas, for both GEORGE and ARL approaches, we used Ge Force RTX3090 (24 GB) NVIDIA GPU cards.
Software Dependencies	No	The paper mentions using
Experiment Setup	Yes	Celeb A Dataset: ...we fine-tune a pre-trained Res Net50 architecture for a total of 50 epochs using SGD optimizer and a batch size of 128. We tune JTT over the same hyper-parameters as in their paper: three pairs of learning rates and weight decays, (1e-04, 1e-04), (1e-04, 1e-02), (1e-05, 1e-01) for both stages, and over ten early stopping points up to T = 50 and λ ∈ {20, 50, 100} for stage 2. For Antigone, we explore over the same learning rate and weight decay values, as well as early stopping at any of the 50 training epochs. Waterbirds Dataset: ...we fine-tune Res Net50 architecture for a total of 300 epoch using the SGD optimizer and a batch size of 64. We tune JTT over the same hyper-parameters as in their paper: three pairs of learning rates and weight decays, (1e-03, 1e-04), (1e-04, 1e-01), (1e-05, 1.0) for both stages, and over 14 early stopping points up to T = 300 and λ ∈ {20, 50, 100} for stage 2. UCI Adult Dataset: ...we train a multi-layer neural network with one hidden layer consisting of 64 neurons. We train for a total of 100 epochs using the SGD optimizer and a batch size of 256. We tune Antigone and JTT by performing grid search over learning rates {1e-03, 1e-04, 1e-05} and weight decays {1e-01, 1e-03}.