reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Ensuring Fairness Beyond the Training Data

Authors: Debmalya Mandal, Samuel Deng, Suman Jana, Jeannette Wing, Daniel J. Hsu

NeurIPS 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on standard machine learning fairness datasets suggest that, compared to the state-of-the-art fair classiﬁers, our classiﬁer retains fairness guarantees and test accuracy for a large class of perturbations on the test set. Furthermore, our experiments show that there is an inherent trade-off between fairness robustness and accuracy of such classiﬁers.
Researcher Affiliation	Academia	Debmalya Mandal EMAIL Columbia University Samuel Deng EMAIL Columbia University Suman Jana EMAIL Columbia University Jeannette M. Wing EMAIL Columbia University Daniel Hsu EMAIL Columbia University
Pseudocode	Yes	ALGORITHM 1: Meta-Algorithm, ALGORITHM 2: Best Response of the -player, ALGORITHM 3: Approximate Fair Classi er (Apx Fair)
Open Source Code	Yes	Our code is available at this Git Hub repo: https://github.com/essdeee/ Ensuring-Fairness-Beyond-the-Training-Data.
Open Datasets	Yes	We used the following four datasets for our experiments. Adult. In this dataset [24], each example represents an adult individual... Communities and Crime. In this dataset from the UCI repository [29]... Law School. We used a preprocessed and balanced subset with 1,823 examples and 17 features [33]. COMPAS. We used a 2,000 example sample from the full dataset. For Adult, Communities and Crime, and Law School we used the preprocessed versions found in the accompanying Git Hub repo of [22]4. For COMPAS, we used a sample from the original dataset [1].
Dataset Splits	Yes	In order to evaluate different fair classi ers, we rst split each dataset into ve different random 80%-20% train-test splits. Then, we split each training set further into a 80%-20% train and validation sets. Therefore, there were ve random sets of 64%-16%-20% train-validation-test split.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, cloud instances) used for running experiments were mentioned in the paper.
Software Dependencies	No	The paper mentions using 'scikit-learn’s logistic regression [27]' but does not provide specific version numbers for scikit-learn or other software dependencies.
Experiment Setup	Yes	To nd the correct hyper-parameters (B, , T, and Tm) for our algorithm, we xed T = 10 for EO, and T = 5 for DP, and used grid search for the hyper-parameters B, , and Tm. The tested values were {0.1, 0, 2, . . . , 1} for B, {0, 0.05, . . . , 1} for , and {100, 200, . . . , 2000} for Tm.