reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Measuring Fairness Under Unawareness of Sensitive Attributes: A Quantification-Based Approach

Authors: Alessandro Fabris, Andrea Esuli, Alejandro Moreo, Fabrizio Sebastiani

JAIR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we propose a novel solution to the problem of measuring classifier fairness under unawareness by using techniques from quantification (Esuli et al., 2023; Gonz alez et al., 2017), a supervised learning task concerned with estimating, rather than the class labels of individual data points, the class prevalence values for samples of such data points, i.e., group-level quantities, such as the percentage of women in a given sample. Quantifi- Measuring Fairness Under Unawareness Using Quantification cation methods address two pressing facets of the fairness under unawareness problem: (1) their estimates are robust to distribution shift (i.e., to the fact that the distribution of the labels in the unlabeled data may significantly differ from the analogous distribution in the training data), which is often inevitable since populations evolve, and demographic data are unlikely to be representative of every condition encountered at deployment time; (2) they allow the estimation of group-level quantities but do not allow the inference of sensitive attributes at the individual level, which is beneficial since the latter might lead to the inappropriate and nonconsensual utilization of this sensitive information, reducing individuals agency over data (Andrus and Villeneuve, 2022). Quantification methods achieve these goals by directly targeting group-level prevalence estimates. They do so through a variety of approaches, including, e.g., dedicated loss functions, task-specific adjustments, and ad hoc model selection procedures. Overall, we make the following contributions: Quantifying fairness under unawareness. We show that measuring fairness under unawareness can be cast as a quantification problem and solved with approaches of proven consistency established in the quantification literature (Section 4). We propose and demonstrate several high-accuracy fairness estimators for both vanilla and fairness-aware classifiers. Experimental protocols for five major challenges. Drawing from the algorithmic fairness literature, we identify five important challenges that arise in estimating fairness under unawareness. These challenges are encountered in real-world applications, and include the nonstationarity of the processes generating the data and the variable cardinality of the available samples. For each such challenge, we define and formalise a precise experimental protocol, through which we compare the performance of quantifiers (i.e., group-level prevalence estimators) generated by six different quantification methods (Sections 5.3 5.7).
Researcher Affiliation	Academia	Alessandro Fabris EMAIL Max Planck Institute for Security and Privacy Universit atsstraße 140, 44799 Bochum, Germany Department of Information Engineering, University of Padova Via Giovanni Gradenigo 6B, Padua, 35131, Italy Andrea Esuli EMAIL Alejandro Moreo EMAIL Fabrizio Sebastiani EMAIL Istituto di Scienza e Tecnologie dell Informazione Consiglio Nazionale delle Ricerche Via Giuseppe Moruzzi 1, Pisa, 56124, Italy
Pseudocode	Yes	Pseudocode describing the SLD algorithm can be found in Appendix A. We consider HDy (Gonz alez-Castro et al., 2013), a probabilistic binary quantification method that views quantification as the problem of minimizing the divergence (measured in terms of the Hellinger Distance) between two cumulative distributions of posterior probabilities returned by the classifier, one coming from the unlabelled examples and the other coming from a validation set. HDy looks for the mixture parameter α that best fits the validation distribution (consisting of a mixture of a positive and a negative distribution) to the unlabelled distribution, and returns α as the estimated prevalence of the positive class. Here, robustness to distribution shift is achieved by the analysis of the distribution of the posterior probabilities in the unlabelled set, which reveals how conditions have changed with respect to the training data. A more detailed description of HDy can be found in Appendix B. Appendix A. The SLD Method SLD (Saerens et al., 2002) produces prevalence estimates ˆp SLD σ (s) iteratively, using EM algorithms. In detail, given two sets, L and U, where the former represents the labelled one (training set) and the latter represents the unlabelled one (test set). The method iterates until convergence (i.e., the difference between the prevalence estimated across two consecutive iterations is less than a tolerance factor ϵ we use ϵ = 1e 4) or until a maximum number of iterations is reached. The pseudocode describing SLD is as follows:
Open Source Code	Yes	Our code is available at https://github.com/alessandro-fabris/ql4facct.
Open Datasets	Yes	We compare the performance of each estimation technique on three datasets (Adult, COMPAS, and Credit Card). The datasets and respective preprocessing are described in detail in Section 5.2. We focus our discussion (and we present plots see Figures 1 8) on the experiments carried out on the Adult dataset, while we summarise numerically the results on COMPAS and Credit Card (Tables 4 8), discussing them only when significant differences from Adult arise. Adult.4 One of the most popular resources in the UCI Machine Learning Repository, the Adult dataset was curated to benchmark the performance of machine learning algorithms. It was extracted from the March 1994 US Current Population Survey and represents respondents along demographic and socioeconomic dimensions, reporting, e.g., their sex, race, educational attainment, and occupation. Each instance comes with a binary label, encoding whether their income exceeds $50,000, which is the target of the associated classification task. We consider sex the sensitive attribute S, with a binary categorization of respondents as Female or Male . From the non-sensitive attributes X, we remove education-num (a redundant feature), relationship (where the values husband and wife are near-perfect predictors of sex ), and fnlwgt (a variable released by the US Census Bureau to encode how representative each instance is of the overall population). Categorical variables are dummy-encoded and instances with missing values (7%) are removed. COMPAS.5 This dataset was curated to audit racial biases in the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) risk assessment tool, which estimates the likelihood of a defendant becoming a recidivist (Angwin et al., 2016; Larson et al., 2016). The dataset represents defendants who were scored for risk of recidivism by COMPAS in Broward County, Florida between 2013 and 2014, summarizing their demographics, criminal record, custody, and COMPAS scores. We consider the compas-scores-two-years subset published by Pro Publica on github, consisting of defendants who were observed for two years after screening, for whom a binary recidivism ground truth is available. We follow standard pre-processing to remove noisy instances (Pro Publica, 2016). We focus on race as a protected attribute S, restricting the data to defendants labelled African-American or Caucasian . Our attributes X are the age of the defendant ( age , an integer), the number of juvenile felonies, misdemeanours, and other convictions ( juv fel count , juv misd count , juv other count , all integers), the number of prior crimes ( priors count , an integer) and the degree of current charge ( c charge degree , felony or misdemeanour, dummy-encoded). Credit Card.6 This resource was curated to study automated credit card default prediction, following a wave of defaults in Taiwan. The dataset summarizes the payment history of customers of an important Taiwanese bank, from April to October 2005. Demographics, marital status, and education of customers are also provided, along with the amount of credit given and a binary variable encoding the default on payment within the next month, which is the associated prediction task. We consider sex (binarily encoded) as the sensitive attribute S and keep every other variable in X, preprocessing categorical ones via dummy-encoding ( education , marriage , pay 0 , pay 2 , pay 3 , pay 4 , pay 5 , pay 6 ). Differently from Adult, we keep marital status as its values are not trivial predictors of the sensitive attribute. A summary of these datasets and related statistics is reported in Table 3.
Dataset Splits	Yes	We divide a given data set into three subsets DA, DB, DC of identical sizes and identical joint distribution over (S, Y ). We perform five random such splits; in order to test each estimator under the same conditions, these splits are the same for every method. For each split, we permute the role of the stratified subsets DA, DB, DC, so that each subset alternatively serves as the training set (D1), or auxiliary set (D2), or test set (D3). We test all (six) such permutations. Whenever an experimental protocol requires sampling from a set, for instance when artificially altering a class prevalence value, we perform 10 different samplings. To perform extensive experiments at a reasonable computational cost, every time an experimental protocol requires changing a dataset D into a version D characterized by distribution shift, we also reduce its cardinality to \| D\| = 500. Further details and implications of this choice on each experimental protocol are provided in the context of the protocol s setup (e.g., Section 5.6.1).
Hardware Specification	No	The paper does not mention specific hardware used for running the experiments. It only refers to classifiers trained via LR or SVM.
Software Dependencies	No	The paper mentions using Logistic Regression (LR) and Support Vector Machines (SVMs) as classifiers and Python is implied by the GitHub link but no specific version numbers for any software dependencies or libraries are provided.
Experiment Setup	Yes	In this section, we carry out an evaluation of different estimators of demographic disparity. We propose five experimental protocols (Sections 5.3 5.7) summarized in Table 2. Each protocol addresses a major challenge that may arise in estimating fairness under unawareness, and does so by varying the size and the mutual distribution shift of the training, auxiliary, and test sets. Protocol names are in the form action-characteristic-dataset, as they act on datasets (D1, D2 or D3), modifying their characteristics (size or class prevalence) through one of two actions (sampling or flipping of labels). We investigate the performance of six estimators of demographic disparity in each of the five challenges/protocols, keeping the remaining factors constant. For every protocol, we perform an extensive empirical evaluation as follows: We compare the performance of each estimation technique on three datasets (Adult, COMPAS, and Credit Card). The datasets and respective preprocessing are described in detail in Section 5.2. We focus our discussion (and we present plots see Figures 1 8) on the experiments carried out on the Adult dataset, while we summarise numerically the results on COMPAS and Credit Card (Tables 4 8), discussing them only when significant differences from Adult arise. We divide a given data set into three subsets DA, DB, DC of identical sizes and identical joint distribution over (S, Y ). We perform five random such splits; in order to test each estimator under the same conditions, these splits are the same for every method. For each split, we permute the role of the stratified subsets DA, DB, DC, so that each subset alternatively serves as the training set (D1), or auxiliary set (D2), or test set (D3). We test all (six) such permutations. Whenever an experimental protocol requires sampling from a set, for instance when artificially altering a class prevalence value, we perform 10 different samplings. To perform extensive experiments at a reasonable computational cost, every time an experimental protocol requires changing a dataset D into a version D characterized by distribution shift, we also reduce its cardinality to \| D\| = 500. Further details and implications of this choice on each experimental protocol are provided in the context of the protocol s setup (e.g., Section 5.6.1). Different learning approaches can be used to train the sensitive attribute classifier ks underlying the quantification methods. We test Logistic Regression (LR) and Support Vector Machines (SVMs).3 Sections 5.3 5.7 report results of quantification algorithms wrapped around a classifier trained via LR. Analogous results obtained with SVMs are reported in Appendix D. We train the classifier h, whose demographic disparity we aim to estimate, using LR with balanced class weights (i.e., loss weights inversely proportional to class frequencies). To measure the performance of different quantifiers, we report the signed estimation error, derived from Equations (10) and (14) as e = ˆδS h δS h = [ˆµ(1) ˆµ(0)] [µ(1) µ(0)] (16) We refer to \|e\| as the Absolute Error (AE), and evaluate the results of our experiments by Mean Absolute Error (MAE) and Mean Squared Error (MSE), defined as MAE(E) = 1 \|E\| ei E \|ei\| (17) MSE(E) = 1 \|E\| ei E e2 i (18) where the mean of the signed estimation errors ei is computed over multiple experiments E. Overall, our experiments consist of over 700,000 separate estimations of demographic disparity.