Fair Data Representation for Machine Learning at the Pareto Frontier

Authors: Shizhou Xu, Thomas Strohmer

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical simulations underscore the advantages: (1) the pre-processing step is compositive with arbitrary conditional expectation estimation supervised learning methods and unseen data; (2) the fair representation protects the sensitive information by limiting the inference capability of the remaining data with respect to the sensitive data; (3) the optimal affine maps are computationally efficient even for high-dimensional data.
Researcher Affiliation Academia Shizhou Xu EMAIL Department of Mathematics University of California Davis Davis, CA 95616-5270, USA Thomas Strohmer EMAIL Department of Mathematics Center of Data Science and Artificial Intelligence Research University of California Davis Davis, CA 95616-5270, USA
Pseudocode Yes Algorithm 1: Pseudo-Barycenter Geodesics for Independent Variable Algorithm 2: Dependent (or Post-processing) Pseudo-Barycenter Geodesics
Open Source Code Yes The code for the results of our experiments is available online at: github.com/xushizhou/fair_data_ representation
Open Datasets Yes We adopt the following metrics of accuracy and discrimination that are frequently used in fair machine learning experiments on various data sets: (1) For fair classification, the prediction accuracy, and statistical disparity are quantified respectively by AUC (area under the Receiver Operator Characteristic curve) and Discrimination = max z,z Z P( ˆYz = 1) P( ˆYz = 1) 1 as defined in [13]. (2) For univariate supervised learning, the prediction error and statistical disparity are quantified respectively by MSE (mean squared error, equivalent to the squared L2 norm on sample probability space) and KS (Kolmogorov-Smirnov) distance as in [18] for indirect comparison purpose. So that readers can compare the proposed methods indirectly with other methods that are tested in [13, 18, 44] and their references. (3) For univariate and multivariate supervised learning, the prediction error and statistical disparity are quantified respectively by L2 and W2 (Wasserstein) distances, which are the quantification the current work adopts to prove the Pareto frontier in the above sections. In addition, we perform tests on four benchmark data sets: CRIME, LSAC, Adult, COMPAS, which are also frequently used in fair learning experiments.
Dataset Splits Yes For all the test results, we apply 5-fold cross-validation with 50% training and 50% testing split, except for 90% training and 10% testing split in the linear regression test on LSAC due to the high computational cost of the post-processing Wasserstein barycenter method [18].
Hardware Specification Yes As shown in the table above, the computational cost of the pseudo-barycenter method is significantly lower than the cost of the known post-processing methods: on average 7836 times faster on LSAC and 21 times faster on CRIME in a single train-test cycle for a single supervised learning model. Furthermore, in model selection or composition, the pre-processing time is a fixed one-time cost while the post-processing time is additive. (See point 4 below for a more detailed explanation) Now, we show the major advantages of the proposed method compared to the postprocessing ones, such as [18, 28, 24]: Acknowledgement The authors want to thank the referees, whose profound and detailed feedback greatly enhanced the quality and clarity of this paper. The authors acknowledge support from NSF DMS-2027248, NSF DMS-2208356, NSF CCF-1934568, NIH R01HL16351, and DESC0023490. Appendix A. Appendix: Proof of Results in Section 2 A.1 Proof of Lemma 2.1 W2 2(µ, ν) = Z ||x y||2dγ (x, y) = Z ||((x mµ) (y mν)) + (mµ mν)||2dγ (x, y) = Z ||(x mµ) (y mν)||2dγ (x, y) + ||mµ mν||2 W2 2(µ , ν ) + ||mµ mν||2 = Z ||x y||2d(γ ) (x, y) + ||mµ mν||2 = Z ||(x + mµ) (y + mν)||2d(γ ) (x, y) where γ and (γ ) denote the optimal transport plan for (µ, ν) and (µ , ν ), respectively. The first inequality results from the fact that γ (x, y) := γ (x mµ, y mν) Q(µ , ν ), the second inequality from γ(x, y) := (γ ) (x + mµ, y + mν) Q(µ, ν), and the equalities from direct expansion. A.2 Proof of Lemma 2.3 Proof Existence and uniqueness follow directly from Theorem 2.1. For the equivalent multi-marginal coupling problem, there exists an optimal solution γ = L({Xz}z). It follows from Remark 2.3 that X = T({Xz}z) where L( X) is the Wasserstein barycenter. Therefore, the Gaussianity of barycenter results from linearity of T in the finite |Z| case, and the fact that the set of Gaussian distribution is closed in (P2,ac, W2) when |Z| is infinite. The Fair Data Representation for Machine Learning at the Pareto Frontier characterization equation is proved in the case of finite |Z| in [2]. For infinite |Z|, the equation still holds due to the continuity of the covariance function on (P2,ac, W2). The sufficiency and necessity of the equation follows from the following characterization of the barycenter via Brenier s maps {T XXz}z derived in [2]: Z Z T XXzdλ(z) = Id. (96) It follows from the explicit form of {T XXz}z in Lemma 2.2 that Z Z T XXzdλ(z) = Z 1 2 X) 1 2 Σ 1 2 X dλ(z) = Id 1 2 X) 1 2 dλ(z)Σ 1 1 2 X) 1 2 dλ(z) = Σ X. Appendix B. Appendix: Proof of Results in Section 4 B.1 Proof of Lemma 4.1 Proof First, it follows from the triangle inequality that W2(µ0, µ1) W2(µ0, µs) + W2(µs, µt) + W2(µt, µ1) for any s, t [0, 1]. On the other hand, it follows from the definition of µt that for s, t [0, 1] W2 2(µs, µt) Z (Rd)2 ||x y||2d(πs) γ(x) d(πt) γ(y) (Rd)2 ||πs(x, y) πt(x, y)||2dγ(x, y) (Rd)2 ||(1 s)x + sy (1 t)x ty||2dγ(x, y) (Rd)2 ||(t s)x (t s)y||2dγ(x, y) (Rd)2 ||x y||2dγ(x, y) = |t s|2W2 2(µ0, µ1), where the first equation results from definition of W2. Given the above two facts, we complete the proof by contradiction. Assume s, t [0, 1] such that W2(µs, µt) < |t s|W2(µ0, µ1), then W2(µ0, µ1) W2(µ0, µs) + W2(µs, µt) + W2(µt, µ1) < |s|W2(µ0, µ1) + |t s|W2(µ0, µ1) + |1 t|W2(µt, µ1) = W2(µ0, µ1). Fair Data Representation for Machine Learning at the Pareto Frontier B.2 Proof of Theorem 4.1 Proof First, we derive the inequality from the triangle inequality and the optimality of {T( , z)}z: Let f : X Z Y be an arbitrary measurable function. It follows that Z ||E(Y |X, Z)z f(X, Z)z||2 2dλ(z)) 1 2 L(f(X, Z)) + ( Z Z ||f(X, Z)z f(X, Z)z||2 2dλ(z)) 1 2 L(f(X, Z)) + ( Z Z W2 2(L(f(X, Z)z), L(f(X, Z)z))dλ(z)) 1 2 = L(f(X, Z)) + (1 Z2 W2 2(L(f(X, Z)z1), L(f(X, Z)z2))dλ(z1)dλ(z2)) 1 2 = L(f(X, Z)) + 1 2D(f(X, Z)). Here, the penultimate equation results from the fact that, for any {νz}z P2,ac(Rd), Z Z2 W2 2(νz1, νz2)dλ(z1)dλ(z2) = 2 Z Z W2 2(νz, ν)dλ(z), (97) where ν is the Wasserstein barycenter of {νz}z. Now, we show that the lower bound is achieved if and only if f(X, Z) = T(t)(E(Y |X, Z), Z), t [0, 1]. Let t [0, 1], Tz := T( , z), and µz := L(E(Y |X, Z)z). It follows from Lemma 4.1 and Remark 4.1 that: Z W2 2(µz, µ)dλ(z)) 1 2 Z W2 2(µz, Tz(t) µz)dλ(z)) 1 2 + ( Z Z W2 2(Tz(t) µz, µ)dλ(z)) 1 2 Z W2 2(µz, µ)dλ(z)) 1 2 + ((1 t)2 Z Z W2 2(µz, µ)dλ(z)) 1 2 = t V + (1 t)V = V. Therefore, the second inequality is an equality where the first term is L(T(t)): L(T(t)) = ( Z Z ||E(Y |X, Z)z Tz(t)(E(Y |X, Z)z)||2 2dλ(z)) 1 2 Z W2 2(µz, Tz(t) µz)dλ(z)) 1 2 Z W2 2(µz, µ)dλ(z)) 1 2 = t V. Xu and Strohmer For the second term, we claim that it equals 1 2D(T(t)). To see this, we need to first show Tz(t) µz = µ. Indeed, if not, then R Z W2 2(Tz(t) µz, Tz(t) µz)dλ(z) is strictly less than R Z W2 2(Tz(t) µz, µ)dλ(z) by the definition and uniqueness of Tz(t) µz. It follows that Z W2 2(µz, Tz(t) µz)dλ(z)) 1 2 Z W2 2(µz, Tz(t) µz)dλ(z)) 1 2 + ( Z Z W2 2(Tz(t) µz, Tz(t) µz)dλ(z)) 1 2 <L(T(t)) + ( Z Z W2 2(Tz(t) µz, µ)dλ(z)) 1 2 Z W2 2(µz, µ)dλ(z)) 1 2 , which contradicts the definition and uniqueness of µ. Therefore, D(T(t)) = ( Z Z2 W2 2(Tz1(t) µz1, Tz2(t) µz2)dλ(z1)dλ(z2)) 1 2 Z W2 2(Tz(t) µz, Tz(t) µz)dλ(z)) 1 2 Z W2 2(Tz(t) µz, µ)dλ(z)) 1 2 Z W2 2(µz, µ)dλ(z)) 1 2 That completes the proof. Appendix C. Appendix: Proof of Results in Section 5 C.1 Proof of X Z implies E(Yz| X) = E(Y | X, Z)z Proof Let X Z and assume for contradiction that E(Yz| X) = E(Y | X, Z)z. Then, we have ||Y E(Y | X, Z)||2 2 = Z Z ||Yz f ( X, Z)z||2 2dλ Z Z ||Yz f ( X, z)||2 2dλ Z Z ||Yz E(Yz| X)||2 2dλ Z Z ||Yz fz( X)||2 2dλ Fair Data Representation for Machine Learning at the Pareto Frontier where the first line follows from disintegration and the fact that there exists a measurable function f : X Z Y such that f ( X, Z) = E(Y | X, Z), the second from X Z, the third line follows from orthogonal projection property of conditional expectation and the assumption, and the forth from the fact that there exists a measurable function fz : X Y such that fz( X) = E(Yz| X). Now, define f : X Z Y by f( , z) := fz for λ-a.e. z Z. It follows that ||Y E(Y | X, Z)||2 2 > Z Z ||Yz fz( X)||2 2dλ Z Z ||Yz f( X, z)||2 2dλ = ||Y f( X, Z)||2 2 = ||Y E(Y | X, Z)||2 2 + ||E(Y | X, Z) f( X, Z)||2 2. That implies ||E(Y | X, Z) f( X, Z)||2 2 < 0, a contradiction. This completes the proof. C.2 Proof of Lemma 5.1 Proof Let X, X { X DX : X Z} satisfy σ( X ) σ( X). We have ||E(Y |X, Z) E( Y | X, Z)||2 2 ||E(Y |X, Z) E( Y | X , Z)||2 2 =||E(Y |X, Z) E(Y | X, Z)||2 2 ||E(Y |X, Z) E(Y | X , Z)||2 2 Notice that ||E(Y |X, Z) E(Y | X, Z)||2 2 = ||E(Y |X, Z) E(Y | X, Z)||2 2 + Z Z W2 2(µz, µ)dλ where µz := L(E(Y | X, Z)z) and µ := L(E(Y | X, Z)). Also, we define µ z and µ analogously to have ||E(Y |X, Z) E(Y | X , Z)||2 2 =||E(Y |X, Z) E(Y | X , Z)||2 2 + Z Z W2 2(µ z, µ )dλ =||E(Y |X, Z) E(Y | X, Z)||2 2 + ||E(Y | X, Z) E(Y | X , Z)||2 2 + Z Z W2 2(µ z, µ )dλ. Combining the above, we have ||E(Y |X, Z) E( Y | X, Z)||2 2 ||E(Y |X, Z) E( Y | X , Z)||2 2 Z W2 2(µz, µ)dλ Z Z W2 2(µ z, µ )dλ ||E(Y | X, Z) E(Y | X , Z)||2 2. It remains to show that R Z W2 2(µz, µ)dλ < R Z W2 2(µ z, µ )dλ + ||E(Y | X, Z) E(Y | X , Z)||2 2. Indeed, assume for contradiction that R Z W2 2(µ z, µ )dλ + ||E(Y | X, Z) E(Y | X , Z)||2 2 Xu and Strohmer Z W2 2(µz, µ)dλ, then we have Z Z W2 2(µz, µ )dλ ||E(Y | X, Z) E(Y | X , Z)||2 2 + Z Z W2 2(µ z, µ )dλ Z W2 2(µz, µ)dλ. This contradicts the optimality and uniqueness of µ by Lemma 3.1. Therefore, we prove by contradiction that R Z W2 2(µz, µ)dλ < R Z W2 2(µ z, µ )dλ + ||E(Y | X, Z) E(Y | X , Z)||2 2 and, hence, ||E(Y |X, Z) E( Y | X, Z)||2 2 ||E(Y |X, Z) E( Y | X , Z)||2 2 < 0. That completes the proof. C.3 Proof of Lemma 5.2 Proof We first prove σ(( X, Z)) = σ((X, Z)). Since L(Xz) P2,ac, it follows from Lemma 3.1 that there exists a measurable map T : X Z X such that T(Xz, z) = Xz λ-a.e., where X denotes the Wasserstein barycenter of {Xz}z. Define T Id|Z : X Z X Z, we have T Id|Z is X Z/X Z-measurable and satisfies T Id|Z((X, Z)) = ( X, Z). That implies σ(( X, Z)) σ((X, Z)). Furthermore, since L( X) P2,ac, it follows from Brenier s theorem [11] that there exists T 1( , z) such that T 1( Xz, z) = Xz. Therefore, we have (T Id|Z) 1 = T 1 Id|Z is X Z/X Z-measurable and satisfies (T Id|Z) 1(( X, Z)) = (X, Z). That implies σ((X, Z)) σ(( X, Z)). That completes the proof of σ(( X, Z)) = σ((X, Z)). Now, we show σ( X) σ( X). From the construction of X, we have σ(( X, Z)) σ(( X, Z)) = σ((X, Z)). But X Z implies that, for any BX BX , we can construct BX Z BX BZ. In addition, due to σ(( X, Z)) σ(( X, Z)), there exists B XZ BX BZ such that ( X, Z) 1(B XZ) = (X, Z) 1(BX Z). Lastly, X Z also implies that there exists B X BX satisfying B XZ = B X Z. It follows that X 1(BX) = ( X, Z) 1(BX Z) = (X, Z) 1(B X Z) = X 1(B X) (98) Since our choice of BX BX is arbitrary, it follows that σ( X) σ( X). Finally, since our choice of X { X D|X : X Z} is arbitrary, we are done. [1] B. L. Adamson. Ricci v. De Stefano: Procedural Activism (?). National Black Law Journal (University of California, Los Angeles), 24:11 01, 2011. [2] M. Agueh and G. Carlier. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904 924, 2011. [3] J. M. Altschuler and E. Boix-Adsera. Wasserstein barycenters are NP-hard to compute. SIAM Journal on Mathematics of Data Science, 4(1):179 203, 2022. Fair Data Representation for Machine Learning at the Pareto Frontier [4] P. C. Alvarez-Esteban, E. Del Barrio, J. Cuesta-Albertos, and C. Matr an. A fixedpoint approach to barycenters in Wasserstein space. Journal of Mathematical Analysis and Applications, 441(2):744 762, 2016. [5] J. Angwin, J. Larson, S. Mattu, and L. Kirchner. Machine Bias. In Ethics of data and analytics, pages 254 264. Auerbach Publications, 2022. [6] J. B. Aristotle et al. The complete works of Aristotle, volume 2. Princeton University Press Princeton, 1984. [7] A. Asuncion and D. Newman. UCI machine learning repository, 2007. [8] R. Berk, H. Heidari, S. Jabbari, M. Joseph, M. Kearns, J. Morgenstern, S. Neel, and A. Roth. A convex framework for fair regression. ar Xiv preprint ar Xiv:1706.02409, 2017. [9] R. Bhatia. Positive Definite Matrices. Princeton University Press, 2009. [10] A. W. Blumrosen. Strangers in paradise: Griggs v. Duke Power Co. and the concept of employment discrimination. Mich. L. Rev., 71:59, 1972. [11] Y. Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics, 44(4):375 417, 1991. [12] T. Calders and I. ˇZliobait e. Why unbiased computational processes can lead to discriminative decision procedures. In Discrimination and Privacy in the Information Society: Data mining and profiling in large databases, pages 43 57. Springer, 2013. [13] F. Calmon, D. Wei, B. Vinzamuri, K. Natesan Ramamurthy, and K. R. Varshney. Optimized pre-processing for discrimination prevention. Advances in neural information processing systems, 30, 2017. [14] Y. Cao and J. Yang. Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, pages 463 480. IEEE, 2015. [15] G. Carlier and I. Ekeland. Matching for teams. Economic theory, 42:397 418, 2010. [16] A. Chouldechova and A. Roth. The frontiers of fairness in machine learning. ar Xiv preprint ar Xiv:1810.08810, 2018. [17] B. Christian. The alignment problem: Machine learning and human values. WW Norton & Company, 2020. [18] E. Chzhen, C. Denis, M. Hebiri, L. Oneto, and M. Pontil. Fair regression with Wasserstein barycenters. Advances in Neural Information Processing Systems, 33:7321 7331, 2020. [19] J. M. Cooper, D. S. Hutchinson, et al. Plato: complete works. Hackett Publishing, 1997. Xu and Strohmer [20] J. A. Cuesta-Albertos, C. Matr an-Bea, and A. Tuero-Diaz. On lower bounds for the l2Wasserstein metric in a Hilbert space. Journal of Theoretical Probability, 9(2):263 283, 1996. [21] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214 226, 2012. [22] I. Ekeland. Existence, uniqueness and efficiency of equilibrium in hedonic markets with multidimensional types. Economic Theory, 42:275 315, 2010. [23] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian. Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 259 268, 2015. [24] T. L. Gouic, J.-M. Loubes, and P. Rigollet. Projection to fairness in statistical learning. ar Xiv preprint ar Xiv:2005.11720, 2020. [25] S. Hajian and J. Domingo-Ferrer. A methodology for direct and indirect discrimination prevention in data mining. IEEE transactions on knowledge and data engineering, 25 (7):1445 1459, 2012. [26] M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29, 2016. [27] L. Hu and Y. Chen. A short-term intervention for long-term fairness in the labor market. In Proceedings of the 2018 World Wide Web Conference, pages 1389 1398, 2018. [28] R. Jiang, A. Pacchiano, T. Stepleton, H. Jiang, and S. Chiappa. Wasserstein fair classification. In Uncertainty in artificial intelligence, pages 862 872. PMLR, 2020. [29] F. Kamiran and T. Calders. Data preprocessing techniques for classification without discrimination. Knowledge and information systems, 33(1):1 33, 2012. [30] Y.-H. Kim and B. Pass. Wasserstein barycenters over Riemannian manifolds. Advances in Mathematics, 307:640 683, 2017. [31] T. Le Gouic and J.-M. Loubes. Existence and consistency of Wasserstein barycenters. Probability Theory and Related Fields, 168:901 917, 2017. [32] R. J. Mc Cann. A convexity principle for interacting gases. Advances in mathematics, 128(1):153 179, 1997. [33] E. O. of the President. Big data: Seizing opportunities, preserving values. President PACT report, 2014. [34] B. Pass. Optimal transportation with infinitely many marginals. Journal of Functional Analysis, 264(4):947 963, 2013. Fair Data Representation for Machine Learning at the Pareto Frontier [35] M. Redmond and A. Baveja. A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research, 141(3):660 678, 2002. [36] F. Santambrogio. Optimal transport for applied mathematicians. Birk auser, NY, 55 (58-63):94, 2015. [37] C. Silvia, J. Ray, S. Tom, P. Aldo, J. Heinrich, and A. John. A general approach to fairness with optimal transport. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34(04), pages 3633 3640, 2020. [38] L. Sweeney. Discrimination in online ad delivery: Google ads, black names and white names, racial discrimination, and click advertising. Queue, 11(3):10 29, 2013. [39] E. G. Tabak and G. Trigila. Explanation of variability and removal of confounding factors from data through optimal transport. Communications on Pure and Applied Mathematics, 71(1):163 199, 2018. [40] C. Villani. Topics in optimal transportation, volume 58. American Mathematical Soc., 2021. [41] C. Villani et al. Optimal transport: old and new, volume 338. Springer, 2009. [42] L. F. Wightman. LSAC National Longitudinal Bar Passage Study. LSAC Research Report Series. 1998. [43] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th international conference on world wide web, pages 1171 1180, 2017. [44] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In International conference on machine learning, pages 325 333. PMLR, 2013. [45] N. Zhou, Z. Zhang, V. N. Nair, H. Singhal, J. Chen, and A. Sudjianto. Bias, Fairness, and Accountability with AI and ML Algorithms. ar Xiv preprint ar Xiv:2105.06558, 2021.
Software Dependencies No The paper mentions software components like "logistic regression", "random forest", "linear regression", and "artificial neural networks (ANN)" but does not provide specific version numbers for these or other libraries/frameworks used.
Experiment Setup Yes The two supervised learning methods we use are linear regression and artificial neural networks (ANN with 4 linearly stacked layers where each of the first three layers has 32 units all with Re Lu activation while the last has 1 unit with linear activation).