reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Fair Data Representation for Machine Learning at the Pareto Frontier

Authors: Shizhou Xu, Thomas Strohmer

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical simulations underscore the advantages: (1) the pre-processing step is compositive with arbitrary conditional expectation estimation supervised learning methods and unseen data; (2) the fair representation protects the sensitive information by limiting the inference capability of the remaining data with respect to the sensitive data; (3) the optimal aﬃne maps are computationally eﬃcient even for high-dimensional data.
Researcher Affiliation	Academia	Shizhou Xu EMAIL Department of Mathematics University of California Davis Davis, CA 95616-5270, USA Thomas Strohmer EMAIL Department of Mathematics Center of Data Science and Artiﬁcial Intelligence Research University of California Davis Davis, CA 95616-5270, USA
Pseudocode	Yes	Algorithm 1: Pseudo-Barycenter Geodesics for Independent Variable Algorithm 2: Dependent (or Post-processing) Pseudo-Barycenter Geodesics
Open Source Code	Yes	The code for the results of our experiments is available online at: github.com/xushizhou/fair_data_ representation
Open Datasets	Yes	We adopt the following metrics of accuracy and discrimination that are frequently used in fair machine learning experiments on various data sets: (1) For fair classiﬁcation, the prediction accuracy, and statistical disparity are quantiﬁed respectively by AUC (area under the Receiver Operator Characteristic curve) and Discrimination = max z,z Z P( ˆYz = 1) P( ˆYz = 1) 1 as deﬁned in [13]. (2) For univariate supervised learning, the prediction error and statistical disparity are quantiﬁed respectively by MSE (mean squared error, equivalent to the squared L2 norm on sample probability space) and KS (Kolmogorov-Smirnov) distance as in [18] for indirect comparison purpose. So that readers can compare the proposed methods indirectly with other methods that are tested in [13, 18, 44] and their references. (3) For univariate and multivariate supervised learning, the prediction error and statistical disparity are quantiﬁed respectively by L2 and W2 (Wasserstein) distances, which are the quantiﬁcation the current work adopts to prove the Pareto frontier in the above sections. In addition, we perform tests on four benchmark data sets: CRIME, LSAC, Adult, COMPAS, which are also frequently used in fair learning experiments.
Dataset Splits	Yes	For all the test results, we apply 5-fold cross-validation with 50% training and 50% testing split, except for 90% training and 10% testing split in the linear regression test on LSAC due to the high computational cost of the post-processing Wasserstein barycenter method [18].
Hardware Specification	Yes	As shown in the table above, the computational cost of the pseudo-barycenter method is signiﬁcantly lower than the cost of the known post-processing methods: on average 7836 times faster on LSAC and 21 times faster on CRIME in a single train-test cycle for a single supervised learning model. Furthermore, in model selection or composition, the pre-processing time is a ﬁxed one-time cost while the post-processing time is additive. (See point 4 below for a more detailed explanation) Now, we show the major advantages of the proposed method compared to the postprocessing ones, such as [18, 28, 24]: Acknowledgement The authors want to thank the referees, whose profound and detailed feedback greatly enhanced the quality and clarity of this paper. The authors acknowledge support from NSF DMS-2027248, NSF DMS-2208356, NSF CCF-1934568, NIH R01HL16351, and DESC0023490. Appendix A. Appendix: Proof of Results in Section 2 A.1 Proof of Lemma 2.1 W2 2(µ, ν) = Z \|\|x y\|\|2dγ (x, y) = Z \|\|((x mµ) (y mν)) + (mµ mν)\|\|2dγ (x, y) = Z \|\|(x mµ) (y mν)\|\|2dγ (x, y) + \|\|mµ mν\|\|2 W2 2(µ , ν ) + \|\|mµ mν\|\|2 = Z \|\|x y\|\|2d(γ ) (x, y) + \|\|mµ mν\|\|2 = Z \|\|(x + mµ) (y + mν)\|\|2d(γ ) (x, y) where γ and (γ ) denote the optimal transport plan for (µ, ν) and (µ , ν ), respectively. The ﬁrst inequality results from the fact that γ (x, y) := γ (x mµ, y mν) Q(µ , ν ), the second inequality from γ(x, y) := (γ ) (x + mµ, y + mν) Q(µ, ν), and the equalities from direct expansion. A.2 Proof of Lemma 2.3 Proof Existence and uniqueness follow directly from Theorem 2.1. For the equivalent multi-marginal coupling problem, there exists an optimal solution γ = L({Xz}z). It follows from Remark 2.3 that X = T({Xz}z) where L( X) is the Wasserstein barycenter. Therefore, the Gaussianity of barycenter results from linearity of T in the ﬁnite \|Z\| case, and the fact that the set of Gaussian distribution is closed in (P2,ac, W2) when \|Z\| is inﬁnite. The Fair Data Representation for Machine Learning at the Pareto Frontier characterization equation is proved in the case of ﬁnite \|Z\| in [2]. For inﬁnite \|Z\|, the equation still holds due to the continuity of the covariance function on (P2,ac, W2). The suﬃciency and necessity of the equation follows from the following characterization of the barycenter via Brenier s maps {T XXz}z derived in [2]: Z Z T XXzdλ(z) = Id. (96) It follows from the explicit form of {T XXz}z in Lemma 2.2 that Z Z T XXzdλ(z) = Z 1 2 X) 1 2 Σ 1 2 X dλ(z) = Id 1 2 X) 1 2 dλ(z)Σ 1 1 2 X) 1 2 dλ(z) = Σ X. Appendix B. Appendix: Proof of Results in Section 4 B.1 Proof of Lemma 4.1 Proof First, it follows from the triangle inequality that W2(µ0, µ1) W2(µ0, µs) + W2(µs, µt) + W2(µt, µ1) for any s, t [0, 1]. On the other hand, it follows from the deﬁnition of µt that for s, t [0, 1] W2 2(µs, µt) Z (Rd)2 \|\|x y\|\|2d(πs) γ(x) d(πt) γ(y) (Rd)2 \|\|πs(x, y) πt(x, y)\|\|2dγ(x, y) (Rd)2 \|\|(1 s)x + sy (1 t)x ty\|\|2dγ(x, y) (Rd)2 \|\|(t s)x (t s)y\|\|2dγ(x, y) (Rd)2 \|\|x y\|\|2dγ(x, y) = \|t s\|2W2 2(µ0, µ1), where the ﬁrst equation results from deﬁnition of W2. Given the above two facts, we complete the proof by contradiction. Assume s, t [0, 1] such that W2(µs, µt) < \|t s\|W2(µ0, µ1), then W2(µ0, µ1) W2(µ0, µs) + W2(µs, µt) + W2(µt, µ1) < \|s\|W2(µ0, µ1) + \|t s\|W2(µ0, µ1) + \|1 t\|W2(µt, µ1) = W2(µ0, µ1). Fair Data Representation for Machine Learning at the Pareto Frontier B.2 Proof of Theorem 4.1 Proof First, we derive the inequality from the triangle inequality and the optimality of {T( , z)}z: Let f : X Z Y be an arbitrary measurable function. It follows that Z \|\|E(Y \|X, Z)z f(X, Z)z\|\|2 2dλ(z)) 1 2 L(f(X, Z)) + ( Z Z \|\|f(X, Z)z f(X, Z)z\|\|2 2dλ(z)) 1 2 L(f(X, Z)) + ( Z Z W2 2(L(f(X, Z)z), L(f(X, Z)z))dλ(z)) 1 2 = L(f(X, Z)) + (1 Z2 W2 2(L(f(X, Z)z1), L(f(X, Z)z2))dλ(z1)dλ(z2)) 1 2 = L(f(X, Z)) + 1 2D(f(X, Z)). Here, the penultimate equation results from the fact that, for any {νz}z P2,ac(Rd), Z Z2 W2 2(νz1, νz2)dλ(z1)dλ(z2) = 2 Z Z W2 2(νz, ν)dλ(z), (97) where ν is the Wasserstein barycenter of {νz}z. Now, we show that the lower bound is achieved if and only if f(X, Z) = T(t)(E(Y \|X, Z), Z), t [0, 1]. Let t [0, 1], Tz := T( , z), and µz := L(E(Y \|X, Z)z). It follows from Lemma 4.1 and Remark 4.1 that: Z W2 2(µz, µ)dλ(z)) 1 2 Z W2 2(µz, Tz(t) µz)dλ(z)) 1 2 + ( Z Z W2 2(Tz(t) µz, µ)dλ(z)) 1 2 Z W2 2(µz, µ)dλ(z)) 1 2 + ((1 t)2 Z Z W2 2(µz, µ)dλ(z)) 1 2 = t V + (1 t)V = V. Therefore, the second inequality is an equality where the ﬁrst term is L(T(t)): L(T(t)) = ( Z Z \|\|E(Y \|X, Z)z Tz(t)(E(Y \|X, Z)z)\|\|2 2dλ(z)) 1 2 Z W2 2(µz, Tz(t) µz)dλ(z)) 1 2 Z W2 2(µz, µ)dλ(z)) 1 2 = t V. Xu and Strohmer For the second term, we claim that it equals 1 2D(T(t)). To see this, we need to ﬁrst show Tz(t) µz = µ. Indeed, if not, then R Z W2 2(Tz(t) µz, Tz(t) µz)dλ(z) is strictly less than R Z W2 2(Tz(t) µz, µ)dλ(z) by the deﬁnition and uniqueness of Tz(t) µz. It follows that Z W2 2(µz, Tz(t) µz)dλ(z)) 1 2 Z W2 2(µz, Tz(t) µz)dλ(z)) 1 2 + ( Z Z W2 2(Tz(t) µz, Tz(t) µz)dλ(z)) 1 2 <L(T(t)) + ( Z Z W2 2(Tz(t) µz, µ)dλ(z)) 1 2 Z W2 2(µz, µ)dλ(z)) 1 2 , which contradicts the deﬁnition and uniqueness of µ. Therefore, D(T(t)) = ( Z Z2 W2 2(Tz1(t) µz1, Tz2(t) µz2)dλ(z1)dλ(z2)) 1 2 Z W2 2(Tz(t) µz, Tz(t) µz)dλ(z)) 1 2 Z W2 2(Tz(t) µz, µ)dλ(z)) 1 2 Z W2 2(µz, µ)dλ(z)) 1 2 That completes the proof. Appendix C. Appendix: Proof of Results in Section 5 C.1 Proof of X Z implies E(Yz\| X) = E(Y \| X, Z)z Proof Let X Z and assume for contradiction that E(Yz\| X) = E(Y \| X, Z)z. Then, we have \|\|Y E(Y \| X, Z)\|\|2 2 = Z Z \|\|Yz f ( X, Z)z\|\|2 2dλ Z Z \|\|Yz f ( X, z)\|\|2 2dλ Z Z \|\|Yz E(Yz\| X)\|\|2 2dλ Z Z \|\|Yz fz( X)\|\|2 2dλ Fair Data Representation for Machine Learning at the Pareto Frontier where the ﬁrst line follows from disintegration and the fact that there exists a measurable function f : X Z Y such that f ( X, Z) = E(Y \| X, Z), the second from X Z, the third line follows from orthogonal projection property of conditional expectation and the assumption, and the forth from the fact that there exists a measurable function fz : X Y such that fz( X) = E(Yz\| X). Now, deﬁne f : X Z Y by f( , z) := fz for λ-a.e. z Z. It follows that \|\|Y E(Y \| X, Z)\|\|2 2 > Z Z \|\|Yz fz( X)\|\|2 2dλ Z Z \|\|Yz f( X, z)\|\|2 2dλ = \|\|Y f( X, Z)\|\|2 2 = \|\|Y E(Y \| X, Z)\|\|2 2 + \|\|E(Y \| X, Z) f( X, Z)\|\|2 2. That implies \|\|E(Y \| X, Z) f( X, Z)\|\|2 2 < 0, a contradiction. This completes the proof. C.2 Proof of Lemma 5.1 Proof Let X, X { X DX : X Z} satisfy σ( X ) σ( X). We have \|\|E(Y \|X, Z) E( Y \| X, Z)\|\|2 2 \|\|E(Y \|X, Z) E( Y \| X , Z)\|\|2 2 =\|\|E(Y \|X, Z) E(Y \| X, Z)\|\|2 2 \|\|E(Y \|X, Z) E(Y \| X , Z)\|\|2 2 Notice that \|\|E(Y \|X, Z) E(Y \| X, Z)\|\|2 2 = \|\|E(Y \|X, Z) E(Y \| X, Z)\|\|2 2 + Z Z W2 2(µz, µ)dλ where µz := L(E(Y \| X, Z)z) and µ := L(E(Y \| X, Z)). Also, we deﬁne µ z and µ analogously to have \|\|E(Y \|X, Z) E(Y \| X , Z)\|\|2 2 =\|\|E(Y \|X, Z) E(Y \| X , Z)\|\|2 2 + Z Z W2 2(µ z, µ )dλ =\|\|E(Y \|X, Z) E(Y \| X, Z)\|\|2 2 + \|\|E(Y \| X, Z) E(Y \| X , Z)\|\|2 2 + Z Z W2 2(µ z, µ )dλ. Combining the above, we have \|\|E(Y \|X, Z) E( Y \| X, Z)\|\|2 2 \|\|E(Y \|X, Z) E( Y \| X , Z)\|\|2 2 Z W2 2(µz, µ)dλ Z Z W2 2(µ z, µ )dλ \|\|E(Y \| X, Z) E(Y \| X , Z)\|\|2 2. It remains to show that R Z W2 2(µz, µ)dλ < R Z W2 2(µ z, µ )dλ + \|\|E(Y \| X, Z) E(Y \| X , Z)\|\|2 2. Indeed, assume for contradiction that R Z W2 2(µ z, µ )dλ + \|\|E(Y \| X, Z) E(Y \| X , Z)\|\|2 2 Xu and Strohmer Z W2 2(µz, µ)dλ, then we have Z Z W2 2(µz, µ )dλ \|\|E(Y \| X, Z) E(Y \| X , Z)\|\|2 2 + Z Z W2 2(µ z, µ )dλ Z W2 2(µz, µ)dλ. This contradicts the optimality and uniqueness of µ by Lemma 3.1. Therefore, we prove by contradiction that R Z W2 2(µz, µ)dλ < R Z W2 2(µ z, µ )dλ + \|\|E(Y \| X, Z) E(Y \| X , Z)\|\|2 2 and, hence, \|\|E(Y \|X, Z) E( Y \| X, Z)\|\|2 2 \|\|E(Y \|X, Z) E( Y \| X , Z)\|\|2 2 < 0. That completes the proof. C.3 Proof of Lemma 5.2 Proof We ﬁrst prove σ(( X, Z)) = σ((X, Z)). Since L(Xz) P2,ac, it follows from Lemma 3.1 that there exists a measurable map T : X Z X such that T(Xz, z) = Xz λ-a.e., where X denotes the Wasserstein barycenter of {Xz}z. Deﬁne T Id\|Z : X Z X Z, we have T Id\|Z is X Z/X Z-measurable and satisﬁes T Id\|Z((X, Z)) = ( X, Z). That implies σ(( X, Z)) σ((X, Z)). Furthermore, since L( X) P2,ac, it follows from Brenier s theorem [11] that there exists T 1( , z) such that T 1( Xz, z) = Xz. Therefore, we have (T Id\|Z) 1 = T 1 Id\|Z is X Z/X Z-measurable and satisﬁes (T Id\|Z) 1(( X, Z)) = (X, Z). That implies σ((X, Z)) σ(( X, Z)). That completes the proof of σ(( X, Z)) = σ((X, Z)). Now, we show σ( X) σ( X). From the construction of X, we have σ(( X, Z)) σ(( X, Z)) = σ((X, Z)). But X Z implies that, for any BX BX , we can construct BX Z BX BZ. In addition, due to σ(( X, Z)) σ(( X, Z)), there exists B XZ BX BZ such that ( X, Z) 1(B XZ) = (X, Z) 1(BX Z). Lastly, X Z also implies that there exists B X BX satisfying B XZ = B X Z. It follows that X 1(BX) = ( X, Z) 1(BX Z) = (X, Z) 1(B X Z) = X 1(B X) (98) Since our choice of BX BX is arbitrary, it follows that σ( X) σ( X). Finally, since our choice of X { X D\|X : X Z} is arbitrary, we are done. [1] B. L. Adamson. Ricci v. De Stefano: Procedural Activism (?). National Black Law Journal (University of California, Los Angeles), 24:11 01, 2011. [2] M. Agueh and G. Carlier. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904 924, 2011. [3] J. M. Altschuler and E. Boix-Adsera. Wasserstein barycenters are NP-hard to compute. SIAM Journal on Mathematics of Data Science, 4(1):179 203, 2022. Fair Data Representation for Machine Learning at the Pareto Frontier [4] P. C. Alvarez-Esteban, E. Del Barrio, J. Cuesta-Albertos, and C. Matr an. A ﬁxedpoint approach to barycenters in Wasserstein space. Journal of Mathematical Analysis and Applications, 441(2):744 762, 2016. [5] J. Angwin, J. Larson, S. Mattu, and L. Kirchner. Machine Bias. In Ethics of data and analytics, pages 254 264. Auerbach Publications, 2022. [6] J. B. Aristotle et al. The complete works of Aristotle, volume 2. Princeton University Press Princeton, 1984. [7] A. Asuncion and D. Newman. UCI machine learning repository, 2007. [8] R. Berk, H. Heidari, S. Jabbari, M. Joseph, M. Kearns, J. Morgenstern, S. Neel, and A. Roth. A convex framework for fair regression. ar Xiv preprint ar Xiv:1706.02409, 2017. [9] R. Bhatia. Positive Deﬁnite Matrices. Princeton University Press, 2009. [10] A. W. Blumrosen. Strangers in paradise: Griggs v. Duke Power Co. and the concept of employment discrimination. Mich. L. Rev., 71:59, 1972. [11] Y. Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics, 44(4):375 417, 1991. [12] T. Calders and I. ˇZliobait e. Why unbiased computational processes can lead to discriminative decision procedures. In Discrimination and Privacy in the Information Society: Data mining and proﬁling in large databases, pages 43 57. Springer, 2013. [13] F. Calmon, D. Wei, B. Vinzamuri, K. Natesan Ramamurthy, and K. R. Varshney. Optimized pre-processing for discrimination prevention. Advances in neural information processing systems, 30, 2017. [14] Y. Cao and J. Yang. Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, pages 463 480. IEEE, 2015. [15] G. Carlier and I. Ekeland. Matching for teams. Economic theory, 42:397 418, 2010. [16] A. Chouldechova and A. Roth. The frontiers of fairness in machine learning. ar Xiv preprint ar Xiv:1810.08810, 2018. [17] B. Christian. The alignment problem: Machine learning and human values. WW Norton & Company, 2020. [18] E. Chzhen, C. Denis, M. Hebiri, L. Oneto, and M. Pontil. Fair regression with Wasserstein barycenters. Advances in Neural Information Processing Systems, 33:7321 7331, 2020. [19] J. M. Cooper, D. S. Hutchinson, et al. Plato: complete works. Hackett Publishing, 1997. Xu and Strohmer [20] J. A. Cuesta-Albertos, C. Matr an-Bea, and A. Tuero-Diaz. On lower bounds for the l2Wasserstein metric in a Hilbert space. Journal of Theoretical Probability, 9(2):263 283, 1996. [21] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214 226, 2012. [22] I. Ekeland. Existence, uniqueness and eﬃciency of equilibrium in hedonic markets with multidimensional types. Economic Theory, 42:275 315, 2010. [23] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian. Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 259 268, 2015. [24] T. L. Gouic, J.-M. Loubes, and P. Rigollet. Projection to fairness in statistical learning. ar Xiv preprint ar Xiv:2005.11720, 2020. [25] S. Hajian and J. Domingo-Ferrer. A methodology for direct and indirect discrimination prevention in data mining. IEEE transactions on knowledge and data engineering, 25 (7):1445 1459, 2012. [26] M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29, 2016. [27] L. Hu and Y. Chen. A short-term intervention for long-term fairness in the labor market. In Proceedings of the 2018 World Wide Web Conference, pages 1389 1398, 2018. [28] R. Jiang, A. Pacchiano, T. Stepleton, H. Jiang, and S. Chiappa. Wasserstein fair classiﬁcation. In Uncertainty in artiﬁcial intelligence, pages 862 872. PMLR, 2020. [29] F. Kamiran and T. Calders. Data preprocessing techniques for classiﬁcation without discrimination. Knowledge and information systems, 33(1):1 33, 2012. [30] Y.-H. Kim and B. Pass. Wasserstein barycenters over Riemannian manifolds. Advances in Mathematics, 307:640 683, 2017. [31] T. Le Gouic and J.-M. Loubes. Existence and consistency of Wasserstein barycenters. Probability Theory and Related Fields, 168:901 917, 2017. [32] R. J. Mc Cann. A convexity principle for interacting gases. Advances in mathematics, 128(1):153 179, 1997. [33] E. O. of the President. Big data: Seizing opportunities, preserving values. President PACT report, 2014. [34] B. Pass. Optimal transportation with inﬁnitely many marginals. Journal of Functional Analysis, 264(4):947 963, 2013. Fair Data Representation for Machine Learning at the Pareto Frontier [35] M. Redmond and A. Baveja. A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research, 141(3):660 678, 2002. [36] F. Santambrogio. Optimal transport for applied mathematicians. Birk auser, NY, 55 (58-63):94, 2015. [37] C. Silvia, J. Ray, S. Tom, P. Aldo, J. Heinrich, and A. John. A general approach to fairness with optimal transport. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34(04), pages 3633 3640, 2020. [38] L. Sweeney. Discrimination in online ad delivery: Google ads, black names and white names, racial discrimination, and click advertising. Queue, 11(3):10 29, 2013. [39] E. G. Tabak and G. Trigila. Explanation of variability and removal of confounding factors from data through optimal transport. Communications on Pure and Applied Mathematics, 71(1):163 199, 2018. [40] C. Villani. Topics in optimal transportation, volume 58. American Mathematical Soc., 2021. [41] C. Villani et al. Optimal transport: old and new, volume 338. Springer, 2009. [42] L. F. Wightman. LSAC National Longitudinal Bar Passage Study. LSAC Research Report Series. 1998. [43] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classiﬁcation without disparate mistreatment. In Proceedings of the 26th international conference on world wide web, pages 1171 1180, 2017. [44] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In International conference on machine learning, pages 325 333. PMLR, 2013. [45] N. Zhou, Z. Zhang, V. N. Nair, H. Singhal, J. Chen, and A. Sudjianto. Bias, Fairness, and Accountability with AI and ML Algorithms. ar Xiv preprint ar Xiv:2105.06558, 2021.
Software Dependencies	No	The paper mentions software components like "logistic regression", "random forest", "linear regression", and "artiﬁcial neural networks (ANN)" but does not provide specific version numbers for these or other libraries/frameworks used.
Experiment Setup	Yes	The two supervised learning methods we use are linear regression and artiﬁcial neural networks (ANN with 4 linearly stacked layers where each of the ﬁrst three layers has 32 units all with Re Lu activation while the last has 1 unit with linear activation).