reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Metric-Agnostic Continual Learning for Sustainable Group Fairness

Authors: Heng Lian, Chen Zhao, Zhong Chen, Xingquan Zhu, My T. Thai, Yi He

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Theoretical and empirical studies substantiate that Mac FRL excels among its GFCL competitors in terms of prediction accuracy and group fairness metrics. Our empirical studies on eight benchmark datasets substantiate that Mac FRL outperforms its five state-of-the-art competitors on average by 12.7%, 42.8%, and 28.4% in terms of prediction accuracy, demographic parity, and equalized odds, respectively.
Researcher Affiliation	Academia	1 School of Computing, Data Sciences & Physics, William & Mary, Williamsburg, VA, USA 2 School of Engineering and Computer Science, Baylor University, Waco, TX, USA 3 School of Computing, Southern Illinois University, Carbondale, IL, USA 4 College of Engineering & Computer Science, Florida Atlantic University, Boca Raton, FL, USA 5 Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL, USA EMAIL, chen EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology using text and mathematical equations but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Code https://github.com/X1ao Lian/Mac FRL
Open Datasets	Yes	Eight real-world datasets from various domains set up the benchmark, with their statistics summarized in the table below. We follow (Le Quy et al. 2022) to define the protected features. Details of the studied datasets are deferred to Section 3 of supplementary material. Table 1: Statistics of the 8 datasets. No. Dataset # Samples # Features # Tasks y\|0 : 1 p\|0 : 1 1 Adult 30010 15 12 75:25 32:68 2 KDD Census-Income 199523 41 9 94:6 52:48 3 Bank marketing 31647 17 12 88:12 40:60 4 Dutch census 42125 12 10 52:48 50:50 5 Diabetes 71236 50 9 54:46 46:54 6 Law School 14298 23 6 5:95 16:84 7 Bias-MNIST 60000 28 28 3 5 10: ...:10 68:32 8 Celeb A 100000 178 218 3 5 49:51 42:58
Dataset Splits	No	The paper describes the experimental setup in terms of a sequence of tasks {Ti \| i = 0, 1, . . . , N}, where 'only the first task T0 = (X0, y0, p0) has labeled data, and the other tasks {Ti = (Xi, pi)}N 1 remain unlabeled.' While this defines how data is presented across tasks in the continual learning setting, it does not provide specific train/validation/test splits (e.g., percentages or counts) within each individual task or for the overall dataset used for model evaluation. The 'Metrics' section mentions 'Acc(Ti) returns the accuracy on Ti' but doesn't specify how the test set for 'Ti' is derived or what portion of 'Ti' is used for evaluation versus learning.
Hardware Specification	Yes	All experiments are conducted on virtual machines configured with 4 x Intel(R) Xeon(R) Gold 6148 CPUs, one Nvidia V100 GPU, and 16GB of RAM.
Software Dependencies	No	The model for Fa DL is implemented using the Fairlearn package (Bird et al. 2020). This mentions a software package but does not provide a specific version number for Fairlearn or any other dependencies.
Experiment Setup	Yes	We first evaluate the accuracy-fairness tradeoff on Bank marketing by sweep λ2 in [0.1, 0.08, 0.06, 0.04, 0.02]. Figure 4 shows the tradeoff curves (left to right) for all three methods as λ2 decreases. The same range is used. Second, the experimental results of Bank marketing shown in Table 3 demonstrate the impact of λ1. As stated in Eq. (5), increasing λ1 will also make the model more inclined toward fairness requirements. Therefore, we can observe that increasing λ1 from 0.01 to 0.05 improves the fairness of the drop in DP from 8.4% to 4.6% and EO from 7.9% to 6.4%. Table 3: Results of accuracy-fairness tradeoffs on Bank marketing sweeping over a range of λ1. Acc .745 .740 .730 .721 .713 .650 DP .084 .055 .046 .122 .173 .598 EO .079 .070 .064 .142 .181 .605