reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Fairness Overfitting in Machine Learning: An Information-Theoretic Perspective

Authors: Firas Laakom, Haobo Chen, Jürgen Schmidhuber, Yuheng Bu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical results validate the tightness and practical relevance of these bounds across diverse fairness-aware learning algorithms. Our framework offers valuable insights to guide the design of algorithms improving fairness generalization. ... We conduct extensive empirical experiments on standard fairness datasets, including COMPAS and Adult. These experiments highlight the tightness of our bounds and their ability to capture the complex behavior of the fairness generalization error, providing valuable insights for future algorithm design.
Researcher Affiliation	Academia	1Center of Excellence for Generative AI, KAUST, Saudi Arabia 2University of Florida, Gainesville, USA 3The Swiss AI Lab, IDSIA, USI & SUPSI, Switzerland. Correspondence to: Firas Laakom <EMAIL>, Yuheng Bu <EMAIL>.
Pseudocode	No	The paper describes theoretical frameworks and empirical evaluations but does not contain any explicit sections or figures labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	No	The paper does not contain any explicit statement about releasing code or a link to a code repository.
Open Datasets	Yes	We conduct experiments on the COMPAS dataset (Larson et al., 2016). ... We conduct extensive empirical experiments on standard fairness datasets, including COMPAS and Adult. ... The COMPAS (Larson et al., 2016) dataset, which involves recidivism prediction based on criminal and demographic records. ... The Adult (Kohavi & Becker, 1996) dataset, derived from U.S. Census data, which focuses on income prediction.
Dataset Splits	No	The paper mentions varying the 'number of training samples' and describes a super-sample framework for evaluating bounds ('we draw m2 = 50 different train/test splits' for evaluation of CMI terms), but it does not specify conventional train/validation/test splits (e.g., percentages or fixed counts) for the initial training of the fairness algorithms themselves from the COMPAS or Adult datasets.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running experiments, such as GPU or CPU models.
Software Dependencies	No	The paper mentions methods for CMI estimation, such as 'Ross (2014)', but does not list specific software libraries with version numbers (e.g., Python 3.8, PyTorch 1.9) that were used in the experiments.
Experiment Setup	No	The paper states: 'All approaches follow the same training protocol (architectures, hyperparameters, etc.) as in Han et al. (2024).' This defers specific experimental setup details like hyperparameters and architectures to another paper, rather than providing them directly in the main text.