reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FairPFN: A Tabular Foundation Model for Causal Fairness

Authors: Jake Robertson, Noah Hollmann, Samuel Müller, Noor Awad, Frank Hutter

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section assesses Fair PFN s performance on synthetic and real-world benchmarks, highlighting its capability to remove the causal influence of protected attributes without user-specified knowledge of the causal model, while maintaining high predictive accuracy. We first evaluate Fair PFN using synthetic causal case studies to establish an experimental setting where the data-generating processes and all causal quantities are known, presenting a series of causal case studies with increasing difficulty to evaluate Fair PFN s capacity to remove various sources of bias in causally generated data.
Researcher Affiliation	Collaboration	1ELLIS Institute T ubingen 2University of Freiburg 3Charit e University Medicine Berlin 4Prior Labs 5Meta.
Pseudocode	Yes	We provide pseudocode for our pre-training algorithm in Algorithm 2, and outline the steps below.
Open Source Code	Yes	We provide a prediction interface to evaluate and assess our pre-trained model, as well as code to generate and visualize our pre-training data at https://github.com/jr2021/Fair PFN.
Open Datasets	Yes	The first dataset is the Law School Admissions dataset from the 1998 LSAC National Longitudinal Bar Passage Study (Wightman, 1998), which includes admissions data fr approximately 30,000 US law school applicants, revealing disparities in bar passage rates and first-year averages by ethnicity. The second dataset, derived from the 1994 US Census, is the Adult Census Income problem (Dua & Graff, 2017), containing demographic and income outcome data (INC 50K) for nearly 50,000 individuals
Dataset Splits	Yes	After generating Dbias and Dfair, we partition them into training and validation sets: Dtrain bias , Dval bias, Dtrain fair , and Dval fair. Figure 6 shows the mean prediction average treatment effect (ATE) and predictive error (1-AUC) across 5 K-fold cross-validation iterations.
Hardware Specification	Yes	The transformer is trained for approximately 3 days on an RTX-2080 GPU on approximately 1.5 million different synthetic data-generating mechanisms, in which we vary the MLP architecture, the number of features m, the sample size n, and the non-linearities z.
Software Dependencies	No	The paper mentions 'XGBoost (Chen & Guestrin, 2016)' as a base model for EGR, but does not specify a version number for it or any other software dependencies like Python, PyTorch, or specific libraries used for the Fair PFN implementation.
Experiment Setup	Yes	The transformer is trained for approximately 3 days on an RTX-2080 GPU on approximately 1.5 million different synthetic data-generating mechanisms, in which we vary the MLP architecture, the number of features m, the sample size n, and the non-linearities z. For a robust evaluation, we generate 100 datasets per case study, varying causal weights of protected attributes w A, sample sizes m (100, 10000) (sampled on a log-scale), and the standard deviation σ (0, 1) (log-scale) of additive noise terms.