reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Uncertainty as a Fairness Measure

Authors: Selim Kuzucu, Jiaee Cheong, Hatice Gunes, Sinan Kalkan

JAIR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate on many datasets that (i) our uncertainty-based measures are complementary to existing measures of fairness, and (ii) they provide more insights about the underlying issues leading to bias.
Researcher Affiliation	Academia	Selim Kuzucu EMAIL Department of Computer Engineering Middle East Technical University 06800 Ankara, Turkiye Jiaee Cheong EMAIL Department of Computer Science University of Cambridge Cambridge, CB3 0FD, United Kingdom The Alan Turing Institute London, NW1 2DB, United Kingdom Hatice Gunes EMAIL Department of Computer Science University of Cambridge Cambridge, CB3 0FD, United Kingdom Sinan Kalkan EMAIL Department of Computer Engineering & ROMER Robotics-AI Center Middle East Technical University 06800 Ankara, Turkiye
Pseudocode	No	The paper describes its methodology in prose in Section 4 'Methodology' and its subsections, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide any links to a code repository.
Open Datasets	Yes	We adopt the approach of (Zafar, Valera, Gomez Rodriguez, & Gummadi, 2017) for all synthetic dataset curation. Each synthetic dataset has 320 samples with 20% reserved for testing. The COMPAS Recidivism Dataset is a dataset with criminal offenders records generally used to predict recidivism (binary classification) (Angwin, Larson, Mattu, & Kirchner, 2022). The Adult Income Dataset contains a 48K+ samples with 14 features (Becker & Kohavi, 1996). UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5XW20. The D-Vlog Depression Detection Dataset contains visual and acoustic features from Youtube videos of 555 depressed and 406 non-depressed samples belonging to 639 females and 322 males (Yoon, Kang, Kim, & Han, 2022).
Dataset Splits	Yes	Each synthetic dataset has 320 samples with 20% reserved for testing. We follow (Zafar et al., 2017) in terms of the considered attributes and dataset splits. We adhere to the training-testing split provided by the authors. We follow the training and testing splits as provided by the authors.
Hardware Specification	No	The paper states: "We gratefully acknowledge the computational resources provided by METU Center for Robotics and Artificial Intelligence (METU-ROMER) and METU Image Processing Laboratory." However, this does not specify any particular GPU models, CPU models, or other detailed hardware specifications used for the experiments.
Software Dependencies	No	The paper mentions using the "Adam optimizer" and the "Bayes by Backprop method" but does not provide specific version numbers for any programming languages, libraries, or frameworks (e.g., Python 3.x, PyTorch 1.x, TensorFlow 2.x).
Experiment Setup	Yes	For all experiments, we use the Adam optimizer (Kingma & Ba, 2017). Following (Kwon et al., 2020), we set T = 10 (the number of Monte Carlo samples for uncertainty quantification as defined in Section 4.1). Furthermore, following one of the settings provided in (Blundell et al., 2015), we use 10 Monte Carlo samples to approximate the variational posterior, qθ(ω), and sample the initial mean of the posterior from a Gaussian with µ = 0 and σ = 1. The π value, weighting factor for the prior, is set to 0.5 and the two σ1 and σ2 values for the scaled mixture of Gaussians is set to 0 and 6 respectively. We consider λ from the BNN training objective to be 2000. We utilize early stopping to determine the number of training iterations for all experiments. Synthetic Datasets: We train all of them for 5 epochs with a batch size of 8. COMPAS Recidivism Dataset: We employ a BNN with a single hidden layer of size 100. We train the model for 10 epochs with a batch size of 256. Adult Income Dataset: We employ a BNN with no hidden layers where the intermediate size is 25. We train the model for 5 epochs with a batch size of 256. D-Vlog Depression Detection Dataset: We choose T = 5 as existing work indicates that performance tends to peak at that number (Havasi et al., 2020). For all of training configurations, we directly use the setting of (Yoon et al., 2022) with a learning rate of 0.0002 and a batch size of 32, optimized for 50 epochs through the Adam optimizer (Kingma & Ba, 2017). For the dropout rate, we empirically choose 0.1.