reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Disparate Benefits of Deep Ensembles

Authors: Kajetan Schweighofer, Adrian Arnaiz-Rodriguez, Sepp Hochreiter, Nuria M Oliver

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our analysis reveals that they unevenly favor different groups, a phenomenon that we term the disparate benefits effect. We empirically investigate this effect using popular facial analysis and medical imaging datasets with protected group attributes and find that it affects multiple established group fairness metrics, including statistical parity and equal opportunity. Furthermore, we identify that the per-group differences in predictive diversity of ensemble members can explain this effect.
Researcher Affiliation	Collaboration	1ELLIS Unit, LIT AI Lab, Institute for Machine Learning, JKU Linz, Austria 2ELLIS Alicante, Alicante, Spain 3NXAI GmbH, Linz, Austria.
Pseudocode	No	The paper describes methods using prose and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code to reproduce our experiments is available at https://github.com/ml-jku/disparate-benefits.
Open Datasets	Yes	First, two facial analysis datasets, namely Fair Face (Karkkainen & Joo, 2021) and UTKFace (Zhang et al., 2017). ... Second, the Che Xpert medical imaging dataset (Irvin et al., 2019) using the recommended targets provided by Jain et al. (2021) and protected group attributes provided by Gichoya et al. (2022).
Dataset Splits	Yes	For these datasets, all models were trained on the training split of Fair Face and evaluated on the official test split of Fair Face and the full UTKFace dataset. Protected group attributes were binarized... A random subset of 1/8 was split as test dataset. Protected group attributes were binarized as for the facial analysis datasets.
Hardware Specification	Yes	For training the models, we utilized a mixture of P100, RTX 3090, A40 and A100 GPUs, depending on availablility in our cluster.
Software Dependencies	No	We used the Res Net18/24/50, Reg Net-Y 800MF and Efficient Net V2-S implementations of Pytorch (Paszke et al., 2019). The paper mentions PyTorch and cites a paper but does not provide a specific version number for PyTorch or any other software.
Experiment Setup	Yes	The models that were trained on the Fair Face training dataset were trained for 100 epochs using SGD with momentum of 0.9 with a batch size of 256 and learning rate of 1e-2. Furthermore, a standard combination of linear (from factor 1 to 0.1) and cosine annealing schedulers was used. The models that were trained on the Che Xpert training dataset were trained for 30 epochs... We independently trained 10 models for 5 architectures on 4 target variables with 5 seeds.