The Disparate Benefits of Deep Ensembles

Authors: Kajetan Schweighofer, Adrian Arnaiz-Rodriguez, Sepp Hochreiter, Nuria M Oliver

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our analysis reveals that they unevenly favor different groups, a phenomenon that we term the disparate benefits effect. We empirically investigate this effect using popular facial analysis and medical imaging datasets with protected group attributes and find that it affects multiple established group fairness metrics, including statistical parity and equal opportunity. Furthermore, we identify that the per-group differences in predictive diversity of ensemble members can explain this effect.
Researcher Affiliation Collaboration 1ELLIS Unit, LIT AI Lab, Institute for Machine Learning, JKU Linz, Austria 2ELLIS Alicante, Alicante, Spain 3NXAI GmbH, Linz, Austria.
Pseudocode No The paper describes methods using prose and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code to reproduce our experiments is available at https://github.com/ml-jku/disparate-benefits.
Open Datasets Yes First, two facial analysis datasets, namely Fair Face (Karkkainen & Joo, 2021) and UTKFace (Zhang et al., 2017). ... Second, the Che Xpert medical imaging dataset (Irvin et al., 2019) using the recommended targets provided by Jain et al. (2021) and protected group attributes provided by Gichoya et al. (2022).
Dataset Splits Yes For these datasets, all models were trained on the training split of Fair Face and evaluated on the official test split of Fair Face and the full UTKFace dataset. Protected group attributes were binarized... A random subset of 1/8 was split as test dataset. Protected group attributes were binarized as for the facial analysis datasets.
Hardware Specification Yes For training the models, we utilized a mixture of P100, RTX 3090, A40 and A100 GPUs, depending on availablility in our cluster.
Software Dependencies No We used the Res Net18/24/50, Reg Net-Y 800MF and Efficient Net V2-S implementations of Pytorch (Paszke et al., 2019). The paper mentions PyTorch and cites a paper but does not provide a specific version number for PyTorch or any other software.
Experiment Setup Yes The models that were trained on the Fair Face training dataset were trained for 100 epochs using SGD with momentum of 0.9 with a batch size of 256 and learning rate of 1e-2. Furthermore, a standard combination of linear (from factor 1 to 0.1) and cosine annealing schedulers was used. The models that were trained on the Che Xpert training dataset were trained for 30 epochs... We independently trained 10 models for 5 architectures on 4 target variables with 5 seeds.