Agree to Disagree: Demystifying Homogeneous Deep Ensembles through Distributional Equivalence

Authors: Yipei Wang, Xiaoqian Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we demonstrate that Jensen s inequality is not responsible for the effectiveness of deep ensembles, and convexity is not a necessary condition. Instead, Jensen Gap focuses on the average loss of individual models, which provides no practical meaning. Thus it fails to explain the core phenomena of deep ensembles such as their superiority to any single ensemble member, the decreasing loss with the number of ensemble members, etc. Regarding this mystery, we provide theoretical analysis and comprehensive empirical results from a statistical perspective that reveal the true mechanism of deep ensembles. Our results highlight that deep ensembles originate from the homogeneous output distribution across all ensemble members.
Researcher Affiliation Academia Yipei Wang, Xiaoqian Wang Elmore Family School of Electrical and Computer Engineering Purdue University West Lafayette, IN 47907, USA EMAIL
Pseudocode No The paper describes methods and proofs in text and mathematical formulations but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code to re-implement all the experiments is open source at https://github.com/yipei-wang/Deep Ensemble Demystified.
Open Datasets Yes The experiments are carried out on three datasets, CIFAR-10/100 (Krizhevsky et al., 2009) and Tiny Imagenet (Deng et al., 2009).
Dataset Splits Yes To empirically validate our theoretical findings, we carry out comprehensive experiments, which are presented along with theoretical results. Each model structure determines a model family F and the distribution p F over F. Varying the model architectures in CNNs and Res Nets and capacities in width determined by k {10, 20, 40, 80, 160}, we include a total of 10 p F in our experiments. For each p F , we train M = 100 models for three datasets: CIFAR-10/100 and Tiny Imagenet. i.e., a total of 2 * 5 * 3 M = 3000 trained models. The training parameters follow the suggestions in Nakkiran et al. (2021). The setup for experiments is detailed in appendix A. ... The models are evaluate on Xtest, Ytest throughout the paper. ... In fig. 2(a), the kernel density estimation of ℓ|F = f (i) for all 10000 samples of CIFAR-10 are visualized. ... we carry out the KS test over the first 1000 testing samples.
Hardware Specification Yes Experiments are carried out on Intel(R) Xeon(R) Gold 6226R CPU @ 2.90Hz with NVIDIA RTX A5000 GPUs.
Software Dependencies No The paper mentions training parameters like SGD, learning rate, momentum, and weight decay, but it does not specify any software libraries or frameworks with their version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes To empirically validate our theoretical findings, we carry out comprehensive experiments, which are presented along with theoretical results. ... Varying the model architectures in CNNs and Res Nets and capacities in width determined by k {10, 20, 40, 80, 160}, we include a total of 10 p F in our experiments. For each p F , we train M = 100 models for three datasets: CIFAR-10/100 and Tiny Imagenet. i.e., a total of 2 * 5 * 3 M = 3000 trained models. ... SGD is used as the solver with no momentum or data augmentation. Let t denote the number of epochs; we use the learning rate λ(t) = λ0 * sqrt(t), where λ0 = 0.1 is the initial learning rate. ... We set the learning rate as lr=1e-3, and momentum as 0.9. For another variant, we further include a weight decay at 5e-4.