Agree to Disagree: Demystifying Homogeneous Deep Ensembles through Distributional Equivalence
Authors: Yipei Wang, Xiaoqian Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we demonstrate that Jensen s inequality is not responsible for the effectiveness of deep ensembles, and convexity is not a necessary condition. Instead, Jensen Gap focuses on the average loss of individual models, which provides no practical meaning. Thus it fails to explain the core phenomena of deep ensembles such as their superiority to any single ensemble member, the decreasing loss with the number of ensemble members, etc. Regarding this mystery, we provide theoretical analysis and comprehensive empirical results from a statistical perspective that reveal the true mechanism of deep ensembles. Our results highlight that deep ensembles originate from the homogeneous output distribution across all ensemble members. |
| Researcher Affiliation | Academia | Yipei Wang, Xiaoqian Wang Elmore Family School of Electrical and Computer Engineering Purdue University West Lafayette, IN 47907, USA EMAIL |
| Pseudocode | No | The paper describes methods and proofs in text and mathematical formulations but does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code to re-implement all the experiments is open source at https://github.com/yipei-wang/Deep Ensemble Demystified. |
| Open Datasets | Yes | The experiments are carried out on three datasets, CIFAR-10/100 (Krizhevsky et al., 2009) and Tiny Imagenet (Deng et al., 2009). |
| Dataset Splits | Yes | To empirically validate our theoretical findings, we carry out comprehensive experiments, which are presented along with theoretical results. Each model structure determines a model family F and the distribution p F over F. Varying the model architectures in CNNs and Res Nets and capacities in width determined by k {10, 20, 40, 80, 160}, we include a total of 10 p F in our experiments. For each p F , we train M = 100 models for three datasets: CIFAR-10/100 and Tiny Imagenet. i.e., a total of 2 * 5 * 3 M = 3000 trained models. The training parameters follow the suggestions in Nakkiran et al. (2021). The setup for experiments is detailed in appendix A. ... The models are evaluate on Xtest, Ytest throughout the paper. ... In fig. 2(a), the kernel density estimation of ℓ|F = f (i) for all 10000 samples of CIFAR-10 are visualized. ... we carry out the KS test over the first 1000 testing samples. |
| Hardware Specification | Yes | Experiments are carried out on Intel(R) Xeon(R) Gold 6226R CPU @ 2.90Hz with NVIDIA RTX A5000 GPUs. |
| Software Dependencies | No | The paper mentions training parameters like SGD, learning rate, momentum, and weight decay, but it does not specify any software libraries or frameworks with their version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | To empirically validate our theoretical findings, we carry out comprehensive experiments, which are presented along with theoretical results. ... Varying the model architectures in CNNs and Res Nets and capacities in width determined by k {10, 20, 40, 80, 160}, we include a total of 10 p F in our experiments. For each p F , we train M = 100 models for three datasets: CIFAR-10/100 and Tiny Imagenet. i.e., a total of 2 * 5 * 3 M = 3000 trained models. ... SGD is used as the solver with no momentum or data augmentation. Let t denote the number of epochs; we use the learning rate λ(t) = λ0 * sqrt(t), where λ0 = 0.1 is the initial learning rate. ... We set the learning rate as lr=1e-3, and momentum as 0.9. For another variant, we further include a weight decay at 5e-4. |