reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Agree to Disagree: Demystifying Homogeneous Deep Ensembles through Distributional Equivalence

Authors: Yipei Wang, Xiaoqian Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we demonstrate that Jensen s inequality is not responsible for the effectiveness of deep ensembles, and convexity is not a necessary condition. Instead, Jensen Gap focuses on the average loss of individual models, which provides no practical meaning. Thus it fails to explain the core phenomena of deep ensembles such as their superiority to any single ensemble member, the decreasing loss with the number of ensemble members, etc. Regarding this mystery, we provide theoretical analysis and comprehensive empirical results from a statistical perspective that reveal the true mechanism of deep ensembles. Our results highlight that deep ensembles originate from the homogeneous output distribution across all ensemble members.
Researcher Affiliation	Academia	Yipei Wang, Xiaoqian Wang Elmore Family School of Electrical and Computer Engineering Purdue University West Lafayette, IN 47907, USA EMAIL
Pseudocode	No	The paper describes methods and proofs in text and mathematical formulations but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code to re-implement all the experiments is open source at https://github.com/yipei-wang/Deep Ensemble Demystified.
Open Datasets	Yes	The experiments are carried out on three datasets, CIFAR-10/100 (Krizhevsky et al., 2009) and Tiny Imagenet (Deng et al., 2009).
Dataset Splits	Yes	To empirically validate our theoretical findings, we carry out comprehensive experiments, which are presented along with theoretical results. Each model structure determines a model family F and the distribution p F over F. Varying the model architectures in CNNs and Res Nets and capacities in width determined by k {10, 20, 40, 80, 160}, we include a total of 10 p F in our experiments. For each p F , we train M = 100 models for three datasets: CIFAR-10/100 and Tiny Imagenet. i.e., a total of 2 * 5 * 3 M = 3000 trained models. The training parameters follow the suggestions in Nakkiran et al. (2021). The setup for experiments is detailed in appendix A. ... The models are evaluate on Xtest, Ytest throughout the paper. ... In fig. 2(a), the kernel density estimation of ℓ\|F = f (i) for all 10000 samples of CIFAR-10 are visualized. ... we carry out the KS test over the first 1000 testing samples.
Hardware Specification	Yes	Experiments are carried out on Intel(R) Xeon(R) Gold 6226R CPU @ 2.90Hz with NVIDIA RTX A5000 GPUs.
Software Dependencies	No	The paper mentions training parameters like SGD, learning rate, momentum, and weight decay, but it does not specify any software libraries or frameworks with their version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	To empirically validate our theoretical findings, we carry out comprehensive experiments, which are presented along with theoretical results. ... Varying the model architectures in CNNs and Res Nets and capacities in width determined by k {10, 20, 40, 80, 160}, we include a total of 10 p F in our experiments. For each p F , we train M = 100 models for three datasets: CIFAR-10/100 and Tiny Imagenet. i.e., a total of 2 * 5 * 3 M = 3000 trained models. ... SGD is used as the solver with no momentum or data augmentation. Let t denote the number of epochs; we use the learning rate λ(t) = λ0 * sqrt(t), where λ0 = 0.1 is the initial learning rate. ... We set the learning rate as lr=1e-3, and momentum as 0.9. For another variant, we further include a weight decay at 5e-4.