reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Pathologies of Predictive Diversity in Deep Ensembles

Authors: Taiga Abe, E. Kelly Buchanan, Geoff Pleiss, John Patrick Cunningham

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In a large scale study of nearly 600 neural network classification ensembles, we examine a variety of interventions that trade off component model performance for predictive diversity. We demonstrate this result on over 600 deep ensembles, across various architectures, datasets, and training objectives, and show that our findings reconcile differences between the traditional ensembling literature and various methods proposed to improve deep ensembles (Sec. 4).
Researcher Affiliation	Academia	Taiga Abe EMAIL Center for Theoretical Neuroscience Department of Neuroscience Columbia University, E. Kelly Buchanan EMAIL Center for Theoretical Neuroscience Department of Neuroscience Columbia University, Geoff Pleiss EMAIL Department of Statistics University of British Columbia Vector Institute, John Cunningham EMAIL Department of Statistics Columbia University
Pseudocode	No	The paper describes methods and experiments using mathematical equations and prose but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code used to train deep ensembles is provided here: https://github.com/cellistigs/ensemble_attention/tree/dkl. Additional code for data analysis, as well as training non-deep neural network ensembles is located here: https://github.com/cellistigs/interp_ensembles.
Open Datasets	Yes	We train deep ensembles on the CIFAR10, CIFAR100 (Krizhevsky et al., 2009), Tiny Image Net (Le and Yang, 2015), and Image Net (Deng et al., 2009) datasets. Using the covertype (Blackard, 1998) dataset and the Jensen gap regularizer...
Dataset Splits	No	The paper refers to training, validation, and test sets for datasets like CIFAR10, CIFAR100, Tiny Image Net, Image Net, and Covertype, but does not specify the exact percentages or sample counts for these splits. For example, it states 'The best performing checkpoint on validation data was selected after training and used for further evaluation' but provides no further details on the split.
Hardware Specification	Yes	The large majority of our experiments were performed on cloud based computational resources (AWS and GCP). Ensembles trained with diversity regularizers in Figs. 2, 3, 6, 12, 14, 15 and 17 to 22 were trained with EC2 instances from the P3 family. We used p3.xlarge instances for model training, unless we experienced issues with GPU memory. In these cases we trained models on p3.8xlarge instances. Separately, ensembles trained with diversity regularizers in Fig. 8 and Table 1 were trained on 4 x NVIDIA V100 compute instances.
Software Dependencies	No	The paper mentions using 'Py Torch implementation' and 'scikit-learn (Pedregosa et al., 2011)' for various tasks, but does not provide specific version numbers for these libraries or any other software dependencies.
Experiment Setup	Yes	For CIFAR10, we chose 100 epochs of training, with batch size 256, base learning rate 1e-2, weight decay 1e-2. The optimizer used was SGD with momentum, with a linear warmup of 30 epochs and cosine decay. ... For CIFAR100, we ran training for 160 epochs, with batch size 128, base learning rate 1e-1, weight decay 5e-4. The optimizer used was SGD with momentum, with a tenfold decrease in the learning rate at 60 and 120 epochs.