Pathologies of Predictive Diversity in Deep Ensembles
Authors: Taiga Abe, E. Kelly Buchanan, Geoff Pleiss, John Patrick Cunningham
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In a large scale study of nearly 600 neural network classification ensembles, we examine a variety of interventions that trade off component model performance for predictive diversity. We demonstrate this result on over 600 deep ensembles, across various architectures, datasets, and training objectives, and show that our findings reconcile differences between the traditional ensembling literature and various methods proposed to improve deep ensembles (Sec. 4). |
| Researcher Affiliation | Academia | Taiga Abe EMAIL Center for Theoretical Neuroscience Department of Neuroscience Columbia University, E. Kelly Buchanan EMAIL Center for Theoretical Neuroscience Department of Neuroscience Columbia University, Geoff Pleiss EMAIL Department of Statistics University of British Columbia Vector Institute, John Cunningham EMAIL Department of Statistics Columbia University |
| Pseudocode | No | The paper describes methods and experiments using mathematical equations and prose but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code used to train deep ensembles is provided here: https://github.com/cellistigs/ensemble_attention/tree/dkl. Additional code for data analysis, as well as training non-deep neural network ensembles is located here: https://github.com/cellistigs/interp_ensembles. |
| Open Datasets | Yes | We train deep ensembles on the CIFAR10, CIFAR100 (Krizhevsky et al., 2009), Tiny Image Net (Le and Yang, 2015), and Image Net (Deng et al., 2009) datasets. Using the covertype (Blackard, 1998) dataset and the Jensen gap regularizer... |
| Dataset Splits | No | The paper refers to training, validation, and test sets for datasets like CIFAR10, CIFAR100, Tiny Image Net, Image Net, and Covertype, but does not specify the exact percentages or sample counts for these splits. For example, it states 'The best performing checkpoint on validation data was selected after training and used for further evaluation' but provides no further details on the split. |
| Hardware Specification | Yes | The large majority of our experiments were performed on cloud based computational resources (AWS and GCP). Ensembles trained with diversity regularizers in Figs. 2, 3, 6, 12, 14, 15 and 17 to 22 were trained with EC2 instances from the P3 family. We used p3.xlarge instances for model training, unless we experienced issues with GPU memory. In these cases we trained models on p3.8xlarge instances. Separately, ensembles trained with diversity regularizers in Fig. 8 and Table 1 were trained on 4 x NVIDIA V100 compute instances. |
| Software Dependencies | No | The paper mentions using 'Py Torch implementation' and 'scikit-learn (Pedregosa et al., 2011)' for various tasks, but does not provide specific version numbers for these libraries or any other software dependencies. |
| Experiment Setup | Yes | For CIFAR10, we chose 100 epochs of training, with batch size 256, base learning rate 1e-2, weight decay 1e-2. The optimizer used was SGD with momentum, with a linear warmup of 30 epochs and cosine decay. ... For CIFAR100, we ran training for 160 epochs, with batch size 128, base learning rate 1e-1, weight decay 5e-4. The optimizer used was SGD with momentum, with a tenfold decrease in the learning rate at 60 and 120 epochs. |