Sparse MoEs meet Efficient Ensembles

Authors: James Urquhart Allingham, Florian Wenzel, Zelda E Mariet, Basil Mustafa, Joan Puigcerver, Neil Houlsby, Ghassen Jerfel, Vincent Fortuin, Balaji Lakshminarayanan, Jasper Snoek, Dustin Tran, Carlos Riquelme Ruiz, Rodolphe Jenatton

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the accuracy, log-likelihood, few-shot learning, robustness, and uncertainty improvements of e3 over several challenging vision Transformer-based baselines. e3 not only preserves its efficiency while scaling to models with up to 2.7B parameters, but also provides better predictive performance and uncertainty estimates for larger models. Section 5 is titled 'Evaluation' and presents quantitative results using metrics like NLL, ECE, and OOD detection on various datasets.
Researcher Affiliation Collaboration Google Research, Brain Team; 1University of Cambridge; 2no affiliation; 3Waymo; 4ETH Zürich
Pseudocode No The paper describes methods using mathematical equations and figures, but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes Code can be found at https://github.com/google-research/vmoe.
Open Datasets Yes All results correspond to the average over 8 (for {S, B, L} single models) or 5 (for H single models and all up/downstream ensembles) replications. In Appendices L and M we provide results for additional datasets and metrics as well as standard errors. Following Riquelme et al. (2021), we compare the predictive-performance vs. compute cost trade-offs for each method across a range of Vi T families. In the results below, e3 uses (K, M) = (1, 2), single V-Mo E models use K = 2, V-Mo E ensembles use K = 1, and all use E = 32. Experimental details, including those for our upstream training, downstream fine-tuning, hyperparameter sweeps and (linear) few-shot evaluation can be found in Appendix A. Our main findings are as follows: JFT300M (Sun et al., 2017), Image Net (Deng et al., 2009), CIFAR10/100 (Krizhevsky, 2009), Oxford Flowers 102 (Nilsback & Zisserman, 2008), Oxford-IIIT pet (Parkhi et al., 2012), SVHN (Netzer et al., 2011), Places365 (Zhou et al., 2017), Caltech-UCSD Birds 200 (Wah et al., 2011), Caltech 101 (Bansal et al., 2021), Cars196 (Krause et al., 2013), Colorectal histology (Kather et al., 2016), Describable Textures Dataset (Cimpoi et al., 2014), UC Merced (Yang & Newsam, 2010).
Dataset Splits Yes We use the following train/validation splits depending on the dataset: Dataset Train Dataset Fraction Validation Dataset Fraction Image Net 99% 1% CIFAR10 98% 2% CIFAR100 98% 2% Oxford-IIIT Pets 90% 10% Oxford Flowers-102 90% 10%. All those design choices follow from Riquelme et al. (2021) and Dosovitskiy et al. (2021). As commonly defined in few-shot learning, we understand by s shots a setting wherein we have access to s training images per class label in each of the dataset.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, or TPU versions) used for running the experiments.
Software Dependencies No The paper mentions 'JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax' in the references but does not provide specific version numbers for any software dependencies used in their experiments.
Experiment Setup Yes During fine-tuning, there is a number of common design choices we apply. In particular: Image resolution: 384. Clipping gradient norm at: 10.0. Optimizer: SGD with momentum (using half-precision, β = 0.9). Batch size: 512. For V-Mo E models, we finetune with capacity ratio C = 1.5 and evaluate with C = 8. Table 7: Hyperparameter values for fine-tuning on different datasets. Image Net 20 000 {0.5, 1.0, 1.5, 2.0} {0.0024, 0.003, 0.01, 0.03} 0.1 (Steps, Base LR, Expert Dropout). The regularization parameter controlling the strength of Ωpartition was set to 0.01 throughout the experiments. We add: Mixup (Zhang et al., 2018), Weight decay, Dropout (Srivastava et al., 2014) on expert MLPs. The optimal V-Mo E setting was medium2, i.e., mixup ratio 0.5 and Rand Augment (Cubuk et al., 2020) parameters 2 and 15 (2 augmentations applied of magnitude 15), alongside expert dropout 0.2. For e3, these regularization-related hyperparameters were kept fixed and were not further tuned. More precisely, we keep medium2 and tune only the learning rate in {0.001, 0.003}.