reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sparse MoEs meet Efficient Ensembles

Authors: James Urquhart Allingham, Florian Wenzel, Zelda E Mariet, Basil Mustafa, Joan Puigcerver, Neil Houlsby, Ghassen Jerfel, Vincent Fortuin, Balaji Lakshminarayanan, Jasper Snoek, Dustin Tran, Carlos Riquelme Ruiz, Rodolphe Jenatton

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the accuracy, log-likelihood, few-shot learning, robustness, and uncertainty improvements of e3 over several challenging vision Transformer-based baselines. e3 not only preserves its eﬃciency while scaling to models with up to 2.7B parameters, but also provides better predictive performance and uncertainty estimates for larger models. Section 5 is titled 'Evaluation' and presents quantitative results using metrics like NLL, ECE, and OOD detection on various datasets.
Researcher Affiliation	Collaboration	Google Research, Brain Team; 1University of Cambridge; 2no aﬃliation; 3Waymo; 4ETH Zürich
Pseudocode	No	The paper describes methods using mathematical equations and figures, but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Code can be found at https://github.com/google-research/vmoe.
Open Datasets	Yes	All results correspond to the average over 8 (for {S, B, L} single models) or 5 (for H single models and all up/downstream ensembles) replications. In Appendices L and M we provide results for additional datasets and metrics as well as standard errors. Following Riquelme et al. (2021), we compare the predictive-performance vs. compute cost trade-oﬀs for each method across a range of Vi T families. In the results below, e3 uses (K, M) = (1, 2), single V-Mo E models use K = 2, V-Mo E ensembles use K = 1, and all use E = 32. Experimental details, including those for our upstream training, downstream ﬁne-tuning, hyperparameter sweeps and (linear) few-shot evaluation can be found in Appendix A. Our main ﬁndings are as follows: JFT300M (Sun et al., 2017), Image Net (Deng et al., 2009), CIFAR10/100 (Krizhevsky, 2009), Oxford Flowers 102 (Nilsback & Zisserman, 2008), Oxford-IIIT pet (Parkhi et al., 2012), SVHN (Netzer et al., 2011), Places365 (Zhou et al., 2017), Caltech-UCSD Birds 200 (Wah et al., 2011), Caltech 101 (Bansal et al., 2021), Cars196 (Krause et al., 2013), Colorectal histology (Kather et al., 2016), Describable Textures Dataset (Cimpoi et al., 2014), UC Merced (Yang & Newsam, 2010).
Dataset Splits	Yes	We use the following train/validation splits depending on the dataset: Dataset Train Dataset Fraction Validation Dataset Fraction Image Net 99% 1% CIFAR10 98% 2% CIFAR100 98% 2% Oxford-IIIT Pets 90% 10% Oxford Flowers-102 90% 10%. All those design choices follow from Riquelme et al. (2021) and Dosovitskiy et al. (2021). As commonly deﬁned in few-shot learning, we understand by s shots a setting wherein we have access to s training images per class label in each of the dataset.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, or TPU versions) used for running the experiments.
Software Dependencies	No	The paper mentions 'JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax' in the references but does not provide specific version numbers for any software dependencies used in their experiments.
Experiment Setup	Yes	During ﬁne-tuning, there is a number of common design choices we apply. In particular: Image resolution: 384. Clipping gradient norm at: 10.0. Optimizer: SGD with momentum (using half-precision, β = 0.9). Batch size: 512. For V-Mo E models, we ﬁnetune with capacity ratio C = 1.5 and evaluate with C = 8. Table 7: Hyperparameter values for ﬁne-tuning on diﬀerent datasets. Image Net 20 000 {0.5, 1.0, 1.5, 2.0} {0.0024, 0.003, 0.01, 0.03} 0.1 (Steps, Base LR, Expert Dropout). The regularization parameter controlling the strength of Ωpartition was set to 0.01 throughout the experiments. We add: Mixup (Zhang et al., 2018), Weight decay, Dropout (Srivastava et al., 2014) on expert MLPs. The optimal V-Mo E setting was medium2, i.e., mixup ratio 0.5 and Rand Augment (Cubuk et al., 2020) parameters 2 and 15 (2 augmentations applied of magnitude 15), alongside expert dropout 0.2. For e3, these regularization-related hyperparameters were kept ﬁxed and were not further tuned. More precisely, we keep medium2 and tune only the learning rate in {0.001, 0.003}.