reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Routers in Vision Mixture of Experts: An Empirical Study

Authors: Tianlin Liu, Mathieu Blondel, Carlos Riquelme Ruiz, Joan Puigcerver

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct head-to-head experiments with 6 different routers, including existing routers from prior work and new ones we introduce. We show that (i) many routers originally developed for language modeling can be adapted to perform strongly in vision tasks, (ii) in sparse Mo E, Expert Choice routers generally outperform Token Choice routers, and (iii) soft Mo Es generally outperform sparse Mo Es with a fixed compute budget. These results provide new insights regarding the crucial role of routers in vision Mo E models.
Researcher Affiliation	Collaboration	Tianlin Liu University of Basel Mathieu Blondel Google Deep Mind Carlos Riquelme Stability AI Joan Puigcerver Google Deep Mind
Pseudocode	Yes	Algorithm 1: Token Choice allocation Algorithm 2: Expert Choice allocation
Open Source Code	No	The paper does not contain an explicit statement about releasing source code or a link to a code repository.
Open Datasets	Yes	For the pre-training experiments, all models were trained on the JFT-300M (Sun et al., 2017), which contains about 305 million training images and 50,000 validation images, organized in a hierarchy of 18,291 different classes. To avoid overlap with the validation and test sets of JFT-300M, the images in the dataset were deduplicated, as done in Kolesnikov et al. (2020). To assess how well pre-trained Mo E models adapt to new tasks, we conducted few-shot adaptation experiments using the Image Net-1k dataset (Deng et al., 2009).
Dataset Splits	Yes	For the pre-training experiments, all models were trained on the JFT-300M (Sun et al., 2017), which contains about 305 million training images and 50,000 validation images... To assess how well pre-trained Mo E models adapt to new tasks, we conducted few-shot adaptation experiments using the Image Net-1k dataset (Deng et al., 2009). In these experiments, we used 10 image samples per class from Image Net-1k. The pre-trained model extracts a fixed feature embedding for each image, which is then used to train a linear regression model. This linear model maps the extracted features to the one-hot encoded target labels. This procedure is in line with the 10-shot evaluation procedure described by Dosovitskiy et al. (2021); Riquelme et al. (2021).
Hardware Specification	Yes	Accuracy Training TPUv3-days Table 1: Comparing routers in B32 architecture using the JFT dataset. Table 2: Comparing routers in B16 architecture using the JFT dataset. Table 3: Comparing routers in L16 architecture using the JFT dataset.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers.
Experiment Setup	Yes	We fix the total number of experts to be 32; that is, E = 32. For Softmax Token Choice and Sinkhorn Token Choice routers that process each token with k experts, we experiment with k = 1 and k = 2. In this way, the buffer capacity (the number of tokens an expert can process at most in a batch) of these variants is C = round(k T/E). For Softmax Expert Choice, Sinkhorn Expert Choice, and sparsity-constrained variants, we control the buffer capacity C through a capacity factor c, which plays a role similar to k. The buffer capacity C is defined through C = round(c T/E). We experiment c with 1 or 2, which match the buffer capacity of k = 1 and k = 2 in the Token Choice cases. For Soft Mo E routers, we set the number of slots per expert to be C = round(c T/E) with c = 1 or 2, just like in the Expert Choice case.