Routers in Vision Mixture of Experts: An Empirical Study
Authors: Tianlin Liu, Mathieu Blondel, Carlos Riquelme Ruiz, Joan Puigcerver
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct head-to-head experiments with 6 different routers, including existing routers from prior work and new ones we introduce. We show that (i) many routers originally developed for language modeling can be adapted to perform strongly in vision tasks, (ii) in sparse Mo E, Expert Choice routers generally outperform Token Choice routers, and (iii) soft Mo Es generally outperform sparse Mo Es with a fixed compute budget. These results provide new insights regarding the crucial role of routers in vision Mo E models. |
| Researcher Affiliation | Collaboration | Tianlin Liu University of Basel Mathieu Blondel Google Deep Mind Carlos Riquelme Stability AI Joan Puigcerver Google Deep Mind |
| Pseudocode | Yes | Algorithm 1: Token Choice allocation Algorithm 2: Expert Choice allocation |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | For the pre-training experiments, all models were trained on the JFT-300M (Sun et al., 2017), which contains about 305 million training images and 50,000 validation images, organized in a hierarchy of 18,291 different classes. To avoid overlap with the validation and test sets of JFT-300M, the images in the dataset were deduplicated, as done in Kolesnikov et al. (2020). To assess how well pre-trained Mo E models adapt to new tasks, we conducted few-shot adaptation experiments using the Image Net-1k dataset (Deng et al., 2009). |
| Dataset Splits | Yes | For the pre-training experiments, all models were trained on the JFT-300M (Sun et al., 2017), which contains about 305 million training images and 50,000 validation images... To assess how well pre-trained Mo E models adapt to new tasks, we conducted few-shot adaptation experiments using the Image Net-1k dataset (Deng et al., 2009). In these experiments, we used 10 image samples per class from Image Net-1k. The pre-trained model extracts a fixed feature embedding for each image, which is then used to train a linear regression model. This linear model maps the extracted features to the one-hot encoded target labels. This procedure is in line with the 10-shot evaluation procedure described by Dosovitskiy et al. (2021); Riquelme et al. (2021). |
| Hardware Specification | Yes | Accuracy Training TPUv3-days Table 1: Comparing routers in B32 architecture using the JFT dataset. Table 2: Comparing routers in B16 architecture using the JFT dataset. Table 3: Comparing routers in L16 architecture using the JFT dataset. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | We fix the total number of experts to be 32; that is, E = 32. For Softmax Token Choice and Sinkhorn Token Choice routers that process each token with k experts, we experiment with k = 1 and k = 2. In this way, the buffer capacity (the number of tokens an expert can process at most in a batch) of these variants is C = round(k T/E). For Softmax Expert Choice, Sinkhorn Expert Choice, and sparsity-constrained variants, we control the buffer capacity C through a capacity factor c, which plays a role similar to k. The buffer capacity C is defined through C = round(c T/E). We experiment c with 1 or 2, which match the buffer capacity of k = 1 and k = 2 in the Token Choice cases. For Soft Mo E routers, we set the number of slots per expert to be C = round(c T/E) with c = 1 or 2, just like in the Expert Choice case. |