Mixture of Experts for Image Classification: What's the Sweet Spot?

Authors: Mathurin VIDEAU, Alessandro Leite, Marc Schoenauer, Olivier Teytaud

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we explore the integration of Mo E layers into image classification architectures using open datasets. We conduct a systematic analysis across di"erent Mo E configurations and model scales. We conduct a series of experiments considering various architecture configurations. Likewise, we investigate the impact of various components, including the number of experts and their sizes, the gate design, and the layer positions, among others.
Researcher Affiliation Collaboration Mathurin Videau EMAIL Meta AI, TAU, INRIA, and LISN (CNRS & Univ. Paris-Saclay) Alessandro Leite EMAIL INSA Rouen Normandy, University of Rouen Normandy, LITIS UR 4108 Marc Schoenauer EMAIL TAU, INRIA and LISN (CNRS & Univ. Paris-Saclay) Olivier Teytaud Thales, Cort AIx-Labs
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It describes methodologies in narrative text and uses diagrams to illustrate architectural components (e.g., Figure 1).
Open Source Code No The paper does not explicitly state that source code for the described methodology is released, nor does it provide a link to a code repository. It only provides a link to its Open Review page.
Open Datasets Yes In this work, we focus on leveraging the potential of Mo E models for image classification on Image Net-1k and Image Net-21k (Russakovsky et al., 2015).
Dataset Splits Yes Tab. 12 presents the results obtained on Image Net-1k validation set by a model that has been entirely trained on Image Net-1k, for isotropic architecture (e.g., Vi T, Conv Ne Xt iso.) and a hierarchical architecture, namely Conv Ne Xt. Tab. 2 presents the results of models that are pre-trained on Image Net-21k, and tested on the same Image Net1k validation set than above.
Hardware Specification Yes Throughput is measured on V100 GPUs, following (Touvron et al., 2021).
Software Dependencies No The paper refers to training hyperparameters similar to other works (Touvron et al., 2022 and Liu et al., 2022) but does not specify any software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Furthermore, when working with the Image Net-1k dataset, we use a strong data-augmentation pipeline, including Mixup (Zhang et al., 2018), Cutmix (Yun et al., 2019), Rand Augment (Cubuk et al., 2020), and Random Erasing (Zhong et al., 2020), over 300 epochs. Likewise, we utilize drop path, weight decay, and expert-specific weight decay as regularization strategies. Comprehensive details of all the hyperparameters are provided in Tab. 10 in App. A.