Beyond Parameter Count: Implicit Bias in Soft Mixture of Experts
Authors: Youngseog Chung, Dhruv Malik, Jeff Schneider, Yuanzhi Li, Aarti Singh
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prove that Soft MoE with a single arbitrarily powerful expert cannot represent simple convex functions. This justifies that Soft MoE’s success cannot be explained by the traditional viewpoint of many small experts collectively mimicking the representation power of a single large expert, and that multiple experts are actually necessary to achieve good representation power (even for a fixed total parameter count). [...] We empirically show that when there are many small experts, the architecture is implicitly biased in a fashion that allows us to efficiently approximate the specialized expert subset. Our method can be easily implemented to potentially reduce computation during inference. For example, using our method on Image Net, one can perform inference using only 1/8 of the experts and still retain 99% of the test accuracy of using all experts. |
| Researcher Affiliation | Academia | Youngseog Chung EMAIL Dhruv Malik EMAIL Jeff Schneider EMAIL Yuanzhi Li EMAIL Aarti Singh EMAIL Machine Learning Department Carnegie Mellon University |
| Pseudocode | Yes | Algorithm 1 Best Expert Subset Selection Algorithm 2 Best Expert Subset Selection for Batch of Inputs |
| Open Source Code | No | The paper relies on publicly available third-party implementations and codebases (e.g., Py Torch implementation of Soft Mo E variant of Vi T, Astroformer-1 public codebase) but does not provide an explicit statement or link to their own source code for the methodology described in the paper, such as Algorithm 1 or 2. For example, 'In implementing these models, we relied on a publicly available Py Torch implementation of Soft Mo E variant of Vi T2. 2https://github.com/bwconrad/soft-moe' and 'To train our Soft Mo E variant of the Astroformer-1 models, we followed the exact same training procedure as that provided in the Astroformer paper (Dagli, 2023) and their public codebase1'. |
| Open Datasets | Yes | on the MNIST dataset (Le Cun et al., 2010). [...] We experiment on 4 datasets: MNIST, CIFAR10, CIFAR100 (Krizhevsky, 2009) and Image Net-1k (Deng et al., 2009). |
| Dataset Splits | Yes | We trained with the standard train set of 60, 000 datapoints and used the test set of 10, 000 datapoints for evaluation. [...] We trained with the standard train set of 50, 000 datapoints and used the test set of 10, 000 datapoints for evaluation. [...] We trained with the standard train set of 1.3M datapoints and used the validation set of 50, 000 datapoints for evaluation. |
| Hardware Specification | Yes | We had access to 8 NVIDIA A6000 GPUs to train all of our models. [...] The experiment was done on a single NVIDIA Ge Force RTX 2080 Ti GPU. |
| Software Dependencies | No | The paper mentions software like Py Torch(Ansel et al., 2024), Num Py(Harris et al., 2020), Tensor Flow(Abadi et al., 2015), JAX(Bradbury et al., 2018), and Adam optimizer (Kingma & Ba, 2015). However, it does not provide specific version numbers for any of these software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | Tables 4 and 5 provide a summary of the model and training procedure. To elaborate further, each model is trained with the Adam optimizer (Kingma & Ba, 2015) using default optimizer parameters for 15 epochs. We trained with the standard train set of 60, 000 datapoints and used the test set of 10, 000 datapoints for evaluation. [...] Table 7: Key hyperparameters used for CIFAR10 and CIFAR100 experiments in Section 4.4. [...] Table 9: Key hyperparameters settings for Image Net-1k experiments in Section 4.4. |