reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beyond Parameter Count: Implicit Bias in Soft Mixture of Experts

Authors: Youngseog Chung, Dhruv Malik, Jeff Schneider, Yuanzhi Li, Aarti Singh

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We prove that Soft MoE with a single arbitrarily powerful expert cannot represent simple convex functions. This justifies that Soft MoE’s success cannot be explained by the traditional viewpoint of many small experts collectively mimicking the representation power of a single large expert, and that multiple experts are actually necessary to achieve good representation power (even for a fixed total parameter count). [...] We empirically show that when there are many small experts, the architecture is implicitly biased in a fashion that allows us to efficiently approximate the specialized expert subset. Our method can be easily implemented to potentially reduce computation during inference. For example, using our method on Image Net, one can perform inference using only 1/8 of the experts and still retain 99% of the test accuracy of using all experts.
Researcher Affiliation	Academia	Youngseog Chung EMAIL Dhruv Malik EMAIL Jeff Schneider EMAIL Yuanzhi Li EMAIL Aarti Singh EMAIL Machine Learning Department Carnegie Mellon University
Pseudocode	Yes	Algorithm 1 Best Expert Subset Selection Algorithm 2 Best Expert Subset Selection for Batch of Inputs
Open Source Code	No	The paper relies on publicly available third-party implementations and codebases (e.g., Py Torch implementation of Soft Mo E variant of Vi T, Astroformer-1 public codebase) but does not provide an explicit statement or link to their own source code for the methodology described in the paper, such as Algorithm 1 or 2. For example, 'In implementing these models, we relied on a publicly available Py Torch implementation of Soft Mo E variant of Vi T2. 2https://github.com/bwconrad/soft-moe' and 'To train our Soft Mo E variant of the Astroformer-1 models, we followed the exact same training procedure as that provided in the Astroformer paper (Dagli, 2023) and their public codebase1'.
Open Datasets	Yes	on the MNIST dataset (Le Cun et al., 2010). [...] We experiment on 4 datasets: MNIST, CIFAR10, CIFAR100 (Krizhevsky, 2009) and Image Net-1k (Deng et al., 2009).
Dataset Splits	Yes	We trained with the standard train set of 60, 000 datapoints and used the test set of 10, 000 datapoints for evaluation. [...] We trained with the standard train set of 50, 000 datapoints and used the test set of 10, 000 datapoints for evaluation. [...] We trained with the standard train set of 1.3M datapoints and used the validation set of 50, 000 datapoints for evaluation.
Hardware Specification	Yes	We had access to 8 NVIDIA A6000 GPUs to train all of our models. [...] The experiment was done on a single NVIDIA Ge Force RTX 2080 Ti GPU.
Software Dependencies	No	The paper mentions software like Py Torch(Ansel et al., 2024), Num Py(Harris et al., 2020), Tensor Flow(Abadi et al., 2015), JAX(Bradbury et al., 2018), and Adam optimizer (Kingma & Ba, 2015). However, it does not provide specific version numbers for any of these software dependencies, which is required for reproducibility.
Experiment Setup	Yes	Tables 4 and 5 provide a summary of the model and training procedure. To elaborate further, each model is trained with the Adam optimizer (Kingma & Ba, 2015) using default optimizer parameters for 15 epochs. We trained with the standard train set of 60, 000 datapoints and used the test set of 10, 000 datapoints for evaluation. [...] Table 7: Key hyperparameters used for CIFAR10 and CIFAR100 experiments in Section 4.4. [...] Table 9: Key hyperparameters settings for Image Net-1k experiments in Section 4.4.