reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On the Role of Discrete Representation in Sparse Mixture of Experts

Authors: Giang Do, Kha Pham, Hung Le, Truyen Tran

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide theoretical support and empirical evidence demonstrating the VQMo E s ability to overcome the challenges present in traditional routers. Through extensive evaluations on both large language models and vision tasks for pre-training and fine-tuning, we show that VQMo E achieves a 28% improvement in robustness compared to other SMo E routing methods while maintaining strong performance in fine-tuning tasks. We conduct experiments to investigate the following hypotheses: (i) VQMo E offers an effective training algorithm for Sparse Mixture-of-Experts (SMo E) in large language models (LLMs); (ii) VQMo E enables efficient fine-tuning; and (iii) VQMo E outperforms other routing methods across multiple domains. To evaluate the three hypotheses, we conduct experiments across both vision and language tasks.
Researcher Affiliation	Academia	Giang Do EMAIL Applied Artificial Intelligence Initiative (A2I2) Deakin University Kha Pham EMAIL Applied Artificial Intelligence Initiative (A2I2) Deakin University Hung Le EMAIL Applied Artificial Intelligence Initiative (A2I2) Deakin University Truyen Tran EMAIL Applied Artificial Intelligence Initiative (A2I2) Deakin University
Pseudocode	No	The paper describes methods using mathematical equations and descriptive text, but it does not include a clearly labeled pseudocode block or algorithm.
Open Source Code	Yes	The code is publicly available at https://github.com/giangdip2410/VQMo E.
Open Datasets	Yes	For pretraining language models, we assess two standard benchmarks: (i) character-level language modeling using enwik8 and text8(Mahoney, 2011), and (ii) word-level language modeling using Wiki Text-103(Merity et al., 2016) and the more challenging One Billion Word (lm1b) dataset (Chelba et al., 2014). For parameter-efficient fine-tuning, we fine-tune models pre-trained on enwik8 using four widely used NLP datasets: SST-2, SST-5(Socher et al., 2013), IMDB(Maas et al., 2011), and BANKING77(Casanueva et al., 2020). For vision tasks, we employ the Vision Transformer (Vi T)(Dosovitskiy et al., 2021) and compare our routing method with state-of-the-art alternatives on five benchmark image classification datasets: CIFAR-10, CIFAR-100(Krizhevsky, 2009), STL-10(Coates et al., 2011), SVHN(Netzer et al., 2011), and Image Net-1K (Deng et al., 2009).
Dataset Splits	Yes	All experiments use the standard training, validation, and test split with a 90:5:5 ratio as(Child et al., 2019).
Hardware Specification	Yes	The pre-training was conducted on two H100 GPUs, so results might differ when using parallel training on multiple GPUs.
Software Dependencies	No	The paper mentions the use of an Adam (Kingma & Ba, 2017) optimizer and a Cosine Annealing learning rate schedule (Loshchilov & Hutter, 2017), and the SMo E-Dropout implementation(Chen et al., 2023a), but does not provide specific version numbers for software libraries or programming languages.
Experiment Setup	Yes	We use an Adam (Kingma & Ba, 2017) optimizer with a Cosine Annealing learning rate schedule (Loshchilov & Hutter, 2017). The lowest validation loss checkpoint is used to report the final performance on the test set. For the language modeling experiments, we optimize the base models and the large models for 100,000 steps. Table 9: Hyperparameter settings for pre-training experiments on Enwik8, Text8 , Wiki Text-103 , and One Billion Word. Dataset Input length Batch size Optimizer Lr # Training Step # Experts Top K. Table 10: Detail settings for fine-tuning experiments on the evaluation datasets. Dataset Input length Batch size Optimizer Lr # Epochs.