On the Role of Discrete Representation in Sparse Mixture of Experts
Authors: Giang Do, Kha Pham, Hung Le, Truyen Tran
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide theoretical support and empirical evidence demonstrating the VQMo E s ability to overcome the challenges present in traditional routers. Through extensive evaluations on both large language models and vision tasks for pre-training and fine-tuning, we show that VQMo E achieves a 28% improvement in robustness compared to other SMo E routing methods while maintaining strong performance in fine-tuning tasks. We conduct experiments to investigate the following hypotheses: (i) VQMo E offers an effective training algorithm for Sparse Mixture-of-Experts (SMo E) in large language models (LLMs); (ii) VQMo E enables efficient fine-tuning; and (iii) VQMo E outperforms other routing methods across multiple domains. To evaluate the three hypotheses, we conduct experiments across both vision and language tasks. |
| Researcher Affiliation | Academia | Giang Do EMAIL Applied Artificial Intelligence Initiative (A2I2) Deakin University Kha Pham EMAIL Applied Artificial Intelligence Initiative (A2I2) Deakin University Hung Le EMAIL Applied Artificial Intelligence Initiative (A2I2) Deakin University Truyen Tran EMAIL Applied Artificial Intelligence Initiative (A2I2) Deakin University |
| Pseudocode | No | The paper describes methods using mathematical equations and descriptive text, but it does not include a clearly labeled pseudocode block or algorithm. |
| Open Source Code | Yes | The code is publicly available at https://github.com/giangdip2410/VQMo E. |
| Open Datasets | Yes | For pretraining language models, we assess two standard benchmarks: (i) character-level language modeling using enwik8 and text8(Mahoney, 2011), and (ii) word-level language modeling using Wiki Text-103(Merity et al., 2016) and the more challenging One Billion Word (lm1b) dataset (Chelba et al., 2014). For parameter-efficient fine-tuning, we fine-tune models pre-trained on enwik8 using four widely used NLP datasets: SST-2, SST-5(Socher et al., 2013), IMDB(Maas et al., 2011), and BANKING77(Casanueva et al., 2020). For vision tasks, we employ the Vision Transformer (Vi T)(Dosovitskiy et al., 2021) and compare our routing method with state-of-the-art alternatives on five benchmark image classification datasets: CIFAR-10, CIFAR-100(Krizhevsky, 2009), STL-10(Coates et al., 2011), SVHN(Netzer et al., 2011), and Image Net-1K (Deng et al., 2009). |
| Dataset Splits | Yes | All experiments use the standard training, validation, and test split with a 90:5:5 ratio as(Child et al., 2019). |
| Hardware Specification | Yes | The pre-training was conducted on two H100 GPUs, so results might differ when using parallel training on multiple GPUs. |
| Software Dependencies | No | The paper mentions the use of an Adam (Kingma & Ba, 2017) optimizer and a Cosine Annealing learning rate schedule (Loshchilov & Hutter, 2017), and the SMo E-Dropout implementation(Chen et al., 2023a), but does not provide specific version numbers for software libraries or programming languages. |
| Experiment Setup | Yes | We use an Adam (Kingma & Ba, 2017) optimizer with a Cosine Annealing learning rate schedule (Loshchilov & Hutter, 2017). The lowest validation loss checkpoint is used to report the final performance on the test set. For the language modeling experiments, we optimize the base models and the large models for 100,000 steps. Table 9: Hyperparameter settings for pre-training experiments on Enwik8, Text8 , Wiki Text-103 , and One Billion Word. Dataset Input length Batch size Optimizer Lr # Training Step # Experts Top K. Table 10: Detail settings for fine-tuning experiments on the evaluation datasets. Dataset Input length Batch size Optimizer Lr # Epochs. |