Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts

Authors: Junmo Kang, Leonid Karlinsky, Hongyin Luo, Zhen Wang, Jacob Hansen, James R Glass, David Cox, Rameswar Panda, Rogerio Feris, Alan Ritter

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results reveal that specializing LLMs may exhibit potential trade-offs in performances on non-specialized tasks. On the other hand, our Self-Mo E demonstrates substantial improvements (6.5%p on average) over the base LLM across diverse benchmarks such as knowledge, reasoning, math, and coding. It also consistently outperforms other methods, including instance merging and weight merging, while offering better flexibility and interpretability by design with semantic experts and routing. Our findings highlight the critical role of modularity, the applicability of Self-Mo E to multiple base LLMs, and the potential of self-improvement in achieving efficient, scalable, and adaptable systems.
Researcher Affiliation Collaboration Junmo Kang Georgia Tech Leonid Karlinsky MIT-IBM Watson AI Lab Hongyin Luo MIT Zhen Wang UCSD
Pseudocode No The paper describes the method in Section 3 and illustrates it conceptually in Figure 2, but it does not contain a formal pseudocode block or algorithm.
Open Source Code No We use Huggingface PEFT (Mangrulkar et al., 2022) and XLo RA (Buehler & Buehler, 2024) for the implementation of Mo E compatible with Lo RA.
Open Datasets Yes Datasets. We evaluate Self-Mo E across diverse domains categorized into knowledge, reasoning, math, and coding: MMLU (0& 5-shot) (Hendrycks et al., 2021a), BBH (3-shot) (Suzgun et al., 2022), GSM8K (8-shot) (Cobbe et al., 2021), and Human Eval (0-shot) (Chen et al., 2021), respectively. For MMLU, we primarily employ the 0-shot setting unless otherwise specified, based on established observations (Dettmers et al., 2023; Lin et al., 2024) that tuning yields only marginal effects in the 5-shot setting for this task. To test generalization (Section 4.4), we additionally evaluate on MATH (4-shot) (Hendrycks et al., 2021b), MBPP (3-shot) (Austin et al., 2021), Natural Questions (5-shot) (Kwiatkowski et al., 2019), Trivia QA (5-shot) (Joshi et al., 2017), Hellaswag (0-shot) (Zellers et al., 2019), PIQA (0-shot) (Bisk et al., 2020), and Truthful QA (0-shot) (Lin et al., 2022).
Dataset Splits No The paper specifies using 0-shot, 3-shot, 4-shot, 5-shot, and 8-shot settings for evaluation on standard benchmarks, and mentions sampling 100 training instances as seed data for synthetic data generation. However, it does not explicitly provide the train/validation/test splits for its own fine-tuning process using the generated synthetic data.
Hardware Specification Yes We train each module and Mi XSE using a standard Alpaca (Taori et al., 2023) prompt template on a single A100-80GB, which takes only a few hours.
Software Dependencies No We use Huggingface PEFT (Mangrulkar et al., 2022) and XLo RA (Buehler & Buehler, 2024) for the implementation of Mo E compatible with Lo RA.
Experiment Setup Yes For specialization, we use Lo RA applied to all modules with a rank of 8 and alpha of 16, and train it using a learning rate of 3e-4, epochs of 3, and batch size of 32.