reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts

Authors: Junmo Kang, Leonid Karlinsky, Hongyin Luo, Zhen Wang, Jacob Hansen, James R Glass, David Cox, Rameswar Panda, Rogerio Feris, Alan Ritter

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical results reveal that specializing LLMs may exhibit potential trade-offs in performances on non-specialized tasks. On the other hand, our Self-Mo E demonstrates substantial improvements (6.5%p on average) over the base LLM across diverse benchmarks such as knowledge, reasoning, math, and coding. It also consistently outperforms other methods, including instance merging and weight merging, while offering better flexibility and interpretability by design with semantic experts and routing. Our findings highlight the critical role of modularity, the applicability of Self-Mo E to multiple base LLMs, and the potential of self-improvement in achieving efficient, scalable, and adaptable systems.
Researcher Affiliation	Collaboration	Junmo Kang Georgia Tech Leonid Karlinsky MIT-IBM Watson AI Lab Hongyin Luo MIT Zhen Wang UCSD
Pseudocode	No	The paper describes the method in Section 3 and illustrates it conceptually in Figure 2, but it does not contain a formal pseudocode block or algorithm.
Open Source Code	No	We use Huggingface PEFT (Mangrulkar et al., 2022) and XLo RA (Buehler & Buehler, 2024) for the implementation of Mo E compatible with Lo RA.
Open Datasets	Yes	Datasets. We evaluate Self-Mo E across diverse domains categorized into knowledge, reasoning, math, and coding: MMLU (0& 5-shot) (Hendrycks et al., 2021a), BBH (3-shot) (Suzgun et al., 2022), GSM8K (8-shot) (Cobbe et al., 2021), and Human Eval (0-shot) (Chen et al., 2021), respectively. For MMLU, we primarily employ the 0-shot setting unless otherwise specified, based on established observations (Dettmers et al., 2023; Lin et al., 2024) that tuning yields only marginal effects in the 5-shot setting for this task. To test generalization (Section 4.4), we additionally evaluate on MATH (4-shot) (Hendrycks et al., 2021b), MBPP (3-shot) (Austin et al., 2021), Natural Questions (5-shot) (Kwiatkowski et al., 2019), Trivia QA (5-shot) (Joshi et al., 2017), Hellaswag (0-shot) (Zellers et al., 2019), PIQA (0-shot) (Bisk et al., 2020), and Truthful QA (0-shot) (Lin et al., 2022).
Dataset Splits	No	The paper specifies using 0-shot, 3-shot, 4-shot, 5-shot, and 8-shot settings for evaluation on standard benchmarks, and mentions sampling 100 training instances as seed data for synthetic data generation. However, it does not explicitly provide the train/validation/test splits for its own fine-tuning process using the generated synthetic data.
Hardware Specification	Yes	We train each module and Mi XSE using a standard Alpaca (Taori et al., 2023) prompt template on a single A100-80GB, which takes only a few hours.
Software Dependencies	No	We use Huggingface PEFT (Mangrulkar et al., 2022) and XLo RA (Buehler & Buehler, 2024) for the implementation of Mo E compatible with Lo RA.
Experiment Setup	Yes	For specialization, we use Lo RA applied to all modules with a rank of 8 and alpha of 16, and train it using a learning rate of 3e-4, epochs of 3, and batch size of 32.