Autonomy-of-Experts Models

Authors: Ang Lv, Ruobing Xie, Yining Qian, Songhao Wu, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We pre-train language models having 700M up to 4B parameters, demonstrating that Ao E outperforms traditional Mo E models with comparable efficiency. The code is available at https://github.com/trestad/Autonomy-of-Experts. ... 4. Experiments
Researcher Affiliation Collaboration 1Gaoling School of Artificial Intelligence, Renmin University of China 2Large Language Model Department, Tencent 3Southeast University, China 4University of Macau 5Engineering Research Center of Next-Generation Intelligent Search and Recommendation, Mo E 6School of Artificial Intelligence, Wuhan University. Correspondence to: Ruobing Xie <EMAIL>, Rui Yan <EMAIL>.
Pseudocode Yes Algorithm 1 A working pipeline of an Mo E layer ... Algorithm 2 A working pipeline of an Ao E layer
Open Source Code Yes The code is available at https://github.com/trestad/Autonomy-of-Experts
Open Datasets Yes We conducted 5-shot tests on Mixtral 8 × 7B (Jiang et al., 2024) and Phi-3.5-MoE-instruct (Abdin et al., 2024) using MMLU (Hendrycks et al., 2021) and ARC-Challenge (Clark et al., 2018)... We train models on 100 billion tokens from Red Pajama (Computer, 2023)... We conduct a comprehensive evaluation of language models across a range of widely used tasks, including ARCeasy (Clark et al., 2018), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), Winogrande (Sakaguchi et al., 2019), Hella Swag (Zellers et al., 2019), MNLI (Williams et al., 2018), MRPC (Dolan & Brockett, 2005), QNLI (Wang et al., 2019), QQP (Wang et al., 2019), and SST-2 (Socher et al., 2013). ... a validation set comprising 5 billion tokens from (Gokaslan & Cohen, 2019).
Dataset Splits Yes We conducted 5-shot tests on Mixtral 8 × 7B (Jiang et al., 2024) and Phi-3.5-MoE-instruct (Abdin et al., 2024) using MMLU (Hendrycks et al., 2021) and ARC-Challenge (Clark et al., 2018)... The first five tasks are evaluated zero-shot, while the remaining tasks are tested three-shot because models exhibit unstable performance in zero-shot scenarios, with most errors arising from incorrect answer formats. ... We train models on 100 billion tokens from Red Pajama (Computer, 2023), with a batch size of 4.2 million tokens... a validation set comprising 5 billion tokens from (Gokaslan & Cohen, 2019).
Hardware Specification Yes The accuracy on two challenging tasks is reported, along with the time cost (in minutes) for 8 A800-80G GPUs, which is given in parentheses.
Software Dependencies No No specific software versions for key dependencies like Python, PyTorch, or CUDA are provided in the paper. It mentions using 'LM Evaluation Harness' and 'Adam W optimizer' but without version numbers.
Experiment Setup Yes We train small language models consisting of 12 layers, each containing 12 attention heads. Each layer contains 8 experts, with the top-K = 2 experts selected. Models use the Llama (Touvron et al., 2023) vocabulary of size 32,000 and the same pre-RMSNorm (Zhang & Sennrich, 2019) module. We set dmodel = 768 and dffn = 3,072 for traditional Mo E models, while the values of dlow and dwide for Ao E models are variable... We train models on 100 billion tokens from Red Pajama (Computer, 2023), with a batch size of 4.2 million tokens, a learning rate of 2 × 10−4, and a linear warmup over the first 4,800 steps, followed by a cosine decay schedule that reduces the learning rate to 1.28 × 10−5 (Tow et al., 2024). The Adam W optimizer (Loshchilov & Hutter, 2019) is employed with (β1, β2) = (0.9, 0.95), a gradient norm clipping threshold of 1, and a weight decay of 0.1.