Autonomy-of-Experts Models
Authors: Ang Lv, Ruobing Xie, Yining Qian, Songhao Wu, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We pre-train language models having 700M up to 4B parameters, demonstrating that Ao E outperforms traditional Mo E models with comparable efficiency. The code is available at https://github.com/trestad/Autonomy-of-Experts. ... 4. Experiments |
| Researcher Affiliation | Collaboration | 1Gaoling School of Artificial Intelligence, Renmin University of China 2Large Language Model Department, Tencent 3Southeast University, China 4University of Macau 5Engineering Research Center of Next-Generation Intelligent Search and Recommendation, Mo E 6School of Artificial Intelligence, Wuhan University. Correspondence to: Ruobing Xie <EMAIL>, Rui Yan <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 A working pipeline of an Mo E layer ... Algorithm 2 A working pipeline of an Ao E layer |
| Open Source Code | Yes | The code is available at https://github.com/trestad/Autonomy-of-Experts |
| Open Datasets | Yes | We conducted 5-shot tests on Mixtral 8 × 7B (Jiang et al., 2024) and Phi-3.5-MoE-instruct (Abdin et al., 2024) using MMLU (Hendrycks et al., 2021) and ARC-Challenge (Clark et al., 2018)... We train models on 100 billion tokens from Red Pajama (Computer, 2023)... We conduct a comprehensive evaluation of language models across a range of widely used tasks, including ARCeasy (Clark et al., 2018), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), Winogrande (Sakaguchi et al., 2019), Hella Swag (Zellers et al., 2019), MNLI (Williams et al., 2018), MRPC (Dolan & Brockett, 2005), QNLI (Wang et al., 2019), QQP (Wang et al., 2019), and SST-2 (Socher et al., 2013). ... a validation set comprising 5 billion tokens from (Gokaslan & Cohen, 2019). |
| Dataset Splits | Yes | We conducted 5-shot tests on Mixtral 8 × 7B (Jiang et al., 2024) and Phi-3.5-MoE-instruct (Abdin et al., 2024) using MMLU (Hendrycks et al., 2021) and ARC-Challenge (Clark et al., 2018)... The first five tasks are evaluated zero-shot, while the remaining tasks are tested three-shot because models exhibit unstable performance in zero-shot scenarios, with most errors arising from incorrect answer formats. ... We train models on 100 billion tokens from Red Pajama (Computer, 2023), with a batch size of 4.2 million tokens... a validation set comprising 5 billion tokens from (Gokaslan & Cohen, 2019). |
| Hardware Specification | Yes | The accuracy on two challenging tasks is reported, along with the time cost (in minutes) for 8 A800-80G GPUs, which is given in parentheses. |
| Software Dependencies | No | No specific software versions for key dependencies like Python, PyTorch, or CUDA are provided in the paper. It mentions using 'LM Evaluation Harness' and 'Adam W optimizer' but without version numbers. |
| Experiment Setup | Yes | We train small language models consisting of 12 layers, each containing 12 attention heads. Each layer contains 8 experts, with the top-K = 2 experts selected. Models use the Llama (Touvron et al., 2023) vocabulary of size 32,000 and the same pre-RMSNorm (Zhang & Sennrich, 2019) module. We set dmodel = 768 and dffn = 3,072 for traditional Mo E models, while the values of dlow and dwide for Ao E models are variable... We train models on 100 billion tokens from Red Pajama (Computer, 2023), with a batch size of 4.2 million tokens, a learning rate of 2 × 10−4, and a linear warmup over the first 4,800 steps, followed by a cosine decay schedule that reduces the learning rate to 1.28 × 10−5 (Tow et al., 2024). The Adam W optimizer (Loshchilov & Hutter, 2019) is employed with (β1, β2) = (0.9, 0.95), a gradient norm clipping threshold of 1, and a weight decay of 0.1. |