reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Autonomy-of-Experts Models

Authors: Ang Lv, Ruobing Xie, Yining Qian, Songhao Wu, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We pre-train language models having 700M up to 4B parameters, demonstrating that Ao E outperforms traditional Mo E models with comparable efficiency. The code is available at https://github.com/trestad/Autonomy-of-Experts. ... 4. Experiments
Researcher Affiliation	Collaboration	1Gaoling School of Artificial Intelligence, Renmin University of China 2Large Language Model Department, Tencent 3Southeast University, China 4University of Macau 5Engineering Research Center of Next-Generation Intelligent Search and Recommendation, Mo E 6School of Artificial Intelligence, Wuhan University. Correspondence to: Ruobing Xie <EMAIL>, Rui Yan <EMAIL>.
Pseudocode	Yes	Algorithm 1 A working pipeline of an Mo E layer ... Algorithm 2 A working pipeline of an Ao E layer
Open Source Code	Yes	The code is available at https://github.com/trestad/Autonomy-of-Experts
Open Datasets	Yes	We conducted 5-shot tests on Mixtral 8 × 7B (Jiang et al., 2024) and Phi-3.5-MoE-instruct (Abdin et al., 2024) using MMLU (Hendrycks et al., 2021) and ARC-Challenge (Clark et al., 2018)... We train models on 100 billion tokens from Red Pajama (Computer, 2023)... We conduct a comprehensive evaluation of language models across a range of widely used tasks, including ARCeasy (Clark et al., 2018), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), Winogrande (Sakaguchi et al., 2019), Hella Swag (Zellers et al., 2019), MNLI (Williams et al., 2018), MRPC (Dolan & Brockett, 2005), QNLI (Wang et al., 2019), QQP (Wang et al., 2019), and SST-2 (Socher et al., 2013). ... a validation set comprising 5 billion tokens from (Gokaslan & Cohen, 2019).
Dataset Splits	Yes	We conducted 5-shot tests on Mixtral 8 × 7B (Jiang et al., 2024) and Phi-3.5-MoE-instruct (Abdin et al., 2024) using MMLU (Hendrycks et al., 2021) and ARC-Challenge (Clark et al., 2018)... The first five tasks are evaluated zero-shot, while the remaining tasks are tested three-shot because models exhibit unstable performance in zero-shot scenarios, with most errors arising from incorrect answer formats. ... We train models on 100 billion tokens from Red Pajama (Computer, 2023), with a batch size of 4.2 million tokens... a validation set comprising 5 billion tokens from (Gokaslan & Cohen, 2019).
Hardware Specification	Yes	The accuracy on two challenging tasks is reported, along with the time cost (in minutes) for 8 A800-80G GPUs, which is given in parentheses.
Software Dependencies	No	No specific software versions for key dependencies like Python, PyTorch, or CUDA are provided in the paper. It mentions using 'LM Evaluation Harness' and 'Adam W optimizer' but without version numbers.
Experiment Setup	Yes	We train small language models consisting of 12 layers, each containing 12 attention heads. Each layer contains 8 experts, with the top-K = 2 experts selected. Models use the Llama (Touvron et al., 2023) vocabulary of size 32,000 and the same pre-RMSNorm (Zhang & Sennrich, 2019) module. We set dmodel = 768 and dffn = 3,072 for traditional Mo E models, while the values of dlow and dwide for Ao E models are variable... We train models on 100 billion tokens from Red Pajama (Computer, 2023), with a batch size of 4.2 million tokens, a learning rate of 2 × 10−4, and a linear warmup over the first 4,800 steps, followed by a cosine decay schedule that reduces the learning rate to 1.28 × 10−5 (Tow et al., 2024). The Adam W optimizer (Loshchilov & Hutter, 2019) is employed with (β1, β2) = (0.9, 0.95), a gradient norm clipping threshold of 1, and a weight decay of 0.1.