Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Mixture of Experts Made Intrinsically Interpretable

Authors: Xingyi Yang, Constantin Venhoff, Ashkan Khakzar, Christian Schroeder De Witt, Puneet K. Dokania, Adel Bibi, Philip Torr

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Mo E-X on chess and natural language tasks, showing that it achieves performance comparable to dense models while significantly improving interpretability. Our experiments on chess and language tasks confirm that Mo E-X matches or exceeds the performance of dense transformers while eliminating the need for expensive post-hoc interpretability methods. In this section, we conduct experiments on the chess and language datasets to validate the design of Mo E-X, focusing on both performance and interpretability. We present the interpretability scores of different models in Table 1.
Researcher Affiliation Academia 1University of Oxford 2National University of Singapore. All listed affiliations are universities.
Pseudocode No The paper describes methods using mathematical formulas and textual explanations but does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code No The paper mentions 'https://github.com/Eleuther AI/sae-auto-interp' in a footnote, but this is for a third-party auto-interpretability pipeline used for evaluation, not the authors' own implementation code for Mo E-X. There is no explicit statement about releasing the code for the methodology described in this paper.
Open Datasets Yes For chess experiments, we train models on lichess 6gb2 (Karvonen, 2024), a 16 million games from the public Lichess chess games database. The input to the model is a chess PGN string (1.e4 e5 2.Nf3 ...) of a maximum length of 1023 characters, with each character representing an input token. For natural language models, we pretrain on the 10BT subset of Fine Web (Penedo et al., 2024). We use a batch size of 320, a context length of 1024 tokens per sentence, and train all models for 100k gradient steps. We evaluate the models on Open Web Text (Gokaslan et al., 2019), LAMBADA (Paperno et al., 2016), WikiText103, and WikiText2 (Merity et al., 2016), and reported the perplexity (PPL) score to show the performance. We collected the activations of the target MLP over 10M tokens from Red Pajama-v2 (Weber et al., 2024).
Dataset Splits Yes We split the dataset into 99% of training corporse and validation on 1% of validation set, report validation loss to test performance.
Hardware Specification Yes All experiments were conducted on 4 NVIDIA A40 GPUs.
Software Dependencies No The paper mentions using the 'Adam W optimizer' but does not specify its version or any other software dependencies with version numbers.
Experiment Setup Yes All models have 8 layers and are trained for 60k iterations with a batch size of 100. We use the Adam W optimizer with an initial learning rate of 3e-4 and cosine scheduling to reduce the learning rate to 1e-4 in the end. For natural language models, we pretrain on the 10BT subset of Fine Web (Penedo et al., 2024). We use a batch size of 320, a context length of 1024 tokens per sentence, and train all models for 100k gradient steps. The training configuration and hyperparameters are presented in Table 6 and Table 7.