Mixture of Parrots: Experts improve memorization more than reasoning

Authors: Samy Jelassi, Clara Mohri, David Brandfonbrener, Alex Gu, Nikhil Vyas, Nikhil Anand, David Alvarez-Melis, Yuanzhi Li, Sham Kakade, Eran Malach

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate these findings on synthetic graph problems and memory-intensive closed book retrieval tasks. Lastly, we pre-train a series of Mo Es and dense transformers and evaluate them on commonly used benchmarks in math and natural language.
Researcher Affiliation Collaboration Samy Jelassi Harvard University Clara Mohri Harvard University David Brandfonbrener Harvard University Kempner Institute Alex Gu MIT Nikhil Vyas Harvard University Nikhil Anand Harvard University Kempner Institute David Alvarez-Melis Harvard University Kempner Institute Yuanzhi Li Microsoft Research Sham M. Kakade Harvard University Kempner Institute Eran Malach Harvard University Kempner Institute
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets Yes The natural language dataset is a mixture constituted of Fine Web-edu (Penedo et al., 2024), Cosmopedia (Ben Allal et al., 2024), Wikipedia and the training sets of the downstream tasks we evaluate on. The math dataset is a mixture made of Proof-Pile 2 (Azerbayev et al., 2023) and instruction datasets such as Open Math Instruct (Toshniwal et al., 2024) and Meta Math QA (Yu et al., 2023).
Dataset Splits Yes For the graph experiments, the training set size is 1e6 and the test set consists of 1e3 held-out examples that are sampled from the same distribution as the training examples. For the phone-book experiments, we vary the training set size over {1e5, 5e5, 1e6, 1.5e6, 2e6, 2.5e6, 3e6} and the test set consists of 1e3 queries from the training set. We measure the validation perplexity on 5,000 held-out sequences sampled from the training distribution.
Hardware Specification No The paper mentions 'Kempner Institute computing resources enabled this work' but does not provide specific hardware details like GPU/CPU models, memory, or detailed cluster specifications.
Software Dependencies No The paper mentions software like the AdamW optimizer, Mistral and Mixtral architectures, OLMoE codebase, Megablocks package, and FSDP, but does not specify their version numbers.
Experiment Setup Yes We set the number of layers L = 20 and vary the width d {256, 512, 1024, 2048, 4096} for dense transformers and d {256, 512, 1024} for Mo Es. Similarly to Muennighoff et al. (2024), we consistently set the intermediate dimension in the FFN/Mo E blocks to d (and not 4d). For Mo Es, we vary the number of experts E {8, 16, 32, 64}. ... We use the Adam W optimizer (Loshchilov et al., 2017) with a weight decay equal to 0.1. We set the learning rate to 0.001, train on 63B tokens (60k steps) with batch size 512 and sequence length of 2048. We use warmup during the 20% first training steps and a linear decay scheduler.