Tight Clusters Make Specialized Experts

Authors: Stefan Nielsen, Rachel Teo, Laziz Abdullaev, Tan Nguyen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate the advantages of our AC router over baseline routing methods when applied on a variety of Mo E backbones for language modeling and image recognition tasks in both clean and corrupted settings.
Researcher Affiliation Collaboration Stefan K. Nielsen FPT Software AI Center EMAIL Rachel S.Y. Teo Department of Mathematics National University of Singapore EMAIL
Pseudocode No The paper describes methods and equations for the Adaptive Clustering router but does not include a distinct block labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code Yes The code is publicly available at https://github.com/stefvk/ACMo E.
Open Datasets Yes We evaluate our method on large-scale tasks including Wikitext-103 (Merity et al., 2016) language modeling and Image Net (Deng et al., 2009) object classification.
Dataset Splits Yes The Wiki Text-103 dataset... The validation set and test sets consist of 60 articles with 218K and 246K tokens respectively. En Wik-8 contains 90M characters for training, 5M for validation, and 5M for testing. We use the full Image Net dataset that contains 1.28M training images and 50K validation images.
Hardware Specification Yes All models are trained, evaluated, and finetuned on four NVIDIA A100 SXM4 40GB GPUs.
Software Dependencies No The paper mentions using 'Adam' and 'Adam W' optimizers but does not specify versions for these or any other software libraries or dependencies.
Experiment Setup Yes All experiments use Adam with a base learning rate of 0.0007. Small configurations use 3000 iterations of learning rate warmup while medium configurations use 4000 iterations. For Wiki Text-103 pretraining, small Switch backbones are trained for 40 epochs with a batch size of 96 and medium Switch backbones are trained for 80 epochs with a batch size of 48. We use 0.01 auxiliary load balancing loss.