Tight Clusters Make Specialized Experts
Authors: Stefan Nielsen, Rachel Teo, Laziz Abdullaev, Tan Nguyen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate the advantages of our AC router over baseline routing methods when applied on a variety of Mo E backbones for language modeling and image recognition tasks in both clean and corrupted settings. |
| Researcher Affiliation | Collaboration | Stefan K. Nielsen FPT Software AI Center EMAIL Rachel S.Y. Teo Department of Mathematics National University of Singapore EMAIL |
| Pseudocode | No | The paper describes methods and equations for the Adaptive Clustering router but does not include a distinct block labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | The code is publicly available at https://github.com/stefvk/ACMo E. |
| Open Datasets | Yes | We evaluate our method on large-scale tasks including Wikitext-103 (Merity et al., 2016) language modeling and Image Net (Deng et al., 2009) object classification. |
| Dataset Splits | Yes | The Wiki Text-103 dataset... The validation set and test sets consist of 60 articles with 218K and 246K tokens respectively. En Wik-8 contains 90M characters for training, 5M for validation, and 5M for testing. We use the full Image Net dataset that contains 1.28M training images and 50K validation images. |
| Hardware Specification | Yes | All models are trained, evaluated, and finetuned on four NVIDIA A100 SXM4 40GB GPUs. |
| Software Dependencies | No | The paper mentions using 'Adam' and 'Adam W' optimizers but does not specify versions for these or any other software libraries or dependencies. |
| Experiment Setup | Yes | All experiments use Adam with a base learning rate of 0.0007. Small configurations use 3000 iterations of learning rate warmup while medium configurations use 4000 iterations. For Wiki Text-103 pretraining, small Switch backbones are trained for 40 epochs with a batch size of 96 and medium Switch backbones are trained for 80 epochs with a batch size of 48. We use 0.01 auxiliary load balancing loss. |