reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Tight Clusters Make Specialized Experts

Authors: Stefan Nielsen, Rachel Teo, Laziz Abdullaev, Tan Nguyen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate the advantages of our AC router over baseline routing methods when applied on a variety of Mo E backbones for language modeling and image recognition tasks in both clean and corrupted settings.
Researcher Affiliation	Collaboration	Stefan K. Nielsen FPT Software AI Center EMAIL Rachel S.Y. Teo Department of Mathematics National University of Singapore EMAIL
Pseudocode	No	The paper describes methods and equations for the Adaptive Clustering router but does not include a distinct block labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	The code is publicly available at https://github.com/stefvk/ACMo E.
Open Datasets	Yes	We evaluate our method on large-scale tasks including Wikitext-103 (Merity et al., 2016) language modeling and Image Net (Deng et al., 2009) object classification.
Dataset Splits	Yes	The Wiki Text-103 dataset... The validation set and test sets consist of 60 articles with 218K and 246K tokens respectively. En Wik-8 contains 90M characters for training, 5M for validation, and 5M for testing. We use the full Image Net dataset that contains 1.28M training images and 50K validation images.
Hardware Specification	Yes	All models are trained, evaluated, and finetuned on four NVIDIA A100 SXM4 40GB GPUs.
Software Dependencies	No	The paper mentions using 'Adam' and 'Adam W' optimizers but does not specify versions for these or any other software libraries or dependencies.
Experiment Setup	Yes	All experiments use Adam with a base learning rate of 0.0007. Small configurations use 3000 iterations of learning rate warmup while medium configurations use 4000 iterations. For Wiki Text-103 pretraining, small Switch backbones are trained for 40 epochs with a batch size of 96 and medium Switch backbones are trained for 80 epochs with a batch size of 48. We use 0.01 auxiliary load balancing loss.