reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mixture of Lookup Experts

Authors: Shibo Jie, Yehui Tang, Kai Han, Yitong Li, Duyu Tang, Zhi-Hong Deng, Yunhe Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that, with the same FLOPs and VRAM usage, Mo LE achieves inference speeds comparable to dense models and significantly faster than Mo E with experts offloading, while maintaining performance on par with Mo E. Through extensive experiments, we validated the effectiveness of Mo LE at scales of 160M, 410M, and 1B parameters.
Researcher Affiliation	Collaboration	1State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2Huawei Noah s Ark Lab 3Consumer Business Group, Huawei. Correspondence to: Yunhe Wang <EMAIL>, Zhi-Hong Deng <EMAIL>, Yehui Tang <EMAIL>.
Pseudocode	Yes	A. Pseudocode A.1. Training Phase class Mole Decoder Layer(nn.Module): def __init__(self, config): super().__init__() self.self_attn = Attention(config) self.shared_expert = MLP(config) self.router = nn.Linear(config.hidden_size, config.num_experts, bias=False) self.routed_expert = nn.ModuleList([MLP(config) for _ in config.num_experts]) self.input_layernorm = RMSNorm(config.hidden_size) self.post_attention_layernorm = RMSNorm(config.hidden_size) self.expert_layernorm = RMSNorm(config.hidden_size) def forward(self, hidden_states, embedding_states): Attention residual = hidden_states hidden_states = self.input_layernorm(hidden_states) hidden_states = self.self_attn(hidden_states) hidden_states = residual + hidden_states Shared Expert residual = hidden_states hidden_states = self.post_attention_layernorm(hidden_states) shared_output = self.shared_expert(hidden_states) Routed Expert router_value = nn.functional.softmax(self.router(hidden_states), dim=-1) embedding_states = self.expert_layernorm(embedding_states) routed_output = torch.stack([expert(embedding_states) for expert in self.routed_expert], dim=2) routed_output = (routed_output * router_value.unsqueeze(-1)).sum(dim=2) hidden_states = residual + shared_output + routed_output return hidden_states A.2. Inference Phase class Mole Decoder Layer(nn.Module): def __init__(self, config): super().__init__() self.self_attn = Attention(config) self.shared_expert = MLP(config) self.router = nn.Linear(config.hidden_size, config.num_experts, bias=False) self.lut = LookupTable(config.vocab_size, config.num_experts * config.hidden_size) self.input_layernorm = RMSNorm(config.hidden_size) self.post_attention_layernorm = RMSNorm(config.hidden_size) def forward(self, hidden_states, input_ids): Lookup lookup_results = self.lut(input_ids).to(hidden_states.device, non_blocking=True) Attention residual = hidden_states hidden_states = self.input_layernorm(hidden_states) hidden_states = self.self_attn(hidden_states) hidden_states = residual + hidden_states Shared Expert residual = hidden_states hidden_states = self.post_attention_layernorm(hidden_states) shared_output = self.shared_expert(hidden_states) Routed Expert router_value = nn.functional.softmax(self.router(hidden_states), dim=-1) lookup_results = lookup_results.view(-1, config.num_experts, config.hidden_size) routed_output = (lookup_results * router_value.unsqueeze(-1)).sum(dim=2) hidden_states = residual + shared_output + routed_output return hidden_states
Open Source Code	Yes	Code: https://github.com/Jie Shibo/Mo LE.
Open Datasets	Yes	We train all models on a 100B-token subset of the deduped Pile dataset (Gao et al., 2021), using the GPT-Neo X tokenizer employed by Pythia, with a vocabulary size of 50k.
Dataset Splits	No	We train all models on a 100B-token subset of the deduped Pile dataset (Gao et al., 2021), using the GPT-Neo X tokenizer employed by Pythia, with a vocabulary size of 50k. We use the lm-evaluation-harness package for evaluation. The benchmarks used include ARC-C (Clark et al., 2018), ARC-E (Clark et al., 2018), Bool Q (Clark et al., 2019), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), RACE (Lai et al., 2017), SIQA (Sap et al., 2019), and LAMBADA (Paperno et al., 2016). For all these benchmarks, we report the zero-shot accuracy. The paper does not specify the train/validation/test splits used for the Pile dataset, nor explicit splits for the evaluation benchmarks, as it uses zero-shot accuracy.
Hardware Specification	Yes	We measure the per-step decoding latency of models with 410M activated parameters on NVIDIA V100 GPU using Huggingface s transformers package.
Software Dependencies	No	We measure the per-step decoding latency of models with 410M activated parameters on NVIDIA V100 GPU using Huggingface s transformers package. We use the lm-evaluation-harness package for evaluation. The paper mentions software packages but does not provide specific version numbers.
Experiment Setup	Yes	Hyper-Parameters. We follow the learning rate settings used by Pythia, specifically 6.0 10 4 for the models with 160M activated parameter, and 3.0 10 4 for the models with 410M and 1B activated parameter. For the Mo E model, the coefficients for the z-loss and load balance loss are set to 0.001 and 0.01, respectively, as suggested by Muennighoff et al. (2024). From Appendix B: global-batch-size 1024, gradient-clipping 1.0, lr-decay-style cosine, optimizer.type Adam, train-iters 50000, weight-decay 0.01.