Mixture of Lookup Experts

Authors: Shibo Jie, Yehui Tang, Kai Han, Yitong Li, Duyu Tang, Zhi-Hong Deng, Yunhe Wang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that, with the same FLOPs and VRAM usage, Mo LE achieves inference speeds comparable to dense models and significantly faster than Mo E with experts offloading, while maintaining performance on par with Mo E. Through extensive experiments, we validated the effectiveness of Mo LE at scales of 160M, 410M, and 1B parameters.
Researcher Affiliation Collaboration 1State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2Huawei Noah s Ark Lab 3Consumer Business Group, Huawei. Correspondence to: Yunhe Wang <EMAIL>, Zhi-Hong Deng <EMAIL>, Yehui Tang <EMAIL>.
Pseudocode Yes A. Pseudocode A.1. Training Phase class Mole Decoder Layer(nn.Module): def __init__(self, config): super().__init__() self.self_attn = Attention(config) self.shared_expert = MLP(config) self.router = nn.Linear(config.hidden_size, config.num_experts, bias=False) self.routed_expert = nn.ModuleList([MLP(config) for _ in config.num_experts]) self.input_layernorm = RMSNorm(config.hidden_size) self.post_attention_layernorm = RMSNorm(config.hidden_size) self.expert_layernorm = RMSNorm(config.hidden_size) def forward(self, hidden_states, embedding_states): Attention residual = hidden_states hidden_states = self.input_layernorm(hidden_states) hidden_states = self.self_attn(hidden_states) hidden_states = residual + hidden_states Shared Expert residual = hidden_states hidden_states = self.post_attention_layernorm(hidden_states) shared_output = self.shared_expert(hidden_states) Routed Expert router_value = nn.functional.softmax(self.router(hidden_states), dim=-1) embedding_states = self.expert_layernorm(embedding_states) routed_output = torch.stack([expert(embedding_states) for expert in self.routed_expert], dim=2) routed_output = (routed_output * router_value.unsqueeze(-1)).sum(dim=2) hidden_states = residual + shared_output + routed_output return hidden_states A.2. Inference Phase class Mole Decoder Layer(nn.Module): def __init__(self, config): super().__init__() self.self_attn = Attention(config) self.shared_expert = MLP(config) self.router = nn.Linear(config.hidden_size, config.num_experts, bias=False) self.lut = LookupTable(config.vocab_size, config.num_experts * config.hidden_size) self.input_layernorm = RMSNorm(config.hidden_size) self.post_attention_layernorm = RMSNorm(config.hidden_size) def forward(self, hidden_states, input_ids): Lookup lookup_results = self.lut(input_ids).to(hidden_states.device, non_blocking=True) Attention residual = hidden_states hidden_states = self.input_layernorm(hidden_states) hidden_states = self.self_attn(hidden_states) hidden_states = residual + hidden_states Shared Expert residual = hidden_states hidden_states = self.post_attention_layernorm(hidden_states) shared_output = self.shared_expert(hidden_states) Routed Expert router_value = nn.functional.softmax(self.router(hidden_states), dim=-1) lookup_results = lookup_results.view(-1, config.num_experts, config.hidden_size) routed_output = (lookup_results * router_value.unsqueeze(-1)).sum(dim=2) hidden_states = residual + shared_output + routed_output return hidden_states
Open Source Code Yes Code: https://github.com/Jie Shibo/Mo LE.
Open Datasets Yes We train all models on a 100B-token subset of the deduped Pile dataset (Gao et al., 2021), using the GPT-Neo X tokenizer employed by Pythia, with a vocabulary size of 50k.
Dataset Splits No We train all models on a 100B-token subset of the deduped Pile dataset (Gao et al., 2021), using the GPT-Neo X tokenizer employed by Pythia, with a vocabulary size of 50k. We use the lm-evaluation-harness package for evaluation. The benchmarks used include ARC-C (Clark et al., 2018), ARC-E (Clark et al., 2018), Bool Q (Clark et al., 2019), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), RACE (Lai et al., 2017), SIQA (Sap et al., 2019), and LAMBADA (Paperno et al., 2016). For all these benchmarks, we report the zero-shot accuracy. The paper does not specify the train/validation/test splits used for the Pile dataset, nor explicit splits for the evaluation benchmarks, as it uses zero-shot accuracy.
Hardware Specification Yes We measure the per-step decoding latency of models with 410M activated parameters on NVIDIA V100 GPU using Huggingface s transformers package.
Software Dependencies No We measure the per-step decoding latency of models with 410M activated parameters on NVIDIA V100 GPU using Huggingface s transformers package. We use the lm-evaluation-harness package for evaluation. The paper mentions software packages but does not provide specific version numbers.
Experiment Setup Yes Hyper-Parameters. We follow the learning rate settings used by Pythia, specifically 6.0 10 4 for the models with 160M activated parameter, and 3.0 10 4 for the models with 410M and 1B activated parameter. For the Mo E model, the coefficients for the z-loss and load balance loss are set to 0.001 and 0.01, respectively, as suggested by Muennighoff et al. (2024). From Appendix B: global-batch-size 1024, gradient-clipping 1.0, lr-decay-style cosine, optimizer.type Adam, train-iters 50000, weight-decay 0.01.