MoH: Multi-Head Attention as Mixture-of-Head Attention

Authors: Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on Vi T, Di T, and LLMs demonstrate that Mo H outperforms multi-head attention by using only 50% 90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLa MA38B, can be further continue-tuned into our Mo H models. Notably, Mo H-LLa MA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLa MA3-8B by 2.4% by utilizing only 75% of the attention heads.
Researcher Affiliation Collaboration 1School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, Shenzhen, China 2Pengcheng Laboratory, Shenzhen, China 3School of AI for Science, Shenzhen Graduate School, Peking University, Shenzhen, China 4Skywork AI, Singapore 5Rabbitpre Intelligence, Shenzhen, China 6National University of Singapore, Singapore.
Pseudocode No The paper describes the methodology using mathematical formulations (Eq 1-8) and descriptive text, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/Skywork AI/Mo H.
Open Datasets Yes Mo H-Vi T models, based on Trans Ne Xt (Shi, 2024), are trained for 300 epochs using a resolution of 224 224. To ensure a fair comparison, we only replace the standard multi-head attention with our Mixture-of-Head attention (Mo H), keeping all other training parameters identical to Trans Ne Xt. ... trained from scratch on the Image Net-1K dataset (Deng et al., 2009) ... We only use public datasets for training, ensuring accessibility for academic research. Specifically, we sample from the Red Pajama (Computer, 2023), Dolma (Soldaini et al., 2024), and Pile (Gao et al., 2020) datasets according to different sampling probabilities.
Dataset Splits Yes The Image Net-1K dataset (Deng et al., 2009), which contains over 1.2 million images in 1,000 categories. ... We only use public datasets for training, ensuring accessibility for academic research. Specifically, we sample from the Red Pajama (Computer, 2023), Dolma (Soldaini et al., 2024), and Pile (Gao et al., 2020) datasets according to different sampling probabilities. Please refer to the Appendix for detailed sample ratios. ... Tab. D shows the detailed sample ratios of different open-source datasets for Mo H-LLMs. Specifically, we sample from the following datasets according to different sampling probabilities: The Red Pajama (Computer, 2023)... The Dolma (Soldaini et al., 2024)... The Pile (Gao et al., 2020)...
Hardware Specification No Our Mo H-Vi T models are trained for 300 epochs using automatic mixed precision across 8 GPUs.
Software Dependencies No We optimize our models using Adam W optimizer (Loshchilov & Hutter, 2017)... We use Megatron (Shoeybi et al., 2019), an open-source training code, as the training framework. ... We utilize the tokenizer from LLa MA2 (Touvron et al., 2023), which contains 65,536 vocabulary tokens.
Experiment Setup Yes Our Mo H-Vi T models are trained for 300 epochs using automatic mixed precision across 8 GPUs. We follow the training strategy of Trans Ne Xt, which includes various data augmentation techniques, including Random Augmentation (Cubuk et al., 2020), Mixup (Zhang, 2017), Cut Mix (Yun et al., 2019), and Random Erasing (Zhong et al., 2020). We also apply Label Smoothing (Szegedy et al., 2016) and Drop Path (Huang et al., 2016) to regularize our models. We optimize our models using Adam W optimizer (Loshchilov & Hutter, 2017) with a gradient clipping norm of 1.0 and a weight decay of 0.05. The initial learning rate is set to 1e-3, with a 5-epoch warm-up starting at 1e-6. A cosine learning rate scheduler (Loshchilov & Hutter, 2016) is employed to decay the learning rate. During training, images are randomly cropped to a size of 224 224. ... Tab. C shows the detailed hyper-parameter settings of various Mo H-LLMs. ... Tab. E shows the detailed training hyper-parameters of Mo H-LLMs.