KV Shifting Attention Enhances Language Modeling

Authors: Mingyu Xu, Bingning Wang, Weipeng Chen

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results confirm that KV shifting attention enhances the learning of induction heads and improves language modeling performance. This leads to superior performance or accelerated convergence, spanning from toy models to pretrained models with over 10 billion parameters. ... 4. Experiments
Researcher Affiliation Industry 1Baichuan-inc, China. Correspondence to: Bingning Wang <EMAIL>.
Pseudocode Yes M. Pytorch code for KV shifting attention We provide the following Python code that can easily implement KV shifting attention with rotary embedding. In this example, we used convolution operation to perform shifting operations.
Open Source Code Yes We provide the training and inference code for Py Torch implementation in the Appendix M.
Open Datasets Yes To validate the performance of different models, we used some benchmarks for the 3B and 19B models that were trained more tokens, including Lambada(Paperno et al., 2016), Winogrande(Sakaguchi et al., 2021), Hellaswag(Zellers et al., 2019), ARC(Clark et al., 2018), CMMLU(Li et al., 2023a), MMLU(Hendrycks et al.), Math(Hendrycks et al., 2021).
Dataset Splits Yes We used a huge vocabulary with 8000 tokens to randomly generate sentences and ensure that the sequences in them satisfy the condition that when the jth token is the same as the ith token, then the (j + 1)th token is the same as the ith token (i < j). We present the accuracy of jth next token prediction s accuracy in Figure 1a. ... In addition, under different evaluation metric, KV shifting attention achieved better results, which reflects the robust of KV shifting attention. ... In this section, we conducted MMLU evaluation under three condition, few shot (5 shot), zero shot and cloze (zero shot).
Hardware Specification Yes The experiments trained from scratch are conducted on Nvidia H800-80G GPUs, while others are conducted on Nvidia A100-80G GPUs. ... We conducted toy models for induction heads on 8 Nvidia A100-80G GPUs...
Software Dependencies No We provide the training and inference code for Py Torch implementation in the Appendix M.
Experiment Setup Yes Hyperparameters We used a constant learning rate with a linear warmup of 1000 steps. The learning rate for 1.4B / 3B / 7B / 13B / 19B model is 2e-4 / 8e-4 / 2e-4 / 2e-4 / 2e-4, the batch size is 1M/16M/1M/2M/3M6. For optimization, we apply the Adam W optimizer with β1 = 0.9 and β = 0.95, and weight decay = 0.1.