KV Shifting Attention Enhances Language Modeling
Authors: Mingyu Xu, Bingning Wang, Weipeng Chen
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results confirm that KV shifting attention enhances the learning of induction heads and improves language modeling performance. This leads to superior performance or accelerated convergence, spanning from toy models to pretrained models with over 10 billion parameters. ... 4. Experiments |
| Researcher Affiliation | Industry | 1Baichuan-inc, China. Correspondence to: Bingning Wang <EMAIL>. |
| Pseudocode | Yes | M. Pytorch code for KV shifting attention We provide the following Python code that can easily implement KV shifting attention with rotary embedding. In this example, we used convolution operation to perform shifting operations. |
| Open Source Code | Yes | We provide the training and inference code for Py Torch implementation in the Appendix M. |
| Open Datasets | Yes | To validate the performance of different models, we used some benchmarks for the 3B and 19B models that were trained more tokens, including Lambada(Paperno et al., 2016), Winogrande(Sakaguchi et al., 2021), Hellaswag(Zellers et al., 2019), ARC(Clark et al., 2018), CMMLU(Li et al., 2023a), MMLU(Hendrycks et al.), Math(Hendrycks et al., 2021). |
| Dataset Splits | Yes | We used a huge vocabulary with 8000 tokens to randomly generate sentences and ensure that the sequences in them satisfy the condition that when the jth token is the same as the ith token, then the (j + 1)th token is the same as the ith token (i < j). We present the accuracy of jth next token prediction s accuracy in Figure 1a. ... In addition, under different evaluation metric, KV shifting attention achieved better results, which reflects the robust of KV shifting attention. ... In this section, we conducted MMLU evaluation under three condition, few shot (5 shot), zero shot and cloze (zero shot). |
| Hardware Specification | Yes | The experiments trained from scratch are conducted on Nvidia H800-80G GPUs, while others are conducted on Nvidia A100-80G GPUs. ... We conducted toy models for induction heads on 8 Nvidia A100-80G GPUs... |
| Software Dependencies | No | We provide the training and inference code for Py Torch implementation in the Appendix M. |
| Experiment Setup | Yes | Hyperparameters We used a constant learning rate with a linear warmup of 1000 steps. The learning rate for 1.4B / 3B / 7B / 13B / 19B model is 2e-4 / 8e-4 / 2e-4 / 2e-4 / 2e-4, the batch size is 1M/16M/1M/2M/3M6. For optimization, we apply the Adam W optimizer with β1 = 0.9 and β = 0.95, and weight decay = 0.1. |