reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

KV Shifting Attention Enhances Language Modeling

Authors: Mingyu Xu, Bingning Wang, Weipeng Chen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results confirm that KV shifting attention enhances the learning of induction heads and improves language modeling performance. This leads to superior performance or accelerated convergence, spanning from toy models to pretrained models with over 10 billion parameters. ... 4. Experiments
Researcher Affiliation	Industry	1Baichuan-inc, China. Correspondence to: Bingning Wang <EMAIL>.
Pseudocode	Yes	M. Pytorch code for KV shifting attention We provide the following Python code that can easily implement KV shifting attention with rotary embedding. In this example, we used convolution operation to perform shifting operations.
Open Source Code	Yes	We provide the training and inference code for Py Torch implementation in the Appendix M.
Open Datasets	Yes	To validate the performance of different models, we used some benchmarks for the 3B and 19B models that were trained more tokens, including Lambada(Paperno et al., 2016), Winogrande(Sakaguchi et al., 2021), Hellaswag(Zellers et al., 2019), ARC(Clark et al., 2018), CMMLU(Li et al., 2023a), MMLU(Hendrycks et al.), Math(Hendrycks et al., 2021).
Dataset Splits	Yes	We used a huge vocabulary with 8000 tokens to randomly generate sentences and ensure that the sequences in them satisfy the condition that when the jth token is the same as the ith token, then the (j + 1)th token is the same as the ith token (i < j). We present the accuracy of jth next token prediction s accuracy in Figure 1a. ... In addition, under different evaluation metric, KV shifting attention achieved better results, which reflects the robust of KV shifting attention. ... In this section, we conducted MMLU evaluation under three condition, few shot (5 shot), zero shot and cloze (zero shot).
Hardware Specification	Yes	The experiments trained from scratch are conducted on Nvidia H800-80G GPUs, while others are conducted on Nvidia A100-80G GPUs. ... We conducted toy models for induction heads on 8 Nvidia A100-80G GPUs...
Software Dependencies	No	We provide the training and inference code for Py Torch implementation in the Appendix M.
Experiment Setup	Yes	Hyperparameters We used a constant learning rate with a linear warmup of 1000 steps. The learning rate for 1.4B / 3B / 7B / 13B / 19B model is 2e-4 / 8e-4 / 2e-4 / 2e-4 / 2e-4, the batch size is 1M/16M/1M/2M/3M6. For optimization, we apply the Adam W optimizer with β1 = 0.9 and β = 0.95, and weight decay = 0.1.