reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing Elusive Clues in Knowledge Learning by Contrasting Attention of Language Models

Authors: Jian Gao, Xiao Zhang, Miao Li, Ji Wu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experimented on both synthetic and real-world corpus and show that the proposed method outperforms other forms of data augmentation, and boosting elusive clues universally helps both the large and the small models. [...] Results in Table 2 show that the proposed token-dropout augmentation based on attention difference significantly outperforms other data augmentation methods.
Researcher Affiliation	Academia	Jian Gao2, Xiao Zhang1, Miao Li1,B, Ji Wu1,3,4 1Department of Electronic Engineering, Tsinghua University 2Department of Energy and Power Engineering, Tsinghua University 3College of AI, Tsinghua University 4Beijing National Research Center for Information Science and Technology EMAIL EMAIL
Pseudocode	No	We use the following function to calculate dropout probability for each token: p(r) = α(1 e βr) (1) [...] Figure 4 illustrates the process of the proposed augmentation method.
Open Source Code	Yes	Code https://github.com/tsinghua-msiip/contrasting_attention [...] We release the code and data used in this paper for reproducibility and further research.
Open Datasets	Yes	Zhu and Li (2023) introduced a synthetic biography dataset for evaluating the efficiency of knowledge learning in language models. [...] Aside from the biography dataset, we also evaluate the proposed method on Wikipedia text to verify if the method helps knowledge learning on general text. Specifically, we evaluate on the Paragraph-Level Wikipedia Question-Answering dataset (Du and Cardie 2018).
Dataset Splits	No	The task is to finetune (continual pretraining) a language model on the biographies to let it memorize the factual information about the individuals. After training, the model is evaluated on a question-answering task [...] For each experiment, we trained the model from 10 to 30 epochs with learning rates in [5e-5, 1e-3] and selected the model with the best performance.
Hardware Specification	Yes	We finetune models with the Huggingface’s transformer library (Wolf et al. 2020) on NVIDIA 4090 GPUs.
Software Dependencies	No	We use low-rank adaptation (Lo RA) (Hu et al. 2022) to facilitate finetuning of models up to 70 billion parameters. [...] We finetune models with the Huggingface’s transformer library (Wolf et al. 2020) on NVIDIA 4090 GPUs.
Experiment Setup	Yes	We use low-rank adaptation (Lo RA) (Hu et al. 2022) to facilitate finetuning of models up to 70 billion parameters. As the corpus size is limited, we use a rank of 16 for the Lo RA adapters. Adapters are added to all of the model’s weights except for the embedding and the output layer. [...] For each experiment, we trained the model from 10 to 30 epochs with learning rates in [5e-5, 1e-3] and selected the model with the best performance. [...] For each of the augmentation methods, we generate 10 augmented versions of each training example and combine them with the original examples. [...] We also searched for the best hyperparameters α and β individually for each method.